Very Random Numbers



  • I've only just started working on this large existing code base, so admittedly, maybe there's something I'm missing about this, but something tells me this may be a bit excessive.

    /* Generate a _very_ random number */
    $unique_id = md5((md5(rand(0,100000))*rand(0,100000)));


  • I'd like to see more background, but this has the makings of an outstanding WTF.

    First off, I'm guessing that this is PHP?

    1. So the first rand generates 100,001 numbers. This is far less entropy than just generating a random float or random int.
    2. Md5(x) doesn't do anything to randomize, in fact it could limit the entropy (if the source entropy wasn't already way under 128 bits) since it maps an infinitely large domain into a finite range.
    3. Then he multiplies the hash by another random number!
      1. For those who don't know PHP, md5() returns a string (of hexadecimal characters). That string is casted to int to multiple by the 2nd rand().
      2. String times int in PHP is well-defined, but has strange effects. If the string begins with a numeric, then PHP casts it to an integer with that value. Ie. (int)"2 French Hens" evaluates to 2. "7 Marry 3" evaluates to 7, and so on.If the string doesn't begin with a numeric, then it evaluates to zero.
      3. So in 1/6 of the cases (*disclaimer: math is provided with no warantees*), the "very random unique id" will actually just be md5(0).
      4. In the remaining cases, the entropy of md5()*rand() is greatly reduced since a hash is very unlikely to have more than just a few numeric characters in a row at the start of the string.
    So unless I made a logical mistake, this has great WTF characteristics: trying to "outsmart" the standard library, inventing a cockamamie scheme that reflects no actual knowledge on the subject, failing miserably but still getting a signoff and putting the code into production anyway.


  •  I think that trying to improve the randomness of built-in RNGs is a very common problem. Especially in PHP scripts. People just fail to see that in most cases, it's just getting worse and usually never improves the results.



  • Which is odd, because PHP has mt_rand().

    savar, if by "generates 100,001 numbers" you meant one of 100,0001, then yes. A brillant function indeed.

    whaps CS for deleting my post



  •  Out of curiosity, I used this method in a script to generate 100,000 IDs. Here are the results:

    Generated 100000 unique IDs, resulting with 56983 UNIQUE IDs.
    Sorted by increasing popularity:
    ...
    9cdfa26181dd4af6f5585eae4f57970b x3
    e31ba89e87cc2fcc8605c20618b3d5b6 x3
    25c23fb3191ef61cba0e9b6b1db8dfff x3
    f5b8c2566771a688a65cf97202fb5f0f x3
    8f81fa8534d98e0f3bbaf3bda5a1c6c5 x3
    244c636d9a0ef0dbe7c373431e41ebdf x3
    199723158503538b8a685601d083511f x3
    f7c67babedafc73bfd48c1d5fac222a8 x4
    9517fd0bf8faa655990a4dffe358e13e x2234
    cfcd208495d565ef66e7dff9f98764da x39982


  • @savar said:

    "7 Marry 3"

    That's "7 Mary 3" from one of us who actually gets the original reference.



  • @bartleby84 said:

    I've only just started working on this large existing code base, so admittedly, maybe there's something I'm missing about this, but something tells me this may be a bit excessive.

    /* Generate a _very_ random number */
    $unique_id = md5((md5(rand(0,100000))*rand(0,100000)));

     Clearly, they should have used the one true RNG:



  • @samanddeanus said:

     Clearly, they should have used the one true RNG: [xkcd reference]
     

    I'm rolling in pennies.

     

    It's safe to assume that on boards such as this, all xckd's have already been posted and played out.



  • @superjer said:

     Out of curiosity, I used this method in a script to generate 100,000 IDs. Here are the results:

    Generated 100000 unique IDs, resulting with 56983 UNIQUE IDs.
    Sorted by increasing popularity:
    ...
    9cdfa26181dd4af6f5585eae4f57970b x3
    e31ba89e87cc2fcc8605c20618b3d5b6 x3
    25c23fb3191ef61cba0e9b6b1db8dfff x3
    f5b8c2566771a688a65cf97202fb5f0f x3
    8f81fa8534d98e0f3bbaf3bda5a1c6c5 x3
    244c636d9a0ef0dbe7c373431e41ebdf x3
    199723158503538b8a685601d083511f x3
    f7c67babedafc73bfd48c1d5fac222a8 x4
    9517fd0bf8faa655990a4dffe358e13e x2234
    cfcd208495d565ef66e7dff9f98764da x39982

     

    And a quick Google reveals that cfcd208495d565ef66e7dff9f98764da=md5(0) and 9517fd0bf8faa655990a4dffe358e13e=md5("INF").



  • @dhromed said:

    It's safe to assume that on boards such as this, all xckd's have already been posted and played out.

    I think there's an xkcd about this... 



  • @mallard said:

    @superjer said:

     Out of curiosity, I used this method in a script to generate 100,000 IDs. Here are the results:

    Generated 100000 unique IDs, resulting with 56983 UNIQUE IDs.
    Sorted by increasing popularity:
    ...
    9cdfa26181dd4af6f5585eae4f57970b x3
    e31ba89e87cc2fcc8605c20618b3d5b6 x3
    25c23fb3191ef61cba0e9b6b1db8dfff x3
    f5b8c2566771a688a65cf97202fb5f0f x3
    8f81fa8534d98e0f3bbaf3bda5a1c6c5 x3
    244c636d9a0ef0dbe7c373431e41ebdf x3
    199723158503538b8a685601d083511f x3
    f7c67babedafc73bfd48c1d5fac222a8 x4
    9517fd0bf8faa655990a4dffe358e13e x2234
    cfcd208495d565ef66e7dff9f98764da x39982

     

    And a quick Google reveals that cfcd208495d565ef66e7dff9f98764da=md5(0) and 9517fd0bf8faa655990a4dffe358e13e=md5("INF").

    FEAR MY AWESOME GDATA SKILLZ!!!!!11
    cfcd208495d565ef66e7dff9f98764da = empty string
    9517fd0bf8faa655990a4dffe358e13e = INF
    f7c67babedafc73bfd48c1d5fac222a8 = 81956
    199723158503538b8a685601d083511f = 80685
    244c636d9a0ef0dbe7c373431e41ebdf = 186534
    8f81fa8534d98e0f3bbaf3bda5a1c6c5 = 84252
    f5b8c2566771a688a65cf97202fb5f0f = 81384
    25c23fb3191ef61cba0e9b6b1db8dfff = 71535
    e31ba89e87cc2fcc8605c20618b3d5b6 = 470180
    9cdfa26181dd4af6f5585eae4f57970b = 284634
    


  • @superjer said:

    Generated 100000 unique IDs, resulting with 56983 UNIQUE IDs.
    unique and UNIQUE are different?

    I assume that the IDs that are unique but not UNIQUE have hash collisions, yes?



  • @savar said:

    2. Md5(x) doesn't do anything to randomize, in fact it could limit the entropy (if the source entropy wasn't already way under 128 bits) since it maps an infinitely large domain into a finite range.
    Hash functions like md5 or sha1 can be used to increase the entropy [i]concentration[/i] for a poor entropy source.  They never produce more entropy than is already there, and may reduce the entropy.  However, if fed with 65536 bits with a total entropy of 100, a good hash function might produce 256 bits with total entropy of 80 - much higher entropy per bit.  The Linux kernel uses this for /dev/random.

    @savar said:
    3. So in 1/6 of the cases (*disclaimer: math is provided with no warantees*), the "very random unique id" will actually just be md5(0).
    In the spirit of Open Source, I provide a fix for this bug: the correct answer is sum(i=1..32, 6/(16^i)), which evaluates to 40%.  This is backed up by the empirical test seen previously.


  • @superjer said:

     Out of curiosity, I used this method in a script to generate 100,000 IDs. Here are the results:

    Generated 100000 unique IDs, resulting with 56983 UNIQUE IDs.
    Sorted by increasing popularity:
    ...
    9cdfa26181dd4af6f5585eae4f57970b x3
    e31ba89e87cc2fcc8605c20618b3d5b6 x3
    25c23fb3191ef61cba0e9b6b1db8dfff x3
    f5b8c2566771a688a65cf97202fb5f0f x3
    8f81fa8534d98e0f3bbaf3bda5a1c6c5 x3
    244c636d9a0ef0dbe7c373431e41ebdf x3
    199723158503538b8a685601d083511f x3
    f7c67babedafc73bfd48c1d5fac222a8 x4
    9517fd0bf8faa655990a4dffe358e13e x2234
    cfcd208495d565ef66e7dff9f98764da x39982

    This piqued my interest, so I made some research of my own in Python.  I started by defining some elementary random number sources.  The first one uses Python's built-in RNG directly for numbers in the range [0, 256):

    def rand():
    return random.randrange(0, 256)

    Second one is a homebrew RNG with "randomly" selected constants (i.e. bash my fingers on the number row, the only constraint being that the constants have to be odd):

    pr_seed=0
    def poorrand():
    global pr_seed
    pr_seed=(pr_seed*8243+73)%256
    return pr_seed

    Last one uses Python's built-in random again, but in a way that strongly biases it towards the low-end (inspiration from [url=http://thedailywtf.com/Articles/Random-Stupidity.aspx]this frontpage article[/url]): 

    def biasedrand():
    return random.randrange(0, 256)*random.randrange(0, 256)/255

    Next I defined two hash-random functions.  The first one is almost identical to the one in the OP, but in Python str*int produced a concatenation of that many copies of the string, so it works better:

    def hashrand(func):
    return hashlib.md5(hashlib.md5(str(func())).hexdigest()*func()).hexdigest()

    Second one is an improved one by me:

    def hashrand2(func):
    return hashlib.md5("".join([str(func()) for i in xrange(13)])).hexdigest()

    And on to the test results.  I tested each of the elementary functions alone, and with each of the hash-based randomness enhancers.  Sample set was 262144 (2**18).  I will present some general statistics, plus the least and most frequent values.  Let's start with the built-in random:

    Total 256 different values, average freq 1024.0
    13 (936), 7 (941), 201 (952)
    103 (1089), 52 (1107), 182 (1118)

    As expected, the output is well centered at the theoretical average, but there is considerable variation.  Now feed it through the first enhancer:

    Total 64091 different values, average freq 4.1
    30ad00f7125b587ffed290c185c6c4fc (1), 55438816aaa7d0f33d8875572fc8e6a5 (1), 72813ac7e0adf9a2d85c9197189f3d87 (1)
    8b2aea2e81d77b0dd74e0af98cf44030 (14), b0db8220f0f13cf8fb8b58274890d315 (17), d41d8cd98f00b204e9800998ecf8427e (1034)

    More different values, but a strong concentration to a single value, which occurs when the second RNG call in the "enhancer" returns 0.  No surprise there.  Next up, my improved enhancer:

    Total 262144 different values, average freq 1.0
    b1e3ddb43909344bb6bfdc9bb862edfd (1), 5a41e6867a37c556274b96a26415e89c (1), a06dbb750cc17c7a3c078d8237691caa (1)
    853d0c4808bc7e8344b4055f8f9bbe62 (1), 76876b56c4308faea9bd8b2b7d011354 (1), 2d224d0d42590e031b77e77ec40ac6ba (1)

    Now this is something.  The enhancer returns a different value every single time.  Which is to be expected, as I'm feeding the hash the concatenation of 13 independent calls to the base function.  Now we'll move on to the homebrew random number source:

    Total 128 different values, average freq 2048.0
    0 (2048), 1 (2048), 4 (2048)
    249 (2048), 252 (2048), 253 (2048)

    My hand-mashed constants only manage to produce half of the available values, but at least the distribution is dead even.  But what happens when wefeed it through a hash?

    Total 64 different values, average freq 4096.0
    70699d3da243b04bdefb014f440a86ba (4096), 6872df0c22b150c7ec1ac61a0fa49686 (4096), 416d85f143d5cec6d0df8bcee1fc1842 (4096)
    c249a1cd66769d3b191a62a6ed9af60e (4096), b676ba518b58e2b1da510b604674f330 (4096), d41d8cd98f00b204e9800998ecf8427e (4096)

    Oops!  The hash function actually reduces the selection of values here.  The random function has a rather short sequence due to only having 128 internal states, and since the hash calls the function twice, the sequence length of call pairs is 64.

    Total 128 different values, average freq 2048.0
    6778a79056633ab7a46728c90f00e340 (2048), b960a28a3792ecac8270a7df39d68215 (2048), 961aacbdb7cbe2554d0d9edc74a828a6 (2048)
    a9e5257e645e5e4be42a70641536708f (2048), c824f9a8f8bf344f8b0bba0c701a5737 (2048), ba3fddbf6eedf290dcbd73fab2bc42a5 (2048)

    The improved hash doesn't fare much better, suffering from the same sequence length problem.  But since it performs 13 calls to the base function, it at least manages to retain the number of different values.  And how about a biased generator?

    Total 256 different values, average freq 1024.0
    255 (4), 254 (9), 253 (16)
    2 (4757), 1 (5042), 0 (7703)

    Results are as expected.  Nearly 3% of the time the product of two random numbers will be small enough to round to 0.

    Total 43202 different values, average freq 6.1
    3f051c5889da43a3322705f4cd1d6afd (1), b1b4e79f63405be543b93a8d32eecf1a (1), 8ee7e9c45cec603b7cab5b3961e31913 (1)
    dcfcd07e645d245babe887e5e2daa016 (134), 60de623798082c910b03aa3d3e609bd0 (154), d41d8cd98f00b204e9800998ecf8427e (7731)

    Again the bad hash function produces bad results.

    Total 262144 different values, average freq 1.0
    7ff09356369c37b92481ff5fcf28ed28 (1), 006001a36b60f11fda70d546a81cde82 (1), 7d735393e2f0e8825b94dac97777fe3f (1)
    c2deb38a24cee02f331e18870b4d82d8 (1), c4571187f2cacea83ffa55d6f54417ea (1), 25ead6d33883c0042fff47efff7ada8c (1)

    Even in this strongly biased environment, the good hash function is able to produce totally unique values. 

    So what have we learned here?  Hash functions are viable in unique ID generation, but either a bad choice of base RNG or a bad hashing implementation can botch the results badly.

    Another useful application is that a hash function can hide the internal state of the RNG.  With knowledge of the RNG algorithm, an attacker may be able to deduce the internal state from enough output data and then predict future output.  With a good hash function applied to the output, this can be made infeasible.


Log in to reply