How to benchmark random number generators?

Today at work I felt like the pseudo random number generator I am using (C++; rand()) is far from producing good random numbers. I remember having read that it is better to use

(rand()/(RAND_MAX+1) )*N

than it is to use

rand()%N

in order to get a number in the range [0, N-1].

Basically I wanted to know how I could compare the outcome of both methods in terms of randomness. A bunch of ideas came ot mind, like plotting a histogram, computing average and standard deviation, counting the number of identical consecutive numbers but I think all those methods falls short. Is there any proven scientific method of comparing the randomness of a sequence of numbers?

DrJokepu

@seriousJoker said:

(rand()/(RAND_MAX+1) )*N

Am I the only one seeing serious problems with this?

I did omit the cast to higher precision on purpose. The purpose is better readability. In practise one would probably use

(int)(((double)rand()/((double)RAND_MAX+1.0) )*(double)N )

Better now?

gremlin

@seriousJoker said:

I did omit the cast to higher precision on purpose. The purpose is better readability. In practise one would probably use
(int)(((double)rand()/((double)RAND_MAX+1.0) )*(double)N )
Better now?

Actually, a simple decimal point after the 1 (and a cast to int at the end) would have been enough:

(int)(rand()/(RAND_MAX+1.0)*N)

Now that we have cleared up this issue, do you have any ideas to contribute to the original problem which is benchmarking random number generators?

@seriousJoker said:

Today at work I felt like the pseudo random number generator I am using (C++;
rand()) is far from producing good random numbers.

You have interesting senses. I'm able to feel, however, that an xkcd reference is coming. Then a Dilbert one.

Try [url=http://www.fourmilab.ch/random/][b]ent[/b][/url].

morbiuswilters

@Spectre said:

I'm able to feel, however, that an xkcd reference is coming.

I sense that anyone who makes a reference to that comic will have their face cut by Morb.

dhromed

@seriousJoker said:

Basically I wanted to know how I could compare the outcome of both methods in terms of randomness.

rand() ===== rand() ====== rand()

You're not going to get more randomness from an algorithm if the randomness component is the same as the old one. You can scale up a photo, but you won't get any more detail.

Nonetheless, your question is still valid. If your randomess components are actually different, how does one compare randomness? I never really troubled myself to find out, because any language you'll use has randomness that will exceed your randomness needs for most non-serious practical purposes (such as selecting a bunch of random ads for a website).

If you want true chaos, get your data from real sources, such as http://www.random.org/

@Spectre said:

...
Try ent.

At last a useful answer. Thanks.

@dhromed said:

@seriousJoker said:
Basically I wanted to know how I could compare the outcome of both methods in terms of randomness.

rand() ===== rand() ====== rand()
...

http://members.cox.net/srice1/random/crandom.html

"Do NOT use y = rand() % M; as this focuses on the lower bits of rand(). For linear congruential random number generators, which rand() often is, the lower bytes are much less random than the higher bytes. In fact the lowest bit cycles between 0 and 1. Thus rand() may cycle between even and odd (try it out). Note rand() does not have to be a linear congruential random number generator. It's perfectly permissible for it to be something better which does not have this problem."

I did not make this up. As I said in my first post I read this up. Using the tool "ent" however showed me that both methods basically generate about the same result in terms of entropy and serial correlation. So I recon that by now there are more sophisticated methods of rand() shipped with the latest compilers.

morbiuswilters

@seriousJoker said:

"Do NOT use y = rand() % M; as this focuses on the lower bits of rand(). For linear congruential random number generators, which rand() often is, the lower bytes are much less random than the higher bytes. In fact the lowest bit cycles between 0 and 1. Thus rand() may cycle between even and odd (try it out). Note rand() does not have to be a linear congruential random number generator. It's perfectly permissible for it to be something better which does not have this problem."

I did not make this up. As I said in my first post I read this up. Using the tool "ent" however showed me that both methods basically generate about the same result in terms of entropy and serial correlation. So I recon that by now there are more sophisticated methods of rand() shipped with the latest compilers.

If you're using glibc rand(), it is a linear congruential PRNG and will suffer this problem. Quite frankly, if you are concerned enough with the quality of the random output, you should not be using rand().

Generally only cryptographic applications are going to need this much randomness, which means you should be using a cryptographically secure PRNG. If you are on L00nix, then /dev/urandom should be a good source and there should be libraries for most languages. One note about using /dev/urandom: the kernel entropy tables are generated over time by sampling data from various sources. As such, if the computer is freshly started the kernel will not have much entropy. Most libs that try to read /dev/urandom just block if the requested number of random bits are not available, which will leave your program waiting until the kernel entropy tables build up.

This can be a problem with a machine that is using SSL/TLS from OpenSSL and is starting services on boot, as the OpenSSL libs will just sit and wait for the kernel to build up the entropy tables, which can take several minutes at times. What's worse is that it is damn near impossible to tell what is going on as there are no error messages -- the daemon just appears to be locked up. To relieve this, there are daemons that will keep /dev/urandom at minimum amount of available random data. They do this either by sampling additional sources of pseudo-random data that the kernel does not or by sacrificing some randomness when the number of available bits gets to low by sampling more frequently than the kernel does. The daemon I'm running is just called "rngd" but I believe there are a few others out there.

morbiuswilters

@morbiuswilters said:

If you're using glibc rand(), it is a linear congruential PRNG and will suffer this problem. Quite frankly, if you are concerned enough with the quality of the random output, you should not be using rand().

I should also add: you should not be benchmarking this stuff yourself. If the randomness of the data is that important, you should be using a proven, standard PRNG. If you are running glibc and ent did not detect a lack of randomness in the low bits of rand(), then something is amiss. However, the only time randomness is critical in an application is with cryptography, in which case you should be using a standard crypto library and PRNG. If you're just trying to learn, there are plenty of good resources out there, but just keep in mind that only a fool would roll their own crypto stack or try to "tweak" their PRNG. If your PRNG is not cryptographically secure, don't use it for cryptography, period.

I don't mean to sound like a jerk, I'm just pointing out the number one rule of practical crypto, which is: Never implement your own crypto system. Always use a proven, standard library.

@morbiuswilters said:

I don't mean to sound like a jerk, I'm just pointing out the number one rule of practical crypto, which is: Never implement your own crypto system. Always use a proven, standard library.

That's cool, I'm always willing to learn.

But don't worry, the work I deal with is not security related at all and I'm far from implementing my own rand function. It's just something I got curious about for no specific reason. Now that I am curious however I'm wondering about this issue of linear congruental PRNGs. The page I posted earlier tells that these sorts of PRNGs would always return an alternating even/odd pattern. I could not find this issue (using Visual Studio 2005 btw, not glibc). So I found this page [1] which claims to show the code for one such specific PRNG. However running that code didn't show the even/odd pattern either. So I'm wondering if the even/odd pattern is indeed a problem with linear congruental prngs or if that other page was just telling bs?

[1] http://www.xoax.net/comp/cpp/console/Lesson22.php

@seriousJoker said:

But don't worry, the work I deal with is not security related at all and I'm far from implementing my own rand function. It's just something I got curious about for no specific reason. Now that I am curious however I'm wondering about this issue of linear congruental PRNGs. The page I posted earlier tells that these sorts of PRNGs would always return an alternating even/odd pattern. I could not find this issue (using Visual Studio 2005 btw, not glibc). So I found this page [1] which claims to show the code for one such specific PRNG. However running that code didn't show the even/odd pattern either. So I'm wondering if the even/odd pattern is indeed a problem with linear congruental prngs or if that other page was just telling bs?

Oh and this would have quite an impact even on the low security stuff I'm doing. After all it would mean that rand()%2 would always output 0,1,0,1,0,1,0,... if I got that right. And in response to an older posting in here: It doesn't take quite any special senses to feel like something is wrong with that random number sequence, right? No hard feelings about the posting though.

@morbiuswilters said:

One note about using /dev/urandom: the kernel entropy tables are generated over
time by sampling data from various sources. As such, if the computer is
freshly started the kernel will not have much entropy. Most libs that try
to read /dev/urandom just block if the requested number of random bits are not
available, which will leave your program waiting until the kernel entropy tables
build up.

I object.

@random(4) said:

When the entropy pool is empty, reads from <font size="+1">[i]/dev/random[/i]</font>
will block until additional environmental noise is gathered.
   A read from the [i]/dev/urandom[/i] device <font size="+1">will not block waiting</font> for more entropy.</blockquote>
Emphasis mine; verified empirically.

morbiuswilters

@Spectre said:

I object.

Oh noes, I made a mistake and said /dev/urandom instead of /dev/random!

@Spectre said:

Emphasis mine; verified empirically.

You didn't really need to do that: the "u" in urandom means that it won't block. I was just lazy and distracted when I wrote that and didn't bother re-reading it carefully. Anyway, thanks for pointing it out.

bjolling

@seriousJoker said:

And in response to an older posting in here: It doesn't take quite any special senses to feel like something is wrong with that random number sequence, right? No hard feelings about the posting though.

I would say it is the other way around. The human brain is trained to see patterns even if there are none. So you need 'special' senses if you DON'T feel that something is wrong with random functions.

Hey, my iPod prefers to play Bruce Springsteen everytime I'm in the train!

@bjolling said:

So you need 'special' senses if you DON'T feel that something is wrong with
random functions.

Heh, I'm special then. I hear people complaining about RNGs all the time, and I only reply with "WTF are you talking about? So what if it produced 2 five times in a row, it's perfectly random".

Daid

@bjolling said:

Hey, my iPod prefers to play Bruce Springsteen everytime I'm in the train!

Put more then 1 song/album/artist on it.

It totally depends on what you need it for, a few bits but they need to be very random, use /dev/random
Lots of bits but not too predictable, /dev/urandom.
I don't care, as long as it's fast: rand()

Ilya_Ehrenburg

@morbiuswilters said:

@seriousJoker said:
"Do NOT use y = rand() % M; as this focuses on the lower bits of rand(). For linear congruential random number generators, which rand() often is, the lower bytes are much less random than the higher bytes. In fact the lowest bit cycles between 0 and 1. Thus rand() may cycle between even and odd (try it out). Note rand() does not have to be a linear congruential random number generator. It's perfectly permissible for it to be something better which does not have this problem."

I did not make this up. As I said in my first post I read this up. Using the tool "ent" however showed me that both methods basically generate about the same result in terms of entropy and serial correlation. So I recon that by now there are more sophisticated methods of rand() shipped with the latest compilers.
If you're using glibc rand(), it is a linear congruential PRNG and will suffer this problem. Quite frankly, if you are concerned enough with the quality of the random output, you should not be using rand().

I wonder how much that is still true. The man page says

NOTES
       The versions of rand() and srand() in the Linux C Library use the
       same random number generator as random() and srandom(), [b]so the lower-
       order bits should be as random as the higher-order bits[/b]. However, on
       older rand() implementations, the lower-order bits are much less ran
       dom than the higher-order bits.

Has anybody recent reliable info what is correct?

And regarding the rand()%M vs (int)((M*rand())/(1.0+RAND_MAX)) issue: if M is odd, the higher-order bits contribute as much to rand()%M as the lower order bits.

@Ilya Ehrenburg said:

Has anybody recent reliable info what is correct?

Cut from Glibc CVS HEAD:

[code]
/* Return a random integer between 0 and RAND_MAX.  */
int
rand ()
{
  return (int) __random ();
}

long int
__random ()
{
  int32_t retval;

  __libc_lock_lock (lock);

  (void) __random_r (&unsafe_state, &retval);

  __libc_lock_unlock (lock);

  return retval;
}

static struct random_data unsafe_state =
  {
/* FPTR and RPTR are two pointers into the state info, a front and a rear
   pointer.  These two pointers are always rand_sep places aparts, as they
   cycle through the state information.  (Yes, this does mean we could get
   away with just one pointer, but the code for random is more efficient
   this way).  The pointers are left positioned as they would be from the call:
        initstate(1, randtbl, 128);
   (The position of the rear pointer, rptr, is really 0 (as explained above
   in the initialization of randtbl) because the state table pointer is set
   to point to randtbl[1] (as explained below).)  */

    .fptr = &randtbl[SEP_3 + 1],
    .rptr = &randtbl[1],

/* The following things are the pointer to the state information table,
   the type of the current generator, the degree of the current polynomial
   being used, and the separation between the two pointers.
   Note that for efficiency of random, we remember the first location of
   the state information, not the zeroth.  Hence it is valid to access
   state[-1], which is used to store the type of the R.N.G.
   Also, we remember the last location, since this is more efficient than
   indexing every time to find the address of the last element to see if
   the front and rear pointers have wrapped.  */

    .state = &randtbl[1],

    .rand_type = TYPE_3,
    .rand_deg = DEG_3,
    .rand_sep = SEP_3,

    .end_ptr = &randtbl[sizeof (randtbl) / sizeof (randtbl[0])]
};

int
__random_r (buf, result)
     struct random_data *buf;
     int32_t *result;
{
  int32_t *state;

  if (buf == NULL || result == NULL)
    goto fail;

  state = buf->state;

  if (buf->rand_type == TYPE_0)
    {
      int32_t val = state[0];
      val = ((state[0] * 1103515245) + 12345) & 0x7fffffff;
      state[0] = val;
      *result = val;
    }
  else
    {
      int32_t *fptr = buf->fptr;
      int32_t *rptr = buf->rptr;
      int32_t *end_ptr = buf->end_ptr;
      int32_t val;

      val = *fptr += *rptr;
      /* Chucking least random bit.  */
      *result = (val >> 1) & 0x7fffffff;
      ++fptr;
      if (fptr >= end_ptr)
        {
          fptr = state;
          ++rptr;
        }
      else
        {
          ++rptr;
          if (rptr >= end_ptr)
            rptr = state;
        }
      buf->fptr = fptr;
      buf->rptr = rptr;
    }
  return 0;

 fail:
  __set_errno (EINVAL);
  return -1;
}


[/code]

I say TRWTF is the use of old-style parameter declarations.

morbiuswilters

@Spectre said:

Heh, I'm special then. I hear people complaining about RNGs all the time, and I only reply with "WTF are you talking about? So what if it produced 2 five times in a row, it's perfectly random".

That's incorrect. If the random data was actually random, then sure. However, "random" number generators are deterministic and many that were believed to be random were later found to be predictable and weak. So producing 2 five times in a row might actually be a sign of a much bigger problem.

ajp

@morbiuswilters said:

@Spectre said:
Heh, I'm special then. I hear people complaining about RNGs all the time, and I only reply with "WTF are you talking about? So what if it produced 2 five times in a row, it's perfectly random".
That's incorrect. If the random data was actually random, then sure. However, "random" number generators are deterministic and many that were believed to be random were later found to be predictable and weak. So producing 2 five times in a row might actually be a sign of a much bigger problem.

anyone who takes only two points of data in order to make any conclusion should be led out into the street and shot. whether that conclusion happens to be correct or not.

anyway. has a distinction been made between RNGs and PRNGs here? or are we assuming then to be equivalent.

bjolling

@Daid said:

@bjolling said:
Hey, my iPod prefers to play Bruce Springsteen everytime I'm in the train!
Put more then 1 song/album/artist on it.

It totally depends on what you need it for, a few bits but they need to be very random, use /dev/random
Lots of bits but not too predictable, /dev/urandom.
I don't care, as long as it's fast: rand()

I can't be the only one to remember this from a few years ago: iPod Shuffle and Randomness

“Our brains aren’t wired to understand randomness - there’s even a huge industry that takes advantage of people’s inability to deal with random distributions. It’s called gambling.”

morbiuswilters

@ajp said:

anyone who takes only two points of data in order to make any conclusion should be led out into the street and shot. whether that conclusion happens to be correct or not.

Nobody is talking about drawing a conclusion from 2 points of data. What we are talking about is whether having a "sense" something is wrong due to seemingly non-random distributions of numbers is rational. Spectre said it is not because the data is perfectly random. I say it can be because many PRNG algos have ended up having flaws that have given them non-random distributions.

@ajp said:

anyway. has a distinction been made between RNGs and PRNGs here? or are we assuming then to be equivalent.

We're only talking about PRNGs here. However, even an RNG that uses some physical source of information is not truly random and can possibly be interfered with or predicted.

@ajp said:

anyone who takes only two points of data in order to make any conclusion should be led out into the street and shot.

O RLY?

Daid

@Ilya Ehrenburg said:

Has anybody recent reliable info what is correct?
And
regarding the rand()%M vs (int)((M*rand())/(1.0+RAND_MAX)) issue: if M
is odd, the higher-order bits contribute as much to rand()%M as the
lower order bits.

The reason to use (int)((M*rand())/(1.0+RAND_MAX)) instead of (rand()%M) is simple.

Immagine the silly case where M is 10. RAND_MAX is 65535. Now most if not all PRND functions have a uniform distribution, so the chance for every number is just as large. But with M = 10 on rand()%M the distribution is as follows:
0:6554
1:6554
2:6554
3:6554
4:6554
5:6554
6:6553
7:6553
8:6553
9:6553
While with (int)((M*rand())/(1.0+RAND_MAX)) it is:
0:6554
1:6554
2:6553
3:6554
4:6553
5:6554
6:6554
7:6553
8:6554
9:6553
So rand()%M favors lower numbers more then the more 'complex' (int)((M*rand())/(1.0+RAND_MAX))

HypocriteWorld

well, you still have the same flaw right? now 0/1/3/5/6/8 is favored instead of 0/1/2/3/4/5...

Ilya_Ehrenburg

@HypocriteWorld said:

well, you still have the same flaw right? now 0/1/3/5/6/8 is favored instead of 0/1/2/3/4/5...

The difference is that using rand()%M, you have a systematic overrepresentation of the numbers 0 .. RAND_MAX%M, while using (int)(M*rand()/(1.0+RAND_MAX)) distributes the necessary bias over the whole range. So using the latter is indeed preferable*. My point, however, was that the often given reason that "the lower order bits are less random than the higher order bits", doesn't hold if M is odd, even if the claim is true.

[*]However, in my libc, RAND_MAX == INT_MAX == 0x7fffffff, so the skew of rand()%M is negligible for small enough M if you're just doing a simple simulation.

dstozek

On actual benchmarking of random generators:

try playing with creating empirical distributions for different values obtained from the RNG (be it the generated value, diff from the last value, number of bit transitions from the last value, and so on) for large data sets. Then plot histogram. Maybe you can get some interesting results.

Blat

IDK about "proven scientific" anything very much, however the best practical solution to this one I have come accross is using your generator to plot "random"pixels onto bitmaps of varying dimensions. The human eye will qucikly pick up visible patterns from many "random" number generators.

PJH

@Blat said:

The human eye will qucikly pick up visible patterns from many "random" number generators.

And the human brain is fairly adept at 'recognising visible patterns' in truly random distributions.

Blat

True, but when your ink blot looks like a zebra crossing it's time to ask questions

viraptor

@Blat said:

... using your generator to plot "random"pixels onto bitmaps of varying dimensions. The human eye will qucikly pick up visible patterns from many "random" number generators.

Just thought that a presentation may be useful: http://lcamtuf.coredump.cx/newtcp/

dhromed

@viraptor said:

Just thought that a presentation may be useful: http://lcamtuf.coredump.cx/newtcp/

+1 wba

savar

I was looking at the NIST website today and came across this:

http://csrc.nist.gov/publications/nistpubs/800-22-rev1/SP800-22rev1.pdf

Reminded me of this thread

savar

@savar said:

I was looking at the NIST website today and came across this:

http://csrc.nist.gov/publications/nistpubs/800-22-rev1/SP800-22rev1.pdf

Reminded me of this thread

GOD why doesn't the forum format plain-text posts?!!

I was looking at the NIST website today and came across this:

http://csrc.nist.gov/publications/nistpubs/800-22-rev1/SP800-22rev1.pdf

Reminded me of this thread

chrismcb

@PJH said:

@Blat said:
The human eye will qucikly pick up visible patterns from many "random" number generators.
And the human brain is fairly adept at 'recognising visible patterns' in truly random distributions.

Well we aren't talking about Rorschach here. But plotting your random numbers is a good way to see if your random routine is random enough. What you are looking for is a fairly even distribution of the dots over your sample space. You aren't really looking for patterns, but you shouldn't be seeing any straight lines.

Yes there are a LOT of times when you care that you are getting random enough results, but aren't using a cryptosecure random number.

chrismcb

@seriousJoker said:

Today at work I felt like the pseudo random number generator I am using (C++; rand()) is far from producing good random numbers. I remember having read that it is better to use
(rand()/(RAND_MAX+1) )*N
than it is to use
rand()%N
in order to get a number in the range [0, N-1].

To answer the original question, you should never use rand()%N, no matter how random the lower bits are, unless RAND_MAX is evenly divisible by N.

Lets assume that RAND_MAX = 8 so rand() generates 0 - 7.

If N is 5 then you get 0,1,2,3,4,0,1,2 as possible values. Yuo should be able to spot the problem.

gremlin

@chrismcb said:

@seriousJoker said:
I remember having read that it is better to use
(rand()/(RAND_MAX+1) )*N
than it is to use
rand()%N
in order to get a number in the range [0, N-1].

To answer the original question, you should never use rand()%N, no matter how random the lower bits are, unless RAND_MAX is evenly divisible by N.
Lets assume that RAND_MAX = 8 so rand() generates 0 - 7.
If N is 5 then you get 0,1,2,3,4,0,1,2 as possible values. Yuo should be able to spot the problem.

But if you use (rand()/(RAND_MAX+1) )*N, you get 0,0,1,1,2,3,3,4 as possible values, which, although slightly better, still has problems. If RAND_MAX is not divisible by N, you have to do smething more complex (e.g. throw away some values, and regenerate another number).

CDarklock

@gremlin said:

But if you use (rand()/(RAND_MAX+1) )*N, you get 0,0,1,1,2,3,3,4 as possible values, which, although slightly better, still has problems. If RAND_MAX is not divisible by N, you have to do smething more complex (e.g. throw away some values, and regenerate another number).

Actually, the problem is in which bits you are using. With a % operation, you are always using only the lowest-order bits. With (rand()/(RAND_MAX+1) )*N, you're using all of them. So the overall entropy of the PRNG is in your calculation, no matter to which bits that entropy may have been localised. Many PRNGs have localised entropy at certain points in the cycle, and when you're only using the lower order bits, you may get no entropy at all for a while.

Imagine a PRNG which textually reorders digits. If it doesn't happen to reorder the final digit in a given call, any N <= 10 will generate the same number as the previous call.

gremlin

@CDarklock said:

Actually, the problem is in which bits you are using. With a % operation, you are always using only the lowest-order bits.

Not always, only if N is a divisor of (RAND_MAX+1). Otherwise you are using all the bits. But then you are biased towards certain numbers (the lower ones).

@CDarklock said:

With (rand()/(RAND_MAX+1) )*N, you're using all of them.

Again, not always. If N is a divisor of (RAND_MAX+1), then you are only using the highest-order bits. And if not, you are still biased towards certain numbers, but they are more evenly spread throughout the range.

@CDarklock said:

So the overall entropy of the PRNG is in your calculation, no matter to which bits that entropy may have been localised. Many PRNGs have localised entropy at certain points in the cycle, and when you're only using the lower order bits, you may get no entropy at all for a while.

No, neither method guarantees you use all of the bits. But many PRNGs have less entropy in the lower-order bits, so restricting yourself the the higher-order bits is better than the lower-order bits.

@CDarklock said:

Imagine a PRNG which textually reorders digits. If it doesn't happen to reorder the final digit in a given call, any N <= 10 will generate the same number as the previous call.

Only if N==10, not for all N<=10. If N=3 it would use all the digits. And if it doesn't happen to reorder the first digit in a given call, then the other method would also generate the same number twice for N==10, but that's not necessarily a problem, a good PRNG should generate the same number twice sometimes, otherwise it's not very random.