www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 11598] New: std.random.uniform could be faster for integrals

reply d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11598

           Summary: std.random.uniform could be faster for integrals
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: zshazz gmail.com



(may be related to https://d.puremagic.com/issues/show_bug.cgi?id=10287 but
probably not)

std.random.uniform performs two division operations, but it's possible to make
a uniform random number using just one division operation. This change could
dramatically speed up every usage of uniform.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Nov 24 2013
next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11598




Pull request sent:
https://github.com/D-Programming-Language/phobos/pull/1717

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Nov 24 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11598


bearophile_hugs eml.cc changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bearophile_hugs eml.cc



Do you have simple statistical unittests that assert the distribution is
uniform?

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Nov 24 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11598




Perhaps. There's an easy way to test this method to check whether this uniform
introduces any kind of skew. Just do uniform(ulong.min, (ulong.max/2 + 2)) and
count how many times the first and last number shows up over a long period of
time. Since there are two (giant) buckets, if something is wrong then the last
number will show up roughly half as often as the first.

Even better is to edit the source and make it so it acts on ushorts instead and
you can easily build an array to count those occurrences. Visually it looks
fine to me (I did this when developing this). While your rigging the source,
you can also change it so that it doesn't reroll and you'll see the effect I
described above clearly (that the first number shows up 2x more than the last).

Another alternative is I wrote up an informal proof on it a few months ago on
it (I've intended on submitting this sooner but just haven't been able to get
around to it). Here it is in pdf format:

https://dl.dropboxusercontent.com/u/2206555/uniformUpgrade.pdf

So that all said, I'm almost certain of the quality of the approach. If you
find any problems with it, though, I would love to fix them.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Nov 25 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11598


Joseph Rushton Wakeling <joseph.wakeling webdrake.net> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |joseph.wakeling webdrake.ne
                   |                            |t



2013-12-01 04:20:47 PST ---
It should be simple enough to do a few different cases of rolling uniform(0, n)
a sufficient number of times (say, n * 10 million or n * 100 million) to
confirm that each rollable value has no systematic bias.

Do it for n = 2, 3, ..., 10, 20, 50, 100, and post the results here and in the
pull request.  This doesn't need to be part of formal unittests (too heavy) but
should be a documented part of your testing procedure.

There have been bugs in the past in std.random (I won't say which) which would
never have got there if the developer responsible had done a simple test like
this before submitting a pull request.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Dec 01 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11598




It's a fair request to have documented rigor for such an important function in
std.random. I'll document my entire process here.

For tests doing what you describe, refer to this gist:

https://gist.github.com/Zshazz/1bffebf9cb7cb97cfabc

One thing to note, knowing the implementation for this change (a bit of
information on the idea behind the implementation is at the end of this post),
is that this wouldn't be sufficient for detecting all of the potential
problems. For instance, if you used a bad uniform implementation with _just_
`rnum % upperDist` (using the notation of this implementation), then the
distribution error over `uniform(0, 100)` would be indetectable (even using a
32-bit rnum).

In that case, there would be `((2^32) / 100) + 1` buckets (around 43 million)
and the last bucket would only cover up to `((2^32) % 100) - 1`, which is 95.
So, 96-99 each would be off by about 1 in 43 million rolls. Not really
detectable visually. That would essentially appear to be background noise at
that point and explainable just by the nature of random rolls.

So, let's define `test1`, a test that is divised to guard against that
particular type of error. That's easy enough: just pick your numbers in such a
way to minimize the number of buckets while still having some elements missing
from the last bucket. So, picking `((2^32) / 4) * 3` would result in two
buckets, with about 2/3 of the numbers in that range not being represented in
the second bucket. Consequently, this makes it _very_ easy to count and _very_
easy to tell that something is off. Roll 100_000 times or so and count how many
of those rolls fall into the first third of the range, and then count how many
of those rolls fall into the last third of the range. If the counts differ (and
they will in the case of `badUniform`), then you know there's an issue.

Now, let's define `test2` to answer the following question: what if the reroll
condition was chosen in such a way that it was off-by-one? Well, we'd have a
very, very tough time figuring that out since, even with 32-bit numbers, a
count array holding _all_ numbers in that space would be too massive.
Thoroughly analyzing that data would be even more of a challenge. So, to be
fully rigorous about this, `test2` has to _slightly_ modify the types internal
to the new uniform implementation to ensure that the rnum is generated using
ubytes to make it so that it's easy to analyze (it uses
`typeof(unsigned(b-lower))` by default which is minimally a uint, which
prevents excessive and expensive rerolls when smaller types are used). This
does _not_ change overall algorithm, so it still applies to the bigger version
of the new uniform as well. Now `test2` just tries out several `i` in
`uniform(0, i)` s.t. `i` is a number around `128` (which is the exact middle of
the range of 8-bit integers) to make sure everything behaves properly around
that critical point.

Here is the code and results from running the code for these two tests using
the new uniform and a bad uniform:

https://gist.github.com/Zshazz/450f57a5374bf0b80fb5

---

So all that said, these tests aren't necessarily going to cover every possible
question about statistical issues that this new uniform could bring up, either.
Hence a general argument for correctness is probably a better approach. For a
complete description, I refer to my document on my dropbox:

https://dl.dropboxusercontent.com/u/2206555/uniformUpgrade.pdf

Because it's possible (and likely eventually) that this document will no longer
appear on my dropbox, I'll reproduce the most important information here (not
exactly what is in the document, but close enough):

The modulus operator maps an integer to a small, finite space. For instance, `x
% 3` will map whatever x is into the range [0 .. 3). 0 maps to 0, 1 maps to 1,
2 maps to 2, 3 maps to 0, and so on infinitely. As long as the integer is
uniformly chosen from the infinite space of all non-negative integers then `x %
3` will uniformly fall into that range.

(Non-negative is important in this case because some definitions of modulus,
namely the one used in computers generally, map negative numbers differently to
(-3 .. 0]. `uniform` does not use negative number modulus, thus we can safely
ignore that fact.)

The issue with computers is that integers have a finite space they must fit in,
and our uniformly chosen random number is picked in that finite space. So, that
method is not sufficient. You can look at it as the integer space being divided
into "buckets" and every bucket after the first bucket maps directly into that
first bucket. `[0, 1, 2]`, `[3, 4, 5]`, ... When integers are finite, then the
last bucket has the chance to be "incomplete": `[uint.max - 3, uint.max - 2,
uint.max - 1]`, `[uint.max]` ... (Uh oh, the last bucket only has 1!). The
issue here is that _every_ bucket maps _completely_ to the first bucket except
for that last one. The last one doesn't have corresponding mappings to 1 or 2,
in this case, which makes it unfair.

So, the answer is to simply "reroll" if you're in that last bucket, since it's
the only unfair one. Eventually you'll roll into a fair bucket. Simply, instead
of the meaning of the last bucket being "maps to `[0]`", it changes to "maps to
`[0, 1, 2]`", which is precisely what we want.

To generalize, `upperDist` represents the size of our buckets (and, thus, the
exclusive upper bound for our desired uniform number). `rnum` is a uniformly
random number picked from the space of integers that a computer can hold (we'll
say `UpperType` represents that type).

We'll first try to do the mapping into the first bucket by doing `offset = rnum
% upperDist`. We can figure out the position of the front of the bucket we're
in by `bucketFront = rnum - offset`.

If we start at `UpperType.max` and walk backwards `upperDist - 1` spaces, then
the space we land on is the last acceptable position where a full bucket can
fit:

```
   bucketFront     UpperType.max
      v                 v
[..., 0, 1, 2, ..., upperDist - 1]
      ^~~ upperDist - 1 ~~^
```

If the bucket starts any later, then it must have lost at least one number and
at least that number won't be represented fairly.

```
                bucketFront     UpperType.max
                     v                v
[..., upperDist - 1, 0, 1, 2, ..., upperDist - 2]
          ^~~~~~~~ upperDist - 1 ~~~~~~~^
```

Hence, our condition to reroll is
`bucketFront > (UpperType.max - (upperDist - 1))`

---

OK, I think that about covers everything I've done on this. I hope that wasn't
too much, but I did want to document the work put into developing this.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Dec 01 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11598




Commits pushed to master at https://github.com/D-Programming-Language/phobos

https://github.com/D-Programming-Language/phobos/commit/30ced820e56c9088d08f6254c068df40a995706f
Fix issue 11598 - uniform could be faster for integrals

These changes retain the uniform quality of this function while, in the
common case, cutting the number of division operations in half.

Additionally, extra unittests were added to assure that all of the
bounds were tested properly.

https://github.com/D-Programming-Language/phobos/commit/fc48d56284f19bf171780554b63b4ae83808b894


Fix issue 11598 - uniform could be faster for integrals

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Dec 16 2013
prev sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11598




Commits pushed to 2.065 at https://github.com/D-Programming-Language/phobos

https://github.com/D-Programming-Language/phobos/commit/30ced820e56c9088d08f6254c068df40a995706f
Fix issue 11598 - uniform could be faster for integrals

https://github.com/D-Programming-Language/phobos/commit/fc48d56284f19bf171780554b63b4ae83808b894


-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Dec 19 2013