www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 10322] New: std.random.RandomSample.index() returns wrong value if called before front()

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=10322

           Summary: std.random.RandomSample.index() returns wrong value if
                    called before front()
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: joseph.wakeling webdrake.net


--- Comment #0 from Joseph Rushton Wakeling <joseph.wakeling webdrake.net>
2013-06-10 05:26:46 PDT ---
RandomSample contains an index() property which returns the index value of the
currently selected item.  So, if the currently-selected item is the 4th item
from the input range, index() should return 4.

However, if index() is called before the first call to front(), index returns
0.  The attached example code shows this bug in action: the index value
_should_ in this example be identical to the value returned by front().

This bug is a regression caused by the fix to Issue #7936.  Prior to that fix,
the first value of front() was determined in the constructor, so index() could
be guaranteed to be initialized: however, front(), and hence also index(),
cannot be initialized in the constructor because this would bias the
statistical independence of the sample.

The correct solution is therefore to ensure that front() and index() are both
initialized with the first call to either of them.  Probably the simplest
solution is to ensure that index() calls front() internally, either always or
conditional on the private boolean variable _first.

So, either

    index()
    {
        this.front;
        return _index;
    }

or,

    index()
    {
        if(!_first)
            this.front;
        return _index;
    }

Either will work, the question is which is preferable. :-)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jun 10 2013
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=10322



--- Comment #1 from Joseph Rushton Wakeling <joseph.wakeling webdrake.net>
2013-06-10 05:30:47 PDT ---
Created an attachment (id=1222)
Unittest variant of test code

This alternative version provides a unittest that can be used to check for this
bug.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jun 10 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=10322



--- Comment #2 from Joseph Rushton Wakeling <joseph.wakeling webdrake.net>
2013-06-10 09:01:04 PDT ---
The problem also extends to popFront(), which also requires front() to be
initialized before it can be called.  If not, the output may correspond to
sampling (n-1) items from (N-1), whereas it should involve sampling (n-1) items
from whatever remains of the input range after the first sample point is taken.

I'm not sure how to reliably unittest this, but example code will be attached
shortly.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jun 10 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=10322



--- Comment #3 from Joseph Rushton Wakeling <joseph.wakeling webdrake.net>
2013-06-10 09:06:33 PDT ---
Created an attachment (id=1223)
Example of error if popFront() is called before first call to front().

This code averages over multiple samples with the following cases:

  (1) a regular sample of 5 items from 100;

  (2) a sample initialized to request 5 items from 100 (labelled 0-99); .front
is called, followed by .popFront(), and then the remains of the sample is
taken.

  (3) a sample initialized to request 5 items from 100; .popFront() alone is
called, and then the remains of the sample is taken.

In the last case it is clear that the remains of the sampling process is in
practice taking 4 items with equal probability from the items 1-99, when what
_should_ be happening is that it is taking 4 items from the items remaining to
consider after the first sample point has been taken.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jun 10 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=10322



--- Comment #4 from Joseph Rushton Wakeling <joseph.wakeling webdrake.net>
2013-06-10 12:58:16 PDT ---
Note: there's an interesting discrepancy in results if the samplepop.d test
code is tweaked to sample 5 from 10 instead of 5 from 100.  The key difference
here is that because the ratio of sample points to available points is smaller,
Algorithm A will be used instead of Algorithm D.

If we do this, we observe that the sample with popFront() called first does not
reflect an even selection of items from 1 to 99; rather, it looks like the case
where .front is called prior to .popFront().

The explanation for this is as follows.  First, let's cover the 5-from-10 case.
 If .popFront() is called before .front, then the value of _Vprime (used by
Algorithm D) has not been initialized, and is equal to nan.

When _Vprime is used in RandomSample.skip() (std.random(1835)), the cast means
that the value of s is set to 0, and _Vprime is reset (line 1853) to a value of
-nan.

This in turn registers false to the if(_Vprime > 1.0) test, which takes us down
to line 1899 to return s, which is 0.

So, it follows that the call to prime() in the initial popFront() does not
change anything -- the only change to internal values occurs in the body of
popFront() itself with its calls to _input.popFront() and its increments of
_toSelect, _available and _index.  So, at the end of popFront() we are sitting
on the second element of the original _input range, _toSelect and _available
are 4 and 99 respectively, and the internal variable _first is still true.

Now when we read front() it will check _first and, finding it to be true (line
1682) will call prime(), which in turn calls skip() and hence selects as if it
were selecting the first point of the sample (with _toSelect == 4 and
_available == 99).  Hence the even distribution of selection from points 1-99.

If instead we're selecting 5 from 10, the initial call to popFront() and thence
prime() will instead call skipA() (via skip(), lines 1820-1824), which (not
being dependent on any uninitialized variables) will generate a true, typically
non-zero skip value s.

Then, when we call front(), with _first still true, the second call to prime()
will generate _another_ real skip.

It follows that in this second case the algorithm is genuinely taking two
sample points, and hence the non-uniform distribution of selection across
points 1-99.

Sad to say this would all have been caught much easier if skip() had used
std.conv.to instead of cast() .... :-(

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jun 10 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=10322



--- Comment #5 from Joseph Rushton Wakeling <joseph.wakeling webdrake.net>
2013-08-29 10:24:37 PDT ---
Fix submitted: https://github.com/D-Programming-Language/phobos/pull/1533

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Aug 29 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=10322



--- Comment #6 from github-bugzilla puremagic.com 2013-09-26 12:36:08 PDT ---
Commits pushed to master at https://github.com/D-Programming-Language/phobos

https://github.com/D-Programming-Language/phobos/commit/ce8155c7adf906582926b068cac33856f9cb26fa
Fix Issue 10322 - ensure RandomSample is initialized before use

This patch fixes a problem where the public methods .index()
or .popFront() might be called without the first value of the
sample having been determined, which would then cause spurious
results.  The runtime initialization check currently performed
in .front has been extended to those methods.

The private boolean checks in the previous implementation have
been replaced with an enum indicating the algorithm to be used
(A, D or None) with None indicating that the sample has not
been initialized.

Step D1 of Algorithm D has been moved to the skip() function,
which results in a significant performance boost.

Unittests have been introduced to cover the cases where .index
or .popFront() are called before .front.

Finally, the .index method has been made a  property, which I
take to be an oversight of the original code.

https://github.com/D-Programming-Language/phobos/commit/cedee1617a82c60da38511c297d05e7a146ce341
Merge pull request #1533 from WebDrake/randomsample-init

Fix Issue 10322 - ensure RandomSample is initialized before use

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Sep 26 2013
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=10322


Joseph Rushton Wakeling <joseph.wakeling webdrake.net> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Sep 27 2013