www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [OT] Algorithm question

reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
I'm been thinking about the following problem, and thought I'd pick the
brains of the bright people around these parts...

Given a set A of n elements (let's say it's a random-access range of
size n, where n is relatively large), and a predicate P(x) that
specifies some subset of A of elements that we're interested in, what's
the best algorithm (in terms of big-O time complexity) for selecting a
random element x satisfying P(x), such that elements that satisfy P(x)
have equal probability of being chosen? (Elements that do not satisfy
P(x) are disregarded.)

Which elements of A satisfy P(x) is not known ahead of time, nor is the
relative proportion of elements that satisfy P(x) or not.

Note that A is considered to be immutable (swapping elements or
otherwise changing A is not allowed).

So far, I have come up with the following algorithms:

1) "Random shooting":

	auto r = ... /* elements of A */
	for (;;) {
		auto i = uniform(0, r.length);
		if (P(r[i])) return r[i];
	}

   Advantages: If a large percentage of elements in A satisfy P(x), then
   the loop will terminate within a small number of iterations; if the
   majority of elements satisfy P(x), this will terminate well below n
   iterations.

   Disadvantages: If only a small percentage of elements satisfy P(x),
   then this loop could take arbitrarily long to terminate, and it will
   not terminate at all if no elements satisfy P(x).

2) "Turtle walk":

	auto r = ... /* elements of A */
	int nSatisfy = 0;
	ElementType!A currentChoice;
	bool found = false;
	foreach (e; r) {
		if (P(e)) {
			if (uniform(0, nSatisfy) == 0)
			{
				currentChoice = e;
				found = true;
			}
			nSatisfy++;
		}
	}
	if (found) return currentChoice;
	else ... /* no elements satisfy P(x) */

   Advantages: If there is any element at all in A that satisfies P(x),
   it will be found. The loop always terminates after n iterations.

   Disadvantages: Even if 99% of elements in A satisfies P(x), we still
   have to traverse the entire data set before we terminate. (We cannot
   terminate before that, otherwise the probability of elements
   satisfying P(x) being selected will not be even.)

3) Permutation walk:

	auto r = ... /* elements of A */
	foreach (i; iota(0 .. r.length).randomPermutation) {
		if (P(r[i])) return r[i];
	}
	/* no elements satisfy P(x) */

   Advantages: if an element that satisfies P(x) is found early, the
   loop will terminate before n iterations. This seems like the best of
   both worlds of (1) and (2), except:

   Disadvantages: AFAIK, generating a random permutation of indices from
   0 .. n requires at least O(n) time, so any advantage we may have had
   seems to be negated.

Is there an algorithm for *incrementally* generating a random
permutation of indices? If there is, we could use that in (3) and thus
achieve the best of both worlds between early termination if an element
satisfying P(x) is found, and guaranteeing termination after n
iterations if no elements satisfying P(x) exists.


T

-- 
The early bird gets the worm. Moral: ewww...
Apr 30 2017
next sibling parent reply Jack Stouffer <jack jackstouffer.com> writes:
On Monday, 1 May 2017 at 04:15:35 UTC, H. S. Teoh wrote:
 2) "Turtle walk":
I'd just like to point out that this algorithm doesn't satisfy
such that elements that satisfy P(x) have equal probability of 
being chosen
as the first element found in A will be chosen, and then each subsequent element has a decreasing probability of replacing the first element of P = 1/nSatisfy.
Apr 30 2017
parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Mon, May 01, 2017 at 05:03:17AM +0000, Jack Stouffer via Digitalmars-d wrote:
 On Monday, 1 May 2017 at 04:15:35 UTC, H. S. Teoh wrote:
 2) "Turtle walk":
I'd just like to point out that this algorithm doesn't satisfy
 such that elements that satisfy P(x) have equal probability of being
 chosen
as the first element found in A will be chosen, and then each subsequent element has a decreasing probability of replacing the first element of P = 1/nSatisfy.
Actually, this is exactly why it *does* satisfy the requirement. If there is exactly 1 element that satisfies P(x), then the probability of it being picked is 1. If there are 2 elements, the first element is picked with probability 1 initially, but the 2nd element has a 1/2 chance of replacing it, meaning the overall probability distribution is 1/2 and 1/2. If there are 3 elements, then by the time the 2nd element is processed, the first 2 elements are equally likely to be currently chosen (with probability 1/2 for each); when the 3rd element is found, it has 1/3 chance of replacing the previous choice, meaning there is a 2/3 chance the previous choice prevails. In that case, the 2/3 chance is equally distributed between the first two elements (they were picked with 1/2 and 1/2 probability), so the effective probability is 1/3 each after the 3rd element is processed. In general, by the time you process the n'th element, the previous elements would have had been picked with 1/(n-1) chance each, and the n'th element would be picked with 1/n chance, meaning there is a (n-1)/n chance the previous choice prevails. That (n-1)/n chance is equally distributed among the previous (n-1) elements, so effectively they are picked with 1/n chance each. The key here is that the probability that the n'th element is *not* picked is exactly (n-1)/n, which, by inductive hypothesis, must be evenly distributed among the previous (n-1) candidates, thereby making their *eventual* probability of remaining as the chosen element exactly 1/n. T -- Two wrongs don't make a right; but three rights do make a left...
Apr 30 2017
prev sibling next sibling parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:
On Monday, 1 May 2017 at 04:15:35 UTC, H. S. Teoh wrote:
 Which elements of A satisfy P(x) is not known ahead of time, 
 nor is the
 relative proportion of elements that satisfy P(x) or not.
O(N) given P(x)===false for all x... What you want is probably average case analysis, not worst case?
May 01 2017
prev sibling next sibling parent reply John Colvin <john.loughran.colvin gmail.com> writes:
On Monday, 1 May 2017 at 04:15:35 UTC, H. S. Teoh wrote:
 I'm been thinking about the following problem, and thought I'd 
 pick the brains of the bright people around these parts...

 Given a set A of n elements (let's say it's a random-access 
 range of
 size n, where n is relatively large), and a predicate P(x) that
 specifies some subset of A of elements that we're interested 
 in, what's
 the best algorithm (in terms of big-O time complexity) for 
 selecting a
 random element x satisfying P(x), such that elements that 
 satisfy P(x)
 have equal probability of being chosen? (Elements that do not 
 satisfy
 P(x) are disregarded.)

 Which elements of A satisfy P(x) is not known ahead of time, 
 nor is the
 relative proportion of elements that satisfy P(x) or not.

 Note that A is considered to be immutable (swapping elements or
 otherwise changing A is not allowed).

 So far, I have come up with the following algorithms:

 1) "Random shooting":

 	auto r = ... /* elements of A */
 	for (;;) {
 		auto i = uniform(0, r.length);
 		if (P(r[i])) return r[i];
 	}

    Advantages: If a large percentage of elements in A satisfy 
 P(x), then
    the loop will terminate within a small number of iterations; 
 if the
    majority of elements satisfy P(x), this will terminate well 
 below n
    iterations.

    Disadvantages: If only a small percentage of elements 
 satisfy P(x),
    then this loop could take arbitrarily long to terminate, and 
 it will
    not terminate at all if no elements satisfy P(x).
You can recover O(n) calls to P by caching the results. You can recover O(n) for both P and rng by progressively removing elements from r or any other method that results in an easily samplable set of all elements of r not yet visited (and therefore possible candidates). I like this one (warning, just thought of, untested): auto r = ... /* elements of A */ auto nRemaining = r.length; while (nRemaining) { auto i = uniform(0, nRemaining); if (P(r[i])) return r[i]; else swap(r[i], r[--nRemaining]); }
 3) Permutation walk:

 	auto r = ... /* elements of A */
 	foreach (i; iota(0 .. r.length).randomPermutation) {
 		if (P(r[i])) return r[i];
 	}
 	/* no elements satisfy P(x) */

    Advantages: if an element that satisfies P(x) is found 
 early, the
    loop will terminate before n iterations. This seems like the 
 best of
    both worlds of (1) and (2), except:

    Disadvantages: AFAIK, generating a random permutation of 
 indices from
    0 .. n requires at least O(n) time, so any advantage we may 
 have had
    seems to be negated.

 Is there an algorithm for *incrementally* generating a random 
 permutation of indices? If there is, we could use that in (3) 
 and thus achieve the best of both worlds between early 
 termination if an element satisfying P(x) is found, and 
 guaranteeing termination after n iterations if no elements 
 satisfying P(x) exists.


 T
Yes. As a matter of fact it would be the logical extension of my algorithm above for when you can't or don't want to change r (again, only just thought of this, untested...): auto r = ... /* elements of A */ auto indices = iota(r.length).array; auto nRemaining = r.length; while (nRemaining) { auto i = uniform(0, nRemaining); if (P(r[indices[i]])) return r[indices[i]]; else swap(indices[i], indices[--nRemaining]); } You could also do the same thing but with an array of pointers to elements of r instead of indices.
May 01 2017
next sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:
On Monday, 1 May 2017 at 08:42:05 UTC, John Colvin wrote:
 I like this one (warning, just thought of, untested):

 auto r = ... /* elements of A */
 auto nRemaining = r.length;
 while (nRemaining) {
 	auto i = uniform(0, nRemaining);
 	if (P(r[i])) return r[i];
 	else swap(r[i], r[--nRemaining]);
 }
Yes, this is the standard text-book way of doing it, still O(N) of course. The other standard-text-book way is to generate an array of indexes and mutate that instead, still O(N). If you cache in a heap you get O(N log N). Anyway, this kind of worst case analysis doesn't really help. And neither does average case, which will be O(1) assuming 50% match P. So you need is to specify what kind of average case analysis you want. For instance expressed as f(N,S) where N is the number of elements and S is the number of elements that satisfies P.
May 01 2017
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:
On Monday, 1 May 2017 at 09:01:33 UTC, Ola Fosheim Grøstad wrote:
 array of indexes and mutate that instead, still O(N). If you 
 cache in a heap you get O(N log N).
To avoid confusion: I didn't mean a literal "heap", but some kind of search structure that tracks and draws from unused indice ranges in a way that does not require O(N) initialization and can be implemented as a single array on the stack. I don't remember what it is called, but you draw on each node based on the number of children (kind of like the ones used in markov-chain type situations). I assume this can be maintained in O(log N) per iteration with O(1) initialization. E.g. 1. push_indices(0,N-1) //data (0,N-1) 2. i = draw_and_remove_random_index() // data could now be (0,i-1), (i+1,N-1) 3. if( i == -1 ) return 4. dostuff(i) 5. goto 2 Even though it is O(N log N) it could outperform other solutions if you only want to draw a small number of elements. And a small array with a linear search (tracking used indices) would outperform that if you only want to draw say 8 elements. If you want to draw close to N elements then you would be better off just randomizing a full array (with elements from 0 to N-1). Or you could do all of these, start with linear search, then convert to a tree like search, then convert to full enumeration. In general I don't think asymptotical analysis will do much good. I think you need to look on actual costs for typical N, P and different distributions of satisfied P, e.g. cost of loop iteration: 10 average cost of P(x)=>false: 1 average cost of P(x)=>true: 1000 probability for P(a[i]) for each i
May 01 2017
prev sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Mon, May 01, 2017 at 08:42:05AM +0000, John Colvin via Digitalmars-d wrote:
[...]
 You can recover O(n) calls to P by caching the results.
Sorry, I forgot to specify that P(x) can be different each time. Also, the input A may also be different each time, albeit remain the same length (though the expectation is that it will be mostly the same -- not that it matters, though, I don't think any of the proposed solutions here count on that).
 You can recover O(n) for both P and rng by progressively removing
 elements from r or any other method that results in an easily
 samplable set of all elements of r not yet visited (and therefore
 possible candidates).
[...] Yes, I realize that if r was mutable, there are existing simple solutions. The problem is that it's not mutable, or rather, should not be mutated by this selection algorithm. [...]
 Yes. As a matter of fact it would be the logical extension of my
 algorithm above for when you can't or don't want to change r (again,
 only just thought of this, untested...):
 
 auto r = ... /* elements of A */
 auto indices = iota(r.length).array;
 auto nRemaining = r.length;
 while (nRemaining) {
 	auto i = uniform(0, nRemaining);
 	if (P(r[indices[i]])) return r[indices[i]];
 	else swap(indices[i], indices[--nRemaining]);
 }
 
 You could also do the same thing but with an array of pointers to
 elements of r instead of indices.
The trouble with this is that you have to allocate an array of indices, and r.length is rather large, so creating this array is O(n) each time. But I think Ivan Kazmenko has an excellent idea to cache this index array. Since, by symmetry, it doesn't matter if the starting indices array was the identity permutation, we could just initialize this array once, and the next time just skip to the loop directly (reusing the previous (possibly partially) permuted indices as the starting point). It does require O(n) memory, which is unfortunate, but at least we only pay for that once. I was hoping for something more along the lines of a random permutation algorithm with sublinear (if not O(1)) space complexity that can return elements of the permutation consecutively on-demand, like a lazy range. Something along the lines of: auto seed = ... /* get random number */ auto lazyRange = generatePermutation(seed, n); where generatePermutation is some incremental algorithm that produces a unique permutation of 0 .. n per seed value. Ideally, this algorithm would only do as much work as is necessary to produce the first k elements of the permutation, so that if the main loop terminates early (we found an element satisfying P(x)) we don't waste time generating the rest of the permutation. T -- "No, John. I want formats that are actually useful, rather than over-featured megaliths that address all questions by piling on ridiculous internal links in forms which are hideously over-complex." -- Simon St. Laurent on xml-dev
May 01 2017
prev sibling next sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:
On Monday, 1 May 2017 at 04:15:35 UTC, H. S. Teoh wrote:
 Is there an algorithm for *incrementally* generating a random 
 permutation of indices?
Does it exist? Yes, because you can build permutation tables for it, but you'll have to find the closed form for it that is efficient... Can you do that without selecting a specific N? It is easy for N=2 and N=3 at least. E.g. For array length 2 you get the 2 permutations: 0 1, 1 0 Then you just select one of them at random (you know the number of permutations) So if it exists you'll should probably do a search for permutations. "How do I efficiently generate permutation number N" Of course for smaller N you could generate a big array of bytes and compress it by having arrays of slices into it (as many permutations will share index sequences).
May 01 2017
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:
On Monday, 1 May 2017 at 08:44:25 UTC, Ola Fosheim Grøstad wrote:
 Does it exist? Yes, because you can build permutation tables 
 for it, but you'll have to find the closed form for it that is 
 efficient... Can you do that without selecting a specific N? It 
 is easy for N=2 and N=3 at least.

 E.g.

 For array length 2 you get the 2 permutations: 0 1, 1 0
 Then you just select one of them at random (you know the number 
 of permutations)

 So if it exists you'll should probably do a search for 
 permutations. "How do I efficiently generate permutation number 
 N"

 Of course for smaller N you could generate a big array of bytes 
 and compress it by having arrays of slices into it (as many 
 permutations will share index sequences).
Keep also in mind that you can split the original array in two, so it might be sufficient to have the permutations for various 2^M sizes if those are easier to generate (my guess, maybe through some xor magic?) E.g. if N = 21 then it can be split into 16 + (4 + 1) So you can draw like this: 1. randomly select group_16 over group_5 based with 16/21 probability) 2. if group_5 selected: ramdomly select group_4 over group 1 on 4/5 probability etc. That way you only need log(N) permutation sequences to keep track of. Which for all practical purposes are going to stay pretty low (<20?) Right?
May 01 2017
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:
On Monday, 1 May 2017 at 10:51:46 UTC, Ola Fosheim Grøstad wrote:
 Keep also in mind that you can split the original array in two, 
 so it might be sufficient to  have the permutations for various 
 2^M sizes if those are easier to generate (my guess, maybe 
 through some xor magic?)

 E.g. if N = 21 then it can be split into 16 + (4 + 1)

 So you can draw like this:

 1. randomly select group_16 over group_5 based with 16/21 
 probability)

 2. if group_5 selected: ramdomly select group_4 over group 1 on 
 4/5 probability

 etc.

 That way you only need log(N) permutation sequences to keep 
 track of. Which for all practical purposes are going to stay 
 pretty low (<20?)

 Right?
But if you don't use a hardware random generator and are happy with more sloppy pseudo-randomness then you would be better of ditching the whole concept of permutations and replace it with some cylic random generators intead using the same technique I just outlined. E.g. find a set of cyclic random generators and break down N into a sum of these cycle-sizes, then just keep track of the seed for each. If they are 2^N then you could use xor to get more randomness between runs. Also in the algorithms above you need to change the probabilities each time you take away one index from a group (you don't want to draw from an empty group).
May 01 2017
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:
On Monday, 1 May 2017 at 11:08:56 UTC, Ola Fosheim Grøstad wrote:
 E.g. find a set of cyclic random generators and break down N 
 into a sum of these cycle-sizes, then just keep track of the 
 seed for each. If they are 2^N then you could use xor to get 
 more randomness between runs.

 Also in the algorithms above you need to change the 
 probabilities each time you take away one index from a group 
 (you don't want to draw from an empty group).
Well, actually, just select the single cyclic generator that is larger or equal to N, then just redraw if it returns a value >= N. Duh! Sloppy thinking. I would think you should be able to find some prime sized ones that will get the next index with an insignificant number of redraws. But permutations is the way to go if you want scientific quality.
May 01 2017
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:
On Monday, 1 May 2017 at 11:15:28 UTC, Ola Fosheim Grøstad wrote:
 But permutations is the way to go if you want scientific 
 quality.
I take that back: «Polynomial pseudo-random number generator via cyclic phase» http://www.sciencedirect.com/science/article/pii/S0378475409001463 Might be what you are looking for and if not it probably references other papers that fits the bill.
May 01 2017
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:
Ah, found a sensible search term: «pseudorandom permutation»

http://cse.iitkgp.ac.in/~debdeep/courses_iitkgp/FCrypto/slides/LubyRackoff.pdf

If you go down this path, please share links. Useful stuff 
actually.
May 01 2017
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:
On Monday, 1 May 2017 at 11:57:32 UTC, Ola Fosheim Grøstad wrote:
 Ah, found a sensible search term: «pseudorandom permutation»

 http://cse.iitkgp.ac.in/~debdeep/courses_iitkgp/FCrypto/slides/LubyRackoff.pdf

 If you go down this path, please share links. Useful stuff 
 actually.
A readable paper for most people: http://www.pcg-random.org/pdf/toms-oneill-pcg-family-v1.02.pdf «some users would prefer a much stronger property analogous to uniformity: that over the full period of the generator, every possible k-tuple will occur, and it will occur the same number of times.» PCG claims to have arbitrary cycle sizes. I haven't looked at it, but it is available as C/C++.
May 01 2017
prev sibling next sibling parent reply Moritz Maxeiner <moritz ucworks.org> writes:
On Monday, 1 May 2017 at 04:15:35 UTC, H. S. Teoh wrote:
 Given a set A of n elements (let's say it's a random-access 
 range of
 size n, where n is relatively large), and a predicate P(x) that
 specifies some subset of A of elements that we're interested 
 in, what's
 the best algorithm (in terms of big-O time complexity) for 
 selecting a
 random element x satisfying P(x), such that elements that 
 satisfy P(x)
 have equal probability of being chosen? (Elements that do not 
 satisfy
 P(x) are disregarded.)

 Which elements of A satisfy P(x) is not known ahead of time, 
 nor is the
 relative proportion of elements that satisfy P(x) or not.
Works only with a random access range and I haven't done the full analysis (and I'm a bit rusty on probability), but my first thought was random divide and conquer: ElementType!A* select(A,P)(A r) { // Recursion anchor if (r.length == 1) { if (P(r[0])) return r[0]; else return null; // Recurse randomly with p=0.5 into either the left, or right half of r } else { ElementType!A* e; ElementType!A[][2] half; half[0] = r[0..floor(r.length/2)]; half[1] = r[ceil(r.length/2)..$]; ubyte coinFlip = uniform(0,1) > 0; // Recurse in one half and terminate if e is found there e = select(half[coinFlip]); if (e) return e; // Otherwise, recurse into other half return select(half[1 - coinFlip]); } } As stated above, I haven't done the full analysis, but intuitively speaking (which can be wrong, of course), the p=0.5 recursion ought satisfy the condition of elements satisfying P(x) being chosen uniformly; also intuitively, I'd expect the expected runtime for a uniform distribution of elements satisfying P(x) to be around O(log N). Worst-case would be that it has to inspect every element in r once => O(N)
May 01 2017
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 01.05.2017 11:51, Moritz Maxeiner wrote:
 On Monday, 1 May 2017 at 04:15:35 UTC, H. S. Teoh wrote:
 Given a set A of n elements (let's say it's a random-access range of
 size n, where n is relatively large), and a predicate P(x) that
 specifies some subset of A of elements that we're interested in, what's
 the best algorithm (in terms of big-O time complexity) for selecting a
 random element x satisfying P(x), such that elements that satisfy P(x)
 have equal probability of being chosen? (Elements that do not satisfy
 P(x) are disregarded.)

 Which elements of A satisfy P(x) is not known ahead of time, nor is the
 relative proportion of elements that satisfy P(x) or not.
Works only with a random access range and I haven't done the full analysis (and I'm a bit rusty on probability), but my first thought was random divide and conquer: ElementType!A* select(A,P)(A r) { // Recursion anchor if (r.length == 1) { if (P(r[0])) return r[0]; else return null; // Recurse randomly with p=0.5 into either the left, or right half of r } else { ElementType!A* e; ElementType!A[][2] half; half[0] = r[0..floor(r.length/2)]; half[1] = r[ceil(r.length/2)..$]; ubyte coinFlip = uniform(0,1) > 0;
This deterministically chooses 0. (uniform is right-exclusive.) I assume you mean uniform(0,2).
     // Recurse in one half and terminate if e is found there
     e = select(half[coinFlip]);
     if (e) return e;
     // Otherwise, recurse into other half
     return select(half[1 - coinFlip]);
   }
 }

 As stated above, I haven't done the full analysis, but intuitively
 speaking (which can be wrong, of course), the p=0.5 recursion ought
 satisfy the condition of elements satisfying P(x) being chosen
 uniformly;
Counterexample: [1,0,1,1]. The first element is chosen with probability 1/2, which is not 1/3.
 also intuitively, I'd expect the expected runtime for a
 uniform distribution of elements satisfying P(x)  to be around O(log N).
I don't understand this input model.
 Worst-case would be that it has to inspect every element in r once => O(N)
May 01 2017
parent Moritz Maxeiner <moritz ucworks.org> writes:
On Monday, 1 May 2017 at 09:58:39 UTC, Timon Gehr wrote:
 On 01.05.2017 11:51, Moritz Maxeiner wrote:
     [...]
This deterministically chooses 0. (uniform is right-exclusive.) I assume you mean uniform(0,2).
Yes.
 [...]
Counterexample: [1,0,1,1]. The first element is chosen with probability 1/2, which is not 1/3.
Dang. Serves me right for not doing the full analysis. Thanks for the counter example and sorry for wasting your time.
 [...]
I don't understand this input model.
Since the idea was nonsense, it doesn't matter, anyway.
May 01 2017
prev sibling next sibling parent reply Mike B Johnson <Mikey Ikes.com> writes:
On Monday, 1 May 2017 at 04:15:35 UTC, H. S. Teoh wrote:
 I'm been thinking about the following problem, and thought I'd 
 pick the brains of the bright people around these parts...

 [...]
Since most real world problems would require selecting elements more than once it may be far more efficient to sort by P(x)(filter) then simply select a random element. e.g., a = A.filter(x->P(x)) // Creates a new set a where only the elements of A that satisfy P(x) are added ... e = a.random; this is O(1) if you only have to filter once(you can create a container that always "sorts" on P(x) so to speak.. like a sorted dictionary).
May 01 2017
parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Mon, May 01, 2017 at 11:32:19AM +0000, Mike B Johnson via Digitalmars-d
wrote:
[...]
 Since most real world problems would require selecting elements more
 than once it may be far more efficient to sort by P(x)(filter) then
 simply select a random element.
 
 e.g.,
 
 a = A.filter(x->P(x))  // Creates a new set a where only the elements
 of A that satisfy P(x) are added
 
 ...
 
 e = a.random;
 
 
 this is O(1) if you only have to filter once(you can create a
 container that always "sorts" on P(x) so to speak.. like a sorted
 dictionary).
[...] Sorry, I forgot to specify that P(x) may change each time, as may the input A. So caching would be of little help in this case. T -- This is a tpyo.
May 01 2017
prev sibling next sibling parent reply Andrea Fontana <nospam example.com> writes:
On Monday, 1 May 2017 at 04:15:35 UTC, H. S. Teoh wrote:
 1) "Random shooting":

 	auto r = ... /* elements of A */
 	for (;;) {
 		auto i = uniform(0, r.length);
 		if (P(r[i])) return r[i];
 	}

    Advantages: If a large percentage of elements in A satisfy 
 P(x), then
    the loop will terminate within a small number of iterations; 
 if the
    majority of elements satisfy P(x), this will terminate well 
 below n
    iterations.

    Disadvantages: If only a small percentage of elements 
 satisfy P(x),
    then this loop could take arbitrarily long to terminate, and 
 it will
    not terminate at all if no elements satisfy P(x).
If you find an element that doesn't satisfy P(x) move it on top of array (swap!) and then try again with uniform(1, r.length - 1) and so on. It terminates in this way. Andrea
May 01 2017
next sibling parent reply Andrea Fontana <nospam example.com> writes:
On Monday, 1 May 2017 at 12:31:48 UTC, Andrea Fontana wrote:
 If you find an element that doesn't satisfy P(x) move it on top 
 of array (swap!) and then try again with uniform(1, r.length - 
 1) and so on.
Whoops i mean uniform(1, r.length) of course :)
May 01 2017
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:
On Monday, 1 May 2017 at 12:38:32 UTC, Andrea Fontana wrote:
 On Monday, 1 May 2017 at 12:31:48 UTC, Andrea Fontana wrote:
 If you find an element that doesn't satisfy P(x) move it on 
 top of array (swap!) and then try again with uniform(1, 
 r.length - 1) and so on.
Whoops i mean uniform(1, r.length) of course :)
I guess you meant to the end, but if mutating the array is ok, then you don't have to swap. Just move in the last element and chop 1 off the length. Another cache-friendly mutation solution (assuming PRNG is cheap) is to have two phases: Phase 1: linear walkover the array and select elements with P(a[i])==true with probability 1/N the overwrite the ones with P(a[i]) = false. when done truncate the array. Phase 2: regular random selection from the array were all elements satisfy P. The OP underspecified the requirements... There is a gazillion variations... :-P
May 01 2017
prev sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Mon, May 01, 2017 at 12:31:48PM +0000, Andrea Fontana via Digitalmars-d
wrote:
 On Monday, 1 May 2017 at 04:15:35 UTC, H. S. Teoh wrote:
 1) "Random shooting":
 
 	auto r = ... /* elements of A */
 	for (;;) {
 		auto i = uniform(0, r.length);
 		if (P(r[i])) return r[i];
 	}
 
    Advantages: If a large percentage of elements in A satisfy P(x), then
    the loop will terminate within a small number of iterations; if the
    majority of elements satisfy P(x), this will terminate well below n
    iterations.
 
    Disadvantages: If only a small percentage of elements satisfy P(x),
    then this loop could take arbitrarily long to terminate, and it will
    not terminate at all if no elements satisfy P(x).
 
If you find an element that doesn't satisfy P(x) move it on top of array (swap!) and then try again with uniform(1, r.length - 1) and so on. It terminates in this way.
[...] The problem is that swapping elements is not allowed for this particular problem. If swapping was allowed this would be a much easier problem to solve. :-) T -- Chance favours the prepared mind. -- Louis Pasteur
May 01 2017
prev sibling next sibling parent reply Ivan Kazmenko <gassa mail.ru> writes:
On Monday, 1 May 2017 at 04:15:35 UTC, H. S. Teoh wrote:
 Given a set A of n elements (let's say it's a random-access 
 range of
 size n, where n is relatively large), and a predicate P(x) that
 specifies some subset of A of elements that we're interested 
 in, what's
 the best algorithm (in terms of big-O time complexity) for 
 selecting a
 random element x satisfying P(x), such that elements that 
 satisfy P(x)
 have equal probability of being chosen? (Elements that do not 
 satisfy
 P(x) are disregarded.)
I'd like to note here that, if you make use of the same P(x) many times (instead of different predicates on each call), it makes sense to spend O(n) time and memory filtering by that predicate and storing the result, and then answer each query in O(1).
 3) Permutation walk:

 	auto r = ... /* elements of A */
 	foreach (i; iota(0 .. r.length).randomPermutation) {
 		if (P(r[i])) return r[i];
 	}
 	/* no elements satisfy P(x) */

    Advantages: if an element that satisfies P(x) is found 
 early, the
    loop will terminate before n iterations. This seems like the 
 best of
    both worlds of (1) and (2), except:

    Disadvantages: AFAIK, generating a random permutation of 
 indices from
    0 .. n requires at least O(n) time, so any advantage we may 
 have had
    seems to be negated.

 Is there an algorithm for *incrementally* generating a random 
 permutation of indices? If there is, we could use that in (3) 
 and thus achieve the best of both worlds between early 
 termination if an element satisfying P(x) is found, and 
 guaranteeing termination after n iterations if no elements 
 satisfying P(x) exists.
Yes, there is. There are actually two variations of Fisher-Yates shuffle: (https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle) 1. auto p = n.iota.array; foreach (pos; 0..n) { auto otherPos = uniform (0, pos + 1); swap (p[pos], p[otherPos]); } When we look at this after k-th iteration, the first k elements are randomly and uniformly permuted, and the rest (n-k) are left untouched. 2. auto p = n.iota.array; foreach (pos; 0..n) { auto otherPos = uniform (pos, n); swap (p[pos], p[otherPos]); } When we look at this after k-th iteration, the first k elements are a random combination of all n elements, and this combination is randomly and uniformly permuted. So, the second variation is what we need: each new element is randomly and uniformly selected from all the elements left. Once we get the first element satisfying the predicate, we can just terminate the loop. If there are m out of n elements satisfying the predicate, the average number of steps is n/m. Now, the problem is that both of these allocate n "size_t"-s of memory to start with. And your problem does not allow to shuffle the elements of the original array in place, so we do need an external permutation for these algorithms. However, there are at least two ways to mitigate that: (I) We can allocate the permutation once using n time and memory, and then, on every call, just reuse it in its current state in n/m time. It does not matter if the permutation is not identity permutation: by symmetry argument, any starting permutation will do just fine. (II) We can store the permutation p in an associative array instead of a regular array, actually storing only the elements accessed at least once, and assuming other elements to satisfy the identity p[x] = x. So, if we finish in n/m steps on average, the time and extra memory used will be O(n/m) too. I can put together an example implementation if this best satisfies your requirements. Ivan Kazmenko.
May 01 2017
parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Mon, May 01, 2017 at 02:38:09PM +0000, Ivan Kazmenko via Digitalmars-d wrote:
 On Monday, 1 May 2017 at 04:15:35 UTC, H. S. Teoh wrote:
 Given a set A of n elements (let's say it's a random-access range of
 size n, where n is relatively large), and a predicate P(x) that
 specifies some subset of A of elements that we're interested in,
 what's the best algorithm (in terms of big-O time complexity) for
 selecting a random element x satisfying P(x), such that elements
 that satisfy P(x) have equal probability of being chosen? (Elements
 that do not satisfy P(x) are disregarded.)
I'd like to note here that, if you make use of the same P(x) many times (instead of different predicates on each call), it makes sense to spend O(n) time and memory filtering by that predicate and storing the result, and then answer each query in O(1).
Unfortunately, P(x) could be different each time (sorry, should have stated this explicitly), and A may also change in between calls to this random selection algorithm. So filtering would not be a good approach. [...]
 There are actually two variations of Fisher-Yates shuffle:
 (https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle)
 
 1.
     auto p = n.iota.array;
     foreach (pos; 0..n) {
         auto otherPos = uniform (0, pos + 1);
         swap (p[pos], p[otherPos]);
     }
 
 When we look at this after k-th iteration, the first k elements are
 randomly and uniformly permuted, and the rest (n-k) are left
 untouched.
 
 2.
     auto p = n.iota.array;
     foreach (pos; 0..n) {
         auto otherPos = uniform (pos, n);
         swap (p[pos], p[otherPos]);
     }
 
 When we look at this after k-th iteration, the first k elements are a
 random combination of all n elements, and this combination is randomly
 and uniformly permuted.  So, the second variation is what we need:
 each new element is randomly and uniformly selected from all the
 elements left.  Once we get the first element satisfying the
 predicate, we can just terminate the loop.  If there are m out of n
 elements satisfying the predicate, the average number of steps is n/m.
But wouldn't creating the array p already be O(n)? Albeit, somewhat faster than n iterations of the main loop since copying iota ought to be somewhat cheaper than generating a random number and swapping two elements.
 Now, the problem is that both of these allocate n "size_t"-s of memory
 to start with.  And your problem does not allow to shuffle the
 elements of the original array in place, so we do need an external
 permutation for these algorithms.  However, there are at least two
 ways to mitigate that:
 
 (I)
 We can allocate the permutation once using n time and memory, and
 then, on every call, just reuse it in its current state in n/m time.
 It does not matter if the permutation is not identity permutation: by
 symmetry argument, any starting permutation will do just fine.
I like this idea very much. This lets us only pay once for creating the index array, and reuse it almost "for free" thereafter. Still, the O(n) space requirement could potentially be onerous, since n is expected to be rather large.
 (II)
 We can store the permutation p in an associative array instead of a
 regular array, actually storing only the elements accessed at least
 once, and assuming other elements to satisfy the identity p[x] = x.
 So, if we finish in n/m steps on average, the time and extra memory
 used will be O(n/m) too.  I can put together an example implementation
 if this best satisfies your requirements.
[...] This is interesting, and relaxes the O(n) space requirement initially. After enough queries, though, I'd expect the AA to eventually grow to n elements (as there is a chance some queries will end up finding no elements that satisfy P(x) so it would generate the entire permutation). But I like your idea of reusing the index array. It's definitely a possibility! T -- Your inconsistency is the only consistent thing about you! -- KD
May 01 2017
prev sibling parent reply MysticZach <reachzach ggmail.com> writes:
On Monday, 1 May 2017 at 04:15:35 UTC, H. S. Teoh wrote:
 Given a set A of n elements (let's say it's a random-access 
 range of
 size n, where n is relatively large), and a predicate P(x) that
 specifies some subset of A of elements that we're interested 
 in, what's
 the best algorithm (in terms of big-O time complexity) for 
 selecting a
 random element x satisfying P(x), such that elements that 
 satisfy P(x)
 have equal probability of being chosen? (Elements that do not 
 satisfy
 P(x) are disregarded.)
Here's how I would do it: // returns a random index of array satisfying P(x), -1 if not found int randomlySatisfy(A[] array) { if (array.length == 0) return -1; int i = uniform(0, array.length); return randomlySatisfyImpl(array, i); } // recursive function private int randomlySatisfyImpl(A[] da, int i) { if (P(da[i])) return i; if (da.length == 1) return -1; // choose a random partition proportionally auto j = uniform(da.length - 1); if (j < i) { // the lower partition int a = randomlySatisfyImpl(da[0..i], j); if (a != -1) return a; else return randomlySatisfyImpl(da[i+1 .. da.length], j - (i + 1)); } else { // higher partition, investigate in reverse order int a = randomlySatisfyImpl(da[i+1 .. da.length], j - (i + 1)); if (a != -1) return i +1 + a; else return i + 1 + randomlySatisfyImpl(da[0..i], j); } } The goal is to have the first hit be the one you return. The method: if a random pick doesn't satisfy, randomly choose the partition greater than or less than based on uniform(0..array.length-1), and do the same procedure on that partition, reusing the random index to avoid having to call uniform twice on each recursion (one to choose a partition and one to choose an index within that partition). If the probability of choosing a partition is precisely proportional to the number of elements in that partition, it seems to me that the result will be truly random, but I'm not sure.
May 01 2017
next sibling parent MysticZach <reachzach ggmail.com> writes:
On Monday, 1 May 2017 at 16:56:58 UTC, MysticZach wrote:
    // choose a random partition proportionally
    auto j = uniform(da.length - 1);
    if (j < i) {
       // the lower partition
       int a = randomlySatisfyImpl(da[0..i], j);
       if (a != -1) return a;
       else return randomlySatisfyImpl(da[i+1 .. da.length], j - 
 (i + 1));
    }
    else {
       // higher partition, investigate in reverse order
       int a = randomlySatisfyImpl(da[i+1 .. da.length], j - (i 
 + 1));
       if (a != -1) return i +1 + a;
       else return i + 1 + randomlySatisfyImpl(da[0..i], j);
The line above has a bug. Replace it with: else { a = randomlySatisfyImpl(da[0..i], j); return (a == -1) ? -1 : i + 1 + a; }
    }
 }
But the idea's the same. Hopefully it's clear.
May 01 2017
prev sibling parent reply MysticZach <reachzach ggmail.com> writes:
On Monday, 1 May 2017 at 16:56:58 UTC, MysticZach wrote:
 The goal is to have the first hit be the one you return. The 
 method: if a random pick doesn't satisfy, randomly choose the 
 partition greater than or less than based on 
 uniform(0..array.length-1), and do the same procedure on that 
 partition, reusing the random index to avoid having to call 
 uniform twice on each recursion (one to choose a partition and 
 one to choose an index within that partition). If the 
 probability of choosing a partition is precisely proportional 
 to the number of elements in that partition, it seems to me 
 that the result will be truly random, but I'm not sure.
Now I'm questioning this, because if the positive cases are unevenly distributed, i.e., [111111000100], the last one has about 50% chance to get picked instead of a 1 in 7 chance with my method. I guess you'd need to build a whole new array like the others are saying.
May 01 2017
parent reply Ivan Kazmenko <gassa mail.ru> writes:
On Monday, 1 May 2017 at 21:54:43 UTC, MysticZach wrote:
 On Monday, 1 May 2017 at 16:56:58 UTC, MysticZach wrote:
 The goal is to have the first hit be the one you return. The 
 method: if a random pick doesn't satisfy, randomly choose the 
 partition greater than or less than based on 
 uniform(0..array.length-1), and do the same procedure on that 
 partition, reusing the random index to avoid having to call 
 uniform twice on each recursion (one to choose a partition and 
 one to choose an index within that partition). If the 
 probability of choosing a partition is precisely proportional 
 to the number of elements in that partition, it seems to me 
 that the result will be truly random, but I'm not sure.
Now I'm questioning this, because if the positive cases are unevenly distributed, i.e., [111111000100], the last one has about 50% chance to get picked instead of a 1 in 7 chance with my method. I guess you'd need to build a whole new array like the others are saying.
A pity; it sounded nice! But yeah, once we pick the right ~half, we have to completely traverse it before paying any attention to the left ~half. I hope some part of the idea is still salvageable. For example, what if we put the intervals in a queue instead of a stack? Ivan Kazmenko.
May 02 2017
parent reply Ivan Kazmenko <gassa mail.ru> writes:
On Tuesday, 2 May 2017 at 10:35:46 UTC, Ivan Kazmenko wrote:
 I hope some part of the idea is still salvageable.
 For example, what if we put the intervals in a queue instead of 
 a stack?
I tried to implement a similar approach, but instead of a queue or a stack, I used a random-access array of intervals. Sadly, it is still far from uniform, since it does not take interval lengths into account, and I don't see how to do that without at least O(log n) time per interval insertion or deletion. Implementation and empiric frequencies for n=5 elements in a permutation: http://ideone.com/3zSxLN Ivan Kazmenko.
May 02 2017
parent reply MysticZach <reachzach ggmail.com> writes:
On Tuesday, 2 May 2017 at 11:27:17 UTC, Ivan Kazmenko wrote:
 On Tuesday, 2 May 2017 at 10:35:46 UTC, Ivan Kazmenko wrote:
 I hope some part of the idea is still salvageable.
 For example, what if we put the intervals in a queue instead 
 of a stack?
I tried to implement a similar approach, but instead of a queue or a stack, I used a random-access array of intervals. Sadly, it is still far from uniform, since it does not take interval lengths into account, and I don't see how to do that without at least O(log n) time per interval insertion or deletion. Implementation and empiric frequencies for n=5 elements in a permutation: http://ideone.com/3zSxLN Ivan Kazmenko.
Well, I thought of another idea that may not be technically random, but which I imagine would get pretty close in real world use cases: 1. Test a random element in the array. If it satisfies P, return it. 2. Choose a hopper interval, h, that is relatively prime to the number of elements in the array, n. You could do this by randomly selecting from a pre-made list of primes of various orders of magnitude larger than n, with h = the prime % n. 3. Hop along the array, testing each element as you go. Increment a counter. If you reach the end of the array, cycle back to the beginning starting with the remainder of h that you didn't use. I think that h being relatively prime means it will thereby hit every element in the array. 4. Return the first hit. If the counter reaches n, return failure. The way I see it, with a random start position, and a *mostly* random interval, the only way to defeat the randomness of the result would be if the elements that satisfy P were dispersed precisely according to the random interval specified for that particular evocation of the function — in which case the first in the dispersed set would be chosen more often. This would require a rare convergence depending on both n and h, and would not repeat from one call to the next.
May 02 2017
parent reply MysticZach <reachzach ggmail.com> writes:
On Tuesday, 2 May 2017 at 13:44:03 UTC, MysticZach wrote:
 1. Test a random element in the array. If it satisfies P, 
 return it.
 2. Choose a hopper interval, h, that is relatively prime to the 
 number of elements in the array, n. You could do this by 
 randomly selecting from a pre-made list of primes of various 
 orders of magnitude larger than n, with h = the prime % n.
 3. Hop along the array, testing each element as you go. 
 Increment a counter. If you reach the end of the array, cycle 
 back to the beginning starting with the remainder of h that you 
 didn't use. I think that h being relatively prime means it will 
 thereby hit every element in the array.
 4. Return the first hit. If the counter reaches n, return 
 failure.
Taking this a step further, it occurred to me that you could use *any* hopping interval from 1 to array.length, if the length of the array were prime. So artificially extend the array and just skip a jump when you land in the extended area. And since you'd skip lots of jumps if the extension were any bigger that the hopping interval, reduce its length to modulo the hopping interval. // returns a random index of array satisfying P(x), -1 if not found int randomlySatisfy(A[] array) { auto length = array.length; if (!length) return -1; auto i = uniform(0, length); auto hop = uniform(1, length); // greatest prime < 2^^31, for simplicity enum prime = 2147483477; assert (length <= prime); auto skipLength = ((prime - length) % hop) + length; auto count = 0; for (;;) { if (P(a[i])) return i; ++count; if (count == length) return -1; i += hop; if (i < length) continue; if (i < skipLength) i += hop; i -= skipLength; } return -1; } This solution is stupidly simple and I haven't tested it. But I believe it's truly random, since the hopping interval is arbitrary. Who knows, it might work.
May 02 2017
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 02.05.2017 22:42, MysticZach wrote:
 On Tuesday, 2 May 2017 at 13:44:03 UTC, MysticZach wrote:
 1. Test a random element in the array. If it satisfies P, return it.
 2. Choose a hopper interval, h, that is relatively prime to the number
 of elements in the array, n. You could do this by randomly selecting
 from a pre-made list of primes of various orders of magnitude larger
 than n, with h = the prime % n.
 3. Hop along the array, testing each element as you go. Increment a
 counter. If you reach the end of the array, cycle back to the
 beginning starting with the remainder of h that you didn't use. I
 think that h being relatively prime means it will thereby hit every
 element in the array.
 4. Return the first hit. If the counter reaches n, return failure.
Taking this a step further, it occurred to me that you could use *any* hopping interval from 1 to array.length, if the length of the array were prime. So artificially extend the array and just skip a jump when you land in the extended area. And since you'd skip lots of jumps if the extension were any bigger that the hopping interval, reduce its length to modulo the hopping interval. // returns a random index of array satisfying P(x), -1 if not found int randomlySatisfy(A[] array) { auto length = array.length; if (!length) return -1; auto i = uniform(0, length); auto hop = uniform(1, length); // greatest prime < 2^^31, for simplicity enum prime = 2147483477; assert (length <= prime); auto skipLength = ((prime - length) % hop) + length; auto count = 0; for (;;) { if (P(a[i])) return i; ++count; if (count == length) return -1; i += hop; if (i < length) continue; if (i < skipLength) i += hop;
(I guess this should be 'while'.)
       i -= skipLength;
    }
    return -1;
 }

 This solution is stupidly simple and I haven't tested it. But I believe
 it's truly random, since the hopping interval is arbitrary. Who knows,
 it might work.
Counterexample: [1,1,1,0,0] Your algorithm returns 0 with probability 7/20, 1 with probability 6/20 and 2 with probability 7/20. Note that there is a simple reason why your algorithm cannot work for this case: it picks one of 20 numbers at random. The resulting probability mass cannot be evenly distributed among the three elements, because 20 is not divisible by 3.
May 02 2017
parent reply MysticZach <reachzach ggmail.com> writes:
On Tuesday, 2 May 2017 at 21:00:36 UTC, Timon Gehr wrote:
 On 02.05.2017 22:42, MysticZach wrote:
 On Tuesday, 2 May 2017 at 13:44:03 UTC, MysticZach wrote:
    for (;;) {
       if (P(a[i]))
          return i;
       ++count;
       if (count == length)
          return -1;
       i += hop;
       if (i < length)
          continue;
       if (i < skipLength)
          i += hop;
(I guess this should be 'while'.)
skipLength is determined modulo hop, thus it can't be more than one hop away.
 Counterexample: [1,1,1,0,0]

 Your algorithm returns 0 with probability 7/20, 1 with 
 probability 6/20 and 2 with probability 7/20.

 Note that there is a simple reason why your algorithm cannot 
 work for this case: it picks one of 20 numbers at random. The 
 resulting probability mass cannot be evenly distributed among 
 the three elements, because 20 is not divisible by 3.
That's true. Two points, though: If the range of error is within 1/(n*(n-1)), with array length n, it may be close enough for a given real world use case. 2. n is known to be large, which makes 1/(n*(n-1)) even smaller. You'd probably have to trade accuracy for performance. I wonder how much less performant a truly accurate algorithm would be. Also, whether there's a way to make an algorithm of this style completely accurate.
May 02 2017
next sibling parent MysticZach <reachzach ggmail.com> writes:
On Tuesday, 2 May 2017 at 23:09:54 UTC, MysticZach wrote:
 On Tuesday, 2 May 2017 at 21:00:36 UTC, Timon Gehr wrote:
 On 02.05.2017 22:42, MysticZach wrote:
 On Tuesday, 2 May 2017 at 13:44:03 UTC, MysticZach wrote:
    for (;;) {
       if (P(a[i]))
          return i;
       ++count;
       if (count == length)
          return -1;
       i += hop;
       if (i < length)
          continue;
       if (i < skipLength)
          i += hop;
(I guess this should be 'while'.)
skipLength is determined modulo hop, thus it can't be more than one hop away.
Actually, I did botch the implementation. The hopping interval must be based on the prime rather than on n. But you could still short circuit the while loop, I think. `if (prime - n > hop) { skipLength = n + ((prime - n) % hop); }
May 03 2017
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 03.05.2017 01:09, MysticZach wrote:
 Counterexample: [1,1,1,0,0]

 Your algorithm returns 0 with probability 7/20, 1 with probability
 6/20 and 2 with probability 7/20.

 Note that there is a simple reason why your algorithm cannot work for
 this case: it picks one of 20 numbers at random. The resulting
 probability mass cannot be evenly distributed among the three
 elements, because 20 is not divisible by 3.
That's true. Two points, though: If the range of error is within 1/(n*(n-1)), with array length n,
It's not though. For example, [1,1,1,0,...,0] (length 29), you get 0 and 2 each with probability 43/116, but 1 only with probability 30/116. It might be interesting to figure out how far from uniform the distribution can get.
 it may be close enough for a given
 real world use case. 2. n is known to be large, which makes 1/(n*(n-1))
 even smaller.  You'd probably have to trade accuracy for performance. I
 wonder how much less performant a truly accurate algorithm would be.
 Also, whether there's a way to make an algorithm of this style
 completely accurate.
For an accurate algorithm based on (unconditional) uniformly random sampling, the number of outcomes from sampling needs to be divisible by all numbers from 1 to n. (Because each of those could conceivably be the number of elements satisfying the predicate.)
May 04 2017
parent reply MysticZach <reachzach ggmail.com> writes:
On Thursday, 4 May 2017 at 08:04:22 UTC, Timon Gehr wrote:
 On 03.05.2017 01:09, MysticZach wrote:
 Counterexample: [1,1,1,0,0]

 Your algorithm returns 0 with probability 7/20, 1 with 
 probability
 6/20 and 2 with probability 7/20.

 Note that there is a simple reason why your algorithm cannot 
 work for
 this case: it picks one of 20 numbers at random. The resulting
 probability mass cannot be evenly distributed among the three
 elements, because 20 is not divisible by 3.
That's true. Two points, though: If the range of error is within 1/(n*(n-1)), with array length n,
It's not though. For example, [1,1,1,0,...,0] (length 29), you get 0 and 2 each with probability 43/116, but 1 only with probability 30/116. It might be interesting to figure out how far from uniform the distribution can get.
Or how close it can get, depending on the range of intervals used. My math skill is shaky here. Maybe there's no way to deterministically jump to every element of an array with equal probability of hitting any given element satisfying a given predicate first. It sure would be cool if there were.
May 04 2017
next sibling parent MysticZach <reachzach ggmail.com> writes:
On Thursday, 4 May 2017 at 13:19:43 UTC, MysticZach wrote:
 On Thursday, 4 May 2017 at 08:04:22 UTC, Timon Gehr wrote:
 On 03.05.2017 01:09, MysticZach wrote:
 That's true. Two points, though: If the range of error is 
 within
 1/(n*(n-1)), with array length n,
It's not though. For example, [1,1,1,0,...,0] (length 29), you get 0 and 2 each with probability 43/116, but 1 only with probability 30/116. It might be interesting to figure out how far from uniform the distribution can get.
Or how close it can get, depending on the range of intervals used. My math skill is shaky here. Maybe there's no way to deterministically jump to every element of an array with equal probability of hitting any given element satisfying a given predicate first. It sure would be cool if there were.
Within a small range of error, I mean.
May 04 2017
prev sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, May 04, 2017 at 01:19:43PM +0000, MysticZach via Digitalmars-d wrote:
[...]
 Maybe there's no way to deterministically jump to every element of an
 array with equal probability of hitting any given element satisfying a
 given predicate first. It sure would be cool if there were.
Of course there is. The random permutation approach works. In this case, the data itself is not allowed to be permuted, but equivalent semantics can be obtained by permuting an array of indices into the data. But the challenge is how efficiently you can generate a random permutation so that the cost of generating one doesn't outweigh the O(n) linear scan algorithm. I'm surprised there are no (known) incremental algorithms for generating a random permutation of 0..n that requires less than O(n) space. T -- Life is complex. It consists of real and imaginary parts. -- YHL
May 12 2017
parent Ola Fosheim Grostad <ola.fosheim.grostad gmail.com> writes:
On Friday, 12 May 2017 at 18:43:53 UTC, H. S. Teoh wrote:
 I'm surprised there are no (known) incremental algorithms for 
 generating
 a random permutation of 0..n that requires less than O(n) space.
I've told you all you need to know...
May 12 2017