digitalmars.D - [OT] Generating distribution of N dice rolls

H. S. Teoh (31/31) Nov 09 2022 This is technically OT, but I thought I'd pick the smart brains here for

=?UTF-8?Q?Ali_=c3=87ehreli?= (8/9) Nov 09 2022 I don't know what the standard deviation should be and whether it is a
Sergey (6/39) Nov 09 2022 They should have uniform distribution with well known mean and
Timon Gehr (16/51) Nov 10 2022 The distribution of the output is multinomial(n,k,[1/k,1/k,...,1/k]),

jmh530 (19/31) Nov 10 2022 mir-random has the ability to sample from both the binomial [1]

Timon Gehr (18/54) Nov 10 2022 Well, I said that he needs to sample from a multinomial distribution. I

Timon Gehr (2/4) Nov 10 2022 Actually it uses "trials" instead of "samples".
jmh530 (5/13) Nov 10 2022 Ah, I see what you're saying now. When p = 1 / K, a categorical

H. S. Teoh (19/24) Nov 10 2022 I have to admit, the linked article is way above my head. I just don't

=?UTF-8?Q?Ali_=c3=87ehreli?= (47/49) Nov 10 2022 I am happy with my solution which includes scientific terms like

Siarhei Siamashka (13/21) Nov 10 2022 That's fast, but unfortunately doesn't seem to be correct. Your

=?UTF-8?Q?Ali_=c3=87ehreli?= (14/19) Nov 10 2022 Agreed. I didn't think the following "random" values would magically be

Siarhei Siamashka (28/31) Nov 10 2022 Yes, this particular result looks okish (when tested by the

H. S. Teoh (145/174) Nov 10 2022 According to the Wikipedia page on multinomial distribution (linked by

Siarhei Siamashka (6/8) Nov 11 2022 How important for you is to actually have a statistically correct

H. S. Teoh (9/19) Nov 11 2022 Relax, this isn't for generating fake data. :-D It's for a simulation,

Abdulhaq (2/10) Nov 11 2022 I guess it simulates the simulator.
Timon Gehr (11/32) Nov 12 2022 I think the question was more: does it matter to you whether or not the

Siarhei Siamashka (36/45) Nov 12 2022 This is where things take a wrong turn. In reality you need more

H. S. Teoh (185/197) Nov 10 2022 [...]
H. S. Teoh (10/21) Nov 10 2022 [...]

Max Haughton (7/31) Nov 10 2022 Built in AI model? (I joke but this is probably the future,

H. S. Teoh (32/43) Nov 10 2022 [...]

Abdulhaq (4/6) Nov 21 2022 The original article that you copied this from has some useful

H. S. Teoh (7/16) Nov 21 2022 It's a copy-n-paste spambot promoting a link and masquerading as

"H. S. Teoh" <hsteoh qfbox.info> writes:

This is technically OT, but I thought I'd pick the smart brains here for
my project, which happens to be written in D. ;-)

Basically, I want to write a function that takes 2 uint arguments k and
N, and simulates rolling N k-sided dice and counting how many 1's, 2's,
3's, ... k's were rolled. Something like this:

	uint[k] diceDistrib(uint k)(uint N)
		in(k > 0)
		in(N > 0)
		out(r; r[].sum == N)
	{
		uint[k] result;
		foreach (i; 0 .. N) {
			result[uniform(0, k)]++;
		}
		return result;
	}

The above code works and does what I want, but since N may be large, I'd
like to refactor the code to loop over k instead of N. I.e., instead of
actually rolling N dice and tallying the results, the function would
generate the elements of the output array directly, such that the
distribution of the array elements follow the same probabilities as the
above code.

Note that in all cases, the output array must sum to N; it is not enough
to merely simulate the roll distribution probabilistically.

Any ideas?  (Or links if this is a well-studied problem with a known
solution.)

<ObDReference> I love how D's new contract syntax makes it so conducive
to expressing programming problem requirements. ;-) </ObDReference>


T

-- 
When you breathe, you inspire. When you don't, you expire. -- The Weekly Reader

Nov 09 2022

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 11/9/22 18:10, H. S. Teoh wrote:

 Note that in all cases, the output array must sum to N;

I don't know what the standard deviation should be and whether it is a 
function of N but the following makes sense:

- for all k, pick a random number with mean == N/k and standard 
deviation == ???

- if the sum > N, remove an item from a random bucket for sum - N times

- if the sum < N, add an item to a random bucket for N - sum times

Ali

Nov 09 2022

Sergey <kornburn yandex.ru> writes:

On Thursday, 10 November 2022 at 02:10:32 UTC, H. S. Teoh wrote:
 This is technically OT, but I thought I'd pick the smart brains 
 here for my project, which happens to be written in D. ;-)

 Basically, I want to write a function that takes 2 uint 
 arguments k and N, and simulates rolling N k-sided dice and 
 counting how many 1's, 2's, 3's, ... k's were rolled. Something 
 like this:

 	uint[k] diceDistrib(uint k)(uint N)
 		in(k > 0)
 		in(N > 0)
 		out(r; r[].sum == N)
 	{
 		uint[k] result;
 		foreach (i; 0 .. N) {
 			result[uniform(0, k)]++;
 		}
 		return result;
 	}

 The above code works and does what I want, but since N may be 
 large, I'd like to refactor the code to loop over k instead of 
 N. I.e., instead of actually rolling N dice and tallying the 
 results, the function would generate the elements of the output 
 array directly, such that the distribution of the array 
 elements follow the same probabilities as the above code.

 Note that in all cases, the output array must sum to N; it is 
 not enough to merely simulate the roll distribution 
 probabilistically.

 Any ideas?  (Or links if this is a well-studied problem with a 
 known
 solution.)

 <ObDReference> I love how D's new contract syntax makes it so 
 conducive to expressing programming problem requirements. ;-) 
 </ObDReference>


 T

They should have uniform distribution with well known mean and 
std. To have exactly N rolls you can estimate with distribution 
function numbers for (k-1) sides (let their sum will be M), and 
the rest (N-M) put into k’s side

https://scientificgems.wordpress.com/2015/11/03/mathematics-in-action-probability/

Nov 09 2022

Timon Gehr <timon.gehr gmx.ch> writes:

On 10.11.22 03:10, H. S. Teoh wrote:
 This is technically OT, but I thought I'd pick the smart brains here for
 my project, which happens to be written in D. ;-)
 
 Basically, I want to write a function that takes 2 uint arguments k and
 N, and simulates rolling N k-sided dice and counting how many 1's, 2's,
 3's, ... k's were rolled. Something like this:
 
 	uint[k] diceDistrib(uint k)(uint N)
 		in(k > 0)
 		in(N > 0)
 		out(r; r[].sum == N)
 	{
 		uint[k] result;
 		foreach (i; 0 .. N) {
 			result[uniform(0, k)]++;
 		}
 		return result;
 	}
 
 The above code works and does what I want, but since N may be large, I'd
 like to refactor the code to loop over k instead of N. I.e., instead of
 actually rolling N dice and tallying the results, the function would
 generate the elements of the output array directly, such that the
 distribution of the array elements follow the same probabilities as the
 above code.
 
 Note that in all cases, the output array must sum to N; it is not enough
 to merely simulate the roll distribution probabilistically.
 
 Any ideas?  (Or links if this is a well-studied problem with a known
 solution.)
 ...

The distribution of the output is multinomial(n,k,[1/k,1/k,...,1/k]), 
i.e., n trials, k events, and the events follow an uniform distribution.
https://en.wikipedia.org/wiki/Multinomial_distribution

It is at least as hard to sample as binomial(n,1/k), because that is the 
marginal distribution of each component of the result.

I guess one approach is this (if you can find a way to sample from 
binomial distributions that is good enough for your use case):

auto tot=n;
foreach(i;0..k){
     result[i]=binomial(tot,1.0/(k-i));
     tot-=result[i];
}
return result;

This samples the result one component at a time.

 <ObDReference> I love how D's new contract syntax makes it so conducive
 to expressing programming problem requirements. ;-) </ObDReference>
 

:)

Nov 10 2022

jmh530 <john.michael.hall gmail.com> writes:

On Thursday, 10 November 2022 at 10:45:50 UTC, Timon Gehr wrote:
 [snip]

 The distribution of the output is 
 multinomial(n,k,[1/k,1/k,...,1/k]), i.e., n trials, k events, 
 and the events follow an uniform distribution.
 https://en.wikipedia.org/wiki/Multinomial_distribution

 It is at least as hard to sample as binomial(n,1/k), because 
 that is the marginal distribution of each component of the 
 result.

 I guess one approach is this (if you can find a way to sample 
 from binomial distributions that is good enough for your use 
 case):

 [snip]

mir-random has the ability to sample from both the binomial [1] 
and multinomial [2] distributions (you would want the multinomial 
here, unless I'm missing something). I don't know how well 
mir-random handles large values of N or K, but there are also 
well-known normal and poisson approximations of the binomial 
distribution (I make some use of them here [3] for binomials) at 
least.

I wasn't exactly sure what you meant by the events follow a 
uniform distribution, but it's certainly possible to simulate 
from the multinomial to get the counts. You should also be able 
to calculate the PMF for it to compare the simulations against. 
The PMF shouldn't be uniform.


[1] 
http://mir-random.libmir.org/mir_random_variable.html#BinomialVariable
[2] 
http://mir-random.libmir.org/mir_random_ndvariable.html#MultinomialVariable
[3] 
https://github.com/libmir/mir-stat/blob/master/source/mir/stat/distribution/binomial.d

Nov 10 2022

Timon Gehr <timon.gehr gmx.ch> writes:

On 10.11.22 12:59, jmh530 wrote:
 On Thursday, 10 November 2022 at 10:45:50 UTC, Timon Gehr wrote:
 [snip]

 The distribution of the output is multinomial(n,k,[1/k,1/k,...,1/k]), 
 i.e., n trials, k events, and the events follow an uniform distribution.
 https://en.wikipedia.org/wiki/Multinomial_distribution

 It is at least as hard to sample as binomial(n,1/k), because that is 
 the marginal distribution of each component of the result.

 I guess one approach is this (if you can find a way to sample from 
 binomial distributions that is good enough for your use case):

 [snip]

 
 mir-random has the ability to sample from both the binomial [1] and 
 multinomial [2] distributions (you would want the multinomial here, 
 unless I'm missing something).

Well, I said that he needs to sample from a multinomial distribution. I 
just pointed out that you can sample from multinomial(n,k,...) by 
sampling from binomial distributions k times, and that to be able to 
sample from multinomial, you need to be able to sample from binomial anyway.

It turns out mir-random is doing precisely what I suggested:
https://github.com/libmir/mir-random/blob/master/source/mir/random/ndvariable.d#L344-L365

 I don't know how well mir-random handles large values of N or K,

FWIW this is what it does (haven't looked at it closely though): 
https://github.com/libmir/mir-random/blob/master/source/mir/random/variable.d#L1730-L1770

 but there are also well-known normal and poisson 
 approximations of the binomial distribution (I make some use of them 
 here [3] for binomials) at least.
 ...

Well, that seems a bit more involved. In case that is more 
accurate/efficient than mir-random, I guess one can sample from 
multinomial using that

 I wasn't exactly sure what you meant by the events follow a uniform 
 distribution,

A multinomial is the distribution of a frequency table obtained after 
drawing n times from a categorical distribution. The OP is only 
interested in the special case where that categorical distribution is 
uniform. n is the number of samples, k is the number of events. It's the 
terminology used by the Wikipedia article I linked.

 but it's certainly possible to simulate from the 
 multinomial to get the counts. You should also be able to calculate the 
 PMF for it to compare the simulations against. The PMF shouldn't be 
 uniform.
 ...

Well, it should be multinomial. :)

 
 [1] http://mir-random.libmir.org/mir_random_variable.html#BinomialVariable
 [2] 
 http://mir-random.libmir.org/mir_random_ndvariable.html#MultinomialVariable
 [3] 
 https://github.com/libmir/mir-stat/blob/master/source/mir/stat/distribution/binomial.d

Nov 10 2022

Timon Gehr <timon.gehr gmx.ch> writes:

On 10.11.22 13:21, Timon Gehr wrote:
 n is the number of samples, k is the number of events. It's the 
 terminology used by the Wikipedia article I linked.

Actually it uses "trials" instead of "samples".

Nov 10 2022

jmh530 <john.michael.hall gmail.com> writes:

On Thursday, 10 November 2022 at 12:21:30 UTC, Timon Gehr wrote:
 [snip]
 A multinomial is the distribution of a frequency table obtained 
 after drawing n times from a categorical distribution. The OP 
 is only interested in the special case where that categorical 
 distribution is uniform. n is the number of samples, k is the 
 number of events. It's the terminology used by the Wikipedia 
 article I linked.
 [snip]

Ah, I see what you're saying now. When p = 1 / K, a categorical 
distribution is the same as a uniform distribution. I tend to 
think of multinomial as a generalization of binomial, but it's 
really a generalization of both binomial and categorical.

Nov 10 2022

"H. S. Teoh" <hsteoh qfbox.info> writes:

On Thu, Nov 10, 2022 at 11:45:50AM +0100, Timon Gehr via Digitalmars-d wrote:
[...]
 The distribution of the output is multinomial(n,k,[1/k,1/k,...,1/k]),
 i.e., n trials, k events, and the events follow an uniform
 distribution.  https://en.wikipedia.org/wiki/Multinomial_distribution

I have to admit, the linked article is way above my head.  I just don't
grok probability. :-(


 It is at least as hard to sample as binomial(n,1/k), because that is
 the marginal distribution of each component of the result.

[...]

Hmm.  I'm thinking one way to approximate this, if there's a fast way to
compute the standard deviation of the multinomial, is to use the
Box-Muller transform to do the initial population of the result array,
then distribute the remaining (N - sum(array)) elements by running
individual trials and adjusting the array accordingly.  The initial
population should bring us reasonably close to the mean, so (N -
sum(array)) should be small enough that running individual trials won't
be excessively expensive.

Or if (N - sum(array)) is still large, maybe do another estimate using
the standard deviation of the remaining trials to reduce it, and then
run the remaining trials manually?


T

-- 
Doubt is a self-fulfilling prophecy.

Nov 10 2022

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 11/10/22 08:57, H. S. Teoh wrote:

 if there's a fast way to
 compute the standard deviation of the multinomial

I am happy with my solution which includes scientific terms like 
normalDistributianness. :p

import std;

uint[k] diceDistrib(uint k)(uint N)
in(k > 0)
in(N > 0)
out(r; r[].sum == N)
{
     uint[k] result;
     const average = N / k;

     // foreach (i; 0 .. N) {
     //     result[uniform(0, k)]++;
     // }

     // Larger this value, closer the results to normal distribution
     enum normalDistributianness = 1000;

     // Window around the average of values to pick randomly within
     const halfWindow = average / 2;

     foreach (ref r; result) {
         r = iota(normalDistributianness)
             .map!(_ => uniform(average - halfWindow,
                                average + halfWindow + 1))
             .mean
             .to!uint;
     }

     uint total = result[].sum;

     while (true) {
         if (total == N) {
             break;
         }

         auto which = &result[uniform(0, k)];

         if (total < N) {
             ++(*which);
             ++total;

         } else if (total > N) {
             --(*which);
             --total;
         }
     }

     return result;
}

void main() {
     writeln(diceDistrib!6(100_000_000));
}

Error checking etc. are missing but hundred million dice throws in a 
split second... Now that's fast! :p

Ali

Nov 10 2022

Siarhei Siamashka <siarhei.siamashka gmail.com> writes:

On Thursday, 10 November 2022 at 17:46:59 UTC, Ali Çehreli wrote:
 On 11/10/22 08:57, H. S. Teoh wrote:

 if there's a fast way to
 compute the standard deviation of the multinomial

 I am happy with my solution which includes scientific terms 
 like normalDistributianness. :p

 [...]

 Error checking etc. are missing but hundred million dice throws 
 in a split second... Now that's fast! :p

That's fast, but unfortunately doesn't seem to be correct. Your 
code outputs something like this:

     [16548205, 16741274, 16600777, 16792301, 16585808, 16731635]

While the original slow reference H. S. Teoh's implementation 
outputs:

     [16669388, 16665882, 16668019, 16671186, 16664718, 16660807]

It's easy to see even with a naked eye that the statistical 
properties are obviously not the same (your implementation has a 
significantly higher variance). And for comparison, a 
[Python/SciPy 
implementation](https://github.com/scipy/scipy/issues/17388#issu
comment-1310913940) outputs:

     [16664115, 16666144, 16677962, 16660692, 16663181, 16667906]

Nov 10 2022

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 11/10/22 13:45, Siarhei Siamashka wrote:

 (your implementation has a significantly
 higher variance)

Agreed. I didn't think the following "random" values would magically be 
correct. :)

     // Larger this value, closer the results to normal distribution
     enum normalDistributianness = 1000;

     // Window around the average of values to pick randomly within
     const halfWindow = average / 2;

I was half joking anyway but the values 4_000 and 'average / 20', 
respectively produce closer values:

[16663617, 16663620, 16662052, 16671587, 16663818, 16675306]

 . And for comparison, a [Python/SciPy
 

implementation](https://github.com/scipy/scipy/issues/17388#issu
comment-1310913940) 
outputs:
      [16664115, 16666144, 16677962, 16660692, 16663181, 16667906]

But I am still half joking. :)

Ali

Nov 10 2022

Siarhei Siamashka <siarhei.siamashka gmail.com> writes:

On Thursday, 10 November 2022 at 22:18:47 UTC, Ali Çehreli wrote:
 I was half joking anyway but the values 4_000 and 'average / 
 20', respectively produce closer values:

 [16663617, 16663620, 16662052, 16671587, 16663818, 16675306]

Yes, this particular result looks okish (when tested by the 
XNomial R package):

     P value (LLR) = 0.11807 ± 0.00102

You can modify your code this way:

```D
void main() {
     auto results = diceDistrib!6(100_000_000);
     auto expected_probabilities = [1.0 / 6].replicate(6);
https://rdrr.io/cran/XNomial/");
     writefln("library(XNomial)");
     writefln("experimental_results <- c(%(%s, %))", results);
     writefln("expected_probabilities <- c(%(%.18f, %))", 
expected_probabilities);
     writefln("xmonte(experimental_results, 
expected_probabilities)");
}
```

And then paste its output to https://rdrr.io/cran/XNomial/ to 
calculate p-value. If the reported p-value is smaller than 0.05, 
then we can be 95% confident that the generator is working 
incorrectly. A p-value higher than 0.05 means that the test can't 
see anything wrong, but this doesn't guarantee anything and isn't 
a proof of correctness.

Even a correct generator can sporadically have p-values smaller 
than 0.05, but this happens only in roughly 1 out of 20 runs (in 
other words, there's a 5% false positive chance). This situation 
is explained by https://xkcd.com/882/

Nov 10 2022

"H. S. Teoh" <hsteoh qfbox.info> writes:

On Thu, Nov 10, 2022 at 09:45:56PM +0000, Siarhei Siamashka via Digitalmars-d
wrote:
 On Thursday, 10 November 2022 at 17:46:59 UTC, Ali �ehreli wrote:
 On 11/10/22 08:57, H. S. Teoh wrote:
 
 if there's a fast way to compute the standard deviation of the
 multinomial



According to the Wikipedia page on multinomial distribution (linked by
Timon), it states that the variance of X_i for n rolls of a k-sided dice
(with probability p_i), where i is a specific outcome, is:

	Var(X_i) = n*p_i*(1 - p_i)

Don't really understand where this formula came from (as I said, that
page is way above my head), but we can make use of it.  In this case,
p_i = 1/k since we're assuming unbiased dice rolls.  So this reduces to:

	Var(X_i) = (n/k)*(1 - 1/k)

Now, since the standard deviation is just the square root of the
variance, we have:

	sigma = sqrt((n/k)*(1 - 1/k))

which is easy to compute.

Furthermore, the mean is just:

	E(X_i) = n*p_i = n/k

We can then plug this into the Box-Muller transform (which I found on
Wikipedia :-P):

	import std.random, std.math;
	const sigma = sqrt((n/k)*(1.0 - 1.0/k));
	auto u1 = uniform01();
	auto u2 = uniform01();
	auto mag = sigma*sqrt(-2*log(u1));
	auto z0 = mag * cos(2*PI*u2) + n/k;
	auto z1 = mag * sin(2*PI*u2) + n/k;

which sets z0 and z1 to be two random variables having mean n/k and
variance Var(X_i).  So they can be used to populate two elements of the
resulting array, and should have the correct distribution.

Repeat this procedure until all array elements are populated. (If the
output array length is odd, i.e., k is odd, we can just discard the last
z1.)  Then compute the difference d between sum of the array and n (the
expected sum), which should be relatively small, and run d iterations of
individual dice rolls and adjust the array (++ if the sum is too small,
or -- if the sum is too large) to make it sum exactly to n.


 I am happy with my solution which includes scientific terms like
 normalDistributianness. :p
 
 [...]
 
 Error checking etc. are missing but hundred million dice throws in a
 split second... Now that's fast! :p

 
 That's fast, but unfortunately doesn't seem to be correct. Your code
 outputs something like this:
 
     [16548205, 16741274, 16600777, 16792301, 16585808, 16731635]
 
 While the original slow reference H. S. Teoh's implementation outputs:
 
     [16669388, 16665882, 16668019, 16671186, 16664718, 16660807]
 
 It's easy to see even with a naked eye that the statistical properties
 are obviously not the same (your implementation has a significantly
 higher variance). And for comparison, a [Python/SciPy
 implementation](https://github.com/scipy/scipy/issues/17388#issuecomment-1310913940)
 outputs:
 
     [16664115, 16666144, 16677962, 16660692, 16663181, 16667906]

[...]

I implemented what I described above, and got very good-looking results.

Output:
-----------
1 modules passed unittests
[16666837, 16666500, 16664692, 16666306, 16663568, 16672097] sum=100000000
[16661247, 16666674, 16668562, 16673476, 16665867, 16664174] sum=100000000
[16663070, 16665221, 16664850, 16668355, 16668246, 16670258] sum=100000000
[16671423, 16665663, 16667209, 16669164, 16663916, 16662625] sum=100000000
[16666194, 16669936, 16662697, 16666535, 16666671, 16667967] sum=100000000
[16665977, 16667094, 16662753, 16670737, 16666762, 16666677] sum=100000000
[16665518, 16666940, 16671829, 16664487, 16665344, 16665882] sum=100000000
[16658486, 16666671, 16667778, 16665901, 16667471, 16673693] sum=100000000
[16668149, 16669431, 16661462, 16668241, 16668286, 16664431] sum=100000000
[16660879, 16671401, 16659154, 16671245, 16667720, 16669601] sum=100000000
[16665213, 16670794, 16665401, 16667548, 16663484, 16667560] sum=100000000
[16663091, 16671196, 16669946, 16670791, 16661162, 16663814] sum=100000000
[16664305, 16670070, 16662123, 16663467, 16673867, 16666168] sum=100000000
[16664395, 16667406, 16657623, 16673838, 16670227, 16666511] sum=100000000
[16671401, 16662181, 16672923, 16657766, 16674160, 16661569] sum=100000000
[16663059, 16669574, 16662338, 16668319, 16670541, 16666169] sum=100000000
[16672366, 16662296, 16666526, 16664284, 16669201, 16665327] sum=100000000
[16665787, 16669038, 16666956, 16668728, 16660096, 16669395] sum=100000000
[16669162, 16669034, 16662788, 16663672, 16669127, 16666217] sum=100000000
[16667016, 16669649, 16661551, 16670882, 16668304, 16662598] sum=100000000
-----------

(I've no idea how accurate it is, but visual inspection certainly
indicates that it's doing the right thing. And it's lightning-fast. :-P)


Code (compile with `dmd -unittest -main`):
------------------------------------------
import std;

/// Box-Muller transform.
double[2] gaussianPoint(double[2] mean, double deviation)
{
    import std.math : cos, log, sin, sqrt, PI, round;
    import std.random : uniform01;

    auto u = uniform01();
    auto v = uniform01();
    auto x0 = sqrt(-2.0*log(u)) * cos(2.0*PI*v);
    auto y0 = sqrt(-2.0*log(u)) * sin(2.0*PI*v);

    double[2] result;
    result[0] = mean[0] + cast(int)round(deviation * x0);
    result[1] = mean[1] + cast(int)round(deviation * y0);
    return result;
}

/// Roll k-sided dice N times and tally each output.
int[] diceDistrib(int k, int N)
    in (k > 0)
    in (N > 0)
    out (r; r.sum == N)
{
    import std.math : sqrt, round;
    import std.range : chunks;
    import std.random : uniform;

    const mean = cast(double)N / k;
    const deviation = sqrt(mean * (1.0 - 1.0 / k));
    double[2] means = [ mean, mean ];

    // Populate output array with initial values with the correct mean and
    // standard deviation.
    auto result = new int[k];
    foreach (chunk; result.chunks(2))
    {
        auto z = gaussianPoint(means, deviation);
        chunk[0] = cast(int) round(z[0]);
        if (chunk.length > 1)
            chunk[1] = cast(int) round(z[1]);
    }

    // Tweak resulting array until it sums exactly to N.
    auto total = result.sum;
    while (total != N)
    {
        auto i = uniform(0, k);
        if (total < N)
        {
            result[i]++;
            total++;
        }
        else
        {
            result[i]--;
            total--;
        }
    }

    return result;
}

unittest
{
    foreach (i; 0 .. 20)
    {
        import std.stdio;
        auto N = 100_000_000;
        auto dist = diceDistrib(6, N);
        writefln("%s sum=%d", dist, dist[].sum);
        assert(dist.sum == N);
    }
}
------------------------------------------

//

Now, just for fun, I added a writeln before the `while (total != N)` to
print out just how big of a discrepancy to N the array sum is.  Turns
out, quite a bit: for N = 100_000_000 as above, the discrepancies range
from approximately -25000 to 25000, with the typical discrepancy being
about 3-4 digits long, which means that the code is spending quite a bit
of time in that final loop.  Which suggests that a possible improvement
is to recursively run the initial approximation step, setting sub_N = (N
- array.sum).

I'll leave that to a future iteration, though.  Being able to compute a
hundred million dice rolls in a split second is already good enough for
what I need. :-D


T

-- 
Answer: Because it breaks the logical sequence of discussion. / Question: Why
is top posting bad?

Nov 10 2022

Siarhei Siamashka <siarhei.siamashka gmail.com> writes:

On Thursday, 10 November 2022 at 23:15:24 UTC, H. S. Teoh wrote:
 Being able to compute a hundred million dice rolls in a split 
 second is already good enough for what I need. :-D

How important for you is to actually have a statistically correct 
solution for this particular problem?

If something is off, then this may be eventually discovered by 
somebody in the future. Here's one famous example: 
https://www.wondriumdaily.com/gregor-mendel-fake-data/

Nov 11 2022

"H. S. Teoh" <hsteoh qfbox.info> writes:

On Fri, Nov 11, 2022 at 10:52:47AM +0000, Siarhei Siamashka via Digitalmars-d
wrote:
 On Thursday, 10 November 2022 at 23:15:24 UTC, H. S. Teoh wrote:
 Being able to compute a hundred million dice rolls in a split second
 is already good enough for what I need. :-D

 
 How important for you is to actually have a statistically correct
 solution for this particular problem?
 
 If something is off, then this may be eventually discovered by
 somebody in the future. Here's one famous example:
 https://www.wondriumdaily.com/gregor-mendel-fake-data/

Relax, this isn't for generating fake data. :-D  It's for a simulation,
and actually my use case mainly involves small values of N. So
technically I don't need to optimize it to this extent; it's just a
nice-to-have and a fun exercise to make it resistant to performance
degradation by unusually large inputs.


T

-- 
Mediocrity has been pushed to extremes.

Nov 11 2022

Abdulhaq <alynch4048 gmail.com> writes:

On Friday, 11 November 2022 at 13:27:45 UTC, H. S. Teoh wrote:
 On Fri, Nov 11, 2022 at 10:52:47AM +0000, Siarhei Siamashka via 
 Digitalmars-d wrote:

 Relax, this isn't for generating fake data. :-D  It's for a 
 simulation, and actually my use case mainly involves small 
 values of N. So technically I don't need to optimize it to this 
 extent; it's just a nice-to-have and a fun exercise to make it 
 resistant to performance degradation by unusually large inputs.


 T

I guess it simulates the simulator.

Nov 11 2022

Timon Gehr <timon.gehr gmx.ch> writes:

On 11.11.22 14:27, H. S. Teoh wrote:
 On Fri, Nov 11, 2022 at 10:52:47AM +0000, Siarhei Siamashka via Digitalmars-d
wrote:
 On Thursday, 10 November 2022 at 23:15:24 UTC, H. S. Teoh wrote:
 Being able to compute a hundred million dice rolls in a split second
 is already good enough for what I need. :-D

 How important for you is to actually have a statistically correct
 solution for this particular problem?

 If something is off, then this may be eventually discovered by
 somebody in the future. Here's one famous example:
 https://www.wondriumdaily.com/gregor-mendel-fake-data/

 
 Relax, this isn't for generating fake data. :-D  It's for a simulation,
 and actually my use case mainly involves small values of N. So
 technically I don't need to optimize it to this extent; it's just a
 nice-to-have and a fun exercise to make it resistant to performance
 degradation by unusually large inputs.
 
 
 T
 

I think the question was more: does it matter to you whether or not the 
simulation models an accurate distribution of results? What kind of code 
is consuming those dice roll frequency tables?

Note that even though the results are random, the distribution of 
results itself is not random and can in principle be compared precisely.

It's pretty clear that you are not getting the right distribution, 
although I have not investigated in detail how much what you simulate 
deviates from the true multinomial distribution that you actually 
attempt to simulate. Still, I guess for some use cases, the deviation 
would be significant.

Nov 12 2022

Siarhei Siamashka <siarhei.siamashka gmail.com> writes:

On Thursday, 10 November 2022 at 23:15:24 UTC, H. S. Teoh wrote:
 According to the Wikipedia page on multinomial distribution 
 (linked by
 Timon), it states that the variance of X_i for n rolls of a 
 k-sided dice
 (with probability p_i), where i is a specific outcome, is:

 	Var(X_i) = n*p_i*(1 - p_i)

 Don't really understand where this formula came from (as I 
 said, that page is way above my head), but we can make use of 
 it.

This is where things take a wrong turn. In reality you need more 
than just a matching mean and variance to correctly simulate some 
arbitrary probability distribution: 
https://en.wikipedia.org/wiki/Moment_(mathematics)

Every n-th moment needs to be correct too. Some of these moments 
have special names (n=1 mean, n=2 variance, n=3 skewness, n=4 
kurtosis, ...). If you only take care of the mean and variance 
for simulating a random distribution, then it's somewhat similar 
to approximating "sin(x) = x - (x^3 / 3!)" via taking only the 
first few terms of the Taylor series.

I wonder what's the reason for not using the mir-random library 
like suggested in the early comments? Do you want to avoid having 
an extra dependency?

```D
/+dub.sdl:
dependency "mir-random" version="~>2.2.19"
+/
import std, mir.random.engine, mir.random.ndvariable;

uint[k] diceDistrib(uint k)(uint N)
   in(k > 0)
   in(N > 0)
   out(r; r[].sum == N)
{
   uint[k] result;
   double[k] p;
   p[] = 1.0 / k;
   auto rv = multinomialVar(N, p);
   rv(rne, result[]);
   return result;
}

void main()
{
   writeln(diceDistrib!6(100_000_000));
}
```

Nov 12 2022

"H. S. Teoh" <hsteoh qfbox.info> writes:

On Thu, Nov 10, 2022 at 03:15:24PM -0800, H. S. Teoh via Digitalmars-d wrote:
[...]
 Now, just for fun, I added a writeln before the `while (total != N)`
 to print out just how big of a discrepancy to N the array sum is.
 Turns out, quite a bit: for N = 100_000_000 as above, the
 discrepancies range from approximately -25000 to 25000, with the
 typical discrepancy being about 3-4 digits long, which means that the
 code is spending quite a bit of time in that final loop.  Which
 suggests that a possible improvement is to recursively run the initial
 approximation step, setting sub_N = (N - array.sum).
 
 I'll leave that to a future iteration, though.  Being able to compute
 a hundred million dice rolls in a split second is already good enough
 for what I need. :-D

[...]

OK, I couldn't resist, the idea is so tempting.  So I made a new
implementation that iteratively uses the Box-Muller transform to close
the gap between array.sum and N, resorting to individual rolls only when
the difference is < k.  For N = 100_000_000, it typically takes about
3-5 iterations to bring the difference down to < k, so the entire
algorithm takes about k+5 iterations to compute the result.

Now, in theory, I could just run the Box-Muller estimate until the
difference is 0, but I found that once the difference is small, it tends
to bounce around 0 several iterations before converging on 0.  So I
arbitrarily decided to stop when the difference < k, and use individual
rolls to do the rest.

One wrinkle that came up is that for small values of N, sometimes the
result array can end up with negative elements, either due to
overcompensation during the iterative Box-Muller stage (discrepancy > 0
and the selected z values happen to be larger than the current array
element), or during the final adjustment loop (it picks an array element
that's already 0 and tries to decrement it). So I had to insert extra
checks to discard generated z values if they would cause the result
array to have negative counts.  Generally, this happens only for small
values of N; for large N the initial estimate is large enough that it's
extremely unlikely for a later adjustment to overshoot into negative
values.  To ensure the output is never negative, I added another
out-contract to check this.


Code:
------------
import std.algorithm;
import std.conv : text;

double[2] gaussianPoint(double[2] mean, double deviation)
{
    import std.math : cos, log, sin, sqrt, PI, round;
    import std.random : uniform01;

    auto u = uniform01();
    auto v = uniform01();
    auto x0 = sqrt(-2.0*log(u)) * cos(2.0*PI*v);
    auto y0 = sqrt(-2.0*log(u)) * sin(2.0*PI*v);

    double[2] result;
    result[0] = mean[0] + cast(int)round(deviation * x0);
    result[1] = mean[1] + cast(int)round(deviation * y0);
    return result;
}

/**
 * Simulates rolling N k-sided dice.
 *
 * Returns: A static array representing how many of each of 1..(k+1) were
 * obtained by the rolls.
 *
 * Bugs: The current implementation uses a na�ve algorithm that's rather
 * inefficient for large N.  For our purposes, though, which involve only
 * relatively small N, this is Good Enough(tm) for the time being. We can look
 * into improving this if it becomes a performance bottleneck.
 */
int[] diceDistrib(int k, int N)
    in (k > 0)
    in (N > 0)
    out (r; r.sum == N)
    out (r; r.all!(c => c >= 0), r.text)
{
    import std.math : abs, sgn, sqrt, round;
    import std.range : chunks;
    import std.random : uniform;
    debug import std.stdio;

    // Populate output array with initial values with the correct mean and
    // standard deviation.
    auto result = new int[k];
    auto discrepancy = N - result.sum;
    while (abs(discrepancy) > k)
    {
        debug writefln("discrepancy=%d", discrepancy);
        const mean = cast(double)abs(discrepancy) / k;
        const deviation = sqrt(mean * (1.0 - 1.0 / k));
        double[2] means = [ mean, mean ];
        double sign = sgn(discrepancy);

        foreach (chunk; result.chunks(2))
        {
            do
            {
                auto z = gaussianPoint(means, deviation);
                auto c0 = chunk[0] + cast(int) round(sign*z[0]);
                if (c0 < 0)
                    continue;   // discard nonsensical result

                if (chunk.length > 1)
                {
                    auto c1 = chunk[1] + cast(int) round(sign*z[1]);
                    if (c1 < 0)
                        continue;   // discard nonsensical result
                    chunk[1] = c1;
                }
                chunk[0] = c0;
            } while (false);
        }
        discrepancy = N - result.sum;
    }
    debug writefln("final discrepancy=%d", discrepancy);

    // Tweak resulting array until it sums exactly to N.
    auto total = result.sum;
    while (total != N)
    {
        auto i = uniform(0, k);
        if (total < N)
        {
            result[i]++;
            total++;
        }
        else if (result[i] > 0)
        {
            result[i]--;
            total--;
        }
    }

    return result;
}

unittest
{
    import std.stdio;
    foreach (i; 0 .. 5)
    {
        auto N = 100_000_000;
        auto dist = diceDistrib(6, N);
        debug writefln("%s sum=%d", dist, dist[].sum);
        assert(dist.sum == N);
    }

    foreach (i; 0 .. 5)
    {
        auto N = 10;
        auto dist = diceDistrib(6, N);
        debug writefln("%s sum=%d", dist, dist[].sum);
        assert(dist.sum == N);
    }
}
------------

Typical output:
------------
1 modules passed unittests
discrepancy=1000000000
discrepancy=-94251
discrepancy=287
final discrepancy=5
[166665591, 166651866, 166681043, 166654104, 166673559, 166673837]
sum=1000000000
discrepancy=1000000000
discrepancy=20314
discrepancy=-88
discrepancy=-9
final discrepancy=6
[166668490, 166665588, 166648866, 166658164, 166684475, 166674417]
sum=1000000000
discrepancy=1000000000
discrepancy=14062
discrepancy=18
final discrepancy=5
[166658497, 166674399, 166652111, 166667954, 166672801, 166674238]
sum=1000000000
discrepancy=1000000000
discrepancy=15926
discrepancy=-175
discrepancy=-19
final discrepancy=-4
[166664463, 166659610, 166655243, 166678783, 166673507, 166668394]
sum=1000000000
discrepancy=1000000000
discrepancy=39916
discrepancy=201
discrepancy=-16
final discrepancy=2
[166666774, 166663400, 166672276, 166686269, 166652312, 166658969]
sum=1000000000
discrepancy=10
final discrepancy=-1
[2, 2, 2, 3, 0, 1] sum=10
discrepancy=10
final discrepancy=-5
[2, 1, 3, 2, 1, 1] sum=10
discrepancy=10
final discrepancy=-1
[1, 0, 1, 1, 4, 3] sum=10
discrepancy=10
final discrepancy=-3
[3, 2, 2, 1, 0, 2] sum=10
discrepancy=10
final discrepancy=-2
[1, 1, 2, 2, 2, 2] sum=10
------------

Note that the last 5 outputs are for a different test case (N=10), just
to make sure that we don't produce nonsensical outputs when N is small.


T

-- 
Leather is waterproof.  Ever see a cow with an umbrella?

Nov 10 2022

"H. S. Teoh" <hsteoh qfbox.info> writes:

On Thu, Nov 10, 2022 at 05:27:07PM -0800, H. S. Teoh via Digitalmars-d wrote:
[...]
 /**
  * Simulates rolling N k-sided dice.
  *
  * Returns: A static array representing how many of each of 1..(k+1) were
  * obtained by the rolls.
  *
  * Bugs: The current implementation uses a na�ve algorithm that's rather
  * inefficient for large N.  For our purposes, though, which involve only
  * relatively small N, this is Good Enough(tm) for the time being. We can look
  * into improving this if it becomes a performance bottleneck.
  */

[...]

Argh, that last paragraph should be deleted. :-(

See, this is proof that documentation goes out of sync with the code.
We need a new innovation, along the lines of built-in unittests, to
revolutionize keeping docs up-to-date with the code.


T

-- 
The best compiler is between your ears. -- Michael Abrash

Nov 10 2022

Max Haughton <maxhaton gmail.com> writes:

On Friday, 11 November 2022 at 01:47:54 UTC, H. S. Teoh wrote:
 On Thu, Nov 10, 2022 at 05:27:07PM -0800, H. S. Teoh via 
 Digitalmars-d wrote: [...]
 /**
  * Simulates rolling N k-sided dice.
  *
  * Returns: A static array representing how many of each of 
 1..(k+1) were
  * obtained by the rolls.
  *
  * Bugs: The current implementation uses a naïve algorithm 
 that's rather
  * inefficient for large N.  For our purposes, though, which 
 involve only
  * relatively small N, this is Good Enough(tm) for the time 
 being. We can look
  * into improving this if it becomes a performance bottleneck.
  */

 [...]

 Argh, that last paragraph should be deleted. :-(

 See, this is proof that documentation goes out of sync with the 
 code. We need a new innovation, along the lines of built-in 
 unittests, to revolutionize keeping docs up-to-date with the 
 code.


 T

Built in AI model? (I joke but this is probably the future, 
whoever gets it right will make a lot of money).

Describing what code *does* is actually already relatively doable 
using contemporary AI techniques, but that isn't documentation 
(or at least isn't good documentation, I don't know that `x + y` 
adds two numbers together.

Nov 10 2022

"H. S. Teoh" <hsteoh qfbox.info> writes:

On Fri, Nov 11, 2022 at 03:34:31AM +0000, Max Haughton via Digitalmars-d wrote:
 On Friday, 11 November 2022 at 01:47:54 UTC, H. S. Teoh wrote:

[...]
 See, this is proof that documentation goes out of sync with the
 code. We need a new innovation, along the lines of built-in
 unittests, to revolutionize keeping docs up-to-date with the code.


[...]
 Built in AI model? (I joke but this is probably the future, whoever
 gets it right will make a lot of money).
 
 Describing what code *does* is actually already relatively doable
 using contemporary AI techniques, but that isn't documentation (or at
 least isn't good documentation, I don't know that `x + y` adds two
 numbers together.

I don't think AI will ever be able to replace real documentation. It can
describe what the code does, but like you said, that isn't good
documentation.  I mean, I already know `x + y` adds two numbers (if I
didn't, I shouldn't be touching the code in the first place). The AI
saying this for me, doesn't really add any value.  What makes good
documentation is to explain *why* I'm adding two numbers.  I don't think
any AI will be able to divine my intention in adding two numbers; that
has to be written by the programmer.

One of the factors that make D unittests so awesome is the fact that you
can put them right in the same source file as you're coding in, and in
fact, quite often right next to the function it's documenting.  This
proximity is what increases the likelihood that people are more likely
to write unittests, and more inclined to keep them up-to-date (as
opposed to, y'know, opening up an external unittest file and commenting
out / disabling a troublesome test just to get things to run). Even if
somebody `version(none)`'d out a unittest, it's still sitting right
there in the source file, sticking out like a sore thumb and begging
silently but persistently "fix me, fix me, you know you should!".

One idea that comes to mind w.r.t. documentation, is if the docs can be
written not only in the comment header, but inside the function body,
next to the block of code responsible for some particular task. It would
have some tags to help ddoc figure out where in the doc block it goes,
so that there is some control over order of presentation.  But the
comment would be sitting next to the offending code, so when somebody
changes the code, the comment would be staring right at them, "update
me, update me, c'mon I know you know you should!".


T

-- 
Why do conspiracy theories always come from the same people??

Nov 10 2022

Abdulhaq <alynch4048 gmail.com> writes:

On Monday, 21 November 2022 at 10:44:48 UTC, Cabal Powel wrote:
 On Thursday, 10 November 2022 at 02:10:32 UTC, H. S. Teoh wrote:

 Hope this will help you.

The original article that you copied this from has some useful 
images:

https://towardsdatascience.com/modelling-the-probability-distributions-of-dice-b6ecf87b24ea

Nov 21 2022

"H. S. Teoh" <hsteoh qfbox.info> writes:

On Mon, Nov 21, 2022 at 11:08:24AM +0000, Abdulhaq via Digitalmars-d wrote:
 On Monday, 21 November 2022 at 10:44:48 UTC, Cabal Powel wrote:
 On Thursday, 10 November 2022 at 02:10:32 UTC, H. S. Teoh wrote:

 
 
 Hope this will help you.

 
 The original article that you copied this from has some useful images:
 
 https://towardsdatascience.com/modelling-the-probability-distributions-of-dice-b6ecf87b24ea

It's a copy-n-paste spambot promoting a link and masquerading as
something legitimate by posting (an incomplete version of) a relevant
article.


T

-- 
In a world without fences, who needs Windows and Gates? -- Christian Surchi

Nov 21 2022

D Programming

C/C++ Programming

Other

digitalmars.D - [OT] Generating distribution of N dice rolls