www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [Bench!][Mir] +54%..+185% performance boost for Mersenne Twister.

reply Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:
Hey,

32-bit Mt19937 random number Generator is default in Phobos.
It is default in Mir too, except that 64-bit targets use 64-bit 
Mt19937 instead.

The last Mir Random beta improves performance for Mt19937.

The goal was to:

1. Improve RNG generation performance by making code more 
friendly for CPU pipelining. Tempering (finalization) operations 
was mixed with internal payload update operations.

2. Reduce number of required comparison operations using 
unification with add/sub operations. The order of payload 
iteration was reversed (the engine generates the same numbers as 
before).

3. Reduce memory access twice both for generation (required for 
1) and initialization. This will make the algorithm more 
cache-friendly for real world applications. Ditto (1)

Bench results:
----
mir.random 32-bit Mt19937:
     6.80851 Gb/s

mir.random 64-bit Mt19937:
    12.5984 Gb/s

std.random 32-bit Mt19937:
    4.41989 Gb/s

std.random 64-bit Mt19937:
     wrong initialization and tempering (finalization)
----
54%: 1.54 = 6.80851 / 4.41989
185%: 2.85 = 12.5984 / 4.41989

The benchmark can be found here:
https://github.com/libmir/mir-random/blob/master/bench/Mt19937_bench.d

Ads:
We are looking for new contributors and partners! Star and 
contribute!

https://github.com/libmir :

  - mir - Generic numeric library
  - dcv - D Computer Vision library
  - mir-glas - Linear Algebra Subprograms (written in D)
  - mir-random - random numbers generators, includes non-uniform 
distributions.
  - mir-cpuid - CPU identification

Future:
  - mir-runtime - lightweight always inlined `nothrow  nogc` Dlang 
Runtime
  - mir-neural - neural network library [WIP]
  - mir-fft - Fast and multidimensional FFT
  - mir-svm - support vector machines

Cheers,
Ilya & Mir Team
Nov 26 2016
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 11/26/16 11:31 AM, Ilya Yaroshenko wrote:
 Hey,

 32-bit Mt19937 random number Generator is default in Phobos.
 It is default in Mir too, except that 64-bit targets use 64-bit Mt19937
 instead.

 The last Mir Random beta improves performance for Mt19937.

 The goal was to:
Congrats! Also thanks for using the Boost license which would allow backporting the improvements to Phobos. Who'd be up for it? Also I'm thinking of removing std.random's dependency on druntime, e.g. by removing the uses of enforce. Thoughts? Andrei
Nov 26 2016
next sibling parent Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:
On Saturday, 26 November 2016 at 20:13:36 UTC, Andrei 
Alexandrescu wrote:
 On 11/26/16 11:31 AM, Ilya Yaroshenko wrote:
 Hey,

 32-bit Mt19937 random number Generator is default in Phobos.
 It is default in Mir too, except that 64-bit targets use 
 64-bit Mt19937
 instead.

 The last Mir Random beta improves performance for Mt19937.

 The goal was to:
Congrats! Also thanks for using the Boost license which would allow backporting the improvements to Phobos. Who'd be up for it?
Yep, I am using Phobos parts and we can do backports. In other hand it is requires efforts (I have experience with ndslice, and I will be happy to remove ndslice from Mir after LDC 1.2.0 release).
 Also I'm thinking of removing std.random's dependency on 
 druntime, e.g. by removing the uses of enforce. Thoughts?

 Andrei
Another one DRuntime dependency is core time. BTW, I would be happy to have core.time + std.datetime as unified nothrow nogc library without DRuntime dependency (ahah, it is BetterC propaganda!) Ilya
Nov 26 2016
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On Saturday, 26 November 2016 at 20:13:36 UTC, Andrei 
Alexandrescu wrote:
 Also I'm thinking of removing std.random's dependency on 
 druntime, e.g. by removing the uses of enforce. Thoughts?
There's no strong reason for those checks to be done via `enforce` except for a design decision that user input should be treated as external input. The user is in complete control of those parameters, there's no reason they can't be `assert` checks.
Nov 26 2016
prev sibling next sibling parent reply Guillaume Piolat <first.last gmail.com> writes:
On Saturday, 26 November 2016 at 20:13:36 UTC, Andrei 
Alexandrescu wrote:
 Congrats! Also thanks for using the Boost license which would 
 allow backporting the improvements to Phobos. Who'd be up for 
 it?

 Also I'm thinking of removing std.random's dependency on 
 druntime, e.g. by removing the uses of enforce. Thoughts?

 Andrei
Without druntime, global ctor/dtor and TLS can't be used too. Besides nothrow nogc, std.random relies on a global ctor for creating the default RNG, which is used liberally. A useful intermediate step is to have these "[shared] static this" ctor call a function instead, so that programs without druntime can call them too.
Nov 27 2016
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 11/27/2016 08:28 AM, Guillaume Piolat wrote:
 On Saturday, 26 November 2016 at 20:13:36 UTC, Andrei Alexandrescu wrote:
 Congrats! Also thanks for using the Boost license which would allow
 backporting the improvements to Phobos. Who'd be up for it?

 Also I'm thinking of removing std.random's dependency on druntime,
 e.g. by removing the uses of enforce. Thoughts?

 Andrei
Without druntime, global ctor/dtor and TLS can't be used too. Besides nothrow nogc, std.random relies on a global ctor for creating the default RNG, which is used liberally.
I see, so TLS is a problem.
 A useful intermediate step is to have these "[shared] static this" ctor
 call a function instead, so that programs without druntime can call them
 too.
That would be progress. Andrei
Nov 27 2016
parent reply Guillaume Piolat <first.last gmail.com> writes:
On Sunday, 27 November 2016 at 13:35:48 UTC, Andrei Alexandrescu 
wrote:
 A useful intermediate step is to have these "[shared] static 
 this" ctor
 call a function instead, so that programs without druntime can 
 call them
 too.
That would be progress. Andrei
Same story for core.cpuid which is initialized by a shared static constructor. https://github.com/dlang/druntime/blob/master/src/core/cpuid.d It seems the "shared static this" is allowed to break immutable but a function wouldn't. So maybe the variables here should be private __gshared?.
Nov 27 2016
next sibling parent Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:
On Sunday, 27 November 2016 at 13:58:29 UTC, Guillaume Piolat 
wrote:
 On Sunday, 27 November 2016 at 13:35:48 UTC, Andrei 
 Alexandrescu wrote:
 A useful intermediate step is to have these "[shared] static 
 this" ctor
 call a function instead, so that programs without druntime 
 can call them
 too.
That would be progress. Andrei
Same story for core.cpuid which is initialized by a shared static constructor. https://github.com/dlang/druntime/blob/master/src/core/cpuid.d It seems the "shared static this" is allowed to break immutable but a function wouldn't. So maybe the variables here should be private __gshared?.
They are in Mir CPUID, but methods are not pure. https://github.com/libmir/mir-cpuid/blob/master/source/cpuid/x86_any.d
Nov 27 2016
prev sibling parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 11/27/16 8:58 AM, Guillaume Piolat wrote:
 On Sunday, 27 November 2016 at 13:35:48 UTC, Andrei Alexandrescu wrote:
 A useful intermediate step is to have these "[shared] static this" ctor
 call a function instead, so that programs without druntime can call them
 too.
That would be progress.
Same story for core.cpuid which is initialized by a shared static constructor. https://github.com/dlang/druntime/blob/master/src/core/cpuid.d It seems the "shared static this" is allowed to break immutable but a function wouldn't. So maybe the variables here should be private __gshared?.
How we do a similar thing for core.time: https://github.com/dlang/druntime/blob/master/src/core/time.d#L2388 This function is called from the druntime startup (before any static ctors). Not sure if it helps or not. Note that we can cheat the type system here because we know that nothing else will access the data before it's initialized. There is also a check in the function to ensure it's only done once. -Steve
Nov 28 2016
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2016-11-27 14:28, Guillaume Piolat wrote:

 Without druntime, global ctor/dtor and TLS can't be used too.
Isn't TLS initialized by the OS? -- /Jacob Carlborg
Nov 28 2016
prev sibling next sibling parent reply Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On Saturday, 26 November 2016 at 20:13:36 UTC, Andrei 
Alexandrescu wrote:
 Congrats! Also thanks for using the Boost license which would 
 allow backporting the improvements to Phobos. Who'd be up for 
 it?
I've finally found a moment to look into this (I'm at home recovering from a seasonal virus, which ironically puts some time and mental space in my hands). It makes for an interesting comparison. The traditional Mersenne Twister algorithm generates N random words in one go, and then iterates over them, before generating another N random words, and so on. (Word == unsigned integer type.) By contrast Ilya's implementation just updates the single next entry in the static array that stores the state of the engine. This would seem to work out cheaper on average: perhaps because the update-a-variate code stays hotter, perhaps because it involves less branching and looping. It certainly makes for a simpler, easier to follow update. However, it might be a good idea to benchmark this in a context where there are a lot of random variates being generated, but where other things are also happening that might render the update code less "hot" than a straightforward `foreach` loop benchmark. (Maybe try it out with a Monte Carlo simulation ...?) The tradeoff between many cheap updates and one expensive one, versus all updates being a little more expensive but less on average, might not be viable in some practical use-cases. The Phobos implementation also includes a check on whether the generator is properly seeded, which is performed in every `popFront()`, which has a reasonably significant impact. Ilya's implementation can get away with not performing that check because it uses ` disable this();` to remove the possibility of initializing the generator without it being properly seeded. The impact of that check can possibly be lessened by making it a boolean comparison rather than the existing `size_t` equality comparison. -- action item here: could we deprecate `this()` in Phobos as a precursor to removing the opportunity to instantiate a generator without properly initializing it? I'm going to try to put together a range-based version to see if this also makes any difference. I'll post some benchmarks of my own once that's done, and if all looks good I'll try to put a Phobos PR together. A few questions -- not blockers, but nice to understand: * Ilya, is this implementation your own design, or do you have a reference for the rationale behind this revised implementation of MT? * Is there a particular reason why the `index` variable is a `Uint` (i.e. the word type of the generator) rather than a `size_t`? I presume there's a motivation, since the casting back and forth in `opCall` would otherwise seem inconvenient. * Is there a motivation for the reverse iteration through the RNG state array?
Dec 13 2016
next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On Tuesday, 13 December 2016 at 18:15:25 UTC, Joseph Rushton 
Wakeling wrote:
 I'm going to try to put together a range-based version to see 
 if this also makes any difference.  I'll post some benchmarks 
 of my own once that's done, and if all looks good I'll try to 
 put a Phobos PR together.
Benchmark is here: https://github.com/WebDrake/mersenne-twister-range Overall these benchmarks seem to have reasonably wide variance. I note that the original benchmark's attempt to make the code "hot" had quite a strong impact in terms of the ordering of individual benchmarks; I've tried to work around that by doing the "make it hot" loop immediately before the benchmark. This is still just a plain old foreach, so it may be worth exploring some benchmarks where other stuff is done with the random variates. As a first step I've added some benchmarks using `uniform`. Sample output from 2 different runs of these benchmarks: ----- Running ./mir-mt-range opCall Mir 32-bit Mt19937: 6.68058 Gb/s check sum = 429472035921730457 range Mir 32-bit Mt19937: 6.639 Gb/s check sum = 429472031909629874 opCall Mir 64-bit Mt19937: 14.0351 Gb/s check sum = 5606740663277085587 range Mir 64-bit Mt19937: 13.8229 Gb/s check sum = 3636092631268073456 Phobos 32-bit Mt19937: 4.22164 Gb/s check sum = 429472035921730457 Phobos 64-bit Mt19937: wrong initialization and tempering range Mir 32-bit Mt19937: 95.4198 U(0.0, 1.0) variates/s range Mir 64-bit Mt19937: 87.184 U(0.0, 1.0) variates/s Phobos 32-bit Mt19937: 80.8407 U(0.0, 1.0) variates/s ---- ---- Running ./mir-mt-range opCall Mir 32-bit Mt19937: 6.36183 Gb/s check sum = 429472035921730457 range Mir 32-bit Mt19937: 6.54397 Gb/s check sum = 429472031909629874 opCall Mir 64-bit Mt19937: 13.7339 Gb/s check sum = 5606740663277085587 range Mir 64-bit Mt19937: 13.8229 Gb/s check sum = 3636092631268073456 Phobos 32-bit Mt19937: 4.18848 Gb/s check sum = 429472035921730457 Phobos 64-bit Mt19937: wrong initialization and tempering range Mir 32-bit Mt19937: 95.2381 U(0.0, 1.0) variates/s range Mir 64-bit Mt19937: 97.0874 U(0.0, 1.0) variates/s Phobos 32-bit Mt19937: 88.1057 U(0.0, 1.0) variates/s ----
Dec 13 2016
prev sibling parent reply Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:
On Tuesday, 13 December 2016 at 18:15:25 UTC, Joseph Rushton 
Wakeling wrote:
 On Saturday, 26 November 2016 at 20:13:36 UTC, Andrei 
 Alexandrescu wrote:
 Congrats! Also thanks for using the Boost license which would 
 allow backporting the improvements to Phobos. Who'd be up for 
 it?
I've finally found a moment to look into this (I'm at home recovering from a seasonal virus, which ironically puts some time and mental space in my hands). It makes for an interesting comparison. The traditional Mersenne Twister algorithm generates N random words in one go, and then iterates over them, before generating another N random words, and so on. (Word == unsigned integer type.) By contrast Ilya's implementation just updates the single next entry in the static array that stores the state of the engine. This would seem to work out cheaper on average: perhaps because the update-a-variate code stays hotter, perhaps because it involves less branching and looping. It certainly makes for a simpler, easier to follow update. However, it might be a good idea to benchmark this in a context where there are a lot of random variates being generated, but where other things are also happening that might render the update code less "hot" than a straightforward `foreach` loop benchmark. (Maybe try it out with a Monte Carlo simulation ...?) The tradeoff between many cheap updates and one expensive one, versus all updates being a little more expensive but less on average, might not be viable in some practical use-cases. The Phobos implementation also includes a check on whether the generator is properly seeded, which is performed in every `popFront()`, which has a reasonably significant impact. Ilya's implementation can get away with not performing that check because it uses ` disable this();` to remove the possibility of initializing the generator without it being properly seeded. The impact of that check can possibly be lessened by making it a boolean comparison rather than the existing `size_t` equality comparison. -- action item here: could we deprecate `this()` in Phobos as a precursor to removing the opportunity to instantiate a generator without properly initializing it? I'm going to try to put together a range-based version to see if this also makes any difference. I'll post some benchmarks of my own once that's done, and if all looks good I'll try to put a Phobos PR together. A few questions -- not blockers, but nice to understand: * Ilya, is this implementation your own design, or do you have a reference for the rationale behind this revised implementation of MT?
My own.
   * Is there a particular reason why the `index` variable is a 
 `Uint` (i.e.
     the word type of the generator) rather than a `size_t`?  I 
 presume there's
     a motivation, since the casting back and forth in `opCall` 
 would otherwise
     seem inconvenient.
If I remember correctly it is used with Using, so I use the same type.
   * Is there a motivation for the reverse iteration through the 
 RNG state array?
Yes, it composes add/sub with cmp operations. Ilya
Dec 13 2016
next sibling parent Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:
 If I remember correctly it is used with Using, so I use the 
 same type.
Using -> Uint Sorry, it is phone keyboard
Dec 13 2016
prev sibling parent reply Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On Tuesday, 13 December 2016 at 23:18:26 UTC, Ilya Yaroshenko 
wrote:
   *  Ilya, is this implementation your own design, or do you 
 have a reference
     for the rationale behind this revised implementation of MT?
My own.
Congratulations, then -- I think this is a very interesting rewrite of the algorithm and I would encourage you to write up a paper on your choice of optimizations, if that is of interest to you.
   * Is there a particular reason why the `index` variable is a 
 `Uint` (i.e.
     the word type of the generator) rather than a `size_t`?  I 
 presume there's
     a motivation, since the casting back and forth in `opCall` 
 would otherwise
     seem inconvenient.
If I remember correctly it is used with Uint, so I use the same type.
Not anywhere that I can find. `n`, the template parameter from which `this.index` is initialized, is a `size_t`, while the value used inside `opCall` is a `sizediff_t`. TBH I assumed it was related to the bit-packing of the struct, rather than any question of usage.
   * Is there a motivation for the reverse iteration through 
 the RNG state array?
Yes, it composes add/sub with cmp operations.
Cool. Last question: IIUC you use the private `_z` parameter as a cache for the most recent `data[index]` value (and AFAICT that's the only use it has). Is there a good reason for doing this, rather than just setting `z = data[index]` inside `opCall` (where `z` is the local parameter inside the method)? I ask because it doesn't seem to make any practical difference to the benchmarking outcomes, and you have to look up `data[index]` anyway when you write it, so it's not obvious to me that it really saves anything in practice. (I confess I haven't looked at the assembly, so I might be missing something.) Thanks for the answers, and apologies for my slow follow-up on some of the things we discussed earlier related to mir.random -- it's been a bit of a busy patch ... :-\
Dec 13 2016
parent Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:
On Wednesday, 14 December 2016 at 00:18:06 UTC, Joseph Rushton 
Wakeling wrote:
 Cool.  Last question: IIUC you use the private `_z` parameter 
 as a cache for the most recent `data[index]` value (and AFAICT 
 that's the only use it has).  Is there a good reason for doing 
 this, rather than just setting `z = data[index]` inside 
 `opCall` (where `z` is the local parameter inside the method)?
Because memory access locality. z is always hot.
Dec 14 2016
prev sibling parent reply Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On Saturday, 26 November 2016 at 20:13:36 UTC, Andrei 
Alexandrescu wrote:
 On 11/26/16 11:31 AM, Ilya Yaroshenko wrote:
 Hey,

 32-bit Mt19937 random number Generator is default in Phobos.
 It is default in Mir too, except that 64-bit targets use 
 64-bit Mt19937 instead.
Congrats! Also thanks for using the Boost license which would allow backporting the improvements to Phobos. Who'd be up for it?
PR is now on Phobos, and in the process of looking at this together, Ilya and I resolved a number of issues both in this code and in mir.random: https://github.com/dlang/phobos/pull/5011 There is unfortunately one serious blocker to this PR. In order to allow it to be a drop-in replacement for the previous Phobos implementation, I had to find a means to default-initialize its internal state at compile time so that it would match what would be there if it was explicitly seeded with the default seed. The method I developed works fine with LDC, but fails with DMD: the internal state of the generator winds up as all zeros, except for the `State.z` parameter which mysteriously ends up at the correct value. This would suggest that somehow the generated state is being overwritten _after_ being correctly generated, rather than not correctly generated in the first place. Any chance that someone with more CTFE-fu than myself could take a glance and advise what might be wrong (or how to work around it)?
Jan 07 2017
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/7/17 6:16 PM, Joseph Rushton Wakeling wrote:
 The method I developed works fine with LDC, but fails with DMD: the
 internal state of the generator winds up as all zeros, except for the
 `State.z` parameter which mysteriously ends up at the correct value.
 This would suggest that somehow the generated state is being overwritten
 _after_ being correctly generated, rather than not correctly generated
 in the first place.

 Any chance that someone with more CTFE-fu than myself could take a
 glance and advise what might be wrong (or how to work around it)?
This indicates a compiler bug in dmd itself, so the best outcome for this library work would be a reduced compiler bug + a simple library workaround. -- Andrei
Jan 07 2017
parent reply Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On Sunday, 8 January 2017 at 02:51:51 UTC, Andrei Alexandrescu 
wrote:
 This indicates a compiler bug in dmd itself, so the best 
 outcome for this library work would be a reduced compiler bug + 
 a simple library workaround. -- Andrei
Yes, that much is clear -- sorry if I wasn't clear enough myself. I'm asking for eyes on the problem because reducing it to a minimal example appears non-trivial, while the bug itself looks serious beyond its effect on this PR. For what it's worth: * it appears to be a backend issue, not frontend (using dmd with the same frontend as the latest ldc doesn't make a difference); * creating an `enum` manifest constant from the CTFE function works fine, but assigning to the runtime struct field from that enum produces the same mostly-zeros result; * it definitely doesn't appear to be something as trivial as just default initializing a struct instance (or a struct instance including a static array of more than a certain size) being the problem. I will keep trying to reduce it to a minimal example, but CTFE-related backend bugs are not something I have a lot of experience with tracking down, hence the request for help. Would filing an issue still be helpful at this stage, even without a minimal example?
Jan 08 2017
parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On Sunday, 8 January 2017 at 13:16:29 UTC, Joseph Rushton 
Wakeling wrote:
 I'm asking for eyes on the problem because reducing it to a 
 minimal example appears non-trivial, while the bug itself looks 
 serious beyond its effect on this PR.
I underestimated myself :-P Minimal example is as follows: ///////////////////////////////////////////////////////////////// struct Inner { uint value = void; // <=== removing this `void` initializer // removes the problem } struct Outer { Inner inner = defaultInner(); static Inner defaultInner() { if (!__ctfe) assert(false); Inner inn; inn.value = 23; return inn; } } void main() { import std.stdio : writeln; Outer outer; outer.inner.writeln; // outputs `Inner(0)` with dmd // but should be `Inner(23)` // ldc gets it right ;-) } ///////////////////////////////////////////////////////////////// So, the bug comes down to priority of default initialization. I'll follow an issue later today and fix the PR accordingly.
Jan 08 2017
prev sibling next sibling parent reply jmh530 <john.michael.hall gmail.com> writes:
On Saturday, 26 November 2016 at 16:31:40 UTC, Ilya Yaroshenko 
wrote:
 https://github.com/libmir :

  - mir - Generic numeric library
  - dcv - D Computer Vision library
  - mir-glas - Linear Algebra Subprograms (written in D)
  - mir-random - random numbers generators, includes non-uniform 
 distributions.
  - mir-cpuid - CPU identification

 Future:
  - mir-runtime - lightweight always inlined `nothrow  nogc` 
 Dlang Runtime
  - mir-neural - neural network library [WIP]
  - mir-fft - Fast and multidimensional FFT
  - mir-svm - support vector machines

 Cheers,
 Ilya & Mir Team
How does a mir-lapack fit in to the plan going forward?
Nov 26 2016
parent reply Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:
On Sunday, 27 November 2016 at 02:28:06 UTC, jmh530 wrote:
 On Saturday, 26 November 2016 at 16:31:40 UTC, Ilya Yaroshenko 
 wrote:
 https://github.com/libmir :

  - mir - Generic numeric library
  - dcv - D Computer Vision library
  - mir-glas - Linear Algebra Subprograms (written in D)
  - mir-random - random numbers generators, includes 
 non-uniform distributions.
  - mir-cpuid - CPU identification

 Future:
  - mir-runtime - lightweight always inlined `nothrow  nogc` 
 Dlang Runtime
  - mir-neural - neural network library [WIP]
  - mir-fft - Fast and multidimensional FFT
  - mir-svm - support vector machines

 Cheers,
 Ilya & Mir Team
How does a mir-lapack fit in to the plan going forward?
LAPACK API will be a part of Mir GLAS https://github.com/libmir/mir-glas --Ilya
Nov 26 2016
parent jmh530 <john.michael.hall gmail.com> writes:
On Sunday, 27 November 2016 at 06:46:20 UTC, Ilya Yaroshenko 
wrote:
 LAPACK API will be a part of Mir GLAS
 https://github.com/libmir/mir-glas --Ilya
Great. I might have asked before, apologies if so.
Nov 27 2016
prev sibling next sibling parent reply =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:
On Saturday, 26 November 2016 at 16:31:40 UTC, Ilya Yaroshenko 
wrote:
 Bench results:
 ----
 mir.random 32-bit Mt19937:
     6.80851 Gb/s
Does Gb mean Gigabytes or Gigabits?
Nov 29 2016
parent Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:
On Tuesday, 29 November 2016 at 16:54:55 UTC, Nordlöw wrote:
 On Saturday, 26 November 2016 at 16:31:40 UTC, Ilya Yaroshenko 
 wrote:
 Bench results:
 ----
 mir.random 32-bit Mt19937:
     6.80851 Gb/s
Does Gb mean Gigabytes or Gigabits?
Gigabits
Nov 29 2016
prev sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On Saturday, 26 November 2016 at 16:31:40 UTC, Ilya Yaroshenko 
wrote:
 1. Improve RNG generation performance by making code more 
 friendly for CPU pipelining. Tempering (finalization) 
 operations was mixed with internal payload update operations.
A note on this. The `opCall` (or, in the range version, `popFront`) of Ilya's implementation mixes together two superficially independent actions: (1) calculating the current random variate from the current index of the internal state array; (2) updating the current index of the internal state array, and moving to the next entry. It's straightforward to split out these two procedures into two separate methods (or at least two clearly separated sequences within the `opCall`), but doing so results in a notable performance hit (on my machine, something in the order of 1 GB/s less random bits). Intertwining these steps in this way is therefore a very smart optimization (although TBH it feels a little worrying that it's necessary).
Dec 14 2016