digitalmars.D - std.benchmark is in reviewable state

Andrei Alexandrescu (16/16) Sep 25 2011 I've had a good time with std.benchmark today and made it ready for

bearophile (4/5) Sep 25 2011 Is the micron symbol visible on Windows too?

Adam D. Ruppe (6/7) Sep 25 2011 Yes, of course.

Vladimir Panteleev (13/20) Sep 25 2011 Have you tried it? It's not a matter of installed fonts.

Adam D. Ruppe (4/5) Sep 25 2011 I was thinking of the website documentation, not the program itself.
Andrej Mitrovic (4/4) Sep 25 2011 I'm thinking we could shorten the function names. benchmarkResume is a

so (6/10) Sep 25 2011 All the workarounds to this (that i know of) instead of solving the issu...
Regan Heath (5/9) Sep 26 2011 What's wrong with "benchResume"?

Andrej Mitrovic (2/3) Sep 26 2011 I'm ok with that.

Andrei Alexandrescu (4/7) Sep 26 2011 I see negative progress from "benchmarkResume" to "benchResume". I think...

Andrei Alexandrescu (3/6) Sep 25 2011 Ionno, haven't tried.

Jonathan M Davis (6/13) Sep 25 2011 core.time.Duration has alwayse used the micron symbol in its toString. W...

Robert Jacques (8/30) Sep 25 2011 Wikipedia also lists some of the challenges to modern benchmarking: http...

Andrei Alexandrescu (12/70) Sep 25 2011 std.benchmark repeats the measurement in a loop, discounts the times

Robert Jacques (6/80) Sep 25 2011 First, a core jump is guaranteed to, essentially, poison the cache. Whic...

Andrei Alexandrescu (18/62) Sep 26 2011 Well it will be one of many runs. The questions are if (a) core jumps

Jens Mueller (4/19) Sep 26 2011 500?

Robert Jacques (2/21) Sep 26 2011 The minimum, automatic time window is 10 ms. Given the scaling factor, t...

Robert Jacques (14/79) Sep 26 2011 Great, but MS still recommends benchmarking be done on a single core. An...

Jonathan M Davis (8/11) Sep 26 2011 And how would you ensure that? Beyond having your application be single-

Timon Gehr (2/13) Sep 26 2011 There are API functions to do that on any reasonable OS.

Robert Jacques (2/17) Sep 27 2011 Such as SetThreadAffinityMask, which I mentioned in my original post.

Marco Leise (3/28) Sep 28 2011 sched_setaffinity (per process / Linux)

Andrei Alexandrescu (7/18) Sep 26 2011 I guess this may have relevance only on multithreaded apps, but it's

Robert Jacques (4/21) Sep 27 2011 No problem. 10 is a good default and is generally what I use.

travert phare.normalesup.org (Christophe) (11/27) Sep 28 2011 Sorry to question you, but what makes you say that the minimum is more

Robert Jacques (5/30) Sep 28 2011 The reason that the mean + std is so pervasive in science is that it's p...

travert phare.normalesup.org (Christophe) (4/4) Sep 28 2011 I am reassured now.

Don (27/45) Sep 29 2011 I think that's quite misleading. Microsoft has made some assumptions in

so (17/33) Sep 25 2011 I know it does same thing under the hood. Yet, wouldn't
Walter Bright (4/7) Sep 25 2011 We already have std.perf:

Jerry Quinn (3/8) Sep 26 2011 std.perf doesn't show up in the nav bar to the left. I didn't know it e...

Jonathan M Davis (8/16) Sep 26 2011 I don't think that it ever showed up in the online docs. I believe that ...

Jose Armando Garcia (20/21) Sep 26 2011 Very nice. Really like the functionality. Really like how you

Andrei Alexandrescu (12/35) Sep 26 2011 Not sure whether the compiler allows it, but it shouldn't work if one

Jonathan M Davis (5/24) Sep 26 2011 You might want to check the compilability of your code. I merged your
Jonathan M Davis (5/29) Sep 26 2011 That is, the unittest target fails to compile (due to a template error)....
travert phare.normalesup.org (Christophe) (10/30) Sep 28 2011 Nice piece of work.
Andrei Alexandrescu (14/16) Sep 28 2011 [snip]

Jens Mueller (97/122) Oct 03 2011 Overall I like this module very much.

Andrei Alexandrescu (49/104) Oct 03 2011 On 10/3/11 8:59 AM, Jens Mueller wrote:

Jens Mueller (16/103) Oct 03 2011 I just mean that the relative speed ups printed in the table does not
Andrej Mitrovic (3/3) Oct 03 2011 With this new module we might as well implement __MODULE__ or
Jacob Carlborg (16/27) Oct 03 2011 Not necessarily. The user could provide a delegate as a callback,

Andrei Alexandrescu (6/30) Oct 03 2011 That's what I meant when I said "we'd need to expose the benchmark

Jacob Carlborg (5/38) Oct 03 2011 Hehe, Ok. I don't think it gets THAT much more complicated, just call

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

I've had a good time with std.benchmark today and made it ready for 
submission to Phobos. As a perk I've added the %h format specifier, 
which formats a number in human-readable form using SI prefixes.

For those interested in discussing this prior to the formal review 
process, the code is at:

https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d
https://github.com/andralex/phobos/blob/benchmark/std/format.d

and the dox at

http://erdani.com/d/web/phobos-prerelease/std_benchmark.html
http://erdani.com/d/web/phobos-prerelease/std_format.html

Comments and suggestions are welcome. If possible, it would be great if 
this submission were given a bump in the priority queue. This is because 
with robust benchmarking in Phobos we can ask new, performance-sensitive 
submissions, to add benchmarking.


Thanks,

Andrei

Sep 25 2011

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:

 http://erdani.com/d/web/phobos-prerelease/std_benchmark.html

Is the micron symbol visible on Windows too?

Bye,
bearophile

Sep 25 2011

Adam D. Ruppe <destructionator gmail.com> writes:

bearophile wrote:
 Is the micron symbol visible on Windows too?

Yes, of course.

It's a matter of fonts though - if you don't have the character
in your font (regardless of operating system) it might not show.

The micro sign is very common though, so unlikely to have
any font without it.

Sep 25 2011

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Mon, 26 Sep 2011 05:03:32 +0300, Adam D. Ruppe  
<destructionator gmail.com> wrote:

 bearophile wrote:
 Is the micron symbol visible on Windows too?

 Yes, of course.

 It's a matter of fonts though - if you don't have the character
 in your font (regardless of operating system) it might not show.

 The micro sign is very common though, so unlikely to have
 any font without it.

Have you tried it? It's not a matter of installed fonts.

There are two problems with Unicode characters in console D programs:

1. The default code page on Windows is practically never UTF-8.

You could change the code page to UTF-8 using SetConsoleOutputCP, but  
weird things happen if you don't change it back before your program exits:  
if your program was called from a batch file, all batch files are  
instantly aborted.

2. The default font for the Windows console is not a Unicode font.

-- 
Best regards,
  Vladimir                            mailto:vladimir thecybershadow.net

Sep 25 2011

Adam D. Ruppe <destructionator gmail.com> writes:

Vladimir Panteleev wrote:
 There are two problems with Unicode characters in console D programs:

I was thinking of the website documentation, not the program itself.

The symbols works for the website, but not really for the program.
(even most my linux terminals aren't UTF-8)

Sep 25 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

I'm thinking we could shorten the function names. benchmarkResume is a
mouthful, and we already came to the conclusion that we shouldn't
repeat module names in function names. But I'm all out of ideas for
naming things, sorry. :)

Sep 25 2011

so <so so.so> writes:

On Mon, 26 Sep 2011 06:14:59 +0300, Andrej Mitrovic  
<andrej.mitrovich gmail.com> wrote:

 I'm thinking we could shorten the function names. benchmarkResume is a
 mouthful, and we already came to the conclusion that we shouldn't
 repeat module names in function names. But I'm all out of ideas for
 naming things, sorry. :)

All the workarounds to this (that i know of) instead of solving the issue,  
makes the situation worse. I guess the C way was the best.
benchmark::Resume, benchmark.Resume benchmarkResume. This is palatable to  
me, what i hate to see is things like array.Array.

Sep 25 2011

"Regan Heath" <regan netmail.co.nz> writes:

On Mon, 26 Sep 2011 04:14:59 +0100, Andrej Mitrovic  
<andrej.mitrovich gmail.com> wrote:

 I'm thinking we could shorten the function names. benchmarkResume is a
 mouthful, and we already came to the conclusion that we shouldn't
 repeat module names in function names. But I'm all out of ideas for
 naming things, sorry. :)

What's wrong with "benchResume"?

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

Sep 26 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 9/26/11, Regan Heath <regan netmail.co.nz> wrote:
 What's wrong with "benchResume"?

I'm ok with that.

Sep 26 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/26/11 8:07 AM, Andrej Mitrovic wrote:
 On 9/26/11, Regan Heath<regan netmail.co.nz>  wrote:
 What's wrong with "benchResume"?

 I'm ok with that.

I see negative progress from "benchmarkResume" to "benchResume". I think 
I'll simply name them "pause" and "resume".

Andrei

Sep 26 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/25/11 8:41 PM, bearophile wrote:
 Andrei Alexandrescu:

 http://erdani.com/d/web/phobos-prerelease/std_benchmark.html

 Is the micron symbol visible on Windows too?

Ionno, haven't tried.

Andrei

Sep 25 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday, September 25, 2011 21:51:49 Andrei Alexandrescu wrote:
 On 9/25/11 8:41 PM, bearophile wrote:
 Andrei Alexandrescu:
 http://erdani.com/d/web/phobos-prerelease/std_benchmark.html

 
 Is the micron symbol visible on Windows too?

 
 Ionno, haven't tried.

core.time.Duration has alwayse used the micron symbol in its toString. Worst 
case, I would think that you'd get a box or other symbol which indicates that 
the symbol is unknown, and since the micron symbol is the only one used which 
isn't ASCII, it'll be fairly obvious what it was meant to be.

- Jonathan M Davis

Sep 25 2011

"Robert Jacques" <sandford jhu.edu> writes:

On Sun, 25 Sep 2011 21:08:45 -0400, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
 I've had a good time with std.benchmark today and made it ready for
 submission to Phobos. As a perk I've added the %h format specifier,
 which formats a number in human-readable form using SI prefixes.

 For those interested in discussing this prior to the formal review
 process, the code is at:

 https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d
 https://github.com/andralex/phobos/blob/benchmark/std/format.d

 and the dox at

 http://erdani.com/d/web/phobos-prerelease/std_benchmark.html
 http://erdani.com/d/web/phobos-prerelease/std_format.html

 Comments and suggestions are welcome. If possible, it would be great if
 this submission were given a bump in the priority queue. This is because
 with robust benchmarking in Phobos we can ask new, performance-sensitive
 submissions, to add benchmarking.


 Thanks,

 Andrei

Andrei, good job on the framework design, but you've left all of the design
flaws of the current benchmark routine. To quote myself from last Tuesday:

 On Tue, 20 Sep 2011 14:01:05 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:

 [snip]

 Thank you for making this more meaningful! I assumed the standard
 library benchmark function would take care of those things. Should it?

Yes and no. Benchmark provides a good way to make a single measurement of
 a function, as for really short functions you do have to loop many timesto be
able to get a reliable reading. However, actual benchmarking requiresa) tuning
the benchmark() call time to about 10-20 ms and b) runningbenchmark() many
times, taking the minimum. The idea is that on any givenrun you could hit a
context switch, etc. so if you make multiple run, onewill get lucky and not be
interrupted. Worse, if a core switch happensbetween StopWatch start and end,
the number of ticks returned is random.Hence, the comment to manually limit
execution to a single core. So, itmight be nice if benchmark() also took a
second parameter, denoting thenumber of times to benchmark the function and had
some better documentationon how to do good benchmarking.

Wikipedia also lists some of the challenges to modern benchmarking:
http://en.wikipedia.org/wiki/Time_Stamp_Counter

So, std.benchmark should
[ ] Set the affinity to a single thread at start (i.e. use
SetProcessAffinityMask, etc)
[ ] Repeat the measurement N times, taking the min.
[X] Adaptively increase the number of loop iterations to ensure a valid reading.
[ ] Adaptively decrease the number of loop iterations to ensure minimal context
switching.

Sep 25 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/25/11 9:02 PM, Robert Jacques wrote:
 On Sun, 25 Sep 2011 21:08:45 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 I've had a good time with std.benchmark today and made it ready for
 submission to Phobos. As a perk I've added the %h format specifier,
 which formats a number in human-readable form using SI prefixes.

 For those interested in discussing this prior to the formal review
 process, the code is at:

 https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d
 https://github.com/andralex/phobos/blob/benchmark/std/format.d

 and the dox at

 http://erdani.com/d/web/phobos-prerelease/std_benchmark.html
 http://erdani.com/d/web/phobos-prerelease/std_format.html

 Comments and suggestions are welcome. If possible, it would be great if
 this submission were given a bump in the priority queue. This is because
 with robust benchmarking in Phobos we can ask new, performance-sensitive
 submissions, to add benchmarking.


 Thanks,

 Andrei

 Andrei, good job on the framework design, but you've left all of the
 design flaws of the current benchmark routine. To quote myself from last
 Tuesday:

 On Tue, 20 Sep 2011 14:01:05 -0400, Timon Gehr <timon.gehr gmx.ch>
 wrote:

 [snip]

 Thank you for making this more meaningful! I assumed the standard
 library benchmark function would take care of those things. Should it?

 Yes and no. Benchmark provides a good way to make a single measurement of
 a function, as for really short functions you do have to loop many
 timesto be able to get a reliable reading. However, actual
 benchmarking requiresa) tuning the benchmark() call time to about
 10-20 ms and b) runningbenchmark() many times, taking the minimum. The
 idea is that on any givenrun you could hit a context switch, etc. so
 if you make multiple run, onewill get lucky and not be interrupted.
 Worse, if a core switch happensbetween StopWatch start and end, the
 number of ticks returned is random.Hence, the comment to manually
 limit execution to a single core. So, itmight be nice if benchmark()
 also took a second parameter, denoting thenumber of times to benchmark
 the function and had some better documentationon how to do good
 benchmarking.

 Wikipedia also lists some of the challenges to modern benchmarking:
 http://en.wikipedia.org/wiki/Time_Stamp_Counter

 So, std.benchmark should
 [ ] Set the affinity to a single thread at start (i.e. use
 SetProcessAffinityMask, etc)

I think that won't influence most measurements measurably.

 [ ] Repeat the measurement N times, taking the min.

std.benchmark repeats the measurement in a loop, discounts the times 
spent in the iteration proper, and divides total time by the number of 
iterations to figure the time per iteration. This has the advantage that 
works with even very fast functions without letting the counter itself 
affect the timing.

 [X] Adaptively increase the number of loop iterations to ensure a valid
 reading.

The current system tries 10, 100, 1000 iterations until it gets to a 
total run time of at least 0.5 seconds. That's quite a generous margin, 
I plan to tighten it.

 [ ] Adaptively decrease the number of loop iterations to ensure minimal
 context switching.

OK.


Andrei

Sep 25 2011

"Robert Jacques" <sandford jhu.edu> writes:

On Sun, 25 Sep 2011 23:06:01 -0400, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
 On 9/25/11 9:02 PM, Robert Jacques wrote:
 On Sun, 25 Sep 2011 21:08:45 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 I've had a good time with std.benchmark today and made it ready for
 submission to Phobos. As a perk I've added the %h format specifier,
 which formats a number in human-readable form using SI prefixes.

 For those interested in discussing this prior to the formal review
 process, the code is at:

 https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d
 https://github.com/andralex/phobos/blob/benchmark/std/format.d

 and the dox at

 http://erdani.com/d/web/phobos-prerelease/std_benchmark.html
 http://erdani.com/d/web/phobos-prerelease/std_format.html

 Comments and suggestions are welcome. If possible, it would be great if
 this submission were given a bump in the priority queue. This is because
 with robust benchmarking in Phobos we can ask new, performance-sensitive
 submissions, to add benchmarking.


 Thanks,

 Andrei

 Andrei, good job on the framework design, but you've left all of the
 design flaws of the current benchmark routine. To quote myself from last
 Tuesday:

 On Tue, 20 Sep 2011 14:01:05 -0400, Timon Gehr <timon.gehr gmx.ch>
 wrote:

 [snip]

 Thank you for making this more meaningful! I assumed the standard
 library benchmark function would take care of those things. Should it?

 Yes and no. Benchmark provides a good way to make a single measurement of
 a function, as for really short functions you do have to loop many
 timesto be able to get a reliable reading. However, actual
 benchmarking requiresa) tuning the benchmark() call time to about
 10-20 ms and b) runningbenchmark() many times, taking the minimum. The
 idea is that on any givenrun you could hit a context switch, etc. so
 if you make multiple run, onewill get lucky and not be interrupted.
 Worse, if a core switch happensbetween StopWatch start and end, the
 number of ticks returned is random.Hence, the comment to manually
 limit execution to a single core. So, itmight be nice if benchmark()
 also took a second parameter, denoting thenumber of times to benchmark
 the function and had some better documentationon how to do good
 benchmarking.

 Wikipedia also lists some of the challenges to modern benchmarking:
 http://en.wikipedia.org/wiki/Time_Stamp_Counter

 So, std.benchmark should
 [ ] Set the affinity to a single thread at start (i.e. use
 SetProcessAffinityMask, etc)

 I think that won't influence most measurements measurably.

First, a core jump is guaranteed to, essentially, poison the cache. Which, at
least in my eyes, invalidates that particular timing run.

Second, timing generally relies on the CPUs Time Stamp Counter, which is not
multi-thread safe; a core switch invalidates all previous TSC values, and
hence, the time measurement itself. Furthermore, the TSC is not even guaranteed
to have a fixed frequency on some CPUs. Now there are ways around the problems
of the TSC, but even so:

(From the Wikipedia)
"Under Windows platforms, Microsoft strongly discourages using the TSC for
high-resolution timing for exactly these reasons, providing instead the Windows
APIs QueryPerformanceCounter and QueryPerformanceFrequency.[2] Even when using
these functions, Microsoft recommends the code to be locked to a single CPU."

 [ ] Repeat the measurement N times, taking the min.

 std.benchmark repeats the measurement in a loop, discounts the times
 spent in the iteration proper, and divides total time by the number of
 iterations to figure the time per iteration. This has the advantage that
 works with even very fast functions without letting the counter itself
 affect the timing.

Which is necessity but not sufficient for proper benchmarking. The timers
themselves are noisy, to say nothing of the effects of context switches,
warm-up, etc on a run. During the ptr+length vs ptr+ptr discussion on Tuesday,
the naive use benchmark lead to some very wrong conclusions simply because any
single run had up to ~5% of error. The only way to get rid of this error, is to
make multiple runs.

 [X] Adaptively increase the number of loop iterations to ensure a valid
 reading.

 The current system tries 10, 100, 1000 iterations until it gets to a
 total run time of at least 0.5 seconds. That's quite a generous margin,
 I plan to tighten it.

 [ ] Adaptively decrease the number of loop iterations to ensure minimal
 context switching.

 OK.


 Andrei

Sep 25 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/25/11 11:56 PM, Robert Jacques wrote:
 On Sun, 25 Sep 2011 23:06:01 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 So, std.benchmark should
 [ ] Set the affinity to a single thread at start (i.e. use
 SetProcessAffinityMask, etc)

 I think that won't influence most measurements measurably.

 First, a core jump is guaranteed to, essentially, poison the cache.
 Which, at least in my eyes, invalidates that particular timing run.

Well it will be one of many runs. The questions are if (a) core jumps 
are likely to happen during a benchmark run, (b) many enough will happen 
to influence a 500ms time window.

 Second, timing generally relies on the CPUs Time Stamp Counter, which is
 not multi-thread safe; a core switch invalidates all previous TSC
 values, and hence, the time measurement itself. Furthermore, the TSC is
 not even guaranteed to have a fixed frequency on some CPUs. Now there
 are ways around the problems of the TSC, but even so:

 (From the Wikipedia)
 "Under Windows platforms, Microsoft strongly discourages using the TSC
 for high-resolution timing for exactly these reasons, providing instead
 the Windows APIs QueryPerformanceCounter and
 QueryPerformanceFrequency.[2] Even when using these functions, Microsoft
 recommends the code to be locked to a single CPU."

std.benchmark uses QueryPerformanceCounter on Windows and 
clock_gettime/gettimeofday on Unix.

 [ ] Repeat the measurement N times, taking the min.

 std.benchmark repeats the measurement in a loop, discounts the times
 spent in the iteration proper, and divides total time by the number of
 iterations to figure the time per iteration. This has the advantage that
 works with even very fast functions without letting the counter itself
 affect the timing.

 Which is necessity but not sufficient for proper benchmarking. The
 timers themselves are noisy, to say nothing of the effects of context
 switches, warm-up, etc on a run. During the ptr+length vs ptr+ptr
 discussion on Tuesday, the naive use benchmark lead to some very wrong
 conclusions simply because any single run had up to ~5% of error. The
 only way to get rid of this error, is to make multiple runs.

You may want to take a look at the code. This is a long argument ending 
with "...make multiple runs" as if std.benchmark is not doing them and 
you suggest it should. Again, std.benchmark runs the tested function in 
a loop in powers of 10 until the entire run lasts at least 500ms.

 [X] Adaptively increase the number of loop iterations to ensure a valid
 reading.

 The current system tries 10, 100, 1000 iterations until it gets to a
 total run time of at least 0.5 seconds. That's quite a generous margin,
 I plan to tighten it.

 [ ] Adaptively decrease the number of loop iterations to ensure minimal
 context switching.

 OK.


In fact upon further thinking the iterations should not be too few. 
Context switches and other activity is inevitable, but the effects are 
negligible over long periods (as half a second would be).

To conclude, I'm unsure what steps you suggest to take to improve 
std.benchmark, and I'd be grateful if you could clarify them after 
perusing its code and documentation.


Thanks,

Andrei

Sep 26 2011

Jens Mueller <jens.k.mueller gmx.de> writes:

Andrei Alexandrescu wrote:
 On 9/25/11 11:56 PM, Robert Jacques wrote:
On Sun, 25 Sep 2011 23:06:01 -0400, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
So, std.benchmark should
[ ] Set the affinity to a single thread at start (i.e. use
SetProcessAffinityMask, etc)

I think that won't influence most measurements measurably.

First, a core jump is guaranteed to, essentially, poison the cache.
Which, at least in my eyes, invalidates that particular timing run.

 
 Well it will be one of many runs. The questions are if (a) core
 jumps are likely to happen during a benchmark run, (b) many enough
 will happen to influence a 500ms time window.

500?
To me it looks as if the time window is 10ms long (line 447)?

Jens

Sep 26 2011

"Robert Jacques" <sandford jhu.edu> writes:

On Mon, 26 Sep 2011 09:56:27 -0400, Jens Mueller <jens.k.mueller gmx.de> wrote:

 Andrei Alexandrescu wrote:
 On 9/25/11 11:56 PM, Robert Jacques wrote:
On Sun, 25 Sep 2011 23:06:01 -0400, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
So, std.benchmark should
[ ] Set the affinity to a single thread at start (i.e. use
SetProcessAffinityMask, etc)

I think that won't influence most measurements measurably.

First, a core jump is guaranteed to, essentially, poison the cache.
Which, at least in my eyes, invalidates that particular timing run.

 Well it will be one of many runs. The questions are if (a) core
 jumps are likely to happen during a benchmark run, (b) many enough
 will happen to influence a 500ms time window.

 500?
 To me it looks as if the time window is 10ms long (line 447)?

 Jens

The minimum, automatic time window is 10 ms. Given the scaling factor, the
actual time window will be between 10 ms and 100 ms. IIRC A context switch
occurs every 15 ms or less. And given that you'd be starting the time window at
a random point inside the active time slice, at the minimum (i.e. 10 ms time
window) you have a 66% chance of a switch happening.

Sep 26 2011

"Robert Jacques" <sandford jhu.edu> writes:

On Mon, 26 Sep 2011 09:39:28 -0400, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

 On 9/25/11 11:56 PM, Robert Jacques wrote:
 On Sun, 25 Sep 2011 23:06:01 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 So, std.benchmark should
 [ ] Set the affinity to a single thread at start (i.e. use
 SetProcessAffinityMask, etc)

 I think that won't influence most measurements measurably.

 First, a core jump is guaranteed to, essentially, poison the cache.
 Which, at least in my eyes, invalidates that particular timing run.

 Well it will be one of many runs. The questions are if (a) core jumps
 are likely to happen during a benchmark run, (b) many enough will happen
 to influence a 500ms time window.

I don't know exactly how likely core jump are, but context switches definitely
happen. And yes, in my experience, they influence a 50 ms time window.
Remember, context switches have two outcomes: your thread continues running or
your thread is paused. But the timer doesn't pause, when your thread pauses. So
all you need is one bad switch to see an effect.

 Second, timing generally relies on the CPUs Time Stamp Counter, which is
 not multi-thread safe; a core switch invalidates all previous TSC
 values, and hence, the time measurement itself. Furthermore, the TSC is
 not even guaranteed to have a fixed frequency on some CPUs. Now there
 are ways around the problems of the TSC, but even so:

 (From the Wikipedia)
 "Under Windows platforms, Microsoft strongly discourages using the TSC
 for high-resolution timing for exactly these reasons, providing instead
 the Windows APIs QueryPerformanceCounter and
 QueryPerformanceFrequency.[2] Even when using these functions, Microsoft
 recommends the code to be locked to a single CPU."

 std.benchmark uses QueryPerformanceCounter on Windows and
 clock_gettime/gettimeofday on Unix.

Great, but MS still recommends benchmarking be done on a single core. And if MS
thinks that is how benchmarking should be done, I think that's how we should do
it.

 From MSDN documentation on QueryPerformanceCounter:
"On a multiprocessor computer, it should not matter which processor is called.
However, you can get different results on different processors due to bugs in
the basic input/output system (BIOS) or the hardware abstraction layer (HAL).
To specify processor affinity for a thread, use the SetThreadAffinityMask
function."

 [ ] Repeat the measurement N times, taking the min.

 std.benchmark repeats the measurement in a loop, discounts the times
 spent in the iteration proper, and divides total time by the number of
 iterations to figure the time per iteration. This has the advantage that
 works with even very fast functions without letting the counter itself
 affect the timing.

 Which is necessity but not sufficient for proper benchmarking. The
 timers themselves are noisy, to say nothing of the effects of context
 switches, warm-up, etc on a run. During the ptr+length vs ptr+ptr
 discussion on Tuesday, the naive use benchmark lead to some very wrong
 conclusions simply because any single run had up to ~5% of error. The
 only way to get rid of this error, is to make multiple runs.

 You may want to take a look at the code. This is a long argument ending
 with "...make multiple runs" as if std.benchmark is not doing them and
 you suggest it should. Again, std.benchmark runs the tested function in
 a loop in powers of 10 until the entire run lasts at least 500ms.

*sigh* By multiple runs, I mean multiple runs of the benchmark function,
_including_ the loops. From my post last Tuesday:

     double min_f1 = int.max;
     double min_f2 = int.max;
     foreach(i;0..10_000) {
         auto b=benchmark!(f1,f2)(3);
         min_f1 = min(min_f1, b[0].to!("seconds",double));
         min_f2 = min(min_f2, b[1].to!("seconds",double));
     }

All the looping that benchmark currently does is to lengthen the total time a
function takes. This makes it possible to avoid sampling errors with regard to
the performance counter itself. But you are still only making a single, noisy
measurement of a (meta)function. In order to get a robust result, you need to
make multiple measurements and take the minimum.

 [X] Adaptively increase the number of loop iterations to ensure a valid
 reading.

 The current system tries 10, 100, 1000 iterations until it gets to a
 total run time of at least 0.5 seconds. That's quite a generous margin,
 I plan to tighten it.

 [ ] Adaptively decrease the number of loop iterations to ensure minimal
 context switching.

 OK.


 In fact upon further thinking the iterations should not be too few.
 Context switches and other activity is inevitable, but the effects are
 negligible over long periods (as half a second would be).

 To conclude, I'm unsure what steps you suggest to take to improve
 std.benchmark, and I'd be grateful if you could clarify them after
 perusing its code and documentation.


 Thanks,

 Andrei

Sep 26 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Monday, September 26, 2011 11:43:52 Robert Jacques wrote:
 Great, but MS still recommends benchmarking be done on a single core. And if
 MS thinks that is how benchmarking should be done, I think that's how we
 should do it.

And how would you ensure that? Beyond having your application be single-
threaded, I'm not aware of any way to have any control over whether a 
particular process is run on a single core or not. Generally, that stuff is 
completely managed by the OS and the hardware and is outside the control of 
the programmer. I'm not saying that there definitively isn't a way to ensure 
that a process only runs on one core, but I'm not aware of one.

- Jonathan M Davis

Sep 26 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 09/26/2011 05:59 PM, Jonathan M Davis wrote:
 On Monday, September 26, 2011 11:43:52 Robert Jacques wrote:
 Great, but MS still recommends benchmarking be done on a single core. And if
 MS thinks that is how benchmarking should be done, I think that's how we
 should do it.

 And how would you ensure that? Beyond having your application be single-
 threaded, I'm not aware of any way to have any control over whether a
 particular process is run on a single core or not. Generally, that stuff is
 completely managed by the OS and the hardware and is outside the control of
 the programmer. I'm not saying that there definitively isn't a way to ensure
 that a process only runs on one core, but I'm not aware of one.

 - Jonathan M Davis

There are API functions to do that on any reasonable OS.

Sep 26 2011

"Robert Jacques" <sandford jhu.edu> writes:

On Mon, 26 Sep 2011 13:20:01 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 09/26/2011 05:59 PM, Jonathan M Davis wrote:
 On Monday, September 26, 2011 11:43:52 Robert Jacques wrote:
 Great, but MS still recommends benchmarking be done on a single core. And if
 MS thinks that is how benchmarking should be done, I think that's how we
 should do it.

 And how would you ensure that? Beyond having your application be single-
 threaded, I'm not aware of any way to have any control over whether a
 particular process is run on a single core or not. Generally, that stuff is
 completely managed by the OS and the hardware and is outside the control of
 the programmer. I'm not saying that there definitively isn't a way to ensure
 that a process only runs on one core, but I'm not aware of one.

 - Jonathan M Davis

 There are API functions to do that on any reasonable OS.

Such as SetThreadAffinityMask, which I mentioned in my original post.

Sep 27 2011

"Marco Leise" <Marco.Leise gmx.de> writes:

Am 27.09.2011, 09:08 Uhr, schrieb Robert Jacques <sandford jhu.edu>:

 On Mon, 26 Sep 2011 13:20:01 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 09/26/2011 05:59 PM, Jonathan M Davis wrote:
 On Monday, September 26, 2011 11:43:52 Robert Jacques wrote:
 Great, but MS still recommends benchmarking be done on a single core.  
 And if
 MS thinks that is how benchmarking should be done, I think that's how  
 we
 should do it.

 And how would you ensure that? Beyond having your application be  
 single-
 threaded, I'm not aware of any way to have any control over whether a
 particular process is run on a single core or not. Generally, that  
 stuff is
 completely managed by the OS and the hardware and is outside the  
 control of
 the programmer. I'm not saying that there definitively isn't a way to  
 ensure
 that a process only runs on one core, but I'm not aware of one.

 - Jonathan M Davis

 There are API functions to do that on any reasonable OS.

 Such as SetThreadAffinityMask, which I mentioned in my original post.

sched_setaffinity      (per process / Linux)
pthread_setaffinity_np (per thread / Posix)

Sep 28 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/26/11 8:43 AM, Robert Jacques wrote:
 Great, but MS still recommends benchmarking be done on a single core.
 And if MS thinks that is how benchmarking should be done, I think that's
 how we should do it.

  From MSDN documentation on QueryPerformanceCounter:
 "On a multiprocessor computer, it should not matter which processor is
 called. However, you can get different results on different processors
 due to bugs in the basic input/output system (BIOS) or the hardware
 abstraction layer (HAL). To specify processor affinity for a thread, use
 the SetThreadAffinityMask function."

I guess this may have relevance only on multithreaded apps, but it's 
best to do things by the book. I'll look into it.

 *sigh* By multiple runs, I mean multiple runs of the benchmark function,
 _including_ the loops. From my post last Tuesday:

I now got it, sorry for being slow. How about taking the minimum of 10 
consecutive bigloop runs while committing to the same overall timing budget?


Thanks,

Andrei

Sep 26 2011

"Robert Jacques" <sandford jhu.edu> writes:

On Mon, 26 Sep 2011 17:56:26 -0400, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

 On 9/26/11 8:43 AM, Robert Jacques wrote:
 Great, but MS still recommends benchmarking be done on a single core.
 And if MS thinks that is how benchmarking should be done, I think that's
 how we should do it.

  From MSDN documentation on QueryPerformanceCounter:
 "On a multiprocessor computer, it should not matter which processor is
 called. However, you can get different results on different processors
 due to bugs in the basic input/output system (BIOS) or the hardware
 abstraction layer (HAL). To specify processor affinity for a thread, use
 the SetThreadAffinityMask function."

 I guess this may have relevance only on multithreaded apps, but it's
 best to do things by the book. I'll look into it.

Well, this applies to any multi-core PC; whether the app is single-threaded or
multi-threaded makes no difference.

 *sigh* By multiple runs, I mean multiple runs of the benchmark function,
 _including_ the loops. From my post last Tuesday:

 I now got it, sorry for being slow. How about taking the minimum of 10
 consecutive bigloop runs while committing to the same overall timing budget?

No problem. 10 is a good default and is generally what I use.

Although, there should be at least one function which allows manual control out
the number of bigloops. i.e. for the ptr+length discussion I ultimately used a
lot more than 10 iterations, as this was a very precise comparison. And people
wanting to test network or cold-start disk performance may only want to use a
single bigloop.

Sep 27 2011

travert phare.normalesup.org (Christophe) writes:

"Robert Jacques" , dans le message (digitalmars.D:145443), a écrit :
 *sigh* By multiple runs, I mean multiple runs of the benchmark function,
_including_ the loops. From my post last Tuesday:
 
      double min_f1 = int.max;
      double min_f2 = int.max;
      foreach(i;0..10_000) {
          auto b=benchmark!(f1,f2)(3);
          min_f1 = min(min_f1, b[0].to!("seconds",double));
          min_f2 = min(min_f2, b[1].to!("seconds",double));
      }
 
 All the looping that benchmark currently does is to lengthen the total 
 time a function takes. This makes it possible to avoid sampling errors 
 with regard to the performance counter itself. But you are still only 
 making a single, noisy measurement of a (meta)function. In order to 
 get a robust result, you need to make multiple measurements and take 
 the minimum.

Sorry to question you, but what makes you say that the minimum is more 
interesting than, let's say, the mean + the standard deviation. Is there 
any paper supporting this ?

I am not a benchmark specialist, but an average statistician who knows 
that the minimum is not always very informative (a very low value can be 
a piece of luck). I am sure you may have good reasons to choose the 
minimum, but I just want to make sure we make the right choice by 
choosing to use the minimun of consecutive benchmarks.

-- 
Christophe

Sep 28 2011

"Robert Jacques" <sandford jhu.edu> writes:

On Wed, 28 Sep 2011 08:19:16 -0400, Christophe <travert phare.normalesup.org>
wrote:
 "Robert Jacques" , dans le message (digitalmars.D:145443), a �crit :
 *sigh* By multiple runs, I mean multiple runs of the benchmark function,
_including_ the loops. From my post last Tuesday:

      double min_f1 = int.max;
      double min_f2 = int.max;
      foreach(i;0..10_000) {
          auto b=benchmark!(f1,f2)(3);
          min_f1 = min(min_f1, b[0].to!("seconds",double));
          min_f2 = min(min_f2, b[1].to!("seconds",double));
      }

 All the looping that benchmark currently does is to lengthen the total
 time a function takes. This makes it possible to avoid sampling errors
 with regard to the performance counter itself. But you are still only
 making a single, noisy measurement of a (meta)function. In order to
 get a robust result, you need to make multiple measurements and take
 the minimum.

 Sorry to question you, but what makes you say that the minimum is more
 interesting than, let's say, the mean + the standard deviation. Is there
 any paper supporting this ?

 I am not a benchmark specialist, but an average statistician who knows
 that the minimum is not always very informative (a very low value can be
 a piece of luck). I am sure you may have good reasons to choose the
 minimum, but I just want to make sure we make the right choice by
 choosing to use the minimun of consecutive benchmarks.

The reason that the mean + std is so pervasive in science is that it's perfect
for a Gaussian noise model. In most physical measurements, you expect some
readings to be both above and below the actual value and for most processes we
assume the error of these measurements is normally distributed. Importantly,
when we know the process not to produce normally distibuted error, we look to
other statistics for meaning. Now, for computer timing measurements, you have
access to the physical number of clock cycles taken. There is some noise in
this, but it's on the order of 1-2 cycles (i.e. ns). As a benchmark is trying
to measure the number of cycles something takes to complete, and it is
physically impossible to complete that task in less than a fixed number of
cycles, 'noise' in the measurement above 1-2 ns is by definition purely
additive due to OS or hardware effects (i.e. a context switch occurred, a page
fault happened, you put your computer to sleep, etc.). So the best way to get  
to the true number of cycles a task takes to complete is to take the minimum
recorded value, on the premise that you did get lucky and those other issues
which you are not trying to measure, had little to no impact.

So that's the rational. As for academic support, I've published peer-reviewed
papers using this methodology, but I can't remember where I originally got the
advice. Also, there are certain types of benchmarks for which the mean + std is
better (networking comes to mind), etc.

P.S. A little Google-fu turned up: The science of computer benchmarking By
Roger W. Hockney, which does recommend using the minimum, particularly to get
around the problem of time-slicing (i.e. context switching) by the OS.

Sep 28 2011

travert phare.normalesup.org (Christophe) writes:

I am reassured now.

Thanks a lot.

-- 
Christophe

Sep 28 2011

Don <nospam nospam.com> writes:

On 26.09.2011 17:43, Robert Jacques wrote:
 Second, timing generally relies on the CPUs Time Stamp Counter, which is
 not multi-thread safe; a core switch invalidates all previous TSC
 values, and hence, the time measurement itself. Furthermore, the TSC is
 not even guaranteed to have a fixed frequency on some CPUs. Now there
 are ways around the problems of the TSC, but even so:

 (From the Wikipedia)
 "Under Windows platforms, Microsoft strongly discourages using the TSC
 for high-resolution timing for exactly these reasons, providing instead
 the Windows APIs QueryPerformanceCounter and
 QueryPerformanceFrequency.[2] Even when using these functions, Microsoft
 recommends the code to be locked to a single CPU."

 std.benchmark uses QueryPerformanceCounter on Windows and
 clock_gettime/gettimeofday on Unix.

 Great, but MS still recommends benchmarking be done on a single core.
 And if MS thinks that is how benchmarking should be done, I think that's
 how we should do it.

I think that's quite misleading. Microsoft has made some assumptions in 
there about what you are measuring. QueryPerformanceCounter gives you 
the wall clock time. But, with modern CPUs (esp, core i7 Sandy Bridge, 
but also as far back as Pentium M) the CPU frequency isn't constant.
So, the wall clock time depends both on the number of clock cycles 
required to execute the code, AND on the temperature of the CPU!
If you run the same benchmark code enough times, it'll eventually get 
slower as the CPU heats up!

Personally I always use the TSC, and then discard any results where the 
TSC is inconsistent. Specifically I use the hardware performance 


1 cache misses, etc. I've managed to get rock-solid data that way, but 
it takes a lot of work (eg, you have to make sure that your stack is 
aligned to 16 bytes).
BUT this sort of thing is only relevant to really small sections of code 
(< 1 time slice).

If you're timing something which involves hardware other than the CPU 
(eg, database access, network access) then you really want the wall 
clock time. And if the section of code is long enough, then maybe you do 
care how much it heats up the CPU!
But OTOH once you're in that regime, I don't think you care about 
processor affinity. In real life, you WILL get core transitions.

So I don't think it's anywhere near as simple as what Microsoft makes 
out. It's very important to be clear about what kind of stuff you intend 
to measure.

Sep 29 2011

so <so so.so> writes:

On Mon, 26 Sep 2011 04:08:45 +0300, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 I've had a good time with std.benchmark today and made it ready for  
 submission to Phobos. As a perk I've added the %h format specifier,  
 which formats a number in human-readable form using SI prefixes.

 For those interested in discussing this prior to the formal review  
 process, the code is at:

 https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d
 https://github.com/andralex/phobos/blob/benchmark/std/format.d

 and the dox at

 http://erdani.com/d/web/phobos-prerelease/std_benchmark.html
 http://erdani.com/d/web/phobos-prerelease/std_format.html

 Comments and suggestions are welcome. If possible, it would be great if  
 this submission were given a bump in the priority queue. This is because  
 with robust benchmarking in Phobos we can ask new, performance-sensitive  
 submissions, to add benchmarking.


 Thanks,

 Andrei

I know it does same thing under the hood. Yet, wouldn't

this(AutoStart autostart)
     {
         if(autostart)
             start();
     }

be better off something explicit like:

this(AutoStart autostart=AutoStart.no)
     {
         if(Autostart.yes == autostart)
             start();
     }

It overwrites default constructor but then again it is explicit.
Maybe i am completely of the hook and there is a phobos/D consensus like  
"leave the def. ctor alone!"

Sep 25 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 9/25/2011 6:08 PM, Andrei Alexandrescu wrote:
 I've had a good time with std.benchmark today and made it ready for submission
 to Phobos. As a perk I've added the %h format specifier, which formats a number
 in human-readable form using SI prefixes.


We already have std.perf:

  http://www.d-programming-language.org/phobos/std_perf.html

A review should compare the two.

Sep 25 2011

Jerry Quinn <jlquinn optonline.net> writes:

Walter Bright Wrote:

 We already have std.perf:
 
   http://www.d-programming-language.org/phobos/std_perf.html
 
 A review should compare the two.

std.perf doesn't show up in the nav bar to the left.  I didn't know it existed
until your post :-)

Jerry

Sep 26 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Monday, September 26, 2011 11:30:58 Jerry Quinn wrote:
 Walter Bright Wrote:
 We already have std.perf:
   http://www.d-programming-language.org/phobos/std_perf.html
 
 A review should compare the two.

 
 std.perf doesn't show up in the nav bar to the left.  I didn't know it
 existed until your post :-)

I don't think that it ever showed up in the online docs. I believe that the 
current StopWatch and benchmarking functions in std.datetime are supposed to 
have replaced it as well, so it's been scheduled for deprecation for a while 
(and probably should have been deprecated by now). Glancing at it though, it 
does appear to have more functionality that the StopWatch and benchmarking 
functions in std.datetime currently provide.

- Jonathan M Davis

Sep 26 2011

Jose Armando Garcia <jsancio gmail.com> writes:

On Sun, Sep 25, 2011 at 6:08 PM, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
 Comments and suggestions are welcome.

Very nice. Really like the functionality. Really like how you
implemented "void benchmarkModule(string mod)(File target = stdout)".
I have a few comments.

My main concern on how this function works is that it looks at the
module for methods starting with benchmark_. I like the idea in
general but I am worry that it would pollute the symbol space. Does
this work with private symbols? Can the user  make all the members
private and it will still work? What about versioning should the user
version all the benchmark members with version(Benchmark) or something
like that? Should the user just create a benchmark module that would
just contain all these benchmark functions. What do you recommend to
the user? Should the doc contain these recommendations?

One last comment for now. Maybe we can make the prefix, 'benchmark_',
configurable by passing it to benchmarkModule(...).

I think that benchmarkModule would greatly benefit from annotation it
would be nice if the user could write:

 benchmark("list")
private void listInsert() {...}

Sep 26 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/26/11 9:04 AM, Jose Armando Garcia wrote:
 On Sun, Sep 25, 2011 at 6:08 PM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org>  wrote:
 Comments and suggestions are welcome.

 Very nice. Really like the functionality. Really like how you
 implemented "void benchmarkModule(string mod)(File target = stdout)".
 I have a few comments.

 My main concern on how this function works is that it looks at the
 module for methods starting with benchmark_. I like the idea in
 general but I am worry that it would pollute the symbol space. Does
 this work with private symbols? Can the user  make all the members
 private and it will still work?

Not sure whether the compiler allows it, but it shouldn't work if one 
module benchmarks another. That doesn't worry me - benchmark functions 
are meant to be exposed for anyone to play with and it doesn't seem a 
function starting with "benchmark_" is casually confusable.

 What about versioning should the user
 version all the benchmark members with version(Benchmark) or something
 like that?

As they wish. We will indeed define a version(StdBenchmark) for Phobos 
itself.

 Should the user just create a benchmark module that would
 just contain all these benchmark functions. What do you recommend to
 the user? Should the doc contain these recommendations?

Good idea. I will add some typical physical designs.

 One last comment for now. Maybe we can make the prefix, 'benchmark_',
 configurable by passing it to benchmarkModule(...).

Good idea, will do.

 I think that benchmarkModule would greatly benefit from annotation it
 would be nice if the user could write:

  benchmark("list")
 private void listInsert() {...}

I think the benefit would be only marginal.


Thanks,

Andrei

Sep 26 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday, September 25, 2011 20:08:45 Andrei Alexandrescu wrote:
 I've had a good time with std.benchmark today and made it ready for
 submission to Phobos. As a perk I've added the %h format specifier,
 which formats a number in human-readable form using SI prefixes.
 
 For those interested in discussing this prior to the formal review
 process, the code is at:
 
 https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d
 https://github.com/andralex/phobos/blob/benchmark/std/format.d
 
 and the dox at
 
 http://erdani.com/d/web/phobos-prerelease/std_benchmark.html
 http://erdani.com/d/web/phobos-prerelease/std_format.html
 
 Comments and suggestions are welcome. If possible, it would be great if
 this submission were given a bump in the priority queue. This is because
 with robust benchmarking in Phobos we can ask new, performance-sensitive
 submissions, to add benchmarking.

You might want to check the compilability of your code. I merged your 
benchmark branch into a branch that I just branched from master, and it fails 
to compile with the latest compiler.

- Jonathan M Davis

Sep 26 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Monday, September 26, 2011 20:09:37 Jonathan M Davis wrote:
 On Sunday, September 25, 2011 20:08:45 Andrei Alexandrescu wrote:
 I've had a good time with std.benchmark today and made it ready for
 submission to Phobos. As a perk I've added the %h format specifier,
 which formats a number in human-readable form using SI prefixes.
 
 For those interested in discussing this prior to the formal review
 process, the code is at:
 
 https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d
 https://github.com/andralex/phobos/blob/benchmark/std/format.d
 
 and the dox at
 
 http://erdani.com/d/web/phobos-prerelease/std_benchmark.html
 http://erdani.com/d/web/phobos-prerelease/std_format.html
 
 Comments and suggestions are welcome. If possible, it would be great if
 this submission were given a bump in the priority queue. This is because
 with robust benchmarking in Phobos we can ask new, performance-sensitive
 submissions, to add benchmarking.

 
 You might want to check the compilability of your code. I merged your
 benchmark branch into a branch that I just branched from master, and it
 fails to compile with the latest compiler.

That is, the unittest target fails to compile (due to a template error). The 
normal build succeeds (probably because it's not compiling the benchmark 
function - which is the templated function which is failing).

- Jonathan M Davis

Sep 26 2011

travert phare.normalesup.org (Christophe) writes:

Andrei Alexandrescu , dans le message (digitalmars.D:145370), a écrit :
 I've had a good time with std.benchmark today and made it ready for 
 submission to Phobos. As a perk I've added the %h format specifier, 
 which formats a number in human-readable form using SI prefixes.
 
 For those interested in discussing this prior to the formal review 
 process, the code is at:
 
 https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d
 https://github.com/andralex/phobos/blob/benchmark/std/format.d
 
 and the dox at
 
 http://erdani.com/d/web/phobos-prerelease/std_benchmark.html
 http://erdani.com/d/web/phobos-prerelease/std_format.html
 
 Comments and suggestions are welcome. If possible, it would be great if 
 this submission were given a bump in the priority queue. This is because 
 with robust benchmarking in Phobos we can ask new, performance-sensitive 
 submissions, to add benchmarking.
 


Nice piece of work.

I quite don't like the use of a global (well, thread local) name for the 
benchmark clockWatch. What about giving the clockWatch instance as a 
parameter to the benchmarked functions if they need to pause/resume?

That would also avoid the issue of naming benchmarkPause and 
benchmarkResume (btw, I would have named them pauseBenchmark and 
resumeBenchmark).

-- 
Christophe

Sep 28 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/25/11 6:08 PM, Andrei Alexandrescu wrote:
 I've had a good time with std.benchmark today and made it ready for
 submission to Phobos.

[snip]

You destroyed, and I listened. I updated my benchmark branch to accept a 
configurable time budget for measurements, and a number of trials that 
take the minimum time.

Still need to do the thread affinity stuff.

Code:

https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d
https://github.com/andralex/phobos/blob/benchmark/std/format.d

Dox:

http://erdani.com/d/web/phobos-prerelease/std_benchmark.html
http://erdani.com/d/web/phobos-prerelease/std_format.html


Destroy once again.

Andrei

Sep 28 2011

Jens Mueller <jens.k.mueller gmx.de> writes:

Andrei Alexandrescu wrote:
 On 9/25/11 6:08 PM, Andrei Alexandrescu wrote:
I've had a good time with std.benchmark today and made it ready for
submission to Phobos.

 [snip]
 
 You destroyed, and I listened. I updated my benchmark branch to
 accept a configurable time budget for measurements, and a number of
 trials that take the minimum time.
 
 Still need to do the thread affinity stuff.
 
 Code:
 
 https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d
 https://github.com/andralex/phobos/blob/benchmark/std/format.d
 
 Dox:
 
 http://erdani.com/d/web/phobos-prerelease/std_benchmark.html
 http://erdani.com/d/web/phobos-prerelease/std_format.html
 
 
 Destroy once again.
 
 Andrei

Overall I like this module very much.

1. I'm not that convinced that prepending benchmark_ to a function is
that nice even though it's simple. But following this idea I could
remove all my unittest blocks and add functions starting with test_ and
write a module std.test in same fashion. If this approach is considered
the way to go we can remove unittest altogether and solve testing
entirely in a library. After all it looks less elegant to me but I admit
that I see no other way and it is the easiest solution after all. I just
want to stress that when seeing the code I questioned myself whether
having unittest in the language is nice to have.

2. When benchmarking code you often investigate the algorithm's
properties regarding its scalability. That's why I think it's useful to
give an example in the documentation. Something like

  void benchmark_insert_array(size_t length)(uint n)
  {   
      benchmarkPause();
      Array!bool array;
      array.length = length;
      benchmarkResume();

      foreach(i; 0 .. n)
          array.insertAfter(array[0 .. uniform(0, length)], true);
  }
  
  void benchmark_insert_slist(size_t length)(uint n)
  {   
      benchmarkPause();
      SList!bool list;
      foreach (i; 0 .. length)
          list.insertFront(false);
      benchmarkResume();

      foreach(i; 0 .. n)
      {   
          auto r = take(list[], uniform(0, length));
          list.insertAfter(r, true);
      }
  }
  
  alias benchmark_insert_slist!(10) benchmark_insert_slist_10;
  alias benchmark_insert_array!(10) benchmark_insert_array_10;
  alias benchmark_insert_slist!(100) benchmark_insert_slist_100;
  alias benchmark_insert_array!(100) benchmark_insert_array_100;

This example needs more polishing.

Input size is the most important dimension that CPU time depends on. The
example should clarify how to benchmark depending on the input size.

3.
(r[i] / 10_000_000.).to!("nsecs", int)
should be written as
(r[i] / 10_000_000).nsecs

Same for msecs and similar.

4. Test results should be exposed to the caller.
Usually, it is enough to write the results to the console. But you may
want to post process the results. Being it either graphically or to
output XML for whatever reason.
I think it would be beneficial to add this flexibility and separate the
benchmarking logic from its generating the output.
It might be useful to add the used CPU using std.cpuid to the output.

5. In the first example I completely missed that it is not "relative
calls". I thought this was one term even though I didn't know what it
should mean. Later I figured that there is an empty column called
relative.

6. Wrong numbers in documentation
The examples in the documentation have wrong numbers. In the end one
should pay special attention that the documentation is consistent. I
think one cannot automate this.
There are also rounding errors. That means when I try to compute these
numbers I get different in the first digit after the decimal point. I
wonder why this is. Even though the measurement may be noisy the
computed numbers should be more precise I think.

7. Changing units in the output.
One benchmark report may use different units in its output. The one in
the example uses microseconds and nanoseconds. This makes comparison
more difficult and should be avoided (at least for groups). It is
error-prone when analyzing the numbers.

8. I prefer writing std.benchmark.pause() over benchmarkPause() and
std.benchmark.resume() over benchmarkResume().
or
import std.benchmark : benchmark;
try benchmark.pause()

Supporting benchmark.start() would improve the example. Since you just
need to specify one line that indicates where the benchmarking should
start. But this seems difficult to get to work with the current design
and I don't consider it important. It's a very minor thing.

10. The file reading/writing benchmark should be improved. Since the
file should be deleted. The file writing benchmark should be more like:
void benchmark_fileWrite()
{   
    benchmarkPause();
    auto f = File.tmpfile();
    benchmarkResume();

    f.writeln("hello, world!");
}

But this also measures the deletion of the file.
I have no idea how to fix the benchmark for reading a file. But I
believe it involves using mkstemp on POSIX and similar on Windows.
mkstemp should be added to std.file in the long run.

Jens

Oct 03 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/3/11 8:59 AM, Jens Mueller wrote:
[snip]

Thanks for this rich feedback. I'll act on most or all points. A few 
notes follow.

 1. I'm not that convinced that prepending benchmark_ to a function is
 that nice even though it's simple. But following this idea I could
 remove all my unittest blocks and add functions starting with test_ and
 write a module std.test in same fashion. If this approach is considered
 the way to go we can remove unittest altogether and solve testing
 entirely in a library. After all it looks less elegant to me but I admit
 that I see no other way and it is the easiest solution after all. I just
 want to stress that when seeing the code I questioned myself whether
 having unittest in the language is nice to have.

std.benchmark uses relatively new reflection features that weren't 
available until recently. Adding unittest now would be difficult to 
justify, but at the time unittest was a good feature.

 2. When benchmarking code you often investigate the algorithm's
 properties regarding its scalability.

Yah, that definitely needs to be looked at. I think I need to add some 
support beyond mere documentation.

 3.
 (r[i] / 10_000_000.).to!("nsecs", int)
 should be written as
 (r[i] / 10_000_000).nsecs

 Same for msecs and similar.

Noted.

 4. Test results should be exposed to the caller.
 Usually, it is enough to write the results to the console. But you may
 want to post process the results. Being it either graphically or to
 output XML for whatever reason.
 I think it would be beneficial to add this flexibility and separate the
 benchmarking logic from its generating the output.
 It might be useful to add the used CPU using std.cpuid to the output.

We should decouple collection from formatting. However, this complicates 
matters because we'd need to expose the benchmark results structure. 
I'll think of it.

 5. In the first example I completely missed that it is not "relative
 calls". I thought this was one term even though I didn't know what it
 should mean. Later I figured that there is an empty column called
 relative.

There are two spaces there, I should add three.

 6. Wrong numbers in documentation
 The examples in the documentation have wrong numbers. In the end one
 should pay special attention that the documentation is consistent. I
 think one cannot automate this.

Not sure how this can be addressed at all (or maybe I don't understand). 
The examples show real numbers collected from real runs on my laptop. I 
don't think one may expect to reproduce them with reasonable accuracy 
because they depend on a lot of imponderables.

 There are also rounding errors. That means when I try to compute these
 numbers I get different in the first digit after the decimal point. I
 wonder why this is. Even though the measurement may be noisy the
 computed numbers should be more precise I think.

Where are calculations mistaken? Indeed I see e.g. that append built-in 
has 83ns/call and 11.96M calls/s. Reverting the first with a 
"calculator" program yields 12.04 Mcalls/s, and reverting the second 
yields 83.61ns/call. I'll see what can be done.

 7. Changing units in the output.
 One benchmark report may use different units in its output. The one in
 the example uses microseconds and nanoseconds. This makes comparison
 more difficult and should be avoided (at least for groups). It is
 error-prone when analyzing the numbers.

Problem is space and wildly varying numbers. I initially had exponential 
notation for everything, but it was super ugly and actually more 
difficult to navigate. That's why I added the compact notation. I think 
we'll need to put up with it.

 8. I prefer writing std.benchmark.pause() over benchmarkPause() and
 std.benchmark.resume() over benchmarkResume().

OK.

 10. The file reading/writing benchmark should be improved. Since the
 file should be deleted. The file writing benchmark should be more like:
 void benchmark_fileWrite()
 {
      benchmarkPause();
      auto f = File.tmpfile();
      benchmarkResume();

      f.writeln("hello, world!");
 }

 But this also measures the deletion of the file.

If you don't want to measure std.file.write but instead only the time 
spent in writeln, his should work and measure only the time spent in 
writing to the file:

void benchmark_fileWrite()
{
      benchmarkPause();
      auto f = File.tmpfile();
      benchmarkResume();
      f.writeln("hello, world!");
      benchmarkPause();
}

I haven't yet tested the case when a functions leaves the benchmark paused.

 I have no idea how to fix the benchmark for reading a file. But I
 believe it involves using mkstemp on POSIX and similar on Windows.
 mkstemp should be added to std.file in the long run.

Why is there a need to fix it? The benchmark measures the performance of 
std.file.read soup to nuts.

In related news, I got word from a colleagues that I should also use 
times() (http://linux.die.net/man/2/times) to distinguish time spent in 
user space vs. system calls.  This is because some functions must issue 
certain system calls in any case and the user is interested in how much 
they can improve the situation by optimizing their end of the deal. This 
would apply e.g. to file I/O where the overhead of formatting etc. would 
become easier to distinguish.


Andrei

Oct 03 2011

Jens Mueller <jens.k.mueller gmx.de> writes:

Andrei Alexandrescu wrote:
 On 10/3/11 8:59 AM, Jens Mueller wrote:
 [snip]
 
 Thanks for this rich feedback. I'll act on most or all points. A few
 notes follow.
 
1. I'm not that convinced that prepending benchmark_ to a function is
that nice even though it's simple. But following this idea I could
remove all my unittest blocks and add functions starting with test_ and
write a module std.test in same fashion. If this approach is considered
the way to go we can remove unittest altogether and solve testing
entirely in a library. After all it looks less elegant to me but I admit
that I see no other way and it is the easiest solution after all. I just
want to stress that when seeing the code I questioned myself whether
having unittest in the language is nice to have.

 
 std.benchmark uses relatively new reflection features that weren't
 available until recently. Adding unittest now would be difficult to
 justify, but at the time unittest was a good feature.

I see. This is also my feeling.

6. Wrong numbers in documentation
The examples in the documentation have wrong numbers. In the end one
should pay special attention that the documentation is consistent. I
think one cannot automate this.

 
 Not sure how this can be addressed at all (or maybe I don't
 understand). The examples show real numbers collected from real runs
 on my laptop. I don't think one may expect to reproduce them with
 reasonable accuracy because they depend on a lot of imponderables.

I just mean that the relative speed ups printed in the table does not
match the ones given in the surrounding text.

There are also rounding errors. That means when I try to compute these
numbers I get different in the first digit after the decimal point. I
wonder why this is. Even though the measurement may be noisy the
computed numbers should be more precise I think.

 
 Where are calculations mistaken? Indeed I see e.g. that append
 built-in has 83ns/call and 11.96M calls/s. Reverting the first with
 a "calculator" program yields 12.04 Mcalls/s, and reverting the
 second yields 83.61ns/call. I'll see what can be done.

Yes. This is what I meant here.

7. Changing units in the output.
One benchmark report may use different units in its output. The one in
the example uses microseconds and nanoseconds. This makes comparison
more difficult and should be avoided (at least for groups). It is
error-prone when analyzing the numbers.

 
 Problem is space and wildly varying numbers. I initially had
 exponential notation for everything, but it was super ugly and
 actually more difficult to navigate. That's why I added the compact
 notation. I think we'll need to put up with it.

If formatting and collecting is separated it is fine I'll guess. It's
okay for the default output.

10. The file reading/writing benchmark should be improved. Since the
file should be deleted. The file writing benchmark should be more like:
void benchmark_fileWrite()
{
     benchmarkPause();
     auto f = File.tmpfile();
     benchmarkResume();

     f.writeln("hello, world!");
}

But this also measures the deletion of the file.

 
 If you don't want to measure std.file.write but instead only the
 time spent in writeln, his should work and measure only the time
 spent in writing to the file:
 
 void benchmark_fileWrite()
 {
      benchmarkPause();
      auto f = File.tmpfile();
      benchmarkResume();
      f.writeln("hello, world!");
      benchmarkPause();
 }
 
 I haven't yet tested the case when a functions leaves the benchmark paused.

This looks good.

I have no idea how to fix the benchmark for reading a file. But I
believe it involves using mkstemp on POSIX and similar on Windows.
mkstemp should be added to std.file in the long run.

 
 Why is there a need to fix it? The benchmark measures the
 performance of std.file.read soup to nuts.

Yes. But the benchmark should clean up after itself if possible. So it
should remove the file.

 In related news, I got word from a colleagues that I should also use
 times() (http://linux.die.net/man/2/times) to distinguish time spent
 in user space vs. system calls.  This is because some functions must
 issue certain system calls in any case and the user is interested in
 how much they can improve the situation by optimizing their end of
 the deal. This would apply e.g. to file I/O where the overhead of
 formatting etc. would become easier to distinguish.

Good hint from your colleague. The more data one can get when measuring
code the better. You could even be interested in cache misses when
benchmarking. But CPU time is usually what boils down to. I fear you may
loose simplicity.
I'm looking forward to your changes.

Jens

Oct 03 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

With this new module we might as well implement __MODULE__ or
something similar to avoid hardcoding the module name or using the
.stringof[7..$] trick.

Oct 03 2011

Jacob Carlborg <doob me.com> writes:

On 2011-10-03 17:44, Andrei Alexandrescu wrote:
 On 10/3/11 8:59 AM, Jens Mueller wrote:
 4. Test results should be exposed to the caller.
 Usually, it is enough to write the results to the console. But you may
 want to post process the results. Being it either graphically or to
 output XML for whatever reason.
 I think it would be beneficial to add this flexibility and separate the
 benchmarking logic from its generating the output.
 It might be useful to add the used CPU using std.cpuid to the output.

 We should decouple collection from formatting. However, this complicates
 matters because we'd need to expose the benchmark results structure.
 I'll think of it.

Not necessarily. The user could provide a delegate as a callback, 
something like this:

struct Result
{
     string func;
     double relative;
     double calls;
     double tCall;
     double callsPerSecond;
}

alias void delegate (Result result) Formatter;

Doesn't seem complicated to me. Then the module could provide a default 
formatter that does what the module currently does.

-- 
/Jacob Carlborg

Oct 03 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/3/11 12:49 PM, Jacob Carlborg wrote:
 On 2011-10-03 17:44, Andrei Alexandrescu wrote:
 On 10/3/11 8:59 AM, Jens Mueller wrote:
 4. Test results should be exposed to the caller.
 Usually, it is enough to write the results to the console. But you may
 want to post process the results. Being it either graphically or to
 output XML for whatever reason.
 I think it would be beneficial to add this flexibility and separate the
 benchmarking logic from its generating the output.
 It might be useful to add the used CPU using std.cpuid to the output.

 We should decouple collection from formatting. However, this complicates
 matters because we'd need to expose the benchmark results structure.
 I'll think of it.

 Not necessarily. The user could provide a delegate as a callback,
 something like this:

 struct Result
 {
 string func;
 double relative;
 double calls;
 double tCall;
 double callsPerSecond;
 }

 alias void delegate (Result result) Formatter;

That's what I meant when I said "we'd need to expose the benchmark 
results structure" :o).

There's probably no way out of this anyway, particularly if we provide 
richer information via times(), scalability, etc.


Andrei

Oct 03 2011

Jacob Carlborg <doob me.com> writes:

On 2011-10-03 20:04, Andrei Alexandrescu wrote:
 On 10/3/11 12:49 PM, Jacob Carlborg wrote:
 On 2011-10-03 17:44, Andrei Alexandrescu wrote:
 On 10/3/11 8:59 AM, Jens Mueller wrote:
 4. Test results should be exposed to the caller.
 Usually, it is enough to write the results to the console. But you may
 want to post process the results. Being it either graphically or to
 output XML for whatever reason.
 I think it would be beneficial to add this flexibility and separate the
 benchmarking logic from its generating the output.
 It might be useful to add the used CPU using std.cpuid to the output.

 We should decouple collection from formatting. However, this complicates
 matters because we'd need to expose the benchmark results structure.
 I'll think of it.

 Not necessarily. The user could provide a delegate as a callback,
 something like this:

 struct Result
 {
 string func;
 double relative;
 double calls;
 double tCall;
 double callsPerSecond;
 }

 alias void delegate (Result result) Formatter;

 That's what I meant when I said "we'd need to expose the benchmark
 results structure" :o).

Hehe, Ok. I don't think it gets THAT much more complicated, just call 
the delegate with the results instead of writef(ln).

 There's probably no way out of this anyway, particularly if we provide
 richer information via times(), scalability, etc.


 Andrei


-- 
/Jacob Carlborg

Oct 03 2011

D Programming

C/C++ Programming

Other

digitalmars.D - std.benchmark is in reviewable state