www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - std.benchmark is in reviewable state

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
I've had a good time with std.benchmark today and made it ready for 
submission to Phobos. As a perk I've added the %h format specifier, 
which formats a number in human-readable form using SI prefixes.

For those interested in discussing this prior to the formal review 
process, the code is at:

https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d
https://github.com/andralex/phobos/blob/benchmark/std/format.d

and the dox at

http://erdani.com/d/web/phobos-prerelease/std_benchmark.html
http://erdani.com/d/web/phobos-prerelease/std_format.html

Comments and suggestions are welcome. If possible, it would be great if 
this submission were given a bump in the priority queue. This is because 
with robust benchmarking in Phobos we can ask new, performance-sensitive 
submissions, to add benchmarking.


Thanks,

Andrei
Sep 25 2011
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:

 http://erdani.com/d/web/phobos-prerelease/std_benchmark.html

Is the micron symbol visible on Windows too? Bye, bearophile
Sep 25 2011
next sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
bearophile wrote:
 Is the micron symbol visible on Windows too?

Yes, of course. It's a matter of fonts though - if you don't have the character in your font (regardless of operating system) it might not show. The micro sign is very common though, so unlikely to have any font without it.
Sep 25 2011
next sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
Vladimir Panteleev wrote:
 There are two problems with Unicode characters in console D programs:

I was thinking of the website documentation, not the program itself. The symbols works for the website, but not really for the program. (even most my linux terminals aren't UTF-8)
Sep 25 2011
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/26/11 8:07 AM, Andrej Mitrovic wrote:
 On 9/26/11, Regan Heath<regan netmail.co.nz>  wrote:
 What's wrong with "benchResume"?

I'm ok with that.

I see negative progress from "benchmarkResume" to "benchResume". I think I'll simply name them "pause" and "resume". Andrei
Sep 26 2011
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/25/11 8:41 PM, bearophile wrote:
 Andrei Alexandrescu:

 http://erdani.com/d/web/phobos-prerelease/std_benchmark.html

Is the micron symbol visible on Windows too?

Ionno, haven't tried. Andrei
Sep 25 2011
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Mon, 26 Sep 2011 05:03:32 +0300, Adam D. Ruppe  
<destructionator gmail.com> wrote:

 bearophile wrote:
 Is the micron symbol visible on Windows too?

Yes, of course. It's a matter of fonts though - if you don't have the character in your font (regardless of operating system) it might not show. The micro sign is very common though, so unlikely to have any font without it.

Have you tried it? It's not a matter of installed fonts. There are two problems with Unicode characters in console D programs: 1. The default code page on Windows is practically never UTF-8. You could change the code page to UTF-8 using SetConsoleOutputCP, but weird things happen if you don't change it back before your program exits: if your program was called from a batch file, all batch files are instantly aborted. 2. The default font for the Windows console is not a Unicode font. -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Sep 25 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
I'm thinking we could shorten the function names. benchmarkResume is a
mouthful, and we already came to the conclusion that we shouldn't
repeat module names in function names. But I'm all out of ideas for
naming things, sorry. :)
Sep 25 2011
prev sibling next sibling parent so <so so.so> writes:
On Mon, 26 Sep 2011 06:14:59 +0300, Andrej Mitrovic  
<andrej.mitrovich gmail.com> wrote:

 I'm thinking we could shorten the function names. benchmarkResume is a
 mouthful, and we already came to the conclusion that we shouldn't
 repeat module names in function names. But I'm all out of ideas for
 naming things, sorry. :)

All the workarounds to this (that i know of) instead of solving the issue, makes the situation worse. I guess the C way was the best. benchmark::Resume, benchmark.Resume benchmarkResume. This is palatable to me, what i hate to see is things like array.Array.
Sep 25 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Sunday, September 25, 2011 21:51:49 Andrei Alexandrescu wrote:
 On 9/25/11 8:41 PM, bearophile wrote:
 Andrei Alexandrescu:
 http://erdani.com/d/web/phobos-prerelease/std_benchmark.html

Is the micron symbol visible on Windows too?

Ionno, haven't tried.

core.time.Duration has alwayse used the micron symbol in its toString. Worst case, I would think that you'd get a box or other symbol which indicates that the symbol is unknown, and since the micron symbol is the only one used which isn't ASCII, it'll be fairly obvious what it was meant to be. - Jonathan M Davis
Sep 25 2011
prev sibling next sibling parent "Regan Heath" <regan netmail.co.nz> writes:
On Mon, 26 Sep 2011 04:14:59 +0100, Andrej Mitrovic  
<andrej.mitrovich gmail.com> wrote:

 I'm thinking we could shorten the function names. benchmarkResume is a
 mouthful, and we already came to the conclusion that we shouldn't
 repeat module names in function names. But I'm all out of ideas for
 naming things, sorry. :)

What's wrong with "benchResume"? -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Sep 26 2011
prev sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 9/26/11, Regan Heath <regan netmail.co.nz> wrote:
 What's wrong with "benchResume"?

I'm ok with that.
Sep 26 2011
prev sibling next sibling parent reply "Robert Jacques" <sandford jhu.edu> writes:
On Sun, 25 Sep 2011 21:08:45 -0400, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
 I've had a good time with std.benchmark today and made it ready for
 submission to Phobos. As a perk I've added the %h format specifier,
 which formats a number in human-readable form using SI prefixes.

 For those interested in discussing this prior to the formal review
 process, the code is at:

 https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d
 https://github.com/andralex/phobos/blob/benchmark/std/format.d

 and the dox at

 http://erdani.com/d/web/phobos-prerelease/std_benchmark.html
 http://erdani.com/d/web/phobos-prerelease/std_format.html

 Comments and suggestions are welcome. If possible, it would be great if
 this submission were given a bump in the priority queue. This is because
 with robust benchmarking in Phobos we can ask new, performance-sensitive
 submissions, to add benchmarking.


 Thanks,

 Andrei

Andrei, good job on the framework design, but you've left all of the design flaws of the current benchmark routine. To quote myself from last Tuesday:
 On Tue, 20 Sep 2011 14:01:05 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:

 Thank you for making this more meaningful! I assumed the standard
 library benchmark function would take care of those things. Should it?

a function, as for really short functions you do have to loop many timesto be able to get a reliable reading. However, actual benchmarking requiresa) tuning the benchmark() call time to about 10-20 ms and b) runningbenchmark() many times, taking the minimum. The idea is that on any givenrun you could hit a context switch, etc. so if you make multiple run, onewill get lucky and not be interrupted. Worse, if a core switch happensbetween StopWatch start and end, the number of ticks returned is random.Hence, the comment to manually limit execution to a single core. So, itmight be nice if benchmark() also took a second parameter, denoting thenumber of times to benchmark the function and had some better documentationon how to do good benchmarking.

Wikipedia also lists some of the challenges to modern benchmarking: http://en.wikipedia.org/wiki/Time_Stamp_Counter So, std.benchmark should [ ] Set the affinity to a single thread at start (i.e. use SetProcessAffinityMask, etc) [ ] Repeat the measurement N times, taking the min. [X] Adaptively increase the number of loop iterations to ensure a valid reading. [ ] Adaptively decrease the number of loop iterations to ensure minimal context switching.
Sep 25 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/25/11 9:02 PM, Robert Jacques wrote:
 On Sun, 25 Sep 2011 21:08:45 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 I've had a good time with std.benchmark today and made it ready for
 submission to Phobos. As a perk I've added the %h format specifier,
 which formats a number in human-readable form using SI prefixes.

 For those interested in discussing this prior to the formal review
 process, the code is at:

 https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d
 https://github.com/andralex/phobos/blob/benchmark/std/format.d

 and the dox at

 http://erdani.com/d/web/phobos-prerelease/std_benchmark.html
 http://erdani.com/d/web/phobos-prerelease/std_format.html

 Comments and suggestions are welcome. If possible, it would be great if
 this submission were given a bump in the priority queue. This is because
 with robust benchmarking in Phobos we can ask new, performance-sensitive
 submissions, to add benchmarking.


 Thanks,

 Andrei

Andrei, good job on the framework design, but you've left all of the design flaws of the current benchmark routine. To quote myself from last Tuesday:
 On Tue, 20 Sep 2011 14:01:05 -0400, Timon Gehr <timon.gehr gmx.ch>
 wrote:

 Thank you for making this more meaningful! I assumed the standard
 library benchmark function would take care of those things. Should it?

a function, as for really short functions you do have to loop many timesto be able to get a reliable reading. However, actual benchmarking requiresa) tuning the benchmark() call time to about 10-20 ms and b) runningbenchmark() many times, taking the minimum. The idea is that on any givenrun you could hit a context switch, etc. so if you make multiple run, onewill get lucky and not be interrupted. Worse, if a core switch happensbetween StopWatch start and end, the number of ticks returned is random.Hence, the comment to manually limit execution to a single core. So, itmight be nice if benchmark() also took a second parameter, denoting thenumber of times to benchmark the function and had some better documentationon how to do good benchmarking.

Wikipedia also lists some of the challenges to modern benchmarking: http://en.wikipedia.org/wiki/Time_Stamp_Counter So, std.benchmark should [ ] Set the affinity to a single thread at start (i.e. use SetProcessAffinityMask, etc)

I think that won't influence most measurements measurably.
 [ ] Repeat the measurement N times, taking the min.

std.benchmark repeats the measurement in a loop, discounts the times spent in the iteration proper, and divides total time by the number of iterations to figure the time per iteration. This has the advantage that works with even very fast functions without letting the counter itself affect the timing.
 [X] Adaptively increase the number of loop iterations to ensure a valid
 reading.

The current system tries 10, 100, 1000 iterations until it gets to a total run time of at least 0.5 seconds. That's quite a generous margin, I plan to tighten it.
 [ ] Adaptively decrease the number of loop iterations to ensure minimal
 context switching.

OK. Andrei
Sep 25 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/25/11 11:56 PM, Robert Jacques wrote:
 On Sun, 25 Sep 2011 23:06:01 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 So, std.benchmark should
 [ ] Set the affinity to a single thread at start (i.e. use
 SetProcessAffinityMask, etc)

I think that won't influence most measurements measurably.

First, a core jump is guaranteed to, essentially, poison the cache. Which, at least in my eyes, invalidates that particular timing run.

Well it will be one of many runs. The questions are if (a) core jumps are likely to happen during a benchmark run, (b) many enough will happen to influence a 500ms time window.
 Second, timing generally relies on the CPUs Time Stamp Counter, which is
 not multi-thread safe; a core switch invalidates all previous TSC
 values, and hence, the time measurement itself. Furthermore, the TSC is
 not even guaranteed to have a fixed frequency on some CPUs. Now there
 are ways around the problems of the TSC, but even so:

 (From the Wikipedia)
 "Under Windows platforms, Microsoft strongly discourages using the TSC
 for high-resolution timing for exactly these reasons, providing instead
 the Windows APIs QueryPerformanceCounter and
 QueryPerformanceFrequency.[2] Even when using these functions, Microsoft
 recommends the code to be locked to a single CPU."

std.benchmark uses QueryPerformanceCounter on Windows and clock_gettime/gettimeofday on Unix.
 [ ] Repeat the measurement N times, taking the min.

std.benchmark repeats the measurement in a loop, discounts the times spent in the iteration proper, and divides total time by the number of iterations to figure the time per iteration. This has the advantage that works with even very fast functions without letting the counter itself affect the timing.

Which is necessity but not sufficient for proper benchmarking. The timers themselves are noisy, to say nothing of the effects of context switches, warm-up, etc on a run. During the ptr+length vs ptr+ptr discussion on Tuesday, the naive use benchmark lead to some very wrong conclusions simply because any single run had up to ~5% of error. The only way to get rid of this error, is to make multiple runs.

You may want to take a look at the code. This is a long argument ending with "...make multiple runs" as if std.benchmark is not doing them and you suggest it should. Again, std.benchmark runs the tested function in a loop in powers of 10 until the entire run lasts at least 500ms.
 [X] Adaptively increase the number of loop iterations to ensure a valid
 reading.

The current system tries 10, 100, 1000 iterations until it gets to a total run time of at least 0.5 seconds. That's quite a generous margin, I plan to tighten it.
 [ ] Adaptively decrease the number of loop iterations to ensure minimal
 context switching.

OK.


In fact upon further thinking the iterations should not be too few. Context switches and other activity is inevitable, but the effects are negligible over long periods (as half a second would be). To conclude, I'm unsure what steps you suggest to take to improve std.benchmark, and I'd be grateful if you could clarify them after perusing its code and documentation. Thanks, Andrei
Sep 26 2011
next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Monday, September 26, 2011 11:43:52 Robert Jacques wrote:
 Great, but MS still recommends benchmarking be done on a single core. And if
 MS thinks that is how benchmarking should be done, I think that's how we
 should do it.

And how would you ensure that? Beyond having your application be single- threaded, I'm not aware of any way to have any control over whether a particular process is run on a single core or not. Generally, that stuff is completely managed by the OS and the hardware and is outside the control of the programmer. I'm not saying that there definitively isn't a way to ensure that a process only runs on one core, but I'm not aware of one. - Jonathan M Davis
Sep 26 2011
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 09/26/2011 05:59 PM, Jonathan M Davis wrote:
 On Monday, September 26, 2011 11:43:52 Robert Jacques wrote:
 Great, but MS still recommends benchmarking be done on a single core. And if
 MS thinks that is how benchmarking should be done, I think that's how we
 should do it.

And how would you ensure that? Beyond having your application be single- threaded, I'm not aware of any way to have any control over whether a particular process is run on a single core or not. Generally, that stuff is completely managed by the OS and the hardware and is outside the control of the programmer. I'm not saying that there definitively isn't a way to ensure that a process only runs on one core, but I'm not aware of one. - Jonathan M Davis

There are API functions to do that on any reasonable OS.
Sep 26 2011
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/26/11 8:43 AM, Robert Jacques wrote:
 Great, but MS still recommends benchmarking be done on a single core.
 And if MS thinks that is how benchmarking should be done, I think that's
 how we should do it.

  From MSDN documentation on QueryPerformanceCounter:
 "On a multiprocessor computer, it should not matter which processor is
 called. However, you can get different results on different processors
 due to bugs in the basic input/output system (BIOS) or the hardware
 abstraction layer (HAL). To specify processor affinity for a thread, use
 the SetThreadAffinityMask function."

I guess this may have relevance only on multithreaded apps, but it's best to do things by the book. I'll look into it.
 *sigh* By multiple runs, I mean multiple runs of the benchmark function,
 _including_ the loops. From my post last Tuesday:

I now got it, sorry for being slow. How about taking the minimum of 10 consecutive bigloop runs while committing to the same overall timing budget? Thanks, Andrei
Sep 26 2011
prev sibling next sibling parent "Robert Jacques" <sandford jhu.edu> writes:
On Mon, 26 Sep 2011 13:20:01 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 09/26/2011 05:59 PM, Jonathan M Davis wrote:
 On Monday, September 26, 2011 11:43:52 Robert Jacques wrote:
 Great, but MS still recommends benchmarking be done on a single core. And if
 MS thinks that is how benchmarking should be done, I think that's how we
 should do it.

And how would you ensure that? Beyond having your application be single- threaded, I'm not aware of any way to have any control over whether a particular process is run on a single core or not. Generally, that stuff is completely managed by the OS and the hardware and is outside the control of the programmer. I'm not saying that there definitively isn't a way to ensure that a process only runs on one core, but I'm not aware of one. - Jonathan M Davis

There are API functions to do that on any reasonable OS.

Such as SetThreadAffinityMask, which I mentioned in my original post.
Sep 27 2011
prev sibling next sibling parent reply travert phare.normalesup.org (Christophe) writes:
"Robert Jacques" , dans le message (digitalmars.D:145443), a écrit :
 *sigh* By multiple runs, I mean multiple runs of the benchmark function,
_including_ the loops. From my post last Tuesday:
 
      double min_f1 = int.max;
      double min_f2 = int.max;
      foreach(i;0..10_000) {
          auto b=benchmark!(f1,f2)(3);
          min_f1 = min(min_f1, b[0].to!("seconds",double));
          min_f2 = min(min_f2, b[1].to!("seconds",double));
      }
 
 All the looping that benchmark currently does is to lengthen the total 
 time a function takes. This makes it possible to avoid sampling errors 
 with regard to the performance counter itself. But you are still only 
 making a single, noisy measurement of a (meta)function. In order to 
 get a robust result, you need to make multiple measurements and take 
 the minimum.

Sorry to question you, but what makes you say that the minimum is more interesting than, let's say, the mean + the standard deviation. Is there any paper supporting this ? I am not a benchmark specialist, but an average statistician who knows that the minimum is not always very informative (a very low value can be a piece of luck). I am sure you may have good reasons to choose the minimum, but I just want to make sure we make the right choice by choosing to use the minimun of consecutive benchmarks. -- Christophe
Sep 28 2011
parent travert phare.normalesup.org (Christophe) writes:
I am reassured now.

Thanks a lot.

-- 
Christophe
Sep 28 2011
prev sibling next sibling parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 27.09.2011, 09:08 Uhr, schrieb Robert Jacques <sandford jhu.edu>:

 On Mon, 26 Sep 2011 13:20:01 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 09/26/2011 05:59 PM, Jonathan M Davis wrote:
 On Monday, September 26, 2011 11:43:52 Robert Jacques wrote:
 Great, but MS still recommends benchmarking be done on a single core.  
 And if
 MS thinks that is how benchmarking should be done, I think that's how  
 we
 should do it.

And how would you ensure that? Beyond having your application be single- threaded, I'm not aware of any way to have any control over whether a particular process is run on a single core or not. Generally, that stuff is completely managed by the OS and the hardware and is outside the control of the programmer. I'm not saying that there definitively isn't a way to ensure that a process only runs on one core, but I'm not aware of one. - Jonathan M Davis

There are API functions to do that on any reasonable OS.

Such as SetThreadAffinityMask, which I mentioned in my original post.

sched_setaffinity (per process / Linux) pthread_setaffinity_np (per thread / Posix)
Sep 28 2011
prev sibling parent Don <nospam nospam.com> writes:
On 26.09.2011 17:43, Robert Jacques wrote:
 Second, timing generally relies on the CPUs Time Stamp Counter, which is
 not multi-thread safe; a core switch invalidates all previous TSC
 values, and hence, the time measurement itself. Furthermore, the TSC is
 not even guaranteed to have a fixed frequency on some CPUs. Now there
 are ways around the problems of the TSC, but even so:

 (From the Wikipedia)
 "Under Windows platforms, Microsoft strongly discourages using the TSC
 for high-resolution timing for exactly these reasons, providing instead
 the Windows APIs QueryPerformanceCounter and
 QueryPerformanceFrequency.[2] Even when using these functions, Microsoft
 recommends the code to be locked to a single CPU."

std.benchmark uses QueryPerformanceCounter on Windows and clock_gettime/gettimeofday on Unix.

Great, but MS still recommends benchmarking be done on a single core. And if MS thinks that is how benchmarking should be done, I think that's how we should do it.

I think that's quite misleading. Microsoft has made some assumptions in there about what you are measuring. QueryPerformanceCounter gives you the wall clock time. But, with modern CPUs (esp, core i7 Sandy Bridge, but also as far back as Pentium M) the CPU frequency isn't constant. So, the wall clock time depends both on the number of clock cycles required to execute the code, AND on the temperature of the CPU! If you run the same benchmark code enough times, it'll eventually get slower as the CPU heats up! Personally I always use the TSC, and then discard any results where the TSC is inconsistent. Specifically I use the hardware performance counters since it gives you real information: code #1 executes N more instructions than code #2, it has B fewer branches, it has D more level 1 cache misses, etc. I've managed to get rock-solid data that way, but it takes a lot of work (eg, you have to make sure that your stack is aligned to 16 bytes). BUT this sort of thing is only relevant to really small sections of code (< 1 time slice). If you're timing something which involves hardware other than the CPU (eg, database access, network access) then you really want the wall clock time. And if the section of code is long enough, then maybe you do care how much it heats up the CPU! But OTOH once you're in that regime, I don't think you care about processor affinity. In real life, you WILL get core transitions. So I don't think it's anywhere near as simple as what Microsoft makes out. It's very important to be clear about what kind of stuff you intend to measure.
Sep 29 2011
prev sibling next sibling parent so <so so.so> writes:
On Mon, 26 Sep 2011 04:08:45 +0300, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 I've had a good time with std.benchmark today and made it ready for  
 submission to Phobos. As a perk I've added the %h format specifier,  
 which formats a number in human-readable form using SI prefixes.

 For those interested in discussing this prior to the formal review  
 process, the code is at:

 https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d
 https://github.com/andralex/phobos/blob/benchmark/std/format.d

 and the dox at

 http://erdani.com/d/web/phobos-prerelease/std_benchmark.html
 http://erdani.com/d/web/phobos-prerelease/std_format.html

 Comments and suggestions are welcome. If possible, it would be great if  
 this submission were given a bump in the priority queue. This is because  
 with robust benchmarking in Phobos we can ask new, performance-sensitive  
 submissions, to add benchmarking.


 Thanks,

 Andrei

I know it does same thing under the hood. Yet, wouldn't this(AutoStart autostart) { if(autostart) start(); } be better off something explicit like: this(AutoStart autostart=AutoStart.no) { if(Autostart.yes == autostart) start(); } It overwrites default constructor but then again it is explicit. Maybe i am completely of the hook and there is a phobos/D consensus like "leave the def. ctor alone!"
Sep 25 2011
prev sibling next sibling parent "Robert Jacques" <sandford jhu.edu> writes:
On Sun, 25 Sep 2011 23:06:01 -0400, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
 On 9/25/11 9:02 PM, Robert Jacques wrote:
 On Sun, 25 Sep 2011 21:08:45 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 I've had a good time with std.benchmark today and made it ready for
 submission to Phobos. As a perk I've added the %h format specifier,
 which formats a number in human-readable form using SI prefixes.

 For those interested in discussing this prior to the formal review
 process, the code is at:

 https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d
 https://github.com/andralex/phobos/blob/benchmark/std/format.d

 and the dox at

 http://erdani.com/d/web/phobos-prerelease/std_benchmark.html
 http://erdani.com/d/web/phobos-prerelease/std_format.html

 Comments and suggestions are welcome. If possible, it would be great if
 this submission were given a bump in the priority queue. This is because
 with robust benchmarking in Phobos we can ask new, performance-sensitive
 submissions, to add benchmarking.


 Thanks,

 Andrei

Andrei, good job on the framework design, but you've left all of the design flaws of the current benchmark routine. To quote myself from last Tuesday:
 On Tue, 20 Sep 2011 14:01:05 -0400, Timon Gehr <timon.gehr gmx.ch>
 wrote:

 Thank you for making this more meaningful! I assumed the standard
 library benchmark function would take care of those things. Should it?

a function, as for really short functions you do have to loop many timesto be able to get a reliable reading. However, actual benchmarking requiresa) tuning the benchmark() call time to about 10-20 ms and b) runningbenchmark() many times, taking the minimum. The idea is that on any givenrun you could hit a context switch, etc. so if you make multiple run, onewill get lucky and not be interrupted. Worse, if a core switch happensbetween StopWatch start and end, the number of ticks returned is random.Hence, the comment to manually limit execution to a single core. So, itmight be nice if benchmark() also took a second parameter, denoting thenumber of times to benchmark the function and had some better documentationon how to do good benchmarking.

Wikipedia also lists some of the challenges to modern benchmarking: http://en.wikipedia.org/wiki/Time_Stamp_Counter So, std.benchmark should [ ] Set the affinity to a single thread at start (i.e. use SetProcessAffinityMask, etc)

I think that won't influence most measurements measurably.

First, a core jump is guaranteed to, essentially, poison the cache. Which, at least in my eyes, invalidates that particular timing run. Second, timing generally relies on the CPUs Time Stamp Counter, which is not multi-thread safe; a core switch invalidates all previous TSC values, and hence, the time measurement itself. Furthermore, the TSC is not even guaranteed to have a fixed frequency on some CPUs. Now there are ways around the problems of the TSC, but even so: (From the Wikipedia) "Under Windows platforms, Microsoft strongly discourages using the TSC for high-resolution timing for exactly these reasons, providing instead the Windows APIs QueryPerformanceCounter and QueryPerformanceFrequency.[2] Even when using these functions, Microsoft recommends the code to be locked to a single CPU."
 [ ] Repeat the measurement N times, taking the min.

std.benchmark repeats the measurement in a loop, discounts the times spent in the iteration proper, and divides total time by the number of iterations to figure the time per iteration. This has the advantage that works with even very fast functions without letting the counter itself affect the timing.

Which is necessity but not sufficient for proper benchmarking. The timers themselves are noisy, to say nothing of the effects of context switches, warm-up, etc on a run. During the ptr+length vs ptr+ptr discussion on Tuesday, the naive use benchmark lead to some very wrong conclusions simply because any single run had up to ~5% of error. The only way to get rid of this error, is to make multiple runs.
 [X] Adaptively increase the number of loop iterations to ensure a valid
 reading.

The current system tries 10, 100, 1000 iterations until it gets to a total run time of at least 0.5 seconds. That's quite a generous margin, I plan to tighten it.
 [ ] Adaptively decrease the number of loop iterations to ensure minimal
 context switching.

OK. Andrei

Sep 25 2011
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 9/25/2011 6:08 PM, Andrei Alexandrescu wrote:
 I've had a good time with std.benchmark today and made it ready for submission
 to Phobos. As a perk I've added the %h format specifier, which formats a number
 in human-readable form using SI prefixes.

We already have std.perf: http://www.d-programming-language.org/phobos/std_perf.html A review should compare the two.
Sep 25 2011
next sibling parent Jerry Quinn <jlquinn optonline.net> writes:
Walter Bright Wrote:

 We already have std.perf:
 
   http://www.d-programming-language.org/phobos/std_perf.html
 
 A review should compare the two.

std.perf doesn't show up in the nav bar to the left. I didn't know it existed until your post :-) Jerry
Sep 26 2011
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Monday, September 26, 2011 11:30:58 Jerry Quinn wrote:
 Walter Bright Wrote:
 We already have std.perf:
   http://www.d-programming-language.org/phobos/std_perf.html
 
 A review should compare the two.

std.perf doesn't show up in the nav bar to the left. I didn't know it existed until your post :-)

I don't think that it ever showed up in the online docs. I believe that the current StopWatch and benchmarking functions in std.datetime are supposed to have replaced it as well, so it's been scheduled for deprecation for a while (and probably should have been deprecated by now). Glancing at it though, it does appear to have more functionality that the StopWatch and benchmarking functions in std.datetime currently provide. - Jonathan M Davis
Sep 26 2011
prev sibling next sibling parent Jens Mueller <jens.k.mueller gmx.de> writes:
Andrei Alexandrescu wrote:
 On 9/25/11 11:56 PM, Robert Jacques wrote:
On Sun, 25 Sep 2011 23:06:01 -0400, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
So, std.benchmark should
[ ] Set the affinity to a single thread at start (i.e. use
SetProcessAffinityMask, etc)

I think that won't influence most measurements measurably.

First, a core jump is guaranteed to, essentially, poison the cache. Which, at least in my eyes, invalidates that particular timing run.

Well it will be one of many runs. The questions are if (a) core jumps are likely to happen during a benchmark run, (b) many enough will happen to influence a 500ms time window.

500? To me it looks as if the time window is 10ms long (line 447)? Jens
Sep 26 2011
prev sibling next sibling parent "Robert Jacques" <sandford jhu.edu> writes:
On Mon, 26 Sep 2011 09:56:27 -0400, Jens Mueller <jens.k.mueller gmx.de> wrote:

 Andrei Alexandrescu wrote:
 On 9/25/11 11:56 PM, Robert Jacques wrote:
On Sun, 25 Sep 2011 23:06:01 -0400, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
So, std.benchmark should
[ ] Set the affinity to a single thread at start (i.e. use
SetProcessAffinityMask, etc)

I think that won't influence most measurements measurably.

First, a core jump is guaranteed to, essentially, poison the cache. Which, at least in my eyes, invalidates that particular timing run.

Well it will be one of many runs. The questions are if (a) core jumps are likely to happen during a benchmark run, (b) many enough will happen to influence a 500ms time window.

500? To me it looks as if the time window is 10ms long (line 447)? Jens

The minimum, automatic time window is 10 ms. Given the scaling factor, the actual time window will be between 10 ms and 100 ms. IIRC A context switch occurs every 15 ms or less. And given that you'd be starting the time window at a random point inside the active time slice, at the minimum (i.e. 10 ms time window) you have a 66% chance of a switch happening.
Sep 26 2011
prev sibling next sibling parent "Robert Jacques" <sandford jhu.edu> writes:
On Mon, 26 Sep 2011 09:39:28 -0400, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

 On 9/25/11 11:56 PM, Robert Jacques wrote:
 On Sun, 25 Sep 2011 23:06:01 -0400, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 So, std.benchmark should
 [ ] Set the affinity to a single thread at start (i.e. use
 SetProcessAffinityMask, etc)

I think that won't influence most measurements measurably.

First, a core jump is guaranteed to, essentially, poison the cache. Which, at least in my eyes, invalidates that particular timing run.

Well it will be one of many runs. The questions are if (a) core jumps are likely to happen during a benchmark run, (b) many enough will happen to influence a 500ms time window.

I don't know exactly how likely core jump are, but context switches definitely happen. And yes, in my experience, they influence a 50 ms time window. Remember, context switches have two outcomes: your thread continues running or your thread is paused. But the timer doesn't pause, when your thread pauses. So all you need is one bad switch to see an effect.
 Second, timing generally relies on the CPUs Time Stamp Counter, which is
 not multi-thread safe; a core switch invalidates all previous TSC
 values, and hence, the time measurement itself. Furthermore, the TSC is
 not even guaranteed to have a fixed frequency on some CPUs. Now there
 are ways around the problems of the TSC, but even so:

 (From the Wikipedia)
 "Under Windows platforms, Microsoft strongly discourages using the TSC
 for high-resolution timing for exactly these reasons, providing instead
 the Windows APIs QueryPerformanceCounter and
 QueryPerformanceFrequency.[2] Even when using these functions, Microsoft
 recommends the code to be locked to a single CPU."

std.benchmark uses QueryPerformanceCounter on Windows and clock_gettime/gettimeofday on Unix.

Great, but MS still recommends benchmarking be done on a single core. And if MS thinks that is how benchmarking should be done, I think that's how we should do it. From MSDN documentation on QueryPerformanceCounter: "On a multiprocessor computer, it should not matter which processor is called. However, you can get different results on different processors due to bugs in the basic input/output system (BIOS) or the hardware abstraction layer (HAL). To specify processor affinity for a thread, use the SetThreadAffinityMask function."
 [ ] Repeat the measurement N times, taking the min.

std.benchmark repeats the measurement in a loop, discounts the times spent in the iteration proper, and divides total time by the number of iterations to figure the time per iteration. This has the advantage that works with even very fast functions without letting the counter itself affect the timing.

Which is necessity but not sufficient for proper benchmarking. The timers themselves are noisy, to say nothing of the effects of context switches, warm-up, etc on a run. During the ptr+length vs ptr+ptr discussion on Tuesday, the naive use benchmark lead to some very wrong conclusions simply because any single run had up to ~5% of error. The only way to get rid of this error, is to make multiple runs.

You may want to take a look at the code. This is a long argument ending with "...make multiple runs" as if std.benchmark is not doing them and you suggest it should. Again, std.benchmark runs the tested function in a loop in powers of 10 until the entire run lasts at least 500ms.

*sigh* By multiple runs, I mean multiple runs of the benchmark function, _including_ the loops. From my post last Tuesday: double min_f1 = int.max; double min_f2 = int.max; foreach(i;0..10_000) { auto b=benchmark!(f1,f2)(3); min_f1 = min(min_f1, b[0].to!("seconds",double)); min_f2 = min(min_f2, b[1].to!("seconds",double)); } All the looping that benchmark currently does is to lengthen the total time a function takes. This makes it possible to avoid sampling errors with regard to the performance counter itself. But you are still only making a single, noisy measurement of a (meta)function. In order to get a robust result, you need to make multiple measurements and take the minimum.
 [X] Adaptively increase the number of loop iterations to ensure a valid
 reading.

The current system tries 10, 100, 1000 iterations until it gets to a total run time of at least 0.5 seconds. That's quite a generous margin, I plan to tighten it.
 [ ] Adaptively decrease the number of loop iterations to ensure minimal
 context switching.

OK.


In fact upon further thinking the iterations should not be too few. Context switches and other activity is inevitable, but the effects are negligible over long periods (as half a second would be). To conclude, I'm unsure what steps you suggest to take to improve std.benchmark, and I'd be grateful if you could clarify them after perusing its code and documentation. Thanks, Andrei

Sep 26 2011
prev sibling next sibling parent reply Jose Armando Garcia <jsancio gmail.com> writes:
On Sun, Sep 25, 2011 at 6:08 PM, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
 Comments and suggestions are welcome.

Very nice. Really like the functionality. Really like how you implemented "void benchmarkModule(string mod)(File target = stdout)". I have a few comments. My main concern on how this function works is that it looks at the module for methods starting with benchmark_. I like the idea in general but I am worry that it would pollute the symbol space. Does this work with private symbols? Can the user make all the members private and it will still work? What about versioning should the user version all the benchmark members with version(Benchmark) or something like that? Should the user just create a benchmark module that would just contain all these benchmark functions. What do you recommend to the user? Should the doc contain these recommendations? One last comment for now. Maybe we can make the prefix, 'benchmark_', configurable by passing it to benchmarkModule(...). I think that benchmarkModule would greatly benefit from annotation it would be nice if the user could write: benchmark("list") private void listInsert() {...}
Sep 26 2011
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/26/11 9:04 AM, Jose Armando Garcia wrote:
 On Sun, Sep 25, 2011 at 6:08 PM, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org>  wrote:
 Comments and suggestions are welcome.

Very nice. Really like the functionality. Really like how you implemented "void benchmarkModule(string mod)(File target = stdout)". I have a few comments. My main concern on how this function works is that it looks at the module for methods starting with benchmark_. I like the idea in general but I am worry that it would pollute the symbol space. Does this work with private symbols? Can the user make all the members private and it will still work?

Not sure whether the compiler allows it, but it shouldn't work if one module benchmarks another. That doesn't worry me - benchmark functions are meant to be exposed for anyone to play with and it doesn't seem a function starting with "benchmark_" is casually confusable.
 What about versioning should the user
 version all the benchmark members with version(Benchmark) or something
 like that?

As they wish. We will indeed define a version(StdBenchmark) for Phobos itself.
 Should the user just create a benchmark module that would
 just contain all these benchmark functions. What do you recommend to
 the user? Should the doc contain these recommendations?

Good idea. I will add some typical physical designs.
 One last comment for now. Maybe we can make the prefix, 'benchmark_',
 configurable by passing it to benchmarkModule(...).

Good idea, will do.
 I think that benchmarkModule would greatly benefit from annotation it
 would be nice if the user could write:

  benchmark("list")
 private void listInsert() {...}

I think the benefit would be only marginal. Thanks, Andrei
Sep 26 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Sunday, September 25, 2011 20:08:45 Andrei Alexandrescu wrote:
 I've had a good time with std.benchmark today and made it ready for
 submission to Phobos. As a perk I've added the %h format specifier,
 which formats a number in human-readable form using SI prefixes.
 
 For those interested in discussing this prior to the formal review
 process, the code is at:
 
 https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d
 https://github.com/andralex/phobos/blob/benchmark/std/format.d
 
 and the dox at
 
 http://erdani.com/d/web/phobos-prerelease/std_benchmark.html
 http://erdani.com/d/web/phobos-prerelease/std_format.html
 
 Comments and suggestions are welcome. If possible, it would be great if
 this submission were given a bump in the priority queue. This is because
 with robust benchmarking in Phobos we can ask new, performance-sensitive
 submissions, to add benchmarking.

You might want to check the compilability of your code. I merged your benchmark branch into a branch that I just branched from master, and it fails to compile with the latest compiler. - Jonathan M Davis
Sep 26 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Monday, September 26, 2011 20:09:37 Jonathan M Davis wrote:
 On Sunday, September 25, 2011 20:08:45 Andrei Alexandrescu wrote:
 I've had a good time with std.benchmark today and made it ready for
 submission to Phobos. As a perk I've added the %h format specifier,
 which formats a number in human-readable form using SI prefixes.
 
 For those interested in discussing this prior to the formal review
 process, the code is at:
 
 https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d
 https://github.com/andralex/phobos/blob/benchmark/std/format.d
 
 and the dox at
 
 http://erdani.com/d/web/phobos-prerelease/std_benchmark.html
 http://erdani.com/d/web/phobos-prerelease/std_format.html
 
 Comments and suggestions are welcome. If possible, it would be great if
 this submission were given a bump in the priority queue. This is because
 with robust benchmarking in Phobos we can ask new, performance-sensitive
 submissions, to add benchmarking.

You might want to check the compilability of your code. I merged your benchmark branch into a branch that I just branched from master, and it fails to compile with the latest compiler.

That is, the unittest target fails to compile (due to a template error). The normal build succeeds (probably because it's not compiling the benchmark function - which is the templated function which is failing). - Jonathan M Davis
Sep 26 2011
prev sibling next sibling parent "Robert Jacques" <sandford jhu.edu> writes:
On Mon, 26 Sep 2011 17:56:26 -0400, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:

 On 9/26/11 8:43 AM, Robert Jacques wrote:
 Great, but MS still recommends benchmarking be done on a single core.
 And if MS thinks that is how benchmarking should be done, I think that's
 how we should do it.

  From MSDN documentation on QueryPerformanceCounter:
 "On a multiprocessor computer, it should not matter which processor is
 called. However, you can get different results on different processors
 due to bugs in the basic input/output system (BIOS) or the hardware
 abstraction layer (HAL). To specify processor affinity for a thread, use
 the SetThreadAffinityMask function."

I guess this may have relevance only on multithreaded apps, but it's best to do things by the book. I'll look into it.

Well, this applies to any multi-core PC; whether the app is single-threaded or multi-threaded makes no difference.
 *sigh* By multiple runs, I mean multiple runs of the benchmark function,
 _including_ the loops. From my post last Tuesday:

I now got it, sorry for being slow. How about taking the minimum of 10 consecutive bigloop runs while committing to the same overall timing budget?

No problem. 10 is a good default and is generally what I use. Although, there should be at least one function which allows manual control out the number of bigloops. i.e. for the ptr+length discussion I ultimately used a lot more than 10 iterations, as this was a very precise comparison. And people wanting to test network or cold-start disk performance may only want to use a single bigloop.
Sep 27 2011
prev sibling next sibling parent travert phare.normalesup.org (Christophe) writes:
Andrei Alexandrescu , dans le message (digitalmars.D:145370), a écrit :
 I've had a good time with std.benchmark today and made it ready for 
 submission to Phobos. As a perk I've added the %h format specifier, 
 which formats a number in human-readable form using SI prefixes.
 
 For those interested in discussing this prior to the formal review 
 process, the code is at:
 
 https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d
 https://github.com/andralex/phobos/blob/benchmark/std/format.d
 
 and the dox at
 
 http://erdani.com/d/web/phobos-prerelease/std_benchmark.html
 http://erdani.com/d/web/phobos-prerelease/std_format.html
 
 Comments and suggestions are welcome. If possible, it would be great if 
 this submission were given a bump in the priority queue. This is because 
 with robust benchmarking in Phobos we can ask new, performance-sensitive 
 submissions, to add benchmarking.
 

Nice piece of work. I quite don't like the use of a global (well, thread local) name for the benchmark clockWatch. What about giving the clockWatch instance as a parameter to the benchmarked functions if they need to pause/resume? That would also avoid the issue of naming benchmarkPause and benchmarkResume (btw, I would have named them pauseBenchmark and resumeBenchmark). -- Christophe
Sep 28 2011
prev sibling next sibling parent "Robert Jacques" <sandford jhu.edu> writes:
On Wed, 28 Sep 2011 08:19:16 -0400, Christophe <travert phare.normalesup.org>
wrote:
 "Robert Jacques" , dans le message (digitalmars.D:145443), a crit :
 *sigh* By multiple runs, I mean multiple runs of the benchmark function,
_including_ the loops. From my post last Tuesday:

      double min_f1 = int.max;
      double min_f2 = int.max;
      foreach(i;0..10_000) {
          auto b=benchmark!(f1,f2)(3);
          min_f1 = min(min_f1, b[0].to!("seconds",double));
          min_f2 = min(min_f2, b[1].to!("seconds",double));
      }

 All the looping that benchmark currently does is to lengthen the total
 time a function takes. This makes it possible to avoid sampling errors
 with regard to the performance counter itself. But you are still only
 making a single, noisy measurement of a (meta)function. In order to
 get a robust result, you need to make multiple measurements and take
 the minimum.

Sorry to question you, but what makes you say that the minimum is more interesting than, let's say, the mean + the standard deviation. Is there any paper supporting this ? I am not a benchmark specialist, but an average statistician who knows that the minimum is not always very informative (a very low value can be a piece of luck). I am sure you may have good reasons to choose the minimum, but I just want to make sure we make the right choice by choosing to use the minimun of consecutive benchmarks.

The reason that the mean + std is so pervasive in science is that it's perfect for a Gaussian noise model. In most physical measurements, you expect some readings to be both above and below the actual value and for most processes we assume the error of these measurements is normally distributed. Importantly, when we know the process not to produce normally distibuted error, we look to other statistics for meaning. Now, for computer timing measurements, you have access to the physical number of clock cycles taken. There is some noise in this, but it's on the order of 1-2 cycles (i.e. ns). As a benchmark is trying to measure the number of cycles something takes to complete, and it is physically impossible to complete that task in less than a fixed number of cycles, 'noise' in the measurement above 1-2 ns is by definition purely additive due to OS or hardware effects (i.e. a context switch occurred, a page fault happened, you put your computer to sleep, etc.). So the best way to get to the true number of cycles a task takes to complete is to take the minimum recorded value, on the premise that you did get lucky and those other issues which you are not trying to measure, had little to no impact. So that's the rational. As for academic support, I've published peer-reviewed papers using this methodology, but I can't remember where I originally got the advice. Also, there are certain types of benchmarks for which the mean + std is better (networking comes to mind), etc. P.S. A little Google-fu turned up: The science of computer benchmarking By Roger W. Hockney, which does recommend using the minimum, particularly to get around the problem of time-slicing (i.e. context switching) by the OS.
Sep 28 2011
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/25/11 6:08 PM, Andrei Alexandrescu wrote:
 I've had a good time with std.benchmark today and made it ready for
 submission to Phobos.

You destroyed, and I listened. I updated my benchmark branch to accept a configurable time budget for measurements, and a number of trials that take the minimum time. Still need to do the thread affinity stuff. Code: https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d https://github.com/andralex/phobos/blob/benchmark/std/format.d Dox: http://erdani.com/d/web/phobos-prerelease/std_benchmark.html http://erdani.com/d/web/phobos-prerelease/std_format.html Destroy once again. Andrei
Sep 28 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/3/11 8:59 AM, Jens Mueller wrote:
[snip]

Thanks for this rich feedback. I'll act on most or all points. A few 
notes follow.

 1. I'm not that convinced that prepending benchmark_ to a function is
 that nice even though it's simple. But following this idea I could
 remove all my unittest blocks and add functions starting with test_ and
 write a module std.test in same fashion. If this approach is considered
 the way to go we can remove unittest altogether and solve testing
 entirely in a library. After all it looks less elegant to me but I admit
 that I see no other way and it is the easiest solution after all. I just
 want to stress that when seeing the code I questioned myself whether
 having unittest in the language is nice to have.

std.benchmark uses relatively new reflection features that weren't available until recently. Adding unittest now would be difficult to justify, but at the time unittest was a good feature.
 2. When benchmarking code you often investigate the algorithm's
 properties regarding its scalability.

Yah, that definitely needs to be looked at. I think I need to add some support beyond mere documentation.
 3.
 (r[i] / 10_000_000.).to!("nsecs", int)
 should be written as
 (r[i] / 10_000_000).nsecs

 Same for msecs and similar.

Noted.
 4. Test results should be exposed to the caller.
 Usually, it is enough to write the results to the console. But you may
 want to post process the results. Being it either graphically or to
 output XML for whatever reason.
 I think it would be beneficial to add this flexibility and separate the
 benchmarking logic from its generating the output.
 It might be useful to add the used CPU using std.cpuid to the output.

We should decouple collection from formatting. However, this complicates matters because we'd need to expose the benchmark results structure. I'll think of it.
 5. In the first example I completely missed that it is not "relative
 calls". I thought this was one term even though I didn't know what it
 should mean. Later I figured that there is an empty column called
 relative.

There are two spaces there, I should add three.
 6. Wrong numbers in documentation
 The examples in the documentation have wrong numbers. In the end one
 should pay special attention that the documentation is consistent. I
 think one cannot automate this.

Not sure how this can be addressed at all (or maybe I don't understand). The examples show real numbers collected from real runs on my laptop. I don't think one may expect to reproduce them with reasonable accuracy because they depend on a lot of imponderables.
 There are also rounding errors. That means when I try to compute these
 numbers I get different in the first digit after the decimal point. I
 wonder why this is. Even though the measurement may be noisy the
 computed numbers should be more precise I think.

Where are calculations mistaken? Indeed I see e.g. that append built-in has 83ns/call and 11.96M calls/s. Reverting the first with a "calculator" program yields 12.04 Mcalls/s, and reverting the second yields 83.61ns/call. I'll see what can be done.
 7. Changing units in the output.
 One benchmark report may use different units in its output. The one in
 the example uses microseconds and nanoseconds. This makes comparison
 more difficult and should be avoided (at least for groups). It is
 error-prone when analyzing the numbers.

Problem is space and wildly varying numbers. I initially had exponential notation for everything, but it was super ugly and actually more difficult to navigate. That's why I added the compact notation. I think we'll need to put up with it.
 8. I prefer writing std.benchmark.pause() over benchmarkPause() and
 std.benchmark.resume() over benchmarkResume().

OK.
 10. The file reading/writing benchmark should be improved. Since the
 file should be deleted. The file writing benchmark should be more like:
 void benchmark_fileWrite()
 {
      benchmarkPause();
      auto f = File.tmpfile();
      benchmarkResume();

      f.writeln("hello, world!");
 }

 But this also measures the deletion of the file.

If you don't want to measure std.file.write but instead only the time spent in writeln, his should work and measure only the time spent in writing to the file: void benchmark_fileWrite() { benchmarkPause(); auto f = File.tmpfile(); benchmarkResume(); f.writeln("hello, world!"); benchmarkPause(); } I haven't yet tested the case when a functions leaves the benchmark paused.
 I have no idea how to fix the benchmark for reading a file. But I
 believe it involves using mkstemp on POSIX and similar on Windows.
 mkstemp should be added to std.file in the long run.

Why is there a need to fix it? The benchmark measures the performance of std.file.read soup to nuts. In related news, I got word from a colleagues that I should also use times() (http://linux.die.net/man/2/times) to distinguish time spent in user space vs. system calls. This is because some functions must issue certain system calls in any case and the user is interested in how much they can improve the situation by optimizing their end of the deal. This would apply e.g. to file I/O where the overhead of formatting etc. would become easier to distinguish. Andrei
Oct 03 2011
parent reply Jacob Carlborg <doob me.com> writes:
On 2011-10-03 17:44, Andrei Alexandrescu wrote:
 On 10/3/11 8:59 AM, Jens Mueller wrote:
 4. Test results should be exposed to the caller.
 Usually, it is enough to write the results to the console. But you may
 want to post process the results. Being it either graphically or to
 output XML for whatever reason.
 I think it would be beneficial to add this flexibility and separate the
 benchmarking logic from its generating the output.
 It might be useful to add the used CPU using std.cpuid to the output.

We should decouple collection from formatting. However, this complicates matters because we'd need to expose the benchmark results structure. I'll think of it.

Not necessarily. The user could provide a delegate as a callback, something like this: struct Result { string func; double relative; double calls; double tCall; double callsPerSecond; } alias void delegate (Result result) Formatter; Doesn't seem complicated to me. Then the module could provide a default formatter that does what the module currently does. -- /Jacob Carlborg
Oct 03 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/3/11 12:49 PM, Jacob Carlborg wrote:
 On 2011-10-03 17:44, Andrei Alexandrescu wrote:
 On 10/3/11 8:59 AM, Jens Mueller wrote:
 4. Test results should be exposed to the caller.
 Usually, it is enough to write the results to the console. But you may
 want to post process the results. Being it either graphically or to
 output XML for whatever reason.
 I think it would be beneficial to add this flexibility and separate the
 benchmarking logic from its generating the output.
 It might be useful to add the used CPU using std.cpuid to the output.

We should decouple collection from formatting. However, this complicates matters because we'd need to expose the benchmark results structure. I'll think of it.

Not necessarily. The user could provide a delegate as a callback, something like this: struct Result { string func; double relative; double calls; double tCall; double callsPerSecond; } alias void delegate (Result result) Formatter;

That's what I meant when I said "we'd need to expose the benchmark results structure" :o). There's probably no way out of this anyway, particularly if we provide richer information via times(), scalability, etc. Andrei
Oct 03 2011
parent Jacob Carlborg <doob me.com> writes:
On 2011-10-03 20:04, Andrei Alexandrescu wrote:
 On 10/3/11 12:49 PM, Jacob Carlborg wrote:
 On 2011-10-03 17:44, Andrei Alexandrescu wrote:
 On 10/3/11 8:59 AM, Jens Mueller wrote:
 4. Test results should be exposed to the caller.
 Usually, it is enough to write the results to the console. But you may
 want to post process the results. Being it either graphically or to
 output XML for whatever reason.
 I think it would be beneficial to add this flexibility and separate the
 benchmarking logic from its generating the output.
 It might be useful to add the used CPU using std.cpuid to the output.

We should decouple collection from formatting. However, this complicates matters because we'd need to expose the benchmark results structure. I'll think of it.

Not necessarily. The user could provide a delegate as a callback, something like this: struct Result { string func; double relative; double calls; double tCall; double callsPerSecond; } alias void delegate (Result result) Formatter;

That's what I meant when I said "we'd need to expose the benchmark results structure" :o).

Hehe, Ok. I don't think it gets THAT much more complicated, just call the delegate with the results instead of writef(ln).
 There's probably no way out of this anyway, particularly if we provide
 richer information via times(), scalability, etc.


 Andrei

-- /Jacob Carlborg
Oct 03 2011
prev sibling next sibling parent Jens Mueller <jens.k.mueller gmx.de> writes:
Andrei Alexandrescu wrote:
 On 10/3/11 8:59 AM, Jens Mueller wrote:
 [snip]
 
 Thanks for this rich feedback. I'll act on most or all points. A few
 notes follow.
 
1. I'm not that convinced that prepending benchmark_ to a function is
that nice even though it's simple. But following this idea I could
remove all my unittest blocks and add functions starting with test_ and
write a module std.test in same fashion. If this approach is considered
the way to go we can remove unittest altogether and solve testing
entirely in a library. After all it looks less elegant to me but I admit
that I see no other way and it is the easiest solution after all. I just
want to stress that when seeing the code I questioned myself whether
having unittest in the language is nice to have.

std.benchmark uses relatively new reflection features that weren't available until recently. Adding unittest now would be difficult to justify, but at the time unittest was a good feature.

I see. This is also my feeling.
6. Wrong numbers in documentation
The examples in the documentation have wrong numbers. In the end one
should pay special attention that the documentation is consistent. I
think one cannot automate this.

Not sure how this can be addressed at all (or maybe I don't understand). The examples show real numbers collected from real runs on my laptop. I don't think one may expect to reproduce them with reasonable accuracy because they depend on a lot of imponderables.

I just mean that the relative speed ups printed in the table does not match the ones given in the surrounding text.
There are also rounding errors. That means when I try to compute these
numbers I get different in the first digit after the decimal point. I
wonder why this is. Even though the measurement may be noisy the
computed numbers should be more precise I think.

Where are calculations mistaken? Indeed I see e.g. that append built-in has 83ns/call and 11.96M calls/s. Reverting the first with a "calculator" program yields 12.04 Mcalls/s, and reverting the second yields 83.61ns/call. I'll see what can be done.

Yes. This is what I meant here.
7. Changing units in the output.
One benchmark report may use different units in its output. The one in
the example uses microseconds and nanoseconds. This makes comparison
more difficult and should be avoided (at least for groups). It is
error-prone when analyzing the numbers.

Problem is space and wildly varying numbers. I initially had exponential notation for everything, but it was super ugly and actually more difficult to navigate. That's why I added the compact notation. I think we'll need to put up with it.

If formatting and collecting is separated it is fine I'll guess. It's okay for the default output.
10. The file reading/writing benchmark should be improved. Since the
file should be deleted. The file writing benchmark should be more like:
void benchmark_fileWrite()
{
     benchmarkPause();
     auto f = File.tmpfile();
     benchmarkResume();

     f.writeln("hello, world!");
}

But this also measures the deletion of the file.

If you don't want to measure std.file.write but instead only the time spent in writeln, his should work and measure only the time spent in writing to the file: void benchmark_fileWrite() { benchmarkPause(); auto f = File.tmpfile(); benchmarkResume(); f.writeln("hello, world!"); benchmarkPause(); } I haven't yet tested the case when a functions leaves the benchmark paused.

This looks good.
I have no idea how to fix the benchmark for reading a file. But I
believe it involves using mkstemp on POSIX and similar on Windows.
mkstemp should be added to std.file in the long run.

Why is there a need to fix it? The benchmark measures the performance of std.file.read soup to nuts.

Yes. But the benchmark should clean up after itself if possible. So it should remove the file.
 In related news, I got word from a colleagues that I should also use
 times() (http://linux.die.net/man/2/times) to distinguish time spent
 in user space vs. system calls.  This is because some functions must
 issue certain system calls in any case and the user is interested in
 how much they can improve the situation by optimizing their end of
 the deal. This would apply e.g. to file I/O where the overhead of
 formatting etc. would become easier to distinguish.

Good hint from your colleague. The more data one can get when measuring code the better. You could even be interested in cache misses when benchmarking. But CPU time is usually what boils down to. I fear you may loose simplicity. I'm looking forward to your changes. Jens
Oct 03 2011
prev sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
With this new module we might as well implement __MODULE__ or
something similar to avoid hardcoding the module name or using the
.stringof[7..$] trick.
Oct 03 2011
prev sibling parent Jens Mueller <jens.k.mueller gmx.de> writes:
Andrei Alexandrescu wrote:
 On 9/25/11 6:08 PM, Andrei Alexandrescu wrote:
I've had a good time with std.benchmark today and made it ready for
submission to Phobos.

You destroyed, and I listened. I updated my benchmark branch to accept a configurable time budget for measurements, and a number of trials that take the minimum time. Still need to do the thread affinity stuff. Code: https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d https://github.com/andralex/phobos/blob/benchmark/std/format.d Dox: http://erdani.com/d/web/phobos-prerelease/std_benchmark.html http://erdani.com/d/web/phobos-prerelease/std_format.html Destroy once again. Andrei

Overall I like this module very much. 1. I'm not that convinced that prepending benchmark_ to a function is that nice even though it's simple. But following this idea I could remove all my unittest blocks and add functions starting with test_ and write a module std.test in same fashion. If this approach is considered the way to go we can remove unittest altogether and solve testing entirely in a library. After all it looks less elegant to me but I admit that I see no other way and it is the easiest solution after all. I just want to stress that when seeing the code I questioned myself whether having unittest in the language is nice to have. 2. When benchmarking code you often investigate the algorithm's properties regarding its scalability. That's why I think it's useful to give an example in the documentation. Something like void benchmark_insert_array(size_t length)(uint n) { benchmarkPause(); Array!bool array; array.length = length; benchmarkResume(); foreach(i; 0 .. n) array.insertAfter(array[0 .. uniform(0, length)], true); } void benchmark_insert_slist(size_t length)(uint n) { benchmarkPause(); SList!bool list; foreach (i; 0 .. length) list.insertFront(false); benchmarkResume(); foreach(i; 0 .. n) { auto r = take(list[], uniform(0, length)); list.insertAfter(r, true); } } alias benchmark_insert_slist!(10) benchmark_insert_slist_10; alias benchmark_insert_array!(10) benchmark_insert_array_10; alias benchmark_insert_slist!(100) benchmark_insert_slist_100; alias benchmark_insert_array!(100) benchmark_insert_array_100; This example needs more polishing. Input size is the most important dimension that CPU time depends on. The example should clarify how to benchmark depending on the input size. 3. (r[i] / 10_000_000.).to!("nsecs", int) should be written as (r[i] / 10_000_000).nsecs Same for msecs and similar. 4. Test results should be exposed to the caller. Usually, it is enough to write the results to the console. But you may want to post process the results. Being it either graphically or to output XML for whatever reason. I think it would be beneficial to add this flexibility and separate the benchmarking logic from its generating the output. It might be useful to add the used CPU using std.cpuid to the output. 5. In the first example I completely missed that it is not "relative calls". I thought this was one term even though I didn't know what it should mean. Later I figured that there is an empty column called relative. 6. Wrong numbers in documentation The examples in the documentation have wrong numbers. In the end one should pay special attention that the documentation is consistent. I think one cannot automate this. There are also rounding errors. That means when I try to compute these numbers I get different in the first digit after the decimal point. I wonder why this is. Even though the measurement may be noisy the computed numbers should be more precise I think. 7. Changing units in the output. One benchmark report may use different units in its output. The one in the example uses microseconds and nanoseconds. This makes comparison more difficult and should be avoided (at least for groups). It is error-prone when analyzing the numbers. 8. I prefer writing std.benchmark.pause() over benchmarkPause() and std.benchmark.resume() over benchmarkResume(). or import std.benchmark : benchmark; try benchmark.pause() Supporting benchmark.start() would improve the example. Since you just need to specify one line that indicates where the benchmarking should start. But this seems difficult to get to work with the current design and I don't consider it important. It's a very minor thing. 10. The file reading/writing benchmark should be improved. Since the file should be deleted. The file writing benchmark should be more like: void benchmark_fileWrite() { benchmarkPause(); auto f = File.tmpfile(); benchmarkResume(); f.writeln("hello, world!"); } But this also measures the deletion of the file. I have no idea how to fix the benchmark for reading a file. But I believe it involves using mkstemp on POSIX and similar on Windows. mkstemp should be added to std.file in the long run. Jens
Oct 03 2011