www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Review of Andrei's std.benchmark

reply Jens Mueller <jens.k.mueller gmx.de> writes:
Hi,

it's my pleasure to announce the begin of the formal review of Andrei's
std.benchmark. The review will start today and end in two weeks, on 1st
of October. The review is followed by a week of voting which ends on 8th
of October.

Quoting Andrei from his request for formal review:
"I reworked the benchmarking framework for backward compatibility,
flexibility, and convenience.

There are a few enhancement possibilities (such as tracking
system/user time separately etc), but there is value in keeping things
simple and convenient. Right now it really takes only one line of code
and observing a simple naming convention to hook a module into the
benchmarking framework."

Code: https://github.com/D-Programming-Language/phobos/pull/794
Docs: http://dlang.org/phobos-prerelease/std_benchmark.html

If std.benchmark is accepted it will likely lead to a deprecation of
std.datetime's benchmark facilities.

The code is provided as a pull requested and being (as usual) integrated
by the auto tester for Mac OS X, FreeBSD, Linux and Windows (see
(http://d.puremagic.com/test-results/pull-history.ghtml?repoid=3&pullid=794).

In your comments you can/should address the
* design
* implementation
* documentation
* usefulness
of the library.

Provide information regarding the depth (ranging from very brief to
in-depth) of your review and conclude explicitly whether std.benchmark
should or shouldn't be included in Phobos.

Post all feedback to this thread. Constructive feedback is very much
appreciated.

To conclude in more Andrei like words: Happy destruction!

Jens
Sep 17 2012
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/17/12 5:13 PM, Jens Mueller wrote:
 If std.benchmark is accepted it will likely lead to a deprecation of
 std.datetime's benchmark facilities.

One note - I moved the benchmark-related stuff from std.datetime unmodified into std.benchmark and left public aliases in place, so no code breakage is imminent. We may deprecate the aliases themselves later.
 To conclude in more Andrei like words: Happy destruction!

Sounds about right! :o) Andrei
Sep 17 2012
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/18/12 5:07 PM, "Øivind" wrote:
 I think the std.benchmark is definitely a useful library addition, but
 in my mind it currently a bit too limited.

 * All tests are run 1000 times. Depending on the length of the test to
 benchmark, this can be too much. In some cases it would be good to be
 able to trade the number of runs against accuracy.

It would be a good idea to make that a configurable parameter.
 * For all tests, the best run is selected, but would it not be
 reasonable in some cases to get the average value? Maybe excluding the
 runs that are more than a couple std. deviations away from the mean value..

After extensive tests with a variety of aggregate functions, I can say firmly that taking the minimum time is by far the best when it comes to assessing the speed of a function.
 * Is there a way of specifying a test name other than the function-name
 when using the 'mixin(scheduleForBenchmarking)' approach to register
 benchmarks?

Not currently. Probably a manual registration of an individual benchmark would make sense.
 * I would also like to be able (if possible) to register two mentioned
 things (number of runs and result strategy) with the mixin approach (or
 similar).

Makes sense.
 * It seems like the baseline for subtraction from subsequent test runs
 is taken from a call to the test function, passing 1 to it. Shouldn't 0
 be passed for this value?

I'll look into that. Thanks, Andrei
Sep 18 2012
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-09-19 11:38, Peter Alexander wrote:

 The problem with slowest is that you end up with the occasional OS
 hiccup or GC collection which throws the entire benchmark off. I see
 your point, but unless you can prevent the OS from interrupting, the
 time would be meaningless.

That's way the average is good to have as well. -- /Jacob Carlborg
Sep 19 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/19/12 3:54 PM, Jacob Carlborg wrote:
 On 2012-09-19 11:38, Peter Alexander wrote:

 The problem with slowest is that you end up with the occasional OS
 hiccup or GC collection which throws the entire benchmark off. I see
 your point, but unless you can prevent the OS from interrupting, the
 time would be meaningless.

That's way the average is good to have as well.

The occasional hiccup is often orders of magnitude slower than the rest, which means it will ruin the average. You may have meant "median", which has more merit, but then I'd say why bother - just use the minimum. Andrei
Sep 21 2012
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/20/12 2:42 AM, Manu wrote:
 On 19 September 2012 12:38, Peter Alexander
 <peter.alexander.au gmail.com <mailto:peter.alexander.au gmail.com>> wrote:

         The fastest execution time is rarely useful to me, I'm almost
         always much
         more interested in the slowest execution time.
         In realtime software, the slowest time is often the only
         important factor,
         everything must be designed to tolerate this possibility.
         I can also imagine other situations where multiple workloads are
         competing
         for time, the average time may be more useful in that case.


     The problem with slowest is that you end up with the occasional OS
     hiccup or GC collection which throws the entire benchmark off. I see
     your point, but unless you can prevent the OS from interrupting, the
     time would be meaningless.


 So then we need to start getting tricky, and choose the slowest one that
 is not beyond an order of magnitude or so outside the average?

The "best way" according to some of the people who've advised my implementation of the framework at Facebook is to take the mode of the measurements distribution, i.e. the time at the maximum density. I implemented that (and it's not easy). It yielded numbers close to the minimum, but less stable and needing more iterations to become stable (when they do get indeed close to the minimum). Let's use the minimum. It is understood it's not what you'll see in production, but it is an excellent proxy for indicative and reproducible performance numbers. Andrei
Sep 20 2012
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-09-20 14:36, Andrei Alexandrescu wrote:

 Let's use the minimum. It is understood it's not what you'll see in
 production, but it is an excellent proxy for indicative and reproducible
 performance numbers.

Why not min, max and average? -- /Jacob Carlborg
Sep 20 2012
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/20/12 1:37 PM, Jacob Carlborg wrote:
 On 2012-09-20 14:36, Andrei Alexandrescu wrote:

 Let's use the minimum. It is understood it's not what you'll see in
 production, but it is an excellent proxy for indicative and reproducible
 performance numbers.

Why not min, max and average?

Because max and average are misleading and uninformative, as I explained. Andrei
Sep 20 2012
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/20/12 2:37 PM, Jacob Carlborg wrote:
 On 2012-09-20 14:36, Andrei Alexandrescu wrote:

 Let's use the minimum. It is understood it's not what you'll see in
 production, but it is an excellent proxy for indicative and reproducible
 performance numbers.

Why not min, max and average?

For a very simple reason: unless the algorithm under benchmark is very long-running, max is completely useless, and it ruins average as well. For virtually all benchmarks I've run, the distribution of timings is a half-Gaussian very concentrated around the minimum. Say you have a minimum of e.g. 73 us. Then there would be a lot of results close to that; the mode of the distribution would be very close, e.g. 75 us, and the more measurements you take, the closer the mode is to the minimum. Then you have a few timings up to e.g. 90 us. And finally you will inevitably have a few outliers at some milliseconds. Those are orders of magnitude larger than anything of interest and are caused by system interrupts that happened to fall in the middle of the measurement. Taking those into consideration and computing the average with those outliers simply brings useless noise into the measurement process. Andrei
Sep 20 2012
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-09-21 06:23, Andrei Alexandrescu wrote:

 For a very simple reason: unless the algorithm under benchmark is very
 long-running, max is completely useless, and it ruins average as well.

I may have completely misunderstood this but aren't we talking about what do include in the output of the benchmark? In that case, if you don't like max and average just don't look at it.
 For virtually all benchmarks I've run, the distribution of timings is a
 half-Gaussian very concentrated around the minimum. Say you have a
 minimum of e.g. 73 us. Then there would be a lot of results close to
 that; the mode of the distribution would be very close, e.g. 75 us, and
 the more measurements you take, the closer the mode is to the minimum.
 Then you have a few timings up to e.g. 90 us. And finally you will
 inevitably have a few outliers at some milliseconds. Those are orders of
 magnitude larger than anything of interest and are caused by system
 interrupts that happened to fall in the middle of the measurement.

 Taking those into consideration and computing the average with those
 outliers simply brings useless noise into the measurement process.

After your replay to one of Manu's post, I think I misunderstood the std.benchmark module. I was thinking more of profiling. But are these quite similar tasks, couldn't std.benchmark work for both? -- /Jacob Carlborg
Sep 21 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/21/12 5:39 AM, Jacob Carlborg wrote:
 On 2012-09-21 06:23, Andrei Alexandrescu wrote:

 For a very simple reason: unless the algorithm under benchmark is very
 long-running, max is completely useless, and it ruins average as well.

I may have completely misunderstood this but aren't we talking about what do include in the output of the benchmark? In that case, if you don't like max and average just don't look at it.

I disagree. I won't include something in my design just so people don't look at it most of the time. Min and average are most of the time an awful thing to include, and will throw off people with bizarre results. If it's there, it's worth looking at. Note how all columns are directly comparable (I might add, unlike other approaches to benchmarking).
 For virtually all benchmarks I've run, the distribution of timings is a
 half-Gaussian very concentrated around the minimum. Say you have a
 minimum of e.g. 73 us. Then there would be a lot of results close to
 that; the mode of the distribution would be very close, e.g. 75 us, and
 the more measurements you take, the closer the mode is to the minimum.
 Then you have a few timings up to e.g. 90 us. And finally you will
 inevitably have a few outliers at some milliseconds. Those are orders of
 magnitude larger than anything of interest and are caused by system
 interrupts that happened to fall in the middle of the measurement.

 Taking those into consideration and computing the average with those
 outliers simply brings useless noise into the measurement process.

After your replay to one of Manu's post, I think I misunderstood the std.benchmark module. I was thinking more of profiling. But are these quite similar tasks, couldn't std.benchmark work for both?

This is an interesting idea. It would delay release quite a bit because I'd need to design and implement things like performance counters and such. Andrei
Sep 21 2012
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/21/12 11:14 AM, Manu wrote:
 On 21 September 2012 07:23, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org <mailto:SeeWebsiteForEmail erdani.org>>
 wrote:

     For a very simple reason: unless the algorithm under benchmark is
     very long-running, max is completely useless, and it ruins average
     as well.


 This is only true for systems with a comprehensive pre-emptive OS
 running on the same core. Most embedded systems will only be affected by
 cache misses and bus contention, in that situation, max is perfectly
 acceptable.

I think embedded systems that run e.g. Linux will be affected by task switching. Andrei
Sep 21 2012
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/20/12 10:05 AM, Manu wrote:
 On 20 September 2012 15:36, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org <mailto:SeeWebsiteForEmail erdani.org>>
 wrote:
     Let's use the minimum. It is understood it's not what you'll see in
     production, but it is an excellent proxy for indicative and
     reproducible performance numbers.


 If you do more than a single iteration, the minimum will virtually
 always be influenced by ideal cache pre-population, which is
 unrealistic.

To measure performance against cold cache, you could always clear the cache using one of the available methods, see http://stackoverflow.com/questions/1756825/cpu-cache-flush. Probably std.benchmark could include a routine that does that. But performance on cold would actually be most unrealistic and uninformative, as loading the memory into cache will dominate the work that the algorithm is doing, so essentially the benchmark would evaluate the memory bandwidth against the working set of the algorithm. That may be occasionally useful, but I'd argue that most often the interest in benchmarking is to measure repeated application of a function, not occasional use of it.
 Memory locality is often the biggest contributing
 performance hazard in many algorithms, and usually the most
 unpredictable. I want to know about that in my measurements.
 Reproducibility is not important to me as accuracy. And I'd rather be
 conservative(/pessimistic) with the error.

 What guideline would you apply to estimate 'real-world' time spent when
 always working with hyper-optimistic measurements?

The purpose of std.benchmark is not to estimate real-world time. (That is the purpose of profiling.) Instead, benchmarking measures and provides a good proxy of that time for purposes of optimizing the algorithm. If work is done on improving the minimum time given by the benchmark framework, it is reasonable to expect that performance in-situ will also improve. Andrei
Sep 20 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/21/12 10:58 AM, Manu wrote:
 What I'm typically more interested in is profiling. I do occasionally
 need to do some benchmarking by your definition, so I'll find this
 useful, but should there then be another module to provide a 'profiling'
 API? Also worked into this API?

That's a good angle. Profiling is currently done by the -profile switch, and there are a couple of library functions associated with it. To my surprise, that documentation page has not been ported to the dlang.org style: http://digitalmars.com/ctg/trace.html I haven't yet thought whether std.benchmark should add more profiling-related primitives. I'd opine for releasing it without such for the time being. Thanks, Andrei
Sep 21 2012
parent Jacob Carlborg <doob me.com> writes:
On 2012-09-21 18:21, Andrei Alexandrescu wrote:

 That's a good angle. Profiling is currently done by the -profile switch,
 and there are a couple of library functions associated with it. To my
 surprise, that documentation page has not been ported to the dlang.org
 style: http://digitalmars.com/ctg/trace.html

 I haven't yet thought whether std.benchmark should add more
 profiling-related primitives. I'd opine for releasing it without such
 for the time being.

If you have an API that is fairly open and provides more of the raw results then one can build a more profiling like solution on top of that. This can later be used to create a specific profiling module if we choose to do so. -- /Jacob Carlborg
Sep 21 2012
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/20/12 3:01 PM, foobar wrote:
 On Thursday, 20 September 2012 at 12:35:15 UTC, Andrei Alexandrescu wrote:
 Let's use the minimum. It is understood it's not what you'll see in
 production, but it is an excellent proxy for indicative and
 reproducible performance numbers.


 Andrei

From the responses on the thread clearly there isn't a "best way".

I don't quite agree. This is a domain in which intuition is having a hard time, and at least some of the responses come from an intuitive standpoint, as opposed from hard data. For example, there's this opinion that taking the min, max, and average is the "fair" thing to do and the most informative. However, all noise in measuring timing is additive. Unless you talk about performance of entire large systems with networking, I/O, and the such, algorithms running in memory are inevitably spending time doing work, to which various sources of noise (system interrupts, clock quantization, benchmarking framework) just _add_ some time. Clearly these components do affect the visible duration of the algorithm, but if you want to improve it you need to remove the noise.
 There are different use-cases with different tradeoffs so why not allow
 the user to choose the policy best suited for their use-case?
 I'd suggest to provide a few reasonable common choices to choose from,
 as well as a way to provide a user defined calculation (function
 pointer/delegate?)

Reasonable choices are great, but in this case it's a bit difficult to figure what's reasonable. Andrei
Sep 20 2012
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/20/12 11:03 PM, Nick Sabalausky wrote:
 On Tue, 18 Sep 2012 18:02:10 -0400
 Andrei Alexandrescu<SeeWebsiteForEmail erdani.org>  wrote:

 On 9/18/12 5:07 PM, "Øivind" wrote:
 * For all tests, the best run is selected, but would it not be
 reasonable in some cases to get the average value? Maybe excluding
 the runs that are more than a couple std. deviations away from the
 mean value..

After extensive tests with a variety of aggregate functions, I can say firmly that taking the minimum time is by far the best when it comes to assessing the speed of a function.

*Ahem*: http://zedshaw.com/essays/programmer_stats.html

I'm not sure I figure how this applies to the discussion at hand.
 Your claim that the minimum time is sufficient is...ummm...extremely
 unorthodox, to say the least.

What would be the orthodoxy? If orthodoxy is what google finds, it's good we're not orthodox.
 As such, you're going to need a far more
 convincing argument than "It worked well for me."

Sure. I have just detailed the choices made by std.benchmark in a couple of posts. At Facebook we measure using the minimum, and it's working for us. We've tried other approaches (such as taking the mode of the distribution). Turns out the minimum is better every time. Take a look at the early return in estimateTime(): https://github.com/facebook/folly/blob/master/folly/Benchmark.cpp#L136
 I assume I don't need to preach that "Extraordinary claims require
 extraordinary evidence". But, condensing benchmarks and statistics down
 to "take the minimum" and saying that's sufficient is one heck of an
 extraordinary claim. If you feel that you can provide sufficiently
 extraordinary justification, then please do.

My claim is unremarkable. All I'm saying is the minimum running time of an algorithm on a given input is a stable and indicative proxy for the behavior of the algorithm in general. So it's a good target for optimization. There might be some confusion that std.benchmark does profiling. That's not part of its charter.
 Otherwise, I think we'll need richer results. At the very least there
 should be an easy way to get at the raw results programmatically
 so we can run whatever stats/plots/visualizations/output-formats we
 want. I didn't see anything like that browsing through the docs, but
 it's possible I may have missed it.

Currently std.benchmark does not expose raw results for the sake of simplicity. It's easy to expose such, but I'd need a bit more convincing about their utility.
 That brings up another question too: I like the idea of a
 one-stop-benchmarking-shop, much like we have for unittests, but maybe
 reporting shouldn't be so tightly integrated and left more open for
 integration with a proper statistics lib and more generalized
 output abilities? But of course, that doesn't preclude having a nice
 built-in, but optional, default report. (Again though, maybe I'm
 overlooking something already in the module?)

That's pretty much what's happening. There's an API for collecting timings, and then there's an API for printing those with a default format.
 One other nitpick: My initial impression is that the
 "benchmark_relative_file read" stuff seems a bit kludgey (and
 confusing to visually parse). Is there maybe a better way to handle
 that? For example, inspired by getopt:

 printBenchmarks!(
      "file write", { std.file.write("/tmp/deleteme", "hello, world!"); },
      BenchmarkOption.relative,
      "file read", { std.file.read("/tmp/deleteme"); },
      "array creation", { new char[32]; })
      ();

The issue here is automating the benchmark of a module, which would require some naming convention anyway. Andrei
Sep 20 2012
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/21/12 5:46 AM, Tove wrote:
 Considering min will converge towards a stable value quite quickly...
 would it not be a reasonable default to auto detect when the min is
 stable with some degree of statistical certainty...?

I think that's a great idea! Andrei
Sep 21 2012
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-09-21 19:45, Johannes Pfau wrote:

 A perfect use case for user defined attributes ;-)

  benchmark void foo(){}
  benchmark("File read test") void foo(){}

Yes, we need user defined attributes and AST macros ASAP :) -- /Jacob Carlborg
Sep 21 2012
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/21/12 11:12 AM, Manu wrote:
 On 21 September 2012 07:45, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org <mailto:SeeWebsiteForEmail erdani.org>>
 wrote:

         As such, you're going to need a far more
         convincing argument than "It worked well for me."


     Sure. I have just detailed the choices made by std.benchmark in a
     couple of posts.

     At Facebook we measure using the minimum, and it's working for us.


 Facebook isn't exactly 'realtime' software. Obviously, faster is always
 better, but it's not in a situation where if you slip a sync point by
 1ms in an off case, it's all over. You can lose 1ms here, and make it up
 at a later time, and the result is the same. But again, this feeds back
 to your distinction between benchmarking and profiling.

You'd be surprised at how much we care about e.g. 90 percentile time to interaction.
         Otherwise, I think we'll need richer results. At the very least
         there
         should be an easy way to get at the raw results programmatically
         so we can run whatever
         stats/plots/visualizations/__output-formats we
         want. I didn't see anything like that browsing through the docs, but
         it's possible I may have missed it.


     Currently std.benchmark does not expose raw results for the sake of
     simplicity. It's easy to expose such, but I'd need a bit more
     convincing about their utility.


 Custom visualisation, realtime charting/plotting, user supplied reduce
 function?

Hrm, that sounds like an entire new project. Andrei
Sep 21 2012
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/19/12 4:12 AM, Thiez wrote:
 On Tuesday, 18 September 2012 at 22:01:30 UTC, Andrei Alexandrescu wrote:
 After extensive tests with a variety of aggregate functions, I can say
 firmly that taking the minimum time is by far the best when it comes
 to assessing the speed of a function.

What if one tries to benchmark a nondeterministic function? In such a case one might well be interested in the best run, worst run, and the average.

I agree. Currently std.benchmark is not geared for measuring non-deterministic functions. Andrei
Sep 21 2012
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/19/12 3:59 PM, Graham Fawcett wrote:
 For comparison's sake, the Criterion benchmarking package for Haskell is
 worth a look:

 http://www.serpentine.com/blog/2009/09/29/criterion-a-new-benchmarking-library-for-haskell/


 Criterion accounts for clock-call costs, displays various central
 tendencies, reports outliers (and their significance --- whether the
 variance is significantly affected by the outliers), etc., etc. It's a
 very well conceived benchmarking system, and might well be worth
 stealing from.

Will look into it, thanks. Andrei
Sep 21 2012
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/20/12 3:42 AM, Manu wrote:
 On 19 September 2012 12:38, Peter Alexander
 <peter.alexander.au gmail.com <mailto:peter.alexander.au gmail.com>> wrote:

         The fastest execution time is rarely useful to me, I'm almost
         always much
         more interested in the slowest execution time.
         In realtime software, the slowest time is often the only
         important factor,
         everything must be designed to tolerate this possibility.
         I can also imagine other situations where multiple workloads are
         competing
         for time, the average time may be more useful in that case.


     The problem with slowest is that you end up with the occasional OS
     hiccup or GC collection which throws the entire benchmark off. I see
     your point, but unless you can prevent the OS from interrupting, the
     time would be meaningless.


 So then we need to start getting tricky, and choose the slowest one that
 is not beyond an order of magnitude or so outside the average?

That's exactly where it all starts getting unprincipled. Just use the minimum. Just. Use. The. Minimum. Andrei
Sep 21 2012
prev sibling next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 21-Sep-12 22:49, David Piepgrass wrote:
 After extensive tests with a variety of aggregate functions, I can say
 firmly that taking the minimum time is by far the best when it comes
 to assessing the speed of a function.


 As far as I know, D doesn't offer a sampling profiler, so one might
 indeed use a benchmarking library as a (poor) substitute. So I'd want to
 be able to set up some benchmarks that operate on realistic data, with
 perhaps different data in different runs in order to learn about how the
 speed varies with different inputs (if it varies a lot then I might
 create more benchmarks to investigate which inputs are processed
 quickly, and which slowly.)

Real good profilers are the ones served by CPU vendor. See AMD's CodeAnalyst or Intel's VTune. They could even count number of branch predictions, cache misses etc. It is certainly out of the charter of module or for that matter any standard library code.
 Some random comments about std.benchmark based on its documentation:

 - It is very strange that the documentation of printBenchmarks uses
 neither of the words "average" or "minimum", and doesn't say how many
 trials are done.... I suppose the obvious interpretation is that it only
 does one trial, but then we wouldn't be having this discussion about
 averages and minimums right?

See the algorithm in action here: https://github.com/D-Programming-Language/phobos/pull/794/files#L2R381 In other word a function is run 10^n times with n is picked so that total time is big enough to be a trustworthy measurement. Then run-time is time/10^n. Øivind says tests are run 1000 times... The above 1000 times, picking the minimum as the best. Obviously it'd be good to be configurable. but
 it needs to be configurable per-test (my idea: support a _x1000 suffix
 in function names, or _for1000ms to run the test for at least 1000
 milliseconds; and allow a multiplier when when running a group of
 benchmarks, e.g. a multiplier argument of 0.5 means to only run half as
 many trials as usual.) Also, it is not clear from the documentation what
 the single parameter to each benchmark is (define "iterations count".)

 - The "benchmark_relative_" feature looks quite useful. I'm also happy
 to see benchmarkSuspend() and benchmarkResume(), though
 benchmarkSuspend() seems redundant in most cases: I'd like to just call
 one function, say, benchmarkStart() to indicate "setup complete, please
 start measuring time now."

 - I'm glad that StopWatch can auto-start; but the documentation should
 be clearer: does reset() stop the timer or just reset the time to zero?
 does stop() followed by start() start from zero or does it keep the time
 on the clock? I also think there should be a method that returns the
 value of peek() and restarts the timer at the same time (perhaps stop()
 and reset() should just return peek()?)

It's the same as the usual stopwatch (as in the real hardware thingy). Thus: - reset just resets numbers to zeros - stop just stops counting - start just starts counting - peek imitates taking a look at numbers on a device ;)
 - After reading the documentation of comparingBenchmark and measureTime,
 I have almost no idea what they do.

I think that comparingBenchmark was present in std.datetime and is carried over as is. -- Dmitry Olshansky
Sep 21 2012
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/21/12 2:49 PM, David Piepgrass wrote:
 After extensive tests with a variety of aggregate functions, I can say
 firmly that taking the minimum time is by far the best when it comes
 to assessing the speed of a function.

Like others, I must also disagree in princple. The minimum sounds like a useful metric for functions that (1) do the same amount of work in every test and (2) are microbenchmarks, i.e. they measure a small and simple task.

That is correct.
 If the benchmark being measured either (1) varies the amount of
 work each time (e.g. according to some approximation of real-world
 input, which obviously may vary)* or (2) measures a large system, then
 the average and standard deviation and even a histogram may be useful
 (or perhaps some indicator whether the runtimes are consistent with a
 normal distribution or not). If the running-time is long then the max
 might be useful (because things like task-switching overhead probably do
 not contribute that much to the total).

 * I anticipate that you might respond "so, only test a single input per
 benchmark", but if I've got 1000 inputs that I want to try, I really
 don't want to write 1000 functions nor do I want 1000 lines of output
 from the benchmark. An average, standard deviation, min and max may be
 all I need, and if I need more detail, then I might break it up into 10
 groups of 100 inputs. In any case, the minimum runtime is not the
 desired output when the input varies.

I understand. What we currently do at Facebook is support benchmark functions with two parameters (see https://github.com/facebook/folly/blob/master/folly/docs/Benchmark.md). One is the number of iterations, the second is "problem size", akin to what you're discussing. I chose to not support that in this version of std.benchmark because it can be tackled later easily, but I probably need to add it now, sigh.
 It's a little surprising to hear "The purpose of std.benchmark is not to
 estimate real-world time. (That is the purpose of profiling)"...
 Firstly, of COURSE I would want to estimate real-world time with some of
 my benchmarks. For some benchmarks I just want to know which of two or
 three approaches is faster, or to get a coarse ball-park sense of
 performance, but for others I really want to know the wall-clock time
 used for realistic inputs.

I would contend that a benchmark without a baseline is very often misguided. I've seen tons and tons and TONS of nonsensical benchmarks lacking a baseline. "I created one million smart pointers, it took me only one millisecond!" Well how long did it take you to create one million dumb pointers? Choosing good baselines and committing to good comparisons instead of un-based absolutes is what makes the difference between a professional and a well-intended dilettante.
 Secondly, what D profiler actually helps you answer the question "where
 does the time go in the real-world?"? The D -profile switch creates an
 instrumented executable, which in my experience (admittedly not
 experience with DMD) severely distorts running times. I usually prefer
 sampling-based profiling, where the executable is left unchanged and a
 sampling program interrupts the program at random and grabs the call
 stack, to avoid the distortion effect of instrumentation. Of course,
 instrumentation is useful to find out what functions are called the most
 and whether call frequencies are in line with expectations, but I
 wouldn't trust the time measurements that much.

 As far as I know, D doesn't offer a sampling profiler, so one might
 indeed use a benchmarking library as a (poor) substitute. So I'd want to
 be able to set up some benchmarks that operate on realistic data, with
 perhaps different data in different runs in order to learn about how the
 speed varies with different inputs (if it varies a lot then I might
 create more benchmarks to investigate which inputs are processed
 quickly, and which slowly.)

I understand there's a good case to be made for profiling. If this turns out to be an acceptance condition for std.benchmark (which I think it shouldn't), I'll define one.
 Some random comments about std.benchmark based on its documentation:

 - It is very strange that the documentation of printBenchmarks uses
 neither of the words "average" or "minimum", and doesn't say how many
 trials are done....

Because all of those are irrelevant and confusing. We had an older framework at Facebook that reported those numbers, and they were utterly and completely meaningless. Besides the trials column contained numbers that were not even comparable. Everybody was happy when I removed them with today's simple and elegant numbers.
 I suppose the obvious interpretation is that it only
 does one trial, but then we wouldn't be having this discussion about
 averages and minimums right? Øivind says tests are run 1000 times... but
 it needs to be configurable per-test (my idea: support a _x1000 suffix
 in function names, or _for1000ms to run the test for at least 1000
 milliseconds; and allow a multiplier when when running a group of
 benchmarks, e.g. a multiplier argument of 0.5 means to only run half as
 many trials as usual.)

I don't think that's a good idea.
 Also, it is not clear from the documentation what
 the single parameter to each benchmark is (define "iterations count".)

The documentation could include that, but I don't want to overspecify.
 - The "benchmark_relative_" feature looks quite useful. I'm also happy
 to see benchmarkSuspend() and benchmarkResume(), though
 benchmarkSuspend() seems redundant in most cases: I'd like to just call
 one function, say, benchmarkStart() to indicate "setup complete, please
 start measuring time now."

Good point. I think this is a minor encumbrance, so it's good to keep generality.
 - I'm glad that StopWatch can auto-start; but the documentation should
 be clearer: does reset() stop the timer or just reset the time to zero?
 does stop() followed by start() start from zero or does it keep the time
 on the clock? I also think there should be a method that returns the
 value of peek() and restarts the timer at the same time (perhaps stop()
 and reset() should just return peek()?)

 - After reading the documentation of comparingBenchmark and measureTime,
 I have almost no idea what they do.

Yah, these are moved over from std.datetime. I'll need to make a couple more passes through the dox. Andrei
Sep 21 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/21/12 5:36 PM, David Piepgrass wrote:
 Some random comments about std.benchmark based on its
 documentation:

 - It is very strange that the documentation of printBenchmarks
 uses neither of the words "average" or "minimum", and doesn't say
 how many trials are done....

Because all of those are irrelevant and confusing.

Huh? It's not nearly as confusing as reading the documentation and not having the faintest idea what it will do. The way the benchmarker works is somehow 'irrelevant'? The documentation doesn't even indicate that the functions are to be run more than once!!

I misunderstood. I agree that it's a good thing to specify how benchmarking proceeds.
 I don't think that's a good idea.

I have never seen you make such vague arguments, Andrei.

I had expanded my point elsewhere. Your suggestion was:
 - It is very strange that the documentation of printBenchmarks uses
 neither of the words "average" or "minimum", and doesn't say how many
 trials are done.... I suppose the obvious interpretation is that it
 only does one trial, but then we wouldn't be having this discussion
 about averages and minimums right? Øivind says tests are run 1000
 times... but it needs to be configurable per-test (my idea: support a
 _x1000 suffix in function names, or _for1000ms to run the test for at
 least 1000 milliseconds; and allow a multiplier when when running a
 group of benchmarks, e.g. a multiplier argument of 0.5 means to only
 run half as many trials as usual.) Also, it is not clear from the
 documentation what the single parameter to each benchmark is (define
 "iterations count".)

I don't think it's a good idea because the "for 1000 ms" doesn't say anything except how good the clock resolution was on the system. I'm as strongly convinced we shouldn't print useless information as I am we should print useful information. Andrei
Sep 21 2012
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/19/12 4:11 PM, "Øivind" wrote:
 New question for you :)

 To register benchmarks, the 'scheduleForBenchmarking' mixin inserts a
 shared static initializer into the module. If I have a module A and a
 module B, that both depend on eachother, than this will probably not
 work..? The runtime will detect the init cycle and fail with the
 following error:

 "Cycle detected between modules with ctors/dtors"

 Or am I wrong now?

I think you have discovered a major issue. Ideas on how to attack this? Andrei
Sep 21 2012
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 21-Sep-12 23:59, Andrei Alexandrescu wrote:
 On 9/19/12 4:11 PM, "Øivind" wrote:
 New question for you :)

 To register benchmarks, the 'scheduleForBenchmarking' mixin inserts a
 shared static initializer into the module. If I have a module A and a
 module B, that both depend on eachother, than this will probably not
 work..? The runtime will detect the init cycle and fail with the
 following error:

 "Cycle detected between modules with ctors/dtors"

 Or am I wrong now?

I think you have discovered a major issue. Ideas on how to attack this?

Make scheduleForBenchmarking to mixin in something else but not code - say global templated struct with certain name. Then it should be possible to do: benchmarkModules!(module1, module2, ...); That would search for this specific anchor at the top scope of modules and collect all info. I'm not sure we can pass module names as alias parameters but I think our meta-programming tricksters certainly did something along the these lines. -- Dmitry Olshansky
Sep 21 2012
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-09-21 21:59, Andrei Alexandrescu wrote:

 I think you have discovered a major issue. Ideas on how to attack this?

The standard way to solve this would be to move the initialization code from a static constructor to a function what will be called and do the initialization lazy. But since this several layers of string mixins what will make it more complicated. Can you give a short example of how the mixed in code will look like? -- /Jacob Carlborg
Sep 22 2012
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/22/12 8:28 AM, "Øivind" wrote:
 Is there a way to solve the dependency issue without forbidding static
 constructors in modules with cyclic dependencies?

I think an idea just occurred to me. The rules for static ctors and dtors were invented before "import" was allowed inside a scope. We could have taken advantage of that. Say we restrict symbol visibility inside static cdtors to ONLY symbols within the current module. If some static cdtor needs a symbol from a different module, it must import it explicitly (even if the current module already imports it). In this setup it should be possible to compute, in a fine-grained manner, the dependencies of static cdtors. Unfortunately that would be a breaking change. Andrei
Sep 22 2012
prev sibling next sibling parent reply =?UTF-8?B?IsOYaXZpbmQi?= <oivind.loe gmail.com> writes:
I think the std.benchmark is definitely a useful library 
addition, but in my mind it currently a bit too limited.

* All tests are run 1000 times. Depending on the length of the 
test to benchmark, this can be too much. In some cases it would 
be good to be able to trade the number of runs against accuracy.

* For all tests, the best run is selected, but would it not be 
reasonable in some cases to get the average value? Maybe 
excluding the runs that are more than a couple std. deviations 
away from the mean value..

* Is there a way of specifying a test name other than the 
function-name when using the 'mixin(scheduleForBenchmarking)' 
approach to register benchmarks?

* I would also like to be able (if possible) to register two 
mentioned things (number of runs and result strategy) with the 
mixin approach (or similar).

* It seems like the baseline for subtraction from subsequent test 
runs is taken from a call to the test function, passing 1 to it. 
Shouldn't 0 be passed for this value?

If these can be addressed, I would like it added to the library!
Sep 18 2012
parent Jens Mueller <jens.k.mueller gmx.de> writes:
Andrei Alexandrescu wrote:
 On 9/21/12 5:39 AM, Jacob Carlborg wrote:
On 2012-09-21 06:23, Andrei Alexandrescu wrote:

For a very simple reason: unless the algorithm under benchmark is very
long-running, max is completely useless, and it ruins average as well.

I may have completely misunderstood this but aren't we talking about what do include in the output of the benchmark? In that case, if you don't like max and average just don't look at it.

I disagree. I won't include something in my design just so people don't look at it most of the time. Min and average are most of the time an awful thing to include, and will throw off people with bizarre results. If it's there, it's worth looking at. Note how all columns are directly comparable (I might add, unlike other approaches to benchmarking).
For virtually all benchmarks I've run, the distribution of timings is a
half-Gaussian very concentrated around the minimum. Say you have a
minimum of e.g. 73 us. Then there would be a lot of results close to
that; the mode of the distribution would be very close, e.g. 75 us, and
the more measurements you take, the closer the mode is to the minimum.
Then you have a few timings up to e.g. 90 us. And finally you will
inevitably have a few outliers at some milliseconds. Those are orders of
magnitude larger than anything of interest and are caused by system
interrupts that happened to fall in the middle of the measurement.

Taking those into consideration and computing the average with those
outliers simply brings useless noise into the measurement process.

After your replay to one of Manu's post, I think I misunderstood the std.benchmark module. I was thinking more of profiling. But are these quite similar tasks, couldn't std.benchmark work for both?

This is an interesting idea. It would delay release quite a bit because I'd need to design and implement things like performance counters and such.

You mean like extending StopWatch and allowing the user to provide the measuring code, i.e. counting the number of instructions. This would be very useful. Is it possible to make sure that these changes can be introduced later without breaking the API? Jens
Sep 21 2012
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-09-17 23:13, Jens Mueller wrote:

 Post all feedback to this thread. Constructive feedback is very much
 appreciated.

 To conclude in more Andrei like words: Happy destruction!

* Why is "scheduleForBenchmarking" a string? Can't it be a template mixin? * What's the most appropriate way of just timing a block of code? Something like this: auto time = benchmark!({ /* some code */ })(1); If that's the case then I suggest setting a default value of "1" for the "n" parameter. * If I want to format the printed result differently, say in HTML, how would I do that? Should I use the "benchmark" function and iterate the BenchmarkResult array? * BTW why doesn't benchmark return the BenchmarkResult array? * Is this module so important to keep it as a top level module? I'm thinking something like a utility package or a time/date package. How about std.util.benchmark? -- /Jacob Carlborg
Sep 19 2012
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-09-19 09:58, Jonathan M Davis wrote:

 util is one of the worst package names ever, because it means basically
 nothing. Any function could go in there.

Well, the "util" package in Phobos is called "std".
 As for a time/date package, we already have std.datetime (which will hopefully
 be split into the package std.datetime at some point, but we need something
 like DIP 15 or 16 before we can do that), and we're moving the benchmarking
 _out_ of there. If std.datetime were already a package, then maybe putting it
 in there would make some sense, but benchmarking is arguably fundamentally
 different from what the rest of std.datetime does. I really so no problem with
 benchmarking being its own thing, and std.benchmark works just fine for that.

I just think we have too many top level modules. -- /Jacob Carlborg
Sep 19 2012
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/19/12 3:13 AM, Jacob Carlborg wrote:
 On 2012-09-17 23:13, Jens Mueller wrote:

 Post all feedback to this thread. Constructive feedback is very much
 appreciated.

 To conclude in more Andrei like words: Happy destruction!

* Why is "scheduleForBenchmarking" a string? Can't it be a template mixin?

Good point, I'll look into it.
 * What's the most appropriate way of just timing a block of code?
 Something like this:

 auto time = benchmark!({ /* some code */ })(1);

 If that's the case then I suggest setting a default value of "1" for the
 "n" parameter.

A default value of n would depend on the speed of the function and the granularity of the system's timer. That overload of benchmark is imperfect (rather rigid) but we must keep it for backwards compatibility.
 * If I want to format the printed result differently, say in HTML, how
 would I do that? Should I use the "benchmark" function and iterate the
 BenchmarkResult array?

That is correct. printBenchmarks() does not use any magic - it just looks at that same array.
 * BTW why doesn't benchmark return the BenchmarkResult array?

Will look into it.
 * Is this module so important to keep it as a top level module? I'm
 thinking something like a utility package or a time/date package. How
 about std.util.benchmark?

Not sure. Andrei
Sep 22 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, September 19, 2012 09:13:40 Jacob Carlborg wrote:
 * Is this module so important to keep it as a top level module? I'm
 thinking something like a utility package or a time/date package. How
 about std.util.benchmark?

util is one of the worst package names ever, because it means basically nothing. Any function could go in there. As for a time/date package, we already have std.datetime (which will hopefully be split into the package std.datetime at some point, but we need something like DIP 15 or 16 before we can do that), and we're moving the benchmarking _out_ of there. If std.datetime were already a package, then maybe putting it in there would make some sense, but benchmarking is arguably fundamentally different from what the rest of std.datetime does. I really so no problem with benchmarking being its own thing, and std.benchmark works just fine for that. - Jonathan M Davis
Sep 19 2012
prev sibling next sibling parent reply "Peter Alexander" <peter.alexander.au gmail.com> writes:
I don't see why `benchmark` takes (almost) all of its parameters 
as template parameters. It looks quite odd, seems unnecessary, 
and (if I'm not mistaken) makes certain use cases quite difficult.

For example, suppose I want to benchmark a function several times 
with different parameters and names, how would I do that?

foreach (i; 0..10)
{
     printBenchmark!( format("Test %d", i), { someFunc(i); } )();
}

This won't work because i isn't known at compile time, and for 
some use cases it can't be known at compile time.

I wouldn't mind if there was some real benefit to taking these as 
template arguments, but there doesn't seem to be any value at all 
-- it just limits usage.
Sep 19 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/19/12 4:06 AM, Peter Alexander wrote:
 I don't see why `benchmark` takes (almost) all of its parameters as
 template parameters. It looks quite odd, seems unnecessary, and (if I'm
 not mistaken) makes certain use cases quite difficult.

That is intentional - indirect calls would add undue overhead to the measurements. Andrei
Sep 21 2012
prev sibling next sibling parent "Thiez" <thiezz gmail.com> writes:
On Tuesday, 18 September 2012 at 22:01:30 UTC, Andrei 
Alexandrescu wrote:
 After extensive tests with a variety of aggregate functions, I 
 can say firmly that taking the minimum time is by far the best 
 when it comes to assessing the speed of a function.

What if one tries to benchmark a nondeterministic function? In such a case one might well be interested in the best run, worst run, and the average.
Sep 19 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--bcaec54fb81aff83d204ca09cdc6
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 19 September 2012 01:02, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 On 9/18/12 5:07 PM, "=C3=98ivind" wrote:

 * For all tests, the best run is selected, but would it not be

reasonable in some cases to get the average value? Maybe excluding the
 runs that are more than a couple std. deviations away from the mean
 value..

After extensive tests with a variety of aggregate functions, I can say firmly that taking the minimum time is by far the best when it comes to assessing the speed of a function.

The fastest execution time is rarely useful to me, I'm almost always much more interested in the slowest execution time. In realtime software, the slowest time is often the only important factor, everything must be designed to tolerate this possibility. I can also imagine other situations where multiple workloads are competing for time, the average time may be more useful in that case. Side question: Running a test over and over pre-populates the cache with all associated data after the first cycle... The cache needs to be randomised between each cycle to get realistic results. --bcaec54fb81aff83d204ca09cdc6 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div class=3D"gmail_quote">On 19 September 2012 01:02, Andrei Alexandrescu = <span dir=3D"ltr">&lt;<a href=3D"mailto:SeeWebsiteForEmail erdani.org" targ= et=3D"_blank">SeeWebsiteForEmail erdani.org</a>&gt;</span> wrote:<br><block= quote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc= solid;padding-left:1ex"> <div class=3D"im">On 9/18/12 5:07 PM, &quot;=C3=98ivind&quot; wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex">* For all tests, the best run is selected, b= ut would it not be</blockquote></div><div class=3D"im"><blockquote class=3D= "gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding= -left:1ex"> reasonable in some cases to get the average value? Maybe excluding the<br> runs that are more than a couple std. deviations away from the mean value..= <br> </blockquote> <br></div> After extensive tests with a variety of aggregate functions, I can say firm= ly that taking the minimum time is by far the best when it comes to assessi= ng the speed of a function.</blockquote><div><br></div><div>The fastest exe= cution time is rarely useful to me, I&#39;m almost always much more interes= ted in the slowest execution time.</div> <div>In realtime software, the slowest time is often the only important fac= tor, everything must be designed to tolerate this possibility.</div><div>I = can also imagine other situations where multiple workloads are competing fo= r time, the average time may be more useful in that case.</div> <div><br></div><div>Side question:</div><div>Running a test over and over p= re-populates the cache with all associated data after the first cycle... Th= e cache needs to be randomised between each cycle to get realistic results.= </div> </div> --bcaec54fb81aff83d204ca09cdc6--
Sep 19 2012
prev sibling next sibling parent "Peter Alexander" <peter.alexander.au gmail.com> writes:
 The fastest execution time is rarely useful to me, I'm almost 
 always much
 more interested in the slowest execution time.
 In realtime software, the slowest time is often the only 
 important factor,
 everything must be designed to tolerate this possibility.
 I can also imagine other situations where multiple workloads 
 are competing
 for time, the average time may be more useful in that case.

The problem with slowest is that you end up with the occasional OS hiccup or GC collection which throws the entire benchmark off. I see your point, but unless you can prevent the OS from interrupting, the time would be meaningless.
Sep 19 2012
prev sibling next sibling parent "Graham Fawcett" <fawcett uwindsor.ca> writes:
On Wednesday, 19 September 2012 at 08:28:36 UTC, Manu wrote:
 On 19 September 2012 01:02, Andrei Alexandrescu <
 SeeWebsiteForEmail erdani.org> wrote:

 On 9/18/12 5:07 PM, "Øivind" wrote:

 * For all tests, the best run is selected, but would it not be

reasonable in some cases to get the average value? Maybe excluding the
 runs that are more than a couple std. deviations away from 
 the mean
 value..

After extensive tests with a variety of aggregate functions, I can say firmly that taking the minimum time is by far the best when it comes to assessing the speed of a function.

The fastest execution time is rarely useful to me, I'm almost always much more interested in the slowest execution time. In realtime software, the slowest time is often the only important factor, everything must be designed to tolerate this possibility. I can also imagine other situations where multiple workloads are competing for time, the average time may be more useful in that case.

For comparison's sake, the Criterion benchmarking package for Haskell is worth a look: http://www.serpentine.com/blog/2009/09/29/criterion-a-new-benchmarking-library-for-haskell/ Criterion accounts for clock-call costs, displays various central tendencies, reports outliers (and their significance --- whether the variance is significantly affected by the outliers), etc., etc. It's a very well conceived benchmarking system, and might well be worth stealing from. Best, Graham
Sep 19 2012
prev sibling next sibling parent =?UTF-8?B?IsOYaXZpbmQi?= <oivind.loe gmail.com> writes:
New question for you :)

To register benchmarks, the 'scheduleForBenchmarking' mixin 
inserts a shared static initializer into the module. If I have a 
module A and a module B, that both depend on eachother, than this 
will probably not work..? The runtime will detect the init cycle 
and fail with the following error:

"Cycle detected between modules with ctors/dtors"

Or am I wrong now?
Sep 19 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--047d7bd75162de584c04ca1d44a0
Content-Type: text/plain; charset=UTF-8

On 19 September 2012 12:38, Peter Alexander <peter.alexander.au gmail.com>wrote:

 The fastest execution time is rarely useful to me, I'm almost always much
 more interested in the slowest execution time.
 In realtime software, the slowest time is often the only important factor,
 everything must be designed to tolerate this possibility.
 I can also imagine other situations where multiple workloads are competing
 for time, the average time may be more useful in that case.

The problem with slowest is that you end up with the occasional OS hiccup or GC collection which throws the entire benchmark off. I see your point, but unless you can prevent the OS from interrupting, the time would be meaningless.

So then we need to start getting tricky, and choose the slowest one that is not beyond an order of magnitude or so outside the average? --047d7bd75162de584c04ca1d44a0 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div class=3D"gmail_quote">On 19 September 2012 12:38, Peter Alexander <spa= n dir=3D"ltr">&lt;<a href=3D"mailto:peter.alexander.au gmail.com" target=3D= "_blank">peter.alexander.au gmail.com</a>&gt;</span> wrote:<br><blockquote = class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid= ;padding-left:1ex"> <div class=3D"im"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .= 8ex;border-left:1px #ccc solid;padding-left:1ex"> The fastest execution time is rarely useful to me, I&#39;m almost always mu= ch<br> more interested in the slowest execution time.<br> In realtime software, the slowest time is often the only important factor,<= br> everything must be designed to tolerate this possibility.<br> I can also imagine other situations where multiple workloads are competing<= br> for time, the average time may be more useful in that case.<br> </blockquote> <br></div> The problem with slowest is that you end up with the occasional OS hiccup o= r GC collection which throws the entire benchmark off. I see your point, bu= t unless you can prevent the OS from interrupting, the time would be meanin= gless.<br> </blockquote></div><br><div>So then we need to start getting tricky, and ch= oose the slowest one that is not beyond an order of magnitude or so outside= the average?</div> --047d7bd75162de584c04ca1d44a0--
Sep 20 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--002354471a3cba464d04ca229d76
Content-Type: text/plain; charset=UTF-8

On 20 September 2012 15:36, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 On 9/20/12 2:42 AM, Manu wrote:

 On 19 September 2012 12:38, Peter Alexander
 <peter.alexander.au gmail.com
<mailto:peter.alexander.au **gmail.com<peter.alexander.au gmail.com>>>
 wrote:

         The fastest execution time is rarely useful to me, I'm almost
         always much
         more interested in the slowest execution time.
         In realtime software, the slowest time is often the only
         important factor,
         everything must be designed to tolerate this possibility.
         I can also imagine other situations where multiple workloads are
         competing
         for time, the average time may be more useful in that case.


     The problem with slowest is that you end up with the occasional OS
     hiccup or GC collection which throws the entire benchmark off. I see
     your point, but unless you can prevent the OS from interrupting, the
     time would be meaningless.


 So then we need to start getting tricky, and choose the slowest one that
 is not beyond an order of magnitude or so outside the average?

The "best way" according to some of the people who've advised my implementation of the framework at Facebook is to take the mode of the measurements distribution, i.e. the time at the maximum density. I implemented that (and it's not easy). It yielded numbers close to the minimum, but less stable and needing more iterations to become stable (when they do get indeed close to the minimum). Let's use the minimum. It is understood it's not what you'll see in production, but it is an excellent proxy for indicative and reproducible performance numbers.

If you do more than a single iteration, the minimum will virtually always be influenced by ideal cache pre-population, which is unrealistic. Memory locality is often the biggest contributing performance hazard in many algorithms, and usually the most unpredictable. I want to know about that in my measurements. Reproducibility is not important to me as accuracy. And I'd rather be conservative(/pessimistic) with the error. What guideline would you apply to estimate 'real-world' time spent when always working with hyper-optimistic measurements? --002354471a3cba464d04ca229d76 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div class=3D"gmail_quote">On 20 September 2012 15:36, Andrei Alexandrescu = <span dir=3D"ltr">&lt;<a href=3D"mailto:SeeWebsiteForEmail erdani.org" targ= et=3D"_blank">SeeWebsiteForEmail erdani.org</a>&gt;</span> wrote:<br><block= quote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc= solid;padding-left:1ex"> <div class=3D"im">On 9/20/12 2:42 AM, Manu wrote:<br> </div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l= eft:1px #ccc solid;padding-left:1ex"><div class=3D"im"> On 19 September 2012 12:38, Peter Alexander<br></div><div><div class=3D"h5"=

.alexander.au gmail.com</a> &lt;mailto:<a href=3D"mailto:peter.alexander.au= gmail.com" target=3D"_blank">peter.alexander.au <u></u>gmail.com</a>&gt;&g= t; wrote:<br> <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 The fastest execution time is rarely useful to = me, I&#39;m almost<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 always much<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 more interested in the slowest execution time.<= br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 In realtime software, the slowest time is often= the only<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 important factor,<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 everything must be designed to tolerate this po= ssibility.<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 I can also imagine other situations where multi= ple workloads are<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 competing<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 for time, the average time may be more useful i= n that case.<br> <br> <br> =C2=A0 =C2=A0 The problem with slowest is that you end up with the occasion= al OS<br> =C2=A0 =C2=A0 hiccup or GC collection which throws the entire benchmark off= . I see<br> =C2=A0 =C2=A0 your point, but unless you can prevent the OS from interrupti= ng, the<br> =C2=A0 =C2=A0 time would be meaningless.<br> <br> <br></div></div><div class=3D"im"> So then we need to start getting tricky, and choose the slowest one that<br=

</div></blockquote> <br> The &quot;best way&quot; according to some of the people who&#39;ve advised= my implementation of the framework at Facebook is to take the mode of the = measurements distribution, i.e. the time at the maximum density.<br> <br> I implemented that (and it&#39;s not easy). It yielded numbers close to the= minimum, but less stable and needing more iterations to become stable (whe= n they do get indeed close to the minimum).<br> <br> Let&#39;s use the minimum. It is understood it&#39;s not what you&#39;ll se= e in production, but it is an excellent proxy for indicative and reproducib= le performance numbers.</blockquote><div><br></div><div>If you do more than= a single iteration, the minimum will virtually always be influenced by ide= al cache pre-population, which is unrealistic. Memory locality is often the= biggest contributing performance hazard in many algorithms, and usually th= e most unpredictable. I want to know about that in my measurements.</div> <div>Reproducibility is not important to me as accuracy. And I&#39;d rather= be conservative(/pessimistic) with the error.</div><div><br></div><div>Wha= t guideline would you apply to estimate &#39;real-world&#39; time spent whe= n always working with hyper-optimistic measurements?</div> </div> --002354471a3cba464d04ca229d76--
Sep 20 2012
prev sibling next sibling parent "foobar" <foo bar.com> writes:
On Thursday, 20 September 2012 at 12:35:15 UTC, Andrei 
Alexandrescu wrote:
 On 9/20/12 2:42 AM, Manu wrote:
 On 19 September 2012 12:38, Peter Alexander
 <peter.alexander.au gmail.com 
 <mailto:peter.alexander.au gmail.com>> wrote:

        The fastest execution time is rarely useful to me, I'm 
 almost
        always much
        more interested in the slowest execution time.
        In realtime software, the slowest time is often the only
        important factor,
        everything must be designed to tolerate this 
 possibility.
        I can also imagine other situations where multiple 
 workloads are
        competing
        for time, the average time may be more useful in that 
 case.


    The problem with slowest is that you end up with the 
 occasional OS
    hiccup or GC collection which throws the entire benchmark 
 off. I see
    your point, but unless you can prevent the OS from 
 interrupting, the
    time would be meaningless.


 So then we need to start getting tricky, and choose the 
 slowest one that
 is not beyond an order of magnitude or so outside the average?

The "best way" according to some of the people who've advised my implementation of the framework at Facebook is to take the mode of the measurements distribution, i.e. the time at the maximum density. I implemented that (and it's not easy). It yielded numbers close to the minimum, but less stable and needing more iterations to become stable (when they do get indeed close to the minimum). Let's use the minimum. It is understood it's not what you'll see in production, but it is an excellent proxy for indicative and reproducible performance numbers. Andrei

From the responses on the thread clearly there isn't a "best way". There are different use-cases with different tradeoffs so why not allow the user to choose the policy best suited for their use-case? I'd suggest to provide a few reasonable common choices to choose from, as well as a way to provide a user defined calculation (function pointer/delegate?)
Sep 20 2012
prev sibling next sibling parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On Tue, 18 Sep 2012 18:02:10 -0400
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:

 On 9/18/12 5:07 PM, "=D8ivind" wrote:
 * For all tests, the best run is selected, but would it not be
 reasonable in some cases to get the average value? Maybe excluding
 the runs that are more than a couple std. deviations away from the
 mean value..

After extensive tests with a variety of aggregate functions, I can say firmly that taking the minimum time is by far the best when it comes to assessing the speed of a function. =20

*Ahem*: http://zedshaw.com/essays/programmer_stats.html Your claim that the minimum time is sufficient is...ummm...extremely unorthodox, to say the least. As such, you're going to need a far more convincing argument than "It worked well for me." I assume I don't need to preach that "Extraordinary claims require extraordinary evidence". But, condensing benchmarks and statistics down to "take the minimum" and saying that's sufficient is one heck of an extraordinary claim. If you feel that you can provide sufficiently extraordinary justification, then please do. Otherwise, I think we'll need richer results. At the very least there should be an easy way to get at the raw results programmatically so we can run whatever stats/plots/visualizations/output-formats we want. I didn't see anything like that browsing through the docs, but it's possible I may have missed it. That brings up another question too: I like the idea of a one-stop-benchmarking-shop, much like we have for unittests, but maybe reporting shouldn't be so tightly integrated and left more open for integration with a proper statistics lib and more generalized output abilities? But of course, that doesn't preclude having a nice built-in, but optional, default report. (Again though, maybe I'm overlooking something already in the module?) One other nitpick: My initial impression is that the "benchmark_relative_file read" stuff seems a bit kludgey (and confusing to visually parse). Is there maybe a better way to handle that? For example, inspired by getopt: printBenchmarks!( "file write", { std.file.write("/tmp/deleteme", "hello, world!"); }, BenchmarkOption.relative, "file read", { std.file.read("/tmp/deleteme"); }, "array creation", { new char[32]; }) ();
Sep 20 2012
prev sibling next sibling parent "Tove" <tove fransson.se> writes:
On Friday, 21 September 2012 at 04:44:58 UTC, Andrei Alexandrescu 
wrote:
 Andrei Alexandrescu<SeeWebsiteForEmail erdani.org>  wrote:

time of an algorithm on a given input is a stable and indicative proxy for the behavior of the algorithm in general. So it's a good target for optimization.

I reached the same conclusion and use the same method at work. Considering min will converge towards a stable value quite quickly... would it not be a reasonable default to auto detect when the min is stable with some degree of statistical certainty...?
Sep 21 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--bcaec51a7fe89cd80204ca3777cb
Content-Type: text/plain; charset=UTF-8

On 21 September 2012 07:17, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 On 9/20/12 10:05 AM, Manu wrote:

 Memory locality is often the biggest contributing

 unpredictable. I want to know about that in my measurements.
 Reproducibility is not important to me as accuracy. And I'd rather be
 conservative(/pessimistic) with the error.

 What guideline would you apply to estimate 'real-world' time spent when
 always working with hyper-optimistic measurements?

The purpose of std.benchmark is not to estimate real-world time. (That is the purpose of profiling.) Instead, benchmarking measures and provides a good proxy of that time for purposes of optimizing the algorithm. If work is done on improving the minimum time given by the benchmark framework, it is reasonable to expect that performance in-situ will also improve.

Okay, I can buy this distinction in terminology. What I'm typically more interested in is profiling. I do occasionally need to do some benchmarking by your definition, so I'll find this useful, but should there then be another module to provide a 'profiling' API? Also worked into this API? --bcaec51a7fe89cd80204ca3777cb Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div class=3D"gmail_quote">On 21 September 2012 07:17, Andrei Alexandrescu = <span dir=3D"ltr">&lt;<a href=3D"mailto:SeeWebsiteForEmail erdani.org" targ= et=3D"_blank">SeeWebsiteForEmail erdani.org</a>&gt;</span> wrote:<br><block= quote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc= solid;padding-left:1ex"> <div class=3D"im">On 9/20/12 10:05 AM, Manu wrote:<br> </div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l= eft:1px #ccc solid;padding-left:1ex"><div class=3D"im">Memory locality is o= ften the biggest contributing</div></blockquote><div class=3D"im"><blockquo= te class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc so= lid;padding-left:1ex"> performance hazard in many algorithms, and usually the most<br> unpredictable. I want to know about that in my measurements.<br> Reproducibility is not important to me as accuracy. And I&#39;d rather be<b= r> conservative(/pessimistic) with the error.<br> </blockquote> &gt;<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> What guideline would you apply to estimate &#39;real-world&#39; time spent = when<br> always working with hyper-optimistic measurements?<br> </blockquote> <br></div> The purpose of std.benchmark is not to estimate real-world time. (That is t= he purpose of profiling.) Instead, benchmarking measures and provides a goo= d proxy of that time for purposes of optimizing the algorithm. If work is d= one on improving the minimum time given by the benchmark framework, it is r= easonable to expect that performance in-situ will also improve.</blockquote=

div>What I&#39;m typically more interested in is profiling. I do occasional= ly need to do some benchmarking by your definition, so I&#39;ll find this u= seful, but should there then be another module to provide a &#39;profiling&= #39; API? Also worked into this API?</div> </div> --bcaec51a7fe89cd80204ca3777cb--
Sep 21 2012
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Fri, 21 Sep 2012 00:45:44 -0400
schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:
 
 The issue here is automating the benchmark of a module, which would 
 require some naming convention anyway.

A perfect use case for user defined attributes ;-) benchmark void foo(){} benchmark("File read test") void foo(){}
Sep 21 2012
prev sibling next sibling parent "David Piepgrass" <qwertie256 gmail.com> writes:
 After extensive tests with a variety of aggregate functions, I 
 can say firmly that taking the minimum time is by far the best 
 when it comes to assessing the speed of a function.

Like others, I must also disagree in princple. The minimum sounds like a useful metric for functions that (1) do the same amount of work in every test and (2) are microbenchmarks, i.e. they measure a small and simple task. If the benchmark being measured either (1) varies the amount of work each time (e.g. according to some approximation of real-world input, which obviously may vary)* or (2) measures a large system, then the average and standard deviation and even a histogram may be useful (or perhaps some indicator whether the runtimes are consistent with a normal distribution or not). If the running-time is long then the max might be useful (because things like task-switching overhead probably do not contribute that much to the total). * I anticipate that you might respond "so, only test a single input per benchmark", but if I've got 1000 inputs that I want to try, I really don't want to write 1000 functions nor do I want 1000 lines of output from the benchmark. An average, standard deviation, min and max may be all I need, and if I need more detail, then I might break it up into 10 groups of 100 inputs. In any case, the minimum runtime is not the desired output when the input varies. It's a little surprising to hear "The purpose of std.benchmark is not to estimate real-world time. (That is the purpose of profiling)"... Firstly, of COURSE I would want to estimate real-world time with some of my benchmarks. For some benchmarks I just want to know which of two or three approaches is faster, or to get a coarse ball-park sense of performance, but for others I really want to know the wall-clock time used for realistic inputs. Secondly, what D profiler actually helps you answer the question "where does the time go in the real-world?"? The D -profile switch creates an instrumented executable, which in my experience (admittedly not experience with DMD) severely distorts running times. I usually prefer sampling-based profiling, where the executable is left unchanged and a sampling program interrupts the program at random and grabs the call stack, to avoid the distortion effect of instrumentation. Of course, instrumentation is useful to find out what functions are called the most and whether call frequencies are in line with expectations, but I wouldn't trust the time measurements that much. As far as I know, D doesn't offer a sampling profiler, so one might indeed use a benchmarking library as a (poor) substitute. So I'd want to be able to set up some benchmarks that operate on realistic data, with perhaps different data in different runs in order to learn about how the speed varies with different inputs (if it varies a lot then I might create more benchmarks to investigate which inputs are processed quickly, and which slowly.) Some random comments about std.benchmark based on its documentation: - It is very strange that the documentation of printBenchmarks uses neither of the words "average" or "minimum", and doesn't say how many trials are done.... I suppose the obvious interpretation is that it only does one trial, but then we wouldn't be having this discussion about averages and minimums right? Øivind says tests are run 1000 times... but it needs to be configurable per-test (my idea: support a _x1000 suffix in function names, or _for1000ms to run the test for at least 1000 milliseconds; and allow a multiplier when when running a group of benchmarks, e.g. a multiplier argument of 0.5 means to only run half as many trials as usual.) Also, it is not clear from the documentation what the single parameter to each benchmark is (define "iterations count".) - The "benchmark_relative_" feature looks quite useful. I'm also happy to see benchmarkSuspend() and benchmarkResume(), though benchmarkSuspend() seems redundant in most cases: I'd like to just call one function, say, benchmarkStart() to indicate "setup complete, please start measuring time now." - I'm glad that StopWatch can auto-start; but the documentation should be clearer: does reset() stop the timer or just reset the time to zero? does stop() followed by start() start from zero or does it keep the time on the clock? I also think there should be a method that returns the value of peek() and restarts the timer at the same time (perhaps stop() and reset() should just return peek()?) - After reading the documentation of comparingBenchmark and measureTime, I have almost no idea what they do.
Sep 21 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--20cf3079b898770ba904ca37936d
Content-Type: text/plain; charset=UTF-8

On 21 September 2012 07:30, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 I don't quite agree. This is a domain in which intuition is having a hard
 time, and at least some of the responses come from an intuitive standpoint,
 as opposed from hard data.

 For example, there's this opinion that taking the min, max, and average is
 the "fair" thing to do and the most informative.

I don't think this is a 'fair' claim, the situation is that different people are looking for different statistical information, and you can distinguish it with whatever terminology you prefer. You are only addressing a single use case; 'benchmarking', by your definition. I'm more frequently interested in profiling than 'benchmark'ing, and I think both are useful to have. The thing is, the distinction between 'benchmarking' and 'profiling' is effectively implemented via nothing more than the sampling algorithm; min vs avg, so is it sensible to expose the distinction in the API in this way? --20cf3079b898770ba904ca37936d Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div class=3D"gmail_quote">On 21 September 2012 07:30, Andrei Alexandrescu = <span dir=3D"ltr">&lt;<a href=3D"mailto:SeeWebsiteForEmail erdani.org" targ= et=3D"_blank">SeeWebsiteForEmail erdani.org</a>&gt;</span> wrote:<br><block= quote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc= solid;padding-left:1ex"> <div class=3D"im">I don&#39;t quite agree. This is a domain in which intuit= ion is having a hard time, and at least some of the responses come from an = intuitive standpoint, as opposed from hard data.</div> <br> For example, there&#39;s this opinion that taking the min, max, and average= is the &quot;fair&quot; thing to do and the most informative.</blockquote>= <div><br></div><div>I don&#39;t think this is a &#39;fair&#39; claim, the s= ituation is that different people are looking for different statistical inf= ormation, and you can distinguish it with whatever terminology you prefer. = You are only addressing a single use case; &#39;benchmarking&#39;, by your = definition. I&#39;m more frequently interested in profiling than &#39;bench= mark&#39;ing, and I think both are useful to have.</div> <div><br></div><div>The thing is, the distinction between &#39;benchmarking= &#39; and &#39;profiling&#39; is effectively implemented via nothing more t= han the sampling algorithm; min vs avg, so is it sensible to expose the dis= tinction in the API in this way?</div> </div> --20cf3079b898770ba904ca37936d--
Sep 21 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--047d7b33d366bc034704ca37ac39
Content-Type: text/plain; charset=UTF-8

On 21 September 2012 07:45, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 As such, you're going to need a far more
 convincing argument than "It worked well for me."

Sure. I have just detailed the choices made by std.benchmark in a couple of posts. At Facebook we measure using the minimum, and it's working for us.

Facebook isn't exactly 'realtime' software. Obviously, faster is always better, but it's not in a situation where if you slip a sync point by 1ms in an off case, it's all over. You can lose 1ms here, and make it up at a later time, and the result is the same. But again, this feeds back to your distinction between benchmarking and profiling. Otherwise, I think we'll need richer results. At the very least there
 should be an easy way to get at the raw results programmatically
 so we can run whatever stats/plots/visualizations/**output-formats we
 want. I didn't see anything like that browsing through the docs, but
 it's possible I may have missed it.

Currently std.benchmark does not expose raw results for the sake of simplicity. It's easy to expose such, but I'd need a bit more convincing about their utility.

Custom visualisation, realtime charting/plotting, user supplied reduce function? --047d7b33d366bc034704ca37ac39 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div class=3D"gmail_quote">On 21 September 2012 07:45, Andrei Alexandrescu = <span dir=3D"ltr">&lt;<a href=3D"mailto:SeeWebsiteForEmail erdani.org" targ= et=3D"_blank">SeeWebsiteForEmail erdani.org</a>&gt;</span> wrote:<br><block= quote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc= solid;padding-left:1ex"> <div class=3D"im"><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .= 8ex;border-left:1px #ccc solid;padding-left:1ex">As such, you&#39;re going = to need a far more<br> convincing argument than &quot;It worked well for me.&quot;<br> </blockquote> <br></div> Sure. I have just detailed the choices made by std.benchmark in a couple of= posts.<br> <br> At Facebook we measure using the minimum, and it&#39;s working for us.</blo= ckquote><div><br></div><div>Facebook isn&#39;t exactly &#39;realtime&#39; s= oftware. Obviously, faster is always better, but it&#39;s not in a situatio= n where if you slip a sync point by 1ms in an off case, it&#39;s all over. = You can lose 1ms here, and make it up at a later time, and the result is th= e same. But again, this feeds back to your distinction between benchmarking= and profiling.</div> <div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex= ;border-left:1px #ccc solid;padding-left:1ex"><div class=3D"im"> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> Otherwise, I think we&#39;ll need richer results. At the very least there<b= r> should be an easy way to get at the raw results programmatically<br> so we can run whatever stats/plots/visualizations/<u></u>output-formats we<= br> want. I didn&#39;t see anything like that browsing through the docs, but<br=

</blockquote> <br></div> Currently std.benchmark does not expose raw results for the sake of simplic= ity. It&#39;s easy to expose such, but I&#39;d need a bit more convincing a= bout their utility.</blockquote><div><br></div><div>Custom visualisation, r= ealtime charting/plotting, user supplied reduce function?</div> </div> --047d7b33d366bc034704ca37ac39--
Sep 21 2012
prev sibling next sibling parent "jerro" <a a.com> writes:
 As far as I know, D doesn't offer a sampling profiler,

It is possible to use a sampling profiler on D executables though. I usually use perf on Linux and AMD CodeAnalyst on Windows.
Sep 21 2012
prev sibling next sibling parent Manu <turkeyman gmail.com> writes:
--bcaec54fb414f3797604ca37b3f5
Content-Type: text/plain; charset=UTF-8

On 21 September 2012 07:23, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 For a very simple reason: unless the algorithm under benchmark is very
 long-running, max is completely useless, and it ruins average as well.

This is only true for systems with a comprehensive pre-emptive OS running on the same core. Most embedded systems will only be affected by cache misses and bus contention, in that situation, max is perfectly acceptable. --bcaec54fb414f3797604ca37b3f5 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div class=3D"gmail_quote">On 21 September 2012 07:23, Andrei Alexandrescu = <span dir=3D"ltr">&lt;<a href=3D"mailto:SeeWebsiteForEmail erdani.org" targ= et=3D"_blank">SeeWebsiteForEmail erdani.org</a>&gt;</span> wrote:<br><block= quote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc= solid;padding-left:1ex"> <div class=3D"im">For a very simple reason: unless the algorithm under benc= hmark is very long-running, max is completely useless, and it ruins average= as well.</div> </blockquote></div><br><div>This is only true for systems with a comprehens= ive pre-emptive OS running on the same core. Most embedded systems will onl= y be affected by cache misses and bus contention, in that situation, max is= perfectly acceptable.</div> --bcaec54fb414f3797604ca37b3f5--
Sep 21 2012
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Friday, September 21, 2012 17:58:05 Manu wrote:
 Okay, I can buy this distinction in terminology.
 What I'm typically more interested in is profiling. I do occasionally need
 to do some benchmarking by your definition, so I'll find this useful, but
 should there then be another module to provide a 'profiling' API? Also
 worked into this API?

dmd has the -profile flag. - Jonathan M Davis
Sep 21 2012
prev sibling next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
I'd throw in a request to address the following.

Suppose we have a function F and a set of inputs S that are supposedly 
different scenarios we optimize for.
What is interesting is to benchmark all of F(S[i]) as |S| separate 
functions greatly saving on boilerplate (and helping readability).

One way would to allow passing in an input range of ArgumentTuples to F.

Say as prefix:

void benchmark_f(int a, double b, string s){ ... }

enum benchmark_data_f = [ tuple(1, 2.0, "hi"), tuple(2, 3.0, "bye") ];

Then in the results it'd look as:
f(1, 2.0, "hi")                <ns/iter>   <iter/s>
f(2, 3.0, "bye")               <ns/iter>   <iter/s>

Using any input range is interestingly flexible e.g. :

enum benchmark_data_x = cortesianProduct(iota(1, 3), iota(1, 3));
//we should probably have it in std.range somewhere

void benchmark_x(int a, int b){ ... }

That being said I don't really get the benefit of passing iteration 
count to the function being benched. To allow it to do initialization 
step once then do resumeBenchmark() and run some inner loop n times?

-- 
Dmitry Olshansky
Sep 21 2012
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Friday, September 21, 2012 15:59:31 Andrei Alexandrescu wrote:
 On 9/19/12 4:11 PM, "Øivind" wrote:
 New question for you :)
 
 To register benchmarks, the 'scheduleForBenchmarking' mixin inserts a
 shared static initializer into the module. If I have a module A and a
 module B, that both depend on eachother, than this will probably not
 work..? The runtime will detect the init cycle and fail with the
 following error:
 
 "Cycle detected between modules with ctors/dtors"
 
 Or am I wrong now?

I think you have discovered a major issue. Ideas on how to attack this?

Some of us have been asking for ages for the ability to mark a static constructor as not depending on anything so that the runtime _doesn't_ think that there's a circular dependency, but Walter has been against the idea when it's been brought up. That would _really_ help here. Without redesigning std.benchmark so that it doesn't use static constructors, I don't know how you can fix that. Normally, if you really need a static constructor, you go through the pain of creating a separate module which does the initialization for you (like std.stdio does). But that won't work in this case, because you're mixing it in. So, unless you can redesign it so that std.benchmark doesn't require static constructors, it may have to be a limitation of std.benchmark that it can't be used where it would create a circular dependency. Unfortunately, the circular dependency issue makes static constructors almost useless outside of isolated cases, even though they rarely actually have circular dependencies. It's one of the few places in D that I'd say that there's a major design flaw. - Jonathan M Davis
Sep 21 2012
prev sibling next sibling parent "David Piepgrass" <qwertie256 gmail.com> writes:
 Some random comments about std.benchmark based on its 
 documentation:

 - It is very strange that the documentation of printBenchmarks 
 uses neither of the words "average" or "minimum", and doesn't 
 say how many trials are done....

Because all of those are irrelevant and confusing.

Huh? It's not nearly as confusing as reading the documentation and not having the faintest idea what it will do. The way the benchmarker works is somehow 'irrelevant'? The documentation doesn't even indicate that the functions are to be run more than once!!
 I don't think that's a good idea.

I have never seen you make such vague arguments, Andrei.
Sep 21 2012
prev sibling next sibling parent "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Friday, 21 September 2012 at 19:54:12 UTC, Andrei Alexandrescu 
wrote:
 On 9/19/12 4:06 AM, Peter Alexander wrote:
 I don't see why `benchmark` takes (almost) all of its 
 parameters as
 template parameters. It looks quite odd, seems unnecessary, 
 and (if I'm
 not mistaken) makes certain use cases quite difficult.

That is intentional - indirect calls would add undue overhead to the measurements.

I accept that it adds undue overhead. I just think that the function would be more usable with non-template parameters (as per my example). I also think the overhead would be negligible.
Sep 21 2012
prev sibling next sibling parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On Fri, 21 Sep 2012 17:00:29 -0400
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:

 On 9/21/12 11:12 AM, Manu wrote:
 On 21 September 2012 07:45, Andrei Alexandrescu

     Currently std.benchmark does not expose raw results for the
 sake of simplicity. It's easy to expose such, but I'd need a bit
 ***more convincing about their utility***.


(Emphasis added for proper context.)
 Custom visualisation, realtime charting/plotting, user supplied
 reduce function?

Hrm, that sounds like an entire new project.

That doesn't diminish their utility. Keep in mind, nobody's suggesting putting all of that into std.benchmark (certainly not initially anyway), but the idea is to at least have the door open for them.
Sep 21 2012
prev sibling next sibling parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
Stepping back for a moment, I think we're facing two key issues here: 

The first key issue is that the docs for std.benchmark don't adequately
explain Andre's intended charter/scope for it, it's methodology or the
rationale for its methodology. So people see "benchmark" and they think
"oh, ok, for timing stuff", but it appears to be intended as being
for very specific use-cases. I think this entire discussion serves as
evidence that, at the very least, it needs to communicate that
scope/methodology/rationale better that it currently does. If all of
us are having trouble "getting it", then others certainly will too.

Aside from that, there's the second key issue: whether the
current intended scope is sufficient. Should it be more general in
scope and not so specialized? Personally, I would tend to think do, and
I think that seems to the the popular notion. But I don't know for sure.
If it should be more generalized, than does it need to be so for the
first iteration, or can it be done later after being added to phobos?
That, I have no idea.
Sep 21 2012
prev sibling next sibling parent "Simen Kjaeraas" <simen.kjaras gmail.com> writes:
On 2012-21-09 22:58:36, Andrei Alexandrescu wrote:

 On 9/21/12 5:39 AM, Jacob Carlborg wrote:

 After your replay to one of Manu's post, I think I misunderstood the
 std.benchmark module. I was thinking more of profiling. But are these
 quite similar tasks, couldn't std.benchmark work for both?

This is an interesting idea. It would delay release quite a bit because I'd need to design and implement things like performance counters and such.

I certainly think the two use cases are similar enough to warrant their inclusion in a common module. That does not preclude std.benchmark being included as is now, and extended with profiling features at some later point. -- Simen
Sep 22 2012
prev sibling next sibling parent =?UTF-8?B?IsOYaXZpbmQi?= <oivind.loe gmail.com> writes:
On Friday, 21 September 2012 at 20:43:13 UTC, Jonathan M Davis 
wrote:
 On Friday, September 21, 2012 15:59:31 Andrei Alexandrescu 
 wrote:
 On 9/19/12 4:11 PM, "Øivind" wrote:
 New question for you :)
 
 To register benchmarks, the 'scheduleForBenchmarking' mixin 
 inserts a
 shared static initializer into the module. If I have a 
 module A and a
 module B, that both depend on eachother, than this will 
 probably not
 work..? The runtime will detect the init cycle and fail with 
 the
 following error:
 
 "Cycle detected between modules with ctors/dtors"
 
 Or am I wrong now?

I think you have discovered a major issue. Ideas on how to attack this?

Some of us have been asking for ages for the ability to mark a static constructor as not depending on anything so that the runtime _doesn't_ think that there's a circular dependency, but Walter has been against the idea when it's been brought up. That would _really_ help here. Without redesigning std.benchmark so that it doesn't use static constructors, I don't know how you can fix that. Normally, if you really need a static constructor, you go through the pain of creating a separate module which does the initialization for you (like std.stdio does). But that won't work in this case, because you're mixing it in. So, unless you can redesign it so that std.benchmark doesn't require static constructors, it may have to be a limitation of std.benchmark that it can't be used where it would create a circular dependency. Unfortunately, the circular dependency issue makes static constructors almost useless outside of isolated cases, even though they rarely actually have circular dependencies. It's one of the few places in D that I'd say that there's a major design flaw. - Jonathan M Davis

Is there a way to solve the dependency issue without forbidding static constructors in modules with cyclic dependencies? I.e. for two modules with static init referencing eachother, an analysis of the static init code could be used to detect circular dependencies instead of the import statements?
Sep 22 2012
prev sibling next sibling parent =?UTF-8?B?IsOYaXZpbmQi?= <oivind.loe gmail.com> writes:
On Saturday, 22 September 2012 at 13:03:06 UTC, Andrei 
Alexandrescu wrote:
 On 9/22/12 8:28 AM, "Øivind" wrote:
 Is there a way to solve the dependency issue without 
 forbidding static
 constructors in modules with cyclic dependencies?

I think an idea just occurred to me. The rules for static ctors and dtors were invented before "import" was allowed inside a scope. We could have taken advantage of that. Say we restrict symbol visibility inside static cdtors to ONLY symbols within the current module. If some static cdtor needs a symbol from a different module, it must import it explicitly (even if the current module already imports it). In this setup it should be possible to compute, in a fine-grained manner, the dependencies of static cdtors. Unfortunately that would be a breaking change. Andrei

It gets a bit ugly maybe, but we could do a mix of the proposals that have come before and this one, e.g. add a nocycliccheck (or similar) to the static constructor, and in that case only allow access to current module and those imorted inside the ctor scope..
Sep 22 2012
prev sibling next sibling parent =?UTF-8?B?IsOYaXZpbmQi?= <oivind.loe gmail.com> writes:
On Saturday, 22 September 2012 at 13:25:47 UTC, Øivind wrote:
 On Saturday, 22 September 2012 at 13:03:06 UTC, Andrei 
 Alexandrescu wrote:
 On 9/22/12 8:28 AM, "Øivind" wrote:
 Is there a way to solve the dependency issue without 
 forbidding static
 constructors in modules with cyclic dependencies?

I think an idea just occurred to me. The rules for static ctors and dtors were invented before "import" was allowed inside a scope. We could have taken advantage of that. Say we restrict symbol visibility inside static cdtors to ONLY symbols within the current module. If some static cdtor needs a symbol from a different module, it must import it explicitly (even if the current module already imports it). In this setup it should be possible to compute, in a fine-grained manner, the dependencies of static cdtors. Unfortunately that would be a breaking change. Andrei

It gets a bit ugly maybe, but we could do a mix of the proposals that have come before and this one, e.g. add a nocycliccheck (or similar) to the static constructor, and in that case only allow access to current module and those imorted inside the ctor scope..

We would probably not call it ' nycycliccheck' since you propose to still do these checks, but only local imports :) Would need another name for it.
Sep 22 2012
prev sibling next sibling parent "Simen Kjaeraas" <simen.kjaras gmail.com> writes:
On 2012-09-22, 15:28, =C3=98ivind wrote:

 It gets a bit ugly maybe, but we could do a mix of the proposals that=


 have come before and this one, e.g. add a  nocycliccheck (or similar)=


 to the static constructor,  and in that case only allow access to  =


 current module and those imorted inside the ctor scope..

We would probably not call it ' nycycliccheck' since you propose to =

 still do these checks, but only local imports :) Would need another na=

 for it.

noglobalimports. :p -- = Simen
Sep 22 2012
prev sibling next sibling parent "Simen Kjaeraas" <simen.kjaras gmail.com> writes:
On 2012-09-22, 15:04, Andrei Alexandrescu wrote:

 On 9/22/12 8:28 AM, "=C3=98ivind" wrote:
 Is there a way to solve the dependency issue without forbidding stati=


 constructors in modules with cyclic dependencies?

I think an idea just occurred to me. The rules for static ctors and =

 dtors were invented before "import" was allowed inside a scope. We cou=

 have taken advantage of that.

 Say we restrict symbol visibility inside static cdtors to ONLY symbols=

 within the current module. If some static cdtor needs a symbol from a =

 different module, it must import it explicitly (even if the current  =

 module already imports it).

 In this setup it should be possible to compute, in a fine-grained  =

 manner, the dependencies of static cdtors.

 Unfortunately that would be a breaking change.

That *is* neat. I guess putting it on the deprecation path could work. This is a change we'd really, *really* like to see. -- = Simen
Sep 22 2012
prev sibling next sibling parent "David Piepgrass" <qwertie256 gmail.com> writes:
 - It is very strange that the documentation of printBenchmarks 
 uses
 neither of the words "average" or "minimum", and doesn't say 
 how many
 trials are done.... I suppose the obvious interpretation is 
 that it
 only does one trial, but then we wouldn't be having this 
 discussion
 about averages and minimums right? Øivind says tests are run 
 1000
 times... but it needs to be configurable per-test (my idea: 
 support a
 _x1000 suffix in function names, or _for1000ms to run the test 
 for at
 least 1000 milliseconds; and allow a multiplier when when 
 running a
 group of benchmarks, e.g. a multiplier argument of 0.5 means 
 to only
 run half as many trials as usual.) Also, it is not clear from 
 the
 documentation what the single parameter to each benchmark is 
 (define
 "iterations count".)

I don't think it's a good idea because the "for 1000 ms" doesn't say anything except how good the clock resolution was on the system. I'm as strongly convinced we shouldn't print useless information as I am we should print useful information.

I am puzzled about what you think my suggestion meant. I am suggesting allowing the user to configure how long benchmarking takes. Some users might want to run their benchmark for an hour to get stable and reliable numbers; others don't want to wait and want to see results ASAP. Perhaps the *same* user will want to run benchmarks quickly while developing them and then do a "final run" with more trials once their benchmark suite is complete. Also, some individual benchmark functions will take microseconds to complete; others may take seconds to complete. All I'm suggesting are simple ways to avoid wasting users' time, without making std.benchmark overly complicated.
Sep 22 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday, September 22, 2012 09:04:09 Andrei Alexandrescu wrote:
 On 9/22/12 8:28 AM, "=C3=98ivind" wrote:
 Is there a way to solve the dependency issue without forbidding sta=


 constructors in modules with cyclic dependencies?

I think an idea just occurred to me. The rules for static ctors and dtors were invented before "import" was allowed inside a scope. We co=

 have taken advantage of that.
=20
 Say we restrict symbol visibility inside static cdtors to ONLY symbol=

 within the current module. If some static cdtor needs a symbol from a=

 different module, it must import it explicitly (even if the current
 module already imports it).
=20
 In this setup it should be possible to compute, in a fine-grained
 manner, the dependencies of static cdtors.
=20
 Unfortunately that would be a breaking change.

It's a nice thought, but it wouldn't work. If nothing else .di files co= mpletely=20 ruin it. 1. I don't think that it's actually required that static constructors b= e in a=20 .di file. So, the compiler couldn't know for sure whether the modules b= eing=20 imported had static constructors. 2. Even if all static constructors had to be in a .di file, local impor= ts ruin=20 it, because the function bodies (which can definitely be elided from .d= i files)=20 could contain local imports to modules which have static constructors a= nd=20 cause circular dependencies. - Jonathan M Davis
Sep 22 2012
prev sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 22/09/12 07:10, Nick Sabalausky wrote:
 I think this entire discussion serves as evidence that, at the very
 least, it needs to communicate that scope/methodology/rationale better
 that it currently does. If all of us are having trouble "getting it",
 then others certainly will too.

My feeling is that even with a good explanation in the docs, you're _still_ going to have a regular stream of people showing up on the mailing lists going, "Hey, why can't I get my preferred metric with std.benchmark??!!" So, even if there's good reason to think their preferences are daft, it might be worth supporting what they want to do, just to avoid that continuous stream of requests.
 Aside from that, there's the second key issue: whether the
 current intended scope is sufficient. Should it be more general in
 scope and not so specialized? Personally, I would tend to think do, and
 I think that seems to the the popular notion. But I don't know for sure.
 If it should be more generalized, than does it need to be so for the
 first iteration, or can it be done later after being added to phobos?
 That, I have no idea.

This is what I was wondering, whether it's possible to take the current functionality but leave the door open to extending it with a different choice of metrics. Under the extended version, the default metric would be as it is currently, the docs would explain why this default makes sense and the caveats related to other metrics, but ultimately if the user wanted to use them, they'd be available. But I'd only like to see that happening if there is a "natural" path to the extended version rather than breaking changes or significant rewrites.
Sep 23 2012