www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Why is D slower than LuaJIT?

reply Andreas Mayer <spam bacon.eggs> writes:
To see what performance advantage D would give me over using a scripting
language, I made a small benchmark. It consists of this code:

    auto L = iota(0.0, 10000000.0);
    auto L2 = map!"a / 2"(L);
    auto L3 = map!"a + 2"(L2);
    auto V = reduce!"a + b"(L3);

It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. The full D version is here: http://pastebin.com/R5AGHyPx The Lua version: http://pastebin.com/Sa7rp6uz Lua version that emulates ranges: http://pastebin.com/eAKMSWyr Could someone help me solving this mystery? Or is D, unlike I thought, not suitable for high performance computing? What should I do?
Dec 22 2010
next sibling parent reply Trass3r <un known.com> writes:
 Or is D, unlike I thought, not suitable for high performance computing?  
 What should I do?

LuaJIT seems to have a really good backend. You better compare with ldc or gdc.
Dec 22 2010
parent Andreas Mayer <spam bacon.eggs> writes:
Trass3r Wrote:

 LuaJIT seems to have a really good backend.
 You better compare with ldc or gdc.

Maybe someone could do that for me? I don't have ldc or gdc here. There are some Debian packages, but they are D version 1 only?
Dec 22 2010
prev sibling next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 22 Dec 2010 17:04:21 -0500, Andreas Mayer <spam bacon.eggs> wrote:

 To see what performance advantage D would give me over using a scripting  
 language, I made a small benchmark. It consists of this code:

    auto L = iota(0.0, 10000000.0);
    auto L2 = map!"a / 2"(L);
    auto L3 = map!"a + 2"(L2);
    auto V = reduce!"a + b"(L3);

It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. The full D version is here: http://pastebin.com/R5AGHyPx The Lua version: http://pastebin.com/Sa7rp6uz Lua version that emulates ranges: http://pastebin.com/eAKMSWyr Could someone help me solving this mystery? Or is D, unlike I thought, not suitable for high performance computing? What should I do?

Without any imperical testing, I would guess this has something to do with the lack of inlining for algorithmic functions. This is due primarily to uses of enforce, which use lazy parameters, which are currently not inlinable (also, ensure you use -O -release -inline for the most optimized code). I hope that someday this is solved, because it doesn't look very good for D... -Steve
Dec 22 2010
next sibling parent BLS <windevguy hotmail.de> writes:
On 22/12/2010 23:06, Steven Schveighoffer wrote:
 (also, ensure you use -O -release -inline for the most optimized code).

quote... // D version, with std.algorithm // ~ 281 ms, using dmd 2.051 (dmd -O -release -inline) Sometimes having a look at the source helps :)
Dec 22 2010
prev sibling next sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
Steven Schveighoffer wrote:
  I would guess this has something to do with
 the lack of inlining for algorithmic functions.

Yeah, this is almost certainly the problem. I rewrote the code using a traditional C style loop, no external functions, and I'm getting roughly equal performance.
Dec 22 2010
parent reply loser <donottalk to.me> writes:
Adam D. Ruppe Wrote:

 Steven Schveighoffer wrote:
  I would guess this has something to do with
 the lack of inlining for algorithmic functions.

Yeah, this is almost certainly the problem. I rewrote the code using a traditional C style loop, no external functions, and I'm getting roughly equal performance.

So is it justified enough to throw my W's incompetence card on the table at this point? How else it is possible that a simple scripting language with simple JIT optimization heuristics can outperform a performance oriented systems programming language. It seems most D design decisions are based on the perceived performance value (not as aggressively as in C++ groups). I'd like to see how this theory doesn't hold water now?
Dec 22 2010
parent bearophile <bearophileHUGS lycos.com> writes:
loser:

 So is it justified enough to throw my W's incompetence card on the table at
this point? How else it is possible that a simple scripting language with
simple JIT optimization heuristics can outperform a performance oriented
systems programming language.

It's not wise to prematurely improve the inlining a lot right now when there is no 64 bit version yet, and there are holes or missing parts in several corners of the language. Performance tuning has a lower priority. Designing a good language and performance-tuning its implementation ask for different skills. The very good author of Lua-JIT is probably not good at designing a C++-class language :-) What's needed now is to smooth the rough corners of the D language, not to squeeze out every bit of performance. Bye, bearophile
Dec 22 2010
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
Andrej Mitrovic wrote:
 I have just tried removing enforce usage from Phobos and recompiling
 the library, and compiling again with -O -release -inline. It doesn't
 appear to make a difference in the timing speed over multiple runs.

Try looking at the obj2asm dump of the inner loop.
Dec 22 2010
prev sibling next sibling parent reply Gary Whatmore <no spam.sp> writes:
Andreas Mayer Wrote:

 To see what performance advantage D would give me over using a scripting
language, I made a small benchmark. It consists of this code:
 
    auto L = iota(0.0, 10000000.0);
    auto L2 = map!"a / 2"(L);
    auto L3 = map!"a + 2"(L2);
    auto V = reduce!"a + b"(L3);


First note: this is a synthetic toy benchmark. Take it with a grain of salt. It represent in no way the true state of D.
 
 It runs in 281 ms on my computer.
 
 The same code in Lua (using LuaJIT) runs in 23 ms.

Your mp3 player or file system was doing stuff while executing the benchmark. You probably don't know how to run the test many times and use the average/minimum result for both languages. For example D does not have JIT startup cost so take the minimum result for D, JIT has varying startup speed so take the average or slowest result for Luajit. Compare these. More fair for native code D.
 That's about 10 times faster. I would have expected D to be faster. Did I do
something wrong?
 
 The first Lua version uses a simplified design. I thought maybe that is unfair
to ranges, which are more complicated. You could argue ranges have more
features and do more work. To make it fair, I made a second Lua version of the
above benchmark that emulates ranges. It is still 29 ms fast.
 
 The full D version is here: http://pastebin.com/R5AGHyPx
 The Lua version: http://pastebin.com/Sa7rp6uz
 Lua version that emulates ranges: http://pastebin.com/eAKMSWyr
 
 Could someone help me solving this mystery?

My guesses are: 1) you didn't even test this and didn't use optimizations. -> User error 2) whenever doing benchmarks you must compare the competing stuff against all D compilers, cut and paste the object code of different compilers and manually build the fastest executable. 3) you didn't use inline assembler or profiler for D 4) you were using unstable Phobos functions. There is no doubt the final Phobos 2.0 will beat Luajit. D *is* a compiler statical language, Luajit just a joke. 5) you were using old d runtime garbage collector. One fellow here made a precise state of the art GC which beats even Java's 20 year old GC and C#. Patch your dmd to use this instead. Not intending to start a religious war but if your native code runs slower than *JIT* code, you're doing something wrong. D will always beat JIT. Lua is also a joke language, D is for high performance servers and operating systems. In the worst case, disassemble the luajit program, steal its codes and write it using inline assembler in D. D must win these performance battles. It's technically superior.
 
 Or is D, unlike I thought, not suitable for high performance computing? What
should I do?

It is. D a) takes time to mature and b) you didn't fully utilize the compiler.
Dec 22 2010
next sibling parent BLS <windevguy hotmail.de> writes:
On 22/12/2010 23:31, Gary Whatmore wrote:
 Not intending to start a religious war but if your native code runs slower
than*JIT*  code, you're doing something wrong. D will always beat JIT.

You talk like a prayer, don't you ? No need to measure. I believe ... Anyway I don't care about LUA. The dynamic ASM generation on the other hand is something which should be seriously be considered for D templates. Why ? What comes immediately in mind..no more Bloated executables. my 2 euro cents
Dec 22 2010
prev sibling next sibling parent Ary Borenszweig <ary esperanto.org.ar> writes:
Lua is a proven, robust language

Lua has been used in many industrial applications (e.g., Adobe's
Photoshop Lightroom), with an emphasis on embedded systems (e.g., the
Ginga middleware for digital TV in Brazil) and games (e.g., World of
Warcraft). Lua is currently the leading scripting language in games.
Lua has a solid reference manual and there are several books about it.
Several versions of Lua have been released and used in real
applications since its creation in 1993. Lua featured in HOPL III, the
Third ACM SIGPLAN History of Programming Languages Conference, in June
2007.
Dec 22 2010
prev sibling next sibling parent Andreas Mayer <spam bacon.eggs> writes:
Gary Whatmore Wrote:

 Andreas Mayer Wrote:
 
 To see what performance advantage D would give me over using a scripting
language, I made a small benchmark. It consists of this code:
 
    auto L = iota(0.0, 10000000.0);
    auto L2 = map!"a / 2"(L);
    auto L3 = map!"a + 2"(L2);
    auto V = reduce!"a + b"(L3);


First note: this is a synthetic toy benchmark. Take it with a grain of salt. It represent in no way the true state of D.

True enough. Yet it doesn't make me very optimistic what the final performance tradeoff would be as long as you use high level abstractions. Sure, with D you can always go on the C or even assembly level.
 Your mp3 player or file system was doing stuff while executing the benchmark.
You probably don't know how to run the test many times and use the
average/minimum result for both languages. For example D does not have JIT
startup cost so take the minimum result for D, JIT has varying startup speed so
take the average or slowest result for Luajit. Compare these. More fair for
native code D.

Both benchmarks were run under the same conditions. Once the executables were inside the disk cache, the run times didn't vary much. Plus this benchmark already is unfair against LuaJIT: the startup time and the time needed for optimization and code generation are included in the times I gave. The D example on the other hand doesn't include the time needed for compilation. The D compiler needs 360 ms to compile this example. If the comparison were fair and included compilation time in the D timings, D would lose even more.
 My guesses are:
 
 1) you didn't even test this and didn't use optimizations. -> User error

I enabled all dmd optimizations I was aware of. Maybe I forgot some?
 2) whenever doing benchmarks you must compare the competing stuff against all
D compilers, cut and paste the object code of different compilers and manually
build the fastest executable.

That seems like an unreasonable task. Writing the code in assembler would be simpler. But I'm using a high level language because I want to use high level abstractions. Like map and reduce, instead of writing assembler.
 3) you didn't use inline assembler or profiler for D

See 2).
 4) you were using unstable Phobos functions. There is no doubt the final
Phobos 2.0 will beat Luajit. D *is* a compiler statical language, Luajit just a
joke.

I used the latest dmd release (and that is very new). As you can see, LuaJIT beats D by far. I wouldn't call it a joke. If a joke beats D, then what is D? This way of argumentation doesn't sound very advantageous for you.
 5) you were using old d runtime garbage collector. One fellow here made a
precise state of the art GC which beats even Java's 20 year old GC and C#.
Patch your dmd to use this instead.

There shouldn't be any GC activity. Ranges work lazily. They don't allocate arrays for the data they are working on. You can post a package with "bleeding edge" dmd and Phobos sources with updated GC and so on. Then I could try that.
 
 Not intending to start a religious war but if your native code runs slower
than *JIT* code, you're doing something wrong. D will always beat JIT. Lua is
also a joke language, D is for high performance servers and operating systems.
In the worst case, disassemble the luajit program, steal its codes and write it
using inline assembler in D. D must win these performance battles. It's
technically superior.

But D didn't win. Not here. And what was I doing wrong? Please point out. I posted this because I was surprised myself and I thought "that can't be".
Dec 22 2010
prev sibling parent Eric Poggel <dnewsgroup2 yage3d.net> writes:
On 12/22/2010 5:31 PM, Gary Whatmore wrote:
 5) you were using old d runtime garbage collector. One fellow here made a
precise state of the art GC which beats even Java's 20 year old GC and C#.
Patch your dmd to use this instead.

Could you point me to more information? This sounds interesting.
Dec 22 2010
prev sibling next sibling parent reply Iain Buclaw <ibuclaw ubuntu.com> writes:
== Quote from Andreas Mayer (spam bacon.eggs)'s article
 To see what performance advantage D would give me over using a scripting

One corner case doesn't amount to an end-all be-all proof to anything.
    auto L = iota(0.0, 10000000.0);
    auto L2 = map!"a / 2"(L);
    auto L3 = map!"a + 2"(L2);
    auto V = reduce!"a + b"(L3);

The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do

 The first Lua version uses a simplified design. I thought maybe that is unfair

and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. As has been already echoed, the lack of inlining algorithmic functions may be one reason for the added cost to runtime. Another may be simply that there is a lot more going on behind the scenes than what you give credit for in D. Regards :~)
Dec 22 2010
parent Andreas Mayer <spam bacon.eggs> writes:
Iain Buclaw Wrote:

 Another may be simply that there is a lot
 more going on behind the scenes than what you give credit for in D.

What else does it do? I want to add it to the Lua version.
Dec 22 2010
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On Wed, 22 Dec 2010 17:04:21 -0500
Andreas Mayer <spam bacon.eggs> wrote:

 To see what performance advantage D would give me over using a scripting =

=20
    auto L =3D iota(0.0, 10000000.0);
    auto L2 =3D map!"a / 2"(L);
    auto L3 =3D map!"a + 2"(L2);
    auto V =3D reduce!"a + b"(L3);

It runs in 281 ms on my computer. =20 The same code in Lua (using LuaJIT) runs in 23 ms. =20 That's about 10 times faster. I would have expected D to be faster. Did I=

=20
 The first Lua version uses a simplified design. I thought maybe that is u=

re features and do more work. To make it fair, I made a second Lua version = of the above benchmark that emulates ranges. It is still 29 ms fast.
=20
 The full D version is here: http://pastebin.com/R5AGHyPx
 The Lua version: http://pastebin.com/Sa7rp6uz
 Lua version that emulates ranges: http://pastebin.com/eAKMSWyr
=20
 Could someone help me solving this mystery?
=20
 Or is D, unlike I thought, not suitable for high performance computing? W=

Dunno why D seems slow. But Lua is a very fast dynamic language, very simpl= e & rather low level (both on language & implementation sides). Benchmark t= rials in Lua often run much faster than python or ruby equivalents (often 1= 0 X). Depending on the domain, LuaJIT often adds a speed factor of an order= of magnitude. This alltogether brings comparable performance to some compi= led languages using high-level features such as virtual funcs, GC,... highe= r-order funcs, ranges ;-) denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 22 2010
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
Andreas Mayer wrote:
 Or is D, unlike I thought, not suitable for high performance computing? What
 should I do?

I notice you are using doubles in D. dmd currently uses the x87 to evaluate doubles, and on some processors the x87 is slow relative to using the XMM instructions. Also, dmd's back end doesn't align the doubles on 16 byte boundaries, which can also slow down the floating point on some processors. Both of these code gen issues with dmd are well known, and I'd like to solve them after we address higher priority issues. If it's not clear, I'd like to emphasize that these are compiler issues, not D language issues.
Dec 22 2010
next sibling parent reply Andreas Mayer <spam bacon.eggs> writes:
Walter Bright Wrote:

 I notice you are using doubles in D. dmd currently uses the x87 to evaluate 
 doubles, and on some processors the x87 is slow relative to using the XMM 
 instructions. Also, dmd's back end doesn't align the doubles on 16 byte 
 boundaries, which can also slow down the floating point on some processors.

Using long instead of double, it is still slower than LuaJIT (223 ms on my machine). Even with int it still takes 101 ms and is at least 3x slower than LuaJIT.
 Both of these code gen issues with dmd are well known, and I'd like to solve 
 them after we address higher priority issues.
 
 If it's not clear, I'd like to emphasize that these are compiler issues, not D 
 language issues.

I shouldn't use D now? How long until it is ready?
Dec 22 2010
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
Andreas Mayer wrote:
 I shouldn't use D now? How long until it is ready?

It depends on what you want to do. A lot of people are successfully using D.
Dec 22 2010
prev sibling parent reply Lutger Blijdestijn <lutger.blijdestijn gmail.com> writes:
Andreas Mayer wrote:

 Walter Bright Wrote:
 
 I notice you are using doubles in D. dmd currently uses the x87 to
 evaluate doubles, and on some processors the x87 is slow relative to
 using the XMM instructions. Also, dmd's back end doesn't align the
 doubles on 16 byte boundaries, which can also slow down the floating
 point on some processors.

Using long instead of double, it is still slower than LuaJIT (223 ms on my machine). Even with int it still takes 101 ms and is at least 3x slower than LuaJIT.
 Both of these code gen issues with dmd are well known, and I'd like to
 solve them after we address higher priority issues.
 
 If it's not clear, I'd like to emphasize that these are compiler issues,
 not D language issues.

I shouldn't use D now? How long until it is ready?

You may want to explore the great language shootout before drawing that conclusion: http://shootout.alioth.debian.org/ LuaJit ranks high there, but still a bit below the fastest compiled languages (and the fastest java). D is not included anymore, but it once was and these benchmarks can still be found: http://shootout.alioth.debian.org/debian/performance.php LuaJit performance is impressive, far above any 'scripting' language. Just look at some numbers in the shootout comparing it to ruby or python.
Dec 22 2010
parent Lutger Blijdestijn <lutger.blijdestijn gmail.com> writes:
I meant to link this, it includes all benchmarks and ranks gdc at 5th place 
and dmd at 8 (from 2008):

http://shootout.alioth.debian.org/debian/benchmark.php?test=all&lang=all
Dec 22 2010
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
Walter Bright wrote:
 Andreas Mayer wrote:
 Or is D, unlike I thought, not suitable for high performance 
 computing? What
 should I do?


I forgot to mention. In the D version, use integers as a loop counter, not doubles.
Dec 22 2010
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Andreas Mayer:

 To see what performance advantage D would give me over using a scripting
language, I made a small benchmark. It consists of this code:

I have done (and I am doing) many benchmarks with D, and I too have seen similar results. I have discussed this topic two times in past, this was one time: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=110419 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=110420 For Floating-point-heavy code Lua-JIT is often faster than D compiled with DMD. I have found that on the SciMark2 benchmark too Lua-JIT is faster than D code compiled with DMD. On the other hand if I use LDC I am often able to beat LuaJIT 2.0.0-beta3 (we are now at beta5) (if the D code doesn't ask for too much inlining). The Lua-JIT is written by a very smart person, maybe a kind of genius that has recently given ideas to designers of V8 and Firefox JS Engine. The LuaJIT uses very well SSE registers and being a JIT it has more runtime information about the code, so it is able to optimize it better. It unrolls dynamically, inlines dynamic things, etc. DMD doesn't perform enough optimizations. Keep in mind that the main purpose of DMD is now to finish implementing D (and sometimes to find what to implement! Because there are some unfinished corners in D design). Performance tuning is mostly for later. ------------------------- Walter Bright:
If it's not clear, I'd like to emphasize that these are compiler issues, not D
language issues.<

Surely Lua looks like a far worse language regarding optimization opportunities. But people around here (like you) must start to realize that JIT compilation is not what it used to be. Today the JIT compilation done by the JavaVM is able to perform de-virtualization, dynamic loop unrolling, inlining across "compilation units", and some other optimizations that despite are "not language issues" are not done or not done enough by static compilers like LDC, GCC, DMD. The result is that SciMark2 benchmark is about as fast in Java and C, and for some sub-benchmarks it is faster :-) Bye, bearophile
Dec 22 2010
parent reply Walter Bright <newshound2 digitalmars.com> writes:
bearophile wrote:
 Surely Lua looks like a far worse language regarding optimization
 opportunities. But people around here (like you) must start to realize that
 JIT compilation is not what it used to be. Today the JIT compilation done by
 the JavaVM is able to perform de-virtualization, dynamic loop unrolling,

The Java JIT did that 15 years ago. I think you forget that I wrote on a Java compiler way back in the day (the companion JIT was done by Steve Russell, yep, the Optlink guy).
 inlining across "compilation units",

dmd does cross-module inlining.
 and some other optimizations that
 despite are "not language issues" are not done or not done enough by static
 compilers like LDC, GCC, DMD. The result is that SciMark2 benchmark is about
 as fast in Java and C, and for some sub-benchmarks it is faster :-)

Inherent Java slowdowns are not in numerical code. The Java language isn't inherently worse at numerics than C, C++, D, etc. Where Java is inherently worse is in its excessive reliance on dynamic allocation (and that is rare in numeric code - you don't "new" a double).
Dec 22 2010
parent bearophile <bearophileHUGS lycos.com> writes:
 I think you forget that I wrote on a Java compiler way back in the day

I remember it :-)
 dmd does cross-module inlining.

I didn't know this, much... Bye, bearophile
Dec 22 2010
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 12/22/10, Steven Schveighoffer <schveiguy yahoo.com> wrote:
 Without any imperical testing, I would guess this has something to do with
 the lack of inlining for algorithmic functions.  This is due primarily to
 uses of enforce, which use lazy parameters, which are currently not
 inlinable (also, ensure you use -O -release -inline for the most optimized
 code).

I have just tried removing enforce usage from Phobos and recompiling the library, and compiling again with -O -release -inline. It doesn't appear to make a difference in the timing speed over multiple runs.
Dec 22 2010
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/22/10 4:04 PM, Andreas Mayer wrote:
 To see what performance advantage D would give me over using a scripting
language, I made a small benchmark. It consists of this code:

     auto L = iota(0.0, 10000000.0);
     auto L2 = map!"a / 2"(L);
     auto L3 = map!"a + 2"(L2);
     auto V = reduce!"a + b"(L3);

It runs in 281 ms on my computer.

Thanks for posting the numbers. That's a long time, particularly considering that the two map instances don't do anything. So the bulk of the computation is: auto L = iota(0.0, 10000000.0); auto V = reduce!"a + b"(L3); There is one inherent problem that affects the speed of iota: in iota, the value at position i is computed as 0.0 + i * step, where step is computed from the limits. That's one addition and a multiplication for each pass through iota. Given that the actual workload of the loop is only one addition, we are doing a lot more work. I suspect that that's the main issue there. The reason for which iota does that instead of the simpler increment is that iota must iterate the same values forward and backward. Using ++ may interact with floating-point vagaries, so the code is currently conservative. Another issue is the implementation of reduce. Reduce is fairly general which may mean that it generates mediocre code for that particular case. We can always optimize the general case and perhaps specialize for select cases. Once we figure where the problem is, there are numerous possibilities to improve the code: 1. Have iota check in the constructor whether the limits allow ++ to be precise. If so, use that. Of course, that means an extra runtime test... 3. Give up on iota being a random access or bidirectional range. If it's a forward range, we don't need to worry about going backwards. 4. Improve reduce as described above.
 The same code in Lua (using LuaJIT) runs in 23 ms.

 That's about 10 times faster. I would have expected D to be faster. Did I do
something wrong?

 The first Lua version uses a simplified design. I thought maybe that is unfair
to ranges, which are more complicated. You could argue ranges have more
features and do more work. To make it fair, I made a second Lua version of the
above benchmark that emulates ranges. It is still 29 ms fast.

 The full D version is here: http://pastebin.com/R5AGHyPx
 The Lua version: http://pastebin.com/Sa7rp6uz
 Lua version that emulates ranges: http://pastebin.com/eAKMSWyr

 Could someone help me solving this mystery?

 Or is D, unlike I thought, not suitable for high performance computing? What
should I do?

Thanks very much for taking the time to measure and post results, this is very helpful. As this test essentially measures the performance of iota and reduce, it would be hasty to generalize the assessment. Nevertheless, we need to look into improving this particular microbenchmark. Please don't forget to print the result of the computation in both languages, as there's always the possibility of some oversight. Andrei
Dec 22 2010
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Andrei:

 As this test essentially measures the performance of 
 iota and reduce, it would be hasty to generalize the assessment. 

From other tests I have seen that often FP-heavy code is faster with Lua-JIT than with D-DMD. But on average the speed difference is much less than 10 times, generally no more than 2 times. One benchmark, Lua and D code (both OOP and C-style included, plus several manually optimized D versions): http://tinyurl.com/yeo2g8j Bye, bearophile
Dec 22 2010
prev sibling next sibling parent reply "Simen kjaeraas" <simen.kjaras gmail.com> writes:
spir <denis.spir gmail.com> wrote:

 There is a point I don't understand here: Iota is a range-struct  
 template, with
     void popFront()
     {
         current += step;
     }
 So, how does the computation of an arbitrary element at a given index  
 affect looping speed? For mappings (and any kind of traversal, indeed),  
 there should be an addition per element. Else, why define a range  
 interface at all? What do I miss?

With floating-point numbers, the above solution does not always work. If step == 1, increasing current by step amount will stop working at some point, at which the range will then grind to a halt. If instead one multiplies step by the current number of steps taken, and adds to the origin, this problem disappears. As an example of when this problem shows up, try this code: float f = 16_777_216; auto f2 = f + 1; assert( f == f2 ); The assert passes. -- Simen
Dec 23 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Simen kjaeraas:

 With floating-point numbers, the above solution does not always work.

The type is known at compile time, so you can split the algorithm in two with a "static if", and do something else if it's an integral type. Bye, bearophile
Dec 23 2010
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/23/10 6:57 AM, bearophile wrote:
 Simen kjaeraas:

 With floating-point numbers, the above solution does not always work.

The type is known at compile time, so you can split the algorithm in two with a "static if", and do something else if it's an integral type.

That's what the code currently does. Andrei
Dec 23 2010
prev sibling next sibling parent "Simen kjaeraas" <simen.kjaras gmail.com> writes:
bearophile <bearophileHUGS lycos.com> wrote:

 Simen kjaeraas:

 With floating-point numbers, the above solution does not always work.

The type is known at compile time, so you can split the algorithm in two with a "static if", and do something else if it's an integral type.

Absolutely. However, in this example doubles were used. Also, though it may be doable for integers, other types may also want that optimization (BigInt comes to mind). A behavesAsIntegral!T might fix that. -- Simen
Dec 23 2010
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/22/10 8:16 PM, Andrei Alexandrescu wrote:
 On 12/22/10 4:04 PM, Andreas Mayer wrote:
 To see what performance advantage D would give me over using a
 scripting language, I made a small benchmark. It consists of this code:

 auto L = iota(0.0, 10000000.0);
 auto L2 = map!"a / 2"(L);
 auto L3 = map!"a + 2"(L2);
 auto V = reduce!"a + b"(L3);

It runs in 281 ms on my computer.

Thanks for posting the numbers. That's a long time, particularly considering that the two map instances don't do anything. So the bulk of the computation is: auto L = iota(0.0, 10000000.0); auto V = reduce!"a + b"(L3);

Oops, I was wrong. The two instances of map do something, I thought they're all applied to L when in fact they are chained. So my estimates are incorrect. At any rate, clearly iota incurs a 2x cost, which probably composes with other similar costs incurred by map. Andrei
Dec 23 2010
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/23/10 6:22 AM, spir wrote:
 On Wed, 22 Dec 2010 20:16:45 -0600
 Andrei Alexandrescu<SeeWebsiteForEmail erdani.org>  wrote:

 Thanks for posting the numbers. That's a long time, particularly
 considering that the two map instances don't do anything. So the bulk of
 the computation is:

 auto L = iota(0.0, 10000000.0);
 auto V = reduce!"a + b"(L3);

 There is one inherent problem that affects the speed of iota: in iota,
 the value at position i is computed as 0.0 + i * step, where step is
 computed from the limits. That's one addition and a multiplication for
 each pass through iota. Given that the actual workload of the loop is
 only one addition, we are doing a lot more work. I suspect that that's
 the main issue there.

 The reason for which iota does that instead of the simpler increment is
 that iota must iterate the same values forward and backward. Using ++
 may interact with floating-point vagaries, so the code is currently
 conservative.

There is a point I don't understand here: Iota is a range-struct template, with void popFront() { current += step; }

You need to look at this specialization: http://www.dsource.org/projects/phobos/browser/trunk/phobos/std/range.d#L3800 and keep in mind Simen's explanation. Andrei
Dec 23 2010
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/22/10 4:04 PM, Andreas Mayer wrote:
 To see what performance advantage D would give me over using a scripting
language, I made a small benchmark. It consists of this code:

     auto L = iota(0.0, 10000000.0);
     auto L2 = map!"a / 2"(L);
     auto L3 = map!"a + 2"(L2);
     auto V = reduce!"a + b"(L3);

It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. The full D version is here: http://pastebin.com/R5AGHyPx The Lua version: http://pastebin.com/Sa7rp6uz Lua version that emulates ranges: http://pastebin.com/eAKMSWyr Could someone help me solving this mystery? Or is D, unlike I thought, not suitable for high performance computing? What should I do?

I reproduced the problem with a test program as shown below. On my machine the D iota runs in 108ms, whereas a baseline using a handwritten loop runs in 43 ms. I then replaced iota's implementation with a simpler one that's a forward range. Then the performance became exactly the same as for the simple loop. Andreas, any chance you could run this on your machine and compare it with Lua? (I don't have Lua installed.) Thanks! Andrei // D version, with std.algorithm // ~ 281 ms, using dmd 2.051 (dmd -O -release -inline) import std.algorithm; import std.stdio; import std.range; import std.traits; struct Iota2(N, S) if (isFloatingPoint!N && isNumeric!S) { private N start, end, current; private S step; this(N start, N end, S step) { this.start = start; this.end = end; this.step = step; current = start; } /// Range primitives property bool empty() const { return current >= end; } /// Ditto property N front() { return current; } /// Ditto alias front moveFront; /// Ditto void popFront() { assert(!empty); current += step; } property Iota2 save() { return this; } } auto iota2(B, E, S)(B begin, E end, S step) if (is(typeof((E.init - B.init) + 1 * S.init))) { return Iota2!(CommonType!(Unqual!B, Unqual!E), S)(begin, end, step); } void main(string args[]) { double result; auto limit = 10_000_000.0; if (args.length > 1) { writeln("iota"); auto L = iota2(0.0, limit, 1.0); result = reduce!"a + b"(L); } else { writeln("baseline"); result = 0.0; for (double i = 0; i != limit; ++i) { result += i; } } writefln("%f", result); }
Dec 22 2010
next sibling parent reply Andreas Mayer <spam bacon.eggs > writes:
Andrei Alexandrescu Wrote:

 Andreas, any chance you could run this on your machine and compare it 
 with Lua? (I don't have Lua installed.) Thanks!

Your version: 40 ms (iota and baseline give the same timings) LuaJIT with map calls removed: 21 ms Interesting results.
Dec 22 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/22/10 11:06 PM, Andreas Mayer wrote:
 Andrei Alexandrescu Wrote:

 Andreas, any chance you could run this on your machine and compare it
 with Lua? (I don't have Lua installed.) Thanks!

Your version: 40 ms (iota and baseline give the same timings) LuaJIT with map calls removed: 21 ms Interesting results.

Cool, thanks. I also tested against this C++ baseline: #include <stdio.h> int main() { const double limit = 10000000.0; double result = 0.0; for (double i = 0; i != limit; ++i) { result += i; } printf("%f\n", result); } The baseline (compiled with -O3) runs in 21 ms on my machine, which means (if my and Andreas' machines are similar in performance) that Lua has essentially native performance for this loop and D has an issue in code generation that makes it 2x slower. I think this could be filed as a performance bug for dmd. I'm thinking what to do about iota, which has good features but exacts too much cost on tight loop performance. One solution would be to define iota to be the simple, forward range that I defined as Iota2 in my previous post. Then, we need a different name for the full-fledged iota (random-access, has known length, iterates through the same numbers forward and backward etc). Ideas? Andrei
Dec 22 2010
next sibling parent reply Brad Roberts <braddr slice-2.puremagic.com> writes:
On Wed, 22 Dec 2010, Andrei Alexandrescu wrote:

 On 12/22/10 11:06 PM, Andreas Mayer wrote:
 Andrei Alexandrescu Wrote:
 
 Andreas, any chance you could run this on your machine and compare it
 with Lua? (I don't have Lua installed.) Thanks!

Your version: 40 ms (iota and baseline give the same timings) LuaJIT with map calls removed: 21 ms Interesting results.

Cool, thanks. I also tested against this C++ baseline: #include <stdio.h> int main() { const double limit = 10000000.0; double result = 0.0; for (double i = 0; i != limit; ++i) { result += i; } printf("%f\n", result); } The baseline (compiled with -O3) runs in 21 ms on my machine, which means (if my and Andreas' machines are similar in performance) that Lua has essentially native performance for this loop and D has an issue in code generation that makes it 2x slower. I think this could be filed as a performance bug for dmd. I'm thinking what to do about iota, which has good features but exacts too much cost on tight loop performance. One solution would be to define iota to be the simple, forward range that I defined as Iota2 in my previous post. Then, we need a different name for the full-fledged iota (random-access, has known length, iterates through the same numbers forward and backward etc). Ideas? Andrei

Since the timing code isn't here, I'm assuming you guys are doing the testing around the whole app. While that might be interesting, it's hiding an awfully large and important difference, application startup time. C has very little, D quite a bit more, and I don't know what Lua looks like there. If the goal is to test this math code, you'll need to separate the two. At this point, I highly suspect you're really measuring the runtime costs.
Dec 22 2010
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/22/10 11:40 PM, Brad Roberts wrote:
 Since the timing code isn't here, I'm assuming you guys are doing the
 testing around the whole app.  While that might be interesting, it's
 hiding an awfully large and important difference, application startup
 time.

 C has very little, D quite a bit more, and I don't know what Lua looks
 like there.  If the goal is to test this math code, you'll need to
 separate the two.

 At this point, I highly suspect you're really measuring the runtime costs.

One thing I didn't mention is that I also measured with 10x the counter limit. That brings the run time to seconds, and the relative difference persists. So application startup time is negligible in this case. Andrei
Dec 23 2010
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei:

 I'm thinking what to do about iota, which has good features but exacts 
 too much cost on tight loop performance. One solution would be to define 
 iota to be the simple, forward range that I defined as Iota2 in my 
 previous post. Then, we need a different name for the full-fledged iota 
 (random-access, has known length, iterates through the same numbers 
 forward and backward etc). Ideas?

Is improving the compiler instead an option? Bye, bearophile
Dec 23 2010
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/23/10 2:52 AM, bearophile wrote:
 Andrei:

 I'm thinking what to do about iota, which has good features but exacts
 too much cost on tight loop performance. One solution would be to define
 iota to be the simple, forward range that I defined as Iota2 in my
 previous post. Then, we need a different name for the full-fledged iota
 (random-access, has known length, iterates through the same numbers
 forward and backward etc). Ideas?

Is improving the compiler instead an option?

It's more of a separate matter than an option. Iota currently does a fair amount of work for floating-point types, and a contemporary optimizer cannot be reasonably expected to simplify that code. Andrei
Dec 23 2010
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday 23 December 2010 05:22:55 spir wrote:
 On Wed, 22 Dec 2010 23:22:56 -0600
 
 Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:
 I'm thinking what to do about iota, which has good features but exacts
 too much cost on tight loop performance. One solution would be to define
 iota to be the simple, forward range that I defined as Iota2 in my
 previous post. Then, we need a different name for the full-fledged iota
 (random-access, has known length, iterates through the same numbers
 forward and backward etc). Ideas?

I would keep length and add an opIn: if (e in interval) {...}. (I'm unsure whether it's worth allowing different types for bounds and/or for step; I'd rather make things simple.) Then, you could call it Interval, what do you think? Note: The result would be very similar to python (x)ranges. D has a notation for a slightly narrower notion: '..'. Thus, what about: Interval!int interval = 1..9; or else: auto interval = Interval!int(1..9); ? What kind of thingie does "i..j" actually construct as of now?

I believe that the only place that .. works is within []. If an object overrides an opSlice() which takes parameters, then that syntax can be used. I don't believe that it works on its own at all. - Jonathan M Davis
Dec 23 2010
prev sibling parent spir <denis.spir gmail.com> writes:
On Thu, 23 Dec 2010 05:29:32 -0800
Jonathan M Davis <jmdavisProg gmx.com> wrote:

 On Thursday 23 December 2010 05:22:55 spir wrote:
 On Wed, 22 Dec 2010 23:22:56 -0600
=20
 Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:
 I'm thinking what to do about iota, which has good features but exacts
 too much cost on tight loop performance. One solution would be to def=



 iota to be the simple, forward range that I defined as Iota2 in my
 previous post. Then, we need a different name for the full-fledged io=



 (random-access, has known length, iterates through the same numbers
 forward and backward etc). Ideas?

I would keep length and add an opIn: if (e in interval) {...}. (I'm uns=


 whether it's worth allowing different types for bounds and/or for step;
 I'd rather make things simple.) Then, you could call it Interval, what =


 you think?
=20
 Note: The result would be very similar to python (x)ranges. D has a
 notation for a slightly narrower notion: '..'. Thus, what about:
 Interval!int interval =3D 1..9;
 or else:
 	auto interval =3D Interval!int(1..9);
 ?
=20
 What kind of thingie does "i..j" actually construct as of now?

I believe that the only place that .. works is within []. If an object ov=

 an opSlice() which takes parameters, then that syntax can be used.

;-) There's also foreach(n ; i..j) {...} Precisely, that's what I was thinking at when stating that D has a notation= for a very close (but narrower) notion. Slicing is related, but much farth= er since it does not necessarily resuire iteration (bit it's result does al= low it). Note: Iota is right-side exclusive like i..j . (I've just been caught by th= is trap ;-)
  I don't believe that it works on its own at all.

Certainly not. This would be a syntactic addition. The reason why I asked w= hat "i..j" currently yield --if it yield anything (could just rewrite). denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 23 2010
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On Wed, 22 Dec 2010 23:22:56 -0600
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:

 I'm thinking what to do about iota, which has good features but exacts=20
 too much cost on tight loop performance. One solution would be to define=

 iota to be the simple, forward range that I defined as Iota2 in my=20
 previous post. Then, we need a different name for the full-fledged iota=20
 (random-access, has known length, iterates through the same numbers=20
 forward and backward etc). Ideas?

I would keep length and add an opIn: if (e in interval) {...}. (I'm unsure = whether it's worth allowing different types for bounds and/or for step; I'd= rather make things simple.) Then, you could call it Interval, what do you = think? Note: The result would be very similar to python (x)ranges. D has a notatio= n for a slightly narrower notion: '..'. Thus, what about: Interval!int interval =3D 1..9; or else: auto interval =3D Interval!int(1..9); ? What kind of thingie does "i..j" actually construct as of now? Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 23 2010
prev sibling next sibling parent "Simen kjaeraas" <simen.kjaras gmail.com> writes:
spir <denis.spir gmail.com> wrote:

 On Wed, 22 Dec 2010 23:22:56 -0600
 Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:

 I'm thinking what to do about iota, which has good features but exacts
 too much cost on tight loop performance. One solution would be to define
 iota to be the simple, forward range that I defined as Iota2 in my
 previous post. Then, we need a different name for the full-fledged iota
 (random-access, has known length, iterates through the same numbers
 forward and backward etc). Ideas?

I would keep length and add an opIn: if (e in interval) {...}. (I'm unsure whether it's worth allowing different types for bounds and/or for step; I'd rather make things simple.) Then, you could call it Interval, what do you think? Note: The result would be very similar to python (x)ranges. D has a notation for a slightly narrower notion: '..'. Thus, what about: Interval!int interval = 1..9; or else: auto interval = Interval!int(1..9); ? What kind of thingie does "i..j" actually construct as of now?

Nothing. The syntax only works in foreach and opSlice. However, this works: final abstract class Intervals { struct Interval( T ) { T start, end; } static Interval!T opSlice( T )( T start, T end ) { return Interval!T( start, end ); } } auto intInterval = Intervals[1..2]; auto stringInterval = Intervals["foo".."bar"]; -- Simen
Dec 23 2010
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On Thu, 23 Dec 2010 14:40:13 +0100
"Simen kjaeraas" <simen.kjaras gmail.com> wrote:

 What kind of thingie does "i..j" actually construct as of now? =20

Nothing. The syntax only works in foreach and opSlice. =20 However, this works: =20 final abstract class Intervals { struct Interval( T ) { T start, end; } static Interval!T opSlice( T )( T start, T end ) { return Interval!T( start, end ); } } =20 auto intInterval =3D Intervals[1..2]; auto stringInterval =3D Intervals["foo".."bar"];

Nice! (even impressive :-) Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 23 2010
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/23/10 7:04 AM, spir wrote:
 On Wed, 22 Dec 2010 22:14:34 -0600
 Andrei Alexandrescu<SeeWebsiteForEmail erdani.org>  wrote:

 I then replaced iota's implementation with a simpler one that's a
 forward range. Then the performance became exactly the same as for the
 simple loop.

After having watched Iota's very general implementation, I tried the same change, precisely. Actually, with an even simpler range requiring a single element type for (first,last,step). For any reason, this alternative is slightly slower by me than using Iota (don't cry watching absolute times, my computer is old and slow ;-). Sample code below, typical results are: 1.1 3.3 5.5 7.7 Interval time: 1149 Iota time: 1066 Note: adding an assert to ensure front or popfront is not wrongly called past the end adds ~ 20% time.

I cut my losses reading here :o). No performance test is meaningful without all optimizations turned on. Andrei
Dec 23 2010
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
That's odd, I'm getting opposite results:

iota = 78ms
baseline = 187ms

Andreas' old code gives:
421ms

This is over multiple runs so I'm getting the average out of about 20 runs.

On 12/23/10, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:
 On 12/22/10 4:04 PM, Andreas Mayer wrote:
 To see what performance advantage D would give me over using a scripting
 language, I made a small benchmark. It consists of this code:

     auto L = iota(0.0, 10000000.0);
     auto L2 = map!"a / 2"(L);
     auto L3 = map!"a + 2"(L2);
     auto V = reduce!"a + b"(L3);

It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. The full D version is here: http://pastebin.com/R5AGHyPx The Lua version: http://pastebin.com/Sa7rp6uz Lua version that emulates ranges: http://pastebin.com/eAKMSWyr Could someone help me solving this mystery? Or is D, unlike I thought, not suitable for high performance computing? What should I do?

I reproduced the problem with a test program as shown below. On my machine the D iota runs in 108ms, whereas a baseline using a handwritten loop runs in 43 ms. I then replaced iota's implementation with a simpler one that's a forward range. Then the performance became exactly the same as for the simple loop. Andreas, any chance you could run this on your machine and compare it with Lua? (I don't have Lua installed.) Thanks! Andrei // D version, with std.algorithm // ~ 281 ms, using dmd 2.051 (dmd -O -release -inline) import std.algorithm; import std.stdio; import std.range; import std.traits; struct Iota2(N, S) if (isFloatingPoint!N && isNumeric!S) { private N start, end, current; private S step; this(N start, N end, S step) { this.start = start; this.end = end; this.step = step; current = start; } /// Range primitives property bool empty() const { return current >= end; } /// Ditto property N front() { return current; } /// Ditto alias front moveFront; /// Ditto void popFront() { assert(!empty); current += step; } property Iota2 save() { return this; } } auto iota2(B, E, S)(B begin, E end, S step) if (is(typeof((E.init - B.init) + 1 * S.init))) { return Iota2!(CommonType!(Unqual!B, Unqual!E), S)(begin, end, step); } void main(string args[]) { double result; auto limit = 10_000_000.0; if (args.length > 1) { writeln("iota"); auto L = iota2(0.0, limit, 1.0); result = reduce!"a + b"(L); } else { writeln("baseline"); result = 0.0; for (double i = 0; i != limit; ++i) { result += i; } } writefln("%f", result); }

Dec 22 2010
prev sibling next sibling parent =?ISO-8859-1?Q?Pelle_M=E5nsson?= <pelle.mansson gmail.com> writes:
On 12/22/2010 11:04 PM, Andreas Mayer wrote:
 To see what performance advantage D would give me over using a scripting
language, I made a small benchmark. It consists of this code:

     auto L = iota(0.0, 10000000.0);
     auto L2 = map!"a / 2"(L);
     auto L3 = map!"a + 2"(L2);
     auto V = reduce!"a + b"(L3);

It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. The full D version is here: http://pastebin.com/R5AGHyPx The Lua version: http://pastebin.com/Sa7rp6uz Lua version that emulates ranges: http://pastebin.com/eAKMSWyr Could someone help me solving this mystery? Or is D, unlike I thought, not suitable for high performance computing? What should I do?

I changed the code to this: auto L = iota(0, 10000000); auto L2 = map!"a / 2.0"(L); auto L3 = map!"a + 2"(L2); auto V = reduce!"a + b"(L3); and ripped the caching out of std.algorithm.map. :-) This made it go from about 1.4 seconds to about 0.4 seconds on my machine. Note that I did no rigorous or scientific testing. Also, if you really really need the performance you can change it all to lower level code, should you want to.
Dec 23 2010
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On Wed, 22 Dec 2010 20:16:45 -0600
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:

 Thanks for posting the numbers. That's a long time, particularly=20
 considering that the two map instances don't do anything. So the bulk of=

 the computation is:
=20
 auto L =3D iota(0.0, 10000000.0);
 auto V =3D reduce!"a + b"(L3);
=20
 There is one inherent problem that affects the speed of iota: in iota,=20
 the value at position i is computed as 0.0 + i * step, where step is=20
 computed from the limits. That's one addition and a multiplication for=20
 each pass through iota. Given that the actual workload of the loop is=20
 only one addition, we are doing a lot more work. I suspect that that's=20
 the main issue there.
=20
 The reason for which iota does that instead of the simpler increment is=20
 that iota must iterate the same values forward and backward. Using ++=20
 may interact with floating-point vagaries, so the code is currently=20
 conservative.

There is a point I don't understand here: Iota is a range-struct template, = with void popFront() { current +=3D step; } So, how does the computation of an arbitrary element at a given index affec= t looping speed? For mappings (and any kind of traversal, indeed), there sh= ould be an addition per element. Else, why define a range interface at all?= What do I miss? Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 23 2010
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On Wed, 22 Dec 2010 22:14:34 -0600
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:

 I then replaced iota's implementation with a simpler one that's a=20
 forward range. Then the performance became exactly the same as for the=20
 simple loop.

After having watched Iota's very general implementation, I tried the same c= hange, precisely. Actually, with an even simpler range requiring a single e= lement type for (first,last,step). For any reason, this alternative is slig= htly slower by me than using Iota (don't cry watching absolute times, my co= mputer is old and slow ;-). Sample code below, typical results are: 1.1 3.3 5.5 7.7=20 Interval time: 1149 Iota time: 1066 Note: adding an assert to ensure front or popfront is not wrongly called pa= st the end adds ~ 20% time. Note: I think this demonstates that using Iota does not perform undue compu= tations (multiplication to get Nth element with multiplication + addition),= or do I misinterpret? Anyway, what is wrong in my code? What doesn't it perform better? import std.algorithm : map, filter, reduce; import std.range : iota; struct Interval (T) { alias T Element; Element first, last, step; private Element element; this (Element first, Element last, Element step=3D1) { this.first =3D first; this.last =3D last; this.step =3D step; this.element =3D first; } property void popFront () { this.element +=3D this.step; } property bool empty () { return (this.element > this.last); } property Element front () { return this.element; } } void main () { auto nums =3D Interval!float(1.1,8.8, 2.2); foreach(n ; nums) writef("%s ", n); writeln(); auto t1 =3D time(); auto nums1 =3D Interval!int(0, 10_000_000); auto halves1 =3D map!"a/2"(nums1); auto incs1 =3D map!"a+2"(halves1); auto result1 =3D reduce!"a+b"(incs1); writefln("Interval time: %s", time() - t1); =20 auto t2 =3D time(); auto nums2 =3D iota(0, 10_000_000); auto halves2 =3D map!"a/2"(nums2); auto incs2 =3D map!"a+2"(halves2); auto result2 =3D reduce!"a+b"(incs2); writefln("Iota time: %s", time() - t2); } Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 23 2010
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/22/10 4:04 PM, Andreas Mayer wrote:
 To see what performance advantage D would give me over using a scripting
language, I made a small benchmark. It consists of this code:

     auto L = iota(0.0, 10000000.0);
     auto L2 = map!"a / 2"(L);
     auto L3 = map!"a + 2"(L2);
     auto V = reduce!"a + b"(L3);

It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. The full D version is here: http://pastebin.com/R5AGHyPx The Lua version: http://pastebin.com/Sa7rp6uz Lua version that emulates ranges: http://pastebin.com/eAKMSWyr Could someone help me solving this mystery? Or is D, unlike I thought, not suitable for high performance computing? What should I do?

I wrote a new test bench and got 41 ms for the baseline and 220 ms for the code based on map and iota. (Surprisingly, the extra work didn't affect the run time, which suggests the loop is dominated by the counter increment and test.) Then I took out the cache in map and got 136 ms. Finally, I replaced the use of iota with iota2 and got performance equal to that of handwritten code. Code below. I decided to check in the map cache removal. We discussed it a fair amount among Phobos devs. I have no doubts caching might help in certain cases, but it does lead to surprising performance loss for simple cases like the one tested here. See http://www.dsource.org/projects/phobos/changeset/2231 If the other Phobos folks approve, I'll also specialize iota for floating point numbers to be a forward range and defer the decision on defining a "randomAccessIota" for floating point numbers to later. That would complete the library improvements pointed to by this test, leaving further optimization to compiler improvements. Thanks Andreas for starting this. Andrei import std.algorithm; import std.stdio; import std.range; import std.traits; struct Iota2(N, S) if (isFloatingPoint!N && isNumeric!S) { private N start, end, current; private S step; this(N start, N end, S step) { this.start = start; this.end = end; this.step = step; current = start; } /// Range primitives property bool empty() const { return current >= end; } /// Ditto property N front() { return current; } /// Ditto alias front moveFront; /// Ditto void popFront() { assert(!empty); current += step; } property Iota2 save() { return this; } } auto iota2(B, E, S)(B begin, E end, S step) if (is(typeof((E.init - B.init) + 1 * S.init))) { return Iota2!(CommonType!(Unqual!B, Unqual!E), S)(begin, end, step); } void main(string args[]) { double result; auto limit = 10_000_000.0; if (args.length > 1) { writeln("iota"); auto L = iota2(0.0, limit, 1.0); auto L2 = map!"a / 2"(L); auto L3 = map!"a + 2"(L2); result = reduce!"a + b"(L3); } else { writeln("baseline"); result = 0.0; for (double i = 0; i != limit; ++i) { result += (i / 2) + 2; } } writefln("%f", result); }
Dec 23 2010
next sibling parent reply "Simen kjaeraas" <simen.kjaras gmail.com> writes:
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:

 I decided to check in the map cache removal. We discussed it a fair  
 amount among Phobos devs. I have no doubts caching might help in certain  
 cases, but it does lead to surprising performance loss for simple cases  
 like the one tested here. See  
 http://www.dsource.org/projects/phobos/changeset/2231

It seems to me that having a Cached range might be a better, more general solution in any case. -- Simen
Dec 23 2010
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/23/10 10:09 AM, Simen kjaeraas wrote:
 Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:

 I decided to check in the map cache removal. We discussed it a fair
 amount among Phobos devs. I have no doubts caching might help in
 certain cases, but it does lead to surprising performance loss for
 simple cases like the one tested here. See
 http://www.dsource.org/projects/phobos/changeset/2231

It seems to me that having a Cached range might be a better, more general solution in any case.

Agreed. Andrei
Dec 23 2010
prev sibling next sibling parent reply "Simen kjaeraas" <simen.kjaras gmail.com> writes:
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:

 http://www.dsource.org/projects/phobos/changeset/2231

BTW, shouldn't range constructors call .save for forward ranges? This one certainly doesn't. -- Simen
Dec 23 2010
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/23/10 10:14 AM, Simen kjaeraas wrote:
 Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:

 http://www.dsource.org/projects/phobos/changeset/2231

BTW, shouldn't range constructors call .save for forward ranges? This one certainly doesn't.

Currently higher-order ranges assume that the range passed-in is good to take ownership of. A range or algorithm should call save only in case extra copies need to be created. Andrei
Dec 23 2010
prev sibling parent Jimmy Cao <jcao219 gmail.com> writes:
--0023547c8c330e5eec049822503e
Content-Type: text/plain; charset=ISO-8859-1

I hope that in the future more implementations in D can be compared for
performance against their equivalent Lua translations.
It seems that LuaJIT is a super speedy dynamic language, and it is
specifically designed to break into the performance ranges of optimized
static languages, which makes it a formidable competitor.

--0023547c8c330e5eec049822503e
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I hope that in the future more implementations in D can be compared for per=
formance against their equivalent Lua translations.<div>It seems that LuaJI=
T is a super speedy dynamic language, and it is specifically designed to br=
eak into the performance ranges of optimized static languages, which makes =
it a formidable competitor.</div>


--0023547c8c330e5eec049822503e--
Dec 23 2010