www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - From a C++/JS benchmark

reply bearophile <bearophileHUGS lycos.com> writes:
The benchmark info:
http://chadaustin.me/2011/01/digging-into-javascript-performance/

The code, in C++, JS, Java, C#:
https://github.com/chadaustin/Web-Benchmarks/
The C++/JS/Java code runs on a single core.

D2 version translated from the C# version (the C++ version uses struct
inheritance!):
http://ideone.com/kf1tz

Bye,
bearophile
Aug 03 2011
next sibling parent reply Denis Shelomovskij <verylonglogin.reg gmail.com> writes:
03.08.2011 18:20, bearophile:
 The benchmark info:
 http://chadaustin.me/2011/01/digging-into-javascript-performance/

 The code, in C++, JS, Java, C#:
 https://github.com/chadaustin/Web-Benchmarks/
 The C++/JS/Java code runs on a single core.

 D2 version translated from the C# version (the C++ version uses struct
inheritance!):
 http://ideone.com/kf1tz

 Bye,
 bearophile
Compilers: C++: cl /O2 /Oi /Ot /Oy /GT /GL and link /STACK:10240000 Java: Oracle Java 1.6 with hm... Oracle default settings C#: Csc /optimize+ D2: dmd -O -noboundscheck -inline -release Type column: working scalar type Other columns: vertices per second (inaccuracy is about 1%) by language (tests from bearophile's message, C++ test is "skinning_test_no_simd.cpp"). System: Windows XP, Core 2 Duo E6850 ----------------------------------------------------------- Type | C++ | Java | C# | D2 ----------------------------------------------------------- float | 31_400_000 | 17_000_000 | 14_700_000 | 168_000 double | 32_300_000 | 16_000_000 | 14_100_000 | 166_000 real | 32_300_000 | no real | no real | 203_000 int | 29_100_000 | 14_600_000 | 14_100_000 | 16_500_000 long | 29_100_000 | 6_600_000 | 4_400_000 | 5_800_000 ----------------------------------------------------------- JavaScript vs C++ speed is at the first link of original bearophile's post and JS is about 10-20 temes slower than C++. Looks like a spiteful joke... In other words: WTF?! JavaScript is about 10 times faster than D in floating point calculations!? Please, tell me that I'm mistaken.
Aug 03 2011
next sibling parent reply Ziad Hatahet <hatahet gmail.com> writes:
I believe that "long" in this case is 32 bits in C++, and 64-bits in the
remaining languages, hence the same result for int and long in C++. Try with
"long long" maybe? :)


--
Ziad


2011/8/3 Denis Shelomovskij <verylonglogin.reg gmail.com>

 03.08.2011 18:20, bearophile:

  The benchmark info:
 http://chadaustin.me/2011/01/**digging-into-javascript-**performance/<http://chadaustin.me/2011/01/digging-into-javascript-performance/>

 The code, in C++, JS, Java, C#:
 https://github.com/chadaustin/**Web-Benchmarks/<https://github.com/chadaustin/Web-Benchmarks/>
 The C++/JS/Java code runs on a single core.

 D2 version translated from the C# version (the C++ version uses struct
 inheritance!):
 http://ideone.com/kf1tz

 Bye,
 bearophile
Compilers: C++: cl /O2 /Oi /Ot /Oy /GT /GL and link /STACK:10240000 Java: Oracle Java 1.6 with hm... Oracle default settings C#: Csc /optimize+ D2: dmd -O -noboundscheck -inline -release Type column: working scalar type Other columns: vertices per second (inaccuracy is about 1%) by language (tests from bearophile's message, C++ test is "skinning_test_no_simd.cpp"). System: Windows XP, Core 2 Duo E6850 ------------------------------**----------------------------- Type | C++ | Java | C# | D2 ------------------------------**----------------------------- float | 31_400_000 | 17_000_000 | 14_700_000 | 168_000 double | 32_300_000 | 16_000_000 | 14_100_000 | 166_000 real | 32_300_000 | no real | no real | 203_000 int | 29_100_000 | 14_600_000 | 14_100_000 | 16_500_000 long | 29_100_000 | 6_600_000 | 4_400_000 | 5_800_000 ------------------------------**----------------------------- JavaScript vs C++ speed is at the first link of original bearophile's post and JS is about 10-20 temes slower than C++. Looks like a spiteful joke... In other words: WTF?! JavaScript is about 10 times faster than D in floating point calculations!? Please, tell me that I'm mistaken.
Aug 03 2011
parent reply Denis Shelomovskij <verylonglogin.reg gmail.com> writes:
03.08.2011 22:15, Ziad Hatahet:
 I believe that "long" in this case is 32 bits in C++, and 64-bits in the
 remaining languages, hence the same result for int and long in C++. Try
 with "long long" maybe? :)


 --
 Ziad


 2011/8/3 Denis Shelomovskij <verylonglogin.reg gmail.com
 <mailto:verylonglogin.reg gmail.com>>

     03.08.2011 18:20, bearophile:

         The benchmark info:
         http://chadaustin.me/2011/01/__digging-into-javascript-__performance/
         <http://chadaustin.me/2011/01/digging-into-javascript-performance/>

         The code, in C++, JS, Java, C#:
         https://github.com/chadaustin/__Web-Benchmarks/
         <https://github.com/chadaustin/Web-Benchmarks/>
         The C++/JS/Java code runs on a single core.

         D2 version translated from the C# version (the C++ version uses
         struct inheritance!):
         http://ideone.com/kf1tz

         Bye,
         bearophile


     Compilers:
     C++:  cl /O2 /Oi /Ot /Oy /GT /GL and link /STACK:10240000
     Java: Oracle Java 1.6 with hm... Oracle default settings
     C#:   Csc /optimize+
     D2:   dmd -O -noboundscheck -inline -release

     Type column: working scalar type
     Other columns: vertices per second (inaccuracy is about 1%) by
     language (tests from bearophile's message, C++ test is
     "skinning_test_no_simd.cpp").

     System: Windows XP, Core 2 Duo E6850

     ------------------------------__-----------------------------
       Type  |    C++     |    Java    |     C#     |     D2
     ------------------------------__-----------------------------
     float   | 31_400_000 | 17_000_000 | 14_700_000 |    168_000
     double  | 32_300_000 | 16_000_000 | 14_100_000 |    166_000
     real    | 32_300_000 |   no real  |   no real  |    203_000
     int     | 29_100_000 | 14_600_000 | 14_100_000 | 16_500_000
     long    | 29_100_000 |  6_600_000 |  4_400_000 |  5_800_000
     ------------------------------__-----------------------------

     JavaScript vs C++ speed is at the first link of original
     bearophile's post and JS is about 10-20 temes slower than C++.
     Looks like a spiteful joke... In other words: WTF?! JavaScript is
     about 10 times faster than D in floating point calculations!?
     Please, tell me that I'm mistaken.
Good! This is my first blunder (it's so easy to complitely forget illogical (for me) language design). So, corrected last row: Type | C++ | Java | C# | D2 ------------------------------------------------------------- long | 5_500_000 | 6_600_000 | 4_400_000 | 5_800_000 Java is the fastest "long" language :)
Aug 03 2011
parent reply Adam D. Ruppe <destructionator gmail.com> writes:
 System: Windows XP, Core 2 Duo E6850
Is this Windows XP 32 bit or 64 bit? That will probably make a difference on the longs I'd expect.
Aug 03 2011
next sibling parent reply David Nadlinger <see klickverbot.at> writes:
On 8/3/11 9:48 PM, Adam D. Ruppe wrote:
 System: Windows XP, Core 2 Duo E6850
Is this Windows XP 32 bit or 64 bit? That will probably make a difference on the longs I'd expect.
It doesn't, long is 32-bit wide on Windows x86_64 too (LLP64). David
Aug 03 2011
parent reply "Marco Leise" <Marco.Leise gmx.de> writes:
Am 03.08.2011, 21:52 Uhr, schrieb David Nadlinger <see klickverbot.at>:

 On 8/3/11 9:48 PM, Adam D. Ruppe wrote:
 System: Windows XP, Core 2 Duo E6850
Is this Windows XP 32 bit or 64 bit? That will probably make a difference on the longs I'd expect.
It doesn't, long is 32-bit wide on Windows x86_64 too (LLP64). David
I thought he was referring to the processor being able to handle 64-bit ints more efficiently in 64-bit operation mode on a 64-bit OS with 64-bit executables.
Aug 03 2011
parent Adam Ruppe <destructionator gmail.com> writes:
Marco Leise wrote:
 I thought he was referring to the processor being able to handle
 64-bit ints more efficiently in 64-bit operation mode on a 64-bit OS
 with 64-bit executables.
I was thinking a little of both but this is the main thing. My suspicion was that Java might have been using a 64 bit JVM and everything else was compiled in 32 bit, causing it to win in that place. But with a 32 bit OS that means 32 bit programs all around.
Aug 04 2011
prev sibling parent Denis Shelomovskij <verylonglogin.reg gmail.com> writes:
03.08.2011 22:48, Adam D. Ruppe пишет:
 System: Windows XP, Core 2 Duo E6850
Is this Windows XP 32 bit or 64 bit? That will probably make a difference on the longs I'd expect.
I meant Windows XP 32 bit (5.1 (Build 2600: Service Pack 3)) (according to what is "Windows XP" in wikipedia)
Aug 03 2011
prev sibling next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Denis Shelomovskij:

 (tests from bearophile's message, C++ test is "skinning_test_no_simd.cpp").
For a more realistic test I suggest you to time the C++ version that uses the intrinsics too (only for float).
 Looks like a spiteful joke... In other words: WTF?! JavaScript is about 
 10 times faster than D in floating point calculations!? Please, tell me 
 that I'm mistaken.
Languages aren't slow or fast, their implementations produce assembly that's more or less efficient. A D1 version fit for LDC V1 with Tango: http://codepad.org/ewDy31UH Vertices (millions), Linux 32 bit: C++ no simd: 29.5 D: 27.6 LDC based on DMD v1.057 and llvm 2.6, ldc -O3 -release -inline G++ V4.3.3, -s -O3 -mfpmath=sse -ffast-math -msse3 It's a bit slower than the C++ version, but for most people that's an acceptable difference (and maybe porting the C++ code to D instead of the C# one and using a more modern LLVM you reduce that loss a bit). Bye, bearophile
Aug 03 2011
prev sibling parent reply Trass3r <un known.com> writes:
 Looks like a spiteful joke... In other words: WTF?! JavaScript is about  
 10 times faster than D in floating point calculations!? Please, tell me  
 that I'm mistaken.
I'm afraid not. dmd's backend isn't good at floating point calculations.
Aug 03 2011
parent bearophile <bearophileHUGS lycos.com> writes:
Trass3r:

 I'm afraid not. dmd's backend isn't good at floating point calculations.
Studying a bit the asm it's not hard to find the cause, because this benchmark is quite pure (synthetic, despite I think it comes from real-world code). This is what G++ generates from the C++ code without intrinsics (the version that uses SIMD intrinsics has a similar look but it's shorter): movl (%eax), %edx movss 4(%eax), %xmm0 movl 8(%eax), %ecx leal (%edx,%edx,2), %edx sall $4, %edx addl %ebx, %edx testl %ecx, %ecx movss 12(%edx), %xmm1 movss 20(%edx), %xmm7 movss (%edx), %xmm5 mulss %xmm0, %xmm1 mulss %xmm0, %xmm7 movss 4(%edx), %xmm6 movss 8(%edx), %xmm4 movss %xmm1, (%esp) mulss %xmm0, %xmm5 movss 28(%edx), %xmm1 movss %xmm7, 4(%esp) mulss %xmm0, %xmm6 movss 32(%edx), %xmm7 mulss %xmm0, %xmm1 movss 16(%edx), %xmm3 mulss %xmm0, %xmm7 movss 24(%edx), %xmm2 movss %xmm1, 16(%esp) mulss %xmm0, %xmm4 movss 36(%edx), %xmm1 movss %xmm7, 8(%esp) mulss %xmm0, %xmm3 movss 40(%edx), %xmm7 mulss %xmm0, %xmm2 mulss %xmm0, %xmm1 mulss %xmm0, %xmm7 mulss 44(%edx), %xmm0 leal 12(%eax), %edx movss %xmm7, 12(%esp) movss %xmm0, 20(%esp) This is what DMD generates for the same (or quite similar) piece of code: movsd mov EAX,068h[ESP] imul EDX,EAX,030h add EDX,018h[ESP] fld float ptr [EDX] fmul float ptr 06Ch[ESP] fstp float ptr 038h[ESP] fld float ptr 4[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 03Ch[ESP] fld float ptr 8[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 040h[ESP] fld float ptr 0Ch[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 044h[ESP] fld float ptr 010h[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 048h[ESP] fld float ptr 014h[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 04Ch[ESP] fld float ptr 018h[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 050h[ESP] fld float ptr 01Ch[EDX] mov CL,070h[ESP] xor CL,1 fmul float ptr 06Ch[ESP] fstp float ptr 054h[ESP] fld float ptr 020h[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 058h[ESP] fld float ptr 024h[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 05Ch[ESP] fld float ptr 028h[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 060h[ESP] fld float ptr 02Ch[EDX] fmul float ptr 06Ch[ESP] fstp float ptr 064h[ESP] I think DMD back-end already contains logic to use xmm registers as true registers (not as a floating point stack or temporary holes where to push and pull FP values), so I suspect it doesn't take too much work to modify it to emit FP asm with a single optimization: just keep the values inside registers. In my uninformed opinion all other FP optimizations are almost insignificant compared to this one :-) Bye, bearophile
Aug 03 2011
prev sibling parent reply Trass3r <un known.com> writes:
C++:
Skinned vertices per second: 48660000

C++ no SIMD:
Skinned vertices per second: 42420000


D dmd:
Skinned vertices per second: 159046

D gdc:
Skinned vertices per second: 23450000



Compilers:

gcc version 4.5.2 (Ubuntu/Linaro 4.5.2-8ubuntu4)
g++ -s -O3 -mfpmath=sse -ffast-math -march=native

DMD64 D Compiler v2.054
dmd -O -noboundscheck -inline -release dver.d

gcc version 4.6.1 20110627 (gdc 0.30, using dmd 2.054) (GCC)
gdc -s -O3 -mfpmath=sse -ffast-math -march=native dver.d


Ubuntu 11.04 x64
Core2 Duo E6300
Aug 03 2011
next sibling parent reply Trass3r <un known.com> writes:
 C++:
 Skinned vertices per second: 48660000

 C++ no SIMD:
 Skinned vertices per second: 42420000


 D dmd:
 Skinned vertices per second: 159046

 D gdc:
 Skinned vertices per second: 23450000
D ldc: Skinned vertices per second: 37910000 ldc2 -O3 -release -enable-inlining dver.d
Aug 03 2011
parent reply Trass3r <un known.com> writes:
Am 04.08.2011, 04:07 Uhr, schrieb Trass3r <un known.com>:

 C++:
 Skinned vertices per second: 48660000

 C++ no SIMD:
 Skinned vertices per second: 42420000


 D dmd:
 Skinned vertices per second: 159046

 D gdc:
 Skinned vertices per second: 23450000
D ldc: Skinned vertices per second: 37910000 ldc2 -O3 -release -enable-inlining dver.d
D gdc with added -frelease -fno-bounds-check: Skinned vertices per second: 37710000
Aug 05 2011
parent reply bearophile <bearophileHUGS lycos.com> writes:
Trass3r:

 C++ no SIMD:
 Skinned vertices per second: 42420000
... D gdc with added -frelease -fno-bounds-check: Skinned vertices per second: 37710000
I'd like to know why the GCC back-end is able to produce a more efficient binary from the C++ code (compared to the D code), but now the problem is not large, as before. It seems I've found a benchmark coming from real-world code that's a worst case for DMD (GDC here produces code about 237 times faster than DMD). Bye, bearophile
Aug 05 2011
parent Trass3r <un known.com> writes:
 I'd like to know why the GCC back-end is able to produce a more  
 efficient binary from the C++ code (compared to the D code), but now the  
 problem is not large, as before.
I attached both asm versions ;)
Aug 05 2011
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Trass3r:

 C++ no SIMD:
 Skinned vertices per second: 42420000
 
...
 D gdc:
 Skinned vertices per second: 23450000
Are you able and willing to show me the asm produced by gdc? There's a problem there. Bye, bearophile
Aug 03 2011
next sibling parent reply Trass3r <un known.com> writes:
e you able and willing to show me the asm produced by gdc? There's a
 problem there.
Aug 04 2011
parent reply bearophile <bearophileHUGS lycos.com> writes:
 Trass3r:
 are you able and willing to show me the asm produced by gdc? There's a
 problem there.
[attach bla.rar]
In the bla.rar attach there's the unstripped Linux binary, so to read the asm I have used the objdump disassembler. But are you willing and able to show me the asm before it gets assembled? (with gcc you do it with the -S switch). (I also suggest to use only the C standard library, with time() and printf() to produce a smaller asm output: http://codepad.org/12EUo16J ). Using objdump I see it uses 16 xmm registers, this is the main routine. But what's the purpose of those callq? They seem to call the successive asm instruction. The x86 asm of this routine contains jumps only and no "call". The asm of this routine is also very long, I don't know why yet. I see too many instructions like "movss 0x80(%rsp), %xmm7" this looks like a problem. _calculateVerticesAndNormals: push %r15 push %r14 push %r13 push %r12 push %rbp push %rbx sub $0x268, %rsp mov 0x2a0(%rsp), %rax mov %rdi, 0xe8(%rsp) mov %rsi, 0xe0(%rsp) mov %rcx, 0x128(%rsp) mov %r8, 0x138(%rsp) mov %rax, 0xf0(%rsp) mov 0x2a8(%rsp), %rax mov %rdi, 0x180(%rsp) mov %rsi, 0x188(%rsp) mov %rcx, 0x170(%rsp) mov %rax, 0xf8(%rsp) mov 0x2b0(%rsp), %rax mov %r8, 0x178(%rsp) mov %rax, 0x130(%rsp) mov 0x2b8(%rsp), %rax mov %rax, 0x140(%rsp) mov %rcx, %rax add %rax, %rax cmp 0x130(%rsp), %rax je 74d <_calculateVerticesAndNormals+0xcd> mov $0x57, %edx mov $0x6, %edi mov $0x0, %esi movq $0x6, 0x190(%rsp) movq $0x0, 0x198(%rsp) callq 74d <_calculateVerticesAndNormals+0xcd> cmpq $0x0, 0x128(%rsp) je 1317 <_calculateVerticesAndNormals+0xc97> movq $0x1, 0x120(%rsp) xor %r15d, %r15d movq $0x0, 0x100(%rsp) movslq %r15d, %r12 cmp %r12, 0xf0(%rsp) movq $0x0, 0x108(%rsp) jbe f1d <_calculateVerticesAndNormals+0x89d> nopl 0x0(%rax) lea (%r12, %r12, 2), %rax shl $0x2, %rax mov %rax, 0x148(%rsp) mov 0xf8(%rsp), %rax add 0x148(%rsp), %rax movss 0x4(%rax), %xmm9 movzbl 0x8(%rax), %r13d movslq (%rax), %rax cmp 0xe8(%rsp), %rax jae f50 <_calculateVerticesAndNormals+0x8d0> lea (%rax, %rax, 2), %rax shl $0x4, %rax mov %rax, 0x110(%rsp) mov 0xe0(%rsp), %rbx add 0x110(%rsp), %rbx je 12af <_calculateVerticesAndNormals+0xc2f> movss (%rbx), %xmm7 test %r13b, %r13b movss 0x4(%rbx), %xmm8 movss 0x8(%rbx), %xmm6 mulss %xmm9, %xmm7 movss 0xc(%rbx), %xmm11 mulss %xmm9, %xmm8 movss 0x10(%rbx), %xmm4 mulss %xmm9, %xmm6 movss 0x14(%rbx), %xmm5 mulss %xmm9, %xmm11 movss 0x18(%rbx), %xmm3 mulss %xmm9, %xmm4 movss 0x1c(%rbx), %xmm10 mulss %xmm9, %xmm5 movss 0x20(%rbx), %xmm1 mulss %xmm9, %xmm3 movss 0x24(%rbx), %xmm2 mulss %xmm9, %xmm10 movss 0x28(%rbx), %xmm0 mulss %xmm9, %xmm1 mulss %xmm9, %xmm2 mulss %xmm9, %xmm0 mulss 0x2c(%rbx), %xmm9 jne cdb <_calculateVerticesAndNormals+0x65b> add $0x1, %r12 mov %r14, %rax lea (%r12, %r12, 2), %r13 shl $0x2, %r13 jmpq 99e <_calculateVerticesAndNormals+0x31e> nopl (%rax) mov %r13, %rax mov 0xf8(%rsp), %rdx add %rax, %rdx movss 0x4(%rdx), %xmm12 movzbl 0x8(%rdx), %r14d movslq (%rdx), %rdx cmp %rdx, 0xe8(%rsp) jbe aa0 <_calculateVerticesAndNormals+0x420> mov 0xe0(%rsp), %rbx lea (%rdx, %rdx, 2), %rbp shl $0x4, %rbp add %rbp, %rbx je baf <_calculateVerticesAndNormals+0x52f> movss (%rbx), %xmm13 add $0x1, %r12 add $0xc, %r13 test %r14b, %r14b mulss %xmm12, %xmm13 addss %xmm13, %xmm7 movss 0x4(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm8 movss 0x8(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm6 movss 0xc(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm11 movss 0x10(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm4 movss 0x14(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm5 movss 0x18(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm3 movss 0x1c(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm10 movss 0x20(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm1 movss 0x24(%rbx), %xmm13 mulss %xmm12, %xmm13 addss %xmm13, %xmm2 movss 0x28(%rbx), %xmm13 mulss %xmm12, %xmm13 mulss 0x2c(%rbx), %xmm12 addss %xmm13, %xmm0 addss %xmm12, %xmm9 jne cd8 <_calculateVerticesAndNormals+0x658> add $0x1, %r15d cmp %r12, 0xf0(%rsp) ja 890 <_calculateVerticesAndNormals+0x210> mov $0x63, %edx mov $0x6, %edi mov $0x0, %esi mov %rax, 0xc8(%rsp) movss %xmm0, (%rsp) movss %xmm1, 0x20(%rsp) movss %xmm2, 0x10(%rsp) movss %xmm3, 0x30(%rsp) movss %xmm4, 0x50(%rsp) movss %xmm5, 0x40(%rsp) movss %xmm6, 0x60(%rsp) movss %xmm7, 0x80(%rsp) movss %xmm8, 0x70(%rsp) movss %xmm9, 0x90(%rsp) movss %xmm10, 0xa0(%rsp) movss %xmm11, 0xb0(%rsp) movq $0x6, 0x1c0(%rsp) movq $0x0, 0x1c8(%rsp) callq a3b <_calculateVerticesAndNormals+0x3bb> mov 0xc8(%rsp), %rax movss (%rsp), %xmm0 movss 0x20(%rsp), %xmm1 movss 0x10(%rsp), %xmm2 movss 0x30(%rsp), %xmm3 movss 0x50(%rsp), %xmm4 movss 0x40(%rsp), %xmm5 movss 0x60(%rsp), %xmm6 movss 0x80(%rsp), %xmm7 movss 0x70(%rsp), %xmm8 movss 0x90(%rsp), %xmm9 movss 0xa0(%rsp), %xmm10 movss 0xb0(%rsp), %xmm11 jmpq 893 <_calculateVerticesAndNormals+0x213> nop mov $0x65, %edx mov $0x6, %edi mov $0x0, %esi mov %rax, 0xc8(%rsp) movss %xmm0, (%rsp) movss %xmm1, 0x20(%rsp) movss %xmm2, 0x10(%rsp) movss %xmm3, 0x30(%rsp) movss %xmm4, 0x50(%rsp) movss %xmm5, 0x40(%rsp) movss %xmm6, 0x60(%rsp) movss %xmm7, 0x80(%rsp) movss %xmm8, 0x70(%rsp) movss %xmm9, 0x90(%rsp) movss %xmm10, 0xa0(%rsp) movss %xmm11, 0xb0(%rsp) movss %xmm12, 0xd0(%rsp) movq $0x6, 0x1d0(%rsp) movq $0x0, 0x1d8(%rsp) callq b35 <_calculateVerticesAndNormals+0x4b5> mov 0xe0(%rsp), %rbx movss 0xd0(%rsp), %xmm12 movss 0xb0(%rsp), %xmm11 movss 0xa0(%rsp), %xmm10 add %rbp, %rbx movss 0x70(%rsp), %xmm8 movss 0x90(%rsp), %xmm9 movss 0x80(%rsp), %xmm7 movss 0x60(%rsp), %xmm6 movss 0x40(%rsp), %xmm5 movss 0x50(%rsp), %xmm4 movss 0x30(%rsp), %xmm3 movss 0x10(%rsp), %xmm2 movss 0x20(%rsp), %xmm1 movss (%rsp), %xmm0 mov 0xc8(%rsp), %rax jne 8d3 <_calculateVerticesAndNormals+0x253> mov $0x23, %r8d mov $0x6, %edx mov $0x0, %ecx mov $0x9, %edi mov $0x0, %esi movss %xmm0, (%rsp) mov %rax, 0xc8(%rsp) movss %xmm1, 0x20(%rsp) movss %xmm2, 0x10(%rsp) movss %xmm3, 0x30(%rsp) movss %xmm4, 0x50(%rsp) movss %xmm5, 0x40(%rsp) movss %xmm6, 0x60(%rsp) movss %xmm7, 0x80(%rsp) movss %xmm8, 0x70(%rsp) movss %xmm9, 0x90(%rsp) movss %xmm10, 0xa0(%rsp) movss %xmm11, 0xb0(%rsp) movss %xmm12, 0xd0(%rsp) movq $0x6, 0x240(%rsp) movq $0x0, 0x248(%rsp) movq $0x9, 0x250(%rsp) movq $0x0, 0x258(%rsp) callq c67 <_calculateVerticesAndNormals+0x5e7> movss 0x70(%rsp), %xmm8 movss 0xd0(%rsp), %xmm12 movss 0xb0(%rsp), %xmm11 movss 0xa0(%rsp), %xmm10 movss 0x90(%rsp), %xmm9 movss 0x80(%rsp), %xmm7 movss 0x60(%rsp), %xmm6 movss 0x40(%rsp), %xmm5 movss 0x50(%rsp), %xmm4 movss 0x30(%rsp), %xmm3 movss 0x10(%rsp), %xmm2 movss 0x20(%rsp), %xmm1 movss (%rsp), %xmm0 mov 0xc8(%rsp), %rax jmpq 8d3 <_calculateVerticesAndNormals+0x253> nopl (%rax) mov %rax, %r14 mov 0x108(%rsp), %rax cmp %rax, 0x128(%rsp) jbe 11d0 <_calculateVerticesAndNormals+0xb50> shl $0x5, %rax mov %rax, 0x150(%rsp) mov 0x100(%rsp), %rax mov 0x138(%rsp), %rbx add %rax, %rax add 0x150(%rsp), %rbx cmp %rax, 0x130(%rsp) jbe 10e8 <_calculateVerticesAndNormals+0xa68> mov 0x100(%rsp), %rax shl $0x5, %rax mov %rax, 0x158(%rsp) movss 0x8(%rbx), %xmm12 movaps %xmm8, %xmm15 movss (%rbx), %xmm14 movss %xmm12, 0x11c(%rsp) movss 0x4(%rbx), %xmm13 movaps %xmm7, %xmm12 mulss %xmm14, %xmm12 mov 0x140(%rsp), %rax mulss %xmm13, %xmm15 add 0x158(%rsp), %rax addss %xmm15, %xmm12 addss %xmm11, %xmm12 movl $0x0, 0xc(%rax) movss 0x11c(%rsp), %xmm11 mulss %xmm6, %xmm11 addss %xmm11, %xmm12 movaps %xmm4, %xmm11 mulss %xmm14, %xmm11 mulss %xmm1, %xmm14 movss %xmm12, (%rax) movaps %xmm5, %xmm12 mulss %xmm13, %xmm12 mulss %xmm2, %xmm13 addss %xmm12, %xmm11 addss %xmm13, %xmm14 addss %xmm10, %xmm11 movss 0x11c(%rsp), %xmm10 addss %xmm9, %xmm14 movss 0x11c(%rsp), %xmm9 mulss %xmm3, %xmm10 mulss %xmm0, %xmm9 addss %xmm10, %xmm11 addss %xmm9, %xmm14 movss %xmm11, 0x4(%rax) movss %xmm14, 0x8(%rax) mov 0x108(%rsp), %rax cmp %rax, 0x128(%rsp) jbe 1040 <_calculateVerticesAndNormals+0x9c0> shl $0x5, %rax mov %rax, 0x160(%rsp) mov 0x138(%rsp), %rbx mov 0x120(%rsp), %rax add 0x160(%rsp), %rbx cmp %rax, 0x130(%rsp) jbe f98 <_calculateVerticesAndNormals+0x918> shl $0x4, %rax mov %rax, 0x168(%rsp) movss 0x10(%rbx), %xmm10 add $0x1, %r15d movss 0x14(%rbx), %xmm11 mulss %xmm10, %xmm7 mov 0x140(%rsp), %rax mulss %xmm11, %xmm8 movss 0x18(%rbx), %xmm9 mulss %xmm11, %xmm5 mulss %xmm10, %xmm4 mulss %xmm11, %xmm2 add 0x168(%rsp), %rax addq $0x1, 0x100(%rsp) addss %xmm7, %xmm8 addq $0x2, 0x120(%rsp) addss %xmm4, %xmm5 mulss %xmm10, %xmm1 mulss %xmm9, %xmm6 movl $0x0, 0xc(%rax) mulss %xmm9, %xmm3 mulss %xmm9, %xmm0 addss %xmm1, %xmm2 addss %xmm6, %xmm8 addss %xmm3, %xmm5 addss %xmm0, %xmm2 movss %xmm8, (%rax) movss %xmm5, 0x4(%rax) movss %xmm2, 0x8(%rax) mov 0x100(%rsp), %rax cmp %rax, 0x128(%rsp) je 1317 <_calculateVerticesAndNormals+0xc97> movslq %r15d, %r12 mov %rax, 0x108(%rsp) cmp %r12, 0xf0(%rsp) ja 798 <_calculateVerticesAndNormals+0x118> mov $0x5d, %edx mov $0x6, %edi mov $0x0, %esi movq $0x6, 0x1a0(%rsp) movq $0x0, 0x1a8(%rsp) callq f49 <_calculateVerticesAndNormals+0x8c9> jmpq 7a8 <_calculateVerticesAndNormals+0x128> xchg %ax, %ax mov $0x5f, %edx mov $0x6, %edi mov $0x0, %esi movss %xmm9, 0x90(%rsp) movq $0x6, 0x1b0(%rsp) movq $0x0, 0x1b8(%rsp) callq f86 <_calculateVerticesAndNormals+0x906> movss 0x90(%rsp), %xmm9 jmpq 7e4 <_calculateVerticesAndNormals+0x164> nopl (%rax) mov $0x69, %edx mov $0x6, %edi mov $0x0, %esi movss %xmm0, (%rsp) movss %xmm1, 0x20(%rsp) movss %xmm2, 0x10(%rsp) movss %xmm3, 0x30(%rsp) movss %xmm4, 0x50(%rsp) movss %xmm5, 0x40(%rsp) movss %xmm6, 0x60(%rsp) movss %xmm7, 0x80(%rsp) movss %xmm8, 0x70(%rsp) movq $0x6, 0x210(%rsp) movq $0x0, 0x218(%rsp) callq ffd <_calculateVerticesAndNormals+0x97d> movss 0x70(%rsp), %xmm8 movss 0x80(%rsp), %xmm7 movss 0x60(%rsp), %xmm6 movss 0x40(%rsp), %xmm5 movss 0x50(%rsp), %xmm4 movss 0x30(%rsp), %xmm3 movss 0x10(%rsp), %xmm2 movss 0x20(%rsp), %xmm1 movss (%rsp), %xmm0 jmpq e59 <_calculateVerticesAndNormals+0x7d9> nopl 0x0(%rax, %rax, 1) mov $0x69, %edx mov $0x6, %edi mov $0x0, %esi movss %xmm0, (%rsp) movss %xmm1, 0x20(%rsp) movss %xmm2, 0x10(%rsp) movss %xmm3, 0x30(%rsp) movss %xmm4, 0x50(%rsp) movss %xmm5, 0x40(%rsp) movss %xmm6, 0x60(%rsp) movss %xmm7, 0x80(%rsp) movss %xmm8, 0x70(%rsp) movq $0x6, 0x200(%rsp) movq $0x0, 0x208(%rsp) callq 10a5 <_calculateVerticesAndNormals+0xa25> movss 0x70(%rsp), %xmm8 movss 0x80(%rsp), %xmm7 movss 0x60(%rsp), %xmm6 movss 0x40(%rsp), %xmm5 movss 0x50(%rsp), %xmm4 movss 0x30(%rsp), %xmm3 movss 0x10(%rsp), %xmm2 movss 0x20(%rsp), %xmm1 movss (%rsp), %xmm0 jmpq e27 <_calculateVerticesAndNormals+0x7a7> nopl 0x0(%rax, %rax, 1) mov $0x68, %edx mov $0x6, %edi mov $0x0, %esi movss %xmm0, (%rsp) movss %xmm1, 0x20(%rsp) movss %xmm2, 0x10(%rsp) movss %xmm3, 0x30(%rsp) movss %xmm4, 0x50(%rsp) movss %xmm5, 0x40(%rsp) movss %xmm6, 0x60(%rsp) movss %xmm7, 0x80(%rsp) movss %xmm8, 0x70(%rsp) movss %xmm9, 0x90(%rsp) movss %xmm10, 0xa0(%rsp) movss %xmm11, 0xb0(%rsp) movq $0x6, 0x1f0(%rsp) movq $0x0, 0x1f8(%rsp) callq 116b <_calculateVerticesAndNormals+0xaeb> movss 0x70(%rsp), %xmm8 movss 0xb0(%rsp), %xmm11 movss 0xa0(%rsp), %xmm10 movss 0x90(%rsp), %xmm9 movss 0x80(%rsp), %xmm7 movss 0x60(%rsp), %xmm6 movss 0x40(%rsp), %xmm5 movss 0x50(%rsp), %xmm4 movss 0x30(%rsp), %xmm3 movss 0x10(%rsp), %xmm2 movss 0x20(%rsp), %xmm1 movss (%rsp), %xmm0 jmpq d3a <_calculateVerticesAndNormals+0x6ba> nopw 0x0(%rax, %rax, 1) mov $0x68, %edx mov $0x6, %edi mov $0x0, %esi movss %xmm0, (%rsp) movss %xmm1, 0x20(%rsp) movss %xmm2, 0x10(%rsp) movss %xmm3, 0x30(%rsp) movss %xmm4, 0x50(%rsp) movss %xmm5, 0x40(%rsp) movss %xmm6, 0x60(%rsp) movss %xmm7, 0x80(%rsp) movss %xmm8, 0x70(%rsp) movss %xmm9, 0x90(%rsp) movss %xmm10, 0xa0(%rsp) movss %xmm11, 0xb0(%rsp) movq $0x6, 0x1e0(%rsp) movq $0x0, 0x1e8(%rsp) callq 1253 <_calculateVerticesAndNormals+0xbd3> movss 0x70(%rsp), %xmm8 movss 0xb0(%rsp), %xmm11 movss 0xa0(%rsp), %xmm10 movss 0x90(%rsp), %xmm9 movss 0x80(%rsp), %xmm7 movss 0x60(%rsp), %xmm6 movss 0x40(%rsp), %xmm5 movss 0x50(%rsp), %xmm4 movss 0x30(%rsp), %xmm3 movss 0x10(%rsp), %xmm2 movss 0x20(%rsp), %xmm1 movss (%rsp), %xmm0 jmpq cfd <_calculateVerticesAndNormals+0x67d> mov $0x12, %r8d mov $0x6, %edx mov $0x0, %ecx mov $0x9, %edi mov $0x0, %esi movss %xmm9, 0x90(%rsp) movq $0x6, 0x220(%rsp) movq $0x0, 0x228(%rsp) movq $0x9, 0x230(%rsp) movq $0x0, 0x238(%rsp) callq 1308 <_calculateVerticesAndNormals+0xc88> movss 0x90(%rsp), %xmm9 jmpq 7fa <_calculateVerticesAndNormals+0x17a> add $0x268, %rsp pop %rbx pop %rbp pop %r12 pop %r13 pop %r14 pop %r15 retq nopl 0x0(%rax) Bye, bearophile
Aug 04 2011
next sibling parent reply Adam Ruppe <destructionator gmail.com> writes:
 But what's the purpose of those callq? They seem to call the
 successive asm instruct
I find AT&T syntax to be almost impossible to read, but it looks like they are comparing the instruction pointer for some reason. call works by pushing the instruction pointer on the stack, then jumping to the new address. By calling the next thing, you can then pop the instruction pointer off the stack and continue on where you left off. I don't know why they want this though. That AT&T syntax really messes with my brain...
Aug 04 2011
parent Don <nospam nospam.com> writes:
Adam Ruppe wrote:
 But what's the purpose of those callq? They seem to call the
 successive asm instruct
I find AT&T syntax to be almost impossible to read, but it looks like they are comparing the instruction pointer for some reason. call works by pushing the instruction pointer on the stack, then jumping to the new address. By calling the next thing, you can then pop the instruction pointer off the stack and continue on where you left off.
They do that to implement Position Independent Code: you need to know the instruction pointer, to be able to access your data. Actually it has a terrible effect on performance, because it destroys the processor's return prediction mechanism (it guarantees multiple mispredictions). But it seems to be unavoidable -- I don't think it's possible to generate decent code for PIC on x86-32. But there should never be more than one call in a function.
 I don't know why they want this though. That AT&T syntax really
 messes with my brain...
Aug 05 2011
prev sibling parent reply Trass3r <un known.com> writes:
 are you willing and able to show me the asm before it gets assembled?  
 (with gcc you do it with the -S switch). (I also suggest to use only the  
 C standard library, with time() and printf() to produce a smaller asm  
 output: http://codepad.org/12EUo16J ).
Aug 05 2011
parent reply bearophile <bearophileHUGS lycos.com> writes:
Trass3r:

 are you willing and able to show me the asm before it gets assembled?  
 (with gcc you do it with the -S switch). (I also suggest to use only the  
 C standard library, with time() and printf() to produce a smaller asm  
 output: http://codepad.org/12EUo16J ).
You are a person of few words :-) Thank you for the asm. Apparently the program was not compiled in release mode (or with nobounds. With DMD it's the same thing, maybe with gdc it's not the same thing). It contains the calls, but they aren't to the next line, they were for the array bounds: call _d_assert call _d_array_bounds call _d_array_bounds call _d_assert_msg call _d_array_bounds call _d_array_bounds call _d_array_bounds call _d_array_bounds call _d_array_bounds call _d_array_bounds call _d_assert_msg But I think this doesn't fully explain the low performance, I have seen too many instructions like: movss DWORD PTR [rsp+32], xmm1 movss DWORD PTR [rsp+16], xmm2 movss DWORD PTR [rsp+48], xmm3 If you want to go on with this exploration, then I suggest you to find a way to disable bound tests. Bye, bearophile
Aug 05 2011
parent Trass3r <un known.com> writes:
 If you want to go on with this exploration, then I suggest you to find a  
 way to disable bound tests.
Ok, now I get up to 32930000 skinned vertices per second. Still a bit worse than LDC.
Aug 05 2011
prev sibling parent reply Iain Buclaw <ibuclaw ubuntu.com> writes:
== Quote from bearophile (bearophileHUGS lycos.com)'s article
 Trass3r:
 C++ no SIMD:
 Skinned vertices per second: 42420000
...
 D gdc:
 Skinned vertices per second: 23450000
Are you able and willing to show me the asm produced by gdc? There's a problem
there.
 Bye,
 bearophile
Notes from me: - Options -fno-bounds-check and -frelease can be just as important in GDC as they are in DMD under certain instances. - You can output asm in intel dialect using -masm=intel if at&t is that difficult for you to read. 8-) I will look into this later from my workstation.
Aug 06 2011
parent reply bearophile <bearophileHUGS lycos.com> writes:
Iain Buclaw:

 I will look into this later from my workstation.
The remaining thing to look at is just the small performance difference between the D-GDC version and the C++-G++ version. Bye, bearophile
Aug 06 2011
parent reply Iain Buclaw <ibuclaw ubuntu.com> writes:
== Quote from bearophile (bearophileHUGS lycos.com)'s article
 Iain Buclaw:
 I will look into this later from my workstation.
The remaining thing to look at is just the small performance difference between
the D-GDC version and the C++-G++ version.
 Bye,
 bearophile
Three things that helped improve performance in a minor way for me: 1) using pointers over dynamic arrays. (5% speedup) 2) removing the calls to CalVector4's constructor (5.7% speedup) 3) using core.stdc.time over std.datetime. (1.6% speedup) Point one is pretty well known issue in D as far as I'm aware. Point two is not an issue with inlining (all methods are marked 'inline'), but it did help remove quite a few movss instructions being emitted. Point three is interesting, it seems that "sw.peek().msecs" slows down the number of iterations in the while loop. With those changes, D implementation is still 21% slower than C++ implementation without SIMD. http://ideone.com/4PP2D
Aug 06 2011
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Iain Buclaw:

Are you using GDC2-64 bit on Linux?

 Three things that helped improve performance in a minor way for me:
 1) using pointers over dynamic arrays. (5% speedup)
 2) removing the calls to CalVector4's constructor (5.7% speedup)
 3) using core.stdc.time over std.datetime. (1.6% speedup)
 
 Point one is pretty well known issue in D as far as I'm aware.
Really? I don't remember discussions about it. What is its cause?
 Point two is not an issue with inlining (all methods are marked 'inline'), but
it
 did help remove quite a few movss instructions being emitted.
This too is something worth fixing. Is this issue in Bugzilla already?
 Point three is interesting, it seems that "sw.peek().msecs" slows down the
number
 of iterations in the while loop.
This needs to be fixed.
 With those changes, D implementation is still 21% slower than C++
implementation
 without SIMD.
 http://ideone.com/4PP2D
This is a lot still. Thank you for your work. I think all three issues are worth fixing, eventually. Bye, bearophile
Aug 06 2011
parent Iain Buclaw <ibuclaw ubuntu.com> writes:
== Quote from bearophile (bearophileHUGS lycos.com)'s article
 Iain Buclaw:
 Are you using GDC2-64 bit on Linux?
GDC2-32 bit on Linux.
 Three things that helped improve performance in a minor way for me:
 1) using pointers over dynamic arrays. (5% speedup)
 2) removing the calls to CalVector4's constructor (5.7% speedup)
 3) using core.stdc.time over std.datetime. (1.6% speedup)

 Point one is pretty well known issue in D as far as I'm aware.
Really? I don't remember discussions about it. What is its cause?
I can't remember the exact discussion, but it was something about a benchmark of passing by value vs passing by ref vs passing by pointer.
 Point two is not an issue with inlining (all methods are marked 'inline'), but
it
 did help remove quite a few movss instructions being emitted.
This too is something worth fixing. Is this issue in Bugzilla already?
I don't think its an issue really. But of course, there is a difference between what you say and what you mean with regards to the code here (that being, with the first version, lots of temp vars get created and moved around the place). Regards Iain
Aug 06 2011
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Iain Buclaw:

 1) using pointers over dynamic arrays. (5% speedup)
 2) removing the calls to CalVector4's constructor (5.7% speedup)
With DMD I have seen 180k -> 190k vertices/sec replacing this: struct CalVector4 { float X, Y, Z, W; this(float x, float y, float z, float w = 0.0f) { X = x; Y = y; Z = z; W = w; } } With: struct CalVector4 { float X, Y, Z, W=0.0f; } I'd like the D compiler to optimize better there.
 http://ideone.com/4PP2D
This line of code is not good: auto vertices = cast(Vertex *) new Vertex[N]; This is much better, it's less bug-prone, simpler and shorter: auto vertices = (new Vertex[N]).ptr; But in practice in this program it is enough to allocate dynamic arrays normally, and then perform the call like this (with DMD it gives the same performance): calculateVerticesAndNormals(boneTransforms.ptr, N, vertices.ptr, influences.ptr, output.ptr); I don't know why passing pointers gives some more performance here, compared to passing dynamic arrays (but I have seen the same behaviour in other D programs of mine). Bye, bearophile
Aug 06 2011
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/6/2011 3:19 PM, bearophile wrote:
 I don't know why passing pointers gives some more performance here, compared
 to passing dynamic arrays (but I have seen the same behaviour in other D
 programs of mine).
A dynamic array is two values being passed, a pointer is one.
Aug 06 2011
parent bearophile <bearophileHUGS lycos.com> writes:
Walter:

 A dynamic array is two values being passed, a pointer is one.
I know, but I think there are many optimization opportunities. An example: private void foo(int[] a2) {} void main() { int[100] a1; foo(a1); } In code like that I think a D compiler is free to compile like this, because foo is private, so it's free to perform optimizations based on just the code inside the module: private void foo(ref int[100] a2) {} void main() { int[100] a1; foo(a1); } I think there are several cases where a D compiler is free to replace the two values with just a pointer. Another example, to optimize code like this: private void foo(int[] a1, int[] a2) {} void main() { int n = 100; // run-time value auto a3 = new int[n]; auto a4 = new int[n]; foo(a3, a4); } Into something like this: private void foo(int* a1, int* a2, size_t a1a2len) {} void main() { int n = 100; auto a3 = new int[n]; auto a4 = new int[n]; foo(a3.ptr, a4.ptr, n); } Bye, bearophile
Aug 06 2011
prev sibling parent reply Iain Buclaw <ibuclaw ubuntu.com> writes:
== Quote from bearophile (bearophileHUGS lycos.com)'s article
 Iain Buclaw:
 1) using pointers over dynamic arrays. (5% speedup)
 2) removing the calls to CalVector4's constructor (5.7% speedup)
With DMD I have seen 180k -> 190k vertices/sec replacing this: struct CalVector4 { float X, Y, Z, W; this(float x, float y, float z, float w = 0.0f) { X = x; Y = y; Z = z; W = w; } } With: struct CalVector4 { float X, Y, Z, W=0.0f; } I'd like the D compiler to optimize better there.
 http://ideone.com/4PP2D
This line of code is not good: auto vertices = cast(Vertex *) new Vertex[N]; This is much better, it's less bug-prone, simpler and shorter: auto vertices = (new Vertex[N]).ptr; But in practice in this program it is enough to allocate dynamic arrays
normally, and then perform the call like this (with DMD it gives the same performance):
 calculateVerticesAndNormals(boneTransforms.ptr, N, vertices.ptr,
influences.ptr,
output.ptr); I was playing about with heap vs stack. Must've forgot to remove that, sorry. :) Anyways, I've tweaked the GDC codegen, and program speed meets that of C++ now (on my system). Implementation: http://ideone.com/0j0L1 Command-line: gdc -O3 -mfpmath=sse -ffast-math -march=native -frelease g++ bench.cc -O3 -mfpmath=sse -ffast-math -march=native Best times: G++-32bit: 11400000 vps GDC-32bit: 11350000 vps Regards Iain
Aug 06 2011
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Iain Buclaw:

 Anyways, I've tweaked the GDC codegen, and program speed meets that of C++ now
(on
 my system).
Are you willing to explain your changes (and maybe give a link to the changes)? Maybe Walter is interested for DMD too.
 Command-line:
 gdc -O3 -mfpmath=sse -ffast-math -march=native -frelease
 g++ bench.cc -O3 -mfpmath=sse -ffast-math -march=native
In newer versions of GCC -Ofast means -ffast-math too. Walter is not a lover of that -ffast-math switch. But I now think that the combination of D strongly pure functions with unsafe FP optimizations offers optimization opportunities that maybe not even GCC is able to use now when it compiles C/C++ code (do you see why?). Not using this opportunity is a waste, in my opinion. Bye, bearophile
Aug 06 2011
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/6/2011 4:46 PM, bearophile wrote:
 Walter is not a lover of that -ffast-math switch.
No, I am not. Few understand the subtleties of IEEE arithmetic, and breaking IEEE conformance is something very, very few should even consider.
Aug 06 2011
parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter:

 On 8/6/2011 4:46 PM, bearophile wrote:
 Walter is not a lover of that -ffast-math switch.
No, I am not. Few understand the subtleties of IEEE arithmetic, and breaking IEEE conformance is something very, very few should even consider.
I have read several papers about FP arithmetic, but I am not an expert yet on them. Both GDC and LDC have compilation switches to perform those unsafe FP optimizations, so even if you don't like them, most D compilers today have them optional, and I don't think those switches will be removed. If you want to simulate a flock of boids (http://en.wikipedia.org/wiki/Boids ) on the screen using D, and you use floating point values to represent their speed vector, introducing unsafe FP optimizations will not harm so much. Video games are a significant purpose for D language, and in them FP errors are often benign (maybe some parts of the game are able to tolerate them and some other part of the game needs to be compiled with strict FP semantics). Bye, bearophile
Aug 06 2011
parent reply "Eric Poggel (JoeCoder)" <dnewsgroup2 yage3d.net> writes:
On 8/6/2011 8:34 PM, bearophile wrote:
 Walter:

 On 8/6/2011 4:46 PM, bearophile wrote:
 Walter is not a lover of that -ffast-math switch.
No, I am not. Few understand the subtleties of IEEE arithmetic, and breaking IEEE conformance is something very, very few should even consider.
I have read several papers about FP arithmetic, but I am not an expert yet on them. Both GDC and LDC have compilation switches to perform those unsafe FP optimizations, so even if you don't like them, most D compilers today have them optional, and I don't think those switches will be removed. If you want to simulate a flock of boids (http://en.wikipedia.org/wiki/Boids ) on the screen using D, and you use floating point values to represent their speed vector, introducing unsafe FP optimizations will not harm so much. Video games are a significant purpose for D language, and in them FP errors are often benign (maybe some parts of the game are able to tolerate them and some other part of the game needs to be compiled with strict FP semantics). Bye, bearophile
Floating point determinism can be very important when it comes to reducing network traffic. If you can achieve it, then you can make sure all players have the same game state and then only send user input commands over the network. Glenn Fiedler has an interesting writeup on it, but I haven't had a chance to read all of it yet: http://gafferongames.com/networking-for-game-programmers/floating-point-determinism/
Aug 07 2011
parent reply bearophile <bearophileHUGS lycos.com> writes:
Eric Poggel (JoeCoder):

 determinism can be very important when it comes to 
 reducing network traffic.  If you can achieve it, then you can make sure 
 all players have the same game state and then only send user input 
 commands over the network.
It seems a hard thing to obtain, but I agree that it gets useful. For me having some FP determinism is useful for debugging: to avoid results from changing randomly if I perform a tiny change in the source code that triggers a change in what optimizations the compiler does. But there are several situations (if I am writing a ray tracer?) where FP determinism is not required in my release build. I was not arguing about removing FP rules from the D compiler, just that there are situations where relaxing those FP rules, on request, doesn't seem to harm. I am not expert about the risks Walter was talking about, so maybe I'm just walking on thin ice (but no one will get hurt if my little raytrcer produces some errors in its images). You don't come often in this newsgroup, thank you for the link :-) Bye, bearophile
Aug 08 2011
parent "Eric Poggel (JoeCoder)" <dnewsgroup2 yage3d.net> writes:
On 8/8/2011 3:02 PM, bearophile wrote:
 Eric Poggel (JoeCoder):

 determinism can be very important when it comes to
 reducing network traffic.  If you can achieve it, then you can make sure
 all players have the same game state and then only send user input
 commands over the network.
It seems a hard thing to obtain, but I agree that it gets useful. For me having some FP determinism is useful for debugging: to avoid results from changing randomly if I perform a tiny change in the source code that triggers a change in what optimizations the compiler does. But there are several situations (if I am writing a ray tracer?) where FP determinism is not required in my release build. I was not arguing about removing FP rules from the D compiler, just that there are situations where relaxing those FP rules, on request, doesn't seem to harm. I am not expert about the risks Walter was talking about, so maybe I'm just walking on thin ice (but no one will get hurt if my little raytrcer produces some errors in its images). You don't come often in this newsgroup, thank you for the link :-) Bye, bearophile
You'd be surprised how much I lurk here. I agree there are some interesting areas where fast floating point may indeed be worth it, but I also don't know enough. I've also wondered about creating a Fixed!(long, 8) struct that would let me work with longs and 8 bits of precision after the decimal point as a way of having equal precision anywhere in a large universe and achieving determinism at the same time. But I don't know how performance would compare vs floats or doubles.
Aug 08 2011
prev sibling parent Trass3r <un known.com> writes:
 Anyways, I've tweaked the GDC codegen, and program speed meets that of  
 C++ now (on my system).

 Implementation: http://ideone.com/0j0L1

 Command-line:
 gdc -O3 -mfpmath=sse -ffast-math -march=native -frelease
 g++ bench.cc -O3 -mfpmath=sse -ffast-math -march=native

 Best times:
 G++-32bit:  11400000 vps
 GDC-32bit:  11350000 vps


 Regards
 Iain
64Bit: C++: 45010000 44270000 42740000 43900000 44680000 43490000 42390000 GDC: 42900000 44010000 44000000 44010000 44010000 44000000 GDC with -fno-bounds-check: 43280000 44440000 44420000 44340000 44440000 44450000
Aug 07 2011