www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Speed kills

reply ixid <nuaccount gmail.com> writes:
Every time there is a D thread on reddit it feels like the new 
user is expecting mind-blowing speed from D.

https://www.reddit.com/r/programming/comments/45v03g/porterstemmerd_an_implementation_of_the_porter/

This is the most recent one where John Colvin provided some 
pointers to speed it up significantly. Walter has done some good 
work taking the low-hanging fruit to speed up DMD code and there 
is a lot of effort going on with reference counting machinery but 
I wondered if some of the common errors people make that slow 
down D code can be addressed?

Literals used to be a hidden speed bump but I think that was 
improved, now the append operator is one of the most common 
culprits, can this not be enhanced behind the scenes to work more 
like append? Do others notice common pitfalls between the article 
code and what the D community then suggests where we can bridge 
the gap so naive users get faster code?
Feb 15 2016
next sibling parent reply Guillaume Piolat <contact gam3sfrommars.fr> writes:
On Monday, 15 February 2016 at 13:51:38 UTC, ixid wrote:
 This is the most recent one where John Colvin provided some 
 pointers to speed it up significantly. Walter has done some 
 good work taking the low-hanging fruit to speed up DMD code and 
 there is a lot of effort going on with reference counting 
 machinery but I wondered if some of the common errors people 
 make that slow down D code can be addressed?
Something that annoyed me a bit is floating-point comparisons, DMD does not seem to be able to handle them from SSE registers, it will convert to FPU and do the comparison there IIRC.
Feb 15 2016
next sibling parent Wyatt <wyatt.epp gmail.com> writes:
On Monday, 15 February 2016 at 14:16:02 UTC, Guillaume Piolat 
wrote:
 Something that annoyed me a bit is floating-point comparisons, 
 DMD does not seem to be able to handle them from SSE registers, 
 it will convert to FPU and do the comparison there IIRC.
I feel like this point comes up often, and that a lot of people have argued x87 FP should just not happen anymore. -Wyatt
Feb 15 2016
prev sibling parent reply Basile B. <b2.temp gmx.com> writes:
On Monday, 15 February 2016 at 14:16:02 UTC, Guillaume Piolat 
wrote:
 On Monday, 15 February 2016 at 13:51:38 UTC, ixid wrote:
 This is the most recent one where John Colvin provided some 
 pointers to speed it up significantly. Walter has done some 
 good work taking the low-hanging fruit to speed up DMD code 
 and there is a lot of effort going on with reference counting 
 machinery but I wondered if some of the common errors people 
 make that slow down D code can be addressed?
Something that annoyed me a bit is floating-point comparisons, DMD does not seem to be able to handle them from SSE registers, it will convert to FPU and do the comparison there IIRC.
Same for std.math.lround they use the FP way while for float and double it's only one sse instruction. Typically with 6 functions similar to this one: int round(float value) { asm { naked; cvtss2si EAX, XMM0; ret; } } we could get ceil/trunc/round/floor, also almost as easily fmod, hypoth. classic but I dont get why thery're not in std.math. Goddamnit, we're in 2016.
Feb 15 2016
next sibling parent reply Jack Stouffer <jack jackstouffer.com> writes:
On Monday, 15 February 2016 at 22:29:00 UTC, Basile B. wrote:
 we could get ceil/trunc/round/floor, also almost as easily 
 fmod, hypoth.
 classic but I dont get why thery're not in std.math.
Seems like you know a lot about the subject, and I know you contributed to phobos before, so how about making a PR for this :)
Feb 15 2016
parent reply Basile B. <b2.temp gmx.com> writes:
On Monday, 15 February 2016 at 23:13:13 UTC, Jack Stouffer wrote:
 On Monday, 15 February 2016 at 22:29:00 UTC, Basile B. wrote:
 we could get ceil/trunc/round/floor, also almost as easily 
 fmod, hypoth.
 classic but I dont get why thery're not in std.math.
Seems like you know a lot about the subject, and I know you contributed to phobos before, so how about making a PR for this :)
In the meantime: https://github.com/BBasile/iz/blob/master/import/iz/math.d Actually when i've participated to this conversation I didn't remember that it was not good on X86. Using SSE rouding is really only good on AMD64, otherwise loading the input parameter "sucks" a lot (even for a 32 bit float since it's not directly in EAX or XMMO). Anyway, not good for phobos, why? When looking for documentation yesterday night I've landed on a post by Walter who explained that the library for a system programming language shouldn't be specific to an architecture.
Feb 17 2016
parent reply Jack Stouffer <jack jackstouffer.com> writes:
On Wednesday, 17 February 2016 at 18:26:47 UTC, Basile B. wrote:
 Anyway, not good for phobos, why? When looking for 
 documentation yesterday night I've landed on a post by Walter 
 who explained that the library for a system programming 
 language shouldn't be specific to an architecture.
While I don't know about the post you're talking about, I don't think what Walter says applies to internal version blocks in a function. You could make it so on AMD lround and friends are much faster by using those ASM routines. Also, std.math is already chock full of architecture specific code.
Feb 17 2016
parent reply Basile B. <b2.temp gmx.com> writes:
On Wednesday, 17 February 2016 at 18:50:45 UTC, Jack Stouffer 
wrote:
 On Wednesday, 17 February 2016 at 18:26:47 UTC, Basile B. wrote:
 Anyway, not good for phobos, why? When looking for 
 documentation yesterday night I've landed on a post by Walter 
 who explained that the library for a system programming 
 language shouldn't be specific to an architecture.
While I don't know about the post you're talking about, I don't think what Walter says applies to internal version blocks in a function. You could make it so on AMD lround and friends are much faster by using those ASM routines. Also, std.math is already chock full of architecture specific code.
That's more subtile than that. The oldest 64 bit processor (AMD64) supports SSE, always. So when we do "cast(int) 0.1;" on X86_64, the backend always generate SSE instructions. The oldest 32 bit processor (X86) doesn't support SSE, maybe MMX (not sure). So when we do "cast(int) 0.1;" on X86, the backend always generate FPU instructions. This is how I understand the post 'I've landed onto'. My current works always use SSE so it's not conform with the "at least available" feature.
Feb 17 2016
next sibling parent reply Basile B. <b2.temp gmx.com> writes:
On Wednesday, 17 February 2016 at 19:01:38 UTC, Basile B. wrote:
 That's more subtile than that.

 The oldest 64 bit processor (AMD64) supports SSE, always. So 
 when we do "cast(int) 0.1;" on X86_64, the backend always 
 generate SSE instructions.

 The oldest 32 bit processor (X86) doesn't support SSE, maybe 
 MMX (not sure). So when we do "cast(int) 0.1;" on X86, the 
 backend always generate FPU instructions.

 This is how I understand the post 'I've landed onto'.
 My current work always use SSE so it's not conform with the "at 
 least available" feature.
Also, forgot to say, but an uniform API is needed to set the rounding mode, whether SSE is used or the FPU...
Feb 17 2016
next sibling parent w0rp <devw0rp gmail.com> writes:
I think it's important that DMD gets more of the easier 
optimisations. Most new users won't bother trying GDC or LDC, and 
if DMD doesn't generate fast enough code, they might leave before 
they try the compilers with better optimisations.
Feb 21 2016
prev sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Wed, 17 Feb 2016 19:55:08 +0000
schrieb Basile B. <b2.temp gmx.com>:

 Also, forgot to say, but an uniform API is needed to set the 
 rounding mode, whether SSE is used or the FPU...
At least GCC has a codegen switch for that. A solution would have to either set both rounding modes at once or the compilers would need to expose version MathFPU/MathSSE. -- Marco
Mar 07 2016
prev sibling parent Luc J. Bourhis <luc_j_bourhis mac.com> writes:
On Wednesday, 17 February 2016 at 19:01:38 UTC, Basile B. wrote:
 The oldest 32 bit processor (X86) doesn't support SSE, maybe 
 MMX (not sure). So when we do "cast(int) 0.1;" on X86, the 
 backend always generate FPU instructions.
SSE goes back to Pentium III, doesn't it? And the Pentium 4 supported SSE3, didn't it? Is it an active specification of D to run on Pentium II e.g.?
Feb 21 2016
prev sibling parent reply Guillaume Piolat <name.lastname gmail.com> writes:
On Monday, 15 February 2016 at 22:29:00 UTC, Basile B. wrote:
 Same for std.math.lround

 they use the FP way while for float and double it's only one 
 sse instruction. Typically with 6 functions similar to this one:


 int round(float value)
 {
     asm
     {
         naked;
         cvtss2si EAX, XMM0;
         ret;
     }
 }

 we could get ceil/trunc/round/floor, also almost as easily 
 fmod, hypoth.
 classic but I dont get why thery're not in std.math.

 Goddamnit, we're in 2016.
lround and friends have been a big performance problem at times. Everytime you can use cast(int) instead, it's way faster.
Feb 15 2016
parent reply Basile B. <b2.temp gmx.com> writes:
On Monday, 15 February 2016 at 23:19:44 UTC, Guillaume Piolat 
wrote:
 lround and friends have been a big performance problem at times.
 Everytime you can use cast(int) instead, it's way faster.
I didn't know this trick. It generates almost the same sse intruction (it truncates) and has the advantage to be inline-able. Is it documented somewhere ? If not it should.
Feb 15 2016
parent Guillaume Piolat <contact gam3sfrommars.fr> writes:
On Monday, 15 February 2016 at 23:35:54 UTC, Basile B. wrote:
 On Monday, 15 February 2016 at 23:19:44 UTC, Guillaume Piolat 
 wrote:
 lround and friends have been a big performance problem at 
 times.
 Everytime you can use cast(int) instead, it's way faster.
I didn't know this trick. It generates almost the same sse intruction (it truncates) and has the advantage to be inline-able. Is it documented somewhere ? If not it should.
In SSE3 you also get an instruction that does this without messing with the x87 control word: FISTTP.
Feb 16 2016
prev sibling next sibling parent rsw0x <anonymous anonymous.com> writes:
On Monday, 15 February 2016 at 13:51:38 UTC, ixid wrote:
 Every time there is a D thread on reddit it feels like the new 
 user is expecting mind-blowing speed from D.

 [...]
if you want better codegen, don't use dmd. use ldc, it's usualy only a version-ish behind dmd.
Feb 15 2016
prev sibling parent reply ixid <nuaccount gmail.com> writes:
On Monday, 15 February 2016 at 13:51:38 UTC, ixid wrote:
 Every time there is a D thread on reddit it feels like the new 
 user is expecting mind-blowing speed from D.

 https://www.reddit.com/r/programming/comments/45v03g/porterstemmerd_an_implementation_of_the_porter/

 This is the most recent one where John Colvin provided some 
 pointers to speed it up significantly. Walter has done some 
 good work taking the low-hanging fruit to speed up DMD code and 
 there is a lot of effort going on with reference counting 
 machinery but I wondered if some of the common errors people 
 make that slow down D code can be addressed?

 Literals used to be a hidden speed bump but I think that was 
 improved, now the append operator is one of the most common 
 culprits, can this not be enhanced behind the scenes to work 
 more like append? Do others notice common pitfalls between the 
 article code and what the D community then suggests where we 
 can bridge the gap so naive users get faster code?
Since I posted this thread I've learned std.algorithm.sum is 4 times slower than a naive loop sum. Even if this is for reasons of accuracy this is exactly what I am talking about- this is a hidden iceberg of terrible performance that will reflect poorly on D. That's so slow the function needs a health warning.
Mar 08 2016
next sibling parent jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 8 March 2016 at 14:14:25 UTC, ixid wrote:
 Since I posted this thread I've learned std.algorithm.sum is 4 
 times slower than a naive loop sum. Even if this is for reasons 
 of accuracy this is exactly what I am talking about- this is a 
 hidden iceberg of terrible performance that will reflect poorly 
 on D. That's so slow the function needs a health warning.
There was a longer discussion here https://forum.dlang.org/post/vkiwojmfjrwhigbkenaa forum.dlang.org
Mar 08 2016
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 03/08/2016 09:14 AM, ixid wrote:
 Since I posted this thread I've learned std.algorithm.sum is 4 times
 slower than a naive loop sum. Even if this is for reasons of accuracy
 this is exactly what I am talking about- this is a hidden iceberg of
 terrible performance that will reflect poorly on D. That's so slow the
 function needs a health warning.
Whoa. What's happening there? Do we have anyone on it? -- Andrei
Mar 09 2016
next sibling parent Daniel Kozak via Digitalmars-d <digitalmars-d puremagic.com> writes:
Dne 9.3.2016 v 14:26 Andrei Alexandrescu via Digitalmars-d napsal(a):
 On 03/08/2016 09:14 AM, ixid wrote:
 Since I posted this thread I've learned std.algorithm.sum is 4 times
 slower than a naive loop sum. Even if this is for reasons of accuracy
 this is exactly what I am talking about- this is a hidden iceberg of
 terrible performance that will reflect poorly on D. That's so slow the
 function needs a health warning.
Whoa. What's happening there? Do we have anyone on it? -- Andrei
I guess he speaks about this one: http://forum.dlang.org/post/mailman.4748.1456070484.22025.digitalmars-d-learn puremagic.com
Mar 09 2016
prev sibling next sibling parent reply cym13 <cpicard openmailbox.org> writes:
On Wednesday, 9 March 2016 at 13:26:45 UTC, Andrei Alexandrescu 
wrote:
 On 03/08/2016 09:14 AM, ixid wrote:
 Since I posted this thread I've learned std.algorithm.sum is 4 
 times
 slower than a naive loop sum. Even if this is for reasons of 
 accuracy
 this is exactly what I am talking about- this is a hidden 
 iceberg of
 terrible performance that will reflect poorly on D. That's so 
 slow the
 function needs a health warning.
Whoa. What's happening there? Do we have anyone on it? -- Andrei
They just don't do the same thing, sum() uses pairwise summation which is safer as I understand it. Corresponding issue: https://issues.dlang.org/show_bug.cgi?id=15722
Mar 09 2016
parent jmh530 <john.michael.hall gmail.com> writes:
On Wednesday, 9 March 2016 at 13:42:40 UTC, cym13 wrote:
 They just don't do the same thing, sum() uses pairwise 
 summation which is safer as I understand it. Corresponding 
 issue: https://issues.dlang.org/show_bug.cgi?id=15722
That third comment about how it's not obvious which algorithm sum is using is a good one.
Mar 09 2016
prev sibling parent reply John Colvin <john.loughran.colvin gmail.com> writes:
On Wednesday, 9 March 2016 at 13:26:45 UTC, Andrei Alexandrescu 
wrote:
 On 03/08/2016 09:14 AM, ixid wrote:
 Since I posted this thread I've learned std.algorithm.sum is 4 
 times
 slower than a naive loop sum. Even if this is for reasons of 
 accuracy
 this is exactly what I am talking about- this is a hidden 
 iceberg of
 terrible performance that will reflect poorly on D. That's so 
 slow the
 function needs a health warning.
Whoa. What's happening there? Do we have anyone on it? -- Andrei
Ilya has long term plans for this, but I have a short-term fix that will buy us a very large performance improvement here (if my old benchmarks were correct). Give me a few mins to prep the pull request :)
Mar 09 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/9/16 9:03 AM, John Colvin wrote:
 On Wednesday, 9 March 2016 at 13:26:45 UTC, Andrei Alexandrescu wrote:
 On 03/08/2016 09:14 AM, ixid wrote:
 Since I posted this thread I've learned std.algorithm.sum is 4 times
 slower than a naive loop sum. Even if this is for reasons of accuracy
 this is exactly what I am talking about- this is a hidden iceberg of
 terrible performance that will reflect poorly on D. That's so slow the
 function needs a health warning.
Whoa. What's happening there? Do we have anyone on it? -- Andrei
Ilya has long term plans for this, but I have a short-term fix that will buy us a very large performance improvement here (if my old benchmarks were correct). Give me a few mins to prep the pull request :)
Thanks much! -- Andrei
Mar 09 2016
parent John Colvin <john.loughran.colvin gmail.com> writes:
On Wednesday, 9 March 2016 at 14:04:40 UTC, Andrei Alexandrescu 
wrote:
 On 3/9/16 9:03 AM, John Colvin wrote:
 On Wednesday, 9 March 2016 at 13:26:45 UTC, Andrei 
 Alexandrescu wrote:
 On 03/08/2016 09:14 AM, ixid wrote:
 [...]
Whoa. What's happening there? Do we have anyone on it? -- Andrei
Ilya has long term plans for this, but I have a short-term fix that will buy us a very large performance improvement here (if my old benchmarks were correct). Give me a few mins to prep the pull request :)
Thanks much! -- Andrei
https://github.com/D-Programming-Language/phobos/pull/4069
Mar 09 2016
prev sibling parent reply Jon D <jond noreply.com> writes:
On Tuesday, 8 March 2016 at 14:14:25 UTC, ixid wrote:
 Since I posted this thread I've learned std.algorithm.sum is 4 
 times slower than a naive loop sum. Even if this is for reasons 
 of accuracy this is exactly what I am talking about- this is a 
 hidden iceberg of terrible performance that will reflect poorly 
 on D. That's so slow the function needs a health warning.
I seen a few cases while exploring D. Not all fully researched (apologies for that), but since there appears to be interest in identification I'll list them. * Lower-casing strings (likely upper-casing), and some character type checks. Routines like to toLower and asLowerCase call functions that work for all unicode characters. I was able to create much faster versions by checking if the character was ascii, then calling either the ascii version or the more general version. Same is true for a few routines like isNumber. Some have the ascii check optimization built in, but not all. If this optimization is added, it might also be useful to add a few common combinations (or a generic solution, if that's feasible). For example, to check if a character is alpha-numeric, one currently ORs two tests from the standard library, isAlpha and isNumber. Putting in an ascii optimization check requires putting it before doing the OR, rather than inside the tests being ORed. * Large associative arrays When associative arrays get beyond about 10 million entries performance starts to decline. I believe this is due to resizing the arrays. It's worse with strings as keys than integers as keys. Having a way to reserve capacity may help under some circumstances. * Associative arrays - Converting keys to immutable versions for lookup Associative arrays want immutable values as keys. Far as I can tell, immutable values are also required when performing a lookup, even if a new entry won't be stored. A couple apps I've written walk through large lists of text values, naturally available as char[] because they are read from input streams. To test presence in an associative array, it's necessary to copy them to immutable strings first. I haven't fully researched this one, but my experience is that copying from char[] to string becomes a meaningful cost. On the surface, this appears to be an optimization opportunity, to create the immutable strings only when actually storing a new value. --Jon
Mar 09 2016
next sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Wed, Mar 09, 2016 at 08:30:10PM +0000, Jon D via Digitalmars-d wrote:
 On Tuesday, 8 March 2016 at 14:14:25 UTC, ixid wrote:
Since I posted this thread I've learned std.algorithm.sum is 4 times
slower than a naive loop sum. Even if this is for reasons of accuracy
this is exactly what I am talking about- this is a hidden iceberg of
terrible performance that will reflect poorly on D. That's so slow
the function needs a health warning.
In the case of std.algorithm.sum, the focus is on accuracy rather than performance. It does some extra work to ensure maximum accuracy in the result, so it shouldn't be expected to have top performance. Granted, though, the docs could be clearer about this accuracy vs. performance tradeoff. Please file a bug on this (or better yet, submit a PR for it). In any case, 4 times slower sounds a bit excessive... it would be good to investigate why this is happening and fix it.
 I seen a few cases while exploring D. Not all fully researched
 (apologies for that), but since there appears to be interest in
 identification I'll list them.
One of the big performance killers that people are likely to run into is autodecoding for strings in all range-based algorithms. Walter has repeatedly complained about this, but so far Andrei (and some others) have resisted getting rid of autodecoding. Fortunately, you can (mostly) suppress this by wrapping your strings in std.encoding.codeUnits before operating on them. Nevertheless, it's one of those hidden gotchas that could result in poorer performance than what you could get otherwise. Another big performance killer that I repeatedly run into is the overly-eager GC collection schedule. Some may argue against the use of the GC, period, but I have found that even with the GC, it's possible to get 30-40% speedups just by calling GC.disable() and manually triggering GC.collect() at a lower frequency than the default. This is especially true when your program performs many allocations of small objects, like (small) strings. I have observed this in at least 2-3 different memory-intensive programs of mine, including a recent experiment in writing a faster CSV parser than std.csv. Arguably, we should fix this in druntime itself so that collection cycles are run less often, though I'm not sure what implications this may have on low-memory systems. Sometimes, you don't even need to do this for the entire program; you can get big performance boosts just by wrapping GC.disable() / GC.enable() around strategic sections of allocation-intensive code.
 * Lower-casing strings (likely upper-casing), and some character type
 checks.
 
 Routines like to toLower and asLowerCase call functions that work for
 all unicode characters. I was able to create much faster versions by
 checking if the character was ascii, then calling either the ascii
 version or the more general version. Same is true for a few routines
 like isNumber. Some have the ascii check optimization built in, but
 not all.
 
 If this optimization is added, it might also be useful to add a few
 common combinations (or a generic solution, if that's feasible). For
 example, to check if a character is alpha-numeric, one currently ORs
 two tests from the standard library, isAlpha and isNumber. Putting in
 an ascii optimization check requires putting it before doing the OR,
 rather than inside the tests being ORed.
These sound like valuable additions to Phobos. Submit a PR, please? [...]
 * Associative arrays - Converting keys to immutable versions for lookup
 
 Associative arrays want immutable values as keys. Far as I can tell,
 immutable values are also required when performing a lookup, even if a
 new entry won't be stored. A couple apps I've written walk through
 large lists of text values, naturally available as char[] because they
 are read from input streams. To test presence in an associative array,
 it's necessary to copy them to immutable strings first. I haven't
 fully researched this one, but my experience is that copying from
 char[] to string becomes a meaningful cost.
[...] This is arguably a bug. If you're only looking up a key, you should be able to pass in const(char)[] rather than immutable(char)[] (aka string). The GC performance problem I mentioned above is especially noticeable when there are large numbers of small allocations, as would be the case if you constantly have to call .idup just because AA lookups demand immutable keys. Please file an issue for this, if there isn't one already filed. T -- Windows 95 was a joke, and Windows 98 was the punchline.
Mar 09 2016
parent John Colvin <john.loughran.colvin gmail.com> writes:
On Wednesday, 9 March 2016 at 21:01:13 UTC, H. S. Teoh wrote:
 On Wed, Mar 09, 2016 at 08:30:10PM +0000, Jon D via 
 Digitalmars-d wrote:
 On Tuesday, 8 March 2016 at 14:14:25 UTC, ixid wrote:
[...]
In the case of std.algorithm.sum, the focus is on accuracy rather than performance. It does some extra work to ensure maximum accuracy in the result, so it shouldn't be expected to have top performance. Granted, though, the docs could be clearer about this accuracy vs. performance tradeoff. Please file a bug on this (or better yet, submit a PR for it). In any case, 4 times slower sounds a bit excessive... it would be good to investigate why this is happening and fix it.
I think you'd be good at reviewing this: https://github.com/D-Programming-Language/phobos/pull/4069
Mar 09 2016
prev sibling parent Jon D <jond noreply.com> writes:
On Wednesday, 9 March 2016 at 20:30:10 UTC, Jon D wrote:
 I seen a few cases while exploring D.
Turns out there are issues filed for each of the performance issues I mentioned: * Lower casing strings: https://issues.dlang.org/show_bug.cgi?id=11229 * Large associative arrays: https://issues.dlang.org/show_bug.cgi?id=2504 * Associative arrays - Checking membership with mutable values (char arrays) rather strings (immutable): https://issues.dlang.org/show_bug.cgi?id=15038
Mar 09 2016