www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Re: memset and related things

reply bearophile <bearophileHUGS lycos.com> writes:
Don:

 It'll be interesting to see what the priorities are now -- 
 maybe this stuff is of more interest now.

Probably removing bugs is more important still :-) For example your work has changed a little how compile-time functions can be used in D.
 BTW the AMD manual for K7 (or might be K6 optimisation manual? don't 
 exactly remember) goes into great detail about both memcpy() and 
 memset(). Turns out there's about five different cases.

In the meantime Deewiant has told me that on 64 bit glibc memset is better and on more modern CPUs the timings are different (and on 64 bit my first version may not work, maybe the second one is better. I have not tested it on 64 bit LDC yet). I'm just a newbie on this stuff, while people that write the memset of 64bit glibc are expert. Bye and thank you, bearophile
Sep 20 2009
next sibling parent reply language_fan <foo bar.com.invalid> writes:
Sun, 20 Sep 2009 16:09:50 -0400, bearophile thusly wrote:

 I'm just a newbie on this stuff, while
 people that write the memset of 64bit glibc are expert.

I have been wondering why people often here complain that Java is slower than D. Every time I see your benchmarks, Java is actually doing just fine, sometimes it's even faster than D. Are your benchmarks somehow flawed every time Java wins D in performance?
Sep 20 2009
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
language_fan:

 Every time I see your benchmarks, Java is actually doing just 
 fine, sometimes it's even faster than D. Are your benchmarks somehow 
 flawed every time Java wins D in performance?

I don't fully understand your question. The main purpose of some of the benchmarks I've done in the last few months is to find performance bugs (or sometimes just bugs) in LDC or LLVM (I have stopped doing something similar for DMD because people look less interested in improving it performance). Now often I use code originally written in C++ and Java because I've seen that in most situations LLVM is able to produce good binaries from very C-like code (with few exceptions). My benchmarks aren't chosen randomly, I naturally focus on things that are slower in D, so sometimes you can see Java to "win". I usually discard the code where Java results slower :-) Installing and using a debug build of the JavaVM to see the assembly its JIT produces is not handy, but I sometimes do it. Once in a while you can find some surprises there. Recently I have found a nice single-file C++ program, that heavily uses templates and partial specialization, but I am having problems in translating it to good D to test templates in LDC because I don't know enough C++ yet. It finds the optimal solution to the 15 problems, and it's quite efficient: http://codepad.org/96ATkuhx Help/suggestions are welcome :-) Bye, bearophile
Sep 20 2009
next sibling parent reply downs <default_357-line yahoo.de> writes:
language_fan wrote:
 Sun, 20 Sep 2009 18:16:37 -0400, bearophile thusly wrote:
 
 My benchmarks aren't chosen randomly, I naturally focus on things that
 are slower in D, so sometimes you can see Java to "win". I usually
 discard the code where Java results slower :-)

I have seen people many times mention that Java is in general orders of magnitude slower than D, no matter what kind of algorithms you run on both environments. This is because of the VM - nothing on a VM can run faster than native code, they say.

Wow. I actually haven't seen that argument in a relatively long time - it seems to me relatively debunked nowadays. (Personally, for me it comes down to "Java will always have a certain delay on startup, and its thoroughly object-oriented design forces heap access that is often unnecessary for solving the problem; plus the single-paradigm model is woefully constraining in comparison to more flexible languages. "
 I personally use a 
 lot of heap memory allocation in my work, and so far Java has not only 
 been safer (and provides decent stack traces and no compiler bugs), but 
 also faster - each time.

Yes, heap allocation is faster in Java. There's so much of it they pretty much had no choice but to tune it to hell and back :)
 If you decide to hide the bad results
 (for D), it will only reinforce the misinformation.

Um, he said he hides the bad results _for Java_.
Sep 21 2009
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
downs:

and its thoroughly object-oriented design forces heap access that is often
unnecessary for solving the problem;<

In D/C# you can use an array of structs that sometimes allows for better performance, in Java to avoid the decrease in performance caused by using an array of references you may need to use several parallel arrays of the single fields of the struct. But the cache coherence signature of such parallels arrays is different, sometimes it's better and sometimes it's worse than the array of structs. If you have to access several items at once of each struct of the array, then using the array of structs is generally faster than using the parallel arrays. If you have to scan only one or few fields then using parallel arrays is faster because there's less data that comes through the caches. Time ago someone has created a "parallel array" struct here to allow for such layout choices at compile-time. A VirtualMachine in theory can dynamically change such struct/array layout at runtime according to the access patterns to the data :-) I think similar dynamism in data structures may be done in the next generation of VM-based languages (you can implement such smart data structures in D too, but the data structure has to be mostly opaque and you have to forbid taking addresses to any data, because it can be moved and shuffled around).
Um, he said he hides the bad results _for Java_.<

Right. I generally don't like to hide things, and I show all data I have collected. Sometimes I don't show benchmarks where Java looks slow because they can be boring and because I am not currently interested in improving the JavaVM. Focusing on where LDC/LLVM do badly I can try to improve them. Bye, bearophile
Sep 21 2009
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
downs wrote:
 language_fan wrote:
 Sun, 20 Sep 2009 18:16:37 -0400, bearophile thusly wrote:

 My benchmarks aren't chosen randomly, I naturally focus on things that
 are slower in D, so sometimes you can see Java to "win". I usually
 discard the code where Java results slower :-)

magnitude slower than D, no matter what kind of algorithms you run on both environments. This is because of the VM - nothing on a VM can run faster than native code, they say.

Wow. I actually haven't seen that argument in a relatively long time - it seems to me relatively debunked nowadays. (Personally, for me it comes down to "Java will always have a certain delay on startup, and its thoroughly object-oriented design forces heap access that is often unnecessary for solving the problem; plus the single-paradigm model is woefully constraining in comparison to more flexible languages. "
 I personally use a 
 lot of heap memory allocation in my work, and so far Java has not only 
 been safer (and provides decent stack traces and no compiler bugs), but 
 also faster - each time.

Yes, heap allocation is faster in Java. There's so much of it they pretty much had no choice but to tune it to hell and back :)
 If you decide to hide the bad results
 (for D), it will only reinforce the misinformation.

Um, he said he hides the bad results _for Java_.

I guess that reinforces some other misinformation :o). Andrei
Sep 21 2009
prev sibling next sibling parent reply Don <nospam nospam.com> writes:
language_fan wrote:
 Sun, 20 Sep 2009 18:16:37 -0400, bearophile thusly wrote:
 
 My benchmarks aren't chosen randomly, I naturally focus on things that
 are slower in D, so sometimes you can see Java to "win". I usually
 discard the code where Java results slower :-)

I have seen people many times mention that Java is in general orders of magnitude slower than D, no matter what kind of algorithms you run on both environments.

I think you're creating a straw man. I don't recall anyone saying that. Certainly not "orders of magnitude" in general. In general, when D is faster, it'll be because more coding techniques are available. For example, because you're not forced to used OOP all the time.
 This is because of the VM - nothing on a VM can run 
 faster than native code, they say.  If you decide to hide the bad results
 (for D), it will only reinforce the misinformation. I personally use a 
 lot of heap memory allocation in my work, and so far Java has not only 
 been safer (and provides decent stack traces and no compiler bugs), but 
 also faster - each time.

On this ng, it seems to be universally acknowledged that Java's GC is much better than D's; and that Java's tool chain is orders of magnitude more mature than D. But, in the cases where Java beats D, it will almost always be because of the GC. LDC probably only gets beaten on the GC. Having looked at the DMD optimiser, I'm a bit surprised that it's competitive at all (especially in floating point). There is so much low-hanging fruit, it's practically an orchard <g>. It's hardly been touched since the mid-90's, while Java's had 15 years of massive commercial development during that time.
Sep 21 2009
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Don wrote:
 Having looked at the DMD optimiser, I'm a bit surprised that it's 
 competitive at all (especially in floating point). There is so much 
 low-hanging fruit, it's practically an orchard <g>.

I believe that's because it has reached the point of diminishing returns.
Sep 23 2009
parent reply Don <nospam nospam.com> writes:
Walter Bright wrote:
 Don wrote:
 Having looked at the DMD optimiser, I'm a bit surprised that it's 
 competitive at all (especially in floating point). There is so much 
 low-hanging fruit, it's practically an orchard <g>.

I believe that's because it has reached the point of diminishing returns.

I presume you're talking about integer optimisation, because that's definitely not the case for DMD's floating point. Eg, there's a comment in the code somewhere saying that CSE (complex sub expression) optimisations are not done for floating point because the code generator can't cope with it. It does loops particularly poorly, it never seems to keep a variable on the FP stack, so it keeps loading an saving them. The potential speedups are enormous.
Sep 23 2009
parent Walter Bright <newshound1 digitalmars.com> writes:
Don wrote:
 I presume you're talking about integer optimisation, because that's 
 definitely not the case for DMD's floating point. Eg, there's a comment 
 in the code somewhere saying that CSE (complex sub expression) 
 optimisations are not done for floating point because the code generator 
 can't cope with it. It does loops particularly poorly, it never seems to 
 keep a variable on the FP stack, so it keeps loading an saving them. The 
 potential speedups are enormous.

I'll agree with you on that assessment.
Sep 23 2009
prev sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
language_fan wrote:
 I have seen people many times mention that Java is in general orders of 
 magnitude slower than D, no matter what kind of algorithms you run on 
 both environments. This is because of the VM - nothing on a VM can run 
 faster than native code, they say. If you decide to hide the bad results 
 (for D), it will only reinforce the misinformation. I personally use a 
 lot of heap memory allocation in my work, and so far Java has not only 
 been safer (and provides decent stack traces and no compiler bugs), but 
 also faster - each time.

If you simply time a heap allocation in D and Java, Java will probably be faster not because of anything inherently faster about Java, but because Sun has poured billions of dollars into trying to make their heap allocator faster. However, a typical D program does far fewer allocations than the Java equivalent, for the simple reason that D supports stack allocated and embedded value aggregates while Java requires them to be on the heap. It is the much reduced need for the heap to be fast that is an advantage for D. Java does do some escape analysis to try and allocate heap objects on the stack instead, but I don't know how effective this is, and even that won't help if you try to embed a value aggregate into a class: struct S { int x, y, z; } class C { S v; // in Java this would require a second heap allocation } In other words, one of the most effective techniques for speeding up heap allocation costs is to design the data structures so they require fewer allocations!
Sep 23 2009
parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:

 Java does do some escape analysis to try and allocate heap objects on 
 the stack instead, but I don't know how effective this is, and even that 
 won't help if you try to embed a value aggregate into a class:
 
 struct S { int x, y, z; }
 
 class C
 {
      S v;   // in Java this would require a second heap allocation
 }

I have discussed with LDC devs an improvement of that idea (it was discussed in this newsgroup too): class C1 { int x, y, z; } class C2 { scope C1 c; } This idea raises few problems, but I think they can be solved. You can special-case the management of such objects scope-allocated inside other objects. Or you can just add an invisible field to that C2 class, the reference to c. Bye, bearophile
Sep 23 2009
parent reply Jeremie Pelletier <jeremiep gmail.com> writes:
bearophile wrote:
 Walter Bright:
 
 Java does do some escape analysis to try and allocate heap objects on 
 the stack instead, but I don't know how effective this is, and even that 
 won't help if you try to embed a value aggregate into a class:

 struct S { int x, y, z; }

 class C
 {
      S v;   // in Java this would require a second heap allocation
 }

I have discussed with LDC devs an improvement of that idea (it was discussed in this newsgroup too): class C1 { int x, y, z; } class C2 { scope C1 c; } This idea raises few problems, but I think they can be solved. You can special-case the management of such objects scope-allocated inside other objects. Or you can just add an invisible field to that C2 class, the reference to c. Bye, bearophile

I would love to see such a feature added in D, would make object composition much more convenient and faster. You'd just need the compiler to force a call into C1's constructor within C2's and generate a call to C1's destructor in the epilog of C2's. I can't see how implementing idea that could be any harder, other than the parser related support. No need to store a reference within the object, that would still require another heap allocation killing the need for scope in the first place.
Sep 23 2009
next sibling parent Jeremie Pelletier <jeremiep gmail.com> writes:
Jarrett Billingsley wrote:
 On Wed, Sep 23, 2009 at 10:52 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:
 I would love to see such a feature added in D, would make object composition
 much more convenient and faster.

 You'd just need the compiler to force a call into C1's constructor within
 C2's and generate a call to C1's destructor in the epilog of C2's. I can't
 see how implementing idea that could be any harder, other than the parser
 related support.

You are really hung up on parsing, aren't you ;) Parsing is a tiny, tiny, tiny fraction of the complexity of implementing a language, and most language features do not require any change to it. Really.

Everything begins at parsing! You'd need to parse the scope lexeme in that context, then you'd need to adjust the appropriate parse node to contain the scope flag, which would require additional semantics analysis and finally a few changes in the IR generation. Thats what I implied by parser "related" support. But yeah, I love parsing :)
Sep 23 2009
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
Jeremie Pelletier:

 No need to store a reference within the object, that would still require 
 another heap allocation killing the need for scope in the first place.

Probably you have not understood what I meant. I surely didn't meant that it needs another heap allocation (the purpose of scope is to avoid that). Such extra invisible reference field among that object attributes is not necessary, but it makes things simpler. If you add such reference, then inside the outer object you have the inner object followed by the reference to the inner object, both inlined (so no other heap allocations are necessary). So you can keep all the usual semantics of D objects, avoid most special-casing. So for example you can assign a different object to such reference: class C1 { int x, y, z; } class C2 { scope C1 c1; } void main() { auto c2 = new C2; auto c = new C1; c2.c1 = c; // <--- see here } Now the memory of c1 inside c2 is not reachable anymore, so the destructor of that inner object can be called by the GC, but of course its actual memory will not be freed until c2 is deallocated. Bye, bearophile
Sep 23 2009
prev sibling next sibling parent language_fan <foo bar.com.invalid> writes:
Sun, 20 Sep 2009 18:16:37 -0400, bearophile thusly wrote:

 My benchmarks aren't chosen randomly, I naturally focus on things that
 are slower in D, so sometimes you can see Java to "win". I usually
 discard the code where Java results slower :-)

I have seen people many times mention that Java is in general orders of magnitude slower than D, no matter what kind of algorithms you run on both environments. This is because of the VM - nothing on a VM can run faster than native code, they say. If you decide to hide the bad results (for D), it will only reinforce the misinformation. I personally use a lot of heap memory allocation in my work, and so far Java has not only been safer (and provides decent stack traces and no compiler bugs), but also faster - each time.
Sep 21 2009
prev sibling parent Jarrett Billingsley <jarrett.billingsley gmail.com> writes:
On Wed, Sep 23, 2009 at 10:52 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:
 I would love to see such a feature added in D, would make object composition
 much more convenient and faster.

 You'd just need the compiler to force a call into C1's constructor within
 C2's and generate a call to C1's destructor in the epilog of C2's. I can't
 see how implementing idea that could be any harder, other than the parser
 related support.

You are really hung up on parsing, aren't you ;) Parsing is a tiny, tiny, tiny fraction of the complexity of implementing a language, and most language features do not require any change to it. Really.
Sep 23 2009
prev sibling parent reply Don <nospam nospam.com> writes:
bearophile wrote:
 Don:
 
 It'll be interesting to see what the priorities are now -- 
 maybe this stuff is of more interest now.

Probably removing bugs is more important still :-) For example your work has changed a little how compile-time functions can be used in D.
 BTW the AMD manual for K7 (or might be K6 optimisation manual? don't 
 exactly remember) goes into great detail about both memcpy() and 
 memset(). Turns out there's about five different cases.

In the meantime Deewiant has told me that on 64 bit glibc memset is better and on more modern CPUs the timings are different (and on 64 bit my first version may not work, maybe the second one is better. I have not tested it on 64 bit LDC yet). I'm just a newbie on this stuff, while people that write the memset of 64bit glibc are expert.

Really, memset() _should_ be optimal in all cases. On almost all compilers, it's not optimal, and on many (such as DMD) there's a _lot_ of room for improvement. So I consider this to a C standard library implementation issue, rather than a language weakness.
Sep 21 2009
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Don wrote:
 bearophile wrote:
 Don:

 It'll be interesting to see what the priorities are now -- maybe this 
 stuff is of more interest now.

Probably removing bugs is more important still :-) For example your work has changed a little how compile-time functions can be used in D.
 BTW the AMD manual for K7 (or might be K6 optimisation manual? don't 
 exactly remember) goes into great detail about both memcpy() and 
 memset(). Turns out there's about five different cases.

In the meantime Deewiant has told me that on 64 bit glibc memset is better and on more modern CPUs the timings are different (and on 64 bit my first version may not work, maybe the second one is better. I have not tested it on 64 bit LDC yet). I'm just a newbie on this stuff, while people that write the memset of 64bit glibc are expert.

Really, memset() _should_ be optimal in all cases. On almost all compilers, it's not optimal, and on many (such as DMD) there's a _lot_ of room for improvement. So I consider this to a C standard library implementation issue, rather than a language weakness.

Don, I suggest the following. std.algorithm has a routine called fill(range, value) which semantically subsumes memset. I suggest you specialize fill() for contiguous memory ranges of primitive types (which shouldn't be hard with std.traits), and then optimize the heck out of it. You could do the same with copy(), also in std.algorithm, to implement a super-duper memcpy() routine. If you go this route people can uniformly use high-level algorithms that specialize themselves whenever applicable. Andrei
Sep 21 2009