www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - memset and related things

reply bearophile <bearophileHUGS lycos.com> writes:
In a program I've seen that in the inner loop an array cleaning was taking too
much time. To solve the problem I've done many experiments, and I've also
produced the following testing program.

The short summary is, to set array of 4 byte integers to a certain constant the
best was are:
- if len <~ 20, then just use an inlined loop.
- if 20 < len < 200_000 it's better to use a loop unrolled 4 times with the
movaps instruction (8 times unrolled is a little worse).
- if n > 200_000 a loop with the movntps instruction is better.

Generally such solutions are better than the memset() (only when len is about
150_000 memset is a bit better than four movaps).


See also:
http://www.gamedev.net/community/forums/topic.asp?topic_id=532112

http://www.sesp.cse.clrc.ac.uk/html/SoftwareTools/vtune/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc197.htm


This is a possible function that can be used if a.length > 20, otherwise an
inlined loop is faster:


void memset4(T)(T[] a, T value) {
    // to be used for len(a) >~ 20
    static assert(T.sizeof == 4);
    if (!a.length)
        return;
    auto a_ptr = a.ptr;
    auto a_end = a_ptr + a.length;

    // align pointer
    size_t aux = ((cast(size_t)a_ptr + 15) & ~15);
    auto n = cast(T*)aux;
    while (a_ptr < n)
        *a_ptr++ = value;
    n = cast(T*)((cast(size_t)a_end) & ~15);

    if (a_ptr < n && (a_end - a_ptr) >= 16) {
        // Aligned case
        if ((a_end - a_ptr) >= 200_000) {
            asm {
                mov ESI, a_ptr;
                mov EDI, n;

                movss XMM0, value; // XMM0 = value,value,value,value
                shufps XMM0, XMM0, 0;

                align 8;
                LOOP1:
                    add ESI, 64;
                    movntps [ESI+ 0-64], XMM0;
                    movntps [ESI+16-64], XMM0;
                    movntps [ESI+32-64], XMM0;
                    movntps [ESI+48-64], XMM0;
                    cmp ESI, EDI;
                jb LOOP1;

                mov a_ptr, ESI;
            }
        } else {
            asm {
                mov ESI, a_ptr;
                mov EDI, n;

                movss XMM0, value; // XMM0 = value,value,value,value
                shufps XMM0, XMM0, 0;

                align 8;
                LOOP2:
                    add ESI, 64;
                    movaps [ESI+ 0-64], XMM0;
                    movaps [ESI+16-64], XMM0;
                    movaps [ESI+32-64], XMM0;
                    movaps [ESI+48-64], XMM0;
                    cmp ESI, EDI;
                jb LOOP2;

                mov a_ptr, ESI;
            }
        }
    }

    // trailing ones
    while (a_ptr < a_end)
        *a_ptr++ = value;
}



So the D front end can implement:
int[] a = ...;
a[] = k;


as:

int[] a = ...;
if (a.length < 20)
  for (int i; i < a.length; i++)
    a[i] = k;
else
  memset4(a, k);

Arrays of other types of values need a little different code. Today most CPUs
support SSE, but in the uncommon situations where it's not available it can be
used C memset() instead of memset4 (the inline for loop is useful anyway when
you use C memset).

I am ignorant of SSE asm still, so the code I have written can contain bugs,
naive things, etc.

If such code can be debugged, and my timings are meaningful, then the code can
be put inside the d front-end, so LDC too can use it (another good thing is for
LLVM to improve its memset intrinsic, so LDC front-end doesn't need to perform
such thing).
    
----------------------------------------

// benchmark code

version (Tango) {
    import tango.stdc.stdio: printf;
    import tango.stdc.time: clock, CLOCKS_PER_SEC;
    import tango.math.Math: sqrt;
    import tango.stdc.string: memset;
} else {
    import std.c.stdio: printf;
    import std.c.time: clock, CLOCKS_PER_SEC;
    import std.math: sqrt;
    import std.c.string: memset;
}


double myclock() {
    return cast(double)clock() / CLOCKS_PER_SEC;
}


void memset4_movaps(T)(T[] a, T value) {
    // to be used if a.length >~ 20, otherwise an inlined loop is faster
    
    static assert(T.sizeof == 4);
    if (!a.length)
        return;
    auto a_ptr = a.ptr;
    auto a_end = a_ptr + a.length;

    // align pointer
    size_t aux = ((cast(size_t)a_ptr + 15) & ~15);
    auto n = cast(T*)aux;
    while (a_ptr < n)
        *a_ptr++ = value;
    n = cast(T*)((cast(size_t)a_end) & ~15);

    if (a_ptr < n && (a_end - a_ptr) >= 16) {
        // Aligned case
        asm {
            mov ESI, a_ptr;
            mov EDI, n;

            movss XMM0, value; // XMM0 = value,value,value,value
            shufps XMM0, XMM0, 0;

            align 8;
            LOOP2:
                add ESI, 64;
                movaps [ESI+ 0-64], XMM0;
                movaps [ESI+16-64], XMM0;
                movaps [ESI+32-64], XMM0;
                movaps [ESI+48-64], XMM0;
                cmp ESI, EDI;
            jb LOOP2;

            mov a_ptr, ESI;
        }
    }

    // trailing ones
    while (a_ptr < a_end)
        *a_ptr++ = value;
}


void memset4_movntps(T)(T[] a, T value) {
    // to be used if a.length >~ 25, otherwise an inlined loop is faster
    
    static assert(T.sizeof == 4);
    if (!a.length)
        return;
    auto a_ptr = a.ptr;
    auto a_end = a_ptr + a.length;

    // align pointer
    size_t aux = ((cast(size_t)a_ptr + 15) & ~15);
    auto n = cast(T*)aux;
    while (a_ptr < n)
        *a_ptr++ = value;
    n = cast(T*)((cast(size_t)a_end) & ~15);

    if (a_ptr < n && (a_end - a_ptr) >= 16) {
        // Aligned case
        asm {
            mov ESI, a_ptr;
            mov EDI, n;

            movss XMM0, value; // XMM0 = value,value,value,value
            shufps XMM0, XMM0, 0;

            align 8;
            LOOP2:
                add ESI, 64;
                movntps [ESI+ 0-64], XMM0;
                movntps [ESI+16-64], XMM0;
                movntps [ESI+32-64], XMM0;
                movntps [ESI+48-64], XMM0;
                cmp ESI, EDI;
            jb LOOP2;

            mov a_ptr, ESI;
        }
    }

    // trailing ones
    while (a_ptr < a_end)
        *a_ptr++ = value;
}


T test(T, int ntest)(int len, int nloops) {
    auto a = new T[len];

    for (int i; i < nloops; i++) {
        static if (ntest == 0)
            for (int j; j < a.length; j++)
                a[j] = T.init;
         else static if (ntest == 1)
            memset(a.ptr, 0, a.length * T.sizeof);
         else static if (ntest == 2)  
             memset4_movaps(a, 0);
         else static if (ntest == 3)
             memset4_movntps(a, 0);          

        a[i % len] = i;
    }

    return a[0];
}


void show(float[] a) {
    printf("[");

    if (a.length > 1)
        foreach (el; a[0 .. $-1])
            printf("%.1f, ", el);
    if (a.length)
        printf("%.1f", a[$-1]);
    printf("]\n");
}


void main() {
    // small test
    auto a = new float[20];
    foreach (i, ref el; a)
        el = i;

    // [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
    show(a);

    memset4_movaps(a[2 .. 2], 0.5f);

    // [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
    show(a);

    memset4_movaps(a[2 .. 9], 0.5f);

    // [0, 1, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19]
    show(a);
    printf("---------------------\n\n");

    auto lens = [2, 4, 5, 8, 15, 16, 20, 25, 32, 50, 64, 100, 250, 256, 1000,
2048, 2500,
                  1 << 12, 1 << 15, 1 << 16, 1 << 17, 210_000, 1 << 18, 1 <<
19, 1 << 20];
    alias int T;
    
    printf("len, nloops, time with loop, time with memset, time with movaps,
time with movntps:\n");
    foreach (len; lens) {
        int nloops;
        if (len >= 8000)
            nloops = cast(int)(cast(double)4_000_000 / sqrt(cast(double)len));
        else
            nloops = cast(int)(cast(double)60_000_000 / sqrt(cast(double)len));
            
        auto t0 = myclock();
        //test!(T, 0)(len, nloops);
        auto a2 = new T[len];
        for (int i; i < nloops; i++) {
                for (int j; j < a2.length; j++)
                    a2[j] = T.init;
            a2[i % len] = i;
        }
        auto t1 = myclock();

        auto t2 = myclock();
        test!(T, 1)(len, nloops);
        auto t3 = myclock();

        auto t4 = myclock();
        test!(T, 2)(len, nloops);
        auto t5 = myclock();

        auto t6 = myclock();
        test!(T, 3)(len, nloops);
        auto t7 = myclock();

        printf("%8d  %9d  %.3f  %.3f  %.2f  %.3f\n", len, nloops, t1-t0, t3-t2,
t5-t4, t7-t6);
    }
}

----------------------------------------

Timings taken on a Celeron at 2.13 MHz (other CPUs will give different results,
I'll take the timings on a Core 2 too):

       2   42426406  0.850  1.510  1.020  1.740
       4   30000000  0.570  1.180  0.850  0.860
       5   26832815  0.480  0.940  0.980  1.000
       8   21213203  0.380  0.720  0.860  0.850
      15   15491933  0.340  0.580  0.960  0.940
      16   15000000  0.350  0.560  0.400  4.420
      20   13416407  0.430  0.570  0.370  3.940
      25   12000000  0.480  0.540  0.320  3.550
      32   10606601  0.420  0.530  0.280  3.140
      50    8485281  0.500  0.530  0.360  2.450
      64    7500000  0.560  0.510  0.320  2.220
     100    6000000  0.690  0.490  0.290  2.020
     250    3794733  1.020  0.470  0.280  2.140
     256    3750000  0.990  0.460  0.260  2.080
    1000    1897366  2.070  0.640  0.340  3.460
    2048    1325825  2.890  0.880  0.470  4.750
    2500    1200000  3.240  0.920  0.500  5.300
    4096     937500  4.070  1.150  0.630  6.660
   32768      22097  0.570  0.320  0.250  1.300
   65536      15625  0.850  0.450  0.380  1.760
  131072      11048  1.630  1.150  1.240  2.500
  210000       8728  3.740  2.630  3.150  3.160
  262144       7812  5.480  4.050  5.040  3.550
  524288       5524  11.120  9.200  11.020  5.110
 1048576       3906  15.850  13.190  15.820  7.210


Graph in PNG:
http://i37.tinypic.com/v836e1.png

Bye,
bearophile
Sep 19 2009
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
I think this version is a bit better:

void memset4(T)(T[] a, T value=T.init) {
    static assert (T.sizeof == 4);
    static assert (size_t.sizeof == (T*).sizeof);
    if (!a.length)
        return;
    auto a_ptr = a.ptr;
    auto a_end = a_ptr + a.length;

    // align pointer to 16 bytes, processing leading unaligned items
    size_t a_end_trimmed = (cast(size_t)a_ptr + 15) & (~15);
    while (cast(size_t)a_ptr < a_end_trimmed)
        *a_ptr++ = value;

    // ending pointer minus the last % 64 bytes
    a_end_trimmed = cast(size_t)a_end & (~cast(size_t)63);

    //printf("%d %d %d %u\n", a_ptr, a_end, a_end_trimmed);
    //int counter1, counter2;

    if (a_end_trimmed - cast(size_t)a_ptr > (200_000 * T.sizeof))
        asm {
            mov ESI, a_ptr;
            mov EDI, a_end_trimmed;

            //pxor XMM0, XMM0; // XMMO = value, value, value, value
            // XMM0 = value,value,value,value
            movss XMM0, value;
            shufps XMM0, XMM0, 0;

            align 8;
            LOOP1: // writes 4 * 4 * 4 bytes each loop
                //inc counter1;
                add ESI, 64;
                movntps [ESI+ 0-64], XMM0;
                movntps [ESI+16-64], XMM0;
                movntps [ESI+32-64], XMM0;
                movntps [ESI+48-64], XMM0;
                cmp ESI, EDI;
            jb LOOP1;

            mov a_ptr, ESI;
        }
    else if (a_end_trimmed - cast(size_t)a_ptr > 16)
        asm {
            mov ESI, a_ptr;
            mov EDI, a_end_trimmed;

            //pxor XMM0, XMM0; // XMMO = value, value, value, value
            // XMM0 = value,value,value,value
            movss XMM0, value;
            shufps XMM0, XMM0, 0;

            align 8;
            LOOP2: // writes 4 * 4 * 4 bytes each loop
                //inc counter2;
                add ESI, 64;
                movaps [ESI+ 0-64], XMM0;
                movaps [ESI+16-64], XMM0;
                movaps [ESI+32-64], XMM0;
                movaps [ESI+48-64], XMM0;
                cmp ESI, EDI;
            jb LOOP2;

            mov a_ptr, ESI;
        }

    //printf("counter1, counter2: %d %d\n", counter1, counter2);

    // the last % 16 items
    while (a_ptr < a_end)
        *a_ptr++ = value;
}


Bye,
bearophile
Sep 20 2009
parent Jeremie Pelletier <jeremiep gmail.com> writes:
bearophile wrote:
 I think this version is a bit better:
 
 [snip]
 
 
 Bye,
 bearophile
This is quite interesting, it made me rethink my own uses of memset. By the way you should first verify if SSE is present through the cpuid instruction.
Sep 20 2009
prev sibling next sibling parent Don <nospam nospam.com> writes:
bearophile wrote:
 In a program I've seen that in the inner loop an array cleaning was taking too
much time. To solve the problem I've done many experiments, and I've also
produced the following testing program.
 
 The short summary is, to set array of 4 byte integers to a certain constant
the best was are:
 - if len <~ 20, then just use an inlined loop.
 - if 20 < len < 200_000 it's better to use a loop unrolled 4 times with the
movaps instruction (8 times unrolled is a little worse).
 - if n > 200_000 a loop with the movntps instruction is better.
 
 Generally such solutions are better than the memset() (only when len is about
150_000 memset is a bit better than four movaps).
Yeah, DMD's memset() and memcpy() are far from optimal. IIRC memcpy() is even worse. I had done a bit of work on it, as well, but when I posted preliminary stuff, there wasn't much interest. The general feedback seemed to be that it'd be more useful to fix the compiler ICE bugs. So I did that <g>. It'll be interesting to see what the priorities are now -- maybe this stuff is of more interest now. BTW the AMD manual for K7 (or might be K6 optimisation manual? don't exactly remember) goes into great detail about both memcpy() and memset(). Turns out there's about five different cases.
Sep 20 2009
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Don:

 It'll be interesting to see what the priorities are now -- 
 maybe this stuff is of more interest now.
Probably removing bugs is more important still :-) For example your work has changed a little how compile-time functions can be used in D.
 BTW the AMD manual for K7 (or might be K6 optimisation manual? don't 
 exactly remember) goes into great detail about both memcpy() and 
 memset(). Turns out there's about five different cases.
In the meantime Deewiant has told me that on 64 bit glibc memset is better and on more modern CPUs the timings are different (and on 64 bit my first version may not work, maybe the second one is better. I have not tested it on 64 bit LDC yet). I'm just a newbie on this stuff, while people that write the memset of 64bit glibc are expert. Bye and thank you, bearophile
Sep 20 2009
next sibling parent reply language_fan <foo bar.com.invalid> writes:
Sun, 20 Sep 2009 16:09:50 -0400, bearophile thusly wrote:

 I'm just a newbie on this stuff, while
 people that write the memset of 64bit glibc are expert.
I have been wondering why people often here complain that Java is slower than D. Every time I see your benchmarks, Java is actually doing just fine, sometimes it's even faster than D. Are your benchmarks somehow flawed every time Java wins D in performance?
Sep 20 2009
parent reply bearophile <bearophileHUGS lycos.com> writes:
language_fan:

 Every time I see your benchmarks, Java is actually doing just 
 fine, sometimes it's even faster than D. Are your benchmarks somehow 
 flawed every time Java wins D in performance?
I don't fully understand your question. The main purpose of some of the benchmarks I've done in the last few months is to find performance bugs (or sometimes just bugs) in LDC or LLVM (I have stopped doing something similar for DMD because people look less interested in improving it performance). Now often I use code originally written in C++ and Java because I've seen that in most situations LLVM is able to produce good binaries from very C-like code (with few exceptions). My benchmarks aren't chosen randomly, I naturally focus on things that are slower in D, so sometimes you can see Java to "win". I usually discard the code where Java results slower :-) Installing and using a debug build of the JavaVM to see the assembly its JIT produces is not handy, but I sometimes do it. Once in a while you can find some surprises there. Recently I have found a nice single-file C++ program, that heavily uses templates and partial specialization, but I am having problems in translating it to good D to test templates in LDC because I don't know enough C++ yet. It finds the optimal solution to the 15 problems, and it's quite efficient: http://codepad.org/96ATkuhx Help/suggestions are welcome :-) Bye, bearophile
Sep 20 2009
parent reply language_fan <foo bar.com.invalid> writes:
Sun, 20 Sep 2009 18:16:37 -0400, bearophile thusly wrote:

 My benchmarks aren't chosen randomly, I naturally focus on things that
 are slower in D, so sometimes you can see Java to "win". I usually
 discard the code where Java results slower :-)
I have seen people many times mention that Java is in general orders of magnitude slower than D, no matter what kind of algorithms you run on both environments. This is because of the VM - nothing on a VM can run faster than native code, they say. If you decide to hide the bad results (for D), it will only reinforce the misinformation. I personally use a lot of heap memory allocation in my work, and so far Java has not only been safer (and provides decent stack traces and no compiler bugs), but also faster - each time.
Sep 21 2009
next sibling parent reply downs <default_357-line yahoo.de> writes:
language_fan wrote:
 Sun, 20 Sep 2009 18:16:37 -0400, bearophile thusly wrote:
 
 My benchmarks aren't chosen randomly, I naturally focus on things that
 are slower in D, so sometimes you can see Java to "win". I usually
 discard the code where Java results slower :-)
I have seen people many times mention that Java is in general orders of magnitude slower than D, no matter what kind of algorithms you run on both environments. This is because of the VM - nothing on a VM can run faster than native code, they say.
Wow. I actually haven't seen that argument in a relatively long time - it seems to me relatively debunked nowadays. (Personally, for me it comes down to "Java will always have a certain delay on startup, and its thoroughly object-oriented design forces heap access that is often unnecessary for solving the problem; plus the single-paradigm model is woefully constraining in comparison to more flexible languages. "
 I personally use a 
 lot of heap memory allocation in my work, and so far Java has not only 
 been safer (and provides decent stack traces and no compiler bugs), but 
 also faster - each time.
Yes, heap allocation is faster in Java. There's so much of it they pretty much had no choice but to tune it to hell and back :)
 If you decide to hide the bad results
 (for D), it will only reinforce the misinformation.
Um, he said he hides the bad results _for Java_.
Sep 21 2009
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
downs:

and its thoroughly object-oriented design forces heap access that is often
unnecessary for solving the problem;<
In D/C# you can use an array of structs that sometimes allows for better performance, in Java to avoid the decrease in performance caused by using an array of references you may need to use several parallel arrays of the single fields of the struct. But the cache coherence signature of such parallels arrays is different, sometimes it's better and sometimes it's worse than the array of structs. If you have to access several items at once of each struct of the array, then using the array of structs is generally faster than using the parallel arrays. If you have to scan only one or few fields then using parallel arrays is faster because there's less data that comes through the caches. Time ago someone has created a "parallel array" struct here to allow for such layout choices at compile-time. A VirtualMachine in theory can dynamically change such struct/array layout at runtime according to the access patterns to the data :-) I think similar dynamism in data structures may be done in the next generation of VM-based languages (you can implement such smart data structures in D too, but the data structure has to be mostly opaque and you have to forbid taking addresses to any data, because it can be moved and shuffled around).
Um, he said he hides the bad results _for Java_.<
Right. I generally don't like to hide things, and I show all data I have collected. Sometimes I don't show benchmarks where Java looks slow because they can be boring and because I am not currently interested in improving the JavaVM. Focusing on where LDC/LLVM do badly I can try to improve them. Bye, bearophile
Sep 21 2009
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
downs wrote:
 language_fan wrote:
 Sun, 20 Sep 2009 18:16:37 -0400, bearophile thusly wrote:

 My benchmarks aren't chosen randomly, I naturally focus on things that
 are slower in D, so sometimes you can see Java to "win". I usually
 discard the code where Java results slower :-)
I have seen people many times mention that Java is in general orders of magnitude slower than D, no matter what kind of algorithms you run on both environments. This is because of the VM - nothing on a VM can run faster than native code, they say.
Wow. I actually haven't seen that argument in a relatively long time - it seems to me relatively debunked nowadays. (Personally, for me it comes down to "Java will always have a certain delay on startup, and its thoroughly object-oriented design forces heap access that is often unnecessary for solving the problem; plus the single-paradigm model is woefully constraining in comparison to more flexible languages. "
 I personally use a 
 lot of heap memory allocation in my work, and so far Java has not only 
 been safer (and provides decent stack traces and no compiler bugs), but 
 also faster - each time.
Yes, heap allocation is faster in Java. There's so much of it they pretty much had no choice but to tune it to hell and back :)
 If you decide to hide the bad results
 (for D), it will only reinforce the misinformation.
Um, he said he hides the bad results _for Java_.
I guess that reinforces some other misinformation :o). Andrei
Sep 21 2009
prev sibling next sibling parent reply Don <nospam nospam.com> writes:
language_fan wrote:
 Sun, 20 Sep 2009 18:16:37 -0400, bearophile thusly wrote:
 
 My benchmarks aren't chosen randomly, I naturally focus on things that
 are slower in D, so sometimes you can see Java to "win". I usually
 discard the code where Java results slower :-)
I have seen people many times mention that Java is in general orders of magnitude slower than D, no matter what kind of algorithms you run on both environments.
I think you're creating a straw man. I don't recall anyone saying that. Certainly not "orders of magnitude" in general. In general, when D is faster, it'll be because more coding techniques are available. For example, because you're not forced to used OOP all the time.
 This is because of the VM - nothing on a VM can run 
 faster than native code, they say.  If you decide to hide the bad results
 (for D), it will only reinforce the misinformation. I personally use a 
 lot of heap memory allocation in my work, and so far Java has not only 
 been safer (and provides decent stack traces and no compiler bugs), but 
 also faster - each time.
On this ng, it seems to be universally acknowledged that Java's GC is much better than D's; and that Java's tool chain is orders of magnitude more mature than D. But, in the cases where Java beats D, it will almost always be because of the GC. LDC probably only gets beaten on the GC. Having looked at the DMD optimiser, I'm a bit surprised that it's competitive at all (especially in floating point). There is so much low-hanging fruit, it's practically an orchard <g>. It's hardly been touched since the mid-90's, while Java's had 15 years of massive commercial development during that time.
Sep 21 2009
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Don wrote:
 Having looked at the DMD optimiser, I'm a bit surprised that it's 
 competitive at all (especially in floating point). There is so much 
 low-hanging fruit, it's practically an orchard <g>.
I believe that's because it has reached the point of diminishing returns.
Sep 23 2009
parent reply Don <nospam nospam.com> writes:
Walter Bright wrote:
 Don wrote:
 Having looked at the DMD optimiser, I'm a bit surprised that it's 
 competitive at all (especially in floating point). There is so much 
 low-hanging fruit, it's practically an orchard <g>.
I believe that's because it has reached the point of diminishing returns.
I presume you're talking about integer optimisation, because that's definitely not the case for DMD's floating point. Eg, there's a comment in the code somewhere saying that CSE (complex sub expression) optimisations are not done for floating point because the code generator can't cope with it. It does loops particularly poorly, it never seems to keep a variable on the FP stack, so it keeps loading an saving them. The potential speedups are enormous.
Sep 23 2009
parent Walter Bright <newshound1 digitalmars.com> writes:
Don wrote:
 I presume you're talking about integer optimisation, because that's 
 definitely not the case for DMD's floating point. Eg, there's a comment 
 in the code somewhere saying that CSE (complex sub expression) 
 optimisations are not done for floating point because the code generator 
 can't cope with it. It does loops particularly poorly, it never seems to 
 keep a variable on the FP stack, so it keeps loading an saving them. The 
 potential speedups are enormous.
I'll agree with you on that assessment.
Sep 23 2009
prev sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
language_fan wrote:
 I have seen people many times mention that Java is in general orders of 
 magnitude slower than D, no matter what kind of algorithms you run on 
 both environments. This is because of the VM - nothing on a VM can run 
 faster than native code, they say. If you decide to hide the bad results 
 (for D), it will only reinforce the misinformation. I personally use a 
 lot of heap memory allocation in my work, and so far Java has not only 
 been safer (and provides decent stack traces and no compiler bugs), but 
 also faster - each time.
If you simply time a heap allocation in D and Java, Java will probably be faster not because of anything inherently faster about Java, but because Sun has poured billions of dollars into trying to make their heap allocator faster. However, a typical D program does far fewer allocations than the Java equivalent, for the simple reason that D supports stack allocated and embedded value aggregates while Java requires them to be on the heap. It is the much reduced need for the heap to be fast that is an advantage for D. Java does do some escape analysis to try and allocate heap objects on the stack instead, but I don't know how effective this is, and even that won't help if you try to embed a value aggregate into a class: struct S { int x, y, z; } class C { S v; // in Java this would require a second heap allocation } In other words, one of the most effective techniques for speeding up heap allocation costs is to design the data structures so they require fewer allocations!
Sep 23 2009
parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:

 Java does do some escape analysis to try and allocate heap objects on 
 the stack instead, but I don't know how effective this is, and even that 
 won't help if you try to embed a value aggregate into a class:
 
 struct S { int x, y, z; }
 
 class C
 {
      S v;   // in Java this would require a second heap allocation
 }
I have discussed with LDC devs an improvement of that idea (it was discussed in this newsgroup too): class C1 { int x, y, z; } class C2 { scope C1 c; } This idea raises few problems, but I think they can be solved. You can special-case the management of such objects scope-allocated inside other objects. Or you can just add an invisible field to that C2 class, the reference to c. Bye, bearophile
Sep 23 2009
parent reply Jeremie Pelletier <jeremiep gmail.com> writes:
bearophile wrote:
 Walter Bright:
 
 Java does do some escape analysis to try and allocate heap objects on 
 the stack instead, but I don't know how effective this is, and even that 
 won't help if you try to embed a value aggregate into a class:

 struct S { int x, y, z; }

 class C
 {
      S v;   // in Java this would require a second heap allocation
 }
I have discussed with LDC devs an improvement of that idea (it was discussed in this newsgroup too): class C1 { int x, y, z; } class C2 { scope C1 c; } This idea raises few problems, but I think they can be solved. You can special-case the management of such objects scope-allocated inside other objects. Or you can just add an invisible field to that C2 class, the reference to c. Bye, bearophile
I would love to see such a feature added in D, would make object composition much more convenient and faster. You'd just need the compiler to force a call into C1's constructor within C2's and generate a call to C1's destructor in the epilog of C2's. I can't see how implementing idea that could be any harder, other than the parser related support. No need to store a reference within the object, that would still require another heap allocation killing the need for scope in the first place.
Sep 23 2009
next sibling parent reply Jarrett Billingsley <jarrett.billingsley gmail.com> writes:
On Wed, Sep 23, 2009 at 10:52 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:
 I would love to see such a feature added in D, would make object composition
 much more convenient and faster.

 You'd just need the compiler to force a call into C1's constructor within
 C2's and generate a call to C1's destructor in the epilog of C2's. I can't
 see how implementing idea that could be any harder, other than the parser
 related support.
You are really hung up on parsing, aren't you ;) Parsing is a tiny, tiny, tiny fraction of the complexity of implementing a language, and most language features do not require any change to it. Really.
Sep 23 2009
parent Jeremie Pelletier <jeremiep gmail.com> writes:
Jarrett Billingsley wrote:
 On Wed, Sep 23, 2009 at 10:52 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:
 I would love to see such a feature added in D, would make object composition
 much more convenient and faster.

 You'd just need the compiler to force a call into C1's constructor within
 C2's and generate a call to C1's destructor in the epilog of C2's. I can't
 see how implementing idea that could be any harder, other than the parser
 related support.
You are really hung up on parsing, aren't you ;) Parsing is a tiny, tiny, tiny fraction of the complexity of implementing a language, and most language features do not require any change to it. Really.
Everything begins at parsing! You'd need to parse the scope lexeme in that context, then you'd need to adjust the appropriate parse node to contain the scope flag, which would require additional semantics analysis and finally a few changes in the IR generation. Thats what I implied by parser "related" support. But yeah, I love parsing :)
Sep 23 2009
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
Jeremie Pelletier:

 No need to store a reference within the object, that would still require 
 another heap allocation killing the need for scope in the first place.
Probably you have not understood what I meant. I surely didn't meant that it needs another heap allocation (the purpose of scope is to avoid that). Such extra invisible reference field among that object attributes is not necessary, but it makes things simpler. If you add such reference, then inside the outer object you have the inner object followed by the reference to the inner object, both inlined (so no other heap allocations are necessary). So you can keep all the usual semantics of D objects, avoid most special-casing. So for example you can assign a different object to such reference: class C1 { int x, y, z; } class C2 { scope C1 c1; } void main() { auto c2 = new C2; auto c = new C1; c2.c1 = c; // <--- see here } Now the memory of c1 inside c2 is not reachable anymore, so the destructor of that inner object can be called by the GC, but of course its actual memory will not be freed until c2 is deallocated. Bye, bearophile
Sep 23 2009
prev sibling parent reply Don <nospam nospam.com> writes:
bearophile wrote:
 Don:
 
 It'll be interesting to see what the priorities are now -- 
 maybe this stuff is of more interest now.
Probably removing bugs is more important still :-) For example your work has changed a little how compile-time functions can be used in D.
 BTW the AMD manual for K7 (or might be K6 optimisation manual? don't 
 exactly remember) goes into great detail about both memcpy() and 
 memset(). Turns out there's about five different cases.
In the meantime Deewiant has told me that on 64 bit glibc memset is better and on more modern CPUs the timings are different (and on 64 bit my first version may not work, maybe the second one is better. I have not tested it on 64 bit LDC yet). I'm just a newbie on this stuff, while people that write the memset of 64bit glibc are expert.
Really, memset() _should_ be optimal in all cases. On almost all compilers, it's not optimal, and on many (such as DMD) there's a _lot_ of room for improvement. So I consider this to a C standard library implementation issue, rather than a language weakness.
Sep 21 2009
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Don wrote:
 bearophile wrote:
 Don:

 It'll be interesting to see what the priorities are now -- maybe this 
 stuff is of more interest now.
Probably removing bugs is more important still :-) For example your work has changed a little how compile-time functions can be used in D.
 BTW the AMD manual for K7 (or might be K6 optimisation manual? don't 
 exactly remember) goes into great detail about both memcpy() and 
 memset(). Turns out there's about five different cases.
In the meantime Deewiant has told me that on 64 bit glibc memset is better and on more modern CPUs the timings are different (and on 64 bit my first version may not work, maybe the second one is better. I have not tested it on 64 bit LDC yet). I'm just a newbie on this stuff, while people that write the memset of 64bit glibc are expert.
Really, memset() _should_ be optimal in all cases. On almost all compilers, it's not optimal, and on many (such as DMD) there's a _lot_ of room for improvement. So I consider this to a C standard library implementation issue, rather than a language weakness.
Don, I suggest the following. std.algorithm has a routine called fill(range, value) which semantically subsumes memset. I suggest you specialize fill() for contiguous memory ranges of primitive types (which shouldn't be hard with std.traits), and then optimize the heck out of it. You could do the same with copy(), also in std.algorithm, to implement a super-duper memcpy() routine. If you go this route people can uniformly use high-level algorithms that specialize themselves whenever applicable. Andrei
Sep 21 2009