digitalmars.D - WordCount performance

bearophile (61/61) Mar 26 2008 The following little program comes from a progressive stripping down of ...

Frits van Bommel (7/9) Mar 26 2008 The first thing that comes to mind: "For the Nth time, someone thinks

bearophile (4/6) Mar 26 2008 Yes, I am probably talking about problems in the DMD optimizer too, here...
Jason House (9/24) Mar 26 2008 I have a D program that's an AI engine. It competes against programs

Saaa (1/4) Mar 26 2008 Exactly the same here.

Derek Parnell (12/16) Mar 26 2008 Regardless of the application in question, if run time speed is the prim...

Saaa (5/12) Mar 26 2008 What if you can't do b nor have the time to learn b and see that another...

Silv3r (2/3) Mar 26 2008 Exactly, I wouldn't be able to put it any clearer myself. ;)
Walter Bright (21/23) Mar 27 2008 Because the insight gained by figuring out why one is faster than the

Saaa (11/32) Mar 27 2008 The first steps taken on this newsgroup regarding benchmarks(program spe...

Walter Bright (3/6) Mar 27 2008 D does match the optimization level of gcc: the GDC (gnu D compiler)

Don Clugston (7/14) Mar 27 2008 Is DMD ever likely to come close? Given the size TODO list for the D fro...

Walter Bright (4/14) Mar 27 2008 I think it already is close. My back end long ago reached the point

Dave (7/18) Mar 27 2008 Except for floating point optimizations, even if some loss of accuracy i...

Saaa (4/10) Mar 27 2008 How does DMD's optimization compare?

Walter Bright (5/8) Mar 27 2008 The backend is not only similar, it is exactly the same.

Dave (5/13) Mar 27 2008 Any ETA on when that will be worked into the optimizer?

Walter Bright (2/8) Mar 27 2008 Too many other things right now!

Dave (4/15) Mar 27 2008 The shootout benchmarks seem to indicate that in the general case DMD is...

bearophile (34/36) Mar 27 2008 Micro-benchmarks have the huge advantage that they allow you to find whe...

bearophile (5/6) Mar 26 2008 Relax, in many real situations D (compiled with DMD) is much faster. Tha...

Daniel (4/75) Mar 26 2008 As an exercise for the viewer, if you use SSE2, prefetch, and non-branch...

Saaa (1/5) Mar 26 2008 Please teach me!

TomD (24/24) Mar 27 2008 bearophile Wrote:

Walter Bright (3/5) Mar 27 2008 To get the fastest code out of dmd:

Walter Bright (28/28) Mar 27 2008 I did a little checking.

Saaa (1/6) Mar 27 2008 lol

Walter Bright (5/12) Mar 27 2008 The hilarity is that full library source was included with the compiler

Georg Wrede (5/18) Mar 27 2008 Hmm. What kind of situations would need multiple threads simultaneously

Walter Bright (5/9) Mar 27 2008 If it was unsynchronized, if you had two threads reading lines from the

Frits van Bommel (7/15) Mar 27 2008 I believe his point was "What kind of program would ever want to do

Georg Wrede (3/23) Mar 27 2008 Heck, I just spent ages formulating the same thing. You said it in a

Walter Bright (6/6) Mar 28 2008 You guys do have a point that perhaps syncing is not needed on stdin.

Sean Kelly (8/14) Mar 28 2008 Another option might be to check the thread count is greater than

Walter Bright (3/9) Mar 28 2008 You have to be very careful with such schemes because they can run afoul...

Frits van Bommel (51/61) Mar 28 2008 In this case though, if the thread count was 1 when checked it's safe to...

Frits van Bommel (2/3) Mar 28 2008 Bugzilla'd: http://d.puremagic.com/issues/show_bug.cgi?id=1957

Walter Bright (1/1) Mar 28 2008 Thanks for putting it in bugzilla.

Sean Kelly (7/16) Mar 28 2008 With the way this call is implemented in Tango, it is completely safe to...
Benji Smith (4/14) Mar 28 2008 I thought the double-checked locking idiom was only buggy in the Java

Walter Bright (2/4) Mar 29 2008 It's a common bug in C/C++ code.
Sean Kelly (5/18) Mar 29 2008 It's broken in C++ as well. It actually works in D however, provided on...

Kevin Bealer (26/47) Mar 29 2008 Couldn't you two mutexes to make this work even in C++ (but I'm using D ...

Sean Kelly (6/52) Mar 29 2008 second mutex doesn't actually protect anything, but it should be a porta...

janderson (6/16) Mar 29 2008 Perhaps a method of disabling locking would be useful. Either by simply...

Georg Wrede (30/42) Mar 27 2008 Yes. Now, my problem was, it should be pretty uncommon to need to have

bearophile (10/14) Mar 27 2008 Thank you taking a look at this benchmark. I understand that D is safer ...

bearophile <bearophileHUGS lycos.com> writes:

The following little program comes from a progressive stripping down of a
program I was creating. This C and D code give the approximate count of the
words in a file:

D version:

import std.c.stdio: printf, getchar, EOF;
import std.ctype: isspace;

void main() {
    int count, c;

  //OUTER:
    while (1) {
        while (1) {
            c = getchar();
            if (c == EOF)
                //break OUTER;
                goto END;
            if (!isspace(c))
                break;
        }

        count++;

        while (1) {
            c = getchar();
            if (c == EOF)
                //break OUTER;
                goto END;
            if (isspace(c))
                break;
        }
    }

  END:
    printf("%d\n", count);
}


C version:

#include <stdio.h>
#include <ctype.h>

int main() {
    int count = 0, c;

    while (1) {
        while (1) {
            c = getchar();
            if (c == EOF)
                goto END;
            if (!isspace(c))
                break;
        }

        count++;

        while (1) {
            c = getchar();
            if (c == EOF)
                goto END;
            if (isspace(c))
                break;
        }
    }

    END:
    printf("%d\n", count);
    return 0;
}


To test it, I have used a 7.5 MB file of real text. The C version (compiled
with MinGW 4.2.1) is ~7.8 times faster (0.43 s instead of 3.35 s) than that
very simpler code compiled with DMD (1.028). If I use a named break in the D
code (that OUTER), to avoid the goto, the running speed is essentially the same.
On a 50 MB file of text the timings are 2.43 s and 20.74 s (C version 8.5+
times faster).
Disabling the GC doesn't change running speed of the D version.
A 7-8 times difference on such simple program is big enough to make me curious,
do you know what the problem can be? (Maybe the getchar() as a function instead
of macro?)

Bye,
bearophile

Mar 26 2008

Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:

bearophile wrote:
 To test it, I have used a 7.5 MB file of real text. The C version (compiled
with MinGW 4.2.1) is ~7.8 times faster (0.43 s instead of 3.35 s) than that
very simpler code compiled with DMD (1.028). If I use a named break in the D
code (that OUTER), to avoid the goto, the running speed is essentially the same.
 On a 50 MB file of text the timings are 2.43 s and 20.74 s (C version 8.5+
times faster).

The first thing that comes to mind: "For the Nth time, someone thinks 
language X is 'faster' than language Y because of a comparison between 
compilers with completely different backends (and thus optimizers), even 
though there are pairs of compilers that use the same backend for either 
language."

Try comparing DMC vs. DMD, or GCC vs. GDC...

Mar 26 2008

bearophile <bearophileHUGS lycos.com> writes:

Frits van Bommel:
 because of a comparison between compilers with
 completely different backends (and thus optimizers),

Yes, I am probably talking about problems in the DMD optimizer too, here, that
may need to be addressed (or explained).

Bye,
bearophile

Mar 26 2008

Jason House <jason.james.house gmail.com> writes:

Frits van Bommel wrote:

 bearophile wrote:
 To test it, I have used a 7.5 MB file of real text. The C version
 (compiled with MinGW 4.2.1) is ~7.8 times faster (0.43 s instead of 3.35
 s) than that very simpler code compiled with DMD (1.028). If I use a
 named break in the D code (that OUTER), to avoid the goto, the running
 speed is essentially the same. On a 50 MB file of text the timings are
 2.43 s and 20.74 s (C version 8.5+ times faster).

 
 The first thing that comes to mind: "For the Nth time, someone thinks
 language X is 'faster' than language Y because of a comparison between
 compilers with completely different backends (and thus optimizers), even
 though there are pairs of compilers that use the same backend for either
 language."
 
 Try comparing DMC vs. DMD, or GCC vs. GDC...

I have a D program that's an AI engine.  It competes against programs
written in other languages and with different compilers.  Speed matters to
me, regardless of which compiler my competition uses.

One of my attractions to D (besides all the cool features) was that it
compiled to native assembly like C++.  I implicitly assumed I'd be getting
C++ (speed) without the warts.  When performance gaps are an order of
magitude, the D language seems like a horrible choice.  I've thought on
more than one occasion about going back to C++ or Java for the speed.

Mar 26 2008

"Saaa" <empty needmail.com> writes:

 I have a D program that's an AI engine.  It competes against programs
 written in other languages and with different compilers.  Speed matters to
 me, regardless of which compiler my competition uses.

Exactly the same here.

Mar 26 2008

Derek Parnell <derek nomail.afraid.org> writes:

On Thu, 27 Mar 2008 02:37:28 +0100, Saaa wrote:

 I have a D program that's an AI engine.  It competes against programs
 written in other languages and with different compilers.  Speed matters to
 me, regardless of which compiler my competition uses.

 Exactly the same here.

Regardless of the application in question, if run time speed is the primary
driver then ...

(a) optimize the algorithms
(b) code it in assembler
(c) profile it
(d) jmp (a)

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
27/03/2008 12:42:39 PM

Mar 26 2008

"Saaa" <empty needmail.com> writes:

 Regardless of the application in question, if run time speed is the 
 primary
 driver then ...

 (a) optimize the algorithms
 (b) code it in assembler
 (c) profile it
 (d) jmp (a)

What if you can't do b nor have the time to learn b and see that another 
compiler(language) runs faster without the need of b?
I love all the features D gives me and the prospect that somebody who knows 
assembler could optimize parts of the code later on.
But, I'd just wish for D to be as fast as C without the use of b.

Mar 26 2008

Silv3r <nomaily 127.0.0.1> writes:

Saaa Wrote:

 But, I'd just wish for D to be as fast as C without the use of b. 

Exactly, I wouldn't be able to put it any clearer myself. ;)

Mar 26 2008

Walter Bright <newshound1 digitalmars.com> writes:

Saaa wrote:
 What if you can't do b nor have the time to learn b and see that another 
 compiler(language) runs faster without the need of b?

Because the insight gained by figuring out why one is faster than the 
other will be very valuable to you in future projects. I know a lot of 
programmers who made mistaken assumptions about why one program was 
faster than another, assumptions that eventually wound up costing them 
dearly.

A sampling:

1) used the wrong compiler switches

2) the 'faster' program was faster because it had a disastrous bug in it

3) thought that their benchmark was measuring loop performance, when it 
was actually measuring the time spent in printf or malloc, rendering all 
their conclusions about how to write fast loops false.

4) thought they'd discovered a unique fast algorithm, when what was 
actually happening was the compiler optimizer had figured out how to 
eliminate a redundant DIV instruction

5) the benchmark proving that language A runs faster than language B 
turned out to be a result peculiar to that particular benchmark, and on 
the real program the reverse was true

So, essentially, if you don't know *why* the benchmark produced the 
results it did, you don't have enough information to draw any 
conclusions about the language.

Mar 27 2008

"Saaa" <empty needmail.com> writes:

 What if you can't do b nor have the time to learn b and see that another
 compiler(language) runs faster without the need of b?


 Because the insight gained by figuring out why one is faster than the 
 other will be very valuable to you in future projects. I know a lot of 
 programmers who made mistaken assumptions about why one program was faster 
 than another, assumptions that eventually wound up costing them dearly.

I don't really see how this is an answer to my question.
I suspect this is a more general post about language benchmarks :)

 A sampling:

 1) used the wrong compiler switches

 2) the 'faster' program was faster because it had a disastrous bug in it

 3) thought that their benchmark was measuring loop performance, when it 
 was actually measuring the time spent in printf or malloc, rendering all 
 their conclusions about how to write fast loops false.

 4) thought they'd discovered a unique fast algorithm, when what was 
 actually happening was the compiler optimizer had figured out how to 
 eliminate a redundant DIV instruction

 5) the benchmark proving that language A runs faster than language B 
 turned out to be a result peculiar to that particular benchmark, and on 
 the real program the reverse was true

 So, essentially, if you don't know *why* the benchmark produced the 
 results it did, you don't have enough information to draw any conclusions 
 about the language.

The first steps taken on this newsgroup regarding benchmarks(program speed)
are generally (as should be) 1, 2 and 3.
4 is sort of what I was talking about: I would like the D compiler to match 
the
optimization level of (the gcc or even intel:) C compiler.

I am not sure whether you meant to say the opposite of this but:
You don't need to understand asm to rule out 1,2 and 3; and know that the 
speed
difference lies within 4 (compiler optimizations).

Mar 27 2008

Walter Bright <newshound1 digitalmars.com> writes:

Saaa wrote:
 4 is sort of what I was talking about: I would like the D compiler to match 
 the
 optimization level of (the gcc or even intel:) C compiler.

D does match the optimization level of gcc: the GDC (gnu D compiler) 
uses gcc's optimizer and code generator.

Mar 27 2008

Don Clugston <dac nospam.com.au> writes:

Walter Bright wrote:
 Saaa wrote:
 4 is sort of what I was talking about: I would like the D compiler to 
 match the
 optimization level of (the gcc or even intel:) C compiler.

 
 D does match the optimization level of gcc: the GDC (gnu D compiler) 
 uses gcc's optimizer and code generator.

Is DMD ever likely to come close? Given the size TODO list for the D front-end, 
I'm having trouble imagining you'll get much time to work on the optimiser 
before 64 bit CPUs become completely dominant (forcing you to rewrite most of 
the code generator).

Unless there are significant features of the DMC optimiser which DMD still 
doesn't make use of?

Mar 27 2008

Walter Bright <newshound1 digitalmars.com> writes:

Don Clugston wrote:
 Walter Bright wrote:
 D does match the optimization level of gcc: the GDC (gnu D compiler) 
 uses gcc's optimizer and code generator.

 
 Is DMD ever likely to come close? Given the size TODO list for the D 
 front-end, I'm having trouble imagining you'll get much time to work on 
 the optimiser before 64 bit CPUs become completely dominant (forcing you 
 to rewrite most of the code generator).

I think it already is close. My back end long ago reached the point 
where it takes a lot more effort to get only tiny gains.


 Unless there are significant features of the DMC optimiser which DMD 
 still doesn't make use of?

No.

Mar 27 2008

"Dave" <Dave_member pathlink.com> writes:

"Walter Bright" <newshound1 digitalmars.com> wrote in message 
news:fsgqvr$vtd$1 digitalmars.com...
 Don Clugston wrote:
 Walter Bright wrote:
 D does match the optimization level of gcc: the GDC (gnu D compiler) 
 uses gcc's optimizer and code generator.

 Is DMD ever likely to come close? Given the size TODO list for the D 
 front-end, I'm having trouble imagining you'll get much time to work on 
 the optimiser before 64 bit CPUs become completely dominant (forcing you 
 to rewrite most of the code generator).

 I think it already is close. My back end long ago reached the point where 
 it takes a lot more effort to get only tiny gains.

Except for floating point optimizations, even if some loss of accuracy is 
encountered (because temporaries are kept in FP registers for example). I 
wish the DMD optimizer would do as good a job with FP as it does w/ int, 
unless I tell it not to for cases where accuracy is paramount. In that case, 
it would just have to fall back to what it's doing now.

Mar 27 2008

"Saaa" <empty needmail.com> writes:

 Saaa wrote:
 4 is sort of what I was talking about: I would like the D compiler to 
 match the
 optimization level of (the gcc or even intel:) C compiler.

 D does match the optimization level of gcc: the GDC (gnu D compiler) uses 
 gcc's optimizer and code generator.

How does DMD's optimization compare?

And maybe a silly question but how does optimizing D compare to optimizing 
C++.
Or is this exactly the same as the backend is that similar?

Mar 27 2008

Walter Bright <newshound1 digitalmars.com> writes:

Saaa wrote:
 And maybe a silly question but how does optimizing D compare to optimizing 
 C++.
 Or is this exactly the same as the backend is that similar? 

The backend is not only similar, it is exactly the same.

Where C++ and D differ is in various opportunities present in the 
language semantics. For example, D has invariants so it can do some more 
aggressive optimizations that C++ cannot.

Mar 27 2008

"Dave" <Dave_member pathlink.com> writes:

"Walter Bright" <newshound1 digitalmars.com> wrote in message 
news:fsgrbs$12lf$2 digitalmars.com...
 Saaa wrote:
 And maybe a silly question but how does optimizing D compare to 
 optimizing C++.
 Or is this exactly the same as the backend is that similar?

 The backend is not only similar, it is exactly the same.

 Where C++ and D differ is in various opportunities present in the language 
 semantics. For example, D has invariants so it can do some more aggressive 
 optimizations that C++ cannot.

Any ETA on when that will be worked into the optimizer?

Thanks,

- Dave

Mar 27 2008

Walter Bright <newshound1 digitalmars.com> writes:

Dave wrote:
 "Walter Bright" <newshound1 digitalmars.com> wrote in message 
 Where C++ and D differ is in various opportunities present in the 
 language semantics. For example, D has invariants so it can do some 
 more aggressive optimizations that C++ cannot.

 
 Any ETA on when that will be worked into the optimizer?

Too many other things right now!

Mar 27 2008

"Dave" <Dave_member pathlink.com> writes:

"Saaa" <empty needmail.com> wrote in message 
news:fsglqt$5n2$1 digitalmars.com...
 Saaa wrote:
 4 is sort of what I was talking about: I would like the D compiler to 
 match the
 optimization level of (the gcc or even intel:) C compiler.

 D does match the optimization level of gcc: the GDC (gnu D compiler) uses 
 gcc's optimizer and code generator.

 How does DMD's optimization compare?

 And maybe a silly question but how does optimizing D compare to optimizing 
 C++.
 Or is this exactly the same as the backend is that similar?

The shootout benchmarks seem to indicate that in the general case DMD is 
pretty close to GCC except for floating point (again, in general).

Mar 27 2008

bearophile <bearophileHUGS lycos.com> writes:

Dave Wrote:
 The shootout benchmarks seem to indicate that in the general case DMD is 
 pretty close to GCC except for floating point (again, in general).

Micro-benchmarks have the huge advantage that they allow you to find where a
(micro) performance problem may be, the following is a stripped down benchmark
from the Shootout:

D version:

import std.stdio, std.conv;
double fibo(double n) {
    if (n < 2) return 1.0;
    return fibo(n-2) + fibo(n-1);
}
void main(string[] args) {
    int n = args.length == 2 ? toInt(args[1]) : 1;
    printf("%.1f\n", fibo(n));
}

C version:

#include <stdio.h>
double fibo(double n) {
    if (n < 2) return 1.0;
    return fibo(n-2) + fibo(n-1);
}
int main(int argc, char ** argv) {
    int n = argc == 2 ? atoi(argv[1]) : 1;
    printf("%.1f\n", fibo(n));
    return 0;
}

The difference (DMD Vs GCC) is visible.
In the case of GCC/DMC you can add the:
-fomit-frame
compilation flag to increase performance, and you can even replace this line:
if (n < 2.0) return 1.0;
With:
if (__builtin_expect(n < 2.0, 0)) return 1.0;
In such situation the runtime timings are (n = 37 on my very old PC) 2.36s Vs
3.11s DMD.
(I don't know how much difficult/work is to add an expect to D).

Bye,
bearophile

Mar 27 2008

bearophile <bearophileHUGS lycos.com> writes:

Jason House:
 When performance gaps are an order of magitude,

Relax, in many real situations D (compiled with DMD) is much faster. That's a
very artificial example.
And I'd like to know if my timings are right, because they may be just wrong
after all. Trusting me blindly isn't good.

Bye,
bearophile

Mar 26 2008

Daniel <murpsoft hotmail.com> writes:

bearophile Wrote:

 The following little program comes from a progressive stripping down of a
program I was creating. This C and D code give the approximate count of the
words in a file:
 
 D version:
 
 import std.c.stdio: printf, getchar, EOF;
 import std.ctype: isspace;
 
 void main() {
     int count, c;
 
   //OUTER:
     while (1) {
         while (1) {
             c = getchar();
             if (c == EOF)
                 //break OUTER;
                 goto END;
             if (!isspace(c))
                 break;
         }
 
         count++;
 
         while (1) {
             c = getchar();
             if (c == EOF)
                 //break OUTER;
                 goto END;
             if (isspace(c))
                 break;
         }
     }
 
   END:
     printf("%d\n", count);
 }
 
 
 C version:
 
 #include <stdio.h>
 #include <ctype.h>
 
 int main() {
     int count = 0, c;
 
     while (1) {
         while (1) {
             c = getchar();
             if (c == EOF)
                 goto END;
             if (!isspace(c))
                 break;
         }
 
         count++;
 
         while (1) {
             c = getchar();
             if (c == EOF)
                 goto END;
             if (isspace(c))
                 break;
         }
     }
 
     END:
     printf("%d\n", count);
     return 0;
 }
 [snip]

As an exercise for the viewer, if you use SSE2, prefetch, and non-branching
bitmath you can perform this roughly 32 times as fast.

Regards,
Daniel

Mar 26 2008

"Saaa" <empty needmail.com> writes:

 As an exercise for the viewer, if you use SSE2, prefetch, and 
 non-branching bitmath you can perform this roughly 32 times as fast.

 Regards,
 Daniel

Please teach me!

Mar 26 2008

TomD <t_demmer nospam.web.de> writes:

bearophile Wrote:
[...]

Seems to be a backend problem:
dmd -O wcd.d
dmc -6  wcc.c
gcc -O3 -mno-cygwin -o wccm wcc.d

On my 7.5MB iTunes XML file:
$ time ./wcd < it.xml
303412

real    0m1.078s
user    0m0.015s
sys     0m0.015s

$ time ./wcc < it.xml
303412

real    0m1.063s
user    0m0.031s
sys     0m0.000s

$ time ./wccm < it.xml
303412

real    0m0.172s
user    0m0.015s
sys     0m0.015s

Even if the usage of opt-switches is argueable (sp??) here, 
it is rather a backe-end than language thing.

Mar 27 2008

Walter Bright <newshound1 digitalmars.com> writes:

TomD wrote:
 Seems to be a backend problem:
 dmd -O wcd.d

To get the fastest code out of dmd:

     dmd -release -inline -O wcd.d

Mar 27 2008

Walter Bright <newshound1 digitalmars.com> writes:

I did a little checking.

D's and DMC's isspace(c) checks to see if the input character c is 
ascii. gcc's does not. So there's a little slowdown there, buying 
robustness.

What the real difference is, however, is that D's and DMC's getchar() is 
synchronized for multiple threads. This means that there's a lock/unlock 
going on for *every* single character read.

gcc may not synchronize stdin for multiple threads, or may have found a 
way to avoid the locking if the start-new-thread function in the library 
is never called.

For the Digital Mars stdio library, the way to do fast I/O from the 
synchronized streams is to do it in chunks, not character-by-character. 
Several functions in std.stdio are available to do this - I bet you'll 
get a dramatic speed boost by using one of them.

In other words, from examining the generated asm for dmd and gcc it's 
pretty clear that the observed speed difference has essentially 
*nothing* to do with either the language or the back end optimizer, and 
everything to do with the I/O strategy.

This is why it is crucial to figure out why one is getting certain 
results from a benchmark, instead of just assuming.

P.S. In general, it's just a bad idea to do compute performance 
benchmarks using I/O, because the performance will be dominated by the 
relatively slow I/O.

P.P.S. Back in the 80's, my compiler would regularly beat the tar out of 
other compilers on magazine I/O speed benchmarks. It's because my 
library used a large disk buffer, which made a HUGE difference when 
reading files from a floppy disk. My competitors never did figure that 
out until about 10 years later <g>.

Mar 27 2008

"Saaa" <empty needmail.com> writes:

 P.P.S. Back in the 80's, my compiler would regularly beat the tar out of 
 other compilers on magazine I/O speed benchmarks. It's because my library 
 used a large disk buffer, which made a HUGE difference when reading files 
 from a floppy disk. My competitors never did figure that out until about 
 10 years later <g>.

lol

Mar 27 2008

Walter Bright <newshound1 digitalmars.com> writes:

Saaa wrote:
 P.P.S. Back in the 80's, my compiler would regularly beat the tar out of 
 other compilers on magazine I/O speed benchmarks. It's because my library 
 used a large disk buffer, which made a HUGE difference when reading files 
 from a floppy disk. My competitors never did figure that out until about 
 10 years later <g>.

 
 lol

The hilarity is that full library source was included with the compiler 
- what I was doing was in plain sight!

Understanding how I/O works is still key to making high performance 
programs, despite there being very little discussion about it.

Mar 27 2008

Georg Wrede <georg nospam.org> writes:

Walter Bright wrote:
 I did a little checking.
 
 D's and DMC's isspace(c) checks to see if the input character c is 
 ascii. gcc's does not. So there's a little slowdown there, buying 
 robustness.
 
 What the real difference is, however, is that D's and DMC's getchar() is 
 synchronized for multiple threads. This means that there's a lock/unlock 
 going on for *every* single character read.
 
 gcc may not synchronize stdin for multiple threads, or may have found a 
 way to avoid the locking if the start-new-thread function in the library 
 is never called.

Hmm. What kind of situations would need multiple threads simultaneously 
reading from stdin? And if there aren't any, wouldn't locking be like 
wearing a life vest on land?

And isspace, is that really the right place to check for ascii?

Mar 27 2008

Walter Bright <newshound1 digitalmars.com> writes:

Georg Wrede wrote:
 Hmm. What kind of situations would need multiple threads simultaneously 
 reading from stdin? And if there aren't any, wouldn't locking be like 
 wearing a life vest on land?

If it was unsynchronized, if you had two threads reading lines from the 
input, the contents of the lines would be interleaved. With syncing, 
each thread will get complete lines.

 And isspace, is that really the right place to check for ascii?

It has to be, because its argument is an int, not a char.

Mar 27 2008

Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:

Walter Bright wrote:
 Georg Wrede wrote:
 Hmm. What kind of situations would need multiple threads 
 simultaneously reading from stdin? And if there aren't any, wouldn't 
 locking be like wearing a life vest on land?

 
 If it was unsynchronized, if you had two threads reading lines from the 
 input, the contents of the lines would be interleaved. With syncing, 
 each thread will get complete lines.

I believe his point was "What kind of program would ever want to do 
character-level stdin I/O from multiple threads?"
My follow-up would be: "And if there *is* some exotic program that needs 
to do that, why can't you just require it to provide its own damn 
locking, and leave the rest of us with faster I/O?" (Of course, I use 
Linux, so AFAIK I'm using glibc anyway)

Mar 27 2008

Georg Wrede <georg nospam.org> writes:

Frits van Bommel wrote:
 Walter Bright wrote:
 
 Georg Wrede wrote:

 Hmm. What kind of situations would need multiple threads 
 simultaneously reading from stdin? And if there aren't any, wouldn't 
 locking be like wearing a life vest on land?


 If it was unsynchronized, if you had two threads reading lines from 
 the input, the contents of the lines would be interleaved. With 
 syncing, each thread will get complete lines.

 
 
 I believe his point was "What kind of program would ever want to do 
 character-level stdin I/O from multiple threads?"
 My follow-up would be: "And if there *is* some exotic program that needs 
 to do that, why can't you just require it to provide its own damn 
 locking, and leave the rest of us with faster I/O?" (Of course, I use 
 Linux, so AFAIK I'm using glibc anyway)

Heck, I just spent ages formulating the same thing. You said it in a 
couple of sentences.

Mar 27 2008

Walter Bright <newshound1 digitalmars.com> writes:

You guys do have a point that perhaps syncing is not needed on stdin. 
But I'm a little leery of removing it, as it is needed for output, and 
I'm not comfortable that this is thought through thoroughly.

But my other point is that this is an issue with the Digital Mars C 
runtime library, and has nothing whatsoever to do with D or the 
optimizer/backend of D.

Mar 28 2008

Sean Kelly <sean invisibleduck.org> writes:

== Quote from Walter Bright (newshound1 digitalmars.com)'s article
 You guys do have a point that perhaps syncing is not needed on stdin.
 But I'm a little leery of removing it, as it is needed for output, and
 I'm not comfortable that this is thought through thoroughly.
 But my other point is that this is an issue with the Digital Mars C
 runtime library, and has nothing whatsoever to do with D or the
 optimizer/backend of D.

Another option might be to check the thread count is greater than
1 and only lock if it is.  Tango has a routine called thread_needLock
for this purpose, though it goes a bit farther and is true once a
thread has been created through program termination.  This avoids
problems with the situation where you have multiple threads running
and all but one terminate but memory is not yet synchronized.


Sean

Mar 28 2008

Walter Bright <newshound1 digitalmars.com> writes:

Sean Kelly wrote:
 Another option might be to check the thread count is greater than
 1 and only lock if it is.  Tango has a routine called thread_needLock
 for this purpose, though it goes a bit farther and is true once a
 thread has been created through program termination.  This avoids
 problems with the situation where you have multiple threads running
 and all but one terminate but memory is not yet synchronized.

You have to be very careful with such schemes because they can run afoul 
of the notoriously difficult to comprehend "double checked locking" bug.

Mar 28 2008

Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:

Walter Bright wrote:
 Sean Kelly wrote:
 Another option might be to check the thread count is greater than
 1 and only lock if it is.  Tango has a routine called thread_needLock
 for this purpose, though it goes a bit farther and is true once a
 thread has been created through program termination.  This avoids
 problems with the situation where you have multiple threads running
 and all but one terminate but memory is not yet synchronized.

 
 You have to be very careful with such schemes because they can run afoul 
 of the notoriously difficult to comprehend "double checked locking" bug.

In this case though, if the thread count was 1 when checked it's safe to 
proceed without locking as long as you don't spawn any new threads. If 
it wasn't when checking, but became 1 afterwards it doesn't really 
matter; you may not have needed to lock it but it'll work fine if you 
do. (Just make sure the code path that locks also (unconditionally) 
unlocks, and of course that the code path that doesn't lock doesn't try 
to unlock either[1])

Oh, and the Phobos gc also has a thread_needLock() (in 
phobos/internal/gc/gcx.d).


[1]
---
if (!thread_needLock()) {
     foo();
} else synchronized(X) {
     foo();
}
---
does that nicely, and is what the GC uses. (Both in Phobos and Tango)



Hmm, reading the relevant parts of the GC code just now, something 
occurred to me. Some background first:
If allocation triggers an actual collection it checks thread_needLock() 
again, and locks for the duration of the collection if there's only one 
thread. The comment explains this is done because finalizers may start 
new threads.

However, the lock is then released *before* using the newly-collected 
heap to perform the actual allocation. That makes me wonder what happens 
if the first thing such a new thread does is allocating some memory...

In other words:
1) The only thread starts an allocation, determining the lock is not needed.
2) There's no space to allocate, and the GC prepares for a collection.
3) The GC notices the number of threads is 1, and acquires the lock, and 
starts performing the collection.
4) A finalizer starts a new thread, which attempts to allocate and 
blocks on the GC lock held by the collector.
5) The original thread finishes the collection and *releases the lock*.
6) It then determines what memory location to return from 'new'.
7) Thread switch, the second thread acquires the GC lock (which is no 
longer held by the first thread) and starts its own allocation 
activities. Since the original thread didn't yet mark its chosen memory 
as allocated, the second thread picks the same memory and marks it as 
allocated.
8) Thread switch, the first thread finishes its allocation by marking 
the memory as allocated. (even though it already was marked by the 
second thread)
9) Both threads start using the same piece of memory as if they have the 
only reference to it (which should be true from either perspective, 
since it was just returned from 'new').
10) *KABOOM*

(I may have just proven your point about these things being very hard to 
get right...)

Mar 28 2008

Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:

Frits van Bommel wrote:
 10) *KABOOM*

Bugzilla'd: http://d.puremagic.com/issues/show_bug.cgi?id=1957

Mar 28 2008

Walter Bright <newshound1 digitalmars.com> writes:

Thanks for putting it in bugzilla.

Mar 28 2008

Sean Kelly <sean invisibleduck.org> writes:

== Quote from Walter Bright (newshound1 digitalmars.com)'s article
 Sean Kelly wrote:
 Another option might be to check the thread count is greater than
 1 and only lock if it is.  Tango has a routine called thread_needLock
 for this purpose, though it goes a bit farther and is true once a
 thread has been created through program termination.  This avoids
 problems with the situation where you have multiple threads running
 and all but one terminate but memory is not yet synchronized.

 You have to be very careful with such schemes because they can run afoul
 of the notoriously difficult to comprehend "double checked locking" bug.

With the way this call is implemented in Tango, it is completely safe to omit
a mutex lock on a false return so long as the ostensibly locked code does
not call into code that may create a thread.  For something like an IO
operation, this seems pretty unlikely so it's a reasonable optimization, if
perhaps progressively less useful as multithreading becomes more common.


Sean

Mar 28 2008

Benji Smith <benji benjismith.net> writes:

Walter Bright wrote:
 Sean Kelly wrote:
 Another option might be to check the thread count is greater than
 1 and only lock if it is.  Tango has a routine called thread_needLock
 for this purpose, though it goes a bit farther and is true once a
 thread has been created through program termination.  This avoids
 problems with the situation where you have multiple threads running
 and all but one terminate but memory is not yet synchronized.

 
 You have to be very careful with such schemes because they can run afoul 
 of the notoriously difficult to comprehend "double checked locking" bug.

I thought the double-checked locking idiom was only buggy in the Java 
memory model (prior to being fixed in the 1.6 JDK).

--benji smith

Mar 28 2008

Walter Bright <newshound1 digitalmars.com> writes:

Benji Smith wrote:
 I thought the double-checked locking idiom was only buggy in the Java 
 memory model (prior to being fixed in the 1.6 JDK).

It's a common bug in C/C++ code.

Mar 29 2008

Sean Kelly <sean invisibleduck.org> writes:

== Quote from Benji Smith (benji benjismith.net)'s article
 Walter Bright wrote:
 Sean Kelly wrote:
 Another option might be to check the thread count is greater than
 1 and only lock if it is.  Tango has a routine called thread_needLock
 for this purpose, though it goes a bit farther and is true once a
 thread has been created through program termination.  This avoids
 problems with the situation where you have multiple threads running
 and all but one terminate but memory is not yet synchronized.

 You have to be very careful with such schemes because they can run afoul
 of the notoriously difficult to comprehend "double checked locking" bug.

 I thought the double-checked locking idiom was only buggy in the Java
 memory model (prior to being fixed in the 1.6 JDK).

It's broken in C++ as well.  It actually works in D however, provided one uses
'volatile' in the right places.  I posted a demo impl a few years back either in
this forum or in D.announce.


Sean

Mar 29 2008

Kevin Bealer <kevinbealer gmail.com> writes:

Sean Kelly Wrote:

 == Quote from Benji Smith (benji benjismith.net)'s article
 Walter Bright wrote:
 Sean Kelly wrote:
 Another option might be to check the thread count is greater than
 1 and only lock if it is.  Tango has a routine called thread_needLock
 for this purpose, though it goes a bit farther and is true once a
 thread has been created through program termination.  This avoids
 problems with the situation where you have multiple threads running
 and all but one terminate but memory is not yet synchronized.

 You have to be very careful with such schemes because they can run afoul
 of the notoriously difficult to comprehend "double checked locking" bug.

 I thought the double-checked locking idiom was only buggy in the Java
 memory model (prior to being fixed in the 1.6 JDK).

 
 It's broken in C++ as well.  It actually works in D however, provided one uses
 'volatile' in the right places.  I posted a demo impl a few years back either
in
 this forum or in D.announce.
 
 
 Sean

Couldn't you two mutexes to make this work even in C++ (but I'm using D syntax
for simplicity)?  The second mutex doesn't actually protect anything, but it
should be a portable way to include the memory barriers.  Actually "portable"
is a weasel word since all C++ locking and multithreadedness is done in
platform depended libraries.

class Foo {
private:
    Object data;
    Mutex a, b;

public:
    Object getData()
    {
        if (data is null) {
            a.lock();
            if (data is null) {
                Object o = null;
                {
                    b.lock();
                    o = new Object;
                    b.unlock();
                }
                data = o;
            }
            a.unlock();
        }
        return data;
    }
};

Kevin

Mar 29 2008

Sean Kelly <sean invisibleduck.org> writes:

== Quote from Kevin Bealer (kevinbealer gmail.com)'s article
 Sean Kelly Wrote:
 == Quote from Benji Smith (benji benjismith.net)'s article
 Walter Bright wrote:
 Sean Kelly wrote:
 Another option might be to check the thread count is greater than
 1 and only lock if it is.  Tango has a routine called thread_needLock
 for this purpose, though it goes a bit farther and is true once a
 thread has been created through program termination.  This avoids
 problems with the situation where you have multiple threads running
 and all but one terminate but memory is not yet synchronized.

 You have to be very careful with such schemes because they can run afoul
 of the notoriously difficult to comprehend "double checked locking" bug.

 I thought the double-checked locking idiom was only buggy in the Java
 memory model (prior to being fixed in the 1.6 JDK).

 It's broken in C++ as well.  It actually works in D however, provided one uses
 'volatile' in the right places.  I posted a demo impl a few years back either
in
 this forum or in D.announce.


 Sean

 Couldn't you two mutexes to make this work even in C++ (but I'm using D syntax
for simplicity)?  The

second mutex doesn't actually protect anything, but it should be a portable way
to include the memory
barriers.  Actually "portable" is a weasel word since all C++ locking and
multithreadedness is done in
platform depended libraries.
 class Foo {
 private:
     Object data;
     Mutex a, b;
 public:
     Object getData()
     {
         if (data is null) {
             a.lock();
             if (data is null) {
                 Object o = null;
                 {
                     b.lock();
                     o = new Object;
                     b.unlock();
                 }
                 data = o;
             }
             a.unlock();
         }
         return data;
     }
 };

That seems like it might work.


Sean

Mar 29 2008

janderson <askme me.com> writes:

Walter Bright wrote:
 Sean Kelly wrote:
 Another option might be to check the thread count is greater than
 1 and only lock if it is.  Tango has a routine called thread_needLock
 for this purpose, though it goes a bit farther and is true once a
 thread has been created through program termination.  This avoids
 problems with the situation where you have multiple threads running
 and all but one terminate but memory is not yet synchronized.

 
 You have to be very careful with such schemes because they can run afoul 
 of the notoriously difficult to comprehend "double checked locking" bug.

Perhaps a method of disabling locking would be useful.  Either by simply 
providing a lock free version of the function or by being able to turn 
locking on/off.   Either way the user would be the one who would 
explicitly disable it.

-Joel

Mar 29 2008

Georg Wrede <georg nospam.org> writes:

Walter Bright wrote:
 Georg Wrede wrote:
 
 Hmm. What kind of situations would need multiple threads 
 simultaneously reading from stdin? And if there aren't any, wouldn't 
 locking be like wearing a life vest on land?

 
 If it was unsynchronized, if you had two threads reading lines from the 
 input, the contents of the lines would be interleaved. With syncing, 
 each thread will get complete lines.

Yes. Now, my problem was, it should be pretty uncommon to need to have 
two (or of course more) threads reading the same input within the same 
program (and hugely more uncommon for stdin, since the _entry_ to stdin 
is implemented in a _single_ process anyway in the operating system, 
even if several simultaneous processes were to feed it data). Therefore 
there is no way a process can aquire data at a rate faster than that of 
a single processor. So, it would only be relevant to an application that

  - is disturbed by having the input mistakenly be split into smaller 
than one-line pieces

  - is happy with a line at a time (i.e. lines are self-sufficient) vs. 
needing to split it into logical entities potentially larger than a line 
(like a parser), any of which may be given to one of the threads.

A programmer even dreaming of implementing this, most likely will 
implement the input itself as a single thread, dispatching the data 
between multiple threads. The assumptions being

  - reading stdin is conceptually ill suited to multithreading

  - consuming the data is slower than reading [and dispatching] it

----

Anyhow, I've never heard even rumors of multiple threads reading _stdin_.

No programmer contemplating such would be newbie enough to actually try 
it. Besides, he'd never _expect_ reading stdio to be thread_safe, 
therefore he'd not even check the documentation.

All in all, I still think it's wearing a life vest on land. And crapping 
up throughput.

 And isspace, is that really the right place to check for ascii?

 
 It has to be, because its argument is an int, not a char.

Ah, ok.

(Not to be a perennial PITA,) but the original context of your statement 
was in a comparison between D and C++. Now, this might have given the 
casual reader the assumption that C++ doesn't have this int thing here, 
making C++ faster. You may want to elaborate on that for the regular reader.

Mar 27 2008

bearophile <bearophileHUGS lycos.com> writes:

Walter Bright:
 gcc may not synchronize stdin for multiple threads, or may have found a 
 way to avoid the locking if the start-new-thread function in the library 
 is never called.

Thank you taking a look at this benchmark. I understand that D is safer
regarding non-ASCII, but 8 times slower seems a lot.
I think GCC has functions like fgets_unlocked, getc_unlocked, etc, (that I
think DMD std lib misses)  for single thread programs:
http://linux.die.net/man/3/fgets_unlocked
http://www.opengroup.org/onlinepubs/009695399/functions/getc_unlocked.html


For the Digital Mars stdio library, the way to do fast I/O from the
synchronized streams is to do it in chunks, not character-by-character. Several
functions in std.stdio are available to do this - I bet you'll get a dramatic
speed boost by using one of them.<

At the moment DMD is two times slower than Clean in number reading benchmark,
so there's space for improvements there too:
http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=all
http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=all

Bye, and thank you,
bearophile

Mar 27 2008

D Programming

C/C++ Programming

Other

digitalmars.D - WordCount performance