www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - WordCount performance

reply bearophile <bearophileHUGS lycos.com> writes:
The following little program comes from a progressive stripping down of a
program I was creating. This C and D code give the approximate count of the
words in a file:

D version:

import std.c.stdio: printf, getchar, EOF;
import std.ctype: isspace;

void main() {
    int count, c;

  //OUTER:
    while (1) {
        while (1) {
            c = getchar();
            if (c == EOF)
                //break OUTER;
                goto END;
            if (!isspace(c))
                break;
        }

        count++;

        while (1) {
            c = getchar();
            if (c == EOF)
                //break OUTER;
                goto END;
            if (isspace(c))
                break;
        }
    }

  END:
    printf("%d\n", count);
}


C version:

#include <stdio.h>
#include <ctype.h>

int main() {
    int count = 0, c;

    while (1) {
        while (1) {
            c = getchar();
            if (c == EOF)
                goto END;
            if (!isspace(c))
                break;
        }

        count++;

        while (1) {
            c = getchar();
            if (c == EOF)
                goto END;
            if (isspace(c))
                break;
        }
    }

    END:
    printf("%d\n", count);
    return 0;
}


To test it, I have used a 7.5 MB file of real text. The C version (compiled
with MinGW 4.2.1) is ~7.8 times faster (0.43 s instead of 3.35 s) than that
very simpler code compiled with DMD (1.028). If I use a named break in the D
code (that OUTER), to avoid the goto, the running speed is essentially the same.
On a 50 MB file of text the timings are 2.43 s and 20.74 s (C version 8.5+
times faster).
Disabling the GC doesn't change running speed of the D version.
A 7-8 times difference on such simple program is big enough to make me curious,
do you know what the problem can be? (Maybe the getchar() as a function instead
of macro?)

Bye,
bearophile
Mar 26 2008
next sibling parent reply Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
bearophile wrote:
 To test it, I have used a 7.5 MB file of real text. The C version (compiled
with MinGW 4.2.1) is ~7.8 times faster (0.43 s instead of 3.35 s) than that
very simpler code compiled with DMD (1.028). If I use a named break in the D
code (that OUTER), to avoid the goto, the running speed is essentially the same.
 On a 50 MB file of text the timings are 2.43 s and 20.74 s (C version 8.5+
times faster).

The first thing that comes to mind: "For the Nth time, someone thinks language X is 'faster' than language Y because of a comparison between compilers with completely different backends (and thus optimizers), even though there are pairs of compilers that use the same backend for either language." Try comparing DMC vs. DMD, or GCC vs. GDC...
Mar 26 2008
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Frits van Bommel:
 because of a comparison between compilers with
 completely different backends (and thus optimizers),

Yes, I am probably talking about problems in the DMD optimizer too, here, that may need to be addressed (or explained). Bye, bearophile
Mar 26 2008
prev sibling parent reply Jason House <jason.james.house gmail.com> writes:
Frits van Bommel wrote:

 bearophile wrote:
 To test it, I have used a 7.5 MB file of real text. The C version
 (compiled with MinGW 4.2.1) is ~7.8 times faster (0.43 s instead of 3.35
 s) than that very simpler code compiled with DMD (1.028). If I use a
 named break in the D code (that OUTER), to avoid the goto, the running
 speed is essentially the same. On a 50 MB file of text the timings are
 2.43 s and 20.74 s (C version 8.5+ times faster).

The first thing that comes to mind: "For the Nth time, someone thinks language X is 'faster' than language Y because of a comparison between compilers with completely different backends (and thus optimizers), even though there are pairs of compilers that use the same backend for either language." Try comparing DMC vs. DMD, or GCC vs. GDC...

I have a D program that's an AI engine. It competes against programs written in other languages and with different compilers. Speed matters to me, regardless of which compiler my competition uses. One of my attractions to D (besides all the cool features) was that it compiled to native assembly like C++. I implicitly assumed I'd be getting C++ (speed) without the warts. When performance gaps are an order of magitude, the D language seems like a horrible choice. I've thought on more than one occasion about going back to C++ or Java for the speed.
Mar 26 2008
next sibling parent reply "Saaa" <empty needmail.com> writes:
 I have a D program that's an AI engine.  It competes against programs
 written in other languages and with different compilers.  Speed matters to
 me, regardless of which compiler my competition uses.

Mar 26 2008
parent reply Derek Parnell <derek nomail.afraid.org> writes:
On Thu, 27 Mar 2008 02:37:28 +0100, Saaa wrote:

 I have a D program that's an AI engine.  It competes against programs
 written in other languages and with different compilers.  Speed matters to
 me, regardless of which compiler my competition uses.


Regardless of the application in question, if run time speed is the primary driver then ... (a) optimize the algorithms (b) code it in assembler (c) profile it (d) jmp (a) -- Derek (skype: derek.j.parnell) Melbourne, Australia 27/03/2008 12:42:39 PM
Mar 26 2008
parent reply "Saaa" <empty needmail.com> writes:
 Regardless of the application in question, if run time speed is the 
 primary
 driver then ...

 (a) optimize the algorithms
 (b) code it in assembler
 (c) profile it
 (d) jmp (a)

What if you can't do b nor have the time to learn b and see that another compiler(language) runs faster without the need of b? I love all the features D gives me and the prospect that somebody who knows assembler could optimize parts of the code later on. But, I'd just wish for D to be as fast as C without the use of b.
Mar 26 2008
next sibling parent Silv3r <nomaily 127.0.0.1> writes:
Saaa Wrote:

 But, I'd just wish for D to be as fast as C without the use of b. 

Exactly, I wouldn't be able to put it any clearer myself. ;)
Mar 26 2008
prev sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Saaa wrote:
 What if you can't do b nor have the time to learn b and see that another 
 compiler(language) runs faster without the need of b?

Because the insight gained by figuring out why one is faster than the other will be very valuable to you in future projects. I know a lot of programmers who made mistaken assumptions about why one program was faster than another, assumptions that eventually wound up costing them dearly. A sampling: 1) used the wrong compiler switches 2) the 'faster' program was faster because it had a disastrous bug in it 3) thought that their benchmark was measuring loop performance, when it was actually measuring the time spent in printf or malloc, rendering all their conclusions about how to write fast loops false. 4) thought they'd discovered a unique fast algorithm, when what was actually happening was the compiler optimizer had figured out how to eliminate a redundant DIV instruction 5) the benchmark proving that language A runs faster than language B turned out to be a result peculiar to that particular benchmark, and on the real program the reverse was true So, essentially, if you don't know *why* the benchmark produced the results it did, you don't have enough information to draw any conclusions about the language.
Mar 27 2008
parent reply "Saaa" <empty needmail.com> writes:
 What if you can't do b nor have the time to learn b and see that another
 compiler(language) runs faster without the need of b?


 Because the insight gained by figuring out why one is faster than the 
 other will be very valuable to you in future projects. I know a lot of 
 programmers who made mistaken assumptions about why one program was faster 
 than another, assumptions that eventually wound up costing them dearly.

I don't really see how this is an answer to my question. I suspect this is a more general post about language benchmarks :)
 A sampling:

 1) used the wrong compiler switches

 2) the 'faster' program was faster because it had a disastrous bug in it

 3) thought that their benchmark was measuring loop performance, when it 
 was actually measuring the time spent in printf or malloc, rendering all 
 their conclusions about how to write fast loops false.

 4) thought they'd discovered a unique fast algorithm, when what was 
 actually happening was the compiler optimizer had figured out how to 
 eliminate a redundant DIV instruction

 5) the benchmark proving that language A runs faster than language B 
 turned out to be a result peculiar to that particular benchmark, and on 
 the real program the reverse was true

 So, essentially, if you don't know *why* the benchmark produced the 
 results it did, you don't have enough information to draw any conclusions 
 about the language.

The first steps taken on this newsgroup regarding benchmarks(program speed) are generally (as should be) 1, 2 and 3. 4 is sort of what I was talking about: I would like the D compiler to match the optimization level of (the gcc or even intel:) C compiler. I am not sure whether you meant to say the opposite of this but: You don't need to understand asm to rule out 1,2 and 3; and know that the speed difference lies within 4 (compiler optimizations).
Mar 27 2008
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Saaa wrote:
 4 is sort of what I was talking about: I would like the D compiler to match 
 the
 optimization level of (the gcc or even intel:) C compiler.

D does match the optimization level of gcc: the GDC (gnu D compiler) uses gcc's optimizer and code generator.
Mar 27 2008
next sibling parent reply Don Clugston <dac nospam.com.au> writes:
Walter Bright wrote:
 Saaa wrote:
 4 is sort of what I was talking about: I would like the D compiler to 
 match the
 optimization level of (the gcc or even intel:) C compiler.

D does match the optimization level of gcc: the GDC (gnu D compiler) uses gcc's optimizer and code generator.

Is DMD ever likely to come close? Given the size TODO list for the D front-end, I'm having trouble imagining you'll get much time to work on the optimiser before 64 bit CPUs become completely dominant (forcing you to rewrite most of the code generator). Unless there are significant features of the DMC optimiser which DMD still doesn't make use of?
Mar 27 2008
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Don Clugston wrote:
 Walter Bright wrote:
 D does match the optimization level of gcc: the GDC (gnu D compiler) 
 uses gcc's optimizer and code generator.

Is DMD ever likely to come close? Given the size TODO list for the D front-end, I'm having trouble imagining you'll get much time to work on the optimiser before 64 bit CPUs become completely dominant (forcing you to rewrite most of the code generator).

I think it already is close. My back end long ago reached the point where it takes a lot more effort to get only tiny gains.
 Unless there are significant features of the DMC optimiser which DMD 
 still doesn't make use of?

No.
Mar 27 2008
parent "Dave" <Dave_member pathlink.com> writes:
"Walter Bright" <newshound1 digitalmars.com> wrote in message 
news:fsgqvr$vtd$1 digitalmars.com...
 Don Clugston wrote:
 Walter Bright wrote:
 D does match the optimization level of gcc: the GDC (gnu D compiler) 
 uses gcc's optimizer and code generator.

Is DMD ever likely to come close? Given the size TODO list for the D front-end, I'm having trouble imagining you'll get much time to work on the optimiser before 64 bit CPUs become completely dominant (forcing you to rewrite most of the code generator).

I think it already is close. My back end long ago reached the point where it takes a lot more effort to get only tiny gains.

Except for floating point optimizations, even if some loss of accuracy is encountered (because temporaries are kept in FP registers for example). I wish the DMD optimizer would do as good a job with FP as it does w/ int, unless I tell it not to for cases where accuracy is paramount. In that case, it would just have to fall back to what it's doing now.
Mar 27 2008
prev sibling parent reply "Saaa" <empty needmail.com> writes:
 Saaa wrote:
 4 is sort of what I was talking about: I would like the D compiler to 
 match the
 optimization level of (the gcc or even intel:) C compiler.

D does match the optimization level of gcc: the GDC (gnu D compiler) uses gcc's optimizer and code generator.

How does DMD's optimization compare? And maybe a silly question but how does optimizing D compare to optimizing C++. Or is this exactly the same as the backend is that similar?
Mar 27 2008
next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Saaa wrote:
 And maybe a silly question but how does optimizing D compare to optimizing 
 C++.
 Or is this exactly the same as the backend is that similar? 

The backend is not only similar, it is exactly the same. Where C++ and D differ is in various opportunities present in the language semantics. For example, D has invariants so it can do some more aggressive optimizations that C++ cannot.
Mar 27 2008
parent reply "Dave" <Dave_member pathlink.com> writes:
"Walter Bright" <newshound1 digitalmars.com> wrote in message 
news:fsgrbs$12lf$2 digitalmars.com...
 Saaa wrote:
 And maybe a silly question but how does optimizing D compare to 
 optimizing C++.
 Or is this exactly the same as the backend is that similar?

The backend is not only similar, it is exactly the same. Where C++ and D differ is in various opportunities present in the language semantics. For example, D has invariants so it can do some more aggressive optimizations that C++ cannot.

Any ETA on when that will be worked into the optimizer? Thanks, - Dave
Mar 27 2008
parent Walter Bright <newshound1 digitalmars.com> writes:
Dave wrote:
 "Walter Bright" <newshound1 digitalmars.com> wrote in message 
 Where C++ and D differ is in various opportunities present in the 
 language semantics. For example, D has invariants so it can do some 
 more aggressive optimizations that C++ cannot.

Any ETA on when that will be worked into the optimizer?

Too many other things right now!
Mar 27 2008
prev sibling parent reply "Dave" <Dave_member pathlink.com> writes:
"Saaa" <empty needmail.com> wrote in message 
news:fsglqt$5n2$1 digitalmars.com...
 Saaa wrote:
 4 is sort of what I was talking about: I would like the D compiler to 
 match the
 optimization level of (the gcc or even intel:) C compiler.

D does match the optimization level of gcc: the GDC (gnu D compiler) uses gcc's optimizer and code generator.

How does DMD's optimization compare? And maybe a silly question but how does optimizing D compare to optimizing C++. Or is this exactly the same as the backend is that similar?

The shootout benchmarks seem to indicate that in the general case DMD is pretty close to GCC except for floating point (again, in general).
Mar 27 2008
parent bearophile <bearophileHUGS lycos.com> writes:
Dave Wrote:
 The shootout benchmarks seem to indicate that in the general case DMD is 
 pretty close to GCC except for floating point (again, in general).

Micro-benchmarks have the huge advantage that they allow you to find where a (micro) performance problem may be, the following is a stripped down benchmark from the Shootout: D version: import std.stdio, std.conv; double fibo(double n) { if (n < 2) return 1.0; return fibo(n-2) + fibo(n-1); } void main(string[] args) { int n = args.length == 2 ? toInt(args[1]) : 1; printf("%.1f\n", fibo(n)); } C version: #include <stdio.h> double fibo(double n) { if (n < 2) return 1.0; return fibo(n-2) + fibo(n-1); } int main(int argc, char ** argv) { int n = argc == 2 ? atoi(argv[1]) : 1; printf("%.1f\n", fibo(n)); return 0; } The difference (DMD Vs GCC) is visible. In the case of GCC/DMC you can add the: -fomit-frame compilation flag to increase performance, and you can even replace this line: if (n < 2.0) return 1.0; With: if (__builtin_expect(n < 2.0, 0)) return 1.0; In such situation the runtime timings are (n = 37 on my very old PC) 2.36s Vs 3.11s DMD. (I don't know how much difficult/work is to add an expect to D). Bye, bearophile
Mar 27 2008
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
Jason House:
 When performance gaps are an order of magitude,

Relax, in many real situations D (compiled with DMD) is much faster. That's a very artificial example. And I'd like to know if my timings are right, because they may be just wrong after all. Trusting me blindly isn't good. Bye, bearophile
Mar 26 2008
prev sibling next sibling parent reply Daniel <murpsoft hotmail.com> writes:
bearophile Wrote:

 The following little program comes from a progressive stripping down of a
program I was creating. This C and D code give the approximate count of the
words in a file:
 
 D version:
 
 import std.c.stdio: printf, getchar, EOF;
 import std.ctype: isspace;
 
 void main() {
     int count, c;
 
   //OUTER:
     while (1) {
         while (1) {
             c = getchar();
             if (c == EOF)
                 //break OUTER;
                 goto END;
             if (!isspace(c))
                 break;
         }
 
         count++;
 
         while (1) {
             c = getchar();
             if (c == EOF)
                 //break OUTER;
                 goto END;
             if (isspace(c))
                 break;
         }
     }
 
   END:
     printf("%d\n", count);
 }
 
 
 C version:
 
 #include <stdio.h>
 #include <ctype.h>
 
 int main() {
     int count = 0, c;
 
     while (1) {
         while (1) {
             c = getchar();
             if (c == EOF)
                 goto END;
             if (!isspace(c))
                 break;
         }
 
         count++;
 
         while (1) {
             c = getchar();
             if (c == EOF)
                 goto END;
             if (isspace(c))
                 break;
         }
     }
 
     END:
     printf("%d\n", count);
     return 0;
 }
 [snip]

As an exercise for the viewer, if you use SSE2, prefetch, and non-branching bitmath you can perform this roughly 32 times as fast. Regards, Daniel
Mar 26 2008
parent "Saaa" <empty needmail.com> writes:
 As an exercise for the viewer, if you use SSE2, prefetch, and 
 non-branching bitmath you can perform this roughly 32 times as fast.

 Regards,
 Daniel

Please teach me!
Mar 26 2008
prev sibling next sibling parent reply TomD <t_demmer nospam.web.de> writes:
bearophile Wrote:
[...]

Seems to be a backend problem:
dmd -O wcd.d
dmc -6  wcc.c
gcc -O3 -mno-cygwin -o wccm wcc.d

On my 7.5MB iTunes XML file:
$ time ./wcd < it.xml
303412

real    0m1.078s
user    0m0.015s
sys     0m0.015s

$ time ./wcc < it.xml
303412

real    0m1.063s
user    0m0.031s
sys     0m0.000s

$ time ./wccm < it.xml
303412

real    0m0.172s
user    0m0.015s
sys     0m0.015s

Even if the usage of opt-switches is argueable (sp??) here, 
it is rather a backe-end than language thing.
Mar 27 2008
parent Walter Bright <newshound1 digitalmars.com> writes:
TomD wrote:
 Seems to be a backend problem:
 dmd -O wcd.d

To get the fastest code out of dmd: dmd -release -inline -O wcd.d
Mar 27 2008
prev sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
I did a little checking.

D's and DMC's isspace(c) checks to see if the input character c is 
ascii. gcc's does not. So there's a little slowdown there, buying 
robustness.

What the real difference is, however, is that D's and DMC's getchar() is 
synchronized for multiple threads. This means that there's a lock/unlock 
going on for *every* single character read.

gcc may not synchronize stdin for multiple threads, or may have found a 
way to avoid the locking if the start-new-thread function in the library 
is never called.

For the Digital Mars stdio library, the way to do fast I/O from the 
synchronized streams is to do it in chunks, not character-by-character. 
Several functions in std.stdio are available to do this - I bet you'll 
get a dramatic speed boost by using one of them.

In other words, from examining the generated asm for dmd and gcc it's 
pretty clear that the observed speed difference has essentially 
*nothing* to do with either the language or the back end optimizer, and 
everything to do with the I/O strategy.

This is why it is crucial to figure out why one is getting certain 
results from a benchmark, instead of just assuming.

P.S. In general, it's just a bad idea to do compute performance 
benchmarks using I/O, because the performance will be dominated by the 
relatively slow I/O.

P.P.S. Back in the 80's, my compiler would regularly beat the tar out of 
other compilers on magazine I/O speed benchmarks. It's because my 
library used a large disk buffer, which made a HUGE difference when 
reading files from a floppy disk. My competitors never did figure that 
out until about 10 years later <g>.
Mar 27 2008
next sibling parent reply "Saaa" <empty needmail.com> writes:
 P.P.S. Back in the 80's, my compiler would regularly beat the tar out of 
 other compilers on magazine I/O speed benchmarks. It's because my library 
 used a large disk buffer, which made a HUGE difference when reading files 
 from a floppy disk. My competitors never did figure that out until about 
 10 years later <g>.

lol
Mar 27 2008
parent Walter Bright <newshound1 digitalmars.com> writes:
Saaa wrote:
 P.P.S. Back in the 80's, my compiler would regularly beat the tar out of 
 other compilers on magazine I/O speed benchmarks. It's because my library 
 used a large disk buffer, which made a HUGE difference when reading files 
 from a floppy disk. My competitors never did figure that out until about 
 10 years later <g>.

lol

The hilarity is that full library source was included with the compiler - what I was doing was in plain sight! Understanding how I/O works is still key to making high performance programs, despite there being very little discussion about it.
Mar 27 2008
prev sibling next sibling parent reply Georg Wrede <georg nospam.org> writes:
Walter Bright wrote:
 I did a little checking.
 
 D's and DMC's isspace(c) checks to see if the input character c is 
 ascii. gcc's does not. So there's a little slowdown there, buying 
 robustness.
 
 What the real difference is, however, is that D's and DMC's getchar() is 
 synchronized for multiple threads. This means that there's a lock/unlock 
 going on for *every* single character read.
 
 gcc may not synchronize stdin for multiple threads, or may have found a 
 way to avoid the locking if the start-new-thread function in the library 
 is never called.

Hmm. What kind of situations would need multiple threads simultaneously reading from stdin? And if there aren't any, wouldn't locking be like wearing a life vest on land? And isspace, is that really the right place to check for ascii?
Mar 27 2008
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Georg Wrede wrote:
 Hmm. What kind of situations would need multiple threads simultaneously 
 reading from stdin? And if there aren't any, wouldn't locking be like 
 wearing a life vest on land?

If it was unsynchronized, if you had two threads reading lines from the input, the contents of the lines would be interleaved. With syncing, each thread will get complete lines.
 And isspace, is that really the right place to check for ascii?

It has to be, because its argument is an int, not a char.
Mar 27 2008
next sibling parent reply Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Walter Bright wrote:
 Georg Wrede wrote:
 Hmm. What kind of situations would need multiple threads 
 simultaneously reading from stdin? And if there aren't any, wouldn't 
 locking be like wearing a life vest on land?

If it was unsynchronized, if you had two threads reading lines from the input, the contents of the lines would be interleaved. With syncing, each thread will get complete lines.

I believe his point was "What kind of program would ever want to do character-level stdin I/O from multiple threads?" My follow-up would be: "And if there *is* some exotic program that needs to do that, why can't you just require it to provide its own damn locking, and leave the rest of us with faster I/O?" (Of course, I use Linux, so AFAIK I'm using glibc anyway)
Mar 27 2008
parent reply Georg Wrede <georg nospam.org> writes:
Frits van Bommel wrote:
 Walter Bright wrote:
 
 Georg Wrede wrote:

 Hmm. What kind of situations would need multiple threads 
 simultaneously reading from stdin? And if there aren't any, wouldn't 
 locking be like wearing a life vest on land?

If it was unsynchronized, if you had two threads reading lines from the input, the contents of the lines would be interleaved. With syncing, each thread will get complete lines.

I believe his point was "What kind of program would ever want to do character-level stdin I/O from multiple threads?" My follow-up would be: "And if there *is* some exotic program that needs to do that, why can't you just require it to provide its own damn locking, and leave the rest of us with faster I/O?" (Of course, I use Linux, so AFAIK I'm using glibc anyway)

Heck, I just spent ages formulating the same thing. You said it in a couple of sentences.
Mar 27 2008
parent reply Walter Bright <newshound1 digitalmars.com> writes:
You guys do have a point that perhaps syncing is not needed on stdin. 
But I'm a little leery of removing it, as it is needed for output, and 
I'm not comfortable that this is thought through thoroughly.

But my other point is that this is an issue with the Digital Mars C 
runtime library, and has nothing whatsoever to do with D or the 
optimizer/backend of D.
Mar 28 2008
parent reply Sean Kelly <sean invisibleduck.org> writes:
== Quote from Walter Bright (newshound1 digitalmars.com)'s article
 You guys do have a point that perhaps syncing is not needed on stdin.
 But I'm a little leery of removing it, as it is needed for output, and
 I'm not comfortable that this is thought through thoroughly.
 But my other point is that this is an issue with the Digital Mars C
 runtime library, and has nothing whatsoever to do with D or the
 optimizer/backend of D.

Another option might be to check the thread count is greater than 1 and only lock if it is. Tango has a routine called thread_needLock for this purpose, though it goes a bit farther and is true once a thread has been created through program termination. This avoids problems with the situation where you have multiple threads running and all but one terminate but memory is not yet synchronized. Sean
Mar 28 2008
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Sean Kelly wrote:
 Another option might be to check the thread count is greater than
 1 and only lock if it is.  Tango has a routine called thread_needLock
 for this purpose, though it goes a bit farther and is true once a
 thread has been created through program termination.  This avoids
 problems with the situation where you have multiple threads running
 and all but one terminate but memory is not yet synchronized.

You have to be very careful with such schemes because they can run afoul of the notoriously difficult to comprehend "double checked locking" bug.
Mar 28 2008
next sibling parent reply Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Walter Bright wrote:
 Sean Kelly wrote:
 Another option might be to check the thread count is greater than
 1 and only lock if it is.  Tango has a routine called thread_needLock
 for this purpose, though it goes a bit farther and is true once a
 thread has been created through program termination.  This avoids
 problems with the situation where you have multiple threads running
 and all but one terminate but memory is not yet synchronized.

You have to be very careful with such schemes because they can run afoul of the notoriously difficult to comprehend "double checked locking" bug.

In this case though, if the thread count was 1 when checked it's safe to proceed without locking as long as you don't spawn any new threads. If it wasn't when checking, but became 1 afterwards it doesn't really matter; you may not have needed to lock it but it'll work fine if you do. (Just make sure the code path that locks also (unconditionally) unlocks, and of course that the code path that doesn't lock doesn't try to unlock either[1]) Oh, and the Phobos gc also has a thread_needLock() (in phobos/internal/gc/gcx.d). [1] --- if (!thread_needLock()) { foo(); } else synchronized(X) { foo(); } --- does that nicely, and is what the GC uses. (Both in Phobos and Tango) Hmm, reading the relevant parts of the GC code just now, something occurred to me. Some background first: If allocation triggers an actual collection it checks thread_needLock() again, and locks for the duration of the collection if there's only one thread. The comment explains this is done because finalizers may start new threads. However, the lock is then released *before* using the newly-collected heap to perform the actual allocation. That makes me wonder what happens if the first thing such a new thread does is allocating some memory... In other words: 1) The only thread starts an allocation, determining the lock is not needed. 2) There's no space to allocate, and the GC prepares for a collection. 3) The GC notices the number of threads is 1, and acquires the lock, and starts performing the collection. 4) A finalizer starts a new thread, which attempts to allocate and blocks on the GC lock held by the collector. 5) The original thread finishes the collection and *releases the lock*. 6) It then determines what memory location to return from 'new'. 7) Thread switch, the second thread acquires the GC lock (which is no longer held by the first thread) and starts its own allocation activities. Since the original thread didn't yet mark its chosen memory as allocated, the second thread picks the same memory and marks it as allocated. 8) Thread switch, the first thread finishes its allocation by marking the memory as allocated. (even though it already was marked by the second thread) 9) Both threads start using the same piece of memory as if they have the only reference to it (which should be true from either perspective, since it was just returned from 'new'). 10) *KABOOM* (I may have just proven your point about these things being very hard to get right...)
Mar 28 2008
parent reply Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Frits van Bommel wrote:
 10) *KABOOM*

Bugzilla'd: http://d.puremagic.com/issues/show_bug.cgi?id=1957
Mar 28 2008
parent Walter Bright <newshound1 digitalmars.com> writes:
Thanks for putting it in bugzilla.
Mar 28 2008
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
== Quote from Walter Bright (newshound1 digitalmars.com)'s article
 Sean Kelly wrote:
 Another option might be to check the thread count is greater than
 1 and only lock if it is.  Tango has a routine called thread_needLock
 for this purpose, though it goes a bit farther and is true once a
 thread has been created through program termination.  This avoids
 problems with the situation where you have multiple threads running
 and all but one terminate but memory is not yet synchronized.

of the notoriously difficult to comprehend "double checked locking" bug.

With the way this call is implemented in Tango, it is completely safe to omit a mutex lock on a false return so long as the ostensibly locked code does not call into code that may create a thread. For something like an IO operation, this seems pretty unlikely so it's a reasonable optimization, if perhaps progressively less useful as multithreading becomes more common. Sean
Mar 28 2008
prev sibling next sibling parent reply Benji Smith <benji benjismith.net> writes:
Walter Bright wrote:
 Sean Kelly wrote:
 Another option might be to check the thread count is greater than
 1 and only lock if it is.  Tango has a routine called thread_needLock
 for this purpose, though it goes a bit farther and is true once a
 thread has been created through program termination.  This avoids
 problems with the situation where you have multiple threads running
 and all but one terminate but memory is not yet synchronized.

You have to be very careful with such schemes because they can run afoul of the notoriously difficult to comprehend "double checked locking" bug.

I thought the double-checked locking idiom was only buggy in the Java memory model (prior to being fixed in the 1.6 JDK). --benji smith
Mar 28 2008
next sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Benji Smith wrote:
 I thought the double-checked locking idiom was only buggy in the Java 
 memory model (prior to being fixed in the 1.6 JDK).

It's a common bug in C/C++ code.
Mar 29 2008
prev sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
== Quote from Benji Smith (benji benjismith.net)'s article
 Walter Bright wrote:
 Sean Kelly wrote:
 Another option might be to check the thread count is greater than
 1 and only lock if it is.  Tango has a routine called thread_needLock
 for this purpose, though it goes a bit farther and is true once a
 thread has been created through program termination.  This avoids
 problems with the situation where you have multiple threads running
 and all but one terminate but memory is not yet synchronized.

You have to be very careful with such schemes because they can run afoul of the notoriously difficult to comprehend "double checked locking" bug.

memory model (prior to being fixed in the 1.6 JDK).

It's broken in C++ as well. It actually works in D however, provided one uses 'volatile' in the right places. I posted a demo impl a few years back either in this forum or in D.announce. Sean
Mar 29 2008
parent reply Kevin Bealer <kevinbealer gmail.com> writes:
Sean Kelly Wrote:

 == Quote from Benji Smith (benji benjismith.net)'s article
 Walter Bright wrote:
 Sean Kelly wrote:
 Another option might be to check the thread count is greater than
 1 and only lock if it is.  Tango has a routine called thread_needLock
 for this purpose, though it goes a bit farther and is true once a
 thread has been created through program termination.  This avoids
 problems with the situation where you have multiple threads running
 and all but one terminate but memory is not yet synchronized.

You have to be very careful with such schemes because they can run afoul of the notoriously difficult to comprehend "double checked locking" bug.

memory model (prior to being fixed in the 1.6 JDK).

It's broken in C++ as well. It actually works in D however, provided one uses 'volatile' in the right places. I posted a demo impl a few years back either in this forum or in D.announce. Sean

Couldn't you two mutexes to make this work even in C++ (but I'm using D syntax for simplicity)? The second mutex doesn't actually protect anything, but it should be a portable way to include the memory barriers. Actually "portable" is a weasel word since all C++ locking and multithreadedness is done in platform depended libraries. class Foo { private: Object data; Mutex a, b; public: Object getData() { if (data is null) { a.lock(); if (data is null) { Object o = null; { b.lock(); o = new Object; b.unlock(); } data = o; } a.unlock(); } return data; } }; Kevin
Mar 29 2008
parent Sean Kelly <sean invisibleduck.org> writes:
== Quote from Kevin Bealer (kevinbealer gmail.com)'s article
 Sean Kelly Wrote:
 == Quote from Benji Smith (benji benjismith.net)'s article
 Walter Bright wrote:
 Sean Kelly wrote:
 Another option might be to check the thread count is greater than
 1 and only lock if it is.  Tango has a routine called thread_needLock
 for this purpose, though it goes a bit farther and is true once a
 thread has been created through program termination.  This avoids
 problems with the situation where you have multiple threads running
 and all but one terminate but memory is not yet synchronized.

You have to be very careful with such schemes because they can run afoul of the notoriously difficult to comprehend "double checked locking" bug.

memory model (prior to being fixed in the 1.6 JDK).

It's broken in C++ as well. It actually works in D however, provided one uses 'volatile' in the right places. I posted a demo impl a few years back either in this forum or in D.announce. Sean


barriers. Actually "portable" is a weasel word since all C++ locking and multithreadedness is done in platform depended libraries.
 class Foo {
 private:
     Object data;
     Mutex a, b;
 public:
     Object getData()
     {
         if (data is null) {
             a.lock();
             if (data is null) {
                 Object o = null;
                 {
                     b.lock();
                     o = new Object;
                     b.unlock();
                 }
                 data = o;
             }
             a.unlock();
         }
         return data;
     }
 };

That seems like it might work. Sean
Mar 29 2008
prev sibling parent janderson <askme me.com> writes:
Walter Bright wrote:
 Sean Kelly wrote:
 Another option might be to check the thread count is greater than
 1 and only lock if it is.  Tango has a routine called thread_needLock
 for this purpose, though it goes a bit farther and is true once a
 thread has been created through program termination.  This avoids
 problems with the situation where you have multiple threads running
 and all but one terminate but memory is not yet synchronized.

You have to be very careful with such schemes because they can run afoul of the notoriously difficult to comprehend "double checked locking" bug.

Perhaps a method of disabling locking would be useful. Either by simply providing a lock free version of the function or by being able to turn locking on/off. Either way the user would be the one who would explicitly disable it. -Joel
Mar 29 2008
prev sibling parent Georg Wrede <georg nospam.org> writes:
Walter Bright wrote:
 Georg Wrede wrote:
 
 Hmm. What kind of situations would need multiple threads 
 simultaneously reading from stdin? And if there aren't any, wouldn't 
 locking be like wearing a life vest on land?

If it was unsynchronized, if you had two threads reading lines from the input, the contents of the lines would be interleaved. With syncing, each thread will get complete lines.

Yes. Now, my problem was, it should be pretty uncommon to need to have two (or of course more) threads reading the same input within the same program (and hugely more uncommon for stdin, since the _entry_ to stdin is implemented in a _single_ process anyway in the operating system, even if several simultaneous processes were to feed it data). Therefore there is no way a process can aquire data at a rate faster than that of a single processor. So, it would only be relevant to an application that - is disturbed by having the input mistakenly be split into smaller than one-line pieces - is happy with a line at a time (i.e. lines are self-sufficient) vs. needing to split it into logical entities potentially larger than a line (like a parser), any of which may be given to one of the threads. A programmer even dreaming of implementing this, most likely will implement the input itself as a single thread, dispatching the data between multiple threads. The assumptions being - reading stdin is conceptually ill suited to multithreading - consuming the data is slower than reading [and dispatching] it ---- Anyhow, I've never heard even rumors of multiple threads reading _stdin_. No programmer contemplating such would be newbie enough to actually try it. Besides, he'd never _expect_ reading stdio to be thread_safe, therefore he'd not even check the documentation. All in all, I still think it's wearing a life vest on land. And crapping up throughput.
 And isspace, is that really the right place to check for ascii?

It has to be, because its argument is an int, not a char.

Ah, ok. (Not to be a perennial PITA,) but the original context of your statement was in a comparison between D and C++. Now, this might have given the casual reader the assumption that C++ doesn't have this int thing here, making C++ faster. You may want to elaborate on that for the regular reader.
Mar 27 2008
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:
 gcc may not synchronize stdin for multiple threads, or may have found a 
 way to avoid the locking if the start-new-thread function in the library 
 is never called.

Thank you taking a look at this benchmark. I understand that D is safer regarding non-ASCII, but 8 times slower seems a lot. I think GCC has functions like fgets_unlocked, getc_unlocked, etc, (that I think DMD std lib misses) for single thread programs: http://linux.die.net/man/3/fgets_unlocked http://www.opengroup.org/onlinepubs/009695399/functions/getc_unlocked.html
For the Digital Mars stdio library, the way to do fast I/O from the
synchronized streams is to do it in chunks, not character-by-character. Several
functions in std.stdio are available to do this - I bet you'll get a dramatic
speed boost by using one of them.<

At the moment DMD is two times slower than Clean in number reading benchmark, so there's space for improvements there too: http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=all http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=all Bye, and thank you, bearophile
Mar 27 2008