www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Read text file fast, how?

reply Johan Holmberg via Digitalmars-d <digitalmars-d puremagic.com> writes:
Hi!

I am trying to port a program I have written earlier to D. My previous
versions are in C++ and Python. I was hoping that a D version would be
similar in speed to the C++ version, rather than similar to the Python
version. But currently it isn't.

Part of the problem may be that I haven't learned the idiomatic way to do
things in D. One such thing is perhaps: how do I read large text files in
an efficient manner in D?

Currently I have created a little test-program that does the same job as
the UNIX-command "wc -lc", i.e. counting the number of lines and characters
in a file. The timings I get in different languages are:

D:           15s
C++:       1.1s
Python:   3.7s
Perl:        2.9s

The central loop in my D program looks like:

        foreach (line; f.byLine) {

            nlines += 1;

            nchars += line.length + 1;

        }

I have also tried another variant with this inner loop:

        char[] line;

        while(f.readln(line)) {

            nlines += 1;

            nchars += line.length;

        }

but in both cases this D program is much slower than any of the others in
C++/Python/Perl. I don't understand what can cause this dramatic difference
to C++, and a factor 4 to Python. My D programs are built with DMD 2.067.1
on MacOS Yosemite, using the flags "-O -release".

Is there something I can do to make the program run faster, and still be
"idiomatic D"?

(I append the whole program for reference)

Regards,
/Johan Holmberg

=======================================
import std.stdio;
import std.file;

void main(string[] argv) {
    foreach (fname; argv[1..$]) {
        auto f = File(fname);
        int nlines = 0;
        int nchars = 0;
        foreach (line; f.byLine) {
            nlines += 1;
            nchars += line.length + 1;
        }
        writeln(nlines, "\t", nchars, "\t", fname);
    }
}
=======================================
Jul 25 2015
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 7/25/15 8:19 AM, Johan Holmberg via Digitalmars-d wrote:
 Hi!

 I am trying to port a program I have written earlier to D. My previous
 versions are in C++ and Python. I was hoping that a D version would be
 similar in speed to the C++ version, rather than similar to the Python
 version. But currently it isn't.

 Part of the problem may be that I haven't learned the idiomatic way to
 do things in D. One such thing is perhaps: how do I read large text
 files in an efficient manner in D?

 Currently I have created a little test-program that does the same job as
 the UNIX-command "wc -lc", i.e. counting the number of lines and
 characters in a file. The timings I get in different languages are:

 D:           15s
 C++:       1.1s
 Python:   3.7s
 Perl:        2.9s
I think this harkens back to the problem discussed here: http://stackoverflow.com/questions/28922323/improving-line-wise-i-o-operations-in-d/29153508 As I discuss there, the performance bug has been fixed for 2.068. With your code: $ time wc -l <(repeat 1000000 echo hello) 1000000 /dev/fd/11 wc -l <(repeat 1000000 echo hello) 0.11s user 2.35s system 54% cpu 4.529 total $ time ./test.d <(repeat 1000000 echo hello) 1000000 6000000 /dev/fd/11 ./test.d <(repeat 1000000 echo hello) 0.73s user 1.76s system 64% cpu 3.870 total The compilation was flag free (no -O -inline -release etc). Andrei
Jul 25 2015
parent reply Johan Holmberg via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Sat, Jul 25, 2015 at 7:14 PM, Andrei Alexandrescu via Digitalmars-d <
digitalmars-d puremagic.com> wrote:

 On 7/25/15 8:19 AM, Johan Holmberg via Digitalmars-d wrote:

 Hi!

 I am trying to port a program I have written earlier to D. My previous
 versions are in C++ and Python. I was hoping that a D version would be
 similar in speed to the C++ version, rather than similar to the Python
 version. But currently it isn't.

 Part of the problem may be that I haven't learned the idiomatic way to
 do things in D. One such thing is perhaps: how do I read large text
 files in an efficient manner in D?

 Currently I have created a little test-program that does the same job as
 the UNIX-command "wc -lc", i.e. counting the number of lines and
 characters in a file. The timings I get in different languages are:

 D:           15s
 C++:       1.1s
 Python:   3.7s
 Perl:        2.9s
I think this harkens back to the problem discussed here: http://stackoverflow.com/questions/28922323/improving-line-wise-i-o-operations-in-d/29153508 As I discuss there, the performance bug has been fixed for 2.068. With your code: $ time wc -l <(repeat 1000000 echo hello) 1000000 /dev/fd/11 wc -l <(repeat 1000000 echo hello) 0.11s user 2.35s system 54% cpu 4.529 total $ time ./test.d <(repeat 1000000 echo hello) 1000000 6000000 /dev/fd/11 ./test.d <(repeat 1000000 echo hello) 0.73s user 1.76s system 64% cpu 3.870 total The compilation was flag free (no -O -inline -release etc). Andrei
Thanks, my question seems like a carbon copy of the Stack Overflow article :) Somehow I had missed it when googling. I download a dmd 2.068 beta, and re-tried with my input file: now the D program takes 1.6s (a 10x improvement). /johan
Jul 25 2015
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 7/25/15 1:53 PM, Johan Holmberg via Digitalmars-d wrote:
 Thanks, my question seems like a carbon copy of the Stack Overflow
 article :) Somehow I had missed it when googling.

 I download a dmd 2.068 beta, and re-tried with my input file: now the D
 program takes 1.6s (a 10x improvement).
Great, though it still seems to be behind the C++ version, which is a bummer. -- Andrei
Jul 25 2015
next sibling parent reply "Brandon Ragland" <brags callmemaybe.com> writes:
On Saturday, 25 July 2015 at 20:12:26 UTC, Andrei Alexandrescu 
wrote:
 On 7/25/15 1:53 PM, Johan Holmberg via Digitalmars-d wrote:
 Thanks, my question seems like a carbon copy of the Stack 
 Overflow
 article :) Somehow I had missed it when googling.

 I download a dmd 2.068 beta, and re-tried with my input file: 
 now the D
 program takes 1.6s (a 10x improvement).
Great, though it still seems to be behind the C++ version, which is a bummer. -- Andrei
Do you happen to have a link to that source where you fixed it. I feel like contributing some reading effort today.
Jul 25 2015
parent "sigod" <sigod.mail gmail.com> writes:
On Saturday, 25 July 2015 at 22:40:55 UTC, Brandon Ragland wrote:
 On Saturday, 25 July 2015 at 20:12:26 UTC, Andrei Alexandrescu 
 wrote:
 On 7/25/15 1:53 PM, Johan Holmberg via Digitalmars-d wrote:
 Thanks, my question seems like a carbon copy of the Stack 
 Overflow
 article :) Somehow I had missed it when googling.

 I download a dmd 2.068 beta, and re-tried with my input file: 
 now the D
 program takes 1.6s (a 10x improvement).
Great, though it still seems to be behind the C++ version, which is a bummer. -- Andrei
Do you happen to have a link to that source where you fixed it. I feel like contributing some reading effort today.
https://github.com/D-Programming-Language/phobos/pull/3089
Jul 25 2015
prev sibling parent reply Johan Holmberg via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Sat, Jul 25, 2015 at 10:12 PM, Andrei Alexandrescu via Digitalmars-d <
digitalmars-d puremagic.com> wrote:

 On 7/25/15 1:53 PM, Johan Holmberg via Digitalmars-d wrote:

 Thanks, my question seems like a carbon copy of the Stack Overflow
 article :) Somehow I had missed it when googling.

 I download a dmd 2.068 beta, and re-tried with my input file: now the D
 program takes 1.6s (a 10x improvement).
Great, though it still seems to be behind the C++ version, which is a bummer. -- Andrei
My C++ program was actually doing C-style IO via <stdio.h>. I didn't think about the distinction C/C++ when reporting the earlier numbers. If I switch to full C++ style: <fstream> + <string> + C++ version of getline(), then the C++-solution is even slower than Python: 5.2s. I think it is the C++ libraries of Clang on MacOS Yosemite that are slow. This prompted me to re-run the tests on a Linux machine (Ubuntu 14.04), still with the same input file, a text file with 7M lines and total size of 466MB: C++ with <stdio.h> style IO: 0.40s C++ with <fstream> style IO: 0.31s D 2.067 1.75s D 2.068 beta 2: 0.69s Perl: 1.49s Python: 1.86s So on Ubuntu, the C++ <fstream> version was clearly best. And the improvement in DMD 2.068 beta "only" a factor of 2.5 from 2.067. /johan
Jul 26 2015
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 7/26/15 10:35 AM, Johan Holmberg via Digitalmars-d wrote:
 On Sat, Jul 25, 2015 at 10:12 PM, Andrei Alexandrescu via Digitalmars-d
 <digitalmars-d puremagic.com <mailto:digitalmars-d puremagic.com>> wrote:

     On 7/25/15 1:53 PM, Johan Holmberg via Digitalmars-d wrote:

         Thanks, my question seems like a carbon copy of the Stack Overflow
         article :) Somehow I had missed it when googling.

         I download a dmd 2.068 beta, and re-tried with my input file:
         now the D
         program takes 1.6s (a 10x improvement).


     Great, though it still seems to be behind the C++ version, which is
     a bummer. -- Andrei


 My C++ program was actually doing C-style IO via <stdio.h>. I didn't
 think about the distinction C/C++ when reporting the earlier numbers.

 If I switch to full C++ style: <fstream> + <string> + C++ version of
 getline(), then the C++-solution is even slower than Python: 5.2s. I
 think it is the C++ libraries of Clang on MacOS Yosemite that are slow.

 This prompted me to re-run the tests on a Linux machine (Ubuntu 14.04),
 still with the same input file, a text file with 7M lines and total size
 of 466MB:

 C++ with <stdio.h> style IO:    0.40s
 C++ with <fstream> style IO:   0.31s
 D 2.067                                    1.75s
 D 2.068 beta 2:                        0.69s
 Perl:                                         1.49s
 Python:                                    1.86s

 So on Ubuntu, the C++ <fstream> version was clearly best. And the
 improvement in DMD 2.068 beta "only" a factor of 2.5 from 2.067.

 /johan
I think we should investigate this and bring performance to par. Anyone interested? -- Andrei
Jul 26 2015
next sibling parent "Brandon Ragland" <brags callmemaybe.com> writes:
On Sunday, 26 July 2015 at 15:36:29 UTC, Andrei Alexandrescu 
wrote:
 On 7/26/15 10:35 AM, Johan Holmberg via Digitalmars-d wrote:
 [...]
I think we should investigate this and bring performance to par. Anyone interested? -- Andrei
Here's the link to the fstream libstc++ source for GNU /linux (Ubuntu / Debian) https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USERS-4.0/fstream-source.html Not to sure who's all familiar with it but it uses the basic_streambuf underneath.
Jul 26 2015
prev sibling parent reply Johan Holmberg via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Sun, Jul 26, 2015 at 5:36 PM, Andrei Alexandrescu via Digitalmars-d <
digitalmars-d puremagic.com> wrote:

 On 7/26/15 10:35 AM, Johan Holmberg via Digitalmars-d wrote:

 On Sat, Jul 25, 2015 at 10:12 PM, Andrei Alexandrescu via Digitalmars-d
 <digitalmars-d puremagic.com <mailto:digitalmars-d puremagic.com>> wrote:

     On 7/25/15 1:53 PM, Johan Holmberg via Digitalmars-d wrote:
 [...]
         I download a dmd 2.068 beta, and re-tried with my input file:
         now the D
         program takes 1.6s (a 10x improvement).

     Great, though it still seems to be behind the C++ version, which is
     a bummer. -- Andrei
 [... linux numbers removed ...]
I think we should investigate this and bring performance to par. Anyone interested? -- Andrei
Back on MacOS again, I thought I should try to run "Instruments" on my program. I'm not familiar with the DMD source code, but I did the following: - downloaded the DMD source from Github + built it - rebuilt my program with this dmd - used Instruments (the MacOS profiler) on my program Two things showed up in Instruments that seemed suspicious, both in "stdio.d": 1) calls to "__tls_get_addr" inside readlnImpl" (taking 0.25s out of the total 1.69s according to Instruments). I added "__gshared" to the static variables "lineptr" and "n" to see if it had any effect (see below for results). 2) calls to "std.algorithm.endsWith" inside File.ByLine.Impl.popFront (taking 0.10s according to Intruments). I replaced it with a simpler test using inline code. The timings running my program normally (not using Instruments now), became as follows with the different versions of dmd: dmd unmodified: 1.59s dmd with change 1): 1.33s dmd with change 1+2): 1.22s C++ using <stdio.h>: 1.13s (for comparison) My changes to dmd are of course not correct, but my program still works as before at least. If 1) and 2) could be changed "the right way" the difference to the C++ program would be much smaller on MacOS (I haven't looked further into the Linux results). Does this help getting forward? /johan
Jul 27 2015
next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 27 July 2015 at 12:03:40 UTC, Johan Holmberg wrote:
 On Sun, Jul 26, 2015 at 5:36 PM, Andrei Alexandrescu via 
 Digitalmars-d < digitalmars-d puremagic.com> wrote:

[...]
Back on MacOS again, I thought I should try to run "Instruments" on my program. I'm not familiar with the DMD source code, but I did the following: [...]
IIRC D's tls is particularly slow on OS X
Jul 27 2015
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2015-07-27 14:03, Johan Holmberg via Digitalmars-d wrote:

 Back on MacOS again, I thought I should try to run "Instruments" on my
 program. I'm not familiar with the DMD source code, but I did the following:

 - downloaded the DMD source from Github + built it
 - rebuilt my program with this dmd
 - used Instruments (the MacOS profiler) on my program

 Two things showed up in Instruments that seemed suspicious, both in
 "stdio.d":

 1) calls to "__tls_get_addr" inside readlnImpl" (taking 0.25s out of the
 total 1.69s according to Instruments). I added "__gshared" to the static
 variables "lineptr" and "n" to see if it had any effect (see below for
 results).

 2) calls to "std.algorithm.endsWith" inside File.ByLine.Impl.popFront
 (taking 0.10s according to Intruments). I replaced it with a simpler
 test using inline code.

 The timings running my program normally (not using Instruments now),
 became as follows with the different versions of dmd:

 dmd unmodified: 1.59s
 dmd with change 1): 1.33s
 dmd with change 1+2): 1.22s
 C++ using <stdio.h>: 1.13s    (for comparison)
I recommend you also try using LDC. It has a better optimizer and is using native TLS on OS X. -- /Jacob Carlborg
Jul 29 2015
parent reply Johan Holmberg via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Wed, Jul 29, 2015 at 11:47 AM, Jacob Carlborg via Digitalmars-d <
digitalmars-d puremagic.com> wrote:

 On 2015-07-27 14:03, Johan Holmberg via Digitalmars-d wrote:

 The timings running my program normally (not using Instruments now),
 became as follows with the different versions of dmd:

 dmd unmodified: 1.59s
 dmd with change 1): 1.33s
 dmd with change 1+2): 1.22s
 C++ using <stdio.h>: 1.13s    (for comparison)
I recommend you also try using LDC. It has a better optimizer and is using native TLS on OS X. /Jacob Carlborg
Is there a LDC that incorporates the changes coming in DMD 2.068 that made my code run 10x faster compared with 2.067? (the one Andrei talked about in the StackOverflow-link given earlier in this thread: https://github.com/D-Programming-Language/phobos/pull/3089 ). I have tried "ldc2-0.15.2-beta1-osx-x86_64" and also built LDC from the Git-archive sources. In both cases I get times around 13s. This is close to my original "bad" numbers from DMD 2.067 (15s). I assume I have to wait until there is a LDC using the same Phobos version as DMD 2.068 uses. /johan
Jul 29 2015
parent Jacob Carlborg <doob me.com> writes:
On 2015-07-29 19:02, Johan Holmberg via Digitalmars-d wrote:

 Is there a LDC that incorporates the changes coming in DMD 2.068 that
 made my code run 10x faster compared with 2.067?
I would guess that there isn't. -- /Jacob Carlborg
Jul 29 2015
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 7/27/15 8:03 AM, Johan Holmberg via Digitalmars-d wrote:
 On Sun, Jul 26, 2015 at 5:36 PM, Andrei Alexandrescu via Digitalmars-d
 <digitalmars-d puremagic.com <mailto:digitalmars-d puremagic.com>> wrote:

     On 7/26/15 10:35 AM, Johan Holmberg via Digitalmars-d wrote:


         On Sat, Jul 25, 2015 at 10:12 PM, Andrei Alexandrescu via
         Digitalmars-d
         <digitalmars-d puremagic.com
         <mailto:digitalmars-d puremagic.com>
         <mailto:digitalmars-d puremagic.com
         <mailto:digitalmars-d puremagic.com>>> wrote:

              On 7/25/15 1:53 PM, Johan Holmberg via Digitalmars-d wrote:
         [...]
                  I download a dmd 2.068 beta, and re-tried with my input
         file:
                  now the D
                  program takes 1.6s (a 10x improvement).

              Great, though it still seems to be behind the C++ version,
         which is
              a bummer. -- Andrei
         [... linux numbers removed ...]


     I think we should investigate this and bring performance to par.
     Anyone interested? -- Andrei



 Back on MacOS again, I thought I should try to run "Instruments" on my
 program. I'm not familiar with the DMD source code, but I did the following:

 - downloaded the DMD source from Github + built it
 - rebuilt my program with this dmd
 - used Instruments (the MacOS profiler) on my program

 Two things showed up in Instruments that seemed suspicious, both in
 "stdio.d":

 1) calls to "__tls_get_addr" inside readlnImpl" (taking 0.25s out of the
 total 1.69s according to Instruments). I added "__gshared" to the static
 variables "lineptr" and "n" to see if it had any effect (see below for
 results).

 2) calls to "std.algorithm.endsWith" inside File.ByLine.Impl.popFront
 (taking 0.10s according to Intruments). I replaced it with a simpler
 test using inline code.

 The timings running my program normally (not using Instruments now),
 became as follows with the different versions of dmd:

 dmd unmodified: 1.59s
 dmd with change 1): 1.33s
 dmd with change 1+2): 1.22s
 C++ using <stdio.h>: 1.13s    (for comparison)

 My changes to dmd are of course not correct, but my program still works
 as before at least. If 1) and 2) could be changed "the right way" the
 difference to the C++ program would be much smaller on MacOS (I haven't
 looked further into the Linux results).

 Does this help getting forward?

 /johan
Thanks, yes, this is a great start. Would anyone want to refine these insights into a pull requests? Andrei
Jul 30 2015
prev sibling next sibling parent "Bigsandwich" <bigsandwich gmail.com> writes:
On Sunday, 26 July 2015 at 14:36:09 UTC, Johan Holmberg wrote:
 On Sat, Jul 25, 2015 at 10:12 PM, Andrei Alexandrescu via 
 Digitalmars-d < digitalmars-d puremagic.com> wrote:

[...]
My C++ program was actually doing C-style IO via <stdio.h>. I didn't think about the distinction C/C++ when reporting the earlier numbers. [...]
It would be interesting to see numbers for the stdio.h code in D since it should be easy to translate and would rule it issues with compiler vs library.
Jul 26 2015
prev sibling parent reply "Jesse Phillips" <Jesse.K.Phillips+D gmail.com> writes:
On Sunday, 26 July 2015 at 14:36:09 UTC, Johan Holmberg wrote:
 C++ with <stdio.h> style IO:    0.40s
 C++ with <fstream> style IO:   0.31s
 D 2.067                                    1.75s
 D 2.068 beta 2:                        0.69s
 Perl:                                         1.49s
 Python:                                    1.86s

 So on Ubuntu, the C++ <fstream> version was clearly best. And 
 the improvement in DMD 2.068 beta "only" a factor of 2.5 from 
 2.067.

 /johan
It would be better to compare with LDC or GDC to match the same backend as C++. That is a little harder since they don't have 2.068 yet.
Jul 26 2015
parent reply Martin Nowak <code+news.digitalmars dawg.eu> writes:
On 07/26/2015 09:04 PM, Jesse Phillips wrote:
 
 It would be better to compare with LDC or GDC to match the same backend
 as C++. That is a little harder since they don't have 2.068 yet.
Reading a file is IO and memcpy limited, has nothing to do with compiler optimizations. Clearly we must be doing some unnecessary copying or allocating.
Jul 27 2015
next sibling parent =?UTF-8?Q?Tobias=20M=C3=BCller?= <troplin bluewin.ch> writes:
Martin Nowak <code+news.digitalmars dawg.eu> wrote:
 On 07/26/2015 09:04 PM, Jesse Phillips wrote:
 
 It would be better to compare with LDC or GDC to match the same backend
 as C++. That is a little harder since they don't have 2.068 yet.
Reading a file is IO and memcpy limited, has nothing to do with compiler optimizations. Clearly we must be doing some unnecessary copying or allocating.
Or too much syscalls because of non-optimal buffering? Tobi
Jul 27 2015
prev sibling parent "Jesse Phillips" <Jesse.K.Phillips+D gmail.com> writes:
On Monday, 27 July 2015 at 08:52:07 UTC, Martin Nowak wrote:
 On 07/26/2015 09:04 PM, Jesse Phillips wrote:
 
 It would be better to compare with LDC or GDC to match the 
 same backend as C++. That is a little harder since they don't 
 have 2.068 yet.
Reading a file is IO and memcpy limited, has nothing to do with compiler optimizations. Clearly we must be doing some unnecessary copying or allocating.
Unless the only code being exercised is only a system call to read and a system call to memcpy, then I'll stick with the notion that the backends may have something to do with it or if it is just tested with the same backend.
Jul 27 2015
prev sibling parent reply "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
Are you including program startup and exit in the timing? For 
comparison, can you include the timings of an empty do-nothing 
program in all the languages?
Jul 27 2015
parent Johan Holmberg via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Mon, Jul 27, 2015 at 11:03 AM, via Digitalmars-d <
digitalmars-d puremagic.com> wrote:

 Are you including program startup and exit in the timing? For comparison,
 can you include the timings of an empty do-nothing program in all the
 languages?
Yes, I measure the whole program. But these startup/exit times are really small. Reading /dev/null takes 0.003s in both C++ and D, and 0.007s in Perl. "Nothing" compared to the other times. /johan
Jul 27 2015