digitalmars.D - std.csv Performance Review

Jesse Phillips (50/50) Jun 02 2017 Author here:

Johan Engelen (7/8) Jun 03 2017 Quick note:
bachmeier (5/11) Jun 03 2017 Do you know what happened with fastcsv [0], original thread [1].

Jesse Phillips (6/10) Jun 03 2017 I do not. Rereading that in light of this new article I'm a

H. S. Teoh via Digitalmars-d (49/58) Jun 03 2017 [...]

Patrick Schluter (11/19) Jun 03 2017 If the file is in the file cache of the kernel, memory mapping

Patrick Schluter (9/29) Jun 04 2017 To be precise, it's the copying of data that is spared by mmap.

Jesse Phillips (15/34) Jun 04 2017 Ok, I took you up on that, I'm still skeptical:

Seb (4/20) Jun 04 2017 In case you have time, it would be very interesting to compare it
H. S. Teoh via Digitalmars-d (60/76) Jun 05 2017 Thank you for testing it yourself. I also tried to run the tests again

Anonymouse (3/9) Jun 04 2017 Immediate idea is to have the cake and eat it too with
Jon Degenhardt (10/20) Jun 05 2017 I'm the author of the TSV tools. I'd be happy to provide insights

Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:

Author here:

The discussion[1] and articles[2] around "Faster Command Line 
Tools" had me trying out std.csv for the task.

Now I know std.csv isn't fast and it allocates. When I wrote my 
CSV parser, I'd also left around a parser which used slicing 
instead of allocation[3].

I compared these two: LDC -O3 -release

std.csv: over 10 seconds
csv slicing: under 5 seconds

Over 50% improvement isn't bad, but this still wasn't competing 
with the other implementations. Now I didn't profile std.csv's 
implementation but I did take a look at the one with slicing.

Majority of the time was spent in std.algorithm.startsWith, which 
is being called by countUntil. The calls made to empty() also add 
up from the use in countUntil and startsWith. These functions are 
by no means slow, startsWith averaged 1 millisecond execution 
time while countUntil was up to 5 milliseconds; thing is starts 
with was called a whopping 384,806,160 times.

Keep in mind that the file itself has 10,512,769 rows of data 
with four columns. Now I've talked to std.csv's performance in 
the past, probably with the author of the fast command line 
tools. Essentially it came down to std.csv is restricted to 
parsing with only the Input Range api, and you can't correctly 
parse CSV without allocation. But now I'm working outside those 
restrictions and so I offer an additional point.

Both of these do something none of the other implementation do, 
it validates the CSV is well formed. If it finds that the file no 
longer conforms to the correct CSV layout it makes a choice, 
either throw an exception or guess and continue on (based on the 
what the user requested). While the Nim implementation does 
handle escaped quotes (and newlines, unlike fast csv) the parsing 
assumes the file is well formed, which std.csv was quite prompt 
to point out that this file is indeed not well formed.

Even though the issue can be ignored, the overhead of parsing to 
identify issues still remains. I haven't attempted write the 
algorithm assuming proper data structure so I don't know what the 
performance would look like, but I suspect it isn't negligible. 
There is also likely some overhead for providing the tokens 
through range interfaces.

On another note, including this slicing version of the CSV parse 
can and should be included in std.csv as a specialization. But it 
is by no means ready. The feature requirements need to be spelled 
out better (hasSlicing!Range fails for strings but is the primary 
use-case for the optimization), escaped quotes remain in the 
returned data (like I said proper parsing requires allocation).

1. 
http://forum.dlang.org/post/chvukhbscgamxecvpwlw forum.dlang.org
2. 
https://www.euantorano.co.uk/posts/faster-command-line-tools-in-nim/
3. https://github.com/JesseKPhillips/JPDLibs/tree/csvoptimize

Jun 02 2017

Johan Engelen <j j.nl> writes:

On Saturday, 3 June 2017 at 04:25:27 UTC, Jesse Phillips wrote:
 I compared these two: LDC -O3 -release

Quick note:
Keep in mind that LDC does not do cross-module inlining 
(non-template functions) by default yet. It's good to check 
whether you see big differences with 
`-enable-cross-module-inlining`.

-Johan

Jun 03 2017

bachmeier <no spam.net> writes:

On Saturday, 3 June 2017 at 04:25:27 UTC, Jesse Phillips wrote:
 Author here:

 The discussion[1] and articles[2] around "Faster Command Line 
 Tools" had me trying out std.csv for the task.

 Now I know std.csv isn't fast and it allocates. When I wrote my 
 CSV parser, I'd also left around a parser which used slicing 
 instead of allocation[3].

Do you know what happened with fastcsv [0], original thread [1].

[0] https://github.com/quickfur/fastcsv
[1] 
http://forum.dlang.org/post/mailman.3952.1453600915.22025.digitalmars-d-learn puremagic.com

Jun 03 2017

Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:

On Saturday, 3 June 2017 at 23:18:26 UTC, bachmeier wrote:
 Do you know what happened with fastcsv [0], original thread [1].

 [0] https://github.com/quickfur/fastcsv
 [1] 
 http://forum.dlang.org/post/mailman.3952.1453600915.22025.digitalmars-d-learn puremagic.com

I do not. Rereading that in light of this new article I'm a 
little sceptical of the 51 times faster, since I'm seeing only 
10x against these other implications.

In other news, I'm not sure the validation std.varient.algebraic. 
Csv is doing is worth it.

Jun 03 2017

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Sun, Jun 04, 2017 at 05:41:10AM +0000, Jesse Phillips via Digitalmars-d
wrote:
On Saturday, 3 June 2017 at 23:18:26 UTC, bachmeier wrote:
Do you know what happened with fastcsv [0], original thread [1].

[0] https://github.com/quickfur/fastcsv
[1] http://forum.dlang.org/post/mailman.3952.1453600915.22025.digitalmars-d-learn puremagic.com

I do not. Rereading that in light of this new article I'm a little
sceptical of the 51 times faster, since I'm seeing only 10x against
these other implications.

[...]

You don't have to be skeptical, neither do you have to believe what I
claimed. I posted the entire code I used in the original thread, as
well as the URLs of the exact data files I used for testing. You can
just run it yourself and see the results for yourself.

And yes, fastcsv has its limitations (the file has to fit in memory, no
validation is done, etc.), which are also documented up-front in the
README file. I wrote the code targeting a specific use case mentioned
by the OP of the original thread, so I do not expect or claim you will
see the same kind of performance for other use cases. If you want
validation, then it's a given that you won't get maximum performance,
simply because there's just more work to do. For data sets that don't
fit into memory, I already have some ideas about how to extend my
algorithm to work with it, so some of the performance may still be
retained. But obviously it's not going to be as fast as if you can just
read the entire file into memory first.

(Note that this is much less of a limitation than it seems; for example
you could use std.mmfile to memory-map the file into your address space
so that it doesn't actually have to fit into memory, and you can still
take slices of it. The OS will manage the paging from/to disk for you.
Of course, it will be slower when something has to be paged from disk,
but IME this is often much faster than if you read the data into memory
yourself. Again, you don't have to believe me: the fastcsv code is
there, just import std.mmfile, mmap the largest CSV file you can find,
call fastcsv on it, and measure the performance yourself. If your code
performs better, great, tell us all about it. I'm interested to learn
how you did it.)

Note that besides slicing, another major part of the performance boost
in fastcsv is in minimizing GC allocations. If you allocate a string
for each field in a row, it will be much slower than if you either
sliced the original string, or if you allocated a large buffer for
holding the data and just take slices for each field. Furthermore, if
you allocate a new array per row to hold the list of fields, it will be
much slower than if you allocate a large array for holding all the
fields of all the rows, and merely slice this array to get your rows.
Of course, you cannot know ahead of time exactly how many rows there
will be, so the next best thing is to allocate a series of large arrays,
capable of holding the field slices of k rows, for sufficiently large k.
Once the current array runs out of space, copy the (partial) slices of
the last row to beginning of a new large array, and continue from there.

This way, you will be making n/k allocations, where n is the number of
rows and k is the number of rows that fit into each buffer, as opposed
to n allocations. For large values of k, this greatly reduces the GC
load and significantly speeds things up. Again, don't take my word for
it. Run a profiler on the fastcsv code and see for yourself.

--
People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG

Jun 03 2017

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Sunday, 4 June 2017 at 06:15:24 UTC, H. S. Teoh wrote:
 On Sun, Jun 04, 2017 at 05:41:10AM +0000, Jesse Phillips via 
 (Note that this is much less of a limitation than it seems; for 
 example you could use std.mmfile to memory-map the file into 
 your address space so that it doesn't actually have to fit into 
 memory, and you can still take slices of it. The OS will manage 
 the paging from/to disk for you. Of course, it will be slower 
 when something has to be paged from disk, but IME this is often 
 much faster than if you read the data into memory yourself.

If the file is in the file cache of the kernel, memory mapping 
does not need to reload the file as it is already in memory. In 
fact, calling mmap() changes only the sharing of the pages in 
general. That's where most of the performance win from memory 
mapping comes from.

This stackoverflow [1] discussion links to a realworldtech 
discussion with Linus Torvalds explaining it in detail. On 
windows and Solaris the mechanism is the same.

[1] 
https://stackoverflow.com/questions/5902629/mmap-msync-and-linux-process-termination/6219962#6219962

Jun 03 2017

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Sunday, 4 June 2017 at 06:54:46 UTC, Patrick Schluter wrote:
On Sunday, 4 June 2017 at 06:15:24 UTC, H. S. Teoh wrote:
On Sun, Jun 04, 2017 at 05:41:10AM +0000, Jesse Phillips via
(Note that this is much less of a limitation than it seems;
for example you could use std.mmfile to memory-map the file
into your address space so that it doesn't actually have to
fit into memory, and you can still take slices of it. The OS
will manage the paging from/to disk for you. Of course, it
will be slower when something has to be paged from disk, but
IME this is often much faster than if you read the data into
memory yourself.

If the file is in the file cache of the kernel, memory mapping
does not need to reload the file as it is already in memory. In
fact, calling mmap() changes only the sharing of the pages in
general. That's where most of the performance win from memory
mapping comes from.

To be precise, it's the copying of data that is spared by mmap.
If the file is in the file cache, the open/read/write/close
syscalls will also be fed from the memory mapped cache entry, but
this requires that the data is copied from the kernel memory
space to the processes buffer space. So each call to read will
have to do this copying. So the gain from mmap comes for avoiding
the copy of memory and avoiding the syscalls read/write/seek. The
loading in memory of the physical file is the same in both cases.

This stackoverflow [1] discussion links to a realworldtech
discussion with Linus Torvalds explaining it in detail. On
windows and Solaris the mechanism is the same.

[1]
https://stackoverflow.com/questions/5902629/mmap-msync-and-linux-process-termination/6219962#6219962

Jun 04 2017

Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:

On Sunday, 4 June 2017 at 06:15:24 UTC, H. S. Teoh wrote:
 On Sun, Jun 04, 2017 at 05:41:10AM +0000, Jesse Phillips via 
 Digitalmars-d wrote:
 On Saturday, 3 June 2017 at 23:18:26 UTC, bachmeier wrote:
 Do you know what happened with fastcsv [0], original thread 
 [1].
 
 [0] https://github.com/quickfur/fastcsv
 [1] 
 http://forum.dlang.org/post/mailman.3952.1453600915.22025.digitalmars-d-learn puremagic.com

 
 I do not. Rereading that in light of this new article I'm a 
 little sceptical of the 51 times faster, since I'm seeing only 
 10x against these other implications.

 [...]

 You don't have to be skeptical, neither do you have to believe 
 what I claimed.  I posted the entire code I used in the 
 original thread, as well as the URLs of the exact data files I 
 used for testing.  You can just run it yourself and see the 
 results for yourself.

Ok, I took you up on that, I'm still skeptical:

LDC2 -O3 -release -enable-cross-module-inlining

std.csv: 12487 msecs
fastcsv (no gc): 1376 msecs
csvslicing: 3039 msecs

That looks like about 10 times faster to me. Using the slicing 
version failed because of \r\n line endings (guess multi-part 
separators is broken) I changed the data file so I could get the 
execution time.

Anyway, I'm not trying to claim fastcsv isn't good at what it 
does, all I'm trying to point out is std.csv is doing more work 
than these faster csv parsers. And I don't even want to claim 
that std.csv is better because of that work, it actually appears 
that it was a mistake to do validation.

Jun 04 2017

Seb <seb wilzba.ch> writes:

On Sunday, 4 June 2017 at 15:59:03 UTC, Jesse Phillips wrote:
 On Sunday, 4 June 2017 at 06:15:24 UTC, H. S. Teoh wrote:
 [...]

 Ok, I took you up on that, I'm still skeptical:

 LDC2 -O3 -release -enable-cross-module-inlining

 std.csv: 12487 msecs
 fastcsv (no gc): 1376 msecs
 csvslicing: 3039 msecs

 That looks like about 10 times faster to me. Using the slicing 
 version failed because of \r\n line endings (guess multi-part 
 separators is broken) I changed the data file so I could get 
 the execution time.

 Anyway, I'm not trying to claim fastcsv isn't good at what it 
 does, all I'm trying to point out is std.csv is doing more work 
 than these faster csv parsers. And I don't even want to claim 
 that std.csv is better because of that work, it actually 
 appears that it was a mistake to do validation.

In case you have time, it would be very interesting to compare it 
with other state of the art tools like paratext:

http://www.wise.io/tech/paratext

Jun 04 2017

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Sun, Jun 04, 2017 at 03:59:03PM +0000, Jesse Phillips via Digitalmars-d
wrote:
[...]
 Ok, I took you up on that, I'm still skeptical:
 
 LDC2 -O3 -release -enable-cross-module-inlining
 
 std.csv: 12487 msecs
 fastcsv (no gc): 1376 msecs
 csvslicing: 3039 msecs
 
 That looks like about 10 times faster to me. Using the slicing version
 failed because of \r\n line endings (guess multi-part separators is
 broken) I changed the data file so I could get the execution time.

Thank you for testing it yourself.  I also tried to run the tests again
on my machine, and I can't seem to reproduce the 102136 msecs reading
again.  It seems that different compilers give somewhat readings, and
also we are using different compile flags.  In any case, in the spirit
of full disclosure, here's my test with the 3 compilers, that I just ran
just now just to be sure I'm not just copying old bad measurements:


$ dmd -O -inline benchmark.d fastcsv.d
$ ./benchmark stdstruct
std.csv read 2126883 records
std.csv (struct): 33119 msecs
$ ./benchmark faststruct2
fastcsv read 2126883 records
fastcsv (struct with const(char)[]): 2596 msecs

$ gdc -O3 -finline benchmark.d fastcsv.d -o benchmark
$ ./benchmark stdstruct
std.csv read 2126883 records
std.csv (struct): 23103 msecs
$ ./benchmark faststruct2
fastcsv read 2126883 records
fastcsv (struct with const(char)[]): 1909 msecs

$ ldc2 -O3 benchmark.d fastcsv.d 
$ ./benchmark stdstruct
std.csv read 2126883 records
std.csv (struct): 20776 msecs
$ ./benchmark faststruct2
fastcsv read 2126883 records
fastcsv (struct with const(char)[]): 1813 msecs


So, it looks like your 10x figure is more-or-less on target.  I've no
idea where the original 102136 msecs reading came from.  Perhaps that
was an unfortunate coincidence with a heavy load on my machine, or
something like that. Or maybe just a bungled copy-n-paste.


 Anyway, I'm not trying to claim fastcsv isn't good at what it does,
 all I'm trying to point out is std.csv is doing more work than these
 faster csv parsers. And I don't even want to claim that std.csv is
 better because of that work, it actually appears that it was a mistake
 to do validation.

I never intended for fastcsv to become a point of contention or as some
kind of competition with std.csv, and I apologize if I ever came across
that way.  I fully understand that std.csv does more work than fastcsv;
certainly, being able to assume an in-memory input and free slicing
gives a big advantage over being restricted to just input range
primitives. I had hoped to actually work fastcsv into a suitable form to
merge into std.csv -- to dispel wrong perceptions of D being "slow", you
see -- but it turned out to be more work than I had time for, so I
didn't get very far beyond the initial promising results.

My original hope was that the fastcsv code would be taken as a source of
ideas that we could adopt for speeding up std.csv, rather than be taken
in the way of "haha I wrote faster code than std.csv, so std.csv sux".
The latter was not my intention at all.

Anyway, I'm glad that you're looking into using slicing in std.csv. We
need Phobos modules to be highly performant so that newcomers don't get
the wrong impression about the language being slow. I'd also recommend
investigating reducing GC load, as I described in my previous post, as
another angle for improving the performance of std.csv.

As for whether to validate or not: if you were to ask me, I'd leave it
in, with a caveat in the docs that it would be less performant. As the
standard library, Phobos should give the user options, including the
option to validate input files that could potentially be malformed.  But
where the user knows the input is always well-formed, we can (and
should) take advantage of that to achieve better performance.


T

-- 
Why waste time reinventing the wheel, when you could be reinventing the engine?
-- Damian Conway

Jun 05 2017

Anonymouse <asdf asdf.net> writes:

On Saturday, 3 June 2017 at 04:25:27 UTC, Jesse Phillips wrote:
 Even though the issue can be ignored, the overhead of parsing 
 to identify issues still remains. I haven't attempted write the 
 algorithm assuming proper data structure so I don't know what 
 the performance would look like, but I suspect it isn't 
 negligible. There is also likely some overhead for providing 
 the tokens through range interfaces.

Immediate idea is to have the cake and eat it too with 
parseXML!(Yes.validate)(...), but more work.

Jun 04 2017

Jon Degenhardt <jond noreply.com> writes:

On Saturday, 3 June 2017 at 04:25:27 UTC, Jesse Phillips wrote:
 Author here:

 The discussion[1] and articles[2] around "Faster Command Line 
 Tools" had me trying out std.csv for the task.

 [snip]

 Keep in mind that the file itself has 10,512,769 rows of data 
 with four columns. Now I've talked to std.csv's performance in 
 the past, probably with the author of the fast command line 
 tools.
 
 [snip]

I'm the author of the TSV tools. I'd be happy to provide insights 
I've gained from this exercise to help improve std.csv. I did 
examine std.csv when I first starting writing the TSV tools, but 
decided they weren't appropriate for what I was trying to do. I 
don't remember the specifics at this point though. CSV imposes 
many requirements on an implementation that TSV does not, which 
makes it challenging to produce an implementation optimal for 
both.

--Jon

Jun 05 2017

D Programming

C/C++ Programming

Other

digitalmars.D - std.csv Performance Review