www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - std.csv Performance Review

reply Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:
Author here:

The discussion[1] and articles[2] around "Faster Command Line 
Tools" had me trying out std.csv for the task.

Now I know std.csv isn't fast and it allocates. When I wrote my 
CSV parser, I'd also left around a parser which used slicing 
instead of allocation[3].

I compared these two: LDC -O3 -release

std.csv: over 10 seconds
csv slicing: under 5 seconds

Over 50% improvement isn't bad, but this still wasn't competing 
with the other implementations. Now I didn't profile std.csv's 
implementation but I did take a look at the one with slicing.

Majority of the time was spent in std.algorithm.startsWith, which 
is being called by countUntil. The calls made to empty() also add 
up from the use in countUntil and startsWith. These functions are 
by no means slow, startsWith averaged 1 millisecond execution 
time while countUntil was up to 5 milliseconds; thing is starts 
with was called a whopping 384,806,160 times.

Keep in mind that the file itself has 10,512,769 rows of data 
with four columns. Now I've talked to std.csv's performance in 
the past, probably with the author of the fast command line 
tools. Essentially it came down to std.csv is restricted to 
parsing with only the Input Range api, and you can't correctly 
parse CSV without allocation. But now I'm working outside those 
restrictions and so I offer an additional point.

Both of these do something none of the other implementation do, 
it validates the CSV is well formed. If it finds that the file no 
longer conforms to the correct CSV layout it makes a choice, 
either throw an exception or guess and continue on (based on the 
what the user requested). While the Nim implementation does 
handle escaped quotes (and newlines, unlike fast csv) the parsing 
assumes the file is well formed, which std.csv was quite prompt 
to point out that this file is indeed not well formed.

Even though the issue can be ignored, the overhead of parsing to 
identify issues still remains. I haven't attempted write the 
algorithm assuming proper data structure so I don't know what the 
performance would look like, but I suspect it isn't negligible. 
There is also likely some overhead for providing the tokens 
through range interfaces.

On another note, including this slicing version of the CSV parse 
can and should be included in std.csv as a specialization. But it 
is by no means ready. The feature requirements need to be spelled 
out better (hasSlicing!Range fails for strings but is the primary 
use-case for the optimization), escaped quotes remain in the 
returned data (like I said proper parsing requires allocation).

1. 
http://forum.dlang.org/post/chvukhbscgamxecvpwlw forum.dlang.org
2. 
https://www.euantorano.co.uk/posts/faster-command-line-tools-in-nim/
3. https://github.com/JesseKPhillips/JPDLibs/tree/csvoptimize
Jun 02
next sibling parent Johan Engelen <j j.nl> writes:
On Saturday, 3 June 2017 at 04:25:27 UTC, Jesse Phillips wrote:
 I compared these two: LDC -O3 -release
Quick note: Keep in mind that LDC does not do cross-module inlining (non-template functions) by default yet. It's good to check whether you see big differences with `-enable-cross-module-inlining`. -Johan
Jun 03
prev sibling next sibling parent reply bachmeier <no spam.net> writes:
On Saturday, 3 June 2017 at 04:25:27 UTC, Jesse Phillips wrote:
 Author here:

 The discussion[1] and articles[2] around "Faster Command Line 
 Tools" had me trying out std.csv for the task.

 Now I know std.csv isn't fast and it allocates. When I wrote my 
 CSV parser, I'd also left around a parser which used slicing 
 instead of allocation[3].
Do you know what happened with fastcsv [0], original thread [1]. [0] https://github.com/quickfur/fastcsv [1] http://forum.dlang.org/post/mailman.3952.1453600915.22025.digitalmars-d-learn puremagic.com
Jun 03
parent reply Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:
On Saturday, 3 June 2017 at 23:18:26 UTC, bachmeier wrote:
 Do you know what happened with fastcsv [0], original thread [1].

 [0] https://github.com/quickfur/fastcsv
 [1] 
 http://forum.dlang.org/post/mailman.3952.1453600915.22025.digitalmars-d-learn puremagic.com
I do not. Rereading that in light of this new article I'm a little sceptical of the 51 times faster, since I'm seeing only 10x against these other implications. In other news, I'm not sure the validation std.varient.algebraic. Csv is doing is worth it.
Jun 03
parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Sun, Jun 04, 2017 at 05:41:10AM +0000, Jesse Phillips via Digitalmars-d
wrote:
 On Saturday, 3 June 2017 at 23:18:26 UTC, bachmeier wrote:
 Do you know what happened with fastcsv [0], original thread [1].
 
 [0] https://github.com/quickfur/fastcsv
 [1] http://forum.dlang.org/post/mailman.3952.1453600915.22025.digitalmars-d-learn puremagic.com
I do not. Rereading that in light of this new article I'm a little sceptical of the 51 times faster, since I'm seeing only 10x against these other implications.
[...] You don't have to be skeptical, neither do you have to believe what I claimed. I posted the entire code I used in the original thread, as well as the URLs of the exact data files I used for testing. You can just run it yourself and see the results for yourself. And yes, fastcsv has its limitations (the file has to fit in memory, no validation is done, etc.), which are also documented up-front in the README file. I wrote the code targeting a specific use case mentioned by the OP of the original thread, so I do not expect or claim you will see the same kind of performance for other use cases. If you want validation, then it's a given that you won't get maximum performance, simply because there's just more work to do. For data sets that don't fit into memory, I already have some ideas about how to extend my algorithm to work with it, so some of the performance may still be retained. But obviously it's not going to be as fast as if you can just read the entire file into memory first. (Note that this is much less of a limitation than it seems; for example you could use std.mmfile to memory-map the file into your address space so that it doesn't actually have to fit into memory, and you can still take slices of it. The OS will manage the paging from/to disk for you. Of course, it will be slower when something has to be paged from disk, but IME this is often much faster than if you read the data into memory yourself. Again, you don't have to believe me: the fastcsv code is there, just import std.mmfile, mmap the largest CSV file you can find, call fastcsv on it, and measure the performance yourself. If your code performs better, great, tell us all about it. I'm interested to learn how you did it.) Note that besides slicing, another major part of the performance boost in fastcsv is in minimizing GC allocations. If you allocate a string for each field in a row, it will be much slower than if you either sliced the original string, or if you allocated a large buffer for holding the data and just take slices for each field. Furthermore, if you allocate a new array per row to hold the list of fields, it will be much slower than if you allocate a large array for holding all the fields of all the rows, and merely slice this array to get your rows. Of course, you cannot know ahead of time exactly how many rows there will be, so the next best thing is to allocate a series of large arrays, capable of holding the field slices of k rows, for sufficiently large k. Once the current array runs out of space, copy the (partial) slices of the last row to beginning of a new large array, and continue from there. This way, you will be making n/k allocations, where n is the number of rows and k is the number of rows that fit into each buffer, as opposed to n allocations. For large values of k, this greatly reduces the GC load and significantly speeds things up. Again, don't take my word for it. Run a profiler on the fastcsv code and see for yourself. T -- People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG
Jun 03
next sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Sunday, 4 June 2017 at 06:15:24 UTC, H. S. Teoh wrote:
 On Sun, Jun 04, 2017 at 05:41:10AM +0000, Jesse Phillips via 
 (Note that this is much less of a limitation than it seems; for 
 example you could use std.mmfile to memory-map the file into 
 your address space so that it doesn't actually have to fit into 
 memory, and you can still take slices of it. The OS will manage 
 the paging from/to disk for you. Of course, it will be slower 
 when something has to be paged from disk, but IME this is often 
 much faster than if you read the data into memory yourself.
If the file is in the file cache of the kernel, memory mapping does not need to reload the file as it is already in memory. In fact, calling mmap() changes only the sharing of the pages in general. That's where most of the performance win from memory mapping comes from. This stackoverflow [1] discussion links to a realworldtech discussion with Linus Torvalds explaining it in detail. On windows and Solaris the mechanism is the same. [1] https://stackoverflow.com/questions/5902629/mmap-msync-and-linux-process-termination/6219962#6219962
Jun 03
parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Sunday, 4 June 2017 at 06:54:46 UTC, Patrick Schluter wrote:
 On Sunday, 4 June 2017 at 06:15:24 UTC, H. S. Teoh wrote:
 On Sun, Jun 04, 2017 at 05:41:10AM +0000, Jesse Phillips via 
 (Note that this is much less of a limitation than it seems; 
 for example you could use std.mmfile to memory-map the file 
 into your address space so that it doesn't actually have to 
 fit into memory, and you can still take slices of it. The OS 
 will manage the paging from/to disk for you. Of course, it 
 will be slower when something has to be paged from disk, but 
 IME this is often much faster than if you read the data into 
 memory yourself.
If the file is in the file cache of the kernel, memory mapping does not need to reload the file as it is already in memory. In fact, calling mmap() changes only the sharing of the pages in general. That's where most of the performance win from memory mapping comes from.
To be precise, it's the copying of data that is spared by mmap. If the file is in the file cache, the open/read/write/close syscalls will also be fed from the memory mapped cache entry, but this requires that the data is copied from the kernel memory space to the processes buffer space. So each call to read will have to do this copying. So the gain from mmap comes for avoiding the copy of memory and avoiding the syscalls read/write/seek. The loading in memory of the physical file is the same in both cases.
 This stackoverflow [1] discussion links to a realworldtech 
 discussion with Linus Torvalds explaining it in detail. On 
 windows and Solaris the mechanism is the same.

 [1] 
 https://stackoverflow.com/questions/5902629/mmap-msync-and-linux-process-termination/6219962#6219962
Jun 04
prev sibling parent reply Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:
On Sunday, 4 June 2017 at 06:15:24 UTC, H. S. Teoh wrote:
 On Sun, Jun 04, 2017 at 05:41:10AM +0000, Jesse Phillips via 
 Digitalmars-d wrote:
 On Saturday, 3 June 2017 at 23:18:26 UTC, bachmeier wrote:
 Do you know what happened with fastcsv [0], original thread 
 [1].
 
 [0] https://github.com/quickfur/fastcsv
 [1] 
 http://forum.dlang.org/post/mailman.3952.1453600915.22025.digitalmars-d-learn puremagic.com
I do not. Rereading that in light of this new article I'm a little sceptical of the 51 times faster, since I'm seeing only 10x against these other implications.
[...] You don't have to be skeptical, neither do you have to believe what I claimed. I posted the entire code I used in the original thread, as well as the URLs of the exact data files I used for testing. You can just run it yourself and see the results for yourself.
Ok, I took you up on that, I'm still skeptical: LDC2 -O3 -release -enable-cross-module-inlining std.csv: 12487 msecs fastcsv (no gc): 1376 msecs csvslicing: 3039 msecs That looks like about 10 times faster to me. Using the slicing version failed because of \r\n line endings (guess multi-part separators is broken) I changed the data file so I could get the execution time. Anyway, I'm not trying to claim fastcsv isn't good at what it does, all I'm trying to point out is std.csv is doing more work than these faster csv parsers. And I don't even want to claim that std.csv is better because of that work, it actually appears that it was a mistake to do validation.
Jun 04
next sibling parent Seb <seb wilzba.ch> writes:
On Sunday, 4 June 2017 at 15:59:03 UTC, Jesse Phillips wrote:
 On Sunday, 4 June 2017 at 06:15:24 UTC, H. S. Teoh wrote:
 [...]
Ok, I took you up on that, I'm still skeptical: LDC2 -O3 -release -enable-cross-module-inlining std.csv: 12487 msecs fastcsv (no gc): 1376 msecs csvslicing: 3039 msecs That looks like about 10 times faster to me. Using the slicing version failed because of \r\n line endings (guess multi-part separators is broken) I changed the data file so I could get the execution time. Anyway, I'm not trying to claim fastcsv isn't good at what it does, all I'm trying to point out is std.csv is doing more work than these faster csv parsers. And I don't even want to claim that std.csv is better because of that work, it actually appears that it was a mistake to do validation.
In case you have time, it would be very interesting to compare it with other state of the art tools like paratext: http://www.wise.io/tech/paratext
Jun 04
prev sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Sun, Jun 04, 2017 at 03:59:03PM +0000, Jesse Phillips via Digitalmars-d
wrote:
[...]
 Ok, I took you up on that, I'm still skeptical:
 
 LDC2 -O3 -release -enable-cross-module-inlining
 
 std.csv: 12487 msecs
 fastcsv (no gc): 1376 msecs
 csvslicing: 3039 msecs
 
 That looks like about 10 times faster to me. Using the slicing version
 failed because of \r\n line endings (guess multi-part separators is
 broken) I changed the data file so I could get the execution time.
Thank you for testing it yourself. I also tried to run the tests again on my machine, and I can't seem to reproduce the 102136 msecs reading again. It seems that different compilers give somewhat readings, and also we are using different compile flags. In any case, in the spirit of full disclosure, here's my test with the 3 compilers, that I just ran just now just to be sure I'm not just copying old bad measurements: $ dmd -O -inline benchmark.d fastcsv.d $ ./benchmark stdstruct std.csv read 2126883 records std.csv (struct): 33119 msecs $ ./benchmark faststruct2 fastcsv read 2126883 records fastcsv (struct with const(char)[]): 2596 msecs $ gdc -O3 -finline benchmark.d fastcsv.d -o benchmark $ ./benchmark stdstruct std.csv read 2126883 records std.csv (struct): 23103 msecs $ ./benchmark faststruct2 fastcsv read 2126883 records fastcsv (struct with const(char)[]): 1909 msecs $ ldc2 -O3 benchmark.d fastcsv.d $ ./benchmark stdstruct std.csv read 2126883 records std.csv (struct): 20776 msecs $ ./benchmark faststruct2 fastcsv read 2126883 records fastcsv (struct with const(char)[]): 1813 msecs So, it looks like your 10x figure is more-or-less on target. I've no idea where the original 102136 msecs reading came from. Perhaps that was an unfortunate coincidence with a heavy load on my machine, or something like that. Or maybe just a bungled copy-n-paste.
 Anyway, I'm not trying to claim fastcsv isn't good at what it does,
 all I'm trying to point out is std.csv is doing more work than these
 faster csv parsers. And I don't even want to claim that std.csv is
 better because of that work, it actually appears that it was a mistake
 to do validation.
I never intended for fastcsv to become a point of contention or as some kind of competition with std.csv, and I apologize if I ever came across that way. I fully understand that std.csv does more work than fastcsv; certainly, being able to assume an in-memory input and free slicing gives a big advantage over being restricted to just input range primitives. I had hoped to actually work fastcsv into a suitable form to merge into std.csv -- to dispel wrong perceptions of D being "slow", you see -- but it turned out to be more work than I had time for, so I didn't get very far beyond the initial promising results. My original hope was that the fastcsv code would be taken as a source of ideas that we could adopt for speeding up std.csv, rather than be taken in the way of "haha I wrote faster code than std.csv, so std.csv sux". The latter was not my intention at all. Anyway, I'm glad that you're looking into using slicing in std.csv. We need Phobos modules to be highly performant so that newcomers don't get the wrong impression about the language being slow. I'd also recommend investigating reducing GC load, as I described in my previous post, as another angle for improving the performance of std.csv. As for whether to validate or not: if you were to ask me, I'd leave it in, with a caveat in the docs that it would be less performant. As the standard library, Phobos should give the user options, including the option to validate input files that could potentially be malformed. But where the user knows the input is always well-formed, we can (and should) take advantage of that to achieve better performance. T -- Why waste time reinventing the wheel, when you could be reinventing the engine? -- Damian Conway
Jun 05
prev sibling next sibling parent Anonymouse <asdf asdf.net> writes:
On Saturday, 3 June 2017 at 04:25:27 UTC, Jesse Phillips wrote:
 Even though the issue can be ignored, the overhead of parsing 
 to identify issues still remains. I haven't attempted write the 
 algorithm assuming proper data structure so I don't know what 
 the performance would look like, but I suspect it isn't 
 negligible. There is also likely some overhead for providing 
 the tokens through range interfaces.
Immediate idea is to have the cake and eat it too with parseXML!(Yes.validate)(...), but more work.
Jun 04
prev sibling parent Jon Degenhardt <jond noreply.com> writes:
On Saturday, 3 June 2017 at 04:25:27 UTC, Jesse Phillips wrote:
 Author here:

 The discussion[1] and articles[2] around "Faster Command Line 
 Tools" had me trying out std.csv for the task.

 [snip]

 Keep in mind that the file itself has 10,512,769 rows of data 
 with four columns. Now I've talked to std.csv's performance in 
 the past, probably with the author of the fast command line 
 tools.
 
 [snip]
I'm the author of the TSV tools. I'd be happy to provide insights I've gained from this exercise to help improve std.csv. I did examine std.csv when I first starting writing the TSV tools, but decided they weren't appropriate for what I was trying to do. I don't remember the specifics at this point though. CSV imposes many requirements on an implementation that TSV does not, which makes it challenging to produce an implementation optimal for both. --Jon
Jun 05