digitalmars.D.learn - What are the best available D (not C) File input/output options?
- confuzzled (18/18) Nov 02 2023 I've ported a small script from C to D. The original C version takes
- Julian Fondren (11/32) Nov 02 2023 First, strace your program. The slowest thing about I/O is the
- Steven Schveighoffer (10/12) Nov 02 2023 Just specifically to answer this, this is so you understand this
- confuzzled (2/7) Nov 05 2023 Thanks Steve, I will try that.
- Sergey (8/11) Nov 02 2023 In my experience I/O in D is quite slow.
- confuzzled (27/42) Nov 05 2023 Good morning,
I've ported a small script from C to D. The original C version takes roughly 6.5 minutes to parse a 12G file while the port originally took about 48 minutes. My naïve attempt to improve the situation pushed it over an hour and 15 minutes. However, replacing std.stdio:File with core.stdc.stdio:FILE* and changing my output code in this latest version from: outputFile.writefln("%c\t%u\t%u\t%d.%09u\t%c", ...) to fprintf(outputFile, "%c,%u,%u,%llu.%09llu,%c\n", ...) reduced the processing time to roughly 7.5 minutes. Why is File.writefln() so appallingly slow? Is there a better D alternative? I tried std.io but write() only outputs ubyte[] while I'm trying to output text so I abandoned idea early. Now that I've got the program execution time within an acceptable range, I tried replacing core.stdc.fread() with std.io.read() but that increased the time to 24 minutes. Now I'm starting to think there is something seriously wrong with my understanding of how to use D correctly because there's no way D's input/output capabilities can suck so bad in comparison to C's.
Nov 02 2023
On Thursday, 2 November 2023 at 15:46:23 UTC, confuzzled wrote:I've ported a small script from C to D. The original C version takes roughly 6.5 minutes to parse a 12G file while the port originally took about 48 minutes. My naïve attempt to improve the situation pushed it over an hour and 15 minutes. However, replacing std.stdio:File with core.stdc.stdio:FILE* and changing my output code in this latest version from: outputFile.writefln("%c\t%u\t%u\t%d.%09u\t%c", ...) to fprintf(outputFile, "%c,%u,%u,%llu.%09llu,%c\n", ...) reduced the processing time to roughly 7.5 minutes. Why is File.writefln() so appallingly slow? Is there a better D alternative?First, strace your program. The slowest thing about I/O is the syscall itself. If the D program does more syscalls, it's going to be slower almost no matter what else is going on. Both D and C are using libc to buffer I/O to reduce syscalls, but you might be defeating that by constantly flushing the buffer.I tried std.io but write() only outputs ubyte[] while I'm trying to output text so I abandoned idea early.string -> immutable(ubyte)[]: alias with std.string.representation(st) 'alias' meaning, this doesn't allocate. If gives you a byte slice of the same memory the string is using. You'd still need to do the formatting, before writing.Now that I've got the program execution time within an acceptable range, I tried replacing core.stdc.fread() with std.io.read() but that increased the time to 24 minutes. Now I'm starting to think there is something seriously wrong with my understanding of how to use D correctly because there's no way D's input/output capabilities can suck so bad in comparison to C's.
Nov 02 2023
On Thursday, 2 November 2023 at 15:46:23 UTC, confuzzled wrote:I tried std.io but write() only outputs ubyte[] while I'm trying to output text so I abandoned idea early.Just specifically to answer this, this is so you understand this is what is going into the file -- bytes. You should use a buffering library like iopipe to write properly here (it handles the encoding of text for you). And I really don't have a good formatting library, you can rely on formattedWrite maybe. A lot of things need to be better for this solution to be smooth, it's one of the things I have to work on. -Steve
Nov 02 2023
On 11/3/23 2:30 AM, Steven Schveighoffer wrote:On Thursday, 2 November 2023 at 15:46:23 UTC, confuzzled wrote: You should use a buffering library like iopipe to write properly here (it handles the encoding of text for you).Thanks Steve, I will try that.
Nov 05 2023
On Thursday, 2 November 2023 at 15:46:23 UTC, confuzzled wrote:I've ported a small script from C to D. The original C version takes roughly 6.5 minutes to parse a 12G file while the port originally took about 48 minutes.In my experience I/O in D is quite slow. But you can try to improve it: Try to use std.outbuffer instead of writeln. And flush the result only in the end. Also check this article. It is showing how manual buffers in D could speed up the processing of files significantly: https://tech.nextroll.com/blog/data/2014/11/17/d-is-for-data-science.html
Nov 02 2023
Good morning, First, thanks to you, Steve, and Julian for responding to my inquiry. On 11/3/23 4:59 AM, Sergey wrote:On Thursday, 2 November 2023 at 15:46:23 UTC, confuzzled wrote:Unless I did it incorrectly, this did nothing for me. My understanding is that I should first prepare an OutBuffer to which I write all my output. Once complete, I then write the OutBuffer to file; which still requires the use of writeln, albeit not as often. First I tried buffering the entire thing, but that turned out to be a big mistake. Next I tried writing and clearing the buffer every 100_000 records (about 3000 writeln calls). Not as bad as the first attempt but significantly worse than what I obtained with the fopen/fprintf combo. I even tried writing the buffer to disk with fprintf but jumped ship because it took far longer than fopen/fprintf. Can't say how much longer because I terminated execution at 14 minutes.I've ported a small script from C to D. The original C version takes roughly 6.5 minutes to parse a 12G file while the port originally took about 48 minutes.In my experience I/O in D is quite slow. But you can try to improve it: Try to use std.outbuffer instead of writeln. And flush the result only in the end.Also check this article. It is showing how manual buffers in D could speed up the processing of files significantly: https://tech.nextroll.com/blog/data/2014/11/17/d-is-for-data-science.htmlThe link above was quite helpful. Thanks. I am a bit slow on the uptake so it took a while to figure out how to apply the idea to my own use case. However, once I figured it out, the result was 2 minutes faster than the original C implementation and 3 minutes faster than the fopen/printf port. Whether it did anything for the writeln implementation or not, I don't know. Wasn't will to wait 45+ minutes for something that can feasibly be done in 6 minutes. I gave up at 12. Haven't played with std.string.representation as suggested by Julian as yet but I plan to. Thank again. --Confuzzled
Nov 05 2023