www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - What are the best available D (not C) File input/output options?

reply confuzzled <con fuzzled.com> writes:
I've ported a small script from C to D. The original C version takes 
roughly 6.5 minutes to parse a 12G file while the port originally took 
about 48 minutes. My naïve attempt to improve the situation pushed it 
over an hour and 15 minutes. However, replacing std.stdio:File with 
core.stdc.stdio:FILE* and changing my output code in this latest version 
from:

	outputFile.writefln("%c\t%u\t%u\t%d.%09u\t%c", ...)

to
	fprintf(outputFile, "%c,%u,%u,%llu.%09llu,%c\n", ...)

reduced the processing time to roughly 7.5 minutes. Why is 
File.writefln() so appallingly slow? Is there a better D alternative?

I tried std.io but write() only outputs ubyte[] while I'm trying to 
output text so I abandoned idea early. Now that I've got the program 
execution time within an acceptable range, I tried replacing 
core.stdc.fread() with std.io.read() but that increased the time to 24 
minutes. Now I'm starting to think there is something seriously wrong 
with my understanding of how to use D correctly because there's no way 
D's input/output capabilities can suck so bad in comparison to C's.
Nov 02 2023
next sibling parent Julian Fondren <julian.fondren gmail.com> writes:
On Thursday, 2 November 2023 at 15:46:23 UTC, confuzzled wrote:
 I've ported a small script from C to D. The original C version 
 takes roughly 6.5 minutes to parse a 12G file while the port 
 originally took about 48 minutes. My naïve attempt to improve 
 the situation pushed it over an hour and 15 minutes. However, 
 replacing std.stdio:File with core.stdc.stdio:FILE* and 
 changing my output code in this latest version from:

 	outputFile.writefln("%c\t%u\t%u\t%d.%09u\t%c", ...)

 to
 	fprintf(outputFile, "%c,%u,%u,%llu.%09llu,%c\n", ...)

 reduced the processing time to roughly 7.5 minutes. Why is 
 File.writefln() so appallingly slow? Is there a better D 
 alternative?
First, strace your program. The slowest thing about I/O is the syscall itself. If the D program does more syscalls, it's going to be slower almost no matter what else is going on. Both D and C are using libc to buffer I/O to reduce syscalls, but you might be defeating that by constantly flushing the buffer.
 I tried std.io but write() only outputs ubyte[] while I'm 
 trying to output text so I abandoned idea early.
string -> immutable(ubyte)[]: alias with std.string.representation(st) 'alias' meaning, this doesn't allocate. If gives you a byte slice of the same memory the string is using. You'd still need to do the formatting, before writing.
 Now that I've got the program execution time within an 
 acceptable range, I tried replacing core.stdc.fread() with 
 std.io.read() but that increased the time to 24 minutes. Now 
 I'm starting to think there is something seriously wrong with 
 my understanding of how to use D correctly because there's no 
 way D's input/output capabilities can suck so bad in comparison 
 to C's.
Nov 02 2023
prev sibling next sibling parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On Thursday, 2 November 2023 at 15:46:23 UTC, confuzzled wrote:

 I tried std.io but write() only outputs ubyte[] while I'm 
 trying to output text so I abandoned idea early.
Just specifically to answer this, this is so you understand this is what is going into the file -- bytes. You should use a buffering library like iopipe to write properly here (it handles the encoding of text for you). And I really don't have a good formatting library, you can rely on formattedWrite maybe. A lot of things need to be better for this solution to be smooth, it's one of the things I have to work on. -Steve
Nov 02 2023
parent confuzzled <con fuzzled.com> writes:
On 11/3/23 2:30 AM, Steven Schveighoffer wrote:
 On Thursday, 2 November 2023 at 15:46:23 UTC, confuzzled wrote:
 
 You should use a buffering library like iopipe to write properly here 
 (it handles the encoding of text for you).
 
Thanks Steve, I will try that.
Nov 05 2023
prev sibling parent reply Sergey <kornburn yandex.ru> writes:
On Thursday, 2 November 2023 at 15:46:23 UTC, confuzzled wrote:
 I've ported a small script from C to D. The original C version 
 takes roughly 6.5 minutes to parse a 12G file while the port 
 originally took about 48 minutes.
In my experience I/O in D is quite slow. But you can try to improve it: Try to use std.outbuffer instead of writeln. And flush the result only in the end. Also check this article. It is showing how manual buffers in D could speed up the processing of files significantly: https://tech.nextroll.com/blog/data/2014/11/17/d-is-for-data-science.html
Nov 02 2023
parent confuzzled <con fuzzled.com> writes:
Good morning,

First, thanks to you, Steve, and Julian for responding to my inquiry.

On 11/3/23 4:59 AM, Sergey wrote:
 On Thursday, 2 November 2023 at 15:46:23 UTC, confuzzled wrote:
 I've ported a small script from C to D. The original C version takes 
 roughly 6.5 minutes to parse a 12G file while the port originally took 
 about 48 minutes.
In my experience I/O in D is quite slow. But you can try to improve it: Try to use std.outbuffer instead of writeln. And flush the result only in the end.
Unless I did it incorrectly, this did nothing for me. My understanding is that I should first prepare an OutBuffer to which I write all my output. Once complete, I then write the OutBuffer to file; which still requires the use of writeln, albeit not as often. First I tried buffering the entire thing, but that turned out to be a big mistake. Next I tried writing and clearing the buffer every 100_000 records (about 3000 writeln calls). Not as bad as the first attempt but significantly worse than what I obtained with the fopen/fprintf combo. I even tried writing the buffer to disk with fprintf but jumped ship because it took far longer than fopen/fprintf. Can't say how much longer because I terminated execution at 14 minutes.
 Also check this article. It is showing how manual buffers in D could 
 speed up the processing of files significantly: 
 https://tech.nextroll.com/blog/data/2014/11/17/d-is-for-data-science.html
 
 
The link above was quite helpful. Thanks. I am a bit slow on the uptake so it took a while to figure out how to apply the idea to my own use case. However, once I figured it out, the result was 2 minutes faster than the original C implementation and 3 minutes faster than the fopen/printf port. Whether it did anything for the writeln implementation or not, I don't know. Wasn't will to wait 45+ minutes for something that can feasibly be done in 6 minutes. I gave up at 12. Haven't played with std.string.representation as suggested by Julian as yet but I plan to. Thank again. --Confuzzled
Nov 05 2023