digitalmars.D - A file reading benchmark

bearophile (50/50) Feb 17 2012 A tiny little file lines reading benchmark I've just found on Reddit:

bearophile (11/12) Feb 17 2012 Sorry, I have forgotten the version of the D code with GC disabling:
Walter Bright (3/4) Feb 17 2012 If you're going to benchmark reading a file line by line, that's what yo...

bearophile (8/12) Feb 17 2012 I have used the benchmark presented on Reddit.

Walter Bright (6/7) Feb 17 2012 I have no issue with that, just with your conclusion that it's file read...
Robert Jacques (3/5) Feb 17 2012 Whether or not deque vs vector was the primary performance problem, post...

Robert Jacques (3/11) Feb 17 2012 Bearophile, when comparing a deque to a classic vector, of course the de...
Andrei Alexandrescu (30/40) Feb 23 2012 [snip]

Martin Nowak (5/13) Feb 23 2012 Nice, I just got shocked yesterday by seeing that we call fgetc for ever...

Andrei Alexandrescu (5/19) Feb 23 2012 Yah, feel free to work opportunistically on this if you find the time.

Martin Nowak (4/23) Feb 23 2012 Could we get rid of libc file streams or do we need to share the locks

H. S. Teoh (9/13) Feb 23 2012 [...]

bearophile <bearophileHUGS lycos.com> writes:

A tiny little file lines reading benchmark I've just found on Reddit:
http://www.reddit.com/r/programming/comments/pub98/a_benchmark_for_reading_flat_files_into_memory/

http://steve.80cols.com/reading_flat_files_into_memory_benchmark.html

The Ruby code that generates slowly the test data:
https://raw.github.com/lorca/flat_file_benchmark/master/gen_data.rb
But for my timings I have used only about a 40% of that file, the first
1_965_800 lines, because I have less memory.

My Python-Psyco version runs in 2.46 seconds, the D version in 4.65 seconds
(the D version runs in 13.20 seconds if I don't disable the GC).

From many other benchmarks I've seen that file reading line-by-line is slow in
D.

-------------------------
My D code:

import std.stdio, std.string, std.array;

void main(in string[] args) {
    Appender!(string[][]) rows;
    foreach (line; File(args[1]).byLine())
        rows.put(line.idup.split("\t"));
    writeln(rows.data[1].join(","));
}

-------------------------
My Python 2.6 code:

from sys import argv
from collections import deque
import gc
import psyco

def main():
    gc.disable()
    rows = deque()
    for line in open(argv[1]):
        rows.append(line[:-1].split("\t"))
    print ",".join(rows[1])

psyco.full()
main()

-------------------------
The test data generator in Ruby:

user_id=1
for user_id in (1..10000)
 payments = (rand * 1000).to_i
 for user_payment_id in (1..payments)
   payment_id = user_id.to_s + user_payment_id.to_s
   payment_amount = "%.2f" % (rand * 30);
   is_card_present = "N"
   created_at = (rand * 10000000).to_i
   if payment_id.to_i % 3 == 0
    is_card_present = "Y"
   end
   puts [user_id, payment_id, payment_amount, is_card_present,
created_at].join("\t")
 end
end

-------------------------

Bye,
bearophile

Feb 17 2012

bearophile <bearophileHUGS lycos.com> writes:

 (the D version runs in 13.20 seconds if I don't disable the GC).

Sorry, I have forgotten the version of the D code with GC disabling:


import std.stdio, std.string, std.array, core.memory;

void main(in string[] args) {
    GC.disable();
    Appender!(string[][]) rows;
    foreach (line; File(args[1]).byLine())
        rows.put(line.idup.split("\t"));
    writeln(rows.data[1].join(","));
}

Bye,
bearophile

Feb 17 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 2/17/2012 5:44 PM, bearophile wrote:
 A tiny little file lines reading benchmark I've just found on Reddit:

If you're going to benchmark reading a file line by line, that's what your code 
should do. But that isn't what it does, it conflates it with appending to an
array.

Feb 17 2012

bearophile <bearophileHUGS lycos.com> writes:

Walter Bright:

 If you're going to benchmark reading a file line by line, that's what your
code 
 should do. But that isn't what it does, it conflates it with appending to an
array.

I have used the benchmark presented on Reddit.

I understand that a read-only benchmark is more useful for D developers.
Removing the storing of lines the code is slower than equivalent Python code
still.

----------------

Bearophile, when comparing a deque to a classic vector, of course the deque is
going to win. This has nothing to do with D, and everything to do with writing
a good algorithm.<

In this specific case you are wrong: in Python if I replace the deque with a
list (that is a dynamic array), the runtime grows only by less than 0.2
seconds. I know this benchmark is about Phobos not about D.


 Also, Appender has known performance problems with large appends (Issue 5813
http://d.puremagic.com/issues/show_bug.cgi?id=5813 , I'm currently folding in
Vladimir's suggestions and generating a pull request)<

OK. We'll see. But the File.byLine() is slower than file by line reading,
regardless appending to deques of arrays or appenders.

Bye,
bearophile

Feb 17 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 2/17/2012 7:25 PM, bearophile wrote:
 I have used the benchmark presented on Reddit.

I have no issue with that, just with your conclusion that it's file reading 
that's slow. It's not supported by your data.

Benchmarking is not so easy, I've very often seen benchmarks purporting to 
measure X when they were actually measuring Y.

The scientific jargon for this is "isolating the variables".

Feb 17 2012

"Robert Jacques" <sandford jhu.edu> writes:

On Fri, 17 Feb 2012 21:25:05 -0600, bearophile <bearophileHUGS lycos.com> wrote:
[snip]
 Bearophile, when comparing a deque to a classic vector, of course the deque is
going to win. This has nothing to do with D, and everything to do with writing
a good algorithm.<

 In this specific case you are wrong: in Python if I replace the deque with a
list (that is a dynamic array), the runtime grows only by less than 0.2
seconds. I know this benchmark is about Phobos not about D.

Whether or not deque vs vector was the primary performance problem, posting an
'apples to oranges' performance benchmark causes people to a) dismiss the
thread and b) lower their opinion of the poster.

Feb 17 2012

"Robert Jacques" <sandford jhu.edu> writes:

On Fri, 17 Feb 2012 19:44:04 -0600, bearophile <bearophileHUGS lycos.com> wrote:
 A tiny little file lines reading benchmark I've just found on Reddit:
 http://www.reddit.com/r/programming/comments/pub98/a_benchmark_for_reading_flat_files_into_memory/

 http://steve.80cols.com/reading_flat_files_into_memory_benchmark.html

 The Ruby code that generates slowly the test data:
 https://raw.github.com/lorca/flat_file_benchmark/master/gen_data.rb
 But for my timings I have used only about a 40% of that file, the first
1_965_800 lines, because I have less memory.

 My Python-Psyco version runs in 2.46 seconds, the D version in 4.65 seconds
(the D version runs in 13.20 seconds if I don't disable the GC).

 From many other benchmarks I've seen that file reading line-by-line is slow in
D.

Bearophile, when comparing a deque to a classic vector, of course the deque is
going to win. This has nothing to do with D, and everything to do with writing
a good algorithm.

Also, Appender has known performance problems with large appends (Issue 5813
http://d.puremagic.com/issues/show_bug.cgi?id=5813 , I'm currently folding in
Vladimir's suggestions and generating a pull request)

Feb 17 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/17/12 7:44 PM, bearophile wrote:
 A tiny little file lines reading benchmark I've just found on Reddit:
 http://www.reddit.com/r/programming/comments/pub98/a_benchmark_for_reading_flat_files_into_memory/

 http://steve.80cols.com/reading_flat_files_into_memory_benchmark.html

 The Ruby code that generates slowly the test data:
 https://raw.github.com/lorca/flat_file_benchmark/master/gen_data.rb
 But for my timings I have used only about a 40% of that file, the first
1_965_800 lines, because I have less memory.

 My Python-Psyco version runs in 2.46 seconds, the D version in 4.65 seconds
(the D version runs in 13.20 seconds if I don't disable the GC).

  From many other benchmarks I've seen that file reading line-by-line is slow
in D.

 -------------------------
 My D code:

[snip]

The thread in D.announce prompted me to go back to this, and I've run a 
simple test that isolates file reads from everything else. After 
generating the CSV data as described above, I ran this Python code:

import sys

rows = []
f = open(sys.argv[1])
for line in f:
     if len(line) > 10000: rows.append(line[:-1].split("\t"))

and this D code:

import std.stdio, std.string, std.array;

void main(in string[] args) {
     Appender!(string[][]) rows;
     auto f = File(args[1]);
     foreach (line; f.byLine()) {
         if (line.length > 10000) rows.put(line.idup.split("\t"));
     }
}

Both programs end up appending nothing because 10000 is larger than any 
line length.

On my machine (Mac OSX Lion), the Python code clocks around 1.2 seconds 
and the D code at a whopping 9.3 seconds. I looked around where the 
problem lies and sure enough the issue was with a slow loop in the 
generic I/O implementation of readln. The commit 
https://github.com/D-Programming-Language/phobos/commit/94b21d38d16e075d7c44
53015eb1113854424d0 
brings the speed of the test to 2.1 seconds. We could and should reduce 
that further with taking buffering in our own hands, but for now this is 
a good low-hanging fruit to pick.


Andrei

Feb 23 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

 On my machine (Mac OSX Lion), the Python code clocks around 1.2 seconds  
 and the D code at a whopping 9.3 seconds. I looked around where the  
 problem lies and sure enough the issue was with a slow loop in the  
 generic I/O implementation of readln. The commit  
 https://github.com/D-Programming-Language/phobos/commit/94b21d38d16e075d7c44
53015eb1113854424d0  
 brings the speed of the test to 2.1 seconds. We could and should reduce  
 that further with taking buffering in our own hands, but for now this is  
 a good low-hanging fruit to pick.

Nice, I just got shocked yesterday by seeing that we call fgetc for every  
char,
those are usually macros and as we already maintain the per system  
*_unlocked
functions we might probably use the macro expansions.

Feb 23 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/23/12 3:40 PM, Martin Nowak wrote:
 On my machine (Mac OSX Lion), the Python code clocks around 1.2
 seconds and the D code at a whopping 9.3 seconds. I looked around
 where the problem lies and sure enough the issue was with a slow loop
 in the generic I/O implementation of readln. The commit
 https://github.com/D-Programming-Language/phobos/commit/94b21d38d16e075d7c44b53015eb1113854424d0
 brings the speed of the test to 2.1 seconds. We could and should
 reduce that further with taking buffering in our own hands, but for
 now this is a good low-hanging fruit to pick.

 Nice, I just got shocked yesterday by seeing that we call fgetc for
 every char,
 those are usually macros and as we already maintain the per system
 *_unlocked
 functions we might probably use the macro expansions.

Yah, feel free to work opportunistically on this if you find the time. 
However, I think long-term we want to give byLine() the freedom of doing 
its own buffering.

Andrei

Feb 23 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Fri, 24 Feb 2012 00:24:16 +0100, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 2/23/12 3:40 PM, Martin Nowak wrote:
 On my machine (Mac OSX Lion), the Python code clocks around 1.2
 seconds and the D code at a whopping 9.3 seconds. I looked around
 where the problem lies and sure enough the issue was with a slow loop
 in the generic I/O implementation of readln. The commit
 https://github.com/D-Programming-Language/phobos/commit/94b21d38d16e075d7c44b53015eb1113854424d0
 brings the speed of the test to 2.1 seconds. We could and should
 reduce that further with taking buffering in our own hands, but for
 now this is a good low-hanging fruit to pick.

 Nice, I just got shocked yesterday by seeing that we call fgetc for
 every char,
 those are usually macros and as we already maintain the per system
 *_unlocked
 functions we might probably use the macro expansions.

 Yah, feel free to work opportunistically on this if you find the time.  
 However, I think long-term we want to give byLine() the freedom of doing  
 its own buffering.

 Andrei

Could we get rid of libc file streams or do we need to share the locks  
with C?

Feb 23 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

 On Fri, 24 Feb 2012 00:24:16 +0100, Andrei Alexandrescu

[...]
Yah, feel free to work opportunistically on this if you find the
time. However, I think long-term we want to give byLine() the freedom
of doing its own buffering.


[...]

And while we're working with byLine(), it would be nice if this issue
was also looked at:

http://d.puremagic.com/issues/show_bug.cgi?id=7374

:-)


T

-- 
What do you mean the Internet isn't filled with subliminal messages? What about
all those buttons marked "submit"??

Feb 23 2012

D Programming

C/C++ Programming

Other

digitalmars.D - A file reading benchmark