www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 11810] New: std.stdio.byLine/readln performance is very bad

reply d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11810

           Summary: std.stdio.byLine/readln performance is very bad
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: peter.alexander.au gmail.com



04:34:29 PST ---
std.stdio.readln (and hence byLine) use repeated calls to fgetc() to find the
new line characters. This is a very inefficient way to read files (lots of
per-byte overhead).

I have a version of byLine that reads the files in 4kb chunks and then does the
new line search. It is 6 times faster than byLine on my machine on a 10MB file
(OSX 10.8.5, x64 2GHz MacBook).

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Dec 24 2013
next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11810


Dejan Lekic <dejan.lekic gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dejan.lekic gmail.com



---
It has been discussed on IRC hundreds of times and we all agreed that if
developer wants performance (s)he would read page-size chunks. That is why we
have byChunk(size_t) in std.stdio, I believe. :)

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Dec 24 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11810




05:27:14 PST ---

 It has been discussed on IRC hundreds of times and we all agreed that if
 developer wants performance (s)he would read page-size chunks. That is why we
 have byChunk(size_t) in std.stdio, I believe. :)
OK, but: 1. It's non-trivial to implement byLine on top of byChunk. 2. Why would you want byLine to be slow? I'm not seeing the advantage of keeping byLine as it is. Fixing it doesn't change the API and has no downsides other than requiring a bit extra memory for the buffer. -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 24 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11810


bearophile_hugs eml.cc changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bearophile_hugs eml.cc




 It has been discussed on IRC hundreds of times and we all agreed that if
 developer wants performance (s)he would read page-size chunks. That is why we
 have byChunk(size_t) in std.stdio, I believe. :)
This is not acceptable. byLine is a very commonly used function (far more than byChunk in script-like D programs) and it should be sufficiently fast. -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 24 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11810


Artem Tarasov <lomereiter gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |lomereiter gmail.com



PST ---
+1

There's also this implementation:
http://permalink.gmane.org/gmane.comp.lang.d.general/117750

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Dec 24 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11810




Created an attachment (id=1305)
byLineFast with small changes

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Dec 24 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11810




The whole point of byLine is to be a convenient API for user code to read lines
from a file. It should not be constrained to using fgetc() just because we
can't predict line length in advance. It should be built on top of a buffering
mechanism (maybe byChunk) so that it offers good performance to a very
commonly-used user operation. I highly recommend Peter Alexander to submit the
improved byLine implementation to Phobos.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Dec 26 2013
prev sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11810


Nick Treleaven <ntrel-public yahoo.co.uk> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ntrel-public yahoo.co.uk



08:55:17 PST ---

 Created an attachment (id=1305) [details]
 byLineFast with small changes
BTW I'm working on porting this to std.stdio.byLine, I'll submit a PR when finished. -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 28 2014