www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 4474] New: Safer stdin.byLine()

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4474

           Summary: Safer stdin.byLine()
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: bearophile_hugs eml.cc



This is relative to page 16-17 of The D Programming Language. It explains
stdin.byLine() and possible 'rather hard to find' bugs caused by not
duplicating the input data.

If I use D to write 20-lines long scripts I really don't want to remember to
dup all things (in D1 code I sometimes end up dupping too much, to be on the
safe side). So I suggest a different API for the line reading:

- stdin.byLineMutable() (or another similar name, longer than "byLine" that
makes it clear it doesn't copy): for the current behaviour that avoids a memory
allocation for each line read. This is faster but it's less safe.

- stdin.byLine(): that allocates a new string for each line, this is safer, as
in Python (Python also uses heuristics to speed up this method as much as
possible, because this is often a very common and performance-critical
operation in scripts).

All D default design policy says that unsafe but faster things need to be asked
for, and the default things must be less bug-prone. If I write a small D script
I can use byLine(), hoping to avoid some bugs. If later I see profiling shows
me it's too much slow, I can replace the byLine() with the other method and
optimize the code, carefully, removing some heap allocations.

(An alternative design strategy is to keep just the byLine() method, but give
it an optional default argument, like stdin.byLine(bool copy=True) or
stdin.byLine(bool COPY=True)(), that on default copies the line with a new
memory allocation.)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 16 2010
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4474


Andrei Alexandrescu <andrei metalanguage.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |andrei metalanguage.com



08:00:52 PDT ---
byLine is safe.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 17 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4474




OK, changed title in "Better" instead of "Safer".

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 17 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4474




This is a small test program (dmd v2.047):

import std.string, std.stdio;
void main() {
    int[string] aa;
    foreach (line; stdin.byLine())
        foreach (word; line.split())
            aa[word]++;
    foreach (word, freq; aa)
        writeln(freq, " ", word);
}


Running with itself as input data:
test.exe < test.d


Prints:
1 eln(fr
1 q, " ", wo
1     writeln
1 }
1 "
1 }
1 }
1 writeln
2     wri
1    wri
1  ", word);
))
1 , w
1 q, " ", word);

1 eln(fr
1 q, "
1 freq,
1 ",
1 eln(freq, "
1  writeln(fr
1 word);
1 writeln(freq,
1 fre
1 e


This shows that byLine() is bug-prone (unsafe).


While this program:

import std.string, std.stdio;
void main() {
    int[string] aa;
    foreach (line; stdin.byLine())
        foreach (word; line.split())
            aa[word.dup]++;
    foreach (word, freq; aa)
        writeln(freq, " ", word);
}


Prints a more correct output:

1 (word,
1 std.stdio;
1 int[string]
1 }
1 "
1 void
1 import
3 foreach
1 main()
1 aa)
1 line.split())
1 stdin.byLine())
1 (line;
1 freq;
1 (word;
1 ",
1 std.string,
1 word);
1 writeln(freq,
1 aa[word.dup]++;
1 aa;
1 {


It's easy to forget dupping/idupping.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 17 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4474




11:06:02 PDT ---
That example is the manifestation of another bug:

http://d.puremagic.com/issues/show_bug.cgi?id=2954

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 17 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4474




If you think this bug report is invalid and byLine() is safe (because the type
system is enough, being able to tell apart char[] and string), then you can
close this bug report.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 17 2010
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4474


bearophile_hugs eml.cc changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |INVALID



Bug closed because Andrei says byLine() is safe :-)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 24 2010