digitalmars.D - [RFC] CSV parser

Jesse Phillips (57/57) Apr 04 2011 I have implemented an input range based CSV parser that works on text

Robert Jacques (18/22) Apr 05 2011 [snip]

Jesse Phillips (36/52) Apr 05 2011 This implementation only operates with input ranges, of text. I have ano...

Robert Jacques (40/110) Apr 05 2011 The library should work with both and be efficient with both. i.e. detec...

Jesse Phillips (21/89) Apr 06 2011 I'm actually considering trying to benchmark my two implementations.

spir (7/9) Apr 05 2011 There are formats using control codes as separators, as well.

Jesse Phillips <jessekphillips+d gmail.com> writes:

I have implemented an input range based CSV parser that works on text 
input[1]. I combined my original implementation with some details of 
David's implementation[2]. It is not ready for formal review as I need to 
update and polish documentation and probably consolidate unit tests.

It provides a very simple interface which can either iterate over all 
elements individually or each record can be stored in a struct. The unit 
tests and examples[3] do a good job showing the interface, but here is 
just one (taken from unit test) using a struct and header:

    string str = "a,b,c\nHello,65,63.63\nWorld,123,3673.562";
    struct Layout
    {
        int value;
        double other;
        string name;
    }

    auto records = csvText!Layout(str, ["b","c","a"]);

    Layout ans[2];
    ans[0].name = "Hello";
    ans[0].value = 65;
    ans[0].other = 63.63;
    ans[1].name = "World";
    ans[1].value = 123;
    ans[1].other = 3673.562;

    int count;
    foreach (record; records)
    {
        assert(ans[count].name == record.name);
        assert(ans[count].value == record.value);
        assert(ans[count].other == record.other);
        count++;
    }
    assert(count == 2);

The main implementation is in the function csvNextToken. I'm thinking it 
might be useful to have this function public as it will allow for writing 
a parser for or recovering from malformed data.

In order to be memory efficient appender is reused for each iteration. 
However the default behavior does result in a copying being taken. To 
prevent the copy being made just provide the type as char[]

    string str = `one,two,"three ""quoted""","",` ~ "\"five\nnew line
\"\nsix";
    auto records = csvText!(char[])(str);
    
    foreach(record; records)
    {
        foreach(cell; record)
        {
            writeln(cell);
        }
    }

If your structure stores char[] instead of string you will also observe 
the overwriting behavior, should this be fixed?.

So feel free to suggest names, implementation correction, or 
documentation. Or giving a thumbs up. The more interest, the more 
interest I'll have in getting this done sooner :)

1. https://github.com/he-the-great/JPDLibs/blob/csv/csv/csv.d
2. http://lists.puremagic.com/pipermail/phobos/2011-January/004300.html
3. https://github.com/he-the-great/JPDLibs/tree/csv/examples

Apr 04 2011

"Robert Jacques" <sandford jhu.edu> writes:

On Tue, 05 Apr 2011 01:44:34 -0400, Jesse Phillips  
<jessekphillips+d gmail.com> wrote:
 I have implemented an input range based CSV parser that works on text
 input[1]. I combined my original implementation with some details of
 David's implementation[2]. It is not ready for formal review as I need to
 update and polish documentation and probably consolidate unit tests.

[snip]

* You should input ranges. It's fine to detect slicing and optimize for  
it, but you should support simple input ranges as well.

* I'd think being able to retrieve the headings from the csv would be a  
good [optional] feature.

* Exposing the tokenizer would be useful.

* Regarding buffering, it's okay for the tokenizer to expose buffering in  
it's API (and users should be able to supply their own buffers), but I  
don't think an unbuffered version of csvText itself is correct;  
csvByRecord or csvText!(T).byRecord would be more appropriate. And  
anyways, since you're only using strings, why is there any buffering going  
on at all? string values should simply be sliced, not buffered. Buffering  
should only come into play with input ranges.

* There should be a way to specify other separators; I've started using  
tab separated files as ','s show up in a lot of data.

* Any thought of parsing a file into a tuple of arrays? Writing csv?

Apr 05 2011

Jesse Phillips <jessekphillips+D gmail.com> writes:

Robert Jacques Wrote:

 * You should input ranges. It's fine to detect slicing and optimize for  
 it, but you should support simple input ranges as well.

This implementation only operates with input ranges, of text. I have another
implementation which works on a slice-able forward range of anything. I stopped
development because it didn't support input ranges, and Phobos must have this.

https://github.com/he-the-great/JPDLibs/tree/csvoptimize

 * I'd think being able to retrieve the headings from the csv would be a  
 good [optional] feature.

Do you think it would be good for the interface to use an associative array.
Maybe the heading could be the first thing to get from a RecordList.

auto records  = csvText(str);
auto heading = records.heading;

or shouldn't it be passed in as an empty array:

string[] heading;
auto records = csvText(str, heading);
 
 * Exposing the tokenizer would be useful.

Good. I'm thinking I will use an output range of type char[] instead of
Appender.

 * Regarding buffering, it's okay for the tokenizer to expose buffering in  
 it's API (and users should be able to supply their own buffers), but I  
 don't think an unbuffered version of csvText itself is correct; 

As I'm thinking of using an output range I believe it has the job of allowing a
user specified buffer or not. I'm not sure if the interface for csvText or
RecordList should allow for custom buffers.

 csvByRecord or csvText!(T).byRecord would be more appropriate. 

You always iterate CSV by record, what other option is there? If you desire an
array for the record:

auto records = csvText(str);

foreach(record: records)
    auto arr = array(record); // ...

 And  
 anyways, since you're only using strings, why is there any buffering going  
 on at all? string values should simply be sliced, not buffered. Buffering  
 should only come into play with input ranges.

Input ranges do not provide slicing. On top of that you _must_ modify the
returned data at times. So you can do buffering and slicing together which my
other implementation will do, but as I said it won't support input ranges.

 * There should be a way to specify other separators; I've started using  
 tab separated files as ','s show up in a lot of data.

You can use a custom "comma" and "quote" the record break is not modifiable
because it doesn't fit into a single character. I can make this available
through csvText and am not exactly sure why I didn't.

 * Any thought of parsing a file into a tuple of arrays? Writing csv?

A static CSV parser? I would hope CTFE would allow for using this parser, at
some point.

For writing I made a quick and dirty one to write structures to a file. But I
have not considered adding such functionality to the library. Thank you for the
comment.

void writeStruct(string file, Layout data) {
    string[] elements;

    foreach(i, U; FieldTypeTuple!Layout) {
        elements ~= to!string(data.tupleof[i]);
    }

    string record;
    foreach(e; elements) {
        if(find(e, `"`, ",", "\n", "\r")[1] == 0)
            record ~= e ~ ",";
        else
            record ~= `"` ~ replace(e, `"`, `""`) ~ `",`;
    }

    std.file.append(file, record[0..$-1] ~ "\n");
    
}

https://gist.github.com/805392

Apr 05 2011

"Robert Jacques" <sandford jhu.edu> writes:

On Tue, 05 Apr 2011 12:45:59 -0400, Jesse Phillips  
<jessekphillips+D gmail.com> wrote:

 Robert Jacques Wrote:

 * You should input ranges. It's fine to detect slicing and optimize for
 it, but you should support simple input ranges as well.

 This implementation only operates with input ranges, of text. I have  
 another implementation which works on a slice-able forward range of  
 anything. I stopped development because it didn't support input ranges,  
 and Phobos must have this.

 https://github.com/he-the-great/JPDLibs/tree/csvoptimize

The library should work with both and be efficient with both. i.e. detect  
at compile-time whether you have a string or an input range, and act  
accordingly.

 * I'd think being able to retrieve the headings from the csv would be a
 good [optional] feature.

 Do you think it would be good for the interface to use an associative  
 array. Maybe the heading could be the first thing to get from a  
 RecordList.

 auto records  = csvText(str);
 auto heading = records.heading;

 or shouldn't it be passed in as an empty array:

 string[] heading;
 auto records = csvText(str, heading);

Hmm... The latter makes it explicit that there should be headers, while  
with the former, you'd need to call heading before parsing the records,  
which could be a source of bugs. Then again, records.heading is more  
natural. Alternative syntax ideas:

auto records = csvText(str, null);
auto heading = records.heading;

or

auto records = csvText!(csvOptions.headers)(str);
auto heading = records.heading;

 * Exposing the tokenizer would be useful.

 Good. I'm thinking I will use an output range of type char[] instead of  
 Appender.

Output range? You mean as a buffer? I'd think Appender is the better  
choice for that internally, but you could allow the user to pass in a  
char[] buffer to use as the basis of the appender.

 * Regarding buffering, it's okay for the tokenizer to expose buffering  
 in
 it's API (and users should be able to supply their own buffers), but I
 don't think an unbuffered version of csvText itself is correct;

 As I'm thinking of using an output range I believe it has the job of  
 allowing a user specified buffer or not. I'm not sure if the interface  
 for csvText or RecordList should allow for custom buffers.

An output range is not a buffer. A buffer requires a clear method and a  
data retrieval method. All output ranges provide is put.

 csvByRecord or csvText!(T).byRecord would be more appropriate.

 You always iterate CSV by record, what other option is there? If you  
 desire an array for the record:

 auto records = csvText(str);

 foreach(record: records)
     auto arr = array(record); // ...

Well, some of your example code included  
csvText!MyStruct(str,["my","headers"]); or possibly just  
csvText!MyStruct(str). Which would have either provide a forward range of  
MyStructs or an array of MyStruct. In either case, duplication of strings  
where appropriate should occur. Also, remember that records might not be  
manually processed; the user might forward it to map,filter,etc and  
therefore should be safe by default. A user should have to manually state  
that they are going to want a lazy buffered version.

 And
 anyways, since you're only using strings, why is there any buffering  
 going
 on at all? string values should simply be sliced, not buffered.  
 Buffering
 should only come into play with input ranges.

 Input ranges do not provide slicing. On top of that you _must_ modify  
 the returned data at times. So you can do buffering and slicing together  
 which my other implementation will do, but as I said it won't support  
 input ranges.

Why do you need to do any modification? Part of the advantage of csv is  
that it's just text: you don't have to deal with escape characters, etc.

 * There should be a way to specify other separators; I've started using
 tab separated files as ','s show up in a lot of data.

 You can use a custom "comma" and "quote" the record break is not  
 modifiable because it doesn't fit into a single character. I can make  
 this available through csvText and am not exactly sure why I didn't.

 * Any thought of parsing a file into a tuple of arrays? Writing csv?

 A static CSV parser? I would hope CTFE would allow for using this  
 parser, at some point.

By tuple I meant the tuple(5,7) kind, not the TypeTuple!(int,char) kind.  
Basically,

auto records = csvText!(string,real)(str);
string[] names   = records._0;
real[]   grades  = records._1;

vs

auto records = csvText!(Tuple!(string,"name",real,"grade"))(str);
foreach(record; records) {
	writeln(record.name," got ",record.grade);
}
	

 For writing I made a quick and dirty one to write structures to a file.  
 But I have not considered adding such functionality to the library.  
 Thank you for the comment.

 void writeStruct(string file, Layout data) {
     string[] elements;

     foreach(i, U; FieldTypeTuple!Layout) {
         elements ~= to!string(data.tupleof[i]);
     }

     string record;
     foreach(e; elements) {
         if(find(e, `"`, ",", "\n", "\r")[1] == 0)
             record ~= e ~ ",";
         else
             record ~= `"` ~ replace(e, `"`, `""`) ~ `",`;
     }

     std.file.append(file, record[0..$-1] ~ "\n");
 }

 https://gist.github.com/805392

Apr 05 2011

Jesse Phillips <jessekphillips+d gmail.com> writes:

On Wed, 06 Apr 2011 01:53:19 -0400, Robert Jacques wrote:

 The library should work with both and be efficient with both. i.e.
 detect at compile-time whether you have a string or an input range, and
 act accordingly.

I'm actually considering trying to benchmark my two implementations. 
However I really don't know a good setup for testing memory usage. At 
this time it is not a priority as specialization can be added without 
changing the interface or capabilities.

 auto records  = csvText(str);
 auto heading = records.heading;

 or shouldn't it be passed in as an empty array:

 string[] heading;
 auto records = csvText(str, heading);

 
 Hmm... The latter makes it explicit that there should be headers, while
 with the former, you'd need to call heading before parsing the records,
 which could be a source of bugs. Then again, records.heading is more
 natural. Alternative syntax ideas:
 
 auto records = csvText(str, null);
 auto heading = records.heading;
 
 or
 
 auto records = csvText!(csvOptions.headers)(str); auto heading =
 records.heading;

I'm not fond of the csvOptions approach, but the combined explicit and 
ability to retrieve seems reasonable.

 * Regarding buffering, it's okay for the tokenizer to expose buffering
 in
 it's API (and users should be able to supply their own buffers), but I
 don't think an unbuffered version of csvText itself is correct;

 As I'm thinking of using an output range I believe it has the job of
 allowing a user specified buffer or not. I'm not sure if the interface
 for csvText or RecordList should allow for custom buffers.

 
 An output range is not a buffer. A buffer requires a clear method and a
 data retrieval method. All output ranges provide is put.

Except for the exceptions thrown the tokenizer only needs an output range 
of char.

csvText is buffered by using Appender internally. However the default is 
to make a copy of the data in the buffer, which is safer.
 
 csvByRecord or csvText!(T).byRecord would be more appropriate.

 You always iterate CSV by record, what other option is there? If you
 desire an array for the record:

 auto records = csvText(str);

 foreach(record: records)
     auto arr = array(record); // ...

 
 Well, some of your example code included
 csvText!MyStruct(str,["my","headers"]); or possibly just
 csvText!MyStruct(str). Which would have either provide a forward range
 of MyStructs or an array of MyStruct. In either case, duplication of
 strings where appropriate should occur. Also, remember that records
 might not be manually processed; the user might forward it to
 map,filter,etc and therefore should be safe by default. A user should
 have to manually state that they are going to want a lazy buffered
 version.

The behavior when using string is to do a duplication, but char[] will 
not. I believe this provides access to both a safe and memory efficient 
iteration. But it is confusing and probably doesn't need to be provided. 

 Why do you need to do any modification? Part of the advantage of csv is
 that it's just text: you don't have to deal with escape characters, etc.

This is wrong. CSV does have escape characters, or more precisely an 
escaped quote.

 By tuple I meant the tuple(5,7) kind, not the TypeTuple!(int,char) kind.
 Basically,
 
 auto records = csvText!(string,real)(str); string[] names   =
 records._0;
 real[]   grades  = records._1;
 
 vs
 
 auto records = csvText!(Tuple!(string,"name",real,"grade"))(str);
 foreach(record; records) {
 	writeln(record.name," got ",record.grade);
 }

I'm not sure the second example would work with the current 
implementation. But thinking on this part, it might be better to wait for 
the DB API to be worked out and then it could use this module to provide 
such database queries.

Apr 06 2011

spir <denis.spir gmail.com> writes:

On 04/05/2011 05:32 PM, Robert Jacques wrote:
 * There should be a way to specify other separators; I've started using tab
 separated files as ','s show up in a lot of data.

There are formats using control codes as separators, as well.

Denis
-- 
_________________
vita es estrany
spir.wikidot.com

Apr 05 2011

D Programming

C/C++ Programming

Other

digitalmars.D - [RFC] CSV parser