www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Review of Jesse Phillips's CSV Parser

reply dsimcha <dsimcha yahoo.com> writes:
Formal review of Jesse Phillips's CSV parser module begins today and 
runs through November .  Following that, a vote will take place for one 
week.  Please post any comments about this library in this thread.

Docs:
http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html

Code:
https://github.com/he-the-great/phobos/tree/csv
Oct 28 2011
next sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Fri, 28 Oct 2011 16:18:27 +0300, dsimcha <dsimcha yahoo.com> wrote:

 Docs:
 http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
Documentation should have examples of basic usage at the top. It should be possible to figure out how to use a CSV parser within 10 seconds of looking at the docs. It looks like this is only a CSV parser, but not a CSV formatter? That's pretty weird. It'd be like std.xml or std.json only having reading functionality. -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Oct 28 2011
parent reply Jesse Phillips <jessekphillips+D gmail.com> writes:
Vladimir Panteleev Wrote:

 On Fri, 28 Oct 2011 16:18:27 +0300, dsimcha <dsimcha yahoo.com> wrote:
 
 Docs:
 http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
Documentation should have examples of basic usage at the top. It should be possible to figure out how to use a CSV parser within 10 seconds of looking at the docs.
Sounds good. What kind of example should be first? Maybe something that just gives you an itterable: auto records = csvText(text);
 It looks like this is only a CSV parser, but not a CSV formatter? That's  
 pretty weird. It'd be like std.xml or std.json only having reading  
 functionality.
I have done quite a bit of CSV output and none of them have been the same. The only similarity is surrounding with quotes and escaping quotes. "\"" ~ text.replace("\"", "\"\"") ~ "\"" The problem is that how your data is stored and where it needs to go is different for each app. I don't see a reason to provide a output function where one must convert their data into a range of struct and specify an output range, when that conversion is the hard part. I would consider a function that does escaping/quoting only if the element needs it. But such a function couldn't work on an Input Range.
Oct 28 2011
parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Fri, 28 Oct 2011 19:41:35 +0300, Jesse Phillips  
<jessekphillips+D gmail.com> wrote:

 Vladimir Panteleev Wrote:

 On Fri, 28 Oct 2011 16:18:27 +0300, dsimcha <dsimcha yahoo.com> wrote:

 Docs:
 http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
Documentation should have examples of basic usage at the top. It should be possible to figure out how to use a CSV parser within 10 seconds of looking at the docs.
Sounds good. What kind of example should be first? Maybe something that just gives you an itterable: auto records = csvText(text);
That's not a very good example, since it doesn't show anything about how to use the result of the csvText call. If it's an iterable, the example should use it in a foreach loop. IMO the example should be a very simple realistic use case, for example printing the data in a user-friendly way: writefln("%s works as a %s and earns %d per year", ...);
 It looks like this is only a CSV parser, but not a CSV formatter? That's
 pretty weird. It'd be like std.xml or std.json only having reading
 functionality.
I have done quite a bit of CSV output and none of them have been the same. The only similarity is surrounding with quotes and escaping quotes. "\"" ~ text.replace("\"", "\"\"") ~ "\"" The problem is that how your data is stored and where it needs to go is different for each app. I don't see a reason to provide a output function where one must convert their data into a range of struct and specify an output range, when that conversion is the hard part. I would consider a function that does escaping/quoting only if the element needs it. But such a function couldn't work on an Input Range.
I think you're overcomplicating this. You already have a way of specifying the specific format of the CSV file (delimiters, quote characters) for the parser. Use the same idiom for the formatter. The formatter should accept data in the same format (datatypes) as what's emitted by the parser. If you use functional idioms (ranges), it ought to be possible to pipe it all together neatly. The specifics don't matter very much, but it should be possible to put some data (array of string arrays, array of structs or whatnot) into a .csv file in a single line of code. Surely this isn't hard? -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Oct 28 2011
prev sibling next sibling parent reply Piotr Szturmaj <bncrbme jadamspam.pl> writes:
dsimcha wrote:
 Formal review of Jesse Phillips's CSV parser module begins today and
 runs through November . Following that, a vote will take place for one
 week. Please post any comments about this library in this thread.

 Docs:
 http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html

 Code:
 https://github.com/he-the-great/phobos/tree/csv
Could csvText!Struct(...) be generalized to also support Tuples and (eventually) classes?
Oct 28 2011
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2011-10-28 15:25, Piotr Szturmaj wrote:
 dsimcha wrote:
 Formal review of Jesse Phillips's CSV parser module begins today and
 runs through November . Following that, a vote will take place for one
 week. Please post any comments about this library in this thread.

 Docs:
 http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html

 Code:
 https://github.com/he-the-great/phobos/tree/csv
Could csvText!Struct(...) be generalized to also support Tuples and (eventually) classes?
Sounds more like a serialization library with a CSV archive type to me. -- /Jacob Carlborg
Oct 28 2011
parent reply Piotr Szturmaj <bncrbme jadamspam.pl> writes:
Jacob Carlborg wrote:
 On 2011-10-28 15:25, Piotr Szturmaj wrote:
 dsimcha wrote:
 Formal review of Jesse Phillips's CSV parser module begins today and
 runs through November . Following that, a vote will take place for one
 week. Please post any comments about this library in this thread.

 Docs:
 http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html

 Code:
 https://github.com/he-the-great/phobos/tree/csv
Could csvText!Struct(...) be generalized to also support Tuples and (eventually) classes?
Sounds more like a serialization library with a CSV archive type to me.
I suppose you misunderstood my intentions. Here's an example from the docs: string str = "Hello,65,63.63\nWorld,123,3673.562"; struct Layout { string name; int value; double other; } auto records = csvText!Layout(str); foreach(record; records) { writeln(record.name); writeln(record.value); writeln(record.other); } I want to occasionally write: alias Tuple!(string, int, double) Layout; // instead of struct It should be easy to extend, but it would be nice if we have had general construct in Phobos. So instead of is(Contents == struct) https://github.com/he-the-great/phobos/blob/csv/std/csv.d#L433 there should be isComposite!Contents and FieldTypeTuple should be generalized (if it isn't) to structs, classes and tuples. I don't know if Tuple as a struct doesn't return valid fields from FieldTypeTuple now.
Oct 28 2011
parent deadalnix <deadalnix gmail.com> writes:
Le 28/10/2011 16:09, Piotr Szturmaj a écrit :
 Jacob Carlborg wrote:
 On 2011-10-28 15:25, Piotr Szturmaj wrote:
 dsimcha wrote:
 Formal review of Jesse Phillips's CSV parser module begins today and
 runs through November . Following that, a vote will take place for one
 week. Please post any comments about this library in this thread.

 Docs:
 http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html

 Code:
 https://github.com/he-the-great/phobos/tree/csv
Could csvText!Struct(...) be generalized to also support Tuples and (eventually) classes?
Sounds more like a serialization library with a CSV archive type to me.
I suppose you misunderstood my intentions. Here's an example from the docs: string str = "Hello,65,63.63\nWorld,123,3673.562"; struct Layout { string name; int value; double other; } auto records = csvText!Layout(str); foreach(record; records) { writeln(record.name); writeln(record.value); writeln(record.other); } I want to occasionally write: alias Tuple!(string, int, double) Layout; // instead of struct It should be easy to extend, but it would be nice if we have had general construct in Phobos. So instead of is(Contents == struct) https://github.com/he-the-great/phobos/blob/csv/std/csv.d#L433 there should be isComposite!Contents and FieldTypeTuple should be generalized (if it isn't) to structs, classes and tuples. I don't know if Tuple as a struct doesn't return valid fields from FieldTypeTuple now.
That would be great !
Oct 28 2011
prev sibling parent Jesse Phillips <jessekphillips+D gmail.com> writes:
Piotr Szturmaj Wrote:

 dsimcha wrote:
 Formal review of Jesse Phillips's CSV parser module begins today and
 runs through November . Following that, a vote will take place for one
 week. Please post any comments about this library in this thread.

 Docs:
 http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html

 Code:
 https://github.com/he-the-great/phobos/tree/csv
Could csvText!Struct(...) be generalized to also support Tuples and (eventually) classes?
I restricted to struct since a class had to be new'd. I and then realized most likely it won't be the final destination. It is better just letting the user move the items over themselves. Tuples should just work since they are just struct. Will double check.
Oct 28 2011
prev sibling next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
On 10/28/2011 9:18 AM, dsimcha wrote:
 Formal review of Jesse Phillips's CSV parser module begins today and
 runs through November . Following that, a vote will take place for one
 week. Please post any comments about this library in this thread.

 Docs:
 http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html

 Code:
 https://github.com/he-the-great/phobos/tree/csv
I'll kick this off with my review of the documentation (I'll review the implementation later): "Header may be provided as first line in file" -> "A header may be provided as first line in file"? csvText: Shouldn't heading be a generic range of strings instead of a string[]? Why should the order of the heading provided to csvText matter? csvText should simply rearrange its results. In csvNextToken: "The expected use of this would be to create a parser. And may also be useful when handling errors within a CSV file." Please re-word this sentence so it's grammatically correct.
Oct 28 2011
next sibling parent Jesse Phillips <jessekphillips+D gmail.com> writes:
dsimcha Wrote:

 I'll kick this off with my review of the documentation (I'll review the 
 implementation later):
 
 "Header may be provided as first line in file" -> "A header may be 
 provided as first line in file"?
 
 csvText:  Shouldn't heading be a generic range of strings instead of a 
 string[]?
I should revisit that. It will still need to be random access and possible some others, but doesn't need to be an array.
 Why should the order of the heading provided to csvText matter?  csvText 
 should simply rearrange its results.
 
 In csvNextToken:  "The expected use of this would be to create a parser. 
 And may also be useful when handling errors within a CSV file."  Please 
 re-word this sentence so it's grammatically correct.
Oct 28 2011
prev sibling parent Jesse Phillips <jessekphillips+D gmail.com> writes:
Forgot to answer one.

dsimcha Wrote:

 Why should the order of the heading provided to csvText matter?  csvText 
 should simply rearrange its results.
Because I do not store anymore information then requested. I must keep pumping out values as I receive them, so order can not be rearranged. With a struct, you must match the header to the order of fields. This is because I can not match it by type and requiring that your field names match the file heading limits what can be in a heading (spaces, punctuation). Providing specialization for associative arrays also looks like a good option to resolve this.
Oct 28 2011
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2011-10-28 15:18, dsimcha wrote:
 Formal review of Jesse Phillips's CSV parser module begins today and
 runs through November . Following that, a vote will take place for one
 week. Please post any comments about this library in this thread.

 Docs:
 http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html

 Code:
 https://github.com/he-the-great/phobos/tree/csv
If a header is specified, is it possible to iterate over the content as an associative array, something like this: string str = "a,b,c\nHello,65,63.63\nWorld,123,3673.562"; auto records = csvText(str, ["b"]); foreach (header, value ; records) {} Or accessing a value by header: foreach (record ; records) { auto value = record["a"]; auto val = record.b; // perhaps using opDispatch as well } -- /Jacob Carlborg
Oct 28 2011
parent reply Jesse Phillips <jessekphillips+D gmail.com> writes:
Jacob Carlborg Wrote:

 If a header is specified, is it possible to iterate over the content as 
 an associative array, something like this:
 
 string str = "a,b,c\nHello,65,63.63\nWorld,123,3673.562";
 auto records = csvText(str, ["b"]);
 
 foreach (header, value ; records) {}
 
 Or accessing a value by header:
 
 foreach (record ; records)
 {
      auto value = record["a"];
      auto val = record.b; // perhaps using opDispatch as well
 }
 
 -- 
 /Jacob Carlborg
If I implemented it this way then the entire record would be read in when a header exists. auto records = csvText!(string[string])(str, ["b"]); Would that be acceptable?
Oct 28 2011
parent Jacob Carlborg <doob me.com> writes:
On 2011-10-28 19:07, Jesse Phillips wrote:
 Jacob Carlborg Wrote:

 If a header is specified, is it possible to iterate over the content as
 an associative array, something like this:

 string str = "a,b,c\nHello,65,63.63\nWorld,123,3673.562";
 auto records = csvText(str, ["b"]);

 foreach (header, value ; records) {}

 Or accessing a value by header:

 foreach (record ; records)
 {
       auto value = record["a"];
       auto val = record.b; // perhaps using opDispatch as well
 }

 --
 /Jacob Carlborg
If I implemented it this way then the entire record would be read in when a header exists. auto records = csvText!(string[string])(str, ["b"]); Would that be acceptable?
That would be acceptable to me, don't know what how other would see it. -- /Jacob Carlborg
Oct 28 2011
prev sibling next sibling parent reply Ary Manzana <ary esperanto.org.ar> writes:
On 10/28/11 10:18 AM, dsimcha wrote:
 Formal review of Jesse Phillips's CSV parser module begins today and
 runs through November . Following that, a vote will take place for one
 week. Please post any comments about this library in this thread.

 Docs:
 http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html

 Code:
 https://github.com/he-the-great/phobos/tree/csv
I like the idea of reading a CSV into a struct, or treating all fields as ints. But, in my experience, CSVs are never perfect... because humans fill them, and humans aren't perfect or don't know "Oh, of course I must fill only numbers in this column, otherwise programs will explode". So although it's a nice functionality, I think just returning everything as strings is more than enough.
Oct 28 2011
parent reply Jesse Phillips <jessekphillips+D gmail.com> writes:
Ary Manzana Wrote:

 I like the idea of reading a CSV into a struct, or treating all fields 
 as ints.
 
 But, in my experience, CSVs are never perfect... because humans fill 
 them, and humans aren't perfect or don't know "Oh, of course I must fill 
 only numbers in this column, otherwise programs will explode".
 
 So although it's a nice functionality, I think just returning everything 
 as strings is more than enough.
If you are dealing with incorrect data you must deal with it. If you expect everything to be an int, and it isn't what are you going to do? I should check how easy it is to recover from an error.
Oct 28 2011
parent reply Ary Manzana <ary esperanto.org.ar> writes:
On 10/28/11 1:14 PM, Jesse Phillips wrote:
 Ary Manzana Wrote:

 I like the idea of reading a CSV into a struct, or treating all fields
 as ints.

 But, in my experience, CSVs are never perfect... because humans fill
 them, and humans aren't perfect or don't know "Oh, of course I must fill
 only numbers in this column, otherwise programs will explode".

 So although it's a nice functionality, I think just returning everything
 as strings is more than enough.
If you are dealing with incorrect data you must deal with it. If you expect everything to be an int, and it isn't what are you going to do? I should check how easy it is to recover from an error.
You are right. I see that you throw ConvException, because when some to!... fails it throws that. But you don't catch it. So you'd get something like "Can't convert 'hello' to an int", but you loose the information of which row and column caused the problem. So maybe catching it and throwing a more detailed exception... Also, sometimes people fill out numbers with some comments, like "1 apple", so maybe a less strict parsing for numbers might be good (as an option). Just some comments :)
Oct 28 2011
parent Jesse Phillips <jessekphillips+D gmail.com> writes:
Ary Manzana Wrote:

 You are right.
 
 I see that you throw ConvException, because when some to!... fails it 
 throws that. But you don't catch it. So you'd get something like "Can't 
 convert 'hello' to an int", but you loose the information of which row 
 and column caused the problem. So maybe catching it and throwing a more 
 detailed exception...
I'll look into that.
 Also, sometimes people fill out numbers with some comments, like "1 
 apple", so maybe a less strict parsing for numbers might be good (as an 
 option).
Loose rules are hard to setup to make everyone happy. I think this use case should be left to the user. I just want to get the information out quickly. Adding a row count shouldn't be too bad.
Oct 28 2011
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 10/28/2011 6:18 AM, dsimcha wrote:
 Formal review of Jesse Phillips's CSV parser module begins today and runs
 through November . Following that, a vote will take place for one week. Please
 post any comments about this library in this thread.

 Docs:
 http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html

 Code:
 https://github.com/he-the-great/phobos/tree/csv
First off, mucho thanks to Jesse for writing this. It's an important module for Phobos. This is a review of the documentation: Include this link: http://en.wikipedia.org/wiki/Comma-separated_values so the programmer can learn more about CSV. "This parser will loosely follow the RFC-4180" How exactly will it differ from RFC-4180? "No exceptions are thrown due to incorrect CSV." What happens, then, when the CSV is incorrect? "csvText" Be more specific about what csvText() returns. I'm confused about the heading parameter to csvText(). "the Content of the input" What is that? There's no parameter named "Content". "RecordList" "Range which provides access to CSV Records and Fields." What is the Range? Do you mean RecordList is a Range? And what does "provides access" mean? If RecordList is a Range, why is it labeled a "List"? "Array of the heading contained in the file." Now I'm really confused as to what a heading is. I thought it was the first line? Now there's an array of them? No examples given for constructors. No documentation for front() or empty(). "Record" "Returned by a RecordList when Contents is a non-struct." RecordList is a struct, it does not return anything. No examples. "csvNextToken" Example?
Oct 28 2011
next sibling parent reply bls <bizprac orange.fr> writes:
On 10/28/2011 12:18 PM, Walter Bright wrote:
 Formal review of Jesse Phillips's CSV parser module
Walter, Asking for high precision documentation is fine. And since you face it ... What about having such a high precision for D2 language definition ? std.csv is pretty cool. No doubt, But shouldn't std.csv be based on std.lexer ? A+
Oct 29 2011
next sibling parent reply Jesse Phillips <jessekphillips+d gmail.com> writes:
On Sat, 29 Oct 2011 15:27:22 -0700, bls wrote:

 On 10/28/2011 12:18 PM, Walter Bright wrote:
 Formal review of Jesse Phillips's CSV parser module
Walter, Asking for high precision documentation is fine. And since you face it ... What about having such a high precision for D2 language definition ?
You have to start somewhere.
 std.csv is pretty cool. No doubt, But shouldn't std.csv be based on
 std.lexer ?
 
 A+
Well, std.lexer doesn't exist. And if it is reasonable to use one, then it could be done without changing the API. I think the database API stuff will have much more influence on what needs to be done with std.csv, but I think what is currently implemented must remain as is.
Oct 29 2011
parent reply bls <bizprac orange.fr> writes:
On 10/29/2011 03:56 PM, Jesse Phillips wrote:
 std.csv is pretty cool. No doubt, But shouldn't std.csv be based on
  std.lexer ?

  A+
Well, std.lexer doesn't exist. And if it is reasonable to use one, then it could be done without changing the API. I think the database API stuff will have much more influence on what needs to be done with std.csv, but I think what is currently implemented must remain as is.
There is somebody working on std.lexer (a generic lexer) (as far as I remember the project started as std.range stress test ) However, I have no problem to imagine that f.i. the Tokenizer will be part of the std.lexer module. :) Regarding std.database.. Please elaborate/ ATM I just can imagine that std.csv will be used for Import/Export.
Oct 29 2011
parent Jesse Phillips <jessekphillips+d gmail.com> writes:
On Sat, 29 Oct 2011 17:05:01 -0700, bls wrote:

 Regarding std.database.. Please elaborate/  ATM I just can imagine that
 std.csv will be used for Import/Export.
CSV is tabular data, so it would make sense that you may have a similar interface as you would a database. csv.select("name").where!"a==4"("years") Or whatever the interface look like.
Oct 29 2011
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 10/29/2011 3:27 PM, bls wrote:
 What about having such a high precision for D2 language definition ?
Specific suggestions are welcome. Patches are even more welcome.
 std.csv is pretty cool. No doubt, But shouldn't std.csv be based on std.lexer ?
I don't see a similarity.
Oct 29 2011
prev sibling parent reply Jesse Phillips <jessekphillips+d gmail.com> writes:
Thank you all.

I've made my first round of updates to the documentation, will be taking 
another pass and looking at adding to functionality. I'd like some help 
with Walters comments as described below.

On Fri, 28 Oct 2011 12:18:47 -0700, Walter Bright wrote:

 On 10/28/2011 6:18 AM, dsimcha wrote:
 Formal review of Jesse Phillips's CSV parser module begins today and
 runs through November . Following that, a vote will take place for one
 week. Please post any comments about this library in this thread.

 Docs:
 http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html

 Code:
 https://github.com/he-the-great/phobos/tree/csv
First off, mucho thanks to Jesse for writing this. It's an important module for Phobos.
Thank you. Fun fact, many popular languages don't include a CSV parser,
 This is a review of the documentation:
 
 Include this link: http://en.wikipedia.org/wiki/Comma-separated_values
 so the programmer can learn more about CSV.
Done, added See Also section.
 "This parser will loosely follow the RFC-4180"
 
 How exactly will it differ from RFC-4180?
I do not know how to address this. After making this statement I list the criteria. I don't see the need to list where it differs: * Records terminate with LF, CR and not just CRLF. * Using Double Quotes or Comma can be customized. I feel that I summarized well enough that listing yet another would be redundant.
 "No exceptions are thrown due to incorrect CSV."
 
 What happens, then, when the CSV is incorrect?
Added list.
 "csvText"
 
 Be more specific about what csvText() returns.
You are right. csvText() return a RecordList, but I explain it as though it were the Range. I'll think about what detail here, any suggestions are welcome.
 I'm confused about the heading parameter to csvText().
Well I hope some reorganizing and an added example will help.
 "the Content of the input"

 What is that? There's no parameter named "Content".
I've replace all instances of Content with Contents. It is a template parameter so hopefully this resolves the confusion.
 "RecordList"
 
 "Range which provides access to CSV Records and Fields."
 
 What is the Range? Do you mean RecordList is a Range? And what does
 "provides access" mean? If RecordList is a Range, why is it labeled a
 "List"?
Considering it is the first documentation line of RecordList, I don't see where the confusion on "what the range is." I've re-written the line, it's shorter. Ok, be prepared for confusion. RecordList is a range of records, it is not named RecordRange because Record is also a range, a range of fields. If you have any suggestions for a better name, such as Document or TabularData Please let me know what you'd prefer.
 "Array of the heading contained in the file."
 
 Now I'm really confused as to what a heading is. I thought it was the
 first line? Now there's an array of them?
The line did not say "headings," but I have change it to: "Heading from the input in array form."
 No examples given for constructors.
I provide an example of using RecordList. Any example for the constructors would be the same or similar to the 7 other examples. Is an example needed for both constructors, and maybe an example for front too?
 No documentation for front() or empty().
These are range primitives, those programming in D already know "returns the current element of the range." I find it more appropriate to describe what is returned by front() in the description of the range (many ranges have become private and have no documentation at all, not even their existence). I will of course add documentation, but recommendations on non-pointless documentation would be welcome. Or how to handle the different types which can be assigned to Contents.
 "Record"
 
 "Returned by a RecordList when Contents is a non-struct."
 
 RecordList is a struct, it does not return anything.
It is a range, are we not allowed to talk of them as though they were high-level concepts? Do I leave off where you'd obtain a Record (it is private). Should I say "Returned by an instance of RecordList when calling front?"
 No examples.
It is private, removing doc comment for constructor.
 "csvNextToken"
 
 Example?
Added. Hopefully I addressed most of the documentation. Keep them coming.
Oct 29 2011
next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday, October 29, 2011 22:33:30 Jesse Phillips wrote:
 On 10/28/2011 6:18 AM, dsimcha wrote:
 First off, mucho thanks to Jesse for writing this. It's an important
 module for Phobos.
Thank you. Fun fact, many popular languages don't include a CSV parser,
CSV parser libraries exist for Java (IIRC Apache has one), but there definitely isn't one in Java's standard library. CSV parsers are not the sort of thing that's generally found in a standard library from what I've seen. So, having one will make us abnormal (albeit in a good way). - Jonathan M Davis
Oct 29 2011
parent reply Jesse Phillips <jessekphillips+d gmail.com> writes:
On Sat, 29 Oct 2011 17:19:44 -0700, Jonathan M Davis wrote:

 On Saturday, October 29, 2011 22:33:30 Jesse Phillips wrote:
 On 10/28/2011 6:18 AM, dsimcha wrote:
 First off, mucho thanks to Jesse for writing this. It's an important
 module for Phobos.
Thank you. Fun fact, many popular languages don't include a CSV parser,
CSV parser libraries exist for Java (IIRC Apache has one), but there definitely isn't one in Java's standard library. CSV parsers are not the sort of thing that's generally found in a standard library from what I've seen. So, having one will make us abnormal (albeit in a good way). - Jonathan M Davis
That is all I meant by it. A CSV parser does exist for every language, ini so people would use XML. It really is sad that instead of something solid they go the way of "discouraging."
Oct 29 2011
parent Somedude <lovelydear mailmetrash.com> writes:
Le 30/10/2011 03:09, Jesse Phillips a écrit :
 That is all I meant by it. A CSV parser does exist for every language, 

 ini so people would use XML. It really is sad that instead of something 
 solid they go the way of "discouraging."
Yes, especially since CSV files are still widely used and prove in many cases far more usable than XML or even JSON. Good luck importing JSON in Excel. Dude
Nov 05 2011
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 10/29/2011 3:33 PM, Jesse Phillips wrote:
 "This parser will loosely follow the RFC-4180"

 How exactly will it differ from RFC-4180?
I do not know how to address this. After making this statement I list the criteria. I don't see the need to list where it differs: * Records terminate with LF, CR and not just CRLF. * Using Double Quotes or Comma can be customized. I feel that I summarized well enough that listing yet another would be redundant.
I saw the criteria listed, but I didn't see how it was connected with it being different from RFC-4180. Are those criteria the differences? If so, please clarify that it is.
 "RecordList"

 "Range which provides access to CSV Records and Fields."

 What is the Range? Do you mean RecordList is a Range? And what does
 "provides access" mean? If RecordList is a Range, why is it labeled a
 "List"?
Considering it is the first documentation line of RecordList, I don't see where the confusion on "what the range is." I've re-written the line, it's shorter. Ok, be prepared for confusion. RecordList is a range of records, it is not named RecordRange because Record is also a range, a range of fields. If you have any suggestions for a better name, such as Document or TabularData Please let me know what you'd prefer.
When it says "List" I think of a specific data structure, i.e. linked list. Since it is not, it shouldn't be labeled as one. How about simply "Records" ?
 "Array of the heading contained in the file."

 Now I'm really confused as to what a heading is. I thought it was the
 first line? Now there's an array of them?
The line did not say "headings," but I have change it to: "Heading from the input in array form."
An example would make this much, much clearer.
 Is an example needed for both constructors, and maybe an example for
 front too?
I know that when one is intimately familiar with the code, more examples seem redundant. They aren't to the new user!
 No documentation for front() or empty().
These are range primitives, those programming in D already know "returns the current element of the range." I find it more appropriate to describe what is returned by front() in the description of the range (many ranges have become private and have no documentation at all, not even their existence). I will of course add documentation, but recommendations on non-pointless documentation would be welcome. Or how to handle the different types which can be assigned to Contents.
I suggest at a minimum that the docs for these say that they are part of the range interface. I want to press Andrei to write up a doc on "What's a Range" and then you can just provide a link to the right spot in that.
 "Record"

 "Returned by a RecordList when Contents is a non-struct."

 RecordList is a struct, it does not return anything.
It is a range, are we not allowed to talk of them as though they were high-level concepts? Do I leave off where you'd obtain a Record (it is private). Should I say "Returned by an instance of RecordList when calling front?"
I think it is very important to be precise in the use of terms in technical documentation. A Range doesn't "return" anything either, though there is a method of Range which does. A range is a set of elements that can be iterated through.
Oct 29 2011
next sibling parent reply Jesse Phillips <jessekphillips+d gmail.com> writes:
Thank you for the response, only one more follow-up question.

Records
Record

Don't these look easy to confuse? And funny if we didn't have type 
inference.

Records records = csvText(str);

foreach(Record record; records) ...

Anyone want to complain about this suggestion? Record is private and can 
only be obtained from Records (is that appropriate phrasing?).
Oct 30 2011
next sibling parent Steve Teale <steve.teale britseyeview.com> writes:
On Sun, 30 Oct 2011 17:02:26 +0000, Jesse Phillips wrote:

 Thank you for the response, only one more follow-up question.
 
 Records
 Record
If it's any consolation, I'm suffering from similar questions to myself in my naming of structs and methods for my database interface experiments. Since it is rather a short step from a generic 'CSV' interface to a database interface, you may be determining the terminology that I should use. I should really divert for a day and look at what you have done. It would be interesting to see if I could write a similar interface over your stuff. But there's so much to do! Steve
Oct 30 2011
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 10/30/2011 10:02 AM, Jesse Phillips wrote:
 Thank you for the response, only one more follow-up question.

 Records
 Record

 Don't these look easy to confuse? And funny if we didn't have type
 inference.

 Records records = csvText(str);

 foreach(Record record; records) ...

 Anyone want to complain about this suggestion? Record is private and can
 only be obtained from Records (is that appropriate phrasing?).
I started using that convention in dmd, i.e. Objects is an array of Object, and after I got used to it I like it fine.
Oct 30 2011
prev sibling parent "Eric Poggel (JoeCoder)" <dnewsgroup2 yage3d.net> writes:
On 10/29/2011 10:03 PM, Walter Bright wrote:
 I want to press Andrei to write up a doc on "What's a Range"
As a D1 user planning to migrate, this would be very appreciated.
Oct 30 2011
prev sibling next sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Fri, 28 Oct 2011 16:18:27 +0300, dsimcha <dsimcha yahoo.com> wrote:

 Docs:
 http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
Checked the new docs today. If I'm reading them right, the top example prints: "Fred works as a Fly and earns $4 per year" Is this a pop culture reference I'm not catching? An idea I had the other day was to include convenience presets for the more common flavours of CSV formats, e.g.: csvText!CSVFormat_Excel(...) -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Oct 31 2011
parent reply Jesse Phillips <jessekphillips+d gmail.com> writes:
On Mon, 31 Oct 2011 19:21:02 +0200, Vladimir Panteleev wrote:

 On Fri, 28 Oct 2011 16:18:27 +0300, dsimcha <dsimcha yahoo.com> wrote:
 
 Docs:
 http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
Checked the new docs today. If I'm reading them right, the top example prints: "Fred works as a Fly and earns $4 per year" Is this a pop culture reference I'm not catching?
No, not that I know of. Should it be? Should I go with a more professional like example?
 An idea I had the other day was to include convenience presets for the
 more common flavours of CSV formats, e.g.: csvText!CSVFormat_Excel(...)
Well if Excel actually conforms to what it claims: http://office.microsoft.com/en-us/excel-help/excel-formatting-and- features-that-are-not-transferred-to-other-file-formats-HP010014105.aspx? CTT=5&origin=HP010099725#BM4 Which I doubt it does, then my parser already defaults to being able to read such formats. In my experience Excel sucks at CSV and follows no rules. The implementation I choose as default is the most common and any other style is just a butchafication and likely to be unreliable, usually stems from "my data has commas, so I'll use colon." Which is great until the data also has colon. But maybe that is just my experience.
Oct 31 2011
next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Tuesday, November 01, 2011 01:28:07 Jesse Phillips wrote:
 On Mon, 31 Oct 2011 19:21:02 +0200, Vladimir Panteleev wrote:
 On Fri, 28 Oct 2011 16:18:27 +0300, dsimcha <dsimcha yahoo.com> wrote:
 Docs:
 http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
Checked the new docs today. If I'm reading them right, the top example prints: "Fred works as a Fly and earns $4 per year" Is this a pop culture reference I'm not catching?
No, not that I know of. Should it be? Should I go with a more professional like example?
 An idea I had the other day was to include convenience presets for the
 more common flavours of CSV formats, e.g.: csvText!CSVFormat_Excel(...)
Well if Excel actually conforms to what it claims: http://office.microsoft.com/en-us/excel-help/excel-formatting-and- features-that-are-not-transferred-to-other-file-formats-HP010014105.aspx? CTT=5&origin=HP010099725#BM4 Which I doubt it does, then my parser already defaults to being able to read such formats. In my experience Excel sucks at CSV and follows no rules.
As I recall, part of the problem is that Excel's behavior with regards to CSV is locale-dependent. - Jonathan M Davis
Oct 31 2011
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2011-11-01 02:28, Jesse Phillips wrote:
 On Mon, 31 Oct 2011 19:21:02 +0200, Vladimir Panteleev wrote:

 On Fri, 28 Oct 2011 16:18:27 +0300, dsimcha<dsimcha yahoo.com>  wrote:

 Docs:
 http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
Checked the new docs today. If I'm reading them right, the top example prints: "Fred works as a Fly and earns $4 per year" Is this a pop culture reference I'm not catching?
No, not that I know of. Should it be? Should I go with a more professional like example?
 An idea I had the other day was to include convenience presets for the
 more common flavours of CSV formats, e.g.: csvText!CSVFormat_Excel(...)
Well if Excel actually conforms to what it claims: http://office.microsoft.com/en-us/excel-help/excel-formatting-and- features-that-are-not-transferred-to-other-file-formats-HP010014105.aspx? CTT=5&origin=HP010099725#BM4 Which I doubt it does, then my parser already defaults to being able to read such formats. In my experience Excel sucks at CSV and follows no rules. The implementation I choose as default is the most common and any other style is just a butchafication and likely to be unreliable, usually stems from "my data has commas, so I'll use colon." Which is great until the data also has colon. But maybe that is just my experience.
Excel + CSV == Pain in the ass Excel uses different value delimiters as the default setting depending on the locale of the Excel application. -- /Jacob Carlborg
Nov 02 2011
parent reply Don <nospam nospam.com> writes:
On 02.11.2011 09:02, Jacob Carlborg wrote:
 On 2011-11-01 02:28, Jesse Phillips wrote:
 On Mon, 31 Oct 2011 19:21:02 +0200, Vladimir Panteleev wrote:

 On Fri, 28 Oct 2011 16:18:27 +0300, dsimcha<dsimcha yahoo.com> wrote:

 Docs:
 http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
Checked the new docs today. If I'm reading them right, the top example prints: "Fred works as a Fly and earns $4 per year" Is this a pop culture reference I'm not catching?
No, not that I know of. Should it be? Should I go with a more professional like example?
 An idea I had the other day was to include convenience presets for the
 more common flavours of CSV formats, e.g.: csvText!CSVFormat_Excel(...)
Well if Excel actually conforms to what it claims: http://office.microsoft.com/en-us/excel-help/excel-formatting-and- features-that-are-not-transferred-to-other-file-formats-HP010014105.aspx? CTT=5&origin=HP010099725#BM4 Which I doubt it does, then my parser already defaults to being able to read such formats. In my experience Excel sucks at CSV and follows no rules. The implementation I choose as default is the most common and any other style is just a butchafication and likely to be unreliable, usually stems from "my data has commas, so I'll use colon." Which is great until the data also has colon. But maybe that is just my experience.
Excel + CSV == Pain in the ass Excel uses different value delimiters as the default setting depending on the locale of the Excel application.
Far more horrible than that -- in an old Excel from around 2000 the user interface used US settings for reading short dates, if the day of the month was 12 or less. Otherwise it uses the locale. But it always uses the locale for displaying them. So 11/1, 12/1, 13/1 gets changed to 1 Nov, 1 Dec, 13 Jan. I wrote a macro called "November 9" (9/11) for swapping them back if they were in the buggy range.
Nov 11 2011
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2011-11-11 12:50, Don wrote:
 Far more horrible than that -- in an old Excel from around 2000 the user
 interface used US settings for reading short dates, if the day of the
 month was 12 or less. Otherwise it uses the locale. But it always uses
 the locale for displaying them.

 So 11/1, 12/1, 13/1 gets changed to 1 Nov, 1 Dec, 13 Jan.

 I wrote a macro called "November 9" (9/11) for swapping them back if
 they were in the buggy range.
Haha that's just insane. -- /Jacob Carlborg
Nov 11 2011
prev sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Fri, 11 Nov 2011 13:50:52 +0200, Don <nospam nospam.com> wrote:

 Far more horrible than that -- in an old Excel from around 2000 the user  
 interface used US settings for reading short dates, if the day of the  
 month was 12 or less. Otherwise it uses the locale. But it always uses  
 the locale for displaying them.

 So 11/1, 12/1, 13/1 gets changed to 1 Nov, 1 Dec, 13 Jan.

 I wrote a macro called "November 9" (9/11) for swapping them back if  
 they were in the buggy range.
Wow. Reminds me of this: http://thedailywtf.com/Articles/TransAtlantic-Time-Trap.aspx -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Nov 11 2011
prev sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Tue, 01 Nov 2011 03:28:07 +0200, Jesse Phillips  
<jessekphillips+d gmail.com> wrote:

 On Mon, 31 Oct 2011 19:21:02 +0200, Vladimir Panteleev wrote:

 On Fri, 28 Oct 2011 16:18:27 +0300, dsimcha <dsimcha yahoo.com> wrote:

 Docs:
 http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
Checked the new docs today. If I'm reading them right, the top example prints: "Fred works as a Fly and earns $4 per year" Is this a pop culture reference I'm not catching?
No, not that I know of. Should it be? Should I go with a more professional like example?
I guess such things are really subjective, but personally to me it does seem a bit silly, and out-of-place from the documentation of a language that aims to be the successor of C++. Have you decided about whether you'll be adding a formatter? Regardless of that, IMO the naming of "csvText" is not ideal. What does the "text" part imply - isn't CSV inherently a text format? How about "csvReader", which would also disambiguate it from a CSV formatter? -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Nov 02 2011
parent Jesse Phillips <jessekphillips+d gmail.com> writes:
On Wed, 02 Nov 2011 15:11:24 +0200, Vladimir Panteleev wrote:

 I guess such things are really subjective, but personally to me it does
 seem a bit silly, and out-of-place from the documentation of a language
 that aims to be the successor of C++.
I'll make it a pop culture reference... kidding.
 Regardless
 of that, IMO the naming of "csvText" is not ideal. What does the "text"
 part imply - isn't CSV inherently a text format? How about "csvReader",
 which would also disambiguate it from a CSV formatter?
Sounds reasonable to me. CSV could be applied to other things, not that it ever is. I started making the implementation very generic and had to cut back especially since I couldn't test it all. I had even played with the idea of being able to just return slices of a range of things https://github.com/he-the-great/JPDLibs/blob/separator/csv/csv.d but stopped its development in favor of something that work with input ranges and could get into Phobos.
 Have you decided about whether you'll be adding a formatter? 
Yes I'll give this a try, but I'm going to concentrate on documentation of what is there.
Nov 02 2011
prev sibling next sibling parent reply Jesse Phillips <jessekphillips+d gmail.com> writes:
On Fri, 28 Oct 2011 09:18:27 -0400, dsimcha wrote:

 Formal review of Jesse Phillips's CSV parser module begins today and
 runs through November .  Following that, a vote will take place for one
 week.  Please post any comments about this library in this thread.
 
 Docs: http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
 
 Code:
 https://github.com/he-the-great/phobos/tree/csv
I've done another update this covers: - Documentation more complete - A provided heading can be any InputRange of string - csvText => csvReader - RecordList => Records - csvReader no longer takes a Malformed enum, and instead provides customization to comma and quote. Tomorrow I will try and complete one or more of the following with this priority: - Have row/column information for thrown exceptions. - Support [string][string] Contents. - Write a CSV formatter. Support for range of struct, range of [string] [string], and range of string[].
Nov 05 2011
parent reply Jesse Phillips <jessekphillips+d gmail.com> writes:
And only one pending request, csvFormatter. I am still considering how I 
want to approach this. I don't want to rush it in so I don't plan on 
having it complete before voting starts.

It has been fun and I look forward to the verdict.

On Sat, 05 Nov 2011 23:37:43 +0000, Jesse Phillips wrote:

 On Fri, 28 Oct 2011 09:18:27 -0400, dsimcha wrote:
 
 Formal review of Jesse Phillips's CSV parser module begins today and
 runs through November .  Following that, a vote will take place for one
 week.  Please post any comments about this library in this thread.
 
 Docs:
 http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
 
 Code:
 https://github.com/he-the-great/phobos/tree/csv
I've done another update this covers: - Documentation more complete - A provided heading can be any InputRange of string - csvText => csvReader - RecordList => Records - csvReader no longer takes a Malformed enum, and instead provides customization to comma and quote. Tomorrow I will try and complete one or more of the following with this priority: - Have row/column information for thrown exceptions. - Support [string][string] Contents. - Write a CSV formatter. Support for range of struct, range of [string] [string], and range of string[].
Nov 06 2011
parent reply "Robert Jacques" <sandford jhu.edu> writes:
On Sun, 06 Nov 2011 19:00:42 -0500, Jesse Phillips <jessekphillips+d gmail.com>
wrote:
 And only one pending request, csvFormatter. I am still considering how I
 want to approach this. I don't want to rush it in so I don't plan on
 having it complete before voting starts.
My primary use of csv files is as a method of saving data to later graph / view / format in Excel. i.e. csvWrite("filename",["x","y","z"],x,y,z); where x, y, z are ranges with possibly different lengths and/or types. I also have some API design concerns. Primarily, most of the type names are overly broad and unclear. For example: Record, Records, Malformed could just as easily be part of a database API. And because of this, not only does one reading the code not know what the object is, but if namespace conflicts can occur, then code reviewers will probably make the wrong assumption half the time as to what Record is and not bother looking it up. Furthermore, Records and Record, are essentially internal ranges/structs returned by csvReader. In the tradition of std.algorithms these should be defined inside the csvReader function (preferably) or as CsvReader / CsvReader.Record.
Nov 07 2011
parent reply Jesse Phillips <jessekphillips+d gmail.com> writes:
On Mon, 07 Nov 2011 23:28:20 -0500, Robert Jacques wrote:

 On Sun, 06 Nov 2011 19:00:42 -0500, Jesse Phillips
 <jessekphillips+d gmail.com> wrote:
 And only one pending request, csvFormatter. I am still considering how
 I want to approach this. I don't want to rush it in so I don't plan on
 having it complete before voting starts.
My primary use of csv files is as a method of saving data to later graph / view / format in Excel. i.e. csvWrite("filename",["x","y","z"],x,y,z); where x, y, z are ranges with possibly different lengths and/or types.
I don't want to do anything with files, I'm thinking an output buffer makes the most sense (should be able to have such a buffer write to a file). The x,y,z seems interesting but what about the poor sole who has a range of struct { auto x,y,z; } If I'm going to make a csvFormatter than making sure something like csvFormatter(csvReader(txt)); works is number one priority. But you have added a style to consider when building the interface.
 I also have some API design concerns. Primarily, most of the type names
 are overly broad and unclear. For example: Record, Records, Malformed
 could just as easily be part of a database API. And because of this, not
 only does one reading the code not know what the object is, but if
 namespace conflicts can occur, then code reviewers will probably make
 the wrong assumption half the time as to what Record is and not bother
 looking it up.
The reason it is name like what you would find in a database, is because they are the same thing. It is a row of data. In past discussion about existing name clashes it was concluded that was what modules, static, and named imports wore for. Malformed, I just don't have any clue what to actually call it. No CSVMalformed doesn't help. But then you go on to the point below, that you shouldn't even see them.
 Furthermore, Records and Record, are essentially internal ranges/structs
 returned by csvReader. In the tradition of std.algorithms these should
 be defined inside the csvReader function (preferably) or as CsvReader /
 CsvReader.Record.
I was going to explain how wrong you are and how I tried having all the options part of csvReader but it just muddies up the clean API when customizing something. But actually I already found the solution with my previous updates and incorrectly chose to move Malformed customization out. Long story short the problems I see: * The documentation still needs to exist and will be a little harder to do without these * Records contains the heading field, this is abnormal for an InputRange and means it won't show as an auto link if it becomes a hidden range. * Currently you can find exactly which function call result in a specific error, if I hide the ranges you don't. I'm expecting that introducing CSVFormatter will require another review, at that time I can try making two forms of documentation to see which is preferred. Thank you! Very useful food for thought.
Nov 10 2011
next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, November 11, 2011 03:17:10 Jesse Phillips wrote:
 I also have some API design concerns. Primarily, most of the type names
 are overly broad and unclear. For example: Record, Records, Malformed
 could just as easily be part of a database API. And because of this, not
 only does one reading the code not know what the object is, but if
 namespace conflicts can occur, then code reviewers will probably make
 the wrong assumption half the time as to what Record is and not bother
 looking it up.
The reason it is name like what you would find in a database, is because they are the same thing. It is a row of data. In past discussion about existing name clashes it was concluded that was what modules, static, and named imports wore for.
Actually, I'd argue that Row would be better than Record, since it _is_ a row in a table. Personally, I'd find it to be more immediately clear that way. With Record, I have to figure out what the heck it is a record of, whereas Row is immediately obvious in this context. - Jonathan M Davis
Nov 11 2011
parent reply Jesse Phillips <jessekphillips+d gmail.com> writes:
On Fri, 11 Nov 2011 02:17:40 -0800, Jonathan M Davis wrote:

 Actually, I'd argue that Row would be better than Record, since it _is_
 a row in a table. Personally, I'd find it to be more immediately clear
 that way. With Record, I have to figure out what the heck it is a record
 of, whereas Row is immediately obvious in this context.
 
 - Jonathan M Davis
It _is_ a record of data, just as much as it is a row in a table. The RFC I reference continuously refers to records and never even says "row."
Nov 11 2011
next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, November 11, 2011 15:25:29 Jesse Phillips wrote:
 On Fri, 11 Nov 2011 02:17:40 -0800, Jonathan M Davis wrote:
 Actually, I'd argue that Row would be better than Record, since it _is_
 a row in a table. Personally, I'd find it to be more immediately clear
 that way. With Record, I have to figure out what the heck it is a record
 of, whereas Row is immediately obvious in this context.
 
 - Jonathan M Davis
It _is_ a record of data, just as much as it is a row in a table. The RFC I reference continuously refers to records and never even says "row."
Well, then you have a good argument for keeping it as Record. But spreadsheets are tables of rows and columns, and a CSV file is a spreadsheet. Personally, I find the term record overly vague and wouldn't use it for much of anything - not to mention that it makes me think of the analog music device rather than anything else. So, I don't particularly like the name Record. I wouldn't use it in reference to a DB either. I'd use the term row there as well. But others may not agree with me, and if the RFC uses the term record, then that's a definite argument for using Record instead of Row. - Jonathan M Davis
Nov 11 2011
prev sibling parent Manu <turkeyman gmail.com> writes:
On 11 November 2011 19:07, Jonathan M Davis <jmdavisProg gmx.com> wrote:

 On Friday, November 11, 2011 15:25:29 Jesse Phillips wrote:
 On Fri, 11 Nov 2011 02:17:40 -0800, Jonathan M Davis wrote:
 Actually, I'd argue that Row would be better than Record, since it _is_
 a row in a table. Personally, I'd find it to be more immediately clear
 that way. With Record, I have to figure out what the heck it is a
record
 of, whereas Row is immediately obvious in this context.

 - Jonathan M Davis
It _is_ a record of data, just as much as it is a row in a table. The RFC I reference continuously refers to records and never even says "row."
Well, then you have a good argument for keeping it as Record. But spreadsheets are tables of rows and columns, and a CSV file is a spreadsheet. Personally, I find the term record overly vague and wouldn't use it for much of anything - not to mention that it makes me think of the analog music device rather than anything else. So, I don't particularly like the name Record. I wouldn't use it in reference to a DB either. I'd use the term row there as well. But others may not agree with me, and if the RFC uses the term record, then that's a definite argument for using Record instead of Row.
+1 for row... although I appreciate the RFC says record all over the place. That said though, upon reading it, I think that might be the worst RFC ever written ;)
Nov 11 2011
prev sibling parent reply "Robert Jacques" <sandford jhu.edu> writes:
On Thu, 10 Nov 2011 22:17:10 -0500, Jesse Phillips <jessekphillips+d gmail.com>
wrote:
 On Mon, 07 Nov 2011 23:28:20 -0500, Robert Jacques wrote:
 On Sun, 06 Nov 2011 19:00:42 -0500, Jesse Phillips
 <jessekphillips+d gmail.com> wrote:
[snip]
 I also have some API design concerns. Primarily, most of the type names
 are overly broad and unclear. For example: Record, Records, Malformed
 could just as easily be part of a database API. And because of this, not
 only does one reading the code not know what the object is, but if
 namespace conflicts can occur, then code reviewers will probably make
 the wrong assumption half the time as to what Record is and not bother
 looking it up.
The reason it is name like what you would find in a database, is because they are the same thing. It is a row of data. In past discussion about existing name clashes it was concluded that was what modules, static, and named imports wore for. Malformed, I just don't have any clue what to actually call it. No CSVMalformed doesn't help. But then you go on to the point below, that you shouldn't even see them.
 Furthermore, Records and Record, are essentially internal ranges/structs
 returned by csvReader. In the tradition of std.algorithms these should
 be defined inside the csvReader function (preferably) or as CsvReader /
 CsvReader.Record.
I was going to explain how wrong you are and how I tried having all the options part of csvReader but it just muddies up the clean API when customizing something. But actually I already found the solution with my previous updates and incorrectly chose to move Malformed customization out. Long story short the problems I see: * The documentation still needs to exist and will be a little harder to do without these * Records contains the heading field, this is abnormal for an InputRange and means it won't show as an auto link if it becomes a hidden range. * Currently you can find exactly which function call result in a specific error, if I hide the ranges you don't. I'm expecting that introducing CSVFormatter will require another review, at that time I can try making two forms of documentation to see which is preferred. Thank you! Very useful food for thought.
Modules, static and named imports are extremely useful tools for making disparate third party libraries work together. But Phobos is one (big) library, not a bunch of separate ones. We shouldn't ever need to use these tools to resolve Phobos name conflicts. In fact one of the (many) goals of D is to scale down to quick and dirty scripts. rdmd helps with this a lot, but the other piece of this puzzle is 'import std.all;' which doesn't work today. And while I don't think we should break existing code just to fix this problem, going forward, as we update and add libraries to Phobos, we need to strive to prevent name collisions. And, like std.file and std.stdio, I'd expect std.csv and std.databases to be used together often. Furthermore, your primary case for exposing these structs is for documentation purposes. As the user isn't expected to actually instantiate these structs (or at least not often), naming them CsvReader and CsvReader.Row is perfectly acceptable. Lastly, I'd like to draw your attention to std.algorithm.findSplit. findSplit has to return three ranges, which it does so using a Tuple. Records is simply a manual tuple of the headings and the row ranges, so why not leverage Phobos for this instead?
Nov 11 2011
parent reply Jesse Phillips <jessekphillips+d gmail.com> writes:
On Sat, 12 Nov 2011 00:21:55 -0500, Robert Jacques wrote:

 On Thu, 10 Nov 2011 22:17:10 -0500, Jesse Phillips
 <jessekphillips+d gmail.com> wrote:
 On Mon, 07 Nov 2011 23:28:20 -0500, Robert Jacques wrote:
 On Sun, 06 Nov 2011 19:00:42 -0500, Jesse Phillips
 <jessekphillips+d gmail.com> wrote:
[snip]
 I also have some API design concerns. Primarily, most of the type
 names are overly broad and unclear. For example: Record, Records,
 Malformed could just as easily be part of a database API. And because
 of this, not only does one reading the code not know what the object
 is, but if namespace conflicts can occur, then code reviewers will
 probably make the wrong assumption half the time as to what Record is
 and not bother looking it up.
The reason it is name like what you would find in a database, is because they are the same thing. It is a row of data. In past discussion about existing name clashes it was concluded that was what modules, static, and named imports wore for. Malformed, I just don't have any clue what to actually call it. No CSVMalformed doesn't help. But then you go on to the point below, that you shouldn't even see them.
 Furthermore, Records and Record, are essentially internal
 ranges/structs returned by csvReader. In the tradition of
 std.algorithms these should be defined inside the csvReader function
 (preferably) or as CsvReader / CsvReader.Record.
I was going to explain how wrong you are and how I tried having all the options part of csvReader but it just muddies up the clean API when customizing something. But actually I already found the solution with my previous updates and incorrectly chose to move Malformed customization out. Long story short the problems I see: * The documentation still needs to exist and will be a little harder to do without these * Records contains the heading field, this is abnormal for an InputRange and means it won't show as an auto link if it becomes a hidden range. * Currently you can find exactly which function call result in a specific error, if I hide the ranges you don't. I'm expecting that introducing CSVFormatter will require another review, at that time I can try making two forms of documentation to see which is preferred. Thank you! Very useful food for thought.
Modules, static and named imports are extremely useful tools for making disparate third party libraries work together. But Phobos is one (big) library, not a bunch of separate ones. We shouldn't ever need to use these tools to resolve Phobos name conflicts. In fact one of the (many) goals of D is to scale down to quick and dirty scripts. rdmd helps with this a lot, but the other piece of this puzzle is 'import std.all;' which doesn't work today. And while I don't think we should break existing code just to fix this problem, going forward, as we update and add libraries to Phobos, we need to strive to prevent name collisions. And, like std.file and std.stdio, I'd expect std.csv and std.databases to be used together often. Furthermore, your primary case for exposing these structs is for documentation purposes. As the user isn't expected to actually instantiate these structs (or at least not often), naming them CsvReader and CsvReader.Row is perfectly acceptable.
I like the point on std.all, but isn't std.csv.Rows and std.csv.Row about the same as your example names? And if dmd could do partial matches (not exactly requesting such a feature) csv.Rows, csv.Row.
 Lastly, I'd
 like to draw your attention to std.algorithm.findSplit. findSplit has to
 return three ranges, which it does so using a Tuple. Records is simply a
 manual tuple of the headings and the row ranges, so why not leverage
 Phobos for this instead?
Records requires access to the header for building the associative array. So while I still could return both, the range would still be holding the header. If you think it would make an improvement then you'd probably want to update your vote to no. Namely what I would see coming from that: auto records = csvReader(txt); records.header ...; foreach(row; records.rows) { ... }
Nov 15 2011
parent reply "Robert Jacques" <sandford jhu.edu> writes:
On Tue, 15 Nov 2011 22:30:51 -0500, Jesse Phillips <jessekphillips+d gmail.com>
wrote:
 On Sat, 12 Nov 2011 00:21:55 -0500, Robert Jacques wrote:
 On Thu, 10 Nov 2011 22:17:10 -0500, Jesse Phillips
 <jessekphillips+d gmail.com> wrote:
 On Mon, 07 Nov 2011 23:28:20 -0500, Robert Jacques wrote:
 On Sun, 06 Nov 2011 19:00:42 -0500, Jesse Phillips
 <jessekphillips+d gmail.com> wrote:
[snip]
 Modules, static and named imports are extremely useful tools for making
 disparate third party libraries work together. But Phobos is one (big)
 library, not a bunch of separate ones. We shouldn't ever need to use
 these tools to resolve Phobos name conflicts. In fact one of the (many)
 goals of D is to scale down to quick and dirty scripts. rdmd helps with
 this a lot, but the other piece of this puzzle is 'import std.all;'
 which doesn't work today. And while I don't think we should break
 existing code just to fix this problem, going forward, as we update and
 add libraries to Phobos, we need to strive to prevent name collisions.
 And, like std.file and std.stdio, I'd expect std.csv and std.databases
 to be used together often. Furthermore, your primary case for exposing
 these structs is for documentation purposes. As the user isn't expected
 to actually instantiate these structs (or at least not often), naming
 them CsvReader and CsvReader.Row is perfectly acceptable.
I like the point on std.all, but isn't std.csv.Rows and std.csv.Row about the same as your example names?
Yes and no. Yes, from the point of view of reading 'std.csv.Rows' in existing code. No, from the point of view writing code. You're much more likely to write just 'Rows' and then have to go through a quick-compile-fix-compile loop. Worse, consider the people using the colliding std.databases.Row struct, which now have to use the fully qualified name. I don't feel that overly succinct names are important to std.csv, as most users will be using 'auto records = csvReader(txt);' and will never have to type out std.csv.Rows or csvRows or whatever.
 And if dmd could do partial matches (not
 exactly requesting such a feature) csv.Rows, csv.Row.
 Lastly, I'd
 like to draw your attention to std.algorithm.findSplit. findSplit has to
 return three ranges, which it does so using a Tuple. Records is simply a
 manual tuple of the headings and the row ranges, so why not leverage
 Phobos for this instead?
Records requires access to the header for building the associative array. So while I still could return both, the range would still be holding the header. If you think it would make an improvement then you'd probably want to update your vote to no. Namely what I would see coming from that: auto records = csvReader(txt); records.header ...; foreach(row; records.rows) { ... }
Thank you for explaining. I wasn't seeing the implementation efficiency that comes from the custom struct approach.
Nov 15 2011
parent Jesse Phillips <jessekphillips+d gmail.com> writes:
On Tue, 15 Nov 2011 23:12:57 -0500, Robert Jacques wrote:

 Thank you for explaining. I wasn't seeing the implementation efficiency
 that comes from the custom struct approach.
No, thank you. It is good to have someone fighting a different opinion. It is because of you we can have this bike-shedding and not even have shed :D. Had the hiding of both structs not worked so well I would have changed it (as I would have thought on it more, realized it was two against zero, and wouldn't be distracted with the idea of simplifying the docs)
Nov 15 2011
prev sibling next sibling parent dsimcha <dsimcha yahoo.com> writes:
On 10/28/2011 9:18 AM, dsimcha wrote:
 Formal review of Jesse Phillips's CSV parser module begins today and
 runs through November . Following that, a vote will take place for one
 week. Please post any comments about this library in this thread.

 Docs:
 http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html

 Code:
 https://github.com/he-the-great/phobos/tree/csv
I apologize. I just noticed that I forgot to put the end date in this message. The review of the CSV parser runs the standard two weeks, until November 11.
Nov 06 2011
prev sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, October 28, 2011 09:18:27 dsimcha wrote:
 Formal review of Jesse Phillips's CSV parser module begins today and
 runs through November .  Following that, a vote will take place for one
 week.  Please post any comments about this library in this thread.
 
 Docs:
 http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
 
 Code:
 https://github.com/he-the-great/phobos/tree/csv
- Nitpick: You should probably proofread the module documentation again. It has several mispellings and grammar errors. - The functions and types mentioned in the documentation should use LREF so that they're links to their documentation. - Shouldn't the term "header" be used rather than "heading?" So, you'd have HeaderMismatchException instead of HeadingMismatchException. Usually, the term heading is only used for stuff like the heading of an article. Usually, the term header is used for stuff like files. RFC 4180 uses the term header but never heading. So, unless I'm missing something, I really think that all of the instances of heading (both in the documentation and in the API) should be changed to header. - I'd argue against Separator in csvReader being char by default. Individual characters should generally be dchar, not char. Using char by itself is almost always bad practice IMHO. - I'd say that csvReader should say that it returns a Records struct which is an input range rather than saying that it provides std.range.isInputRange. We just don't talk that way about ranges normally. We say that something is a particular range type, not that it provides the API for a particular range type. If you want to make "is an input range" a link to std.range.isInputRange, that's fine, but what's currently there is not how ranges are typically referred to. I'd also suggest that you on the range functions where you say that they're part of the std.range.isInputRange interface, you should use the term API instead of interface, since interface has a specific meaning in the language which has nothing to do with this context. - Why does csvNextToken take an Appender!(char[])? That seems overly restrictive. I'd argue that it should take an output range of dchar (which Appender!(char[]) is). Then it's not restricted to Appender, and it's not restricted to char[]. - Also, the fact that you have isSomeChar!(ElementType!Range) concerns me somewhat. Normally, I'd expect something more like is(ElementType!Range == dchar). As it is, you could have a range of char or wchar, which could do funny things. Phobos doesn't normally support that, and I'd be concerned about what would happen if someone tried. ElementType!Range will be dchar for all string types, so they'll work fine either way, but programmers who are overly adventurous or ignorant may try and use a char or wchar range, and I'm willing to be that there would be problems with that. Just all around, I'd advise checking that the ranges are ranges of dchar and using dchar for the separators rather than char. - You should have unit tests testing ranges _other_ than string. The fact that neither wstring nor dstring is tested is worrisome, and ideally, you'd do stuff like call filter!"true" on a string so that you have an input range which isn't a string. Without at least some of that, I'd be very worried that your code can't actually handle generic ranges of dchar and really only works with string. - You should have some unit tests with unicode values - preferably some which are multiple code units for both char and wchar. For instance, std.string uses \U00010143 in some of its tests, because it's 4 code units in UTF-8 and 2 in UTF-16. The complete lack of testing with unicode is worrisome in the same way that only testing with string is worrisome. It's too easy to screw something up such that it happens to work with string and ASCII but not in general. So, please add tests which involve unicode - not only in the data but in the separators. - I find it somewhat odd that you would use a pointer to the range in Record rather than using a slice, but I'd have to examine it in more detail to see whether that's actually necessary or not. But in general, I'd argue that a slice should be used rather than a pointer if it can be. - Jonathan M Davis P.S. Just as a note, since you seem to like not using braces if you don't need to for if statements and the like, I though that I'd point out that try and catch don't require braces in D if htey contain only one statement. You have several one line catches which use braces when they don't need to. I'm not arguing that you should or shouldn't change it, but I thought that I would point out that the braces aren't necessary in case you didn't know (since they _are_ necessary in other languages).
Nov 10 2011
parent Jesse Phillips <jessekphillips+d gmail.com> writes:
Thank you, lots of stuff to ponder and comment.

On Thu, 10 Nov 2011 02:10:28 -0800, Jonathan M Davis wrote:

 On Friday, October 28, 2011 09:18:27 dsimcha wrote:
 Formal review of Jesse Phillips's CSV parser module begins today and
 runs through November .  Following that, a vote will take place for one
 week.  Please post any comments about this library in this thread.
 
 Docs:
 http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
 
 Code:
 https://github.com/he-the-great/phobos/tree/csv
- Nitpick: You should probably proofread the module documentation again. It has several mispellings and grammar errors.
Err, I'll try, use some of the great tips you gave me :)
 - The functions and types mentioned in the documentation should use LREF
 so that they're links to their documentation.
I'll see if I can find them all while proofing.
 - Shouldn't the term "header" be used rather than "heading?" So, you'd
 have HeaderMismatchException instead of HeadingMismatchException.
 Usually, the term heading is only used for stuff like the heading of an
 article. Usually, the term header is used for stuff like files. RFC 4180
 uses the term header but never heading. So, unless I'm missing
 something, I really think that all of the instances of heading (both in
 the documentation and in the API) should be changed to header.
Used both in development, then chose one, guess I got it wrong. Changing.
 - I'd argue against Separator in csvReader being char by default.
 Individual characters should generally be dchar, not char. Using char by
 itself is almost always bad practice IMHO.
The only reason I put defaults on it was because it wouldn't choose the value for it from my default values. Otherwise it works with what you give it, be it char, wchar, dchar.
 - I'd say that csvReader should say that it returns a Records struct
 which is an input range rather than saying that it provides
 std.range.isInputRange. We just don't talk that way about ranges
 normally. We say that something is a particular range type, not that it
 provides the API for a particular range type. If you want to make "is an
 input range" a link to std.range.isInputRange, that's fine, but what's
 currently there is not how ranges are typically referred to. I'd also
 suggest that you on the range functions where you say that they're part
 of the std.range.isInputRange interface, you should use the term API
 instead of interface, since interface has a specific meaning in the
 language which has nothing to do with this context.
Thanks.
 - Why does csvNextToken take an Appender!(char[])? That seems overly
 restrictive. I'd argue that it should take an output range of dchar
 (which Appender!(char[]) is). Then it's not restricted to Appender, and
 it's not restricted to char[].
Appender is used so that the exception thrown can contain the consumed data. Thinking again on this, I suppose I can catch the exception add the data and rethrow. Sounds pretty good actually.
 - Also, the fact that you have isSomeChar!(ElementType!Range) concerns
 me somewhat. Normally, I'd expect something more like
 is(ElementType!Range == dchar). As it is, you could have a range of char
 or wchar, which could do funny things. Phobos doesn't normally support
 that, and I'd be concerned about what would happen if someone tried.
 ElementType!Range will be dchar for all string types, so they'll work
 fine either way, but programmers who are overly adventurous or ignorant
 may try and use a char or wchar range, and I'm willing to be that there
 would be problems with that. Just all around, I'd advise checking that
 the ranges are ranges of dchar and using dchar for the separators rather
 than char.
I was of a different mindset of the time. You are probably right I should prevent someone from eating their leg off.
 - You should have unit tests testing ranges _other_ than string. The
 fact that neither wstring nor dstring is tested is worrisome, and
 ideally, you'd do stuff like call filter!"true" on a string so that you
 have an input range which isn't a string. Without at least some of that,
 I'd be very worried that your code can't actually handle generic ranges
 of dchar and really only works with string.
Did add a wchar input range test, but now it should just be an dchar input range test since wchar ranges won't be allowed.
 - You should have some unit tests with unicode values - preferably some
 which are multiple code units for both char and wchar. For instance,
 std.string uses \U00010143 in some of its tests, because it's 4 code
 units in UTF-8 and 2 in UTF-16. The complete lack of testing with
 unicode is worrisome in the same way that only testing with string is
 worrisome. It's too easy to screw something up such that it happens to
 work with string and ASCII but not in general. So, please add tests
 which involve unicode - not only in the data but in the separators.
Didn't create any new tests, just added unicode in to input and as separators into old tests.
 - I find it somewhat odd that you would use a pointer to the range in
 Record rather than using a slice, but I'd have to examine it in more
 detail to see whether that's actually necessary or not. But in general,
 I'd argue that a slice should be used rather than a pointer if it can
 be.
Record requires modifying the range, for a true input range this is not a problem, but strings and some other struct based ranges are implicit forward ranges and save upon passing. Slicing doesn't exist on input ranges, though I'd almost argue [] should take the place of save() in a forward range.
 - Jonathan M Davis
 
 
 P.S.  Just as a note, since you seem to like not using braces if you
 don't need to for if statements and the like, I though that I'd point
 out that try and catch don't require braces in D if htey contain only
 one statement. You have several one line catches which use braces when
 they don't need to. I'm not arguing that you should or shouldn't change
 it, but I thought that I would point out that the braces aren't
 necessary in case you didn't know (since they _are_ necessary in other
 languages).
Thanks, hopefully I'm consistent in that regard and I'll just leave it. I knew of try, but not catch.
Nov 10 2011