www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Reading files using delimiters/terminators

reply Rekel <paultjeadriaanse gmail.com> writes:
I'm trying to read a file with entries seperated by '\n\n' (empty 
line), with entries containing '\n'. I thought the 
File.readLine(KeepTerminator, Terminator) might work, as it seems 
to accept strings as terminators, since there seems to have been 
a thread regarding '\r\n' seperators.

I don't know if there's some underlying reason, but when I try to 
use "\n\n" as a terminator, I end up getting the entire file into 
1 char[], so it's not delimited.

Should this work or is there a reason one cannot use byLine like 
this?

For context, I'm trying this with the puzzle input of day 6 of 
this year's advent of code. (https://adventofcode.com/)
Dec 26 2020
next sibling parent reply Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:
On Sunday, 27 December 2020 at 00:13:30 UTC, Rekel wrote:
 I'm trying to read a file with entries seperated by '\n\n' 
 (empty line), with entries containing '\n'. I thought the 
 File.readLine(KeepTerminator, Terminator) might work, as it 
 seems to accept strings as terminators, since there seems to 
 have been a thread regarding '\r\n' seperators.

 I don't know if there's some underlying reason, but when I try 
 to use "\n\n" as a terminator, I end up getting the entire file 
 into 1 char[], so it's not delimited.

 Should this work or is there a reason one cannot use byLine 
 like this?

 For context, I'm trying this with the puzzle input of day 6 of 
 this year's advent of code. (https://adventofcode.com/)
Unfortunately std.csv is character based and not string. https://dlang.org/phobos/std_csv.html#.csvReader But your use case sounds like splitter is more aligned with your needs. https://dlang.org/phobos/std_algorithm_iteration.html#.splitter
Dec 26 2020
parent reply Rekel <paultjeadriaanse gmail.com> writes:
On Sunday, 27 December 2020 at 02:41:12 UTC, Jesse Phillips wrote:
 Unfortunately std.csv is character based and not string. 
 https://dlang.org/phobos/std_csv.html#.csvReader

 But your use case sounds like splitter is more aligned with 
 your needs.

 https://dlang.org/phobos/std_algorithm_iteration.html#.splitter
But I'm not using csv right? Additionally, shouldnt byLine also work with "\r\n"?
Dec 27 2020
parent Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:
On Sunday, 27 December 2020 at 13:21:44 UTC, Rekel wrote:
 On Sunday, 27 December 2020 at 02:41:12 UTC, Jesse Phillips 
 wrote:
 Unfortunately std.csv is character based and not string. 
 https://dlang.org/phobos/std_csv.html#.csvReader

 But your use case sounds like splitter is more aligned with 
 your needs.

 https://dlang.org/phobos/std_algorithm_iteration.html#.splitter
But I'm not using csv right? Additionally, shouldnt byLine also work with "\r\n"?
Right, you weren't using csv. I'm not familiar with the file terminater to known why it didn't work. byline would allow \r\n as well as \n
Dec 27 2020
prev sibling next sibling parent =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 12/26/20 4:13 PM, Rekel wrote:
 I'm trying to read a file with entries seperated by '\n\n' (empty line), 
 with entries containing '\n'. I thought the 
 File.readLine(KeepTerminator, Terminator) might work, as it seems to 
 accept strings as terminators, since there seems to have been a thread 
 regarding '\r\n' seperators.
 
 I don't know if there's some underlying reason, but when I try to use 
 "\n\n" as a terminator, I end up getting the entire file into 1 char[], 
 so it's not delimited.
 
 Should this work or is there a reason one cannot use byLine like this?
 
 For context, I'm trying this with the puzzle input of day 6 of this 
 year's advent of code. (https://adventofcode.com/)
byLine should work: import std.stdio; void main() { auto f = File("deneme.d"); // Warning: byLine reuses an internal buffer. Call byLineCopy // if potentially parsed strings into the line need to persist. foreach (line; f.byLine) { if (line.length == 0) { writeln("EMPTY LINE"); } else { writeln(line); } } } Ali
Dec 26 2020
prev sibling next sibling parent reply oddp <oddp posteo.de> writes:
On 27.12.20 01:13, Rekel via Digitalmars-d-learn wrote:
 For context, I'm trying this with the puzzle input of day 6 of this year's
advent of code. 
 (https://adventofcode.com/)
For that specific puzzle I simply did: foreach (group; readText("input").splitter("\n\n")) { ... } Since the input is never that big, I prefer reading in the whole thing and then do the processing. Also, on other days, when the input is more uniform, there's always https://dlang.org/library/std/file/slurp.html which makes reading it in even easier, e.g. day02: alias Record = Tuple!(int, "low", int, "high", char, "needle", string, "hay"); auto input = slurp!Record("input", "%d-%d %s: %s"); P.S.: would've loved to have had multiwayIntersection in the stdlib for day06 part2, especially when there's already multiwayUnion in setops. fold!setIntersection felt a bit clunky.
Dec 27 2020
parent reply Rekel <paultjeadriaanse gmail.com> writes:
On Sunday, 27 December 2020 at 13:27:49 UTC, oddp wrote:
 foreach (group; readText("input").splitter("\n\n")) { ... }
 Also, on other days, when the input is more uniform, there's 
 always https://dlang.org/library/std/file/slurp.html which 
 makes reading it in even easier, e.g. day02:

 alias Record = Tuple!(int, "low", int, "high", char, "needle", 
 string, "hay");
 auto input = slurp!Record("input", "%d-%d %s: %s");

 P.S.: would've loved to have had multiwayIntersection in the 
 stdlib for day06 part2, especially when there's already 
 multiwayUnion in setops. fold!setIntersection felt a bit clunky.
Oh my, all these things are new to me, haha, thanks a lot! I'll be looking into those (slurp & tuple). By the way, is there a reason to use either 'splitter' or 'split'? I'm not sure I see why the difference would matter in the end. Sidetangent, don't mean to bash the learning tour, as it's been really useful for getting started, but I'm surprised stuff like tuples and files arent mentioned there. Especially since the documentation tends to trip me up, with stuff like 'isSomeString' mentioning 'built in string types', while I haven't been able to find that concept elsewhere, let alone functionality one can expect in this case (like .length and the like), and stuff like 'countUntil' not being called 'indexOf', although it also exists and does basically the same thing. Also assumeUnique seems to be a thing?
Dec 27 2020
next sibling parent reply Rekel <paultjeadriaanse gmail.com> writes:
On Sunday, 27 December 2020 at 23:12:46 UTC, Rekel wrote:
 Sidetangent, don't mean to bash the learning tour, as it's been 
 really useful for getting started, but I'm surprised stuff like 
 tuples and files arent mentioned there.
Update; Any clue why there's both "std.file" and "std.io.File"? I was mostly unaware of the former.
Dec 27 2020
parent Mike Parker <aldacron gmail.com> writes:
On Sunday, 27 December 2020 at 23:18:37 UTC, Rekel wrote:

 Update;
 Any clue why there's both "std.file" and "std.io.File"?
 I was mostly unaware of the former.
The very first paragraph at the top of the `std.file` documentation explains it: "Functions in this module handle files as a unit, e.g., read or write one file at a time. For opening files and manipulating them via handles refer to module std.stdio." https://dlang.org/phobos/std_file.html
Dec 27 2020
prev sibling next sibling parent =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 12/27/20 3:12 PM, Rekel wrote:

 is there a reason to use
 either 'splitter' or 'split'? I'm not sure I see why the difference
 would matter in the end.
splitter() is a lazy range algorithm. split() is a range algorithm as well but it is eager; it will put the results in an array that it grows. The string elements would not be copies of the original range; they will still be just the pair of .ptr and .length but it can be expensive if there are a lot of parts. Further, if you want to process just a small number of the initial parts, then being eager would be wasteful. As all lazy range algorithms, splitter() is just an iteration object waiting to be used. It does not allocate any array but serves the parts one by one. You can filter the parts as you iterate over or you can stop at any point. For example, the following would take the first 3 non-empty lines: import std.stdio; import std.range; import std.algorithm; void main() { auto s = "hello\n\nworld\n\n\nand\nmoon"; writefln!"%(%s, %)"(s.splitter('\n').filter!(part => !part.empty).take(3)); }
 Sidetangent, don't mean to bash the learning tour, as it's been really
 useful for getting started, but I'm surprised stuff like tuples and
 files arent mentioned there.
Alternative place to search: :) http://ddili.org/ders/d.en/ix.html
 Especially since the documentation tends to trip me up, with stuff like
 'isSomeString' mentioning 'built in string types', while I haven't been
 able to find that concept elsewhere,
Built in strings are just arrays of character types: char[], wchar[], and dchar[]. Commonly used by their respective immutable aliases: string, wstring, and dstring.
 'countUntil' not being called 'indexOf'
countUntil() is more general because it works with any range while indexOf requires a string.
 assumeUnique seems to be a thing?
That appears in the index I posted above as well. ;) Ali
Dec 27 2020
prev sibling parent reply oddp <oddp posteo.de> writes:
On 28.12.20 00:12, Rekel via Digitalmars-d-learn wrote:
 is there a reason to use either 'splitter' or 'split'?
split gives you a newly allocated array with the results, splitter is lazy equivalent and doesn't allocate. Feel free using either, doesn't matter much with these small puzzle inputs.
 Sidetangent, don't mean to bash the learning tour, as it's been really useful
for getting started, 
 but I'm surprised stuff like tuples and files arent mentioned there.
 Especially since the documentation tends to trip me up, with stuff like
'isSomeString' mentioning 
 'built in string types', while I haven't been able to find that concept
elsewhere, let alone 
 functionality one can expect in this case (like .length and the like), and
stuff like 'countUntil' 
 not being called 'indexOf', although it also exists and does basically the
same thing. Also 
 assumeUnique seems to be a thing?
Might be worth discussing that in a new topic. The stdlib is vast and has tons of useful utilities, not all of which can be explained in detail in a series of overview posts. Ali's "Programming in D" [1], which has a free online version, functions as an excellent in-depth introduction to the language, going over all the important topics. Regarding function names and docs: Yes, some might seem slightly off coming from other languages (e.g. find vs. dropWhile, until vs. takeWhile, cumulativeFold vs scan/accumulate, etc.), but it's all in there somewhere, implemented with the most care to not waste precious cycles. Might makes it harder to grok going over the implementation or docs for very the first time, but it gets easier after a while. Furthermore, alternative names are often times mentioned in the docs so a quick google search should bring you to the right place. [1] http://ddili.org/ders/d.en/index.html
Dec 27 2020
parent Rekel <paultjeadriaanse gmail.com> writes:
 http://ddili.org/ders/d.en/index.html
This seems very promising :) I doubt I'd still be considering D if it weren't for this awesome learning forum, thanks for all the help!
Dec 28 2020
prev sibling parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 12/26/20 7:13 PM, Rekel wrote:
 I'm trying to read a file with entries seperated by '\n\n' (empty line), 
 with entries containing '\n'. I thought the 
 File.readLine(KeepTerminator, Terminator) might work, as it seems to 
 accept strings as terminators, since there seems to have been a thread 
 regarding '\r\n' seperators.
 
 I don't know if there's some underlying reason, but when I try to use 
 "\n\n" as a terminator, I end up getting the entire file into 1 char[], 
 so it's not delimited.
 
 Should this work or is there a reason one cannot use byLine like this?
 
 For context, I'm trying this with the puzzle input of day 6 of this 
 year's advent of code. (https://adventofcode.com/)
Are you on Windows? If so, your double newlines might be \r\n\r\n, depending on what editor you used to create the input. Use a hexdump program to see what the newlines are in your input file. Now, you would think that the underlying C stream would do this for you. I'm not sure how it works exactly, as I don't use Windows. -Steve
Dec 29 2020
parent Rekel <paultjeadriaanse gmail.com> writes:
On Tuesday, 29 December 2020 at 14:50:41 UTC, Steven 
Schveighoffer wrote:
 Are you on Windows? If so, your double newlines might be 
 \r\n\r\n, depending on what editor you used to create the 
 input. Use a hexdump program to see what the newlines are in 
 your input file.
I've tried \r\n\r\n as well, which sadly also did not work. Using vscode I have also switched between CRLF and LF, which also did not do the trick. I'm getting the sense the implementation might have a specific workaround for \r\n / CRLF line-endings, though I haven't checked the sourcecode yet. Note that this is not really a problem for me specifically, I've long used a different approach, however it seemed like a design issue. I'll try replicating this in isolation later, maybe something was wrong last time I tried.
Dec 30 2020