digitalmars.D - persistent byLine

Nick Treleaven (25/25) Jul 22 2013 Hi,

Brad Anderson (7/35) Jul 22 2013 I remember seeing Walter's component programming example and
Jonathan M Davis (15/43) Jul 22 2013 I agree with monarch in that we really shouldn't try and mess with byLin...

monarch_dodra (9/45) Jul 22 2013 It is also inefficient, if you pipe it to an algorithm that calls

bearophile (5/7) Jul 23 2013 See also (three years old discussion):

Nick Treleaven (24/77) Jul 23 2013 I think I'll close that PR then. I reiterate that the

Nick Treleaven (7/51) Jul 24 2013 Although, the caller might not want all the file contents to hang around...

Jonathan M Davis (8/10) Jul 24 2013 It is unknown when std.io will be ready for review, and the person worki...

Nick Treleaven (2/11) Jul 26 2013 Thanks for the info.

H. S. Teoh (28/39) Jul 24 2013 [...]

Brad Anderson (9/70) Jul 24 2013 https://github.com/schveiguy/phobos/blob/new-io/std/io.d

Nick Treleaven (3/4) Jul 27 2013 Thanks, now added to the wiki:

Jakob Ovrum (11/19) Jul 24 2013 std.regex.splitter can handle newlines more flexibly, e.g.:

Peter Alexander (2/3) Jul 24 2013 That will swallow empty lines.

Jakob Ovrum (5/8) Jul 24 2013 Yeah, it's just an example. The specific pattern obviously

Nick Treleaven (6/14) Jul 26 2013 Nice idea, I wasn't aware of regex splitter. I'll try using:

Nick Treleaven <ntrel-public yahoo.co.uk> writes:

Hi,
I remember an example in some slides by Walter which had this snippet 
(slightly simplified):

stdin.byLine.map!(l => l.idup).array

Someone commented in a reddit post that the idup part was not intuitive 
(can't find the link now, sorry).

I made a pull request to re-enable using byLine!(char, immutable char). 
(Note this compiled in the current release, but didn't work properly 
AFAICT. It did work by commit 97cec33^).

https://github.com/D-Programming-Language/phobos/pull/1418

Using that allows us to drop the map!(l => l.idup) part from the above 
snippet. The new syntax isn't much better, but it can also be more 
efficient (as it caches front). I have an idea how to improve the 
syntax, but I'll omit it for this post.

I've since thought that if most or all lines in a file need to be 
persistent, it may be more efficient to use 
readText(filename).splitLines, because that doesn't need to allocate for 
each line.

There are two enhancements for that approach:
1. readText should accept a File, not just a filename, so we can use stdin.
2. splitLines makes an array. It would be more flexible to have an input 
range created from a function e.g. lineSplitter.

With these enhancements, we could change the byLine docs to recommend 
using File.readText.lineSplitter if most lines need to be persistent.

Thoughts?

Jul 22 2013

"Brad Anderson" <eco gnuk.net> writes:

On Monday, 22 July 2013 at 12:08:06 UTC, Nick Treleaven wrote:
 Hi,
 I remember an example in some slides by Walter which had this 
 snippet (slightly simplified):

 stdin.byLine.map!(l => l.idup).array

 Someone commented in a reddit post that the idup part was not 
 intuitive (can't find the link now, sorry).

 I made a pull request to re-enable using byLine!(char, 
 immutable char). (Note this compiled in the current release, 
 but didn't work properly AFAICT. It did work by commit 
 97cec33^).

 https://github.com/D-Programming-Language/phobos/pull/1418

 Using that allows us to drop the map!(l => l.idup) part from 
 the above snippet. The new syntax isn't much better, but it can 
 also be more efficient (as it caches front). I have an idea how 
 to improve the syntax, but I'll omit it for this post.

 I've since thought that if most or all lines in a file need to 
 be persistent, it may be more efficient to use 
 readText(filename).splitLines, because that doesn't need to 
 allocate for each line.

 There are two enhancements for that approach:
 1. readText should accept a File, not just a filename, so we 
 can use stdin.
 2. splitLines makes an array. It would be more flexible to have 
 an input range created from a function e.g. lineSplitter.

 With these enhancements, we could change the byLine docs to 
 recommend using File.readText.lineSplitter if most lines need 
 to be persistent.

 Thoughts?

I remember seeing Walter's component programming example and 
feeling that the idup stuck out like a sore thumb.

I like the idea. Makes simple programs even simpler without 
sacrificing performance by changing byLines.  Ideally readText 
would take generic streams rather than Files but that will have 
to wait until std.io gets finished.

Jul 22 2013

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Monday, July 22, 2013 13:08:05 Nick Treleaven wrote:
 Hi,
 I remember an example in some slides by Walter which had this snippet
 (slightly simplified):
 
 stdin.byLine.map!(l => l.idup).array
 
 Someone commented in a reddit post that the idup part was not intuitive
 (can't find the link now, sorry).
 
 I made a pull request to re-enable using byLine!(char, immutable char).
 (Note this compiled in the current release, but didn't work properly
 AFAICT. It did work by commit 97cec33^).
 
 https://github.com/D-Programming-Language/phobos/pull/1418
 
 Using that allows us to drop the map!(l => l.idup) part from the above
 snippet. The new syntax isn't much better, but it can also be more
 efficient (as it caches front). I have an idea how to improve the
 syntax, but I'll omit it for this post.

I agree with monarch in that we really shouldn't try and mess with byLine like 
this. It would just be cleaner to come up with a new function for this, though 
I confess that part of me thinks that it's better to just use map!(a => 
a.idup)(), because it avoids duplicating functionality. It is arguably a bit 
ugly though.

 I've since thought that if most or all lines in a file need to be
 persistent, it may be more efficient to use
 readText(filename).splitLines, because that doesn't need to allocate for
 each line.
 
 There are two enhancements for that approach:
 1. readText should accept a File, not just a filename, so we can use stdin.

I'm opposed to this. I don't think that std.file should be using std.stdio.File 
at all. What we really need is for std.io to be finished, which will revamp 
std.stdio and give us streams and the like. And std.file really is not designed 
around using stdin - and shouldn't be IMHO. It's for operating on actual files.

 2. splitLines makes an array. It would be more flexible to have an input
 range created from a function e.g. lineSplitter.

splitter will do the job just fine as long as you don't care about /r/n - 
though we should arguably come up with a solution that works with /r/n 
(assuming that something in std.range or std.algorithm doesn't already do it, 
and I'm just not thinking of it at the moment).

- Jonathan M Davis

Jul 22 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Monday, 22 July 2013 at 23:28:46 UTC, Jonathan M Davis wrote:
 On Monday, July 22, 2013 13:08:05 Nick Treleaven wrote:
 Hi,
 I remember an example in some slides by Walter which had this 
 snippet
 (slightly simplified):
 
 stdin.byLine.map!(l => l.idup).array
 
 Someone commented in a reddit post that the idup part was not 
 intuitive
 (can't find the link now, sorry).
 
 I made a pull request to re-enable using byLine!(char, 
 immutable char).
 (Note this compiled in the current release, but didn't work 
 properly
 AFAICT. It did work by commit 97cec33^).
 
 https://github.com/D-Programming-Language/phobos/pull/1418
 
 Using that allows us to drop the map!(l => l.idup) part from 
 the above
 snippet. The new syntax isn't much better, but it can also be 
 more
 efficient (as it caches front). I have an idea how to improve 
 the
 syntax, but I'll omit it for this post.

 I agree with monarch in that we really shouldn't try and mess 
 with byLine like
 this. It would just be cleaner to come up with a new function 
 for this, though
 I confess that part of me thinks that it's better to just use 
 map!(a =>
 a.idup)(), because it avoids duplicating functionality. It is 
 arguably a bit
 ugly though.

It is also inefficient, if you pipe it to an algorithm that calls 
the same front more than once, you'll end up duping more than 
once. It could be mitigated with my proposed "cache" adaptor, but 
then, we'd have:
file.byLine().map!(a => a.idup)().cache();

It works, but it is horrible to look at.

I'm rather in favor of a named "byDupLine", which is self 
documenting, easy to use, and new/casual programmer friendly.

Jul 22 2013

"bearophile" <bearophileHUGS lycos.com> writes:

monarch_dodra:

 I'm rather in favor of a named "byDupLine", which is self 
 documenting, easy to use, and new/casual programmer friendly.

See also (three years old discussion):
http://d.puremagic.com/issues/show_bug.cgi?id=4474

Bye,
bearophile

Jul 23 2013

"Nick Treleaven" <ntrel-public yahoo.co.uk> writes:

(resending from the forum, original didn't arrive for some reason)

On 23/07/2013 00:28, Jonathan M Davis wrote:
 On Monday, July 22, 2013 13:08:05 Nick Treleaven wrote:
 I made a pull request to re-enable using byLine!(char, 
 immutable char).
 (Note this compiled in the current release, but didn't work 
 properly
 AFAICT. It did work by commit 97cec33^).

 https://github.com/D-Programming-Language/phobos/pull/1418

 Using that allows us to drop the map!(l => l.idup) part from 
 the above
 snippet. The new syntax isn't much better, but it can also be 
 more
 efficient (as it caches front). I have an idea how to improve 
 the
 syntax, but I'll omit it for this post.

 I agree with monarch in that we really shouldn't try and mess 
 with byLine like
 this. It would just be cleaner to come up with a new function 
 for this, though
 I confess that part of me thinks that it's better to just use 
 map!(a =>
 a.idup)(), because it avoids duplicating functionality. It is 
 arguably a bit
 ugly though.

I think I'll close that PR then. I reiterate that the 
readText.splitter approach is perhaps usually more efficient than 
either byLine/map/idup or byLine!(char, immutable char). Unless 
e.g. byLineDup was implemented so it allocated more than one line 
at once.

 I've since thought that if most or all lines in a file need to 
 be
 persistent, it may be more efficient to use
 readText(filename).splitLines, because that doesn't need to 
 allocate for
 each line.

 There are two enhancements for that approach:
 1. readText should accept a File, not just a filename, so we 
 can use stdin.

 I'm opposed to this. I don't think that std.file should be 
 using std.stdio.File
 at all. What we really need is for std.io to be finished, which 
 will revamp
 std.stdio and give us streams and the like. And std.file really 
 is not designed
 around using stdin - and shouldn't be IMHO. It's for operating 
 on actual files.

Yes, I meant add std.stdio.File.readText. Would that be OK to add 
now, or is std.io likely to be added relatively soon?

 2. splitLines makes an array. It would be more flexible to 
 have an input
 range created from a function e.g. lineSplitter.

 splitter will do the job just fine as long as you don't care 
 about /r/n -

I currently use Windows ;-)

 though we should arguably come up with a solution that works 
 with /r/n
 (assuming that something in std.range or std.algorithm doesn't 
 already do it,
 and I'm just not thinking of it at the moment).

splitter does work:
"splitting\r\nlines\r\nworks\r\n!".splitter("\r\n").writeln();

I can live with that. However...

The nice thing about splitLines is that it doesn't care what kind 
of line endings are in the file, i.e. you don't need to tell it 
in advance. You don't have to pass it std.ascii.newline as a 
separator in order to get portability.

Portability should be the default. That's what I intend for 
lineSplitter, which IMO would be better than the readln/byLine 
specific terminator approach. It would handle all text files 
portably, even ones that don't have the official system line 
ending chars, by default.

lineSplitter would be useful in other ways, e.g. for counting 
lines in a string.

Jul 23 2013

Nick Treleaven <ntrel-public yahoo.co.uk> writes:

On 23/07/2013 17:13, Nick Treleaven wrote:
 On 23/07/2013 00:28, Jonathan M Davis wrote:
 On Monday, July 22, 2013 13:08:05 Nick Treleaven wrote:
 I made a pull request to re-enable using byLine!(char, immutable char).
 (Note this compiled in the current release, but didn't work properly
 AFAICT. It did work by commit 97cec33^).

 https://github.com/D-Programming-Language/phobos/pull/1418

 Using that allows us to drop the map!(l => l.idup) part from the above
 snippet. The new syntax isn't much better, but it can also be more
 efficient (as it caches front). I have an idea how to improve the
 syntax, but I'll omit it for this post.

 I agree with monarch in that we really shouldn't try and mess with
 byLine like
 this. It would just be cleaner to come up with a new function for
 this, though
 I confess that part of me thinks that it's better to just use map!(a =>
 a.idup)(), because it avoids duplicating functionality. It is arguably
 a bit
 ugly though.

 I think I'll close that PR then. I reiterate that the readText.splitter
 approach is perhaps usually more efficient than either byLine/map/idup
 or byLine!(char, immutable char). Unless e.g. byLineDup was implemented
 so it allocated more than one line at once.

Although, the caller might not want all the file contents to hang around 
just because one line was needed, so I think byLineDup probably is 
needed too. It's good to have a range of options.

I might make a PR for byLineDup sometime, unless anyone beats me to it.

 I've since thought that if most or all lines in a file need to be
 persistent, it may be more efficient to use
 readText(filename).splitLines, because that doesn't need to allocate for
 each line.

 There are two enhancements for that approach:
 1. readText should accept a File, not just a filename, so we can use
 stdin.

 I'm opposed to this. I don't think that std.file should be using
 std.stdio.File
 at all. What we really need is for std.io to be finished, which will
 revamp
 std.stdio and give us streams and the like. And std.file really is not
 designed
 around using stdin - and shouldn't be IMHO. It's for operating on
 actual files.

 Yes, I meant add std.stdio.File.readText. Would that be OK to add now,
 or is std.io likely to be added relatively soon?

Also is there any info on std.io? I assume it's still worthwhile to 
update std.stdio docs and fix issues?

Jul 24 2013

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, July 24, 2013 17:45:53 Nick Treleaven wrote:
 Also is there any info on std.io? I assume it's still worthwhile to
 update std.stdio docs and fix issues?

It is unknown when std.io will be ready for review, and the person working on 
it is quite busy. We certainly want to continue to fix issues in std.stdio. 
What is more questionable is adding a lot of functionality to it. But bug fixes 
and the like should be fine, and I'm sure that some new functionality would be 
fine as well. We definitely don't want to make major changes to it though,
since 
we intend to replace it.

- Jonathan M Davis

Jul 24 2013

Nick Treleaven <ntrel-public yahoo.co.uk> writes:

On 25/07/2013 01:17, Jonathan M Davis wrote:
 On Wednesday, July 24, 2013 17:45:53 Nick Treleaven wrote:
 Also is there any info on std.io? I assume it's still worthwhile to
 update std.stdio docs and fix issues?

 It is unknown when std.io will be ready for review, and the person working on
 it is quite busy. We certainly want to continue to fix issues in std.stdio.
 What is more questionable is adding a lot of functionality to it. But bug fixes
 and the like should be fine, and I'm sure that some new functionality would be
 fine as well. We definitely don't want to make major changes to it though,
since
 we intend to replace it.

Thanks for the info.

Jul 26 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Jul 24, 2013 at 08:17:08PM -0400, Jonathan M Davis wrote:
 On Wednesday, July 24, 2013 17:45:53 Nick Treleaven wrote:
 Also is there any info on std.io? I assume it's still worthwhile to
 update std.stdio docs and fix issues?

 
 It is unknown when std.io will be ready for review, and the person
 working on it is quite busy. We certainly want to continue to fix
 issues in std.stdio.  What is more questionable is adding a lot of
 functionality to it. But bug fixes and the like should be fine, and
 I'm sure that some new functionality would be fine as well. We
 definitely don't want to make major changes to it though, since 
 we intend to replace it.

[...]

This makes me wonder if we should adopt a convention of using a public
fork/branch on GitHub when working on a new feature. That way, if the
person who started it can't work on it for whatever reason (too busy
IRL, working on other things, etc.), somebody else can step up and
continue the work, instead of the work being stalled without any
definite target completion time.

With GitHub, something like this is not only practical, but probably
also beneficial. For example, recently, I started working on a D port of
gdc's gdmd wrapper. After I got the initial functionality working, I got
busy with other stuff, but thanks to the code being available on github,
somebody else could step in and contribute pull requests. So development
continues in spite of me being unable to work on it at the moment.  And
sometimes, receiving pull requests on your branch is a big motivation
for you to get cracking on the code again. :)

For something as crucial as std.io (or other key Phobos components, in
general), I think this kind of approach is probably necessary. It
doesn't help our public image at all if people take a look at our
situation and say, "what's up with these D people, why is there only one
person working on key feature X and no progress is made because that
person is too busy". To improve our professionalism, this situation
ought to be improved.

Not to mention, like this also helps improve our bus factor when it
comes to key pieces of new functionality.


T

-- 
Spaghetti code may be tangly, but lasagna code is just cheesy.

Jul 24 2013

"Brad Anderson" <eco gnuk.net> writes:

On Thursday, 25 July 2013 at 01:57:45 UTC, H. S. Teoh wrote:
 On Wed, Jul 24, 2013 at 08:17:08PM -0400, Jonathan M Davis 
 wrote:
 On Wednesday, July 24, 2013 17:45:53 Nick Treleaven wrote:
 Also is there any info on std.io? I assume it's still 
 worthwhile to
 update std.stdio docs and fix issues?

 
 It is unknown when std.io will be ready for review, and the 
 person
 working on it is quite busy. We certainly want to continue to 
 fix
 issues in std.stdio.  What is more questionable is adding a 
 lot of
 functionality to it. But bug fixes and the like should be 
 fine, and
 I'm sure that some new functionality would be fine as well. We
 definitely don't want to make major changes to it though, 
 since we intend to replace it.

 [...]

 This makes me wonder if we should adopt a convention of using a 
 public
 fork/branch on GitHub when working on a new feature. That way, 
 if the
 person who started it can't work on it for whatever reason (too 
 busy
 IRL, working on other things, etc.), somebody else can step up 
 and
 continue the work, instead of the work being stalled without any
 definite target completion time.

 With GitHub, something like this is not only practical, but 
 probably
 also beneficial. For example, recently, I started working on a 
 D port of
 gdc's gdmd wrapper. After I got the initial functionality 
 working, I got
 busy with other stuff, but thanks to the code being available 
 on github,
 somebody else could step in and contribute pull requests. So 
 development
 continues in spite of me being unable to work on it at the 
 moment.  And
 sometimes, receiving pull requests on your branch is a big 
 motivation
 for you to get cracking on the code again. :)

 For something as crucial as std.io (or other key Phobos 
 components, in
 general), I think this kind of approach is probably necessary. 
 It
 doesn't help our public image at all if people take a look at 
 our
 situation and say, "what's up with these D people, why is there 
 only one
 person working on key feature X and no progress is made because 
 that
 person is too busy". To improve our professionalism, this 
 situation
 ought to be improved.

 Not to mention, like this also helps improve our bus factor 
 when it
 comes to key pieces of new functionality.


 T

https://github.com/schveiguy/phobos/blob/new-io/std/io.d

(which I amusingly found the URL of by Googling "std.io github 
site:dlang.org" and found a reply I made to you [1] with this URL 
back in February).

I wonder if Steve could be convinced to write up a TODO for 
std.io of where he left off.

1. 
http://forum.dlang.org/post/sbmntkiwhnmpcrfypvpw forum.dlang.org

Jul 24 2013

Nick Treleaven <ntrel-public yahoo.co.uk> writes:

On 25/07/2013 03:22, Brad Anderson wrote:
 https://github.com/schveiguy/phobos/blob/new-io/std/io.d

Thanks, now added to the wiki:
http://wiki.dlang.org/Review_Queue#Current_Review_Queue

Jul 27 2013

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Monday, 22 July 2013 at 23:28:46 UTC, Jonathan M Davis wrote:
 splitter will do the job just fine as long as you don't care 
 about /r/n -
 though we should arguably come up with a solution that works 
 with /r/n
 (assuming that something in std.range or std.algorithm doesn't 
 already do it,
 and I'm just not thinking of it at the moment).

 - Jonathan M Davis

std.regex.splitter can handle newlines more flexibly, e.g.:

void main()
{
	import std.algorithm : equal;
	import std.regex : ctRegex, splitter;

	auto text = "one\ntwo\r\nthree";

	auto newlinePattern = ctRegex!"[\r\n]+";

	assert(text.splitter(newlinePattern).equal(["one", "two", 
"three"]));
}

Jul 24 2013

"Peter Alexander" <peter.alexander.au gmail.com> writes:

On Wednesday, 24 July 2013 at 17:13:10 UTC, Jakob Ovrum wrote:
 	auto newlinePattern = ctRegex!"[\r\n]+";

That will swallow empty lines.

Jul 24 2013

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Wednesday, 24 July 2013 at 17:26:58 UTC, Peter Alexander wrote:
 On Wednesday, 24 July 2013 at 17:13:10 UTC, Jakob Ovrum wrote:
 	auto newlinePattern = ctRegex!"[\r\n]+";

 That will swallow empty lines.

Yeah, it's just an example. The specific pattern obviously 
depends on the exact behaviour you want, but I think any desired 
behaviour regarding newlines can be trivially expressed with 
regex.

Jul 24 2013

Nick Treleaven <ntrel-public yahoo.co.uk> writes:

On 24/07/2013 18:53, Jakob Ovrum wrote:
 On Wednesday, 24 July 2013 at 17:26:58 UTC, Peter Alexander wrote:
 On Wednesday, 24 July 2013 at 17:13:10 UTC, Jakob Ovrum wrote:
     auto newlinePattern = ctRegex!"[\r\n]+";

 That will swallow empty lines.

 Yeah, it's just an example. The specific pattern obviously depends on
 the exact behaviour you want, but I think any desired behaviour
 regarding newlines can be trivially expressed with regex.

Nice idea, I wasn't aware of regex splitter. I'll try using:

auto newlinePattern = ctRegex!"\r\n|\r|\n";

In fact, if we add unicode line & paragraph separators to the pattern 
(like std.string.splitLines), that literal might be worthy of adding to 
std.regex.

Jul 26 2013

D Programming

C/C++ Programming

Other

digitalmars.D - persistent byLine