digitalmars.D - Minor std.stdio.File.ByLine rant

H. S. Teoh (77/77) Feb 26 2014 I'm writing a CLI program that uses File.ByLine to read input commands,

Jakob Ovrum (3/8) Feb 26 2014 Ouch, I think I saw this coming... [1]
bearophile (5/7) Feb 26 2014 Isn't using readln() better for that? File.byLine is to read

Jakob Ovrum (4/11) Feb 26 2014 Says who? The type system and documentation only assert that it
H. S. Teoh (11/18) Feb 26 2014 [...]

Jakob Ovrum (2/20) Feb 26 2014 Just write a function that accepts a std.stdio.File parameter?

H. S. Teoh (9/29) Feb 26 2014 Unfortunately, I use string arrays in my unittests (to avoid having to

Steven Schveighoffer (11/15) Feb 27 2014 This is not a posix problem, it's a general stream problem.

H. S. Teoh (10/29) Feb 27 2014 [...]

Sean Kelly (5/5) Feb 27 2014 Are the peek routines standard? I'm on my phone so I can't

Steven Schveighoffer (5/9) Feb 27 2014 Peek doesn't help. You can't, in a non-blocking way, tell if input will ...

Steven Schveighoffer (40/66) Feb 27 2014 Yes, you are right!

H. S. Teoh (10/57) Feb 27 2014 Actually, now that I think about it, can't we just make ByLine lazily

Steven Schveighoffer (25/31) Feb 27 2014 I think this isn't any different than making ByLine.empty cache the firs...

H. S. Teoh (14/47) Feb 28 2014 [...]

Steven Schveighoffer (6/13) Feb 28 2014 Yes, this is true. So, one can specifically define empty() and let the

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

I'm writing a CLI program that uses File.ByLine to read input commands,
with optional prompting (if run in interactive mode). One would imagine
that this should be a natural use for ByLine (perhaps not as common
nowadays with the rampant GUI fanboyism, but it still happens in some
niches), but it is fraught with peril.

First of all, the way ByLine works is kinda tricky, even in the previous
releases. The underlying cause is that at least on Posix, the underlying
C feof() call doesn't actually tell you whether you're really at EOF
until you try to read something from the file descriptor. I know there
are good reasons for this, but this special percolates up the standard
library code and causes a problem with D's input range primitives, where
.empty must tell the caller, right now, whether data is available,
*before* .front ever returns anything.

At one time, this problem was worked around by issuing a single fgetc on
the underlying file descriptor in ByLine's .empty method to determine
its EOF state, and then doing a fungetc to put the char back into the
stream.  However, this code is a rather ugly hack, and causes the
problem that when the interactive program needs to output a prompt
before blocking on input, it has to do so *before* it calls ByLine.empty
(since otherwise .empty blocks and the prompt doesn't get printed until
after the user has hit Enter -- clearly unacceptable for an interactive
shell program). If the stream turns out empty after all, then the prompt
is already output, and there's no way to take it back, so an extraneous
prompt is always written.

Understandably, the fungetc hack was subsequently removed from Phobos,
by caching the subsequent line the first time .empty was called, which
eliminated the ugliness of fungetc, and allowed current code to continue
working as before.

Then recently, and also understandably, caching things in .empty was
frowned upon, so the caching was removed from .empty altogether and
pushed into the ByLine ctor. From the standpoint of Phobos code, this is
perhaps the ideal solution: the ctor reads the stream to get the first
line and simultaneously determine the EOF status of the stream, and
there is no need for ugly boolean state flags, fungetc ugliness, and
generally unpleasant code.

However, what happens is that now, ByLine will block on input *upon
construction*. This is rather unpleasant when your program needs to do
something like this:

	void main() {
		string prompt;
		...
		ByLine!char input;
		if (useStandardInput) {
			input = stdin.byLine();
		} else if (useScriptFile) {
			input = File(filename).byLine();
		}
		...
		if (mode == ProgramMode.modeA) { // mode is an enum
			runModeA(input);
		} else {
			runModeB(input);
		}
	}

	void runModeA(ByLine!char input) {
		write("modeA> ");	// display prompt
		while (!input.empty) {
			...
		}
	}

	void runModeB(ByLine!char input) {
		write("modeB> ");	// display prompt
		while (!input.empty) {
			...
		}
	}

The problem is, when input is initialized, we don't know what prompt to
use yet, but ByLine's ctor will already block when it tries to read from
stdin!

The current workaround I implemented is to use a wrapper around ByLine
that lazily constructs it when .empty is called.

Who knew something so simple as an interactive prompting program that
reads input lines could turn into such a nightmare when ByLine is used?

:-(


T

-- 
What is Matter, what is Mind? Never Mind, it doesn't Matter.

Feb 26 2014

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Wednesday, 26 February 2014 at 23:45:48 UTC, H. S. Teoh wrote:
 The problem is, when input is initialized, we don't know what 
 prompt to
 use yet, but ByLine's ctor will already block when it tries to 
 read from
 stdin!

Ouch, I think I saw this coming... [1]

[1] https://github.com/D-Programming-Language/phobos/pull/1883

Feb 26 2014

"bearophile" <bearophileHUGS lycos.com> writes:

H. S. Teoh:

 I'm writing a CLI program that uses File.ByLine to read input 
 commands,

Isn't using readln() better for that? File.byLine is to read 
lines of files on disk.

Bye,
bearophile

Feb 26 2014

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Wednesday, 26 February 2014 at 23:59:09 UTC, bearophile wrote:
 H. S. Teoh:

 I'm writing a CLI program that uses File.ByLine to read input 
 commands,

 Isn't using readln() better for that? File.byLine is to read 
 lines of files on disk.

 Bye,
 bearophile

Says who? The type system and documentation only assert that it 
works on files, with no reservations about what kind of file. The 
standard input file is as fine a file as any.

Feb 26 2014

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Feb 26, 2014 at 11:59:07PM +0000, bearophile wrote:
 H. S. Teoh:
 
I'm writing a CLI program that uses File.ByLine to read input
commands,

 
 Isn't using readln() better for that? File.byLine is to read lines
 of files on disk.

[...]

Perhaps, but readln() isn't a range. The whole point was to use a
range-based API for the interpreter so that there's no need to write two
separate interfaces for the interpreter, one for stdin, one for a script
file stored on disk.


T

-- 
Today's society is one of specialization: as you grow, you learn more
and more about less and less. Eventually, you know everything about
nothing.

Feb 26 2014

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Thursday, 27 February 2014 at 00:07:47 UTC, H. S. Teoh wrote:
 On Wed, Feb 26, 2014 at 11:59:07PM +0000, bearophile wrote:
 H. S. Teoh:
 
I'm writing a CLI program that uses File.ByLine to read input
commands,

 
 Isn't using readln() better for that? File.byLine is to read 
 lines
 of files on disk.

 [...]

 Perhaps, but readln() isn't a range. The whole point was to use 
 a
 range-based API for the interpreter so that there's no need to 
 write two
 separate interfaces for the interpreter, one for stdin, one for 
 a script
 file stored on disk.


 T

Just write a function that accepts a std.stdio.File parameter?

Feb 26 2014

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Feb 27, 2014 at 12:16:21AM +0000, Jakob Ovrum wrote:
 On Thursday, 27 February 2014 at 00:07:47 UTC, H. S. Teoh wrote:
On Wed, Feb 26, 2014 at 11:59:07PM +0000, bearophile wrote:
H. S. Teoh:

I'm writing a CLI program that uses File.ByLine to read input
commands,

Isn't using readln() better for that? File.byLine is to read lines
of files on disk.

[...]

Perhaps, but readln() isn't a range. The whole point was to use a
range-based API for the interpreter so that there's no need to write
two separate interfaces for the interpreter, one for stdin, one for a
script file stored on disk.


T

 
 Just write a function that accepts a std.stdio.File parameter?

Unfortunately, I use string arrays in my unittests (to avoid having to
create separate unittest input files). So passing in File wouldn't work.
Besides, File isn't a range, so that kinda defeats the purpose (my
current hack of lazily constructing ByLine does work). I just find it
unfortunate that such hacks are necessary to get off the ground.


T

-- 
It won't be covered in the book. The source code has to be useful for
something, after all. -- Larry Wall

Feb 26 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Wed, 26 Feb 2014 18:44:10 -0500, H. S. Teoh <hsteoh quickfur.ath.cx>  
wrote:

 First of all, the way ByLine works is kinda tricky, even in the previous
 releases. The underlying cause is that at least on Posix, the underlying
 C feof() call doesn't actually tell you whether you're really at EOF
 until you try to read something from the file descriptor.

This is not a posix problem, it's a general stream problem.

A stream is not at EOF until the write end is closed. Until then, you  
cannot know whether it's empty until you read and don't get anything back.  
Even if a primitive existed that allowed you to tell whether the write end  
was closed, you can race this against the other process closing it's write  
end.

I think the correct solution is to block on the first front call. We may  
be able to do this without storing an additional variable.

-Steve

Feb 27 2014

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Feb 27, 2014 at 07:55:59AM -0500, Steven Schveighoffer wrote:
 On Wed, 26 Feb 2014 18:44:10 -0500, H. S. Teoh
 <hsteoh quickfur.ath.cx> wrote:
 
First of all, the way ByLine works is kinda tricky, even in the
previous releases. The underlying cause is that at least on Posix,
the underlying C feof() call doesn't actually tell you whether you're
really at EOF until you try to read something from the file
descriptor.

 
 This is not a posix problem, it's a general stream problem.
 
 A stream is not at EOF until the write end is closed. Until then,
 you cannot know whether it's empty until you read and don't get
 anything back. Even if a primitive existed that allowed you to tell
 whether the write end was closed, you can race this against the
 other process closing it's write end.
 
 I think the correct solution is to block on the first front call. We
 may be able to do this without storing an additional variable.

[...]

Unfortunately, you can't. Since Phobos can't know whether the file
(which may be a network socket, say) is at EOF without first blocking on
read, it won't be able to return the correct value from .empty, and
according to the range API, it's invalid to access .front unless .empty
returns false. So this solution doesn't work. :-(


T

-- 
All men are mortal. Socrates is mortal. Therefore all men are Socrates.

Feb 27 2014

"Sean Kelly" <sean invisibleduck.org> writes:

Are the peek routines standard?  I'm on my phone so I can't 
easily check right now. Barring that, there's an ioctl call that 
can tell whether data is available, though I'm not sure offhand 
what the result would be for a file if you haven't read anything 
yet.

Feb 27 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 27 Feb 2014 11:22:45 -0500, Sean Kelly <sean invisibleduck.org>  
wrote:

 Are the peek routines standard?  I'm on my phone so I can't easily check  
 right now. Barring that, there's an ioctl call that can tell whether  
 data is available, though I'm not sure offhand what the result would be  
 for a file if you haven't read anything yet.

Peek doesn't help. You can't, in a non-blocking way, tell if input will be  
forthcoming without actually receiving the input.

-Steve

Feb 27 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 27 Feb 2014 10:04:47 -0500, H. S. Teoh <hsteoh quickfur.ath.cx>  
wrote:

 On Thu, Feb 27, 2014 at 07:55:59AM -0500, Steven Schveighoffer wrote:
 On Wed, 26 Feb 2014 18:44:10 -0500, H. S. Teoh
 <hsteoh quickfur.ath.cx> wrote:

First of all, the way ByLine works is kinda tricky, even in the
previous releases. The underlying cause is that at least on Posix,
the underlying C feof() call doesn't actually tell you whether you're
really at EOF until you try to read something from the file
descriptor.

 This is not a posix problem, it's a general stream problem.

 A stream is not at EOF until the write end is closed. Until then,
 you cannot know whether it's empty until you read and don't get
 anything back. Even if a primitive existed that allowed you to tell
 whether the write end was closed, you can race this against the
 other process closing it's write end.

 I think the correct solution is to block on the first front call. We
 may be able to do this without storing an additional variable.

 [...]

 Unfortunately, you can't. Since Phobos can't know whether the file
 (which may be a network socket, say) is at EOF without first blocking on
 read, it won't be able to return the correct value from .empty, and
 according to the range API, it's invalid to access .front unless .empty
 returns false. So this solution doesn't work. :-(

Yes, you are right!

Thinking about it, the only correct solution is to do what it already does  
-- establish the first line on construction. empty cannot depend on front,  
and doing something different on the first empty vs. every other one makes  
the range bloated and confusing.

The issue really is, to treat the construction and popFront as blocking.  
Streams are a tricky business indeed. I think your solution is the only  
valid one. Unfortunate that you have to do this.

An interesting general solution is to use a delegate to generate the  
range, giving an easy one-line construction without having to make a  
wrapper range that lazily constructs on empty, but just using a delegate  
name does not call it. I did come up with this:

import std.stdio;
import std.range;

void foo(R)(R r)
{
     static if(isInputRange!R)
     {
         alias _r = r;
     }
     else // if is no-arg delegate and returns input range (too lazy to  
figure this out :)
     {
         auto _r(){return r();}
     }

     foreach(x; _r)
     {
         writeln(x);
     }
}
void main()
{
     foo(() => stdin.byLine);
     foo([1,2,3]);
}

The static if at the beginning is awkward, but just allows the rest of the  
code to be identical whether you call with a delegate or a range.

-Steve

Feb 27 2014

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Feb 27, 2014 at 11:26:42AM -0500, Steven Schveighoffer wrote:
 On Thu, 27 Feb 2014 10:04:47 -0500, H. S. Teoh
 <hsteoh quickfur.ath.cx> wrote:
 
On Thu, Feb 27, 2014 at 07:55:59AM -0500, Steven Schveighoffer wrote:
On Wed, 26 Feb 2014 18:44:10 -0500, H. S. Teoh
<hsteoh quickfur.ath.cx> wrote:

First of all, the way ByLine works is kinda tricky, even in the
previous releases. The underlying cause is that at least on Posix,
the underlying C feof() call doesn't actually tell you whether
you're really at EOF until you try to read something from the file
descriptor.

This is not a posix problem, it's a general stream problem.

A stream is not at EOF until the write end is closed. Until then,
you cannot know whether it's empty until you read and don't get
anything back. Even if a primitive existed that allowed you to tell
whether the write end was closed, you can race this against the
other process closing it's write end.

I think the correct solution is to block on the first front call. We
may be able to do this without storing an additional variable.

[...]

Unfortunately, you can't. Since Phobos can't know whether the file
(which may be a network socket, say) is at EOF without first blocking
on read, it won't be able to return the correct value from .empty,
and according to the range API, it's invalid to access .front unless
.empty returns false. So this solution doesn't work. :-(

 
 Yes, you are right!
 
 Thinking about it, the only correct solution is to do what it
 already does -- establish the first line on construction. empty
 cannot depend on front, and doing something different on the first
 empty vs. every other one makes the range bloated and confusing.
 
 The issue really is, to treat the construction and popFront as
 blocking. Streams are a tricky business indeed. I think your
 solution is the only valid one. Unfortunate that you have to do
 this.
 
 An interesting general solution is to use a delegate to generate the
 range, giving an easy one-line construction without having to make a
 wrapper range that lazily constructs on empty, but just using a
 delegate name does not call it. I did come up with this:

Actually, now that I think about it, can't we just make ByLine lazily
constructed? It's already a wrapper around ByLineImpl anyway (since it's
being refcounted), so why not just make the wrapper create ByLineImpl
only when you actually attempt to use it? That would solve the problem:
you can call ByLine but it won't block until ByLineImpl is actually
created, which is the first time you call ByLine.empty.


T

-- 
Don't drink and derive. Alcohol and algebra don't mix.

Feb 27 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 27 Feb 2014 12:32:44 -0500, H. S. Teoh <hsteoh quickfur.ath.cx>  
wrote:

 Actually, now that I think about it, can't we just make ByLine lazily
 constructed? It's already a wrapper around ByLineImpl anyway (since it's
 being refcounted), so why not just make the wrapper create ByLineImpl
 only when you actually attempt to use it? That would solve the problem:
 you can call ByLine but it won't block until ByLineImpl is actually
 created, which is the first time you call ByLine.empty.

I think this isn't any different than making ByLine.empty cache the first  
line.

My solution is basically this:

struct LazyConstructedRange(R)
{
   R r;
   bool isConstructed = false;
   R delegate() _ctor;

   this(R delegate() ctor) {_ctor = ctor;}

   ref R get() {
     if(!isConstructed) { r = _ctor(); isConstructed = true;}
     return r;
   }

   alias get this;
}

Basically, we're not constructing on first call to empty, but first call  
to *anything*. Actually, this kind of a solution would be better that what  
I came up with, because the object itself is a range instead of a delegate  
(satisfies, for instance, isInputRange and isIterable, whereas the  
delegate does not), and you don't need the static if like I wrote. Any  
additional usage of the delegate in my original solution creates a copy of  
the range, but the above would only construct it once.

-Steve

Feb 27 2014

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Feb 27, 2014 at 01:47:49PM -0500, Steven Schveighoffer wrote:
 On Thu, 27 Feb 2014 12:32:44 -0500, H. S. Teoh
 <hsteoh quickfur.ath.cx> wrote:
 
Actually, now that I think about it, can't we just make ByLine lazily
constructed? It's already a wrapper around ByLineImpl anyway (since
it's being refcounted), so why not just make the wrapper create
ByLineImpl only when you actually attempt to use it? That would solve
the problem: you can call ByLine but it won't block until ByLineImpl
is actually created, which is the first time you call ByLine.empty.

 
 I think this isn't any different than making ByLine.empty cache the
 first line.
 
 My solution is basically this:
 
 struct LazyConstructedRange(R)
 {
   R r;
   bool isConstructed = false;
   R delegate() _ctor;
 
   this(R delegate() ctor) {_ctor = ctor;}
 
   ref R get() {
     if(!isConstructed) { r = _ctor(); isConstructed = true;}
     return r;
   }
 
   alias get this;
 }
 
 Basically, we're not constructing on first call to empty, but first
 call to *anything*.

[...]

According to a strict interpretation of the range API, it is invalid to
call any range method before you call .empty, because if the range turns
out to be empty, calling .front or .popFront is undefined. So it is
sufficient to implement lazy construction for .empty alone. All other
cases *should* break anyway. :)

Once you have that, then what you're proposing is no different from
mine, in essence.


T

-- 
By understanding a machine-oriented language, the programmer will tend
to use a much more efficient method; it is much closer to reality. -- D.
Knuth

Feb 28 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 28 Feb 2014 16:12:26 -0500, H. S. Teoh <hsteoh quickfur.ath.cx>  
wrote:

 According to a strict interpretation of the range API, it is invalid to
 call any range method before you call .empty, because if the range turns
 out to be empty, calling .front or .popFront is undefined. So it is
 sufficient to implement lazy construction for .empty alone. All other
 cases *should* break anyway. :)

 Once you have that, then what you're proposing is no different from
 mine, in essence.

Yes, this is true. So, one can specifically define empty() and let the  
rest go to alias this.

I think such a range would be a good phobos addition.

-Steve

Feb 28 2014

D Programming

C/C++ Programming

Other

digitalmars.D - Minor std.stdio.File.ByLine rant