www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Minor std.stdio.File.ByLine rant

reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
I'm writing a CLI program that uses File.ByLine to read input commands,
with optional prompting (if run in interactive mode). One would imagine
that this should be a natural use for ByLine (perhaps not as common
nowadays with the rampant GUI fanboyism, but it still happens in some
niches), but it is fraught with peril.

First of all, the way ByLine works is kinda tricky, even in the previous
releases. The underlying cause is that at least on Posix, the underlying
C feof() call doesn't actually tell you whether you're really at EOF
until you try to read something from the file descriptor. I know there
are good reasons for this, but this special percolates up the standard
library code and causes a problem with D's input range primitives, where
.empty must tell the caller, right now, whether data is available,
*before* .front ever returns anything.

At one time, this problem was worked around by issuing a single fgetc on
the underlying file descriptor in ByLine's .empty method to determine
its EOF state, and then doing a fungetc to put the char back into the
stream.  However, this code is a rather ugly hack, and causes the
problem that when the interactive program needs to output a prompt
before blocking on input, it has to do so *before* it calls ByLine.empty
(since otherwise .empty blocks and the prompt doesn't get printed until
after the user has hit Enter -- clearly unacceptable for an interactive
shell program). If the stream turns out empty after all, then the prompt
is already output, and there's no way to take it back, so an extraneous
prompt is always written.

Understandably, the fungetc hack was subsequently removed from Phobos,
by caching the subsequent line the first time .empty was called, which
eliminated the ugliness of fungetc, and allowed current code to continue
working as before.

Then recently, and also understandably, caching things in .empty was
frowned upon, so the caching was removed from .empty altogether and
pushed into the ByLine ctor. From the standpoint of Phobos code, this is
perhaps the ideal solution: the ctor reads the stream to get the first
line and simultaneously determine the EOF status of the stream, and
there is no need for ugly boolean state flags, fungetc ugliness, and
generally unpleasant code.

However, what happens is that now, ByLine will block on input *upon
construction*. This is rather unpleasant when your program needs to do
something like this:

	void main() {
		string prompt;
		...
		ByLine!char input;
		if (useStandardInput) {
			input = stdin.byLine();
		} else if (useScriptFile) {
			input = File(filename).byLine();
		}
		...
		if (mode == ProgramMode.modeA) { // mode is an enum
			runModeA(input);
		} else {
			runModeB(input);
		}
	}

	void runModeA(ByLine!char input) {
		write("modeA> ");	// display prompt
		while (!input.empty) {
			...
		}
	}

	void runModeB(ByLine!char input) {
		write("modeB> ");	// display prompt
		while (!input.empty) {
			...
		}
	}

The problem is, when input is initialized, we don't know what prompt to
use yet, but ByLine's ctor will already block when it tries to read from
stdin!

The current workaround I implemented is to use a wrapper around ByLine
that lazily constructs it when .empty is called.

Who knew something so simple as an interactive prompting program that
reads input lines could turn into such a nightmare when ByLine is used?

:-(


T

-- 
What is Matter, what is Mind? Never Mind, it doesn't Matter.
Feb 26 2014
next sibling parent "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Wednesday, 26 February 2014 at 23:45:48 UTC, H. S. Teoh wrote:
 The problem is, when input is initialized, we don't know what 
 prompt to
 use yet, but ByLine's ctor will already block when it tries to 
 read from
 stdin!
Ouch, I think I saw this coming... [1] [1] https://github.com/D-Programming-Language/phobos/pull/1883
Feb 26 2014
prev sibling next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
H. S. Teoh:

 I'm writing a CLI program that uses File.ByLine to read input 
 commands,
Isn't using readln() better for that? File.byLine is to read lines of files on disk. Bye, bearophile
Feb 26 2014
next sibling parent "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Wednesday, 26 February 2014 at 23:59:09 UTC, bearophile wrote:
 H. S. Teoh:

 I'm writing a CLI program that uses File.ByLine to read input 
 commands,
Isn't using readln() better for that? File.byLine is to read lines of files on disk. Bye, bearophile
Says who? The type system and documentation only assert that it works on files, with no reservations about what kind of file. The standard input file is as fine a file as any.
Feb 26 2014
prev sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Feb 26, 2014 at 11:59:07PM +0000, bearophile wrote:
 H. S. Teoh:
 
I'm writing a CLI program that uses File.ByLine to read input
commands,
Isn't using readln() better for that? File.byLine is to read lines of files on disk.
[...] Perhaps, but readln() isn't a range. The whole point was to use a range-based API for the interpreter so that there's no need to write two separate interfaces for the interpreter, one for stdin, one for a script file stored on disk. T -- Today's society is one of specialization: as you grow, you learn more and more about less and less. Eventually, you know everything about nothing.
Feb 26 2014
parent reply "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Thursday, 27 February 2014 at 00:07:47 UTC, H. S. Teoh wrote:
 On Wed, Feb 26, 2014 at 11:59:07PM +0000, bearophile wrote:
 H. S. Teoh:
 
I'm writing a CLI program that uses File.ByLine to read input
commands,
Isn't using readln() better for that? File.byLine is to read lines of files on disk.
[...] Perhaps, but readln() isn't a range. The whole point was to use a range-based API for the interpreter so that there's no need to write two separate interfaces for the interpreter, one for stdin, one for a script file stored on disk. T
Just write a function that accepts a std.stdio.File parameter?
Feb 26 2014
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Feb 27, 2014 at 12:16:21AM +0000, Jakob Ovrum wrote:
 On Thursday, 27 February 2014 at 00:07:47 UTC, H. S. Teoh wrote:
On Wed, Feb 26, 2014 at 11:59:07PM +0000, bearophile wrote:
H. S. Teoh:

I'm writing a CLI program that uses File.ByLine to read input
commands,
Isn't using readln() better for that? File.byLine is to read lines of files on disk.
[...] Perhaps, but readln() isn't a range. The whole point was to use a range-based API for the interpreter so that there's no need to write two separate interfaces for the interpreter, one for stdin, one for a script file stored on disk. T
Just write a function that accepts a std.stdio.File parameter?
Unfortunately, I use string arrays in my unittests (to avoid having to create separate unittest input files). So passing in File wouldn't work. Besides, File isn't a range, so that kinda defeats the purpose (my current hack of lazily constructing ByLine does work). I just find it unfortunate that such hacks are necessary to get off the ground. T -- It won't be covered in the book. The source code has to be useful for something, after all. -- Larry Wall
Feb 26 2014
prev sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 26 Feb 2014 18:44:10 -0500, H. S. Teoh <hsteoh quickfur.ath.cx>  
wrote:

 First of all, the way ByLine works is kinda tricky, even in the previous
 releases. The underlying cause is that at least on Posix, the underlying
 C feof() call doesn't actually tell you whether you're really at EOF
 until you try to read something from the file descriptor.
This is not a posix problem, it's a general stream problem. A stream is not at EOF until the write end is closed. Until then, you cannot know whether it's empty until you read and don't get anything back. Even if a primitive existed that allowed you to tell whether the write end was closed, you can race this against the other process closing it's write end. I think the correct solution is to block on the first front call. We may be able to do this without storing an additional variable. -Steve
Feb 27 2014
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Feb 27, 2014 at 07:55:59AM -0500, Steven Schveighoffer wrote:
 On Wed, 26 Feb 2014 18:44:10 -0500, H. S. Teoh
 <hsteoh quickfur.ath.cx> wrote:
 
First of all, the way ByLine works is kinda tricky, even in the
previous releases. The underlying cause is that at least on Posix,
the underlying C feof() call doesn't actually tell you whether you're
really at EOF until you try to read something from the file
descriptor.
This is not a posix problem, it's a general stream problem. A stream is not at EOF until the write end is closed. Until then, you cannot know whether it's empty until you read and don't get anything back. Even if a primitive existed that allowed you to tell whether the write end was closed, you can race this against the other process closing it's write end. I think the correct solution is to block on the first front call. We may be able to do this without storing an additional variable.
[...] Unfortunately, you can't. Since Phobos can't know whether the file (which may be a network socket, say) is at EOF without first blocking on read, it won't be able to return the correct value from .empty, and according to the range API, it's invalid to access .front unless .empty returns false. So this solution doesn't work. :-( T -- All men are mortal. Socrates is mortal. Therefore all men are Socrates.
Feb 27 2014
next sibling parent reply "Sean Kelly" <sean invisibleduck.org> writes:
Are the peek routines standard?  I'm on my phone so I can't 
easily check right now. Barring that, there's an ioctl call that 
can tell whether data is available, though I'm not sure offhand 
what the result would be for a file if you haven't read anything 
yet.
Feb 27 2014
parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 27 Feb 2014 11:22:45 -0500, Sean Kelly <sean invisibleduck.org>  
wrote:

 Are the peek routines standard?  I'm on my phone so I can't easily check  
 right now. Barring that, there's an ioctl call that can tell whether  
 data is available, though I'm not sure offhand what the result would be  
 for a file if you haven't read anything yet.
Peek doesn't help. You can't, in a non-blocking way, tell if input will be forthcoming without actually receiving the input. -Steve
Feb 27 2014
prev sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 27 Feb 2014 10:04:47 -0500, H. S. Teoh <hsteoh quickfur.ath.cx>  
wrote:

 On Thu, Feb 27, 2014 at 07:55:59AM -0500, Steven Schveighoffer wrote:
 On Wed, 26 Feb 2014 18:44:10 -0500, H. S. Teoh
 <hsteoh quickfur.ath.cx> wrote:

First of all, the way ByLine works is kinda tricky, even in the
previous releases. The underlying cause is that at least on Posix,
the underlying C feof() call doesn't actually tell you whether you're
really at EOF until you try to read something from the file
descriptor.
This is not a posix problem, it's a general stream problem. A stream is not at EOF until the write end is closed. Until then, you cannot know whether it's empty until you read and don't get anything back. Even if a primitive existed that allowed you to tell whether the write end was closed, you can race this against the other process closing it's write end. I think the correct solution is to block on the first front call. We may be able to do this without storing an additional variable.
[...] Unfortunately, you can't. Since Phobos can't know whether the file (which may be a network socket, say) is at EOF without first blocking on read, it won't be able to return the correct value from .empty, and according to the range API, it's invalid to access .front unless .empty returns false. So this solution doesn't work. :-(
Yes, you are right! Thinking about it, the only correct solution is to do what it already does -- establish the first line on construction. empty cannot depend on front, and doing something different on the first empty vs. every other one makes the range bloated and confusing. The issue really is, to treat the construction and popFront as blocking. Streams are a tricky business indeed. I think your solution is the only valid one. Unfortunate that you have to do this. An interesting general solution is to use a delegate to generate the range, giving an easy one-line construction without having to make a wrapper range that lazily constructs on empty, but just using a delegate name does not call it. I did come up with this: import std.stdio; import std.range; void foo(R)(R r) { static if(isInputRange!R) { alias _r = r; } else // if is no-arg delegate and returns input range (too lazy to figure this out :) { auto _r(){return r();} } foreach(x; _r) { writeln(x); } } void main() { foo(() => stdin.byLine); foo([1,2,3]); } The static if at the beginning is awkward, but just allows the rest of the code to be identical whether you call with a delegate or a range. -Steve
Feb 27 2014
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Feb 27, 2014 at 11:26:42AM -0500, Steven Schveighoffer wrote:
 On Thu, 27 Feb 2014 10:04:47 -0500, H. S. Teoh
 <hsteoh quickfur.ath.cx> wrote:
 
On Thu, Feb 27, 2014 at 07:55:59AM -0500, Steven Schveighoffer wrote:
On Wed, 26 Feb 2014 18:44:10 -0500, H. S. Teoh
<hsteoh quickfur.ath.cx> wrote:

First of all, the way ByLine works is kinda tricky, even in the
previous releases. The underlying cause is that at least on Posix,
the underlying C feof() call doesn't actually tell you whether
you're really at EOF until you try to read something from the file
descriptor.
This is not a posix problem, it's a general stream problem. A stream is not at EOF until the write end is closed. Until then, you cannot know whether it's empty until you read and don't get anything back. Even if a primitive existed that allowed you to tell whether the write end was closed, you can race this against the other process closing it's write end. I think the correct solution is to block on the first front call. We may be able to do this without storing an additional variable.
[...] Unfortunately, you can't. Since Phobos can't know whether the file (which may be a network socket, say) is at EOF without first blocking on read, it won't be able to return the correct value from .empty, and according to the range API, it's invalid to access .front unless .empty returns false. So this solution doesn't work. :-(
Yes, you are right! Thinking about it, the only correct solution is to do what it already does -- establish the first line on construction. empty cannot depend on front, and doing something different on the first empty vs. every other one makes the range bloated and confusing. The issue really is, to treat the construction and popFront as blocking. Streams are a tricky business indeed. I think your solution is the only valid one. Unfortunate that you have to do this. An interesting general solution is to use a delegate to generate the range, giving an easy one-line construction without having to make a wrapper range that lazily constructs on empty, but just using a delegate name does not call it. I did come up with this:
Actually, now that I think about it, can't we just make ByLine lazily constructed? It's already a wrapper around ByLineImpl anyway (since it's being refcounted), so why not just make the wrapper create ByLineImpl only when you actually attempt to use it? That would solve the problem: you can call ByLine but it won't block until ByLineImpl is actually created, which is the first time you call ByLine.empty. T -- Don't drink and derive. Alcohol and algebra don't mix.
Feb 27 2014
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 27 Feb 2014 12:32:44 -0500, H. S. Teoh <hsteoh quickfur.ath.cx>  
wrote:

 Actually, now that I think about it, can't we just make ByLine lazily
 constructed? It's already a wrapper around ByLineImpl anyway (since it's
 being refcounted), so why not just make the wrapper create ByLineImpl
 only when you actually attempt to use it? That would solve the problem:
 you can call ByLine but it won't block until ByLineImpl is actually
 created, which is the first time you call ByLine.empty.
I think this isn't any different than making ByLine.empty cache the first line. My solution is basically this: struct LazyConstructedRange(R) { R r; bool isConstructed = false; R delegate() _ctor; this(R delegate() ctor) {_ctor = ctor;} ref R get() { if(!isConstructed) { r = _ctor(); isConstructed = true;} return r; } alias get this; } Basically, we're not constructing on first call to empty, but first call to *anything*. Actually, this kind of a solution would be better that what I came up with, because the object itself is a range instead of a delegate (satisfies, for instance, isInputRange and isIterable, whereas the delegate does not), and you don't need the static if like I wrote. Any additional usage of the delegate in my original solution creates a copy of the range, but the above would only construct it once. -Steve
Feb 27 2014
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Feb 27, 2014 at 01:47:49PM -0500, Steven Schveighoffer wrote:
 On Thu, 27 Feb 2014 12:32:44 -0500, H. S. Teoh
 <hsteoh quickfur.ath.cx> wrote:
 
Actually, now that I think about it, can't we just make ByLine lazily
constructed? It's already a wrapper around ByLineImpl anyway (since
it's being refcounted), so why not just make the wrapper create
ByLineImpl only when you actually attempt to use it? That would solve
the problem: you can call ByLine but it won't block until ByLineImpl
is actually created, which is the first time you call ByLine.empty.
I think this isn't any different than making ByLine.empty cache the first line. My solution is basically this: struct LazyConstructedRange(R) { R r; bool isConstructed = false; R delegate() _ctor; this(R delegate() ctor) {_ctor = ctor;} ref R get() { if(!isConstructed) { r = _ctor(); isConstructed = true;} return r; } alias get this; } Basically, we're not constructing on first call to empty, but first call to *anything*.
[...] According to a strict interpretation of the range API, it is invalid to call any range method before you call .empty, because if the range turns out to be empty, calling .front or .popFront is undefined. So it is sufficient to implement lazy construction for .empty alone. All other cases *should* break anyway. :) Once you have that, then what you're proposing is no different from mine, in essence. T -- By understanding a machine-oriented language, the programmer will tend to use a much more efficient method; it is much closer to reality. -- D. Knuth
Feb 28 2014
parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 28 Feb 2014 16:12:26 -0500, H. S. Teoh <hsteoh quickfur.ath.cx>  
wrote:

 According to a strict interpretation of the range API, it is invalid to
 call any range method before you call .empty, because if the range turns
 out to be empty, calling .front or .popFront is undefined. So it is
 sufficient to implement lazy construction for .empty alone. All other
 cases *should* break anyway. :)

 Once you have that, then what you're proposing is no different from
 mine, in essence.
Yes, this is true. So, one can specifically define empty() and let the rest go to alias this. I think such a range would be a good phobos addition. -Steve
Feb 28 2014