www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Why does readln include the line terminator?

reply Georg Wrede <georg.wrede iki.fi> writes:
Readln returns a string which contains the line terminator.

Is there a grand reason for this?


Currently there are a few drawbacks with this. The naive user doesn't 
expect it, and the seasoned user has to keep stripping it. And then he 
has to search the docs (or get hold of other OSs) to determine what 
terminator to expect on other systems.

And it can't really be a speed optimization either, because to do 
anything useful with a string, you have to strip the terminator anyway 
at some point.
Apr 13 2009
next sibling parent reply Daniel Keep <daniel.keep.lists gmail.com> writes:
Georg Wrede wrote:
 Readln returns a string which contains the line terminator.
 
 Is there a grand reason for this?
 
 
 Currently there are a few drawbacks with this. The naive user doesn't
 expect it, and the seasoned user has to keep stripping it. And then he
 has to search the docs (or get hold of other OSs) to determine what
 terminator to expect on other systems.
 
 And it can't really be a speed optimization either, because to do
 anything useful with a string, you have to strip the terminator anyway
 at some point.

Because if it stripped it, there's no way to know what it was. If you want to do per-line processing but don't want to clobber the line endings, readln has to return the line terminator. Besides which, it's a single function call to strip it off irrespective of OS. -- Daniel
Apr 14 2009
next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Daniel Keep wrote:
 Because if it stripped it, there's no way to know what it was.  If you
 want to do per-line processing but don't want to clobber the line
 endings, readln has to return the line terminator.

That's right; there are currently at least 6 different line terminators: CR LF CRLF FF PS LS
Apr 14 2009
parent reply Georg Wrede <georg.wrede iki.fi> writes:
Walter Bright wrote:
 Daniel Keep wrote:
 Because if it stripped it, there's no way to know what it was.  If you
 want to do per-line processing but don't want to clobber the line
 endings, readln has to return the line terminator.


Who wants to receive a line with varying line endings anyway???
 That's right; there are currently at least 6 different line terminators:
 
 CR
 LF
 CRLF
 FF
 PS
 LS

So the programmer who wants to write portable code, has to implement awareness for all of these cases, in each of his programs? This seems a bit laborious. Replacing stuff at the end of the string forces him to check, for *each* line, the length of the terminator, and then use ...$-1 and at other times ...$-2, etc. in his code. In 25 years of computing, I have yet to see a file where variation of line termintators in the file contained some /deliberate/ information. And the only purpose for keeping the line endings would be to edit files while preserving the particular line terminator for each line. Which raises the question, how do you decide which terminator to use if you've inserted a line? So the whole point is absurd. A reasonable default behavior for a file mongering program would be to output line terminators according to the operating system default. The case where one *wants* to preserve them, should be considered the exception. I'm simply asking for the default to be to strip the terminator, thus relieving the programmer from, imho, gratuituos labor. You can still preserve the current functionality as an option.
Apr 14 2009
next sibling parent reply BCS <ao pathlink.com> writes:
Reply to Georg,

 So the whole point is absurd. A reasonable default behavior for a file
 mongering program would be to output line terminators according to the
 operating system default. The case where one *wants* to preserve them,
 should be considered the exception.
 

Only if you considering wanting to maintain merge-ability/diff-ability as the exception. Some, if not most, source control/diff/merge tools consider changes in line endings as changes.
Apr 14 2009
parent reply Georg Wrede <georg.wrede iki.fi> writes:
BCS wrote:
 Reply to Georg,
 
 So the whole point is absurd. A reasonable default behavior for a file
 mongering program would be to output line terminators according to the
 operating system default. The case where one *wants* to preserve them,
 should be considered the exception.

Only if you considering wanting to maintain merge-ability/diff-ability as the exception. Some, if not most, source control/diff/merge tools consider changes in line endings as changes.

Doesn't this kind of prove my point? Changing a line ending /should not/ be a "difference". Not by default. They should have a switch to explicitly turn it on. A good diff is complex enough that it should not stumble on form when it is supposed to examine content.
Apr 14 2009
parent BCS <ao pathlink.com> writes:
Reply to Georg,

 BCS wrote:
 
 Only if you considering wanting to maintain
 merge-ability/diff-ability as the exception. Some, if not most,
 source control/diff/merge tools consider changes in line endings as
 changes.
 

not/ be a "difference". Not by default. They should have a switch to explicitly turn it on. A good diff is complex enough that it should not stumble on form when it is supposed to examine content.

I make no assertions about what should be, only what is.
Apr 14 2009
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Georg Wrede:
 This seems a bit laborious. Replacing stuff at the end of the string 
 forces him to check, for *each* line, the length of the terminator, and 
 then use ...$-1 and at other times ...$-2, etc. in his code.

You use a string function or string method that removes the eventually present ending newline, any kind of. There is one in std.string too. Its main problem (beside working with char[] only in D1) is that its name is too much similar to another string function. I have complained about this time ago. Regarding the newline at the end of lines, in Python: for line in file("somefilename.txt"): print line line contains the ending new line too. Bye, bearophile
Apr 14 2009
parent Georg Wrede <georg.wrede iki.fi> writes:
bearophile wrote:
 Georg Wrede:
 This seems a bit laborious. Replacing stuff at the end of the string 
 forces him to check, for *each* line, the length of the terminator, and 
 then use ...$-1 and at other times ...$-2, etc. in his code.

You use a string function or string method that removes the eventually present ending newline, any kind of. There is one in std.string too. Its main problem (beside working with char[] only in D1) is that its name is too much similar to another string function. I have complained about this time ago. Regarding the newline at the end of lines, in Python: for line in file("somefilename.txt"): print line line contains the ending new line too.

Your code ends up printing the output on every other line.
Apr 14 2009
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Georg Wrede wrote:
 In 25 years of computing, I have yet to see a file where variation of 
 line termintators in the file contained some /deliberate/ information.

25 years and no networking code? Andrei
Apr 14 2009
next sibling parent Sean Kelly <sean invisibleduck.org> writes:
Steven Schveighoffer wrote:
 On Tue, 14 Apr 2009 13:19:49 -0400, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> wrote:
 
 Georg Wrede wrote:
 In 25 years of computing, I have yet to see a file where variation of 
 line termintators in the file contained some /deliberate/ information.

25 years and no networking code?

Been writing code for about 12 years, lots and lots of networking code. Still have never seen this. Don't see your point either.

With HTTP, for example, lines are terminated with \r\n. The lines themselves (in the header, at least) have constraints on the character range they allow, so one might want to error on solo \n but break on a \r\n, etc. Still, I don't know why anyone would use readln() for processing a network protocol, so perhaps the issue is moot.
Apr 14 2009
prev sibling parent reply Georg Wrede <georg.wrede iki.fi> writes:
Andrei Alexandrescu wrote:
 Georg Wrede wrote:
 In 25 years of computing, I have yet to see a file where variation of 
 line termintators in the file contained some /deliberate/ information.

25 years and no networking code?

I can see having to use one or another line ending in the whole output file, but not a situation where some lines and not some other need this or that kind of line ending.
Apr 14 2009
parent reply "Nick Sabalausky" <a a.a> writes:
"Georg Wrede" <georg.wrede iki.fi> wrote in message 
news:gs2o15$233h$2 digitalmars.com...
 I can see having to use one or another line ending in the whole output 
 file, but not a situation where some lines and not some other need this or 
 that kind of line ending.

Source code with unescaped nl's/cr's embedded in a string literal? Though I admit that may not be a particularly compelling case for at least a couple of different reasons. (I do agree with your original point though.)
Apr 14 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Nick Sabalausky wrote:
 "Georg Wrede" <georg.wrede iki.fi> wrote in message 
 news:gs2o15$233h$2 digitalmars.com...
 I can see having to use one or another line ending in the whole output 
 file, but not a situation where some lines and not some other need this or 
 that kind of line ending.

Source code with unescaped nl's/cr's embedded in a string literal? Though I admit that may not be a particularly compelling case for at least a couple of different reasons. (I do agree with your original point though.)

I think there are a few concerns when designing an API for reading separated lines. 1. Reasonably complex separators should be allowed, e.g. regexes. For streams that have lookahead = 1, only regexes without backtracking (i.e., classic regular expressions) can be allowed. 2. Alternate separators should be allowed, and information should be passed as to which one, if any, matched: readln(stream, '\n', '\r', "Brought to you by Carl's Jr.\n"); You should be able to somehow extract which one of these matched, or whether the stream ended without having seen any. The match process is similar to regexes, but the information returned would be difficult to extract from a regex match. 3. Given (1) and (2), the process of eliminating the matched separator can become rather involved. So there should be an option to just eliminate the separator. 4. However, the separator should be made available to the called. That makes for programs that preserve the separator, whatever it was. I plan to implement a little API around these considerations, but haven't gotten around to it. Particularly the regex thing is rather thorny because std.regex does not distinguish classic regular expressions from those needing backtracking, and does not have an implementation that works with limited-lookahead streams. I suspect that that would be a major effort. Right now readln preserves the separator. The newer File.byLine eliminates it by default and offers to keep it by calling File.byLine(KeepTerminator.yes). The allowed terminators are one character or a string. See http://erdani.dreamhosters.com/d/web/phobos/std_stdio.html#byLine I consider such an API adequate but insufficient; we need to add to it. Andrei
Apr 14 2009
next sibling parent reply Daniel Keep <daniel.keep.lists gmail.com> writes:
Andrei Alexandrescu wrote:
 ...
 
 Right now readln preserves the separator. The newer File.byLine
 eliminates it by default and offers to keep it by calling
 File.byLine(KeepTerminator.yes). The allowed terminators are one
 character or a string. See
 
 http://erdani.dreamhosters.com/d/web/phobos/std_stdio.html#byLine
 
 I consider such an API adequate but insufficient; we need to add to it.
 
 
 Andrei

Why not: char[] line, sep; line = File.byLine(); // discard sep line = File.byLine(sep); // pass sep out The separator is likely to be more useful once extracted. -- Daniel
Apr 14 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Daniel Keep wrote:
 
 Andrei Alexandrescu wrote:
 ...

 Right now readln preserves the separator. The newer File.byLine
 eliminates it by default and offers to keep it by calling
 File.byLine(KeepTerminator.yes). The allowed terminators are one
 character or a string. See

 http://erdani.dreamhosters.com/d/web/phobos/std_stdio.html#byLine

 I consider such an API adequate but insufficient; we need to add to it.


 Andrei

Why not: char[] line, sep; line = File.byLine(); // discard sep line = File.byLine(sep); // pass sep out The separator is likely to be more useful once extracted.

And how about when sep is elaborate (e.g. regex)? Andrei
Apr 14 2009
parent reply Daniel Keep <daniel.keep.lists gmail.com> writes:
Andrei Alexandrescu wrote:
 Daniel Keep wrote:
 Andrei Alexandrescu wrote:
 ...

 Right now readln preserves the separator. The newer File.byLine
 eliminates it by default and offers to keep it by calling
 File.byLine(KeepTerminator.yes). The allowed terminators are one
 character or a string. See

 http://erdani.dreamhosters.com/d/web/phobos/std_stdio.html#byLine

 I consider such an API adequate but insufficient; we need to add to it.


 Andrei

Why not: char[] line, sep; line = File.byLine(); // discard sep line = File.byLine(sep); // pass sep out The separator is likely to be more useful once extracted.

And how about when sep is elaborate (e.g. regex)? Andrei

Whatever was matched. If we have a file containing: "A.B,C" And we split lines using /[.,]/, then this:
 char[] line, sep;
 line = File.byLine(sep);
 while( line != "" )
 {
     writefln(`line = "%s", sep = "%s"`, line, sep);
     line = File.byLine(sep);
 }

Would output this:
 line = "A", sep = "."
 line = "B", sep = ","
 line = "C", sep = ""

-- Daniel
Apr 14 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Daniel Keep wrote:
 
 Andrei Alexandrescu wrote:
 Daniel Keep wrote:
 Andrei Alexandrescu wrote:
 ...

 Right now readln preserves the separator. The newer File.byLine
 eliminates it by default and offers to keep it by calling
 File.byLine(KeepTerminator.yes). The allowed terminators are one
 character or a string. See

 http://erdani.dreamhosters.com/d/web/phobos/std_stdio.html#byLine

 I consider such an API adequate but insufficient; we need to add to it.


 Andrei

char[] line, sep; line = File.byLine(); // discard sep line = File.byLine(sep); // pass sep out The separator is likely to be more useful once extracted.

Andrei

Whatever was matched. If we have a file containing: "A.B,C" And we split lines using /[.,]/, then this:
 char[] line, sep;
 line = File.byLine(sep);
 while( line != "" )
 {
     writefln(`line = "%s", sep = "%s"`, line, sep);
     line = File.byLine(sep);
 }

Would output this:
 line = "A", sep = "."
 line = "B", sep = ","
 line = "C", sep = ""

-- Daniel

Where did you specify the separator in the call to byLine? Andrei
Apr 14 2009
parent reply Christopher Wright <dhasenan gmail.com> writes:
Steven Schveighoffer wrote:
 auto reader = file.byLine!("/[.,]/")();

Why specify anything at compile time when a user could reasonably generate the value at runtime? auto reader = file.byLine(readConfig().separator);
Apr 15 2009
parent reply Robert Fraser <fraserofthenight gmail.com> writes:
Christopher Wright wrote:
 Steven Schveighoffer wrote:
 auto reader = file.byLine!("/[.,]/")();

Why specify anything at compile time when a user could reasonably generate the value at runtime? auto reader = file.byLine(readConfig().separator);

Yes, and for maximum abstraction, the config file should be stored as XML in a TEXT field of a database on another server.
Apr 15 2009
parent Christopher Wright <dhasenan gmail.com> writes:
Robert Fraser wrote:
 Christopher Wright wrote:
 Steven Schveighoffer wrote:
 auto reader = file.byLine!("/[.,]/")();

Why specify anything at compile time when a user could reasonably generate the value at runtime? auto reader = file.byLine(readConfig().separator);

Yes, and for maximum abstraction, the config file should be stored as XML in a TEXT field of a database on another server.

I just really hate to see templates when a regular function would suffice and be so close to the same efficiency as makes no difference for most reasonable situations. If there's a significant performance increase, I want to see both options.
Apr 15 2009
prev sibling parent Georg Wrede <georg.wrede iki.fi> writes:
Andrei Alexandrescu wrote:
 I plan to implement a little API around these considerations, but 
 haven't gotten around to it. Particularly the regex thing is rather 
 thorny because std.regex does not distinguish classic regular 
 expressions from those needing backtracking, and does not have an 
 implementation that works with limited-lookahead streams. I suspect that 
 that would be a major effort.
 
 Right now readln preserves the separator. The newer File.byLine 
 eliminates it by default and offers to keep it by calling 

Excellent!!
 File.byLine(KeepTerminator.yes). The allowed terminators are one 
 character or a string. See
 
 http://erdani.dreamhosters.com/d/web/phobos/std_stdio.html#byLine

 I consider such an API adequate but insufficient; we need to add to it.

Apr 15 2009
prev sibling next sibling parent Stewart Gordon <smjg_1998 yahoo.com> writes:
Daniel Keep wrote:
 Georg Wrede wrote:
 Readln returns a string which contains the line terminator.


 Because if it stripped it, there's no way to know what it was.  If you
 want to do per-line processing but don't want to clobber the line
 endings, readln has to return the line terminator.

But readln only stops on '\n' (or whatever character you tell it to otherwise), so will miss Mac "\r" endings altogether. As such, it's useless for this purpose. The big question, however, is why std.stream.InputStream doesn't have readln. It has readLine, which has different semantics - it understands all three line break styles and strips them. This is absurd since you're more likely to care about what line ending is used when reading in a text file than when reading from stdin. Take these four cases: (a) you want to process only files with a specific line ending style (b) you want to know what line endings are used (c) you don't care about what line endings are used, but still want to know whether or not the file ends with one (d) you just want to read the file line by line, without caring about the line endings or the presence or absence of one at the end At the moment, readln is good only for (a). readLine is good only for (d). If you want (b) or (c), you'll have to come up with an alternative means. Stewart.
Apr 14 2009
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 14 Apr 2009 13:19:49 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 Georg Wrede wrote:
 In 25 years of computing, I have yet to see a file where variation of  
 line termintators in the file contained some /deliberate/ information.

25 years and no networking code?

Been writing code for about 12 years, lots and lots of networking code. Still have never seen this. Don't see your point either. -Steve
Apr 14 2009
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 15 Apr 2009 00:21:48 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 Daniel Keep wrote:
  Andrei Alexandrescu wrote:
 Daniel Keep wrote:
 Andrei Alexandrescu wrote:
 ...

 Right now readln preserves the separator. The newer File.byLine
 eliminates it by default and offers to keep it by calling
 File.byLine(KeepTerminator.yes). The allowed terminators are one
 character or a string. See

 http://erdani.dreamhosters.com/d/web/phobos/std_stdio.html#byLine

 I consider such an API adequate but insufficient; we need to add to  
 it.


 Andrei

char[] line, sep; line = File.byLine(); // discard sep line = File.byLine(sep); // pass sep out The separator is likely to be more useful once extracted.

Andrei

"A.B,C" And we split lines using /[.,]/, then this:
 char[] line, sep;
 line = File.byLine(sep);
 while( line != "" )
 {
     writefln(`line = "%s", sep = "%s"`, line, sep);
     line = File.byLine(sep);
 }

 line = "A", sep = "."
 line = "B", sep = ","
 line = "C", sep = ""


Where did you specify the separator in the call to byLine?

I think he's not read the docs. Consider this usage instead: auto reader = file.byLine!("/[.,]/")(); // normal usage, doesn't return separators foreach(line; reader) { ... } // alternate usage, returns separators as well while(!reader.empty) { char[] sep; char[] line = reader.front(sep); // can't remember if this is what you decided on. ... reader.popFront(); // ditto } //Note that if foreach on ranges was extended to allow multiple parameters per pass, you could do: foreach(sep, line; reader) { ... } -Steve
Apr 14 2009
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 15 Apr 2009 22:54:50 -0400, Christopher Wright  
<dhasenan gmail.com> wrote:

 Robert Fraser wrote:
 Christopher Wright wrote:
 Steven Schveighoffer wrote:
 auto reader = file.byLine!("/[.,]/")();

Why specify anything at compile time when a user could reasonably generate the value at runtime? auto reader = file.byLine(readConfig().separator);

XML in a TEXT field of a database on another server.

I just really hate to see templates when a regular function would suffice and be so close to the same efficiency as makes no difference for most reasonable situations. If there's a significant performance increase, I want to see both options.

It's just a demonstration of what the OP was talking about but wasn't explaining properly. I have no intention of writing or supporting this code. I think its fine if Andrei decides to write this code and uses a function parameter instead of a template parameter, that I used a template parameter instead of a function parameter is not a hidden suggestion. -Steve
Apr 15 2009
prev sibling next sibling parent reply Christopher Wright <dhasenan gmail.com> writes:
Georg Wrede wrote:
 Readln returns a string which contains the line terminator.
 
 Is there a grand reason for this?
 
 
 Currently there are a few drawbacks with this. The naive user doesn't 
 expect it, and the seasoned user has to keep stripping it. And then he 
 has to search the docs (or get hold of other OSs) to determine what 
 terminator to expect on other systems.
 
 And it can't really be a speed optimization either, because to do 
 anything useful with a string, you have to strip the terminator anyway 
 at some point.

By default, tango does not exhibit this behavior. If you wish, you can include newlines: auto str = Cin.copyln; // no newline in str auto str2 = Cin.copyln(true); // has system-dependent newline
Apr 14 2009
parent Georg Wrede <georg.wrede iki.fi> writes:
Christopher Wright wrote:
 Georg Wrede wrote:
 Readln returns a string which contains the line terminator.

 Is there a grand reason for this?


 Currently there are a few drawbacks with this. The naive user doesn't 
 expect it, and the seasoned user has to keep stripping it. And then he 
 has to search the docs (or get hold of other OSs) to determine what 
 terminator to expect on other systems.

 And it can't really be a speed optimization either, because to do 
 anything useful with a string, you have to strip the terminator anyway 
 at some point.

By default, tango does not exhibit this behavior. If you wish, you can include newlines: auto str = Cin.copyln; // no newline in str auto str2 = Cin.copyln(true); // has system-dependent newline

Now this is more like it. The default should really be (in Phobos too) to not return the newline. (Hint to Walter: Tango is for users, by users, and if they have no newline as the default, it should be considered a serious hint as to what the programmer prefers.) If one is really interested in doing some file manipulation which might *preserve* varying line terminators in files that might have been edited in both linux and dos, then he should use "the non-default" line reading, like the Cin.copyln(true) above. Not that I'd see the point. I'm certain that the overwhelming majority of cases where one reads lines (_especially_ from the console, but from text files, too), one just wants the contents of the string.
Apr 14 2009
prev sibling next sibling parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Georg Wrede wrote:

 because to do anything useful with a string, you have to strip the
 terminator

This is false in case of simple copying. And I doubt, that for more complex operations splitting `readln' into `readlnBody' and `readlnEOL' and calling them intermittent would be of any benefit. -manfred
Apr 14 2009
parent reply Georg Wrede <georg.wrede iki.fi> writes:
Manfred Nowak wrote:
 Georg Wrede wrote:
 
 because to do anything useful with a string, you have to strip the
 terminator

This is false in case of simple copying. And I doubt, that for more complex operations splitting `readln' into `readlnBody' and `readlnEOL' and calling them intermittent would be of any benefit.

For copying there is the operating system command, copy. Additionally, simple copy is hardly the most used thing when readln is invoked. So, either, there should be two functions, one of which preserves the terminator, or (like in Tango) there should be a parameter to turn them on.
Apr 14 2009
parent Manfred Nowak <svv1999 hotmail.com> writes:
Georg Wrede wrote:

 So, either, there should be [...]

Agreed. -manfred
Apr 14 2009
prev sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Tue, 14 Apr 2009 18:01:52 +0400, Manfred Nowak <svv1999 hotmail.com> wrote:

 Georg Wrede wrote:

 because to do anything useful with a string, you have to strip the
 terminator

This is false in case of simple copying. And I doubt, that for more complex operations splitting `readln' into `readlnBody' and `readlnEOL' and calling them intermittent would be of any benefit. -manfred

Tango does the best by having an optional parameter that denotes whether a line ending needs to be retained.
Apr 14 2009