digitalmars.D - [your code here]

H. S. Teoh (23/23) Feb 17 2012 // Outputs a randomly selected line from standard input with equal

Alf P. Steinbach (9/29) Feb 17 2012 Assuming that by "any" you mean "any particular", you would have to read...

Timon Gehr (6/39) Feb 17 2012 If the .idup is omitted the code does not compile. It does not silently
H. S. Teoh (30/64) Feb 17 2012 But therein lies the trick. The algorithm self-adapts to the number of

Timon Gehr (6/68) Feb 17 2012 You don't need to know how it is implemented. Everything you need to

a (14/18) Feb 17 2012 I'm not sure if I understood your post correctly, but it seems to

bearophile (24/27) Feb 17 2012 An alternative File API that to me looks nice. This is a first part, it'...

Timon Gehr (4/31) Feb 17 2012 Note that what we have now is clear, handy, quite short and safe and

bearophile (12/15) Feb 17 2012 What we have now:

H. S. Teoh (21/55) Feb 17 2012 I don't think it's a good idea to have File() automatically iterate by

Brad Anderson (5/77) Feb 17 2012 Steve Schveighoffer's work in progress on std.io:

H. S. Teoh (9/16) Feb 17 2012 [...]

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

// Outputs a randomly selected line from standard input with equal
// likelihood.
import std.random;
import std.stdio;

void main() {
	auto n = 0;
	string choice;
	foreach (line; stdin.byLine()) {
		n++;
		if (uniform(0,n) == 0)
			choice = line.idup;
	}
	writeln(choice);
}


P.S. Proof that the probability any line is selected is exactly 1/n
(where n is the total number of lines read) is left as an exercise for
the reader. ;-)

P.S.S. The .idup is a bit ugly, but necessary, since apparently byLine()
calls readln() with a static buffer, so choice will be silently
overwritten if the .idup is omitted.


T

-- 
Ruby is essentially Perl minus Wall.

Feb 17 2012

"Alf P. Steinbach" <alf.p.steinbach+usenet gmail.com> writes:

On 18.02.2012 02:39, H. S. Teoh wrote:
 // Outputs a randomly selected line from standard input with equal
 // likelihood.
 import std.random;
 import std.stdio;

 void main() {
 	auto n = 0;
 	string choice;
 	foreach (line; stdin.byLine()) {
 		n++;
 		if (uniform(0,n) == 0)
 			choice = line.idup;
 	}
 	writeln(choice);
 }


 P.S. Proof that the probability any line is selected is exactly 1/n
 (where n is the total number of lines read) is left as an exercise for
 the reader. ;-)

Assuming that by "any" you mean "any particular", you would have to read 
all the lines first. Otherwise, if the code selects the first line with 
probability 1/K, then I can just input some other number of lines.


 P.S.S. The .idup is a bit ugly, but necessary, since apparently byLine()
 calls readln() with a static buffer, so choice will be silently
 overwritten if the .idup is omitted.

That sounds ominous. One should never have to be aware of low level 
details in order to do simple string assignment or initialization, when 
the source already is a string. Does one really have to do that in D?


Cheers & hth.,

- Alf

Feb 17 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 02/18/2012 02:52 AM, Alf P. Steinbach wrote:
 On 18.02.2012 02:39, H. S. Teoh wrote:
 // Outputs a randomly selected line from standard input with equal
 // likelihood.
 import std.random;
 import std.stdio;

 void main() {
     auto n = 0;
     string choice;
     foreach (line; stdin.byLine()) {
         n++;
         if (uniform(0,n) == 0)
             choice = line.idup;
     }
     writeln(choice);
 }


 P.S. Proof that the probability any line is selected is exactly 1/n
 (where n is the total number of lines read) is left as an exercise for
 the reader. ;-)

 Assuming that by "any" you mean "any particular", you would have to read
 all the lines first. Otherwise, if the code selects the first line with
 probability 1/K, then I can just input some other number of lines.


 P.S.S. The .idup is a bit ugly, but necessary, since apparently byLine()
 calls readln() with a static buffer, so choice will be silently
 overwritten if the .idup is omitted.


If the .idup is omitted the code does not compile. It does not silently 
misbehave. That is why we have a const system.

 That sounds ominous. One should never have to be aware of low level
 details in order to do simple string assignment or initialization, when
 the source already is a string.

The source is not a string, it is a char[].

 Does one really have to do that in D?


 Cheers & hth.,

 - Alf

File.byLine re-uses the buffer in order to be more efficient. This 
appears in the documentation, and the buffer is typed appropriately.

Feb 17 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sat, Feb 18, 2012 at 02:52:15AM +0100, Alf P. Steinbach wrote:
 On 18.02.2012 02:39, H. S. Teoh wrote:
// Outputs a randomly selected line from standard input with equal
// likelihood.
import std.random;
import std.stdio;

void main() {
	auto n = 0;
	string choice;
	foreach (line; stdin.byLine()) {
		n++;
		if (uniform(0,n) == 0)
			choice = line.idup;
	}
	writeln(choice);
}


P.S. Proof that the probability any line is selected is exactly 1/n
(where n is the total number of lines read) is left as an exercise
for the reader. ;-)

 
 Assuming that by "any" you mean "any particular", you would have to
 read all the lines first. Otherwise, if the code selects the first
 line with probability 1/K, then I can just input some other number of
 lines.

But therein lies the trick. The algorithm self-adapts to the number of
lines it reads. It does not know in advance how many lines there are,
but guarantees that by the end of the input, the probability that any
particular line is selected is exactly 1/n, where n is the number of
lines read. The key lies in the way uniform() is called. :)

(This has been tested by running the program repeatedly on the same
input and analysing the number of occurrences of each line. The proof of
the algorithm is left as an exercise for the reader. ;-) )


P.S.S. The .idup is a bit ugly, but necessary, since apparently byLine()
calls readln() with a static buffer, so choice will be silently
overwritten if the .idup is omitted.

 
 That sounds ominous. One should never have to be aware of low level
 details in order to do simple string assignment or initialization,
 when the source already is a string. Does one really have to do that
 in D?

[...]

Well, there are two issues here. One is that stdin.byLine() returns
char[], which cannot be assigned to string directly. Two is that the
first version of the code didn't have the .idup (I used char[] instead
of string for choice), and the result was garbage in the output due to
said overwriting of buffer.

Now I agree that one shouldn't need to know how byLine() is implemented
in order to be able to use it, but this is the way the current Phobos
implementation is, unfortunately.

However, the one upside to all this is that the type system, in a sense,
forces you to do the right thing: byLine() returns char[] which cannot
be assigned to string, so naturally you have to write .idup, which
automatically also avoids the buffer overwriting issue. Using string
instead of char[] is a logical choice in this case (pun intended ;-)),
because it's something whose value you want to retain until the next
time it's modified. So using string for 'choice' naturally leads to
needing an .idup in the loop. The system isn't perfect, but it's pretty
good.


T

-- 
Political correctness: socially-sanctioned hypocrisy.

Feb 17 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 02/18/2012 03:16 AM, H. S. Teoh wrote:
 On Sat, Feb 18, 2012 at 02:52:15AM +0100, Alf P. Steinbach wrote:
 On 18.02.2012 02:39, H. S. Teoh wrote:
 // Outputs a randomly selected line from standard input with equal
 // likelihood.
 import std.random;
 import std.stdio;

 void main() {
 	auto n = 0;
 	string choice;
 	foreach (line; stdin.byLine()) {
 		n++;
 		if (uniform(0,n) == 0)
 			choice = line.idup;
 	}
 	writeln(choice);
 }


 P.S. Proof that the probability any line is selected is exactly 1/n
 (where n is the total number of lines read) is left as an exercise
 for the reader. ;-)

 Assuming that by "any" you mean "any particular", you would have to
 read all the lines first. Otherwise, if the code selects the first
 line with probability 1/K, then I can just input some other number of
 lines.

 But therein lies the trick. The algorithm self-adapts to the number of
 lines it reads. It does not know in advance how many lines there are,
 but guarantees that by the end of the input, the probability that any
 particular line is selected is exactly 1/n, where n is the number of
 lines read. The key lies in the way uniform() is called. :)

 (This has been tested by running the program repeatedly on the same
 input and analysing the number of occurrences of each line. The proof of
 the algorithm is left as an exercise for the reader. ;-) )


 P.S.S. The .idup is a bit ugly, but necessary, since apparently byLine()
 calls readln() with a static buffer, so choice will be silently
 overwritten if the .idup is omitted.

 That sounds ominous. One should never have to be aware of low level
 details in order to do simple string assignment or initialization,
 when the source already is a string. Does one really have to do that
 in D?

 [...]

 Well, there are two issues here. One is that stdin.byLine() returns
 char[], which cannot be assigned to string directly. Two is that the
 first version of the code didn't have the .idup (I used char[] instead
 of string for choice), and the result was garbage in the output due to
 said overwriting of buffer.

 Now I agree that one shouldn't need to know how byLine() is implemented
 in order to be able to use it, but this is the way the current Phobos
 implementation is, unfortunately.

You don't need to know how it is implemented. Everything you need to 
know is stated in its interface + documentation comment.

 However, the one upside to all this is that the type system, in a sense,
 forces you to do the right thing: byLine() returns char[] which cannot
 be assigned to string, so naturally you have to write .idup, which
 automatically also avoids the buffer overwriting issue. Using string
 instead of char[] is a logical choice in this case (pun intended ;-)),
 because it's something whose value you want to retain until the next
 time it's modified. So using string for 'choice' naturally leads to
 needing an .idup in the loop. The system isn't perfect, but it's pretty
 good.


 T

Its pretty perfect. If byLine would return string it would be horribly 
inefficient for the common case of processing the input without storing 
it. It even allows in-place modification of the current input line.

Feb 17 2012

"a" <a a.com> writes:

 Assuming that by "any" you mean "any particular", you would 
 have to read all the lines first. Otherwise, if the code 
 selects the first line with probability 1/K, then I can just 
 input some other number of lines.

I'm not sure if I understood your post correctly, but it seems to 
me that you didn't read the code carefully. The code selects the 
last read line with probability 1/n, where n is the number of 
strings already read, not the number of all strings. Say you have 
have already read n strings and they are all selected with 
probability 1/n. Now you read a new string and select it with 
probability  1 / (n+1). The probability that you didn't select 
the new string is then n / (n + 1) and the first n strings are 
selected with probability 1 / n * n / (n + 1) = 1 / (n + 1), so 
you now have n+1 lines all selected with probability 1 / (n + 1). 
So if the property that input strings are all selected with equal 
probability holds for n, it must also hold for n+1. Since it 
obviously holds for n = 1 (we just select the only string with 
probability 1 in that case), it must hold for all n.

Feb 17 2012

bearophile <bearophileHUGS lycos.com> writes:

H. S. Teoh:

 P.S.S. The .idup is a bit ugly, but necessary, since apparently byLine()
 calls readln() with a static buffer, so choice will be silently
 overwritten if the .idup is omitted.

An alternative File API that to me looks nice. This is a first part, it's for
scripting-like or not high performance purposes, it looks essentially like
Python code, every line is a newly allocated string:

import std.stdio;
void main() {
    string[] lines;
    foreach (line; File("data.dat")) {
        static assert(is(line == string));
        lines ~= line;
    }
}

If you don't want a new buffer every line you use something like this:

import std.stdio;
void main() {
    string[] lines;
    foreach (line; File("data.dat").fastLines()) {
        static assert(is(line == char[]));
        lines ~= line.idup;
    }
}

So on default it's handy, short and safe, and with a method you avoid an
allocation every line and get a mutable char[].

Maybe even this works, downloads a page and scans its lines, but maybe it's
better to add a bit of extra safety to this:
foreach (line; File("http://www.dlang.org/faq.html")) {}

Bye,
bearophile

Feb 17 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 02/18/2012 03:20 AM, bearophile wrote:
 H. S. Teoh:

 P.S.S. The .idup is a bit ugly, but necessary, since apparently byLine()
 calls readln() with a static buffer, so choice will be silently
 overwritten if the .idup is omitted.

 An alternative File API that to me looks nice. This is a first part, it's for
scripting-like or not high performance purposes, it looks essentially like
Python code, every line is a newly allocated string:

 import std.stdio;
 void main() {
      string[] lines;
      foreach (line; File("data.dat")) {
          static assert(is(line == string));
          lines ~= line;
      }
 }

 If you don't want a new buffer every line you use something like this:

 import std.stdio;
 void main() {
      string[] lines;
      foreach (line; File("data.dat").fastLines()) {
          static assert(is(line == char[]));
          lines ~= line.idup;
      }
 }

 So on default it's handy, short and safe, and with a method you avoid an
allocation every line and get a mutable char[].

 Maybe even this works, downloads a page and scans its lines, but maybe it's
better to add a bit of extra safety to this:
 foreach (line; File("http://www.dlang.org/faq.html")) {}

 Bye,
 bearophile

Note that what we have now is clear, handy, quite short and safe and 
efficient. What you propose takes away the clarity and efficiency parts. 
foreach(foo; File("data.dat")) {} // by what does this iterate?

Feb 17 2012

bearophile <bearophileHUGS lycos.com> writes:

Timon Gehr:

 Note that what we have now is clear, handy, quite short and safe and 
 efficient. What you propose takes away the clarity and efficiency parts. 

What we have now:
- by Line() is explicit, but if you don't use it is not that clear, if a Python
programmer iterates on File it gives an almost mysterious error message:
test.d(4): Error: invalid foreach aggregate (File __ctmp834 = 0;

 , __ctmp834).this("data.dat","rb")
- It's less handy because you have to add idup if you need strings (and most
times you need a string);
- It's longer than what I have suggested;
- It's not safe, because if you need to work mutable char[], it yields the same
buffer. This is a bug-prone default. A better API allocates every line on
default, and not allocates it on request. (This is how I designed my dlibs1).
It's "safe" only if you need a string, because it forces you to convert it to a
string with idup.


 foreach(foo; File("data.dat")) {} // by what does this iterate?

Iterating the lines of a file is a very common operation, so you remember what
it does.

Anyway, if you don't like File to be iterable, I'd like that attempt to give
better error message, and byLine() to yield a newly allocated string (and
another "fast"-annotated method that yields a slice of the same mutable buffer).

Bye,
bearophile

Feb 17 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Feb 17, 2012 at 09:20:59PM -0500, bearophile wrote:
 H. S. Teoh:
 
 P.S.S. The .idup is a bit ugly, but necessary, since apparently
 byLine() calls readln() with a static buffer, so choice will be
 silently overwritten if the .idup is omitted.

 
 An alternative File API that to me looks nice. This is a first part,
 it's for scripting-like or not high performance purposes, it looks
 essentially like Python code, every line is a newly allocated string:
 
 import std.stdio;
 void main() {
     string[] lines;
     foreach (line; File("data.dat")) {
         static assert(is(line == string));
         lines ~= line;
     }
 }

I don't think it's a good idea to have File() automatically iterate by
lines. What if data.dat is binary? I think it's still better to have:

	foreach (line; File("...").lines()) { ... }

where File.lines() is a lazy method that reads the file line-by-line.

For binary files, you could have File.byChunk!T() where T can be any
type, struct, etc., with the appropriate method for construction on
input.


 If you don't want a new buffer every line you use something like this:
 
 import std.stdio;
 void main() {
     string[] lines;
     foreach (line; File("data.dat").fastLines()) {
         static assert(is(line == char[]));
         lines ~= line.idup;
     }
 }
 
 So on default it's handy, short and safe, and with a method you avoid
 an allocation every line and get a mutable char[].

Makes sense.


 Maybe even this works, downloads a page and scans its lines, but maybe
 it's better to add a bit of extra safety to this:
 foreach (line; File("http://www.dlang.org/faq.html")) {}

[...]

I don't know about this, I would prefer downloading files to be in
another class, not File. As long as they provide a Range interface it's
good enough. Perhaps File and NetworkFile (or whatever you want to call
it) can implement the same interface, .lines() for safe iteration,
.fastLines() for buffered iteration.

Anyway, is anyone working on a new file I/O interface? I'd like to see
std.stream and std.stdio combined, for one thing. Or replaced with a
range-based API.


T

-- 
Frank disagreement binds closer than feigned agreement.

Feb 17 2012

"Brad Anderson" <eco gnuk.net> writes:

On Saturday, 18 February 2012 at 02:46:52 UTC, H. S. Teoh wrote:
 On Fri, Feb 17, 2012 at 09:20:59PM -0500, bearophile wrote:
 H. S. Teoh:
 
 P.S.S. The .idup is a bit ugly, but necessary, since 
 apparently
 byLine() calls readln() with a static buffer, so choice will 
 be
 silently overwritten if the .idup is omitted.

 
 An alternative File API that to me looks nice. This is a first 
 part,
 it's for scripting-like or not high performance purposes, it 
 looks
 essentially like Python code, every line is a newly allocated 
 string:
 
 import std.stdio;
 void main() {
     string[] lines;
     foreach (line; File("data.dat")) {
         static assert(is(line == string));
         lines ~= line;
     }
 }

 I don't think it's a good idea to have File() automatically 
 iterate by
 lines. What if data.dat is binary? I think it's still better to 
 have:

 	foreach (line; File("...").lines()) { ... }

 where File.lines() is a lazy method that reads the file 
 line-by-line.

 For binary files, you could have File.byChunk!T() where T can 
 be any
 type, struct, etc., with the appropriate method for 
 construction on
 input.


 If you don't want a new buffer every line you use something 
 like this:
 
 import std.stdio;
 void main() {
     string[] lines;
     foreach (line; File("data.dat").fastLines()) {
         static assert(is(line == char[]));
         lines ~= line.idup;
     }
 }
 
 So on default it's handy, short and safe, and with a method 
 you avoid
 an allocation every line and get a mutable char[].

 Makes sense.


 Maybe even this works, downloads a page and scans its lines, 
 but maybe
 it's better to add a bit of extra safety to this:
 foreach (line; File("http://www.dlang.org/faq.html")) {}

 [...]

 I don't know about this, I would prefer downloading files to be 
 in
 another class, not File. As long as they provide a Range 
 interface it's
 good enough. Perhaps File and NetworkFile (or whatever you want 
 to call
 it) can implement the same interface, .lines() for safe 
 iteration,
 .fastLines() for buffered iteration.

 Anyway, is anyone working on a new file I/O interface? I'd like 
 to see
 std.stream and std.stdio combined, for one thing. Or replaced 
 with a
 range-based API.


 T

Steve Schveighoffer's work in progress on std.io:

https://github.com/schveiguy/phobos/blob/new-io/std/io.d

Regards,
Brad Anderson

Feb 17 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sat, Feb 18, 2012 at 04:41:09AM +0100, Brad Anderson wrote:
 On Saturday, 18 February 2012 at 02:46:52 UTC, H. S. Teoh wrote:

[...]
Anyway, is anyone working on a new file I/O interface? I'd like to
see std.stream and std.stdio combined, for one thing. Or replaced
with a range-based API.


[...]
 Steve Schveighoffer's work in progress on std.io:
 
 https://github.com/schveiguy/phobos/blob/new-io/std/io.d

[...]

Hmmm, I browsed through the code, and I like what I see. I'd like to see
string streams implemented eventually. That would be very useful.


T

-- 
"You are a very disagreeable person." "NO."

Feb 17 2012

D Programming

C/C++ Programming

Other

digitalmars.D - [your code here]