digitalmars.D.learn - Reading a file of words line by line

mark (45/45) Jan 14 2020 As part of learning D I want to read a file that contains one

mark (3/3) Jan 14 2020 Should I have closed the file, i.e.,:
mipri (18/36) Jan 14 2020 One thing I picked up during Advent of Code last year was

mark (17/17) Jan 15 2020 Thanks for the ideas, I've now reduced the size of the getWords()

dwdv (12/13) Jan 15 2020 How about this?

mark (14/14) Jan 15 2020 I really do need a set for the next part of the program, but

H. S. Teoh (9/10) Jan 15 2020 The .length of a `string` type is the number of bytes that it occupies,
Jesse Phillips (19/33) Jan 15 2020 Your solution is fine, but also

dwdv (8/16) Jan 16 2020 .each!(word => words[word.to!string.toUpper] = 0);

mark (4/13) Jan 16 2020 That's what I'm now using -- thanks!

mark writes:

As part of learning D I want to read a file that contains one 
word per line (plus optional junk after the word) and creates a 
set of all the unique words of a particular length (uppercased).

D doesn't appear to have a set type so I'm faking using an 
associative array whose values are always 0.

I can't help feeling that the foreach loop's block is rather more 
verbose than it could be?

----

import std.stdio;

immutable WORDFILE = "/usr/share/hunspell/en_GB.dic";
immutable WORDSIZE = 4; // Should be even

alias WordSet = int[string]; // key = word; value = 0

void main() {
     import core.time;

     auto start = MonoTime.currTime;
     auto words = getWords(WORDFILE, WORDSIZE);
     // TODO
     writeln(words.length, " words");
     writeln(MonoTime.currTime - start);
}

WordSet getWords(string filename, int wordsize) {
     import std.conv;
     import std.regex;
     import std.uni;

     WordSet words;
     auto rx = ctRegex!(r"^[a-z]+", "i");
     auto file = File(filename);
     foreach (line; file.byLine) {
	auto match = matchFirst(line, rx);
	if (!match.empty()) {
	    auto word = match.hit().to!string; // I hope this assumes 
UTF-8?
	    if (word.length == wordsize) {
		words[word.toUpper] = 0;
	    }
	}
     }
     return words;
}
----

PS I'm using ldc on Linux and think that rdmd is excellent. For 
lots of small Python programs I have I'm wondering how many would 
be faster using D and rdmd (which I think caches binaries). Also 
I've now got Mike Parker's "Learning D" on order.

Jan 14 2020

mark writes:

Should I have closed the file, i.e.,:

     auto file = File(filename);
     scope(exit) file.close(); // Add this?

Jan 14 2020

mipri <mipri minimaltype.com> writes:

On Tuesday, 14 January 2020 at 16:39:16 UTC, mark wrote:
 I can't help feeling that the foreach loop's block is rather 
 more verbose than it could be?

     WordSet words;
     auto rx = ctRegex!(r"^[a-z]+", "i");
     auto file = File(filename);
     foreach (line; file.byLine) {
 	auto match = matchFirst(line, rx);
 	if (!match.empty()) {
 	    auto word = match.hit().to!string; // I hope this assumes 
 UTF-8?
 	    if (word.length == wordsize) {
 		words[word.toUpper] = 0;
 	    }
 	}
     }
     return words;
 }
 ----

One thing I picked up during Advent of Code last year was
std.file.slurp, which was great for reading 90% of the input
files from that contest. With that, I'd do this more like

   int[string] words;
   slurp!string("input.txt", "%s").each!(w => words[w] = 0);

Where "%s" is what slurp() expects to find on each line, and
'string' is the type it returns from that. With just a list of
words this isn't very interesting. Some of my uses from the
contest are:

   auto input = slurp!(int, int, int)(args[1], "<x=%d, y=%d, 
z=%d>")
       .map!(p => Moon([p[0], p[1], p[2]])).array;

   Tuple!(string, string)[] input =
       slurp!(string, string)("input.txt", "%s)%s");

Of course if you want to validate the input as you're reading
it, you still have to do extra work, but it could be in a
.filter!

Jan 14 2020

mark writes:

Thanks for the ideas, I've now reduced the size of the getWords() 
function (even allowing for moving the imports to the top of the 
file) to this:

WordSet getWords(string filename, int wordsize) {
     string bareWord(string line) {
	auto rx = ctRegex!(r"^([a-z]+)", "i");
	auto match = matchFirst(line, rx);
	return match.empty ? "" : match.hit.to!string;
     }
     WordSet words;
     slurp!string(filename, "%s")
	.map!(line => bareWord(line))
	.filter!(word => word.length == wordsize)
	.each!(word => words[word.toUpper] = 0);
     return words;
}

Is this as compact as it _reasonably_ can be?

Jan 15 2020

dwdv <dwdv posteo.de> writes:

On 2020-01-15 16:34, mark via Digitalmars-d-learn wrote:
 Is this as compact as it _reasonably_ can be?

How about this?

auto uniqueWords(string filename, uint wordsize) {
     import std.algorithm, std.array, std.conv, std.functional, std.uni;

     return File(filename).byLine
         .map!(line => line.until!(not!isAlpha))
         .filter!(word => word.count == wordsize)
         .map!(word => word.to!string.toUpper)
         .array
         .sort
         .uniq;
}

Jan 15 2020

mark writes:

I really do need a set for the next part of the program, but 
taking your code and ideas I have now reduced the function to 
this:

WordSet getWords(string filename, int wordsize) {
     WordSet words;
     File(filename).byLine
	.map!(line => line.until!(not!isAlpha))
	.filter!(word => word.count == wordsize)
	.each!(word => words[word.to!string.toUpper] = 0);
     return words;
}

This is also 4x faster than my version that used a regex -- 
thanks!

Why did you use string.count rather than string.length?

Jan 15 2020

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Jan 15, 2020 at 07:50:31PM +0000, mark via Digitalmars-d-learn wrote:
[...]
 Why did you use string.count rather than string.length?

The .length of a `string` type is the number of bytes that it occupies,
which is not necessarily the same thing as the number of characters in
the string. E.g., if you receive a Unicode string, there may be
multi-byte characters in it.


T

-- 
A computer doesn't mind if its programs are put to purposes that don't match
their names. -- D. Knuth

Jan 15 2020

Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:

On Wednesday, 15 January 2020 at 19:50:31 UTC, mark wrote:
 I really do need a set for the next part of the program, but 
 taking your code and ideas I have now reduced the function to 
 this:

 WordSet getWords(string filename, int wordsize) {
     WordSet words;
     File(filename).byLine
 	.map!(line => line.until!(not!isAlpha))
 	.filter!(word => word.count == wordsize)
 	.each!(word => words[word.to!string.toUpper] = 0);
     return words;
 }

 This is also 4x faster than my version that used a regex -- 
 thanks!

 Why did you use string.count rather than string.length?

Your solution is fine, but also



void main () {

auto file = ["word one", "my word", "word"] ;
writeln (uniqueWords(file, 4));
}

auto uniqueWords(string[] file, uint wordsize) {
     import std.algorithm, std.array, std.conv, std.functional, 
std.uni;

     return file
         .map!(line => line.until!(not!isAlpha))
         .filter!(word => word.count == wordsize)
         .map!(word => word.to!string.toUpper)
         .array
         .sort
         .uniq
.map!(x => tuple (x, 0))
.assocArray ;
}

Jan 15 2020

dwdv <dwdv posteo.de> writes:

On 2020-01-16 04:54, Jesse Phillips via Digitalmars-d-learn wrote:
 [...]
 .map!(word => word.to!string.toUpper)
 .array
 .sort
 .uniq
 .map!(x => tuple (x, 0))
 .assocArray ;
 

.each!(word => words[word.to!string.toUpper] = 0);

isn't far off, but could also be (sans imports):

return File(filename).byLine
     .map!(line => line.until!(not!isAlpha))
     .filter!(word => word.count == wordsize)
     .map!(word => word.to!string.toUpper)
     .assocArray(0.repeat);

Jan 16 2020

mark writes:

On Thursday, 16 January 2020 at 10:10:02 UTC, dwdv wrote:
 On 2020-01-16 04:54, Jesse Phillips via Digitalmars-d-learn 
 wrote:
 [...]


[...]
 isn't far off, but could also be (sans imports):

 return File(filename).byLine
     .map!(line => line.until!(not!isAlpha))
     .filter!(word => word.count == wordsize)
     .map!(word => word.to!string.toUpper)
     .assocArray(0.repeat);

That's what I'm now using -- thanks!
(Now I can try the next bit.)

Jan 16 2020

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Reading a file of words line by line