www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Reading a file of words line by line

reply mark <mark qtrac.eu> writes:
As part of learning D I want to read a file that contains one 
word per line (plus optional junk after the word) and creates a 
set of all the unique words of a particular length (uppercased).

D doesn't appear to have a set type so I'm faking using an 
associative array whose values are always 0.

I can't help feeling that the foreach loop's block is rather more 
verbose than it could be?

----

import std.stdio;

immutable WORDFILE = "/usr/share/hunspell/en_GB.dic";
immutable WORDSIZE = 4; // Should be even

alias WordSet = int[string]; // key = word; value = 0

void main() {
     import core.time;

     auto start = MonoTime.currTime;
     auto words = getWords(WORDFILE, WORDSIZE);
     // TODO
     writeln(words.length, " words");
     writeln(MonoTime.currTime - start);
}

WordSet getWords(string filename, int wordsize) {
     import std.conv;
     import std.regex;
     import std.uni;

     WordSet words;
     auto rx = ctRegex!(r"^[a-z]+", "i");
     auto file = File(filename);
     foreach (line; file.byLine) {
	auto match = matchFirst(line, rx);
	if (!match.empty()) {
	    auto word = match.hit().to!string; // I hope this assumes 
UTF-8?
	    if (word.length == wordsize) {
		words[word.toUpper] = 0;
	    }
	}
     }
     return words;
}
----

PS I'm using ldc on Linux and think that rdmd is excellent. For 
lots of small Python programs I have I'm wondering how many would 
be faster using D and rdmd (which I think caches binaries). Also 
I've now got Mike Parker's "Learning D" on order.
Jan 14 2020
next sibling parent mark <mark qtrac.eu> writes:
Should I have closed the file, i.e.,:

     auto file = File(filename);
     scope(exit) file.close(); // Add this?
Jan 14 2020
prev sibling parent reply mipri <mipri minimaltype.com> writes:
On Tuesday, 14 January 2020 at 16:39:16 UTC, mark wrote:
 I can't help feeling that the foreach loop's block is rather 
 more verbose than it could be?
     WordSet words;
     auto rx = ctRegex!(r"^[a-z]+", "i");
     auto file = File(filename);
     foreach (line; file.byLine) {
 	auto match = matchFirst(line, rx);
 	if (!match.empty()) {
 	    auto word = match.hit().to!string; // I hope this assumes 
 UTF-8?
 	    if (word.length == wordsize) {
 		words[word.toUpper] = 0;
 	    }
 	}
     }
     return words;
 }
 ----
One thing I picked up during Advent of Code last year was std.file.slurp, which was great for reading 90% of the input files from that contest. With that, I'd do this more like int[string] words; slurp!string("input.txt", "%s").each!(w => words[w] = 0); Where "%s" is what slurp() expects to find on each line, and 'string' is the type it returns from that. With just a list of words this isn't very interesting. Some of my uses from the contest are: auto input = slurp!(int, int, int)(args[1], "<x=%d, y=%d, z=%d>") .map!(p => Moon([p[0], p[1], p[2]])).array; Tuple!(string, string)[] input = slurp!(string, string)("input.txt", "%s)%s"); Of course if you want to validate the input as you're reading it, you still have to do extra work, but it could be in a .filter!
Jan 14 2020
parent reply mark <mark qtrac.eu> writes:
Thanks for the ideas, I've now reduced the size of the getWords() 
function (even allowing for moving the imports to the top of the 
file) to this:

WordSet getWords(string filename, int wordsize) {
     string bareWord(string line) {
	auto rx = ctRegex!(r"^([a-z]+)", "i");
	auto match = matchFirst(line, rx);
	return match.empty ? "" : match.hit.to!string;
     }
     WordSet words;
     slurp!string(filename, "%s")
	.map!(line => bareWord(line))
	.filter!(word => word.length == wordsize)
	.each!(word => words[word.toUpper] = 0);
     return words;
}

Is this as compact as it _reasonably_ can be?
Jan 15 2020
parent reply dwdv <dwdv posteo.de> writes:
On 2020-01-15 16:34, mark via Digitalmars-d-learn wrote:
 Is this as compact as it _reasonably_ can be?
How about this? auto uniqueWords(string filename, uint wordsize) { import std.algorithm, std.array, std.conv, std.functional, std.uni; return File(filename).byLine .map!(line => line.until!(not!isAlpha)) .filter!(word => word.count == wordsize) .map!(word => word.to!string.toUpper) .array .sort .uniq; }
Jan 15 2020
parent reply mark <mark qtrac.eu> writes:
I really do need a set for the next part of the program, but 
taking your code and ideas I have now reduced the function to 
this:

WordSet getWords(string filename, int wordsize) {
     WordSet words;
     File(filename).byLine
	.map!(line => line.until!(not!isAlpha))
	.filter!(word => word.count == wordsize)
	.each!(word => words[word.to!string.toUpper] = 0);
     return words;
}

This is also 4x faster than my version that used a regex -- 
thanks!

Why did you use string.count rather than string.length?
Jan 15 2020
next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Jan 15, 2020 at 07:50:31PM +0000, mark via Digitalmars-d-learn wrote:
[...]
 Why did you use string.count rather than string.length?
The .length of a `string` type is the number of bytes that it occupies, which is not necessarily the same thing as the number of characters in the string. E.g., if you receive a Unicode string, there may be multi-byte characters in it. T -- A computer doesn't mind if its programs are put to purposes that don't match their names. -- D. Knuth
Jan 15 2020
prev sibling parent reply Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:
On Wednesday, 15 January 2020 at 19:50:31 UTC, mark wrote:
 I really do need a set for the next part of the program, but 
 taking your code and ideas I have now reduced the function to 
 this:

 WordSet getWords(string filename, int wordsize) {
     WordSet words;
     File(filename).byLine
 	.map!(line => line.until!(not!isAlpha))
 	.filter!(word => word.count == wordsize)
 	.each!(word => words[word.to!string.toUpper] = 0);
     return words;
 }

 This is also 4x faster than my version that used a regex -- 
 thanks!

 Why did you use string.count rather than string.length?
Your solution is fine, but also void main () { auto file = ["word one", "my word", "word"] ; writeln (uniqueWords(file, 4)); } auto uniqueWords(string[] file, uint wordsize) { import std.algorithm, std.array, std.conv, std.functional, std.uni; return file .map!(line => line.until!(not!isAlpha)) .filter!(word => word.count == wordsize) .map!(word => word.to!string.toUpper) .array .sort .uniq .map!(x => tuple (x, 0)) .assocArray ; }
Jan 15 2020
parent reply dwdv <dwdv posteo.de> writes:
On 2020-01-16 04:54, Jesse Phillips via Digitalmars-d-learn wrote:
 [...]
 .map!(word => word.to!string.toUpper)
 .array
 .sort
 .uniq
 .map!(x => tuple (x, 0))
 .assocArray ;
 
.each!(word => words[word.to!string.toUpper] = 0); isn't far off, but could also be (sans imports): return File(filename).byLine .map!(line => line.until!(not!isAlpha)) .filter!(word => word.count == wordsize) .map!(word => word.to!string.toUpper) .assocArray(0.repeat);
Jan 16 2020
parent mark <mark qtrac.eu> writes:
On Thursday, 16 January 2020 at 10:10:02 UTC, dwdv wrote:
 On 2020-01-16 04:54, Jesse Phillips via Digitalmars-d-learn 
 wrote:
 [...]
[...]
 isn't far off, but could also be (sans imports):

 return File(filename).byLine
     .map!(line => line.until!(not!isAlpha))
     .filter!(word => word.count == wordsize)
     .map!(word => word.to!string.toUpper)
     .assocArray(0.repeat);
That's what I'm now using -- thanks! (Now I can try the next bit.)
Jan 16 2020