www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - coreutils with D trials, wc, binary vs well formed utf

reply btiffin <btiffin myopera.com> writes:
Hello,

New here. A little background.  Old guy, program for both work 
and recreation, GNU maintainer for the GnuCOBOL package; written 
in C, compiles COBOL via C intermediates.  Fell into the role of 
maintainer mainly due to being a documentation writer and early 
on cheerleader.  Experienced in quite a few programming 
languages, with a "10,000ish hours in" definition of expert 
expertise in C, Forth and COBOL.

Assuming that with gdc in GCC mainline now that D usage will 
continue to grow.  Also of the opinion that slow, long tail 
growth is the best kind of growth.  Not hype, not marketing, but 
adoption due to worthiness and merit.  That is the current 
headspace.  Want D to succeed, can't point to a specific why, 
just feel deep down that it should succeed and have an open ended 
relevant life span.  Thanks, Walter, Andrei, Iain, Ari, et al...

Just bumped into 
https://dlang.org/blog/2020/01/28/wc-in-d-712-characters-without-a-single-branch/

Way cool.  Then bumped into this:

prompt$ ./wc *
std.utf.UTFException /usr/lib/gcc/i686-linux-gnu/11/include/d/std/utf.d(1380):
Invalid UTF-8 sequence (at index 1)

That was from an a.out file in the directory.  Early days, very 
limited D, so answers of "just set ...", will fly over head, 
actual gdc command lines and noob jargon will sink in faster at 
this point.  Is there a(n easy-ish) way to fix up that wc.d 
source in the blog to fallback to byte stream mode when a utf-8 
reader fails an encoding?

Have good, make well,
Brian
May 24
next sibling parent Dukc <ajieskola gmail.com> writes:
 Is there a(n easy-ish) way to fix up that wc.d source in the 
 blog to fallback to byte stream mode when a utf-8 reader fails 
 an encoding?
Rewrite `toLine`: ``` Line toLine(char[] l) pure { import std.array : array; import std.algorithm : filter; import std.utf : byDchar, replacementDchar; auto valid = l.byDchar.filter!(c => c!=replacementDchar).array; return Line(valid.byCodePoint.walkLength, valid.splitter.walkLength); } ``` This just ignores malformed UTF without counting it in. It has one problem though, for some reason it seems to ignore one character after a malformed cone unit. I don't know why.
May 24
prev sibling next sibling parent reply Bastiaan Veelo <Bastiaan Veelo.net> writes:
On Monday, 24 May 2021 at 16:58:33 UTC, btiffin wrote:
[...]
 Just bumped into 
 https://dlang.org/blog/2020/01/28/wc-in-d-712-characters-without-a-single-branch/
[...]
 Is there a(n easy-ish) way to fix up that wc.d source in the 
 blog to fallback to byte stream mode when a utf-8 reader fails 
 an encoding?
Welcome, Brian. I have allowed myself to use exception handling and `filter`, which I regard to be no longer branch free. But it does (almost) produce the same output as gnu wc: ```d Line toLine(char[] l) pure { import std.utf : UTFException, byChar; import std.ascii : isWhite; import std.algorithm : filter; try { return Line(l.byCodePoint.walkLength, l.splitter.walkLength); } catch (UTFException) { return Line(l.length, l.byChar.splitter!(isWhite). filter!(w => w.length > 0).walkLength); } } ``` The number of chars can be returned in O(0) by the `.length` property. Use of `byChar.splitter!(isWhite)` considers the ASCII values of the chars, but without the `filter` it counts too many words. The reason is that a mix of different white space characters causes problems (https://run.dlang.io/is/QzjTN0): ```d writeln("Hello \t D".splitter!isWhite); // ["Hello", "", "", "D"] writeln("Hello \t D".splitter); // ["Hello", "D"] ``` This surprises me, could be a bug. So `filter!(w => w.length > 0)` filters out the "words" with zero length... Compared to gnu wc this reports one line too many for me, though. There may be more elegant solutions than mine. -- Bastiaan.
May 24
parent btiffin <btiffin myopera.com> writes:
Thanks for these hints.

I'm new here, but not so much to programming.  Been following D 
since, well 2007 or 8.  Awaiting the GCCing to gel.  Nice.  Still 
a little rough on Ubuntu 18.04, dub package seems to want ldc and 
dmd from dlang borks with a segfault in start (which is probably 
a my end problem and linker searches).  The pkg-config settings 
pass -L-l, and I'm not sure gdc is overly friendly with that.  
gtkD in Ubuntu is smoother with ldc too, but tis ok.  Choice is 
good, and having three installations is still pretty easy to 
explore.

Looking forward to more bragging about D.  An early integration 
trial with GnuCOBOL (2015ish) looked promising.  New ease of use 
is making that even more promising.

And a note to contributors.  Nicely done.  With some 10 million 
programmers, and what? 750 million excel macro writers assuming a 
life time average of a line of code an hour for professional 
programmers, we live in a field that evolves at 4ish million 
hours an hour.  Nice to see how a few of those hours can really 
make a difference.

Have good.
May 24
prev sibling parent reply Imperatorn <johan_forsberg_86 hotmail.com> writes:
On Monday, 24 May 2021 at 16:58:33 UTC, btiffin wrote:
 Hello,

 New here. A little background.  Old guy, program for both work 
 and recreation, GNU maintainer for the GnuCOBOL package; 
 written in C, compiles COBOL via C intermediates.  Fell into 
 the role of maintainer mainly due to being a documentation 
 writer and early on cheerleader.  Experienced in quite a few 
 programming languages, with a "10,000ish hours in" definition 
 of expert expertise in C, Forth and COBOL.

 [...]
Nice that you're also hopeful about the future for D ☀️ I also think slow steady growth is the best. We just need to focus and D will be great 🍀
May 25
parent btiffin <btiffin myopera.com> writes:
On Tuesday, 25 May 2021 at 18:03:29 UTC, Imperatorn wrote:

 Nice that you're also hopeful about the future for D ☀️ I also 
 think slow steady growth is the best.

 We just need to focus and D will be great 🍀
D is already great. It just needs to take over (a larger chunk of) the world. ;-) Now for some more noob level questions. I modified Robert's entry to include totals. (Gee, I hope I'm not stepping into copyright issues - the code has no explicit license, and I've haven't yet figured out what the default assumption is in the dlang.org terms of service. If this is stepping out of bounds, I'll edit out the copied sources). ```d /++ wc word counting +/ /+ Tectonics: gdc -o wc wc.d +/ module wc; import std.stdio : writefln, File; import std.algorithm : map, fold, splitter; import std.range : walkLength; import std.typecons : Yes; import std.uni : byCodePoint; struct Line { size_t chars; size_t words; } struct Output { size_t lines; size_t words; size_t chars; } Output combine(Output a, Line b) pure nothrow { return Output(a.lines + 1, a.words + b.words, a.chars + b.chars); } //Line toLine(char[] l) pure { // return Line(l.byCodePoint.walkLength, l.splitter.walkLength); //} Line toLine(char[] l) pure { import std.array : array; import std.algorithm : filter; import std.utf : byDchar, replacementDchar; auto valid = l.byDchar.filter!(c => c!=replacementDchar).array; return Line(valid.byCodePoint.walkLength, valid.splitter.walkLength); } void main(string[] args) { Output tot; foreach (int i, string fn; args) { if (i == 0) continue; auto f = File(fn); Output o = f .byLine(Yes.keepTerminator) .map!(l => toLine(l)) .fold!(combine)(Output(0, 0, 0)); writefln!"%7u %7u %11u %s"(o.lines, o.words, o.chars, fn); tot.lines += o.lines; tot.words += o.words; tot.chars += o.chars; } if (args.length > 2) { writefln!"%7u %7u %11u %s"(tot.lines, tot.words, tot.chars, "total"); } } ``` This is no longer a branchless piece of code, but still a joy to toyol with. Oh, and a by the by. I use a portmanteau of *toy* and *tool* for *toyol*. A toy tool toiled for the joy of it. *because inventing words is fun*. I also like to add a Tectonics line to sources; the commands used to build given the source listing, using this rather archaic definition of tectonics from the dict databases. ```text "Tectonics" gcide "The Collaborative International Dictionary of English v.0.48" Tectonics \Tec*ton"ics\, n. 1. The science, or the art, by which implements, vessels, dwellings, or other edifices, are constructed, both agreeably to the end for which they are designed, and in conformity with artistic sentiments and ideas. [1913 Webster] ``` One last by the by, I usually go by the nickname Bluey, and sign as Blue. The question. Is there a more concise D idiom for adding the Output struct fields into the total? Or is it three separate statements? Ok, one more question. Can the foreach loop that scans the argument list be folded into a chain somehow? *I'm not there yet in D learning, but might as well try and choose an idiomatic path, if possible*. Have good, Blue
May 25