digitalmars.D.learn - coreutils with D trials, wc, binary vs well formed utf
- btiffin (28/28) May 24 2021 Hello,
- Dukc (14/17) May 24 2021 Rewrite `toLine`:
- Bastiaan Veelo (39/44) May 24 2021 [...]
- btiffin (20/20) May 24 2021 Thanks for these hints.
- Imperatorn (4/13) May 25 2021 Nice that you're also hopeful about the future for D ☀️ I also
- btiffin (91/94) May 25 2021 D is already great. It just needs to take over (a larger chunk
Hello, New here. A little background. Old guy, program for both work and recreation, GNU maintainer for the GnuCOBOL package; written in C, compiles COBOL via C intermediates. Fell into the role of maintainer mainly due to being a documentation writer and early on cheerleader. Experienced in quite a few programming languages, with a "10,000ish hours in" definition of expert expertise in C, Forth and COBOL. Assuming that with gdc in GCC mainline now that D usage will continue to grow. Also of the opinion that slow, long tail growth is the best kind of growth. Not hype, not marketing, but adoption due to worthiness and merit. That is the current headspace. Want D to succeed, can't point to a specific why, just feel deep down that it should succeed and have an open ended relevant life span. Thanks, Walter, Andrei, Iain, Ari, et al... Just bumped into https://dlang.org/blog/2020/01/28/wc-in-d-712-characters-without-a-single-branch/ Way cool. Then bumped into this: prompt$ ./wc * std.utf.UTFException /usr/lib/gcc/i686-linux-gnu/11/include/d/std/utf.d(1380): Invalid UTF-8 sequence (at index 1) That was from an a.out file in the directory. Early days, very limited D, so answers of "just set ...", will fly over head, actual gdc command lines and noob jargon will sink in faster at this point. Is there a(n easy-ish) way to fix up that wc.d source in the blog to fallback to byte stream mode when a utf-8 reader fails an encoding? Have good, make well, Brian
May 24 2021
Is there a(n easy-ish) way to fix up that wc.d source in the blog to fallback to byte stream mode when a utf-8 reader fails an encoding?Rewrite `toLine`: ``` Line toLine(char[] l) pure { import std.array : array; import std.algorithm : filter; import std.utf : byDchar, replacementDchar; auto valid = l.byDchar.filter!(c => c!=replacementDchar).array; return Line(valid.byCodePoint.walkLength, valid.splitter.walkLength); } ``` This just ignores malformed UTF without counting it in. It has one problem though, for some reason it seems to ignore one character after a malformed cone unit. I don't know why.
May 24 2021
On Monday, 24 May 2021 at 16:58:33 UTC, btiffin wrote: [...]Just bumped into https://dlang.org/blog/2020/01/28/wc-in-d-712-characters-without-a-single-branch/[...]Is there a(n easy-ish) way to fix up that wc.d source in the blog to fallback to byte stream mode when a utf-8 reader fails an encoding?Welcome, Brian. I have allowed myself to use exception handling and `filter`, which I regard to be no longer branch free. But it does (almost) produce the same output as gnu wc: ```d Line toLine(char[] l) pure { import std.utf : UTFException, byChar; import std.ascii : isWhite; import std.algorithm : filter; try { return Line(l.byCodePoint.walkLength, l.splitter.walkLength); } catch (UTFException) { return Line(l.length, l.byChar.splitter!(isWhite). filter!(w => w.length > 0).walkLength); } } ``` The number of chars can be returned in O(0) by the `.length` property. Use of `byChar.splitter!(isWhite)` considers the ASCII values of the chars, but without the `filter` it counts too many words. The reason is that a mix of different white space characters causes problems (https://run.dlang.io/is/QzjTN0): ```d writeln("Hello \t D".splitter!isWhite); // ["Hello", "", "", "D"] writeln("Hello \t D".splitter); // ["Hello", "D"] ``` This surprises me, could be a bug. So `filter!(w => w.length > 0)` filters out the "words" with zero length... Compared to gnu wc this reports one line too many for me, though. There may be more elegant solutions than mine. -- Bastiaan.
May 24 2021
Thanks for these hints. I'm new here, but not so much to programming. Been following D since, well 2007 or 8. Awaiting the GCCing to gel. Nice. Still a little rough on Ubuntu 18.04, dub package seems to want ldc and dmd from dlang borks with a segfault in start (which is probably a my end problem and linker searches). The pkg-config settings pass -L-l, and I'm not sure gdc is overly friendly with that. gtkD in Ubuntu is smoother with ldc too, but tis ok. Choice is good, and having three installations is still pretty easy to explore. Looking forward to more bragging about D. An early integration trial with GnuCOBOL (2015ish) looked promising. New ease of use is making that even more promising. And a note to contributors. Nicely done. With some 10 million programmers, and what? 750 million excel macro writers assuming a life time average of a line of code an hour for professional programmers, we live in a field that evolves at 4ish million hours an hour. Nice to see how a few of those hours can really make a difference. Have good.
May 24 2021
On Monday, 24 May 2021 at 16:58:33 UTC, btiffin wrote:Hello, New here. A little background. Old guy, program for both work and recreation, GNU maintainer for the GnuCOBOL package; written in C, compiles COBOL via C intermediates. Fell into the role of maintainer mainly due to being a documentation writer and early on cheerleader. Experienced in quite a few programming languages, with a "10,000ish hours in" definition of expert expertise in C, Forth and COBOL. [...]Nice that you're also hopeful about the future for D ☀️ I also think slow steady growth is the best. We just need to focus and D will be great 🍀
May 25 2021
On Tuesday, 25 May 2021 at 18:03:29 UTC, Imperatorn wrote:Nice that you're also hopeful about the future for D ☀️ I also think slow steady growth is the best. We just need to focus and D will be great 🍀D is already great. It just needs to take over (a larger chunk of) the world. ;-) Now for some more noob level questions. I modified Robert's entry to include totals. (Gee, I hope I'm not stepping into copyright issues - the code has no explicit license, and I've haven't yet figured out what the default assumption is in the dlang.org terms of service. If this is stepping out of bounds, I'll edit out the copied sources). ```d /++ wc word counting +/ /+ Tectonics: gdc -o wc wc.d +/ module wc; import std.stdio : writefln, File; import std.algorithm : map, fold, splitter; import std.range : walkLength; import std.typecons : Yes; import std.uni : byCodePoint; struct Line { size_t chars; size_t words; } struct Output { size_t lines; size_t words; size_t chars; } Output combine(Output a, Line b) pure nothrow { return Output(a.lines + 1, a.words + b.words, a.chars + b.chars); } //Line toLine(char[] l) pure { // return Line(l.byCodePoint.walkLength, l.splitter.walkLength); //} Line toLine(char[] l) pure { import std.array : array; import std.algorithm : filter; import std.utf : byDchar, replacementDchar; auto valid = l.byDchar.filter!(c => c!=replacementDchar).array; return Line(valid.byCodePoint.walkLength, valid.splitter.walkLength); } void main(string[] args) { Output tot; foreach (int i, string fn; args) { if (i == 0) continue; auto f = File(fn); Output o = f .byLine(Yes.keepTerminator) .map!(l => toLine(l)) .fold!(combine)(Output(0, 0, 0)); writefln!"%7u %7u %11u %s"(o.lines, o.words, o.chars, fn); tot.lines += o.lines; tot.words += o.words; tot.chars += o.chars; } if (args.length > 2) { writefln!"%7u %7u %11u %s"(tot.lines, tot.words, tot.chars, "total"); } } ``` This is no longer a branchless piece of code, but still a joy to toyol with. Oh, and a by the by. I use a portmanteau of *toy* and *tool* for *toyol*. A toy tool toiled for the joy of it. *because inventing words is fun*. I also like to add a Tectonics line to sources; the commands used to build given the source listing, using this rather archaic definition of tectonics from the dict databases. ```text "Tectonics" gcide "The Collaborative International Dictionary of English v.0.48" Tectonics \Tec*ton"ics\, n. 1. The science, or the art, by which implements, vessels, dwellings, or other edifices, are constructed, both agreeably to the end for which they are designed, and in conformity with artistic sentiments and ideas. [1913 Webster] ``` One last by the by, I usually go by the nickname Bluey, and sign as Blue. The question. Is there a more concise D idiom for adding the Output struct fields into the total? Or is it three separate statements? Ok, one more question. Can the foreach loop that scans the argument list be folded into a chain somehow? *I'm not there yet in D learning, but might as well try and choose an idiomatic path, if possible*. Have good, Blue
May 25 2021