digitalmars.D.learn - coreutils with D trials, wc, binary vs well formed utf

btiffin (28/28) May 24 2021 Hello,

Dukc (14/17) May 24 2021 Rewrite `toLine`:
Bastiaan Veelo (39/44) May 24 2021 [...]

btiffin (20/20) May 24 2021 Thanks for these hints.

Imperatorn (4/13) May 25 2021 Nice that you're also hopeful about the future for D ☀️ I also

btiffin (91/94) May 25 2021 D is already great. It just needs to take over (a larger chunk

btiffin <btiffin myopera.com> writes:

Hello,

New here. A little background.  Old guy, program for both work 
and recreation, GNU maintainer for the GnuCOBOL package; written 
in C, compiles COBOL via C intermediates.  Fell into the role of 
maintainer mainly due to being a documentation writer and early 
on cheerleader.  Experienced in quite a few programming 
languages, with a "10,000ish hours in" definition of expert 
expertise in C, Forth and COBOL.

Assuming that with gdc in GCC mainline now that D usage will 
continue to grow.  Also of the opinion that slow, long tail 
growth is the best kind of growth.  Not hype, not marketing, but 
adoption due to worthiness and merit.  That is the current 
headspace.  Want D to succeed, can't point to a specific why, 
just feel deep down that it should succeed and have an open ended 
relevant life span.  Thanks, Walter, Andrei, Iain, Ari, et al...

Just bumped into 
https://dlang.org/blog/2020/01/28/wc-in-d-712-characters-without-a-single-branch/

Way cool.  Then bumped into this:

prompt$ ./wc *
std.utf.UTFException /usr/lib/gcc/i686-linux-gnu/11/include/d/std/utf.d(1380):
Invalid UTF-8 sequence (at index 1)

That was from an a.out file in the directory.  Early days, very 
limited D, so answers of "just set ...", will fly over head, 
actual gdc command lines and noob jargon will sink in faster at 
this point.  Is there a(n easy-ish) way to fix up that wc.d 
source in the blog to fallback to byte stream mode when a utf-8 
reader fails an encoding?

Have good, make well,
Brian

May 24 2021

Dukc <ajieskola gmail.com> writes:

 Is there a(n easy-ish) way to fix up that wc.d source in the 
 blog to fallback to byte stream mode when a utf-8 reader fails 
 an encoding?

Rewrite `toLine`:

```
Line toLine(char[] l) pure
{ import std.array : array;
   import std.algorithm : filter;
   import std.utf : byDchar, replacementDchar;

   auto valid = l.byDchar.filter!(c => c!=replacementDchar).array;
   return Line(valid.byCodePoint.walkLength, 
valid.splitter.walkLength);
}
```

This just ignores malformed UTF without counting it in. It has 
one problem though, for some reason it seems to ignore one 
character after a malformed cone unit. I don't know why.

May 24 2021

Bastiaan Veelo <Bastiaan Veelo.net> writes:

On Monday, 24 May 2021 at 16:58:33 UTC, btiffin wrote:
[...]
 Just bumped into 
 https://dlang.org/blog/2020/01/28/wc-in-d-712-characters-without-a-single-branch/

[...]

 Is there a(n easy-ish) way to fix up that wc.d source in the 
 blog to fallback to byte stream mode when a utf-8 reader fails 
 an encoding?

Welcome, Brian.

I have allowed myself to use exception handling and `filter`, 
which I regard to be no longer branch free. But it does (almost) 
produce the same output as gnu wc:
```d
Line toLine(char[] l) pure {
     import std.utf : UTFException, byChar;
     import std.ascii : isWhite;
     import std.algorithm : filter;
     try {
         return Line(l.byCodePoint.walkLength, 
l.splitter.walkLength);
     }
     catch (UTFException) {
         return Line(l.length, l.byChar.splitter!(isWhite).
                               filter!(w => w.length > 
0).walkLength);
     }
}
```
The number of chars can be returned in O(0) by the `.length` 
property. Use of `byChar.splitter!(isWhite)` considers the ASCII 
values of the chars, but without the `filter` it counts too many 
words. The reason is that a mix of different white space 
characters causes problems (https://run.dlang.io/is/QzjTN0):
```d
     writeln("Hello \t D".splitter!isWhite); // ["Hello", "", "", 
"D"]
     writeln("Hello \t D".splitter);         // ["Hello", "D"]
```
This surprises me, could be a bug.

So `filter!(w => w.length > 0)` filters out the "words" with zero 
length...

Compared to gnu wc this reports one line too many for me, though.

There may be more elegant solutions than mine.

-- Bastiaan.

May 24 2021

btiffin <btiffin myopera.com> writes:

Thanks for these hints.

I'm new here, but not so much to programming.  Been following D 
since, well 2007 or 8.  Awaiting the GCCing to gel.  Nice.  Still 
a little rough on Ubuntu 18.04, dub package seems to want ldc and 
dmd from dlang borks with a segfault in start (which is probably 
a my end problem and linker searches).  The pkg-config settings 
pass -L-l, and I'm not sure gdc is overly friendly with that.  
gtkD in Ubuntu is smoother with ldc too, but tis ok.  Choice is 
good, and having three installations is still pretty easy to 
explore.

Looking forward to more bragging about D.  An early integration 
trial with GnuCOBOL (2015ish) looked promising.  New ease of use 
is making that even more promising.

And a note to contributors.  Nicely done.  With some 10 million 
programmers, and what? 750 million excel macro writers assuming a 
life time average of a line of code an hour for professional 
programmers, we live in a field that evolves at 4ish million 
hours an hour.  Nice to see how a few of those hours can really 
make a difference.

Have good.

May 24 2021

Imperatorn <johan_forsberg_86 hotmail.com> writes:

On Monday, 24 May 2021 at 16:58:33 UTC, btiffin wrote:
 Hello,

 New here. A little background.  Old guy, program for both work 
 and recreation, GNU maintainer for the GnuCOBOL package; 
 written in C, compiles COBOL via C intermediates.  Fell into 
 the role of maintainer mainly due to being a documentation 
 writer and early on cheerleader.  Experienced in quite a few 
 programming languages, with a "10,000ish hours in" definition 
 of expert expertise in C, Forth and COBOL.

 [...]

Nice that you're also hopeful about the future for D ☀️ I also 
think slow steady growth is the best.

We just need to focus and D will be great 🍀

May 25 2021

btiffin <btiffin myopera.com> writes:

On Tuesday, 25 May 2021 at 18:03:29 UTC, Imperatorn wrote:

 Nice that you're also hopeful about the future for D ☀️ I also 
 think slow steady growth is the best.

 We just need to focus and D will be great 🍀

D is already great. It just needs to take over (a larger chunk 
of) the world. ;-)

Now for some more noob level questions.  I modified Robert's 
entry to include totals.

(Gee, I hope I'm not stepping into copyright issues - the code 
has no explicit license, and I've haven't yet figured out what 
the default assumption is in the dlang.org terms of service.  If 
this is stepping out of bounds, I'll edit out the copied sources).

```d
/++ wc word counting +/
/+ Tectonics: gdc -o wc wc.d +/
module wc;

import std.stdio : writefln, File;
import std.algorithm : map, fold, splitter;
import std.range : walkLength;
import std.typecons : Yes;
import std.uni : byCodePoint;

struct Line {
    size_t chars;
    size_t words;
}

struct Output {
    size_t lines;
    size_t words;
    size_t chars;
}

Output combine(Output a, Line b) pure nothrow {
    return Output(a.lines + 1, a.words + b.words, a.chars + 
b.chars);
}

//Line toLine(char[] l) pure {
//   return Line(l.byCodePoint.walkLength, l.splitter.walkLength);
//}

Line toLine(char[] l) pure {
    import std.array : array;
    import std.algorithm : filter;
    import std.utf : byDchar, replacementDchar;

    auto valid = l.byDchar.filter!(c => c!=replacementDchar).array;
    return Line(valid.byCodePoint.walkLength, 
valid.splitter.walkLength);
}

void main(string[] args) {
    Output tot;

    foreach (int i, string fn; args) {
        if (i == 0) continue;
        auto f = File(fn);
        Output o = f
          .byLine(Yes.keepTerminator)
          .map!(l => toLine(l))
          .fold!(combine)(Output(0, 0, 0));

        writefln!"%7u %7u %11u %s"(o.lines, o.words, o.chars, fn);

        tot.lines += o.lines;
        tot.words += o.words;
        tot.chars += o.chars;
    }
    if (args.length > 2) {
        writefln!"%7u %7u %11u %s"(tot.lines, tot.words, 
tot.chars, "total");
    }
}
```

This is no longer a branchless piece of code, but still a joy to 
toyol with.

Oh, and a by the by.  I use a portmanteau of *toy* and *tool* for 
*toyol*.  A toy tool toiled for the joy of it. *because inventing 
words is fun*.

I also like to add a Tectonics line to sources; the commands used 
to build given the source listing, using this rather archaic 
definition of tectonics from the dict databases.

```text
"Tectonics" gcide "The Collaborative International Dictionary of 
English v.0.48"
Tectonics \Tec*ton"ics\, n.
1. The science, or the art, by which implements, vessels,
dwellings, or other edifices, are constructed, both
agreeably to the end for which they are designed, and in
conformity with artistic sentiments and ideas.
[1913 Webster]
```

One last by the by, I usually go by the nickname Bluey, and sign 
as Blue.

The question.  Is there a more concise D idiom for adding the 
Output struct fields into the total?  Or is it three separate 
statements?

Ok, one more question.  Can the foreach loop that scans the 
argument list be folded into a chain somehow?  *I'm not there yet 
in D learning, but might as well try and choose an idiomatic 
path, if possible*.

Have good,
Blue

May 25 2021

D Programming

C/C++ Programming

Other

digitalmars.D.learn - coreutils with D trials, wc, binary vs well formed utf