www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - block file reads and lazy utf-8 decoding

reply Jon D <jond noreply.com> writes:
I want to combine block reads with lazy conversion of utf-8 
characters to dchars. Solution I came with is in the program 
below. This works fine. Has good performance, etc.

Question I have is if there is a better way to do this. For 
example, a different way to construct the lazy 'decodeUTF8Range' 
rather than writing it out in this fashion. There is quite a bit 
of power in the library and I'm still learning it. I'm wondering 
if I overlooked a useful alternative.

--Jon

Program:
-----------

import std.algorithm: each, joiner, map;
import std.conv;
import std.range;
import std.stdio;
import std.traits;
import std.utf: decodeFront;

auto decodeUTF8Range(Range)(Range charSource)
     if (isInputRange!Range && is(Unqual!(ElementType!Range) == 
char))
{
     static struct Result
     {
         private Range source;
         private dchar next;

         bool empty = false;
         dchar front()  property { return next; }
         void popFront() {
             if (source.empty) {
                 empty = true;
                 next = dchar.init;
             } else {
                 next = source.decodeFront;
             }
         }
     }
     auto r = Result(charSource);
     r.popFront;
     return r;
}

void main(string[] args)
{
     if (args.length != 2) { writeln("Provide one file name."); 
return; }

     ubyte[1024*1024] rawbuf;
     auto inputStream = args[1].File();
     inputStream
         .byChunk(rawbuf)        // Read in blocks
         .joiner                 // Join the blocks into a single 
input char range
         .map!(a => to!char(a))  // Cast ubyte to char for 
decodeFront. Any better ways?
         .decodeUTF8Range        // utf8 to dchar conversion.
         .each;                  // Real work goes here.
     writeln("done");
}
Dec 09 2015
parent Jon D <jond noreply.com> writes:
On Thursday, 10 December 2015 at 00:36:27 UTC, Jon D wrote:
 Question I have is if there is a better way to do this. For 
 example, a different way to construct the lazy 
 'decodeUTF8Range' rather than writing it out in this fashion.
A further thought - The decodeUTF8Range function is basically constructing a lazy wrapper range around decodeFront, which is effectively combining a 'front' and 'popFront' operation. So perhaps a generic way to compose a wrapper for such functions.
 auto decodeUTF8Range(Range)(Range charSource)
     if (isInputRange!Range && is(Unqual!(ElementType!Range) == 
 char))
 {
     static struct Result
     {
         private Range source;
         private dchar next;

         bool empty = false;
         dchar front()  property { return next; }
         void popFront() {
             if (source.empty) {
                 empty = true;
                 next = dchar.init;
             } else {
                 next = source.decodeFront;
             }
         }
     }
     auto r = Result(charSource);
     r.popFront;
     return r;
 }
Dec 09 2015