digitalmars.D - UTF8 and unary encoding

Andrei Alexandrescu (12/12) Sep 12 2016 While looking at https://en.wikipedia.org/wiki/Unary_coding I found that...

Jonathan M Davis via Digitalmars-d (13/24) Sep 12 2016 Aren't we already doing that with stride? It reads the number of bytes i...

Andrei Alexandrescu (3/7) Sep 12 2016 Oh, ok. I'd either forgotten or the code has been improved since I last

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

While looking at https://en.wikipedia.org/wiki/Unary_coding I found that 
UTF8 uses unary encoding for the length of multibyte sequences. 
Investigating further at https://en.wikipedia.org/wiki/UTF-8 reveals 
that indeed "The number of high-order 1s in the leading byte of a 
multi-byte sequence indicates the number of bytes in the sequence. When 
reading from a stream, a reader can process all fully received sequences 
without first having to wait for either the leading byte of a next 
sequence or an end-of-stream indication."

We don't use that explicitly; instead, we load each byte of 
multi-sequences. Who'd be interested in looking whether Phobos' 
primitives can be faster with multibyte-rich text?


Andrei

Sep 12 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Monday, September 12, 2016 07:37:05 Andrei Alexandrescu via Digitalmars-d 
wrote:
 While looking at https://en.wikipedia.org/wiki/Unary_coding I found that
 UTF8 uses unary encoding for the length of multibyte sequences.
 Investigating further at https://en.wikipedia.org/wiki/UTF-8 reveals
 that indeed "The number of high-order 1s in the leading byte of a
 multi-byte sequence indicates the number of bytes in the sequence. When
 reading from a stream, a reader can process all fully received sequences
 without first having to wait for either the leading byte of a next
 sequence or an end-of-stream indication."

 We don't use that explicitly; instead, we load each byte of
 multi-sequences. Who'd be interested in looking whether Phobos'
 primitives can be faster with multibyte-rich text?

Aren't we already doing that with stride? It reads the number of bytes in a
code point from the first code unit and then if we're dealing with a random
access range of char or an array of char, then we skip that many code units
without reading them. The fact that we auto-decode in many cases does mean
that all of the bytes are read in a number of cases where they wouldn't need
to be if we were dealing with ranges of char, but in the cases where we
aren't auto-decoding, we should already be taking advantage of this in
general via stride (though obviously, there could be specific places where
the code is not skipping bytes like it should).

Or am I misunderstanding what you're talking about doing here?

- Jonathan M Davis

Sep 12 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/12/16 11:59 AM, Jonathan M Davis via Digitalmars-d wrote:
 Aren't we already doing that with stride? It reads the number of bytes in a
 code point from the first code unit and then if we're dealing with a random
 access range of char or an array of char, then we skip that many code units
 without reading them.

Oh, ok. I'd either forgotten or the code has been improved since I last 
looked at it. Thanks! -- Andrei

Sep 12 2016

D Programming

C/C++ Programming

Other

digitalmars.D - UTF8 and unary encoding