www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - UTF8 and unary encoding

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
While looking at https://en.wikipedia.org/wiki/Unary_coding I found that 
UTF8 uses unary encoding for the length of multibyte sequences. 
Investigating further at https://en.wikipedia.org/wiki/UTF-8 reveals 
that indeed "The number of high-order 1s in the leading byte of a 
multi-byte sequence indicates the number of bytes in the sequence. When 
reading from a stream, a reader can process all fully received sequences 
without first having to wait for either the leading byte of a next 
sequence or an end-of-stream indication."

We don't use that explicitly; instead, we load each byte of 
multi-sequences. Who'd be interested in looking whether Phobos' 
primitives can be faster with multibyte-rich text?


Andrei
Sep 12 2016
parent reply Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Monday, September 12, 2016 07:37:05 Andrei Alexandrescu via Digitalmars-d 
wrote:
 While looking at https://en.wikipedia.org/wiki/Unary_coding I found that
 UTF8 uses unary encoding for the length of multibyte sequences.
 Investigating further at https://en.wikipedia.org/wiki/UTF-8 reveals
 that indeed "The number of high-order 1s in the leading byte of a
 multi-byte sequence indicates the number of bytes in the sequence. When
 reading from a stream, a reader can process all fully received sequences
 without first having to wait for either the leading byte of a next
 sequence or an end-of-stream indication."

 We don't use that explicitly; instead, we load each byte of
 multi-sequences. Who'd be interested in looking whether Phobos'
 primitives can be faster with multibyte-rich text?
Aren't we already doing that with stride? It reads the number of bytes in a code point from the first code unit and then if we're dealing with a random access range of char or an array of char, then we skip that many code units without reading them. The fact that we auto-decode in many cases does mean that all of the bytes are read in a number of cases where they wouldn't need to be if we were dealing with ranges of char, but in the cases where we aren't auto-decoding, we should already be taking advantage of this in general via stride (though obviously, there could be specific places where the code is not skipping bytes like it should). Or am I misunderstanding what you're talking about doing here? - Jonathan M Davis
Sep 12 2016
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/12/16 11:59 AM, Jonathan M Davis via Digitalmars-d wrote:
 Aren't we already doing that with stride? It reads the number of bytes in a
 code point from the first code unit and then if we're dealing with a random
 access range of char or an array of char, then we skip that many code units
 without reading them.
Oh, ok. I'd either forgotten or the code has been improved since I last looked at it. Thanks! -- Andrei
Sep 12 2016