digitalmars.D.learn - How to detect start of Unicode symbol and count amount of graphemes
- Uranuz (32/32) Oct 05 2014 I have struct StringStream that I use to go through and parse
- monarch_dodra (7/20) Oct 05 2014 You can use std.uni.byGrapheme to iterate by graphemes:
- Uranuz (5/11) Oct 05 2014 Maybe there is some idea how to just detect first code unit of
- Jacob Carlborg (7/11) Oct 05 2014 Have a look here [1]. For example, if you have a byte that is between
- Uranuz (13/17) Oct 06 2014 Thanks. I solved it myself already for UTF-8 encoding. There
- ketmar via Digitalmars-d-learn (5/8) Oct 06 2014 On Mon, 06 Oct 2014 17:28:43 +0000
- H. S. Teoh via Digitalmars-d-learn (17/40) Oct 06 2014 This looks wrong to me. Are you sure this finds *all* possible
- Jacob Carlborg (5/7) Oct 06 2014 No, the data I gave was to detect a complete code unit. Graphemes are
- H. S. Teoh via Digitalmars-d-learn (6/13) Oct 07 2014 [...]
- anonymous (16/26) Oct 06 2014 I think your idea of graphemes is off.
- Kagamin (4/9) Oct 06 2014 Are you trying to split strings? If you want to optimize usage of
- Nicolas F. (11/11) Oct 06 2014 Unicode is hard to deal with properly as how you deal with it is
I have struct StringStream that I use to go through and parse input string. String could be of string, wstring or dstring type. I implement function popChar that reads codeUnit from Stream. I want to have *debug* mode of parser (via CT switch), where I could get information about lineIndex, codeUnitIndex, graphemeIndex. So I don't want to use *front* primitive because it autodecodes everywhere, but I want to get info abot index of *user perceived character* in debug mode (so decoding is needed here). Question is how to detect that I go from one Unicode grapheme to another when iterating on string, wstring, dstring by code unit? Is it simple or is it attempt to reimplement a big piece of existing std library code? As a result I should just increment internal graphemeIndex. There short version of implementation that I want follows struct StringStream(String) { String str; size_t index; size_t graphemeIndex; auto popChar() { index++; if( ??? ) //How to detect new grapheme? { graphemeIndex++; } return str[index]; } } Sorry for very simple question. I just have a mess in my head about Unicode and D strings
Oct 05 2014
On Sunday, 5 October 2014 at 08:27:58 UTC, Uranuz wrote:I have struct StringStream that I use to go through and parse input string. String could be of string, wstring or dstring type. I implement function popChar that reads codeUnit from Stream. I want to have *debug* mode of parser (via CT switch), where I could get information about lineIndex, codeUnitIndex, graphemeIndex. So I don't want to use *front* primitive because it autodecodes everywhere, but I want to get info abot index of *user perceived character* in debug mode (so decoding is needed here). Question is how to detect that I go from one Unicode grapheme to another when iterating on string, wstring, dstring by code unit? Is it simple or is it attempt to reimplement a big piece of existing std library code?You can use std.uni.byGrapheme to iterate by graphemes: AFAIK, graphemes are not "self synchronizing", but codepoints are. You can pop code units until you reach the beginning of a new codepoint. From there, you can iterate by graphemes, though your first grapheme might be off.
Oct 05 2014
You can use std.uni.byGrapheme to iterate by graphemes: AFAIK, graphemes are not "self synchronizing", but codepoints are. You can pop code units until you reach the beginning of a new codepoint. From there, you can iterate by graphemes, though your first grapheme might be off.Maybe there is some idea how to just detect first code unit of grapheme without overhead for using Grapheme struct? I just tried to check if ch < 128 (for UTF-8). But this dont work. How to check if byte is continuation of code for single code point or if new sequence started?
Oct 05 2014
On 2014-10-05 14:09, Uranuz wrote:Maybe there is some idea how to just detect first code unit of grapheme without overhead for using Grapheme struct? I just tried to check if ch < 128 (for UTF-8). But this dont work. How to check if byte is continuation of code for single code point or if new sequence started?Have a look here [1]. For example, if you have a byte that is between U+0080 and U+07FF you know that you need two bytes to get that whole code point. [1] http://en.wikipedia.org/wiki/UTF-8#Description -- /Jacob Carlborg
Oct 05 2014
Have a look here [1]. For example, if you have a byte that is between U+0080 and U+07FF you know that you need two bytes to get that whole code point. [1] http://en.wikipedia.org/wiki/UTF-8#DescriptionThanks. I solved it myself already for UTF-8 encoding. There choosed approach with using bitbask. Maybe it is not best with eficiency but it works) ( str[index] & 0b10000000 ) == 0 || ( str[index] & 0b11100000 ) == 0b11000000 || ( str[index] & 0b11110000 ) == 0b11100000 || ( str[index] & 0b11111000 ) == 0b11110000 If it is true it means that first byte of sequence found and I can count them. Am I right that it equals to number of graphemes, or are there some exceptions from this rule? For UTF-32 number of codeUnits is just equal to number of graphemes. And what about UTF-16? Is it possible to detect first codeUnit of encoding sequence?
Oct 06 2014
On Mon, 06 Oct 2014 17:28:43 +0000 Uranuz via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> wrote:If it is true it means that first byte of sequence found and I=20 can count them. Am I right that it equals to number of graphemes,=20 or are there some exceptions from this rule?alot. take for example RIGHT-TO-LEFT MARK, which is not a grapheme at all. and not a "composite" for that matter. ah, those joys of unicode!
Oct 06 2014
On Mon, Oct 06, 2014 at 05:28:43PM +0000, Uranuz via Digitalmars-d-learn wrote:This looks wrong to me. Are you sure this finds *all* possible graphemes? Keep in mind that combining diacritic sequences are treated as a single grapheme; for example the sequence 'A' U+0301 U+0302 U+0303. There are several different codepoint ranges that have the combining diacritic property, and they are definitely more complicated than what you have here. Furthermore, there are more complicated things like the Devanagari sequences (e.g., KA + VIRAMA + TA + VOWEL SIGN U), that your code certainly doesn't look like it would handle correctly. As somebody else has said, it's generally a bad idea to work with Unicode byte sequences yourself, because Unicode is complicated, and many apparently-simple concepts actually require a lot of care to get it right. T -- It won't be covered in the book. The source code has to be useful for something, after all. -- Larry WallHave a look here [1]. For example, if you have a byte that is between U+0080 and U+07FF you know that you need two bytes to get that whole code point. [1] http://en.wikipedia.org/wiki/UTF-8#DescriptionThanks. I solved it myself already for UTF-8 encoding. There choosed approach with using bitbask. Maybe it is not best with eficiency but it works) ( str[index] & 0b10000000 ) == 0 || ( str[index] & 0b11100000 ) == 0b11000000 || ( str[index] & 0b11110000 ) == 0b11100000 || ( str[index] & 0b11111000 ) == 0b11110000 If it is true it means that first byte of sequence found and I can count them. Am I right that it equals to number of graphemes, or are there some exceptions from this rule? For UTF-32 number of codeUnits is just equal to number of graphemes. And what about UTF-16? Is it possible to detect first codeUnit of encoding sequence?
Oct 06 2014
On 06/10/14 19:48, H. S. Teoh via Digitalmars-d-learn wrote:This looks wrong to me. Are you sure this finds *all* possible graphemes?No, the data I gave was to detect a complete code unit. Graphemes are something else, I think Uranuz is mixing up the Unicode terms. -- /Jacob Carlborg
Oct 06 2014
On Tue, Oct 07, 2014 at 08:28:49AM +0200, Jacob Carlborg via Digitalmars-d-learn wrote:On 06/10/14 19:48, H. S. Teoh via Digitalmars-d-learn wrote:[...] Ahhh, OK, then it makes sense. T -- People who are more than casually interested in computers should have at least some idea of what the underlying hardware is like. Otherwise the programs they write will be pretty weird. -- D. KnuthThis looks wrong to me. Are you sure this finds *all* possible graphemes?No, the data I gave was to detect a complete code unit. Graphemes are something else, I think Uranuz is mixing up the Unicode terms.
Oct 07 2014
On Monday, 6 October 2014 at 17:28:45 UTC, Uranuz wrote:( str[index] & 0b10000000 ) == 0 || ( str[index] & 0b11100000 ) == 0b11000000 || ( str[index] & 0b11110000 ) == 0b11100000 || ( str[index] & 0b11111000 ) == 0b11110000 If it is true it means that first byte of sequence found and I can count them. Am I right that it equals to number of graphemes, or are there some exceptions from this rule? For UTF-32 number of codeUnits is just equal to number of graphemes. And what about UTF-16? Is it possible to detect first codeUnit of encoding sequence?I think your idea of graphemes is off. A grapheme is made up of one or more code points. This is the same for all UTF encodings. A code point is made of one or more code units. UTF8: between 1 and 4 I think, UTF16: 1 or 2, UTF32: always 1. A code unit is made up of a fixed number of bytes. UTF8: 1, UTF16: 2, UTF32: 4. So, the number of UTF8 bytes in a sequence has no relation to graphemes. The number of leading ones in a UTF8 start byte is equal to the total number of bytes in that sequence. I.e. when you see a 0b1110_0000 byte, the following two bytes should be continuation bytes (0b10xx_xxxx), and the three of them together encode a *code point*. And in UTF32, the number of code units is equal to the number of *code points*, not graphemes.
Oct 06 2014
On Sunday, 5 October 2014 at 12:09:34 UTC, Uranuz wrote:Maybe there is some idea how to just detect first code unit of grapheme without overhead for using Grapheme struct? I just tried to check if ch < 128 (for UTF-8). But this dont work. How to check if byte is continuation of code for single code point or if new sequence started?Are you trying to split strings? If you want to optimize usage of graphemes, try to check if 10 code units contain ascii symbol; when that fails, fall back to graphemes.
Oct 06 2014
Unicode is hard to deal with properly as how you deal with it is very context dependant. One grapheme is a visible character and consists of one or more codepoints. One codepoint is one mapping of a byte sequence to a meaning, and consists of one or more bytes. This you do not want to deal with yourself, as knowing which codepoints form graphemes is hard. Thankfully, std.uni exists. Specifically, look at decodeGrapheme: it pops one grapheme from an input range and returns it. Never write code that deals with unicode on a bytelevel. It will always be wrong.
Oct 06 2014