digitalmars.D.learn - Accented Characters and Counting Syllables

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (10/10) Dec 06 2014 Given the fact that

H. S. Teoh via Digitalmars-d-learn (10/24) Dec 06 2014 This is a Unicode issue. What you want is neither byCodeUnit nor

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (7/18) Dec 07 2014 Ok, thanks.

H. S. Teoh via Digitalmars-d-learn (6/23) Dec 07 2014 Not sure, but I wouldn't be surprised if it is. Unicode algorithms are

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (5/17) Dec 08 2014 What's the best source of information for these algorithms? Is it

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (4/7) Dec 08 2014 I guess

anonymous (10/15) Dec 07 2014 string already iterates over code points. So byCodePoint doesn't
"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (2/3) Dec 07 2014 Huh? Why is byCodePoint.length even defined?

John Colvin (4/7) Dec 07 2014 because string has ElementType dchar (i.e. it already iterates by
"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (16/19) Dec 07 2014 import std.uni;

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:

Given the fact that

     static assert("é".length == 2);

I was surprised that

     static assert("é".byCodeUnit.length == 2);
     static assert("é".byCodePoint.length == 2);

Isn't there a way to iterate over accented characters (in my case 
UTF-8) in D? Or is this an inherent problem in Unicode? I need 
this in a syllable counting algorithm that needs to distinguish 
accented and non-accented variants of vowels. For example café (2 
syllables) compared to babe (one syllable.

Dec 06 2014

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Sat, Dec 06, 2014 at 10:37:17PM +0000, "Nordl�w" via Digitalmars-d-learn
wrote:
 Given the fact that
 
     static assert("�".length == 2);
 
 I was surprised that
 
     static assert("�".byCodeUnit.length == 2);
     static assert("�".byCodePoint.length == 2);
 
 Isn't there a way to iterate over accented characters (in my case
 UTF-8) in D? Or is this an inherent problem in Unicode? I need this in
 a syllable counting algorithm that needs to distinguish accented and
 non-accented variants of vowels. For example caf� (2 syllables)
 compared to babe (one syllable.

This is a Unicode issue. What you want is neither byCodeUnit nor
byCodePoint, but byGrapheme. A grapheme is the Unicode equivalent of
what lay people would call a "character". A Unicode character (or more
precisely, a "code point") is not necessarily a complete grapheme, as
your example above shows; it's just a numerical value that uniquely
identifies an entry in the Unicode character database.


T

-- 
There are 10 kinds of people in the world: those who can count in binary, and
those who can't.

Dec 06 2014

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:

On Saturday, 6 December 2014 at 23:11:49 UTC, H. S. Teoh via 
Digitalmars-d-learn wrote:
 This is a Unicode issue. What you want is neither byCodeUnit nor
 byCodePoint, but byGrapheme. A grapheme is the Unicode 
 equivalent of
 what lay people would call a "character". A Unicode character 
 (or more
 precisely, a "code point") is not necessarily a complete 
 grapheme, as
 your example above shows; it's just a numerical value that 
 uniquely
 identifies an entry in the Unicode character database.


 T

Ok, thanks.

I just noticed that byGrapheme() lacks bidirectional access. 
Further it also lacks graphemeStrideBack() in complement to 
graphemeStride()? Similar to stride() and strideBack(). Is this 
difficult to implement?

Dec 07 2014

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Sun, Dec 07, 2014 at 02:30:13PM +0000, "Nordl�w" via Digitalmars-d-learn
wrote:
 On Saturday, 6 December 2014 at 23:11:49 UTC, H. S. Teoh via
 Digitalmars-d-learn wrote:
This is a Unicode issue. What you want is neither byCodeUnit nor
byCodePoint, but byGrapheme. A grapheme is the Unicode equivalent of
what lay people would call a "character". A Unicode character (or
more precisely, a "code point") is not necessarily a complete
grapheme, as your example above shows; it's just a numerical value
that uniquely identifies an entry in the Unicode character database.


T

 
 Ok, thanks.
 
 I just noticed that byGrapheme() lacks bidirectional access. Further
 it also lacks graphemeStrideBack() in complement to graphemeStride()?
 Similar to stride() and strideBack(). Is this difficult to implement?

Not sure, but I wouldn't be surprised if it is. Unicode algorithms are
generally non-trivial.


T

-- 
Who told you to swim in Crocodile Lake without life insurance??

Dec 07 2014

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:

On Sunday, 7 December 2014 at 15:47:45 UTC, H. S. Teoh via 
Digitalmars-d-learn wrote:
 Ok, thanks.
 
 I just noticed that byGrapheme() lacks bidirectional access. 
 Further
 it also lacks graphemeStrideBack() in complement to 
 graphemeStride()?
 Similar to stride() and strideBack(). Is this difficult to 
 implement?

 Not sure, but I wouldn't be surprised if it is. Unicode 
 algorithms are
 generally non-trivial.


 T

What's the best source of information for these algorithms? Is it 
certain that graphemes iteration is backwards iteratable by 
definition?

Dec 08 2014

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:

On Monday, 8 December 2014 at 14:57:06 UTC, Nordlöw wrote:
 What's the best source of information for these algorithms? Is 
 it certain that graphemes iteration is backwards iteratable by 
 definition?

I guess

https://en.wikipedia.org/wiki/Combining_character

could be a good start.

Dec 08 2014

"anonymous" <anonymous example.com> writes:

On Saturday, 6 December 2014 at 22:37:19 UTC, Nordlöw wrote:
 Given the fact that

     static assert("é".length == 2);

 I was surprised that

     static assert("é".byCodeUnit.length == 2);
     static assert("é".byCodePoint.length == 2);

string already iterates over code points. So byCodePoint doesn't
have to do anything on it, and it just returns the same string
again.

string's .length is the number of code units. It's not compatible
with the range primitives. That's why hasLength is false for
string (and wstring). Don't use .length on ranges without
checking hasLength.

So, while "é".byCodeUnit and "é".byCodePoint have equal
`.length`s, they have different range element counts.

Dec 07 2014

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:

On Saturday, 6 December 2014 at 22:37:19 UTC, Nordlöw wrote:
     static assert("é".byCodePoint.length == 2);

Huh? Why is byCodePoint.length even defined?

Dec 07 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Sunday, 7 December 2014 at 13:24:28 UTC, Marc Schütz wrote:
 On Saturday, 6 December 2014 at 22:37:19 UTC, Nordlöw wrote:
    static assert("é".byCodePoint.length == 2);

 Huh? Why is byCodePoint.length even defined?

because string has ElementType dchar (i.e. it already iterates by 
codepoint), which means that byCodePoint is just the identity 
function.

Dec 07 2014

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:

On Sunday, 7 December 2014 at 13:24:28 UTC, Marc Schütz wrote:
 On Saturday, 6 December 2014 at 22:37:19 UTC, Nordlöw wrote:
    static assert("é".byCodePoint.length == 2);

 Huh? Why is byCodePoint.length even defined?

import std.uni;
pragma(msg, typeof("é".byCodePoint));
=> string

Something's very broken...

It's this definition in std.uni:

     Range byCodePoint(Range)(Range range)
         if(isInputRange!Range && is(Unqual!(ElementType!Range) == 
dchar))
     {
         return range;
     }

`Unqual!(ElementType!string)` is indeed `dchar` because of 
auto-decoding.

Filed as bug:
https://issues.dlang.org/show_bug.cgi?id=13829

Dec 07 2014

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Accented Characters and Counting Syllables