digitalmars.D - Unicode handling comparison

bearophile (7/7) Nov 27 2013 Through Reddit I have seen this small comparison of Unicode

=?UTF-8?B?U2ltZW4gS2rDpnLDpXM=?= (4/9) Nov 27 2013 Indeed it does. Have you tried with std.uni?
monarch_dodra (11/12) Nov 27 2013 I still think we're doing pretty good.
David Nadlinger (9/14) Nov 27 2013 If you need to perform this kind of operations on Unicode strings

Jacob Carlborg (5/10) Nov 27 2013 That didn't work out very well:

Adam D. Ruppe (7/8) Nov 27 2013 Yeah, I saw it too. The fix is simple:

Jacob Carlborg (4/9) Nov 27 2013 You were faster. But I created an issue as well.

bearophile (4/13) Nov 27 2013 Thank you :-)
Wyatt (12/21) Nov 27 2013 Seems like a pretty big "gotcha" from a usability standpoint;

Dicebot (2/6) Nov 27 2013 It probably is, but is Unicode gotcha, not D one.
Jacob Carlborg (4/8) Nov 27 2013 I think we should have that.
Jakob Ovrum (14/20) Nov 27 2013 What would it do that std.uni doesn't already?

Jacob Carlborg (5/6) Nov 27 2013 A class/struct that handles all these normalizations and other stuff

Jakob Ovrum (3/7) Nov 27 2013 Sounds terrible :)

Dicebot (6/15) Nov 27 2013 +1

Jacob Carlborg (5/9) Nov 27 2013 I think it's missing a final high level abstraction. As with the rest of...

Dmitry Olshansky (12/21) Nov 27 2013 This could give an idea of what Perl folks do to get the grapheme feel

H. S. Teoh (15/39) Nov 27 2013 Maybe it should be called graphemeString?
Wyatt (34/47) Nov 27 2013 Maybe. If I had called it...say, "normalisedString"? Would you
Walter Bright (7/11) Nov 28 2013 Sadly, std.array is determined to decode (i.e. convert to dchar[]) all y...

Jakob Ovrum (9/16) Nov 28 2013 Decoding by default means that algorithms can work reasonably
bearophile (9/11) Nov 28 2013 If you want to sort an array of chars you need to use a dchar[],
monarch_dodra (10/17) Nov 28 2013 I think it's great. It means by default, your strings will always

Walter Bright (11/13) Nov 28 2013 front() in std.array looks like:

H. S. Teoh (25/41) Nov 28 2013 OTOH, it is actually correct by default. If it *didn't* decode, things

Dicebot (1/1) Nov 28 2013 http://dlang.org/phobos/std_encoding.html#.AsciiString ?

monarch_dodra (11/12) Nov 28 2013 Yeah, that or just ubyte[].

Walter Bright (4/7) Nov 28 2013 It doesn't have to be merely ASCII. You can do string substring searches...

Dmitry Olshansky (19/26) Nov 28 2013 The greatest problem is surprisingly that you can't use range functions

Walter Bright (3/7) Nov 28 2013 I suspect the correct approach would be to have the range over string to...

Charles Hixson (12/28) Nov 27 2013 I don't like the overhead, and I don't know how important this is, but
Dmitry Olshansky (6/22) Nov 27 2013 It's anything but cheap.

Walter Bright (3/9) Nov 28 2013 Decoding isn't cheap, either, which is why I rant about it being the def...

Jakob Ovrum (17/22) Nov 27 2013 Most of the points are good, but the author seems to confuse

Wyatt (17/29) Nov 27 2013 I agree with the assertion that people SHOULD know how unicode

Wyatt (9/10) Nov 27 2013 Whoops, overzealous pasting. That is, "e\u0308", which composes

Jakob Ovrum (13/21) Nov 27 2013 Yes.
Dmitry Olshansky (9/20) Nov 27 2013 As much as standard defines it. (actually they talk about boundaries,

Jakob Ovrum (19/36) Nov 27 2013 I thought it was nice that std.uni had a proper terminology

Charles Hixson (22/56) Nov 27 2013 I would put things a bit more emphatically. The codepoint is analogous

Walter Bright (6/8) Nov 27 2013 Many things in Phobos either predate ranges, or are written by people wh...
Dmitry Olshansky (4/8) Nov 27 2013 Which ones? Or do you mean more like isAlpha(rangeOfCodepoints)?

Andrei Alexandrescu (3/10) Nov 27 2013 Yah, byGrapheme would be a great addition.

H. S. Teoh (8/19) Nov 27 2013 [...]

Dmitry Olshansky (13/30) Nov 27 2013 I could have sworn we had byGrapheme somewhere, well apparently not :(

Jakob Ovrum (4/6) Nov 29 2013 Simple attempt:

=?UTF-8?B?U2ltZW4gS2rDpnLDpXM=?= (40/49) Nov 27 2013 It shouldn't be hard to make, either:

Gary Willoughby (3/10) Nov 27 2013 Ha, i was just discussing that here:

"bearophile" <bearophileHUGS lycos.com> writes:

Through Reddit I have seen this small comparison of Unicode 
handling between different programming languages:

http://mortoray.com/2013/11/27/the-string-type-is-broken/

D+Phobos seem to fail most things (it produces BAFFLE):
http://dpaste.dzfl.pl/a5268c435

Bye,
bearophile

Nov 27 2013

=?UTF-8?B?U2ltZW4gS2rDpnLDpXM=?= <simen.kjaras gmail.com> writes:

On 2013-11-27 13:46, bearophile wrote:
 Through Reddit I have seen this small comparison of Unicode handling
 between different programming languages:

 http://mortoray.com/2013/11/27/the-string-type-is-broken/

 D+Phobos seem to fail most things (it produces BAFFLE):
 http://dpaste.dzfl.pl/a5268c435

Indeed it does. Have you tried with std.uni?

--
   Simen

Nov 27 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:
 D+Phobos seem to fail most things (it produces BAFFLE):

I still think we're doing pretty good.

At least, we *handle* unicode at all (looking at you C++). And we 
handle *true* unicode, not BMP style UCS (looking at you 

encoding: UTF8 through UTF32, and the possibility to also have 
ASCII.

We don't yet totally handle things like diacritics or ligatures, 
but we are getting there.

As a whole, I find that D is incredibly "unicode correct enough" 
out of the box, and with no extra effort involved.

Nov 27 2013

"David Nadlinger" <code klickverbot.at> writes:

On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:
 Through Reddit I have seen this small comparison of Unicode 
 handling between different programming languages:

 http://mortoray.com/2013/11/27/the-string-type-is-broken/

 D+Phobos seem to fail most things (it produces BAFFLE):
 http://dpaste.dzfl.pl/a5268c435

If you need to perform this kind of operations on Unicode strings 
in D, you can call normalize (std.uni) on the string first to 
make sure it is in one of the Normalization Forms. For example, 
just appending .normalize to your strings (which defaults to NFC) 
would make the code produce the "expected" results.

As far as I'm aware, this behavior is the result of a deliberate 
decision, as normalizing strings on the fly isn't really cheap.

David

Nov 27 2013

Jacob Carlborg <doob me.com> writes:

On 2013-11-27 15:45, David Nadlinger wrote:

 If you need to perform this kind of operations on Unicode strings in D,
 you can call normalize (std.uni) on the string first to make sure it is
 in one of the Normalization Forms. For example, just appending
 .normalize to your strings (which defaults to NFC) would make the code
 produce the "expected" results.

That didn't work out very well:

std/uni.d(6301): Error: undefined identifier tuple

-- 
/Jacob Carlborg

Nov 27 2013

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Wednesday, 27 November 2013 at 15:03:37 UTC, Jacob Carlborg 
wrote:
 std/uni.d(6301): Error: undefined identifier tuple

Yeah, I saw it too. The fix is simple:

https://github.com/D-Programming-Language/phobos/pull/1728

tbh this makes me think version(unittest) might just be 
considered harmful. I'm sure that code passed the tests, but only 
because a vital import was in a version(unittest) secion!

Nov 27 2013

Jacob Carlborg <doob me.com> writes:

On 2013-11-27 16:07, Adam D. Ruppe wrote:

 Yeah, I saw it too. The fix is simple:

 https://github.com/D-Programming-Language/phobos/pull/1728

 tbh this makes me think version(unittest) might just be considered
 harmful. I'm sure that code passed the tests, but only because a vital
 import was in a version(unittest) secion!

You were faster. But I created an issue as well.

-- 
/Jacob Carlborg

Nov 27 2013

"bearophile" <bearophileHUGS lycos.com> writes:

David Nadlinger:

 If you need to perform this kind of operations on Unicode 
 strings in D, you can call normalize (std.uni) on the string 
 first to make sure it is in one of the Normalization Forms. For 
 example, just appending .normalize to your strings (which 
 defaults to NFC) would make the code produce the "expected" 
 results.

 As far as I'm aware, this behavior is the result of a 
 deliberate decision, as normalizing strings on the fly isn't 
 really cheap.

Thank you :-)

Bye,
bearophile

Nov 27 2013

"Wyatt" <wyatt.epp gmail.com> writes:

On Wednesday, 27 November 2013 at 14:45:32 UTC, David Nadlinger 
wrote:
 If you need to perform this kind of operations on Unicode 
 strings in D, you can call normalize (std.uni) on the string 
 first to make sure it is in one of the Normalization Forms. For 
 example, just appending .normalize to your strings (which 
 defaults to NFC) would make the code produce the "expected" 
 results.

Seems like a pretty big "gotcha" from a usability standpoint; 
it's not exactly intuitive.  I understand WHY this decision was 
made, but it feels like a source of code smell and weird string 
comparison errors.

 As far as I'm aware, this behavior is the result of a 
 deliberate decision, as normalizing strings on the fly isn't 
 really cheap.

I don't remember if it was brought up before, but this makes me 
wonder if something like an i18nString should exist for cases 
where it IS important.  Making i18n stuff as simple as it looks 
like it "should" be has merit, IMO.  (Maybe there's even room for 
a std.string.i18n submodule?)

-Wyatt

Nov 27 2013

"Dicebot" <public dicebot.lv> writes:

On Wednesday, 27 November 2013 at 16:15:53 UTC, Wyatt wrote:
 Seems like a pretty big "gotcha" from a usability standpoint; 
 it's not exactly intuitive.  I understand WHY this decision was 
 made, but it feels like a source of code smell and weird string 
 comparison errors.

It probably is, but is Unicode gotcha, not D one.

Nov 27 2013

Jacob Carlborg <doob me.com> writes:

On 2013-11-27 17:15, Wyatt wrote:

 I don't remember if it was brought up before, but this makes me wonder
 if something like an i18nString should exist for cases where it IS
 important.  Making i18n stuff as simple as it looks like it "should" be
 has merit, IMO.  (Maybe there's even room for a std.string.i18n submodule?)

I think we should have that.

-- 
/Jacob Carlborg

Nov 27 2013

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Wednesday, 27 November 2013 at 16:15:53 UTC, Wyatt wrote:
 I don't remember if it was brought up before, but this makes me 
 wonder if something like an i18nString should exist for cases 
 where it IS important.  Making i18n stuff as simple as it looks 
 like it "should" be has merit, IMO.  (Maybe there's even room 
 for a std.string.i18n submodule?)

 -Wyatt

What would it do that std.uni doesn't already?

i18nString sounds like a range of graphemes to me. I would like a 
convenient function in std.uni to get such a range of graphemes 
from a range of points, but I wouldn't want to elevate it to any 
particular status; that would be a knee-jerk reaction. D's 
granularity when it comes to Unicode is because there is an 
appropriate level of representation for each domain. Shoe-horning 
everything into a range of graphemes is something we should avoid.

In D, we can write code that is both Unicode-correct and highly 
performant, while still being simple and pleasant to read. To 
write such code, one must have a modicum of understanding of how 
Unicode works (in order to choose the right tools from the 
toolbox), but I think it's a novel compromise.

Nov 27 2013

Jacob Carlborg <doob me.com> writes:

On 2013-11-27 18:22, Jakob Ovrum wrote:

 What would it do that std.uni doesn't already?

A class/struct that handles all these normalizations and other stuff 
automatically.

-- 
/Jacob Carlborg

Nov 27 2013

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Wednesday, 27 November 2013 at 17:30:22 UTC, Jacob Carlborg 
wrote:
 On 2013-11-27 18:22, Jakob Ovrum wrote:

 What would it do that std.uni doesn't already?

 A class/struct that handles all these normalizations and other 
 stuff automatically.

Sounds terrible :)

Nov 27 2013

"Dicebot" <public dicebot.lv> writes:

On Wednesday, 27 November 2013 at 17:37:48 UTC, Jakob Ovrum wrote:
 On Wednesday, 27 November 2013 at 17:30:22 UTC, Jacob Carlborg 
 wrote:
 On 2013-11-27 18:22, Jakob Ovrum wrote:

 What would it do that std.uni doesn't already?

 A class/struct that handles all these normalizations and other 
 stuff automatically.

 Sounds terrible :)

+1

Working with graphemes is rather expensive thing to do 
performance-wise. I like how D makes this fact obvious and 
provides continuous transition through abstraction levels here. 
It is important to make the costs obvious.

Nov 27 2013

Jacob Carlborg <doob me.com> writes:

On 2013-11-27 18:56, Dicebot wrote:

 +1

 Working with graphemes is rather expensive thing to do performance-wise.
 I like how D makes this fact obvious and provides continuous transition
 through abstraction levels here. It is important to make the costs obvious.

I think it's missing a final high level abstraction. As with the rest of 
the abstractions you're not forced to use them.

-- 
/Jacob Carlborg

Nov 27 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

27-Nov-2013 22:54, Jacob Carlborg пишет:
 On 2013-11-27 18:56, Dicebot wrote:

 +1

 Working with graphemes is rather expensive thing to do performance-wise.
 I like how D makes this fact obvious and provides continuous transition
 through abstraction levels here. It is important to make the costs
 obvious.

 I think it's missing a final high level abstraction. As with the rest of
 the abstractions you're not forced to use them.

This could give an idea of what Perl folks do to get the grapheme feel 
like a unit of string:
http://www.parrot.org/content/ucs-4-nfg-and-how-grapheme-tables-makes-it-awesome

You seriously don't want this kind of behind the scenes work taking 
place in systems language.

P.S. The text linked presents some incorrect "facts" about Unicode that 
I'm not to be held responsible for :) I do believe however that the 
general idea described is interesting and is worth trying out in 
addition to what we have in std.uni.

-- 
Dmitry Olshansky

Nov 27 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Nov 27, 2013 at 06:22:41PM +0100, Jakob Ovrum wrote:
 On Wednesday, 27 November 2013 at 16:15:53 UTC, Wyatt wrote:
I don't remember if it was brought up before, but this makes me
wonder if something like an i18nString should exist for cases
where it IS important.  Making i18n stuff as simple as it looks
like it "should" be has merit, IMO.  (Maybe there's even room for
a std.string.i18n submodule?)

-Wyatt

 
 What would it do that std.uni doesn't already?
 
 i18nString sounds like a range of graphemes to me.

Maybe it should be called graphemeString?

I'm not sure what this has to do with i18n, though. Properly done i18n
should use Unicode line-breaking algorithms and other such standardized
functions, rather than manipulating graphemes directly (which fails to
take into account double-width characters, language-specific
decomposition rules, and many other gotchas, not to mention
poorly-performing). AFAIK std.uni already provides a way to extract
graphemes when you need it (e.g., for rendering fonts), so there's
really no reason to default to graphemeString everywhere in your
program. *That* is a sign of poorly written code, IMNSHO.


 I would like a convenient function in std.uni to get such a range of
 graphemes from a range of points, but I wouldn't want to elevate it to
 any particular status; that would be a knee-jerk reaction. D's
 granularity when it comes to Unicode is because there is an
 appropriate level of representation for each domain. Shoe-horning
 everything into a range of graphemes is something we should avoid.
 
 In D, we can write code that is both Unicode-correct and highly
 performant, while still being simple and pleasant to read. To write
 such code, one must have a modicum of understanding of how Unicode
 works (in order to choose the right tools from the toolbox), but I
 think it's a novel compromise.

Agreed.


T

-- 
MASM = Mana Ada Sistem, Man!

Nov 27 2013

"Wyatt" <wyatt.epp gmail.com> writes:

On Wednesday, 27 November 2013 at 17:22:43 UTC, Jakob Ovrum wrote:
 i18nString sounds like a range of graphemes to me.

Maybe.  If I had called it...say, "normalisedString"?  Would you 
still think that?  That was an off-the-cuff name because my 
morning brain imagined that this sort of thing would be useful 
for user input where you can't make assumptions about its form.

 I would like a convenient function in std.uni to get such a 
 range of graphemes from a range of points, but I wouldn't want 
 to elevate it to any particular status; that would be a 
 knee-jerk reaction. D's granularity when it comes to Unicode is 
 because there is an appropriate level of representation for 
 each domain. Shoe-horning everything into a range of graphemes 
 is something we should avoid.

Okay, hold up.  It's a bit late to prevent everyone from diving 
down this rabbit hole, but let me be clear:

This really isn't about graphemes.  Not really.  They may be 
involved, but I think focusing on that obscures the point.

If you recall the original article, I don't think he's being 
unfair in expecting "noël" to have a length of four no matter 
how it was composed.  I don't think it's unfair to expect that 
"noël".take(3) returns "noë", and I don't think it's unfair 
that reversing it should be "lëon".  All the places where his 
expectations were defied (and more!) are implementation details.

While I stated before that I don't necessarily have anything 
against people learning more about unicode, neither do I 
fundamentally believe that's something a lot of people _need_ to 
worry about.  I'm not saying the default string in D should 
change or anything crazy like that.  All I'm suggesting is maybe, 
rather than telling people they should read a small book about 
the most arcane stuff imaginable and then explaining which tool 
does what when that doesn't take, we could just tell them "Here, 
use this library type where you need it" with the admonishment 
that it may be too slow if abused.  I think THAT could be useful.

 In D, we can write code that is both Unicode-correct and highly 
 performant, while still being simple and pleasant to read. To 
 write such code, one must have a modicum of understanding of 
 how Unicode works (in order to choose the right tools from the 
 toolbox), but I think it's a novel compromise.

See, this sways me only a little bit.  The reason for that is, 
often, convenience greatly trumps elegance or performance.  Sure 
I COULD write something in C to look for obvious bad stuff in my 
syslog, but would I bother when I have a shell with pipes, grep, 
cut, and sed?  This all isn't to say I don't LIKE performance and 
elegance; but I live, work, and play on both sides of this 
spectrum, and I'd like to think they can peacefully coexist 
without too much fuss.

-Wyatt

Nov 27 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 11/27/2013 9:22 AM, Jakob Ovrum wrote:
 In D, we can write code that is both Unicode-correct and highly performant,
 while still being simple and pleasant to read. To write such code, one must
have
 a modicum of understanding of how Unicode works (in order to choose the right
 tools from the toolbox), but I think it's a novel compromise.

Sadly, std.array is determined to decode (i.e. convert to dchar[]) all your 
strings when they are used as ranges. This means that all algorithms on strings 
will be crippled as far as performance goes.

http://dlang.org/glossary.html#narrow strings

Very, very few operations on strings need decoding. The decoding should have 
gone into a separate layer.

Nov 28 2013

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Thursday, 28 November 2013 at 09:02:12 UTC, Walter Bright 
wrote:
 Sadly, std.array is determined to decode (i.e. convert to 
 dchar[]) all your strings when they are used as ranges. This 
 means that all algorithms on strings will be crippled as far as 
 performance goes.

 http://dlang.org/glossary.html#narrow strings

 Very, very few operations on strings need decoding. The 
 decoding should have gone into a separate layer.

Decoding by default means that algorithms can work reasonably 
with strings without being designed specifically for strings. The 
algorithms can then later be specialized for narrow strings, 
which I believe is happening for a few algorithms in 
std.algorithm like substring search.

Decoding is still available as a separate layer through std.utf, 
when more control over decoding is required.

Nov 28 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Walter Bright:

 This means that all algorithms on strings will be crippled
 as far as performance goes.

If you want to sort an array of chars you need to use a dchar[], 
or code like this:

char[] word = "just a test".dup;
auto sword = cast(char[])word.representation.sort().release;

See:
http://d.puremagic.com/issues/show_bug.cgi?id=10162

Bye,
bearophile

Nov 28 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 28 November 2013 at 09:02:12 UTC, Walter Bright
wrote:
 Sadly,

I think it's great. It means by default, your strings will always
be handled correctly. I think there's quite a few algorithms that
were written without ever taking strings into account, but still
happen to work with them.

 std.array is determined to decode (i.e. convert to dchar[]) all 
 your strings when they are used as ranges.
 This means that all algorithms on
 strings will be crippled as far as performance goes.

Quite a few algorithms in array/algorithm/string *don't* decode
the string when they don't need to actually.

 Very, very few operations on strings need decoding. The 
 decoding should have gone into a separate layer.

Which operations are you thinking of in std.array that decode
when they shouldn't?

Nov 28 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 11/28/2013 5:24 AM, monarch_dodra wrote:
 Which operations are you thinking of in std.array that decode
 when they shouldn't?

front() in std.array looks like:

 property dchar front(T)(T[] a)  safe pure if (isNarrowString!(T[]))
{
     assert(a.length, "Attempting to fetch the front of an empty array of " ~ 
T.stringof);
     size_t i = 0;
     return decode(a, i);
}

So anytime I write a generic algorithm using empty, front, and popFront(), it 
decodes the strings, which is a large pessimization.

Nov 28 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Nov 28, 2013 at 09:52:08AM -0800, Walter Bright wrote:
 On 11/28/2013 5:24 AM, monarch_dodra wrote:
Which operations are you thinking of in std.array that decode
when they shouldn't?

 
 front() in std.array looks like:
 
  property dchar front(T)(T[] a)  safe pure if (isNarrowString!(T[]))
 {
     assert(a.length, "Attempting to fetch the front of an empty
 array of " ~ T.stringof);
     size_t i = 0;
     return decode(a, i);
 }
 
 So anytime I write a generic algorithm using empty, front, and
 popFront(), it decodes the strings, which is a large pessimization.

OTOH, it is actually correct by default. If it *didn't* decode, things
like std.algorithm.sort and std.range.retro would mangle all your
multibyte UTF-8 characters.

Having said that, though, it would be nice if there were a standard
ASCII string type that didn't decode by default. Always decoding strings
*is* slow, esp. when you already know that it only contains ASCII
characters. Maybe we want something like this:

	struct AsciiString {
		immutable(ubyte)[] impl;
		alias impl this;

		// This is so that .front returns char instead of ubyte
		 property char front() { return cast(char) impl[0]; }

		char opIndex(size_t idx) { ... /* ditto */ }

		... // other range methods here
	}

	AsciiString assumeAscii(string s)
	{
		return AsciiString(cast(immutable(ubyte)[]) s);
	}


T

-- 
"640K ought to be enough" -- Bill G., 1984.
"The Internet is not a primary goal for PC usage" -- Bill G., 1995.
"Linux has no impact on Microsoft's strategy" -- Bill G., 1999.

Nov 28 2013

"Dicebot" <public dicebot.lv> writes:

Nov 28 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 28 November 2013 at 18:55:44 UTC, Dicebot wrote:


Yeah, that or just ubyte[].

The problem with both of these though, is printing :/ (which
prints ugly as sin)

Something like:
struct AsciiChar
{
      private char c;
      alias c this;
}

Could be a very easy and efficient alternative.

Nov 28 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 11/28/2013 10:19 AM, H. S. Teoh wrote:
 Always decoding strings
 *is* slow, esp. when you already know that it only contains ASCII
 characters.

It doesn't have to be merely ASCII. You can do string substring searches
without 
any need for decoding, for example. You don't even need decoding to do regex. 
Decoding is rarely needed.

Nov 28 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

28-Nov-2013 17:24, monarch_dodra пишет:
 On Thursday, 28 November 2013 at 09:02:12 UTC, Walter Bright
 wrote:
 Sadly,

 I think it's great. It means by default, your strings will always
 be handled correctly. I think there's quite a few algorithms that
 were written without ever taking strings into account, but still
 happen to work with them.

The greatest problem is surprisingly that you can't use range functions 
to the implicit codeunit range even if you REALLY wanted to.

To not go far away - the only reason std.regex can't take e.g. retro of 
string:

match(retro("hleb), ".el.");

is because of the automatic dumbing down at the moment you apply range 
adapter. What I'd need in std.regex is a codeunit range that due to 
convention also "happens to be" a range of codepoints.

The second problem is that string code is carefully special cased but 
the effort is completely wasted the moment you have a slice of char-s 
that come from anywhere else (circular buffer, for instance) then 
built-in strings.

I had a (a bit cloudy) vision of settling encoded ranges problem once 
and for good. That includes defining notion of an encoded range that is 
2 in one: some stronger (as in capabilities) range of code elements and 
the default decoded view imposed on top of it (that can be weaker).

-- 
Dmitry Olshansky

Nov 28 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 11/28/2013 11:32 AM, Dmitry Olshansky wrote:
 I had a (a bit cloudy) vision of settling encoded ranges problem once and for
 good. That includes defining notion of an encoded range that is 2 in one: some
 stronger (as in capabilities) range of code elements and the default decoded
 view imposed on top of it (that can be weaker).

I suspect the correct approach would be to have the range over string to
produce 
bytes. If you want decoded values, then run it through an adapter algorithm.

Nov 28 2013

Charles Hixson <charleshixsn earthlink.net> writes:

On 11/27/2013 06:45 AM, David Nadlinger wrote:
 On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:
 Through Reddit I have seen this small comparison of Unicode handling 
 between different programming languages:

 http://mortoray.com/2013/11/27/the-string-type-is-broken/

 D+Phobos seem to fail most things (it produces BAFFLE):
 http://dpaste.dzfl.pl/a5268c435

 If you need to perform this kind of operations on Unicode strings in 
 D, you can call normalize (std.uni) on the string first to make sure 
 it is in one of the Normalization Forms. For example, just appending 
 .normalize to your strings (which defaults to NFC) would make the code 
 produce the "expected" results.

 As far as I'm aware, this behavior is the result of a deliberate 
 decision, as normalizing strings on the fly isn't really cheap.

 David

I don't like the overhead, and I don't know how important this is, but 
perhaps the best way to solve it would be to have string include a 
"normalization" byte, saying whether it was normalized, and if so in 
what way.  That there can be multiple ways of normalizing is painful, 
but it *is* the standard.  And this would allow normalization to be 
skipped whenever the comparison of two strings showed the same 
normalization (or lack thereof).  What to do if they're normalized 
differently is a bit of a puzzle, but most reasonable solutions would 
work for most cases, so you just need a way to override the defaults.

-- 
Charles Hixson

Nov 27 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

27-Nov-2013 18:45, David Nadlinger пишет:
 On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:
 Through Reddit I have seen this small comparison of Unicode handling
 between different programming languages:

 http://mortoray.com/2013/11/27/the-string-type-is-broken/

 D+Phobos seem to fail most things (it produces BAFFLE):
 http://dpaste.dzfl.pl/a5268c435

 If you need to perform this kind of operations on Unicode strings in D,
 you can call normalize (std.uni) on the string first to make sure it is
 in one of the Normalization Forms. For example, just appending
 .normalize to your strings (which defaults to NFC) would make the code
 produce the "expected" results.

 As far as I'm aware, this behavior is the result of a deliberate
 decision, as normalizing strings on the fly isn't really cheap.

It's anything but cheap.
At the minimum imagine crawling the string and issuing a table lookup 
per codepoint.

 David


-- 
Dmitry Olshansky

Nov 27 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 11/27/2013 12:06 PM, Dmitry Olshansky wrote:
 27-Nov-2013 18:45, David Nadlinger пишет:
 As far as I'm aware, this behavior is the result of a deliberate
 decision, as normalizing strings on the fly isn't really cheap.

 It's anything but cheap.
 At the minimum imagine crawling the string and issuing a table lookup per
 codepoint.

Decoding isn't cheap, either, which is why I rant about it being the default 
behavior.

Nov 28 2013

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:
 Through Reddit I have seen this small comparison of Unicode 
 handling between different programming languages:

 http://mortoray.com/2013/11/27/the-string-type-is-broken/

Most of the points are good, but the author seems to confuse 
UCS-2 with UTF-16, so the whole point about UTF-16 is plain wrong.

The author also doesn't seem to understand the Unicode 
definitions of character and grapheme, which is a shame, because 
the difference is more or less the whole point of the post.

 D+Phobos seem to fail most things (it produces BAFFLE):
 http://dpaste.dzfl.pl/a5268c435

D strings are arrays of code units and ranges of code points. The 
failure here is yours; in that you didn't use std.uni to handle 
graphemes.

On that note, I tried to use std.uni to write a simple example of 
how to correctly handle this in D, but it became apparent that 
std.uni should expose something like `byGrapheme` which lazily 
transforms a range of code points to a range of graphemes 
(probably needs a `byCodePoint` to do the converse too). The two 
extant grapheme functions, `decodeGrapheme` and `graphemeStride`, 
are *awful* for string manipulation (granted, they are probably 
perfect for text rendering).

Nov 27 2013

"Wyatt" <wyatt.epp gmail.com> writes:

On Wednesday, 27 November 2013 at 15:43:11 UTC, Jakob Ovrum wrote:
 The author also doesn't seem to understand the Unicode 
 definitions of character and grapheme, which is a shame, 
 because the difference is more or less the whole point of the 
 post.

I agree with the assertion that people SHOULD know how unicode 
works if they want to work with it, but the way our docs are now 
is off-putting enough that most probably won't learn anything.  
If they know, they know; if they don't, the wall of jargon is 
intimidating and hard to grasp (more examples up front of more 
things that you'd actually use std.uni for).  Even though I'm 
decently familiar with Unicode, I was having trouble following 
all that (e.g. Isn't "noe\u0308l" a grapheme cluster according to 
std.uni?).  On the flip side, std.utf has a serious dearth of 
examples and the relationship between the two isn't clear.

 On that note, I tried to use std.uni to write a simple example 
 of how to correctly handle this in D, but it became apparent 
 that std.uni should expose something like `byGrapheme` which 
 lazily transforms a range of code points to a range of 
 graphemes (probably needs a `byCodePoint` to do the converse 
 too). The two extant grapheme functions, `decodeGrapheme` and 
 `graphemeStride`, are *awful* for string manipulation (granted, 
 they are probably perfect for text rendering).

Yes, please.  While operations on single codepoints and 
characters seem pretty robust (i.e. you can do lots of things 
with and to them), it feels like it just falls apart when you try 
to work with strings.  It honestly surprised me how many things 
in std.uni don't seem to work on ranges.

-Wyatt

Nov 27 2013

"Wyatt" <wyatt.epp gmail.com> writes:

On Wednesday, 27 November 2013 at 16:18:34 UTC, Wyatt wrote:
 trouble following all that (e.g. Isn't "noe\u0308l" a grapheme

Whoops, overzealous pasting.  That is, "e\u0308", which composes 
to "ë".  A grapheme cluster seems to represent one printed 
character: "...a horizontally segmentable unit of text, 
consisting of some grapheme base (which may consist of a Korean 
syllable) together with any number of nonspacing marks applied to 
it."

Is that about right?

-Wyatt

Nov 27 2013

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Wednesday, 27 November 2013 at 16:22:58 UTC, Wyatt wrote:
 Whoops, overzealous pasting.  That is, "e\u0308", which 
 composes to "ë".  A grapheme cluster seems to represent one 
 printed character: "...a horizontally segmentable unit of text, 
 consisting of some grapheme base (which may consist of a Korean 
 syllable) together with any number of nonspacing marks applied 
 to it."

 Is that about right?

 -Wyatt

Yes.

A grapheme is also sometimes explained as being the unit that lay 
people intuitively think of as being a "character".

The difference between a grapheme and a grapheme cluster is just 
a matter of perspective, like the difference between a character 
and a code point; the former simply refers to the decoded result, 
while the latter refers to the sum of encoding parts (where the 
parts are code points for grapheme cluster, and code units for a 
code point).

Yet another example is that of the UTF-32 code unit: one UTF-32 
code unit is (currently) equal to one Unicode code point, but 
both terms are meaningful in the right context.

Nov 27 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

27-Nov-2013 20:22, Wyatt пишет:
 On Wednesday, 27 November 2013 at 16:18:34 UTC, Wyatt wrote:
 trouble following all that (e.g. Isn't "noe\u0308l" a grapheme

 Whoops, overzealous pasting.  That is, "e\u0308", which composes to
 "ë".  A grapheme cluster seems to represent one printed character: "...a
 horizontally segmentable unit of text, consisting of some grapheme base
 (which may consist of a Korean syllable) together with any number of
 nonspacing marks applied to it."

 Is that about right?

As much as standard defines it. (actually they talk about boundaries, 
and grapheme is what happens to be in between).


More specifically D's std.uni follows the notion of the extended 
grapheme cluster. There is no need to stick with ugly legacy crap.

See also
http://www.unicode.org/reports/tr29/
 -Wyatt


-- 
Dmitry Olshansky

Nov 27 2013

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Wednesday, 27 November 2013 at 16:18:34 UTC, Wyatt wrote:
 I agree with the assertion that people SHOULD know how unicode 
 works if they want to work with it, but the way our docs are 
 now is off-putting enough that most probably won't learn 
 anything.  If they know, they know; if they don't, the wall of 
 jargon is intimidating and hard to grasp (more examples up 
 front of more things that you'd actually use std.uni for).  
 Even though I'm decently familiar with Unicode, I was having 
 trouble following all that (e.g. Isn't "noe\u0308l" a grapheme 
 cluster according to std.uni?).  On the flip side, std.utf has 
 a serious dearth of examples and the relationship between the 
 two isn't clear.

I thought it was nice that std.uni had a proper terminology 
section, complete with links to Unicode documents to kick-start 
beginners to Unicode. It mentions its relationship with std.utf 
right at the top.

Maybe the first paragraph is just too thin, and it's hard to see 
the big picture. Maybe it should include a small leading 
paragraph detailing the three levels of Unicode granularity that 
D/Phobos chooses; arrays of code units -> ranges of code points 
-> std.uni for graphemes and algorithms.

 Yes, please.  While operations on single codepoints and 
 characters seem pretty robust (i.e. you can do lots of things 
 with and to them), it feels like it just falls apart when you 
 try to work with strings.  It honestly surprised me how many 
 things in std.uni don't seem to work on ranges.

 -Wyatt

Most string code is Unicode-correct as long as it works on code 
points and all inputs are of the same normalization format; 
explicit grapheme-awareness is rarely a necessity. By that I mean 
the most common string operations, such as searching, getting a 
substring etc. will work without any special grapheme decoding 
(beyond normalization).

The hiccups appear when code points are shuffled around, or the 
order is changed. Apart from these rare string manipulation 
cases, grapheme awareness is necessary for rendering code.

Nov 27 2013

Charles Hixson <charleshixsn earthlink.net> writes:

On 11/27/2013 08:53 AM, Jakob Ovrum wrote:
 On Wednesday, 27 November 2013 at 16:18:34 UTC, Wyatt wrote:
 I agree with the assertion that people SHOULD know how unicode works 
 if they want to work with it, but the way our docs are now is 
 off-putting enough that most probably won't learn anything.  If they 
 know, they know; if they don't, the wall of jargon is intimidating 
 and hard to grasp (more examples up front of more things that you'd 
 actually use std.uni for).  Even though I'm decently familiar with 
 Unicode, I was having trouble following all that (e.g. Isn't 
 "noe\u0308l" a grapheme cluster according to std.uni?).  On the flip 
 side, std.utf has a serious dearth of examples and the relationship 
 between the two isn't clear.

 I thought it was nice that std.uni had a proper terminology section, 
 complete with links to Unicode documents to kick-start beginners to 
 Unicode. It mentions its relationship with std.utf right at the top.

 Maybe the first paragraph is just too thin, and it's hard to see the 
 big picture. Maybe it should include a small leading paragraph 
 detailing the three levels of Unicode granularity that D/Phobos 
 chooses; arrays of code units -> ranges of code points -> std.uni for 
 graphemes and algorithms.

 Yes, please.  While operations on single codepoints and characters 
 seem pretty robust (i.e. you can do lots of things with and to them), 
 it feels like it just falls apart when you try to work with strings.  
 It honestly surprised me how many things in std.uni don't seem to 
 work on ranges.

 -Wyatt

 Most string code is Unicode-correct as long as it works on code points 
 and all inputs are of the same normalization format; explicit 
 grapheme-awareness is rarely a necessity. By that I mean the most 
 common string operations, such as searching, getting a substring etc. 
 will work without any special grapheme decoding (beyond normalization).

 The hiccups appear when code points are shuffled around, or the order 
 is changed. Apart from these rare string manipulation cases, grapheme 
 awareness is necessary for rendering code.

I would put things a bit more emphatically.  The codepoint is analogous 
to assembler, where the character is analogous to a high level language 
(and the binary representation is analogous to a binary 
representation).  The desire is to make the characters easy to use in a 
way that is cheap to do.  To me this means that the highlevel language 
(i.e., D) should make it easy to deal with characters, possible to deal 
with codepoints, and you can deal with binary representations if you 
really want to.  (Also note the isomorphism between assembler code and 
binary is matched by an isomorphism between codepoints and binary 
representation.)  To do this cheaply, D needs to know what kind of 
normalization each string is in.  This is likely to cost one byte per 
string, unless there's some slack in the current representation.

But is this worth while?  This is the direction that things will 
eventually go, but that doesn't really mean that we need to push them in 
that direction today.  But if D had a default normalization that 
occurred during i/o operations, to cost of the normalization would 
probably be lost during the impedance matching between RAM and storage.  
(Again, however, any default requires the ability to be overridden.)

Also, of course, none of this will be of any significance to ASCII.

-- 
Charles Hixson

Nov 27 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 11/27/2013 8:18 AM, Wyatt wrote:
 It honestly surprised me how
 many things in std.uni don't seem to work on ranges.

Many things in Phobos either predate ranges, or are written by people who
aren't 
used to ranges and don't think in terms of ranges. It's an ongoing issue, and 
one we need to improve upon.

And, of course, you're welcome to pitch in and help with pull requests on the 
documentation and implementation!

Nov 27 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

27-Nov-2013 20:18, Wyatt пишет:
 On Wednesday, 27 November 2013 at 15:43:11 UTC, Jakob Ovrum wrote:

 It
 honestly surprised me how many things in std.uni don't seem to work on
 ranges.

Which ones? Or do you mean more like isAlpha(rangeOfCodepoints)?



-- 
Dmitry Olshansky

Nov 27 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 11/27/13 7:43 AM, Jakob Ovrum wrote:
 On that note, I tried to use std.uni to write a simple example of how to
 correctly handle this in D, but it became apparent that std.uni should
 expose something like `byGrapheme` which lazily transforms a range of
 code points to a range of graphemes (probably needs a `byCodePoint` to
 do the converse too). The two extant grapheme functions,
 `decodeGrapheme` and `graphemeStride`, are *awful* for string
 manipulation (granted, they are probably perfect for text rendering).

Yah, byGrapheme would be a great addition.

Andrei

Nov 27 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Nov 27, 2013 at 10:07:43AM -0800, Andrei Alexandrescu wrote:
 On 11/27/13 7:43 AM, Jakob Ovrum wrote:
On that note, I tried to use std.uni to write a simple example of how
to correctly handle this in D, but it became apparent that std.uni
should expose something like `byGrapheme` which lazily transforms a
range of code points to a range of graphemes (probably needs a
`byCodePoint` to do the converse too). The two extant grapheme
functions, `decodeGrapheme` and `graphemeStride`, are *awful* for
string manipulation (granted, they are probably perfect for text
rendering).

 
 Yah, byGrapheme would be a great addition.

[...]

+1. This is better than the GraphemeString / i18nString proposal
elsewhere in this thread, because it discourages people from using
graphemes (poor performance) unless where actually necessary.


T

-- 
He who laughs last thinks slowest.

Nov 27 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

27-Nov-2013 22:12, H. S. Teoh пишет:
 On Wed, Nov 27, 2013 at 10:07:43AM -0800, Andrei Alexandrescu wrote:
 On 11/27/13 7:43 AM, Jakob Ovrum wrote:
 On that note, I tried to use std.uni to write a simple example of how
 to correctly handle this in D, but it became apparent that std.uni
 should expose something like `byGrapheme` which lazily transforms a
 range of code points to a range of graphemes (probably needs a
 `byCodePoint` to do the converse too). The two extant grapheme
 functions, `decodeGrapheme` and `graphemeStride`, are *awful* for
 string manipulation (granted, they are probably perfect for text
 rendering).

 Yah, byGrapheme would be a great addition.

 [...]

 +1. This is better than the GraphemeString / i18nString proposal
 elsewhere in this thread, because it discourages people from using
 graphemes (poor performance) unless where actually necessary.

I could have sworn we had byGrapheme somewhere, well apparently not :(

BTW I believe that GraphemeString could still be a valuable addition. I 
known of at least one good implementation that gives you O(1) grapheme 
access with nice memory footprint numbers. It has many benefits but the 
chief problem with it:
a) It doesn't at all solve the interchange at all - you'd have to encode 
on  write/re-code on read
b) It relies on having global shared state across the whole program, and 
that's the real show-stopper thing about it

In any case it's a direction well worth exploring.
 T


-- 
Dmitry Olshansky

Nov 27 2013

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Wednesday, 27 November 2013 at 20:13:32 UTC, Dmitry Olshansky 
wrote:
 I could have sworn we had byGrapheme somewhere, well apparently 
 not :(

Simple attempt:

https://github.com/D-Programming-Language/phobos/pull/1736

Nov 29 2013

=?UTF-8?B?U2ltZW4gS2rDpnLDpXM=?= <simen.kjaras gmail.com> writes:

On 27.11.2013 19:07, Andrei Alexandrescu wrote:
 On 11/27/13 7:43 AM, Jakob Ovrum wrote:
 On that note, I tried to use std.uni to write a simple example of how to
 correctly handle this in D, but it became apparent that std.uni should
 expose something like `byGrapheme` which lazily transforms a range of
 code points to a range of graphemes (probably needs a `byCodePoint` to
 do the converse too). The two extant grapheme functions,
 `decodeGrapheme` and `graphemeStride`, are *awful* for string
 manipulation (granted, they are probably perfect for text rendering).

 Yah, byGrapheme would be a great addition.

It shouldn't be hard to make, either:

import std.uni : Grapheme, decodeGrapheme;
import std.traits : isSomeString;
import std.array : empty;

struct ByGrapheme(T) if (isSomeString!T) {
     Grapheme _front;
     bool _empty;
     T _range;

     this(T value) {
         _range = value;
         popFront();
     }

      property
     Grapheme front() {
         assert(!empty);
         return _front;
     }

     void popFront() {
         assert(!empty);
         _empty = _range.empty;
         if (!_empty) {
             _front = decodeGrapheme(_range);
         }
     }

      property
     bool empty() {
         return _empty;
     }
}

auto byGrapheme(T)(T value) if (isSomeString!T) {
     return ByGrapheme!T(value);
}

void main() {
     import std.stdio;
     string s = "তঃঅ৩৵பஂஅபூ௩ᐁᑦᕵᙧᚠᚳᛦᛰ¥¼Ññ";
     writeln(s.byGrapheme);
}


-- 
   Simen

Nov 27 2013

"Gary Willoughby" <dev nomad.so> writes:

On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:
 Through Reddit I have seen this small comparison of Unicode 
 handling between different programming languages:

 http://mortoray.com/2013/11/27/the-string-type-is-broken/

 D+Phobos seem to fail most things (it produces BAFFLE):
 http://dpaste.dzfl.pl/a5268c435

 Bye,
 bearophile

Ha, i was just discussing that here: 
http://forum.dlang.org/thread/xmusisihhbmefeigvxvd forum.dlang.org

Nov 27 2013

D Programming

C/C++ Programming

Other

digitalmars.D - Unicode handling comparison