digitalmars.D - Unicode Normalization (and graphemes and locales)

Walter Bright (23/28) Jun 02 2016 The short answer is, we don't.

Jack Stouffer (10/21) Jun 02 2016 Working at any level of Unicode in a systems programming language
Jonathan M Davis via Digitalmars-d (18/27) Jun 02 2016 I generally agree. The main problem that I was concerned about were the

Steven Schveighoffer (14/39) Jun 03 2016 But consider the case where you are searching the string: "cassé"

Jonathan M Davis via Digitalmars-d (42/54) Jun 03 2016 Well, if you know that you're dealing with a grapheme that has that prob...

Steven Schveighoffer (10/29) Jun 03 2016 Yes, I agree. This is exactly the point. Don't assume anything, just

H. S. Teoh via Digitalmars-d (19/39) Jun 03 2016 I'm guessing it's an invalid sequence.

Walter Bright <newshound2 digitalmars.com> writes:

On 6/2/2016 4:29 PM, Jonathan M Davis via Digitalmars-d wrote:
 How do you suggest that we handle the normalization issue? Should we just
 assume NFC like std.uni.normalize does and provide an optional template
 argument to indicate a different normalization (like normalize does)? Since
 without providing a way to deal with the normalization, we're not actually
 making the code fully correct, just faster.

The short answer is, we don't.

1. D is a systems programming language. Baking normalization, graphemes and 
Unicode locales in at a low level will have a disastrous negative effect on 
performance and size.

2. Very little systems programming work requires level 2 or 3 Unicode support.

3. Are they needed? Pedantically, yes. Practically, not necessarily.

4. What we must do is, for each algorithm, document how it handles Unicode.

5. Normalization, graphemes, and locales should all be explicitly opt-in with 
corresponding library code.

Normalization: s.normalize.algorithm()
Graphemes: may require separate algorithms, maybe std.grapheme?
Locales: I have no idea, given that I have not studied that issue

6. std.string has many analogues for std.algorithms that are specific to the 
peculiarities of strings. I think this is a perfectly acceptable approach. For 
example, there are many ways to sort Unicode strings, and many of them do not 
fit in with std.algorithm.sort's ways. Having special std.string.sort's for
them 
would be the most practical solution.

7. At some point, as the threads on autodecode amply illustrate, working with 
level 2 or level 3 Unicode requires a certain level of understanding on the
part 
of the programmer writing the code, because there simply is no overarching 
correct way to do things. The programmer is going to have to understand what he 
is trying to accomplish with Unicode and select the code/algorithms accordingly.

Jun 02 2016

Jack Stouffer <jack jackstouffer.com> writes:

On Friday, 3 June 2016 at 00:14:13 UTC, Walter Bright wrote:
 5. Normalization, graphemes, and locales should all be 
 explicitly opt-in with corresponding library code.

Add decoding to that list and we're right there with you.

 7. At some point, as the threads on autodecode amply 
 illustrate, working with level 2 or level 3 Unicode requires a 
 certain level of understanding on the part of the programmer 
 writing the code, because there simply is no overarching 
 correct way to do things. The programmer is going to have to 
 understand what he is trying to accomplish with Unicode and 
 select the code/algorithms accordingly.

Working at any level of Unicode in a systems programming language 
requires knowledge of Unicode. The thing is, because D is a 
systems language, we can't have the default behavior to decode to 
grapheme clusters, and because of that, we have to have 
everything be opt-in, because everything else is fundamentally 
wrong on some level. Once you step out of scripting language 
land, you can't get around requiring Unicode knowledge. Like I 
said in my blog,

 Unicode is hard. Trying to hide Unicode specifics helps
 no one because it's going to bite you in the ass eventually.

Jun 02 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Thursday, June 02, 2016 17:14:13 Walter Bright via Digitalmars-d wrote:
 On 6/2/2016 4:29 PM, Jonathan M Davis via Digitalmars-d wrote:
  > How do you suggest that we handle the normalization issue? Should we just
  > assume NFC like std.uni.normalize does and provide an optional template
  > argument to indicate a different normalization (like normalize does)?
  > Since
  > without providing a way to deal with the normalization, we're not
  > actually
  > making the code fully correct, just faster.

 The short answer is, we don't.

I generally agree. The main problem that I was concerned about were the
cases like find where we're talking about encoding the needle to match the
haystack so that we can compare with code units, and I was thinking that
we'd be forced to pick a normalization scheme with that, and if that didn't
match the normalization of the haystack, we'd be in trouble (hence the
concern about being able to specify a normalization scheme). However,
thinking about it further, that's not actually a problem. If the needle is a
dchar, then code point normalization isn't an issue, because it's only ever
one code point, and if the needle uses a different encoding (e.g. UTF-16
instead of UTF-8), and we re-encode it with the encoding of the haystack,
that doesn't change the normalization of the needle. Even if the code units
have changed, the code points that they represent are the same. So, it
doesn't even potentially make sense to try and doing anything with the
normalization when re-encoding the needle.

So, it looks like my concern was born from not thinking the issue through
thoroughly enough.

- Jonathan M Davis

Jun 02 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 6/3/16 2:24 AM, Jonathan M Davis via Digitalmars-d wrote:
 On Thursday, June 02, 2016 17:14:13 Walter Bright via Digitalmars-d wrote:
 On 6/2/2016 4:29 PM, Jonathan M Davis via Digitalmars-d wrote:
  > How do you suggest that we handle the normalization issue? Should we just
  > assume NFC like std.uni.normalize does and provide an optional template
  > argument to indicate a different normalization (like normalize does)?
  > Since
  > without providing a way to deal with the normalization, we're not
  > actually
  > making the code fully correct, just faster.

 The short answer is, we don't.

 I generally agree. The main problem that I was concerned about were the
 cases like find where we're talking about encoding the needle to match the
 haystack so that we can compare with code units, and I was thinking that
 we'd be forced to pick a normalization scheme with that, and if that didn't
 match the normalization of the haystack, we'd be in trouble (hence the
 concern about being able to specify a normalization scheme). However,
 thinking about it further, that's not actually a problem. If the needle is a
 dchar, then code point normalization isn't an issue, because it's only ever
 one code point, and if the needle uses a different encoding (e.g. UTF-16
 instead of UTF-8), and we re-encode it with the encoding of the haystack,
 that doesn't change the normalization of the needle. Even if the code units
 have changed, the code points that they represent are the same. So, it
 doesn't even potentially make sense to try and doing anything with the
 normalization when re-encoding the needle.

But consider the case where you are searching the string: "cassé"

for the letter 'e'. If é is encoded as 'e' + U+0301, then you will 
succeed when you should fail! However, it may be that you actually want 
to find specifically any code points with 'e', including ones with 
combining characters. This is why we really need more discretion from 
Phobos, and less hand-holding.

There are certainly searches that will be correct. For example, 
searching for newline should always work in code-point space. Actually, 
what happens when you use a combining character on newline? Is it an 
invalid unicode sequence? Does it matter? :)

A nice function to determine whether code points or graphemes are 
required for comparison given a needle may be useful.

-Steve

Jun 03 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Friday, June 03, 2016 07:37:59 Steven Schveighoffer via Digitalmars-d wrote:
 But consider the case where you are searching the string: "cassé"

 for the letter 'e'. If é is encoded as 'e' + U+0301, then you will
 succeed when you should fail! However, it may be that you actually want
 to find specifically any code points with 'e', including ones with
 combining characters. This is why we really need more discretion from
 Phobos, and less hand-holding.

 There are certainly searches that will be correct. For example,
 searching for newline should always work in code-point space. Actually,
 what happens when you use a combining character on newline? Is it an
 invalid unicode sequence? Does it matter? :)

 A nice function to determine whether code points or graphemes are
 required for comparison given a needle may be useful.

Well, if you know that you're dealing with a grapheme that has that problem,
you can just iterate by graphemes rather than code units like find would
normally. Otherwise, what you probably end up doing is searching for the
needle and then verifying that the resultant range starts with the right
grapheme and not just the right code point and then call find again to
search further into the range if it was just the right code point.
Regardless, I don't see how find is really going to solve this for you
unless it either assumes that you want to deal with graphemes and converts
everything to graphemes, or it assumes that you want graphemes and converts
to graphemes when it finds a possible match and the only considers it a
match if it's a match at the graphem level. The latter wouldn't be expensive
in most cases, but it _would_ be assuming that you want to operate on
graphemes even though you have a range of code units or code points, and
that's not necessarily the case. You might actually want to find the code
units or code points in question and not care about graphemes (much as
that's not likely to be typical). That could still be acceptable if we
decided that you needed to use a range of ubyte/ushort/uint rather than a
range of char/wchar/dchar in the case where you actually want to look for
code units or code points rather than searching for a grapheme within a
range of code units or code points.

But even if we don't take graphemes into account at all with a function like
find, encoding the needle and searching with code units shouldn't be a
problem. It's just that the programmer needs to be aware that they might end
up finding only a partial grapheme if they're not careful. The alternative
is to not allow searching for needles of one character type inside a
haystack of another character type and force the programmer to to the
encoding rather than having find to it. And that wouldn't be the end of the
world, but it wouldn't be as user-friendly, and I'm not sure that it would
be a great idea given that we currently can do those comparisons thanks to
auto-decoding, and we'd effectively be losing functionality if it didn't
work with other ranges of characters (or with strings if/once auto-decoding
is killed off).

Ultimately, we need to make sure that we don't prevent the programmer for
handling Unicode correctly or make it more difficult in an attempt to make
it easier for the programmer (which is essentially what auto-decoding does),
but that doesn't mean that there aren't cases where we can bake-in some
Unicode handling into functions to increase efficiency without losing out on
correctness. And making find encode the needle so that it can compare at the
code unit level doesn't lose out on correctness. It just isn't sufficient
for full correctness on its own.

- Jonathan M Davis

Jun 03 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 6/3/16 8:06 AM, Jonathan M Davis via Digitalmars-d wrote:
 On Friday, June 03, 2016 07:37:59 Steven Schveighoffer via Digitalmars-d wrote:
 But consider the case where you are searching the string: "cassé"

 for the letter 'e'. If é is encoded as 'e' + U+0301, then you will
 succeed when you should fail! However, it may be that you actually want
 to find specifically any code points with 'e', including ones with
 combining characters. This is why we really need more discretion from
 Phobos, and less hand-holding.

 There are certainly searches that will be correct. For example,
 searching for newline should always work in code-point space. Actually,
 what happens when you use a combining character on newline? Is it an
 invalid unicode sequence? Does it matter? :)

 A nice function to determine whether code points or graphemes are
 required for comparison given a needle may be useful.

 Well, if you know that you're dealing with a grapheme that has that problem,
 you can just iterate by graphemes rather than code units like find would
 normally.

Yes, I agree. This is exactly the point. Don't assume anything, just 
treat a type as it is written. And tell the user this!

If you are going to search a range of code points with a code point, you 
may not get what you expect. If you want to do a grapheme-aware search, 
change it to a range of graphemes, and do a grapheme search.

What I was trying say with my example is that searching by code points, 
even for graphemes that definitively fit into one code point, may still 
not be correct in all use cases.

-Steve

Jun 03 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Fri, Jun 03, 2016 at 05:06:33AM -0700, Jonathan M Davis via Digitalmars-d
wrote:
 On Friday, June 03, 2016 07:37:59 Steven Schveighoffer via Digitalmars-d wrote:
 But consider the case where you are searching the string: "cassé"

 for the letter 'e'. If é is encoded as 'e' + U+0301, then you will
 succeed when you should fail! However, it may be that you actually
 want to find specifically any code points with 'e', including ones
 with combining characters. This is why we really need more
 discretion from Phobos, and less hand-holding.

 There are certainly searches that will be correct. For example,
 searching for newline should always work in code-point space.
 Actually, what happens when you use a combining character on
 newline? Is it an invalid unicode sequence? Does it matter? :)


I'm guessing it's an invalid sequence.


[...]
 Well, if you know that you're dealing with a grapheme that has that
 problem, you can just iterate by graphemes rather than code units like
 find would normally. Otherwise, what you probably end up doing is
 searching for the needle and then verifying that the resultant range
 starts with the right grapheme and not just the right code point and
 then call find again to search further into the range if it was just
 the right code point.

[...]

And this is a prime illustration of why defaulting to a particular
support level is not a good idea.  What if the programmer wants to count
how many variations of e + diacritics are in his string? Then iterating
by grapheme won't work, and you'd actually want to iterate by code
point.  Actually, that wouldn't work either; you'd have to normalize to
NFD first, then iterate by code point.  Whereas if the programmer wanted
to count e but not é, then you'd have to iterate by grapheme. Or if you
wanted to count é but not e, then you'd have to normalize to NFC and
then iterate by grapheme.

There's no getting around learning how Unicode works, and having the
standard library default to something arbitrary that doesn't always do
the "right" thing while pretending it does, doesn't help.


T

-- 
Why have vacation when you can work?? -- EC

Jun 03 2016

D Programming

C/C++ Programming

Other

digitalmars.D - Unicode Normalization (and graphemes and locales)