www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Unicode Normalization (and graphemes and locales)

reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 4:29 PM, Jonathan M Davis via Digitalmars-d wrote:
 How do you suggest that we handle the normalization issue? Should we just
 assume NFC like std.uni.normalize does and provide an optional template
 argument to indicate a different normalization (like normalize does)? Since
 without providing a way to deal with the normalization, we're not actually
 making the code fully correct, just faster.
The short answer is, we don't. 1. D is a systems programming language. Baking normalization, graphemes and Unicode locales in at a low level will have a disastrous negative effect on performance and size. 2. Very little systems programming work requires level 2 or 3 Unicode support. 3. Are they needed? Pedantically, yes. Practically, not necessarily. 4. What we must do is, for each algorithm, document how it handles Unicode. 5. Normalization, graphemes, and locales should all be explicitly opt-in with corresponding library code. Normalization: s.normalize.algorithm() Graphemes: may require separate algorithms, maybe std.grapheme? Locales: I have no idea, given that I have not studied that issue 6. std.string has many analogues for std.algorithms that are specific to the peculiarities of strings. I think this is a perfectly acceptable approach. For example, there are many ways to sort Unicode strings, and many of them do not fit in with std.algorithm.sort's ways. Having special std.string.sort's for them would be the most practical solution. 7. At some point, as the threads on autodecode amply illustrate, working with level 2 or level 3 Unicode requires a certain level of understanding on the part of the programmer writing the code, because there simply is no overarching correct way to do things. The programmer is going to have to understand what he is trying to accomplish with Unicode and select the code/algorithms accordingly.
Jun 02 2016
next sibling parent Jack Stouffer <jack jackstouffer.com> writes:
On Friday, 3 June 2016 at 00:14:13 UTC, Walter Bright wrote:
 5. Normalization, graphemes, and locales should all be 
 explicitly opt-in with corresponding library code.
Add decoding to that list and we're right there with you.
 7. At some point, as the threads on autodecode amply 
 illustrate, working with level 2 or level 3 Unicode requires a 
 certain level of understanding on the part of the programmer 
 writing the code, because there simply is no overarching 
 correct way to do things. The programmer is going to have to 
 understand what he is trying to accomplish with Unicode and 
 select the code/algorithms accordingly.
Working at any level of Unicode in a systems programming language requires knowledge of Unicode. The thing is, because D is a systems language, we can't have the default behavior to decode to grapheme clusters, and because of that, we have to have everything be opt-in, because everything else is fundamentally wrong on some level. Once you step out of scripting language land, you can't get around requiring Unicode knowledge. Like I said in my blog,
 Unicode is hard. Trying to hide Unicode specifics helps
 no one because it's going to bite you in the ass eventually.
Jun 02 2016
prev sibling parent reply Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Thursday, June 02, 2016 17:14:13 Walter Bright via Digitalmars-d wrote:
 On 6/2/2016 4:29 PM, Jonathan M Davis via Digitalmars-d wrote:
  > How do you suggest that we handle the normalization issue? Should we just
  > assume NFC like std.uni.normalize does and provide an optional template
  > argument to indicate a different normalization (like normalize does)?
  > Since
  > without providing a way to deal with the normalization, we're not
  > actually
  > making the code fully correct, just faster.

 The short answer is, we don't.
I generally agree. The main problem that I was concerned about were the cases like find where we're talking about encoding the needle to match the haystack so that we can compare with code units, and I was thinking that we'd be forced to pick a normalization scheme with that, and if that didn't match the normalization of the haystack, we'd be in trouble (hence the concern about being able to specify a normalization scheme). However, thinking about it further, that's not actually a problem. If the needle is a dchar, then code point normalization isn't an issue, because it's only ever one code point, and if the needle uses a different encoding (e.g. UTF-16 instead of UTF-8), and we re-encode it with the encoding of the haystack, that doesn't change the normalization of the needle. Even if the code units have changed, the code points that they represent are the same. So, it doesn't even potentially make sense to try and doing anything with the normalization when re-encoding the needle. So, it looks like my concern was born from not thinking the issue through thoroughly enough. - Jonathan M Davis
Jun 02 2016
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/3/16 2:24 AM, Jonathan M Davis via Digitalmars-d wrote:
 On Thursday, June 02, 2016 17:14:13 Walter Bright via Digitalmars-d wrote:
 On 6/2/2016 4:29 PM, Jonathan M Davis via Digitalmars-d wrote:
  > How do you suggest that we handle the normalization issue? Should we just
  > assume NFC like std.uni.normalize does and provide an optional template
  > argument to indicate a different normalization (like normalize does)?
  > Since
  > without providing a way to deal with the normalization, we're not
  > actually
  > making the code fully correct, just faster.

 The short answer is, we don't.
I generally agree. The main problem that I was concerned about were the cases like find where we're talking about encoding the needle to match the haystack so that we can compare with code units, and I was thinking that we'd be forced to pick a normalization scheme with that, and if that didn't match the normalization of the haystack, we'd be in trouble (hence the concern about being able to specify a normalization scheme). However, thinking about it further, that's not actually a problem. If the needle is a dchar, then code point normalization isn't an issue, because it's only ever one code point, and if the needle uses a different encoding (e.g. UTF-16 instead of UTF-8), and we re-encode it with the encoding of the haystack, that doesn't change the normalization of the needle. Even if the code units have changed, the code points that they represent are the same. So, it doesn't even potentially make sense to try and doing anything with the normalization when re-encoding the needle.
But consider the case where you are searching the string: "cassé" for the letter 'e'. If é is encoded as 'e' + U+0301, then you will succeed when you should fail! However, it may be that you actually want to find specifically any code points with 'e', including ones with combining characters. This is why we really need more discretion from Phobos, and less hand-holding. There are certainly searches that will be correct. For example, searching for newline should always work in code-point space. Actually, what happens when you use a combining character on newline? Is it an invalid unicode sequence? Does it matter? :) A nice function to determine whether code points or graphemes are required for comparison given a needle may be useful. -Steve
Jun 03 2016
next sibling parent reply Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Friday, June 03, 2016 07:37:59 Steven Schveighoffer via Digitalmars-d wrote:
 But consider the case where you are searching the string: "cassé"

 for the letter 'e'. If é is encoded as 'e' + U+0301, then you will
 succeed when you should fail! However, it may be that you actually want
 to find specifically any code points with 'e', including ones with
 combining characters. This is why we really need more discretion from
 Phobos, and less hand-holding.

 There are certainly searches that will be correct. For example,
 searching for newline should always work in code-point space. Actually,
 what happens when you use a combining character on newline? Is it an
 invalid unicode sequence? Does it matter? :)

 A nice function to determine whether code points or graphemes are
 required for comparison given a needle may be useful.
Well, if you know that you're dealing with a grapheme that has that problem, you can just iterate by graphemes rather than code units like find would normally. Otherwise, what you probably end up doing is searching for the needle and then verifying that the resultant range starts with the right grapheme and not just the right code point and then call find again to search further into the range if it was just the right code point. Regardless, I don't see how find is really going to solve this for you unless it either assumes that you want to deal with graphemes and converts everything to graphemes, or it assumes that you want graphemes and converts to graphemes when it finds a possible match and the only considers it a match if it's a match at the graphem level. The latter wouldn't be expensive in most cases, but it _would_ be assuming that you want to operate on graphemes even though you have a range of code units or code points, and that's not necessarily the case. You might actually want to find the code units or code points in question and not care about graphemes (much as that's not likely to be typical). That could still be acceptable if we decided that you needed to use a range of ubyte/ushort/uint rather than a range of char/wchar/dchar in the case where you actually want to look for code units or code points rather than searching for a grapheme within a range of code units or code points. But even if we don't take graphemes into account at all with a function like find, encoding the needle and searching with code units shouldn't be a problem. It's just that the programmer needs to be aware that they might end up finding only a partial grapheme if they're not careful. The alternative is to not allow searching for needles of one character type inside a haystack of another character type and force the programmer to to the encoding rather than having find to it. And that wouldn't be the end of the world, but it wouldn't be as user-friendly, and I'm not sure that it would be a great idea given that we currently can do those comparisons thanks to auto-decoding, and we'd effectively be losing functionality if it didn't work with other ranges of characters (or with strings if/once auto-decoding is killed off). Ultimately, we need to make sure that we don't prevent the programmer for handling Unicode correctly or make it more difficult in an attempt to make it easier for the programmer (which is essentially what auto-decoding does), but that doesn't mean that there aren't cases where we can bake-in some Unicode handling into functions to increase efficiency without losing out on correctness. And making find encode the needle so that it can compare at the code unit level doesn't lose out on correctness. It just isn't sufficient for full correctness on its own. - Jonathan M Davis
Jun 03 2016
parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/3/16 8:06 AM, Jonathan M Davis via Digitalmars-d wrote:
 On Friday, June 03, 2016 07:37:59 Steven Schveighoffer via Digitalmars-d wrote:
 But consider the case where you are searching the string: "cassé"

 for the letter 'e'. If é is encoded as 'e' + U+0301, then you will
 succeed when you should fail! However, it may be that you actually want
 to find specifically any code points with 'e', including ones with
 combining characters. This is why we really need more discretion from
 Phobos, and less hand-holding.

 There are certainly searches that will be correct. For example,
 searching for newline should always work in code-point space. Actually,
 what happens when you use a combining character on newline? Is it an
 invalid unicode sequence? Does it matter? :)

 A nice function to determine whether code points or graphemes are
 required for comparison given a needle may be useful.
Well, if you know that you're dealing with a grapheme that has that problem, you can just iterate by graphemes rather than code units like find would normally.
Yes, I agree. This is exactly the point. Don't assume anything, just treat a type as it is written. And tell the user this! If you are going to search a range of code points with a code point, you may not get what you expect. If you want to do a grapheme-aware search, change it to a range of graphemes, and do a grapheme search. What I was trying say with my example is that searching by code points, even for graphemes that definitively fit into one code point, may still not be correct in all use cases. -Steve
Jun 03 2016
prev sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Jun 03, 2016 at 05:06:33AM -0700, Jonathan M Davis via Digitalmars-d
wrote:
 On Friday, June 03, 2016 07:37:59 Steven Schveighoffer via Digitalmars-d wrote:
 But consider the case where you are searching the string: "cassé"

 for the letter 'e'. If é is encoded as 'e' + U+0301, then you will
 succeed when you should fail! However, it may be that you actually
 want to find specifically any code points with 'e', including ones
 with combining characters. This is why we really need more
 discretion from Phobos, and less hand-holding.

 There are certainly searches that will be correct. For example,
 searching for newline should always work in code-point space.
 Actually, what happens when you use a combining character on
 newline? Is it an invalid unicode sequence? Does it matter? :)
I'm guessing it's an invalid sequence. [...]
 Well, if you know that you're dealing with a grapheme that has that
 problem, you can just iterate by graphemes rather than code units like
 find would normally. Otherwise, what you probably end up doing is
 searching for the needle and then verifying that the resultant range
 starts with the right grapheme and not just the right code point and
 then call find again to search further into the range if it was just
 the right code point.
[...] And this is a prime illustration of why defaulting to a particular support level is not a good idea. What if the programmer wants to count how many variations of e + diacritics are in his string? Then iterating by grapheme won't work, and you'd actually want to iterate by code point. Actually, that wouldn't work either; you'd have to normalize to NFD first, then iterate by code point. Whereas if the programmer wanted to count e but not é, then you'd have to iterate by grapheme. Or if you wanted to count é but not e, then you'd have to normalize to NFC and then iterate by grapheme. There's no getting around learning how Unicode works, and having the standard library default to something arbitrary that doesn't always do the "right" thing while pretending it does, doesn't help. T -- Why have vacation when you can work?? -- EC
Jun 03 2016