www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Today's programming challenge - How's your Range-Fu ?

reply Walter Bright <newshound2 digitalmars.com> writes:
Challenge level - Moderately easy

Consider the function std.string.wrap:

   http://dlang.org/phobos/std_string.html#.wrap

It takes a string as input, and returns a GC allocated string that is 
word-wrapped. It needs to be enhanced to:

1. Accept a ForwardRange as input.
2. Return a lazy ForwardRange that delivers the characters of the wrapped
result 
one by one on demand.
3. Not allocate any memory.
4. The element encoding type of the returned range must be the same as the 
element encoding type of the input.
Apr 17 2015
next sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Apr 17, 2015 at 02:09:07AM -0700, Walter Bright via Digitalmars-d wrote:
 Challenge level - Moderately easy

 Consider the function std.string.wrap:
 
   http://dlang.org/phobos/std_string.html#.wrap
 
 It takes a string as input, and returns a GC allocated string that is
 word-wrapped. It needs to be enhanced to:
 
 1. Accept a ForwardRange as input.
 2. Return a lazy ForwardRange that delivers the characters of the
 wrapped result one by one on demand.
 3. Not allocate any memory.
 4. The element encoding type of the returned range must be the same as
 the element encoding type of the input.
This is harder than it looks at first sight, actually. Mostly thanks to the complexity of Unicode... you need to identify zero-width, normal-width, and double-width characters, combining diacritics, various kinds of spaces (e.g. cannot break on non-breaking space) and treat them accordingly. Which requires decoding. (Well, in theory std.uni could be enhanced to work directly with encoded data, but right now it doesn't. In any case this is outside the scope of this challenge, I think.) Unfortunately, the only reliable way I know of currently that can deal with the spacing of Unicode characters correctly is to segment the input with byGrapheme, which currently is GC-dependent. So this fails (3). There's also the question of what to do with bidi markings: how do you handle counting the columns in that case? Of course, if you forego Unicode correctness, then you *could* just word-wrap on a per-character basis (i.e., every character counts as 1 column), but this also makes the resulting code useless as far as dealing with general Unicode data is concerned -- it'd only work for ASCII, and various character ranges inherited from the old 8-bit European encodings. Not to mention, line-breaking in Chinese encodings cannot work as prescribed anyway, because the rules are different (you can break anywhere at a character boundary except punctuation -- there is no such concept as a space character in Chinese writing). Same applies for Korean/Japanese. So either you have to throw out all pretenses of Unicode-correctness and just stick with ASCII-style per-character line-wrapping, or you have to live with byGrapheme with all the complexity that it entails. The former is quite easy to write -- I could throw it together in a couple o' hours max, but the latter is a pretty big project (cf. Unicode line-breaking algorithm, which is one of the TR's). T -- All problems are easy in retrospect.
Apr 17 2015
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
 So either you have to throw out all pretenses of Unicode-correctness and
 just stick with ASCII-style per-character line-wrapping, or you have to
 live with byGrapheme with all the complexity that it entails. The former
 is quite easy to write -- I could throw it together in a couple o' hours
 max, but the latter is a pretty big project (cf. Unicode line-breaking
 algorithm, which is one of the TR's).
It'd be good enough to duplicate the existing behavior, which is to treat decoded unicode characters as one column.
Apr 17 2015
parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:
 On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
 So either you have to throw out all pretenses of 
 Unicode-correctness and
 just stick with ASCII-style per-character line-wrapping, or 
 you have to
 live with byGrapheme with all the complexity that it entails. 
 The former
 is quite easy to write -- I could throw it together in a 
 couple o' hours
 max, but the latter is a pretty big project (cf. Unicode 
 line-breaking
 algorithm, which is one of the TR's).
It'd be good enough to duplicate the existing behavior, which is to treat decoded unicode characters as one column.
Code points aren't equivalent to characters. They're not the same thing in most European languages, never mind the rest of the world. If we have a line-wrapping algorithm in phobos that works by code points, it needs a large "THIS IS ONLY FOR SIMPLE ENGLISH TEXT" warning. Code points are a useful chunk size for some tasjs and completely insufficient for others.
Apr 18 2015
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2015-04-18 09:58, John Colvin wrote:

 Code points aren't equivalent to characters. They're not the same thing
 in most European languages, never mind the rest of the world. If we have
 a line-wrapping algorithm in phobos that works by code points, it needs
 a large "THIS IS ONLY FOR SIMPLE ENGLISH TEXT" warning.
For that we have std.ascii. -- /Jacob Carlborg
Apr 18 2015
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/18/2015 12:58 AM, John Colvin wrote:
 On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:
 On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
 So either you have to throw out all pretenses of Unicode-correctness and
 just stick with ASCII-style per-character line-wrapping, or you have to
 live with byGrapheme with all the complexity that it entails. The former
 is quite easy to write -- I could throw it together in a couple o' hours
 max, but the latter is a pretty big project (cf. Unicode line-breaking
 algorithm, which is one of the TR's).
It'd be good enough to duplicate the existing behavior, which is to treat decoded unicode characters as one column.
Code points aren't equivalent to characters. They're not the same thing in most European languages,
I know a bit of German, for what characters is that not true?
 never mind the rest of the world. If we have a line-wrapping
 algorithm in phobos that works by code points, it needs a large "THIS IS ONLY
 FOR SIMPLE ENGLISH TEXT" warning.

 Code points are a useful chunk size for some tasjs and completely insufficient
 for others.
The first order of business is making wrap() work with ranges, and otherwise work the same as it always has (it's one of the oldest Phobos functions). There are different standard levels of Unicode support. The lowest level is working correctly with code points, which is what wrap() does. Going to a higher level of support comes after range support. I know little about combining characters. You obviously know much more, do you want to take charge of this function?
Apr 18 2015
parent reply "Panke" <tobias pankrath.net> writes:
On Saturday, 18 April 2015 at 08:18:46 UTC, Walter Bright wrote:
 On 4/18/2015 12:58 AM, John Colvin wrote:
 On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:
 On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
 So either you have to throw out all pretenses of 
 Unicode-correctness and
 just stick with ASCII-style per-character line-wrapping, or 
 you have to
 live with byGrapheme with all the complexity that it 
 entails. The former
 is quite easy to write -- I could throw it together in a 
 couple o' hours
 max, but the latter is a pretty big project (cf. Unicode 
 line-breaking
 algorithm, which is one of the TR's).
It'd be good enough to duplicate the existing behavior, which is to treat decoded unicode characters as one column.
Code points aren't equivalent to characters. They're not the same thing in most European languages,
I know a bit of German, for what characters is that not true?
Umlauts, if combined characters are used. Also words that still have their accents left after import from foreign languages. E.g. Café Getting all unicode correct seems a daunting task with a severe performance impact, esp. if we need to assume that a string might have any normalization form or none at all. See also: http://unicode.org/reports/tr15/#Norm_Forms
Apr 18 2015
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/18/2015 1:26 AM, Panke wrote:
 On Saturday, 18 April 2015 at 08:18:46 UTC, Walter Bright wrote:
 On 4/18/2015 12:58 AM, John Colvin wrote:
 On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:
 On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
 So either you have to throw out all pretenses of Unicode-correctness and
 just stick with ASCII-style per-character line-wrapping, or you have to
 live with byGrapheme with all the complexity that it entails. The former
 is quite easy to write -- I could throw it together in a couple o' hours
 max, but the latter is a pretty big project (cf. Unicode line-breaking
 algorithm, which is one of the TR's).
It'd be good enough to duplicate the existing behavior, which is to treat decoded unicode characters as one column.
Code points aren't equivalent to characters. They're not the same thing in most European languages,
I know a bit of German, for what characters is that not true?
Umlauts, if combined characters are used. Also words that still have their accents left after import from foreign languages. E.g. Café
That doesn't make sense to me, because the umlauts and the accented e all have Unicode code point assignments.
Apr 18 2015
next sibling parent "Panke" <tobias pankrath.net> writes:
 That doesn't make sense to me, because the umlauts and the 
 accented e all have Unicode code point assignments.
Yes, but you may have perfectly fine unicode text where the combined form is used. Actually there is a normalization form for unicode that requires the combined form. To be fully correct phobos needs to handle that as well.
Apr 18 2015
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2015-04-18 12:27, Walter Bright wrote:

 That doesn't make sense to me, because the umlauts and the accented e
 all have Unicode code point assignments.
This code snippet demonstrates the problem: import std.stdio; void main () { dstring a = "e\u0301"; dstring b = "é"; assert(a != b); assert(a.length == 2); assert(b.length == 1); writefln(a, " ", b); } If you run the above code all asserts should pass. If your system correctly supports Unicode (works on OS X 10.10) the two printed characters should look exactly the same. \u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htm -- /Jacob Carlborg
Apr 18 2015
next sibling parent reply "Chris" <wendlec tcd.ie> writes:
On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg wrote:
 On 2015-04-18 12:27, Walter Bright wrote:

 That doesn't make sense to me, because the umlauts and the 
 accented e
 all have Unicode code point assignments.
This code snippet demonstrates the problem: import std.stdio; void main () { dstring a = "e\u0301"; dstring b = "é"; assert(a != b); assert(a.length == 2); assert(b.length == 1); writefln(a, " ", b); } If you run the above code all asserts should pass. If your system correctly supports Unicode (works on OS X 10.10) the two printed characters should look exactly the same. \u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htm
Yep, this was the cause of some bugs I had in my program. The thing is you never know, if a text is composed or decomposed, so you have to be prepared that "é" has length 2 or 1. On OS X these characters are automatically decomposed by default. So if you pipe it through the system an "é" (length=1) automatically becomes "e\u0301" (length=2). Same goes for file names on OS X. I've had to find a workaround for this more than once.
Apr 18 2015
next sibling parent reply "Gary Willoughby" <dev nomad.so> writes:
On Saturday, 18 April 2015 at 11:52:52 UTC, Chris wrote:
 On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg 
 wrote:
 On 2015-04-18 12:27, Walter Bright wrote:

 That doesn't make sense to me, because the umlauts and the 
 accented e
 all have Unicode code point assignments.
This code snippet demonstrates the problem: import std.stdio; void main () { dstring a = "e\u0301"; dstring b = "é"; assert(a != b); assert(a.length == 2); assert(b.length == 1); writefln(a, " ", b); } If you run the above code all asserts should pass. If your system correctly supports Unicode (works on OS X 10.10) the two printed characters should look exactly the same. \u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htm
Yep, this was the cause of some bugs I had in my program. The thing is you never know, if a text is composed or decomposed, so you have to be prepared that "é" has length 2 or 1. On OS X these characters are automatically decomposed by default. So if you pipe it through the system an "é" (length=1) automatically becomes "e\u0301" (length=2). Same goes for file names on OS X. I've had to find a workaround for this more than once.
byGrapheme to the rescue: http://dlang.org/phobos/std_uni.html#byGrapheme Or is this unsuitable here?
Apr 18 2015
parent reply Jacob Carlborg <doob me.com> writes:
On 2015-04-18 14:25, Gary Willoughby wrote:

 byGrapheme to the rescue:

 http://dlang.org/phobos/std_uni.html#byGrapheme

 Or is this unsuitable here?
How is byGrapheme supposed to be used? I tried this put it doesn't do what I expected: foreach (e ; "e\u0301".byGrapheme) writeln(e); -- /Jacob Carlborg
Apr 18 2015
parent "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Saturday, 18 April 2015 at 12:48:53 UTC, Jacob Carlborg wrote:
 On 2015-04-18 14:25, Gary Willoughby wrote:

 byGrapheme to the rescue:

 http://dlang.org/phobos/std_uni.html#byGrapheme

 Or is this unsuitable here?
How is byGrapheme supposed to be used? I tried this put it doesn't do what I expected: foreach (e ; "e\u0301".byGrapheme) writeln(e);
void main() { import std.stdio; import std.uni; foreach (e ; "e\u0301".byGrapheme) writeln(e[]); }
Apr 18 2015
prev sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Sat, Apr 18, 2015 at 11:52:50AM +0000, Chris via Digitalmars-d wrote:
 On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg wrote:
On 2015-04-18 12:27, Walter Bright wrote:

That doesn't make sense to me, because the umlauts and the accented
e all have Unicode code point assignments.
This code snippet demonstrates the problem: import std.stdio; void main () { dstring a = "e\u0301"; dstring b = ""; assert(a != b); assert(a.length == 2); assert(b.length == 1); writefln(a, " ", b); } If you run the above code all asserts should pass. If your system correctly supports Unicode (works on OS X 10.10) the two printed characters should look exactly the same. \u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htm
Yep, this was the cause of some bugs I had in my program. The thing is you never know, if a text is composed or decomposed, so you have to be prepared that "" has length 2 or 1. On OS X these characters are automatically decomposed by default. So if you pipe it through the system an "" (length=1) automatically becomes "e\u0301" (length=2). Same goes for file names on OS X. I've had to find a workaround for this more than once.
Wait, I thought the recommended approach is to normalize first, then do string processing later? Normalizing first will eliminate inconsistencies of this sort, and allow string-processing code to use a uniform approach to handling the string. I don't think it's a good idea to manually deal with composed/decomposed issues within every individual string function. Of course, even after normalization, you still have the issue of zero-width characters and combining diacritics, because not every language has precomposed characters handy. Using byGrapheme, within the current state of Phobos, is still the best bet as to correctly counting the number of printed columns as opposed to the number of "characters" (which, in the Unicode definition, does not always match the layman's notion of "character"). Unfortunately, byGrapheme may allocate, which fails Walter's requirements. Well, to be fair, byGrapheme only *occasionally* allocates -- only for input with unusually long sequences of combining diacritics -- for normal use cases you'll pretty much never have any allocations. But the language can't express the idea of "occasionally allocates", there is only "allocates" or " nogc". Which makes it unusable in nogc code. One possible solution would be to modify std.uni.graphemeStride to not allocate, since it shouldn't need to do so just to compute the length of the next grapheme. T -- Just because you survived after you did it, doesn't mean it wasn't stupid!
Apr 18 2015
next sibling parent "Tobias Pankrath" <tobias pankrath.net> writes:
 Wait, I thought the recommended approach is to normalize first, 
 then do
 string processing later? Normalizing first will eliminate
 inconsistencies of this sort, and allow string-processing code 
 to use a
 uniform approach to handling the string. I don't think it's a 
 good idea
 to manually deal with composed/decomposed issues within every 
 individual
 string function.
1. Problem: Normalization is not closed under almost all operations. E.g. concatenating two normalized strings does not guarantee the result is in normalized form. 2. Problem: Some unicode algorithms e.g. string comparison require a normalization step. It doesn't matter which form you use, but you have to pick one. Now we could say that all strings passed to phobos have to be normalized as (say) NFC and that phobos function thus skip the normalization.
Apr 18 2015
prev sibling next sibling parent "Chris" <wendlec tcd.ie> writes:
On Saturday, 18 April 2015 at 13:30:09 UTC, H. S. Teoh wrote:
 On Sat, Apr 18, 2015 at 11:52:50AM +0000, Chris via 
 Digitalmars-d wrote:
 On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg 
 wrote:
On 2015-04-18 12:27, Walter Bright wrote:

That doesn't make sense to me, because the umlauts and the 
accented
e all have Unicode code point assignments.
This code snippet demonstrates the problem: import std.stdio; void main () { dstring a = "e\u0301"; dstring b = "é"; assert(a != b); assert(a.length == 2); assert(b.length == 1); writefln(a, " ", b); } If you run the above code all asserts should pass. If your system correctly supports Unicode (works on OS X 10.10) the two printed characters should look exactly the same. \u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htm
Yep, this was the cause of some bugs I had in my program. The thing is you never know, if a text is composed or decomposed, so you have to be prepared that "é" has length 2 or 1. On OS X these characters are automatically decomposed by default. So if you pipe it through the system an "é" (length=1) automatically becomes "e\u0301" (length=2). Same goes for file names on OS X. I've had to find a workaround for this more than once.
Wait, I thought the recommended approach is to normalize first, then do string processing later? Normalizing first will eliminate inconsistencies of this sort, and allow string-processing code to use a uniform approach to handling the string. I don't think it's a good idea to manually deal with composed/decomposed issues within every individual string function. Of course, even after normalization, you still have the issue of zero-width characters and combining diacritics, because not every language has precomposed characters handy. Using byGrapheme, within the current state of Phobos, is still the best bet as to correctly counting the number of printed columns as opposed to the number of "characters" (which, in the Unicode definition, does not always match the layman's notion of "character"). Unfortunately, byGrapheme may allocate, which fails Walter's requirements. Well, to be fair, byGrapheme only *occasionally* allocates -- only for input with unusually long sequences of combining diacritics -- for normal use cases you'll pretty much never have any allocations. But the language can't express the idea of "occasionally allocates", there is only "allocates" or " nogc". Which makes it unusable in nogc code. One possible solution would be to modify std.uni.graphemeStride to not allocate, since it shouldn't need to do so just to compute the length of the next grapheme. T
This is why on OS X I always normalized strings to composed. However, there are always issues with Unicode, because, as you said, the layman's notion of what a character is is not the same as Unicode's. I wrote a utility function that uses byGrapheme and byCodePoint. It's a bit of an overhead, but I always get the correct length and character access (e.g. if txt.startsWith("é")).
Apr 18 2015
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote:
 One possible solution would be to modify std.uni.graphemeStride to not
 allocate, since it shouldn't need to do so just to compute the length of
 the next grapheme.
That should be done. There should be a fixed maximum codepoint count to graphemeStride.
Apr 18 2015
parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Sat, Apr 18, 2015 at 10:53:04AM -0700, Walter Bright via Digitalmars-d wrote:
 On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote:
One possible solution would be to modify std.uni.graphemeStride to
not allocate, since it shouldn't need to do so just to compute the
length of the next grapheme.
That should be done. There should be a fixed maximum codepoint count to graphemeStride.
Why? Scanning a string for a grapheme of arbitrary length does not need allocation since you're just reading data. Unless there is some required intermediate representation that I'm not aware of? T -- "How are you doing?" "Doing what?"
Apr 18 2015
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/18/2015 11:29 AM, H. S. Teoh via Digitalmars-d wrote:
 On Sat, Apr 18, 2015 at 10:53:04AM -0700, Walter Bright via Digitalmars-d
wrote:
 On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote:
 One possible solution would be to modify std.uni.graphemeStride to
 not allocate, since it shouldn't need to do so just to compute the
 length of the next grapheme.
That should be done. There should be a fixed maximum codepoint count to graphemeStride.
Why? Scanning a string for a grapheme of arbitrary length does not need allocation since you're just reading data. Unless there is some required intermediate representation that I'm not aware of?
If there's no need for allocation at all, why does it allocate? This should be fixed.
Apr 18 2015
parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Sat, Apr 18, 2015 at 11:37:27AM -0700, Walter Bright via Digitalmars-d wrote:
 On 4/18/2015 11:29 AM, H. S. Teoh via Digitalmars-d wrote:
On Sat, Apr 18, 2015 at 10:53:04AM -0700, Walter Bright via Digitalmars-d wrote:
On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote:
One possible solution would be to modify std.uni.graphemeStride to
not allocate, since it shouldn't need to do so just to compute the
length of the next grapheme.
That should be done. There should be a fixed maximum codepoint count to graphemeStride.
Why? Scanning a string for a grapheme of arbitrary length does not need allocation since you're just reading data. Unless there is some required intermediate representation that I'm not aware of?
If there's no need for allocation at all, why does it allocate? This should be fixed.
AFAICT, the only reason it allocates is because it shares the same underlying implementation as byGrapheme. There's probably a way to fix this, I just don't have the time right now to figure out the code. T -- Маленькие детки - маленькие бедки.
Apr 18 2015
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 4/18/15 4:35 AM, Jacob Carlborg wrote:
 On 2015-04-18 12:27, Walter Bright wrote:

 That doesn't make sense to me, because the umlauts and the accented e
 all have Unicode code point assignments.
This code snippet demonstrates the problem: import std.stdio; void main () { dstring a = "e\u0301"; dstring b = "é"; assert(a != b); assert(a.length == 2); assert(b.length == 1); writefln(a, " ", b); } If you run the above code all asserts should pass. If your system correctly supports Unicode (works on OS X 10.10) the two printed characters should look exactly the same. \u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htm
Isn't this solved commonly with a normalization pass? We should have a normalizeUTF() that can be inserted in a pipeline. Then the rest of Phobos doesn't need to mind these combining characters. -- Andrei
Apr 18 2015
next sibling parent reply "Tobias Pankrath" <tobias pankrath.net> writes:
 Isn't this solved commonly with a normalization pass? We should 
 have a normalizeUTF() that can be inserted in a pipeline.
Yes.
 Then the rest of Phobos doesn't need to mind these combining 
 characters. -- Andrei
I don't think so. The thing is, even after normalization we have to deal with combining characters because in all normalization forms there will be combining characters left after normalization.
Apr 18 2015
parent reply "Chris" <wendlec tcd.ie> writes:
On Saturday, 18 April 2015 at 17:04:54 UTC, Tobias Pankrath wrote:
 Isn't this solved commonly with a normalization pass? We 
 should have a normalizeUTF() that can be inserted in a 
 pipeline.
Yes.
 Then the rest of Phobos doesn't need to mind these combining 
 characters. -- Andrei
I don't think so. The thing is, even after normalization we have to deal with combining characters because in all normalization forms there will be combining characters left after normalization.
Yes, again and again I encountered length related bugs with Unicode characters. Normalization is not 100% reliable. I don't know anyone who works with non English characters who doesn't have problems with Unicode related issues sometimes.
Apr 20 2015
parent reply "Panke" <tobias pankrath.net> writes:
 Yes, again and again I encountered length related bugs with 
 Unicode characters. Normalization is not 100% reliable.
I think it is 100% reliable, it just doesn't make the problems go away. It just guarantees that two strings normalized to the same form are binary equal iff they are equal in the unicode sense. Nothing about columns or string length or grapheme count.
Apr 20 2015
parent reply "Chris" <wendlec tcd.ie> writes:
On Monday, 20 April 2015 at 11:04:58 UTC, Panke wrote:
 Yes, again and again I encountered length related bugs with 
 Unicode characters. Normalization is not 100% reliable.
I think it is 100% reliable, it just doesn't make the problems go away. It just guarantees that two strings normalized to the same form are binary equal iff they are equal in the unicode sense. Nothing about columns or string length or grapheme count.
The problem is not normalization as such, the problem is with string (as opposed to dstring): import std.uni : normalize, NFC; void main() { dstring de_one = "é"; dstring de_two = "e\u0301"; assert(de_one.length == 1); assert(de_two.length == 2); string e_one = "é"; string e_two = "e\u0301"; string random = "ab"; assert(e_one.length == 2); assert(e_two.length == 3); assert(e_one.length == random.length); assert(normalize!NFC(e_one).length == 2); assert(normalize!NFC(e_two).length == 2); } This can lead to subtle bugs, cf. length of random and e_one. You have to convert everything to dstring to get the "expected" result. However, this is not always desirable.
Apr 20 2015
parent reply "Panke" <tobias pankrath.net> writes:
 This can lead to subtle bugs, cf. length of random and e_one. 
 You have to convert everything to dstring to get the "expected" 
 result. However, this is not always desirable.
There are three things that you need to be aware of when handling unicode: code units, code points and graphems. In general the length of one guarantees anything about the length of the other, except for utf32, which is a 1:1 mapping between code units and code points. In this thread, we were discussing the relationship between code points and graphemes. You're examples however apply to the relationship between code units and code points. To measure the columns needed to print a string, you'll need the number of graphemes. (d|)?string.length gives you the number of code units. If you normalize a string (in the sequence of characters/codepoints sense, not object.string) to NFC, it will decompose every precomposed character in the string (like é, single codeunit), establish a defined order between the composite characters and then recompose a selected few graphemes (like é). This way é always ends up as a single code unit in NFC. There are dozens of other combinations where you'll still have n:1 mapping between code points and graphemes left after normalization. Example given already in this thread: putting an arrow over an latin letter is typical in math and always more than one codepoint.
Apr 20 2015
next sibling parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
 To measure the columns needed to print a string, you'll need 
 the number of graphemes. (d|)?string.length gives you the 
 number of code units.
Even that's not really true. In the end it's up to the font and layout engine to decide how much space anything takes up. Unicode doesn't play nicely with the idea of text as a grid of rows and fixed-width columns of characters, although quite a lot can (and is, see urxvt for example) be shoe-horned in.
Apr 20 2015
next sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Mon, Apr 20, 2015 at 06:03:49PM +0000, John Colvin via Digitalmars-d wrote:
 On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
To measure the columns needed to print a string, you'll need the
number of graphemes. (d|)?string.length gives you the number of code
units.
Even that's not really true. In the end it's up to the font and layout engine to decide how much space anything takes up. Unicode doesn't play nicely with the idea of text as a grid of rows and fixed-width columns of characters, although quite a lot can (and is, see urxvt for example) be shoe-horned in.
Yeah, even the grapheme count does not necessarily tell you how wide the printed string really is. The characters in the CJK block are usually rendered with fonts that are, on average, twice as wide as your typical Latin/Cyrillic character, so even applications like urxvt that shoehorn proportional-width fonts into a text grid render CJK characters as two columns rather than one. Because of this, I actually wrote a function at one time to determine the width of a given Unicode character (i.e., zero, single, or double) as displayed in urxvt. Obviously, this is no help if you need to wrap lines rendered with a proportional font. And it doesn't even attempt to work correctly with bidi text. This is why I said at the beginning that wrapping a line of text is a LOT harder than it sounds. A function that only takes a string as input does not have the necessary information to do this correctly in all use cases. The current wrap() function doesn't even do it correctly modulo the information available: it doesn't handle combining diacritics and zero-width characters properly. In fact, it doesn't even handle control characters properly, except perhaps for \t and \n. There are so many things wrong with the current wrap() function (and many other string-processing functions in Phobos) that it makes it look like a joke when we claim that D provides Unicode correctness out-of-the-box. The only use case where wrap() gives the correct result is when you stick with pre-Unicode Latin strings to be displayed on a text console. As such, I don't really see the general utility of wrap() as it currently stands, and I question its value in Phobos, as opposed to an actually more useful implementation that, for instance, correctly implements the Unicode line-breaking algorithm. T -- It said to install Windows 2000 or better, so I installed Linux instead.
Apr 20 2015
prev sibling parent reply "Panke" <tobias pankrath.net> writes:
On Monday, 20 April 2015 at 18:03:50 UTC, John Colvin wrote:
 On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
 To measure the columns needed to print a string, you'll need 
 the number of graphemes. (d|)?string.length gives you the 
 number of code units.
Even that's not really true.
Why? Doesn't string.length give you the byte count?
Apr 20 2015
next sibling parent "rumbu" <rumbu rumbu.ro> writes:
On Monday, 20 April 2015 at 19:24:01 UTC, Panke wrote:
 On Monday, 20 April 2015 at 18:03:50 UTC, John Colvin wrote:
 On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
 To measure the columns needed to print a string, you'll need 
 the number of graphemes. (d|)?string.length gives you the 
 number of code units.
Even that's not really true.
Why? Doesn't string.length give you the byte count?
You'll also need the unicode character display width: Even if the font is monospaced, there are characters (Katakana, Hangul and even in Latin script) with variable width. ABCDEFGH ABCDEFGH (unicode 0xff21 through 0xff27). If the text above is not correctly displayed on your computer, a Korean console can be viewed here: http://upload.wikimedia.org/wikipedia/commons/1/14/KoreanDOSPrompt.png
Apr 20 2015
prev sibling parent reply "JohnnyK" <johnnykinsey comcast.net> writes:
On Monday, 20 April 2015 at 19:24:01 UTC, Panke wrote:
 On Monday, 20 April 2015 at 18:03:50 UTC, John Colvin wrote:
 On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
 To measure the columns needed to print a string, you'll need 
 the number of graphemes. (d|)?string.length gives you the 
 number of code units.
Even that's not really true.
Why? Doesn't string.length give you the byte count?
I think what you are looking for is string.sizeof? From the D reference .sizeof Returns the array length multiplied by the number of bytes per array element. .length Returns the number of elements in the array. This is a fixed quantity for static arrays. It is of type size_t. Isn't a string type an array of characters (char[] string UTF-8, wchar[] string UTF-16, and dchar[] string UTF-32) and not arbitrary bytes?
Apr 21 2015
parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Tuesday, 21 April 2015 at 13:06:22 UTC, JohnnyK wrote:
 On Monday, 20 April 2015 at 19:24:01 UTC, Panke wrote:
 On Monday, 20 April 2015 at 18:03:50 UTC, John Colvin wrote:
 On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
 To measure the columns needed to print a string, you'll need 
 the number of graphemes. (d|)?string.length gives you the 
 number of code units.
Even that's not really true.
Why? Doesn't string.length give you the byte count?
I was talking about the "you'll need the number of graphemes". s.length returns the number of elements in the slice, which in the case of D's string types gives is the same as the number of code units.
 I think what you are looking for is string.sizeof?

 From the D reference

 .sizeof	Returns the array length multiplied by the number of 
 bytes per array element.
 .length	Returns the number of elements in the array. This is a 
 fixed quantity for static arrays. It is of type size_t.
That is for static arrays only. .sizeof for slices is just size_t.sizeof + T*.sizeof i.e. 8 on 32 bit, 16 on 64 bit.
Apr 21 2015
prev sibling parent "Chris" <wendlec tcd.ie> writes:
On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
 This can lead to subtle bugs, cf. length of random and e_one. 
 You have to convert everything to dstring to get the 
 "expected" result. However, this is not always desirable.
There are three things that you need to be aware of when handling unicode: code units, code points and graphems.
This is why I use a helper function that uses byCodePoint and byGrapheme. At least for my use cases it returns the correct length. However, I might think about an alternative version based on the discussion here.
 In general the length of one guarantees anything about the 
 length of the other, except for utf32, which is a 1:1 mapping 
 between code units and code points.

 In this thread, we were discussing the relationship between 
 code points and graphemes. You're examples however apply to the 
 relationship between code units and code points.

 To measure the columns needed to print a string, you'll need 
 the number of graphemes. (d|)?string.length gives you the 
 number of code units.

 If you normalize a string (in the sequence of 
 characters/codepoints sense, not object.string) to NFC, it will 
 decompose every precomposed character in the string (like é, 
 single codeunit), establish a defined order between the 
 composite characters and then recompose a selected few 
 graphemes (like é). This way é always ends up as a single code 
 unit in NFC. There are dozens of other combinations where 
 you'll still have n:1 mapping between code points and graphemes 
 left after normalization.

 Example given already in this thread: putting an arrow over an 
 latin letter is typical in math and always more than one 
 codepoint.
Apr 20 2015
prev sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Saturday, 18 April 2015 at 16:01:20 UTC, Andrei Alexandrescu 
wrote:
 On 4/18/15 4:35 AM, Jacob Carlborg wrote:
 On 2015-04-18 12:27, Walter Bright wrote:

 That doesn't make sense to me, because the umlauts and the 
 accented e
 all have Unicode code point assignments.
This code snippet demonstrates the problem: import std.stdio; void main () { dstring a = "e\u0301"; dstring b = "é"; assert(a != b); assert(a.length == 2); assert(b.length == 1); writefln(a, " ", b); } If you run the above code all asserts should pass. If your system correctly supports Unicode (works on OS X 10.10) the two printed characters should look exactly the same. \u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htm
Isn't this solved commonly with a normalization pass? We should have a normalizeUTF() that can be inserted in a pipeline. Then the rest of Phobos doesn't need to mind these combining characters. -- Andrei
Normalisation can allow some simplifications, sometimes, but knowing whether it will or not requires a lot of a priori knowledge about the input as well as the normalisation form.
Apr 19 2015
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/18/2015 4:35 AM, Jacob Carlborg wrote:
 \u0301 is the "combining acute accent" [1].

 [1] http://www.fileformat.info/info/unicode/char/0301/index.htm
I won't deny what the spec says, but it doesn't make any sense to have two different representations of eacute, and I don't know why anyone would use the two code point version.
Apr 18 2015
next sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Sat, Apr 18, 2015 at 10:50:18AM -0700, Walter Bright via Digitalmars-d wrote:
 On 4/18/2015 4:35 AM, Jacob Carlborg wrote:
\u0301 is the "combining acute accent" [1].

[1] http://www.fileformat.info/info/unicode/char/0301/index.htm
I won't deny what the spec says, but it doesn't make any sense to have two different representations of eacute, and I don't know why anyone would use the two code point version.
Well, *somebody* has to convert it to the single code point eacute, whether it's the human (if the keyboard has a single key for it), or the code interpreting keystrokes (the user may have typed it as e + combining acute), or the program that generated the combination, or the program that receives the data. When we don't know provenance of incoming data, we have to assume the worst and run normalization to be sure that we got it right. The two code-point version may also arise from string concatenation, in which case normalization has to be done again (or possibly from the point of concatenation, given the right algorithms). T -- Mediocrity has been pushed to extremes.
Apr 18 2015
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/18/2015 11:28 AM, H. S. Teoh via Digitalmars-d wrote:
 On Sat, Apr 18, 2015 at 10:50:18AM -0700, Walter Bright via Digitalmars-d
wrote:
 On 4/18/2015 4:35 AM, Jacob Carlborg wrote:
 \u0301 is the "combining acute accent" [1].

 [1] http://www.fileformat.info/info/unicode/char/0301/index.htm
I won't deny what the spec says, but it doesn't make any sense to have two different representations of eacute, and I don't know why anyone would use the two code point version.
Well, *somebody* has to convert it to the single code point eacute, whether it's the human (if the keyboard has a single key for it), or the code interpreting keystrokes (the user may have typed it as e + combining acute), or the program that generated the combination, or the program that receives the data.
Data entry should be handled by the driver program, not a universal interchange format.
 When we don't know provenance of
 incoming data, we have to assume the worst and run normalization to be
 sure that we got it right.
I'm not arguing against the existence of the Unicode standard, I'm saying I can't figure any justification for standardizing different encodings of the same thing.
Apr 18 2015
next sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Sat, Apr 18, 2015 at 11:40:08AM -0700, Walter Bright via Digitalmars-d wrote:
 On 4/18/2015 11:28 AM, H. S. Teoh via Digitalmars-d wrote:
[...]
When we don't know provenance of incoming data, we have to assume the
worst and run normalization to be sure that we got it right.
I'm not arguing against the existence of the Unicode standard, I'm saying I can't figure any justification for standardizing different encodings of the same thing.
Take it up with the Unicode consortium. :-) T -- Tech-savvy: euphemism for nerdy.
Apr 18 2015
parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/18/2015 1:22 PM, H. S. Teoh via Digitalmars-d wrote:
 Take it up with the Unicode consortium. :-)
I see nobody knows :-)
Apr 18 2015
prev sibling parent reply Shachar Shemesh <shachar weka.io> writes:
On 18/04/15 21:40, Walter Bright wrote:
 I'm not arguing against the existence of the Unicode standard, I'm
 saying I can't figure any justification for standardizing different
 encodings of the same thing.
A lot of areas in Unicode are due to pre-Unicode legacy. I'm guessing here, but looking at the code points, é (U00e9 - Latin small letter E with acute), which comes from Latin-1, which is designed to follow ISO-8859-1. U0301 (Combining acute accent) comes from "Combining diacritical marks". The way I understand things, Unicode would really prefer to use U0065+U0301 rather than U00e9. Because of legacy systems, and because they would rather have the ISO-8509 code pages be 1:1 mappings, rather than 1:n mappings, they introduced code points they really would rather do without. This also explains the "presentation forms" code pages (e.g. http://www.unicode.org/charts/PDF/UFB00.pdf). These were intended to be glyphs, rather than code points. Due to legacy reasons, it was not possible to simply discard them. They received code points, with a warning not to use these code points directly. Also, notice that some letters can only be achieved using multiple code points. Hebrew diacritics, for example, do not, typically, have a composite form. My name fully spelled (which you rarely would do), שַׁחַר, cannot be represented with less than 6 code points, despite having only three letters. The last paragraph isn't strictly true. You can use UFB2C + U05B7 for the first letter instead of U05E9 + U05C2 + U05B7. You would be using the presentation form which, as pointed above, is only there for legacy. Shachar or shall I say שחר
Apr 18 2015
next sibling parent reply "Abdulhaq" <alynch4047 gmail.com> writes:
MiOn Sunday, 19 April 2015 at 02:20:01 UTC, Shachar Shemesh wrote:
 On 18/04/15 21:40, Walter Bright wrote:
 I'm not arguing against the existence of the Unicode standard, 
 I'm
 saying I can't figure any justification for standardizing 
 different
 encodings of the same thing.
A lot of areas in Unicode are due to pre-Unicode legacy. I'm guessing here, but looking at the code points, é (U00e9 - Latin small letter E with acute), which comes from Latin-1, which is designed to follow ISO-8859-1. U0301 (Combining acute accent) comes from "Combining diacritical marks". The way I understand things, Unicode would really prefer to use U0065+U0301 rather than U00e9. Because of legacy systems, and because they would rather have the ISO-8509 code pages be 1:1 mappings, rather than 1:n mappings, they introduced code points they really would rather do without. This also explains the "presentation forms" code pages (e.g. http://www.unicode.org/charts/PDF/UFB00.pdf). These were intended to be glyphs, rather than code points. Due to legacy reasons, it was not possible to simply discard them. They received code points, with a warning not to use these code points directly. Also, notice that some letters can only be achieved using multiple code points. Hebrew diacritics, for example, do not, typically, have a composite form. My name fully spelled (which you rarely would do), שַׁחַר, cannot be represented with less than 6 code points, despite having only three letters. The last paragraph isn't strictly true. You can use UFB2C + U05B7 for the first letter instead of U05E9 + U05C2 + U05B7. You would be using the presentation form which, as pointed above, is only there for legacy. Shachar or shall I say שחר
Yes Arabic is similar too
Apr 19 2015
parent Shachar Shemesh <shachar weka.io> writes:
On 19/04/15 10:51, Abdulhaq wrote:
 MiOn Sunday, 19 April 2015 at 02:20:01 UTC, Shachar Shemesh wrote:
 On 18/04/15 21:40, Walter Bright wrote:
 Also, notice that some letters can only be achieved using multiple
 code points. Hebrew diacritics, for example, do not, typically, have a
 composite form. My name fully spelled (which you rarely would do),
 שַׁחַר, cannot be represented with less than 6 code points, despite
 having only three letters.
Yes Arabic is similar too
Actually, the Arab presentation forms serve a slightly different purpose. In Hebrew, the presentation forms are mostly for Bibilical text, where certain decorations are usually done. For Arabic, the main reason for the presentation forms is shaping. Almost every Arabic letter can be written in up to four different forms (alone, start of word, middle of word and end of word). This means that Arabic has 28 letters, but over 100 different shapes for those letters. These days, when the font can do the shaping, the 28 letters suffice. During the DOS days, you needed to actually store those glyphs somewhere, which means that you needed to allocate a number to them. In Hebrew, some letters also have a final form. Since the numbers are so significantly smaller, however, (22 letters, 5 of which have final forms), Hebrew keyboards actually have all 27 letters on them. Going strictly by the "Unicode way", one would be expected to spell שלום with U05DE as the last letter, and let the shaping engine figure out that it should use the final form (or add a ZWNJ). Since all Hebrew code charts contained a final form Mem, however, you actually spell it with U05DD in the end, and it is considered a distinct letter. Shachar
Apr 19 2015
prev sibling parent "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Sunday, 19 April 2015 at 02:20:01 UTC, Shachar Shemesh wrote:
 U0065+U0301 rather than U00e9. Because of legacy systems, and 
 because they would rather have the ISO-8509 code pages be 1:1 
 mappings, rather than 1:n mappings, they introduced code points 
 they really would rather do without.
That's probably right. It is in fact a major feat to have the world adopt a new standard wholesale, but there are also difficult "semiotic" issues when you encode symbols and different languages view symbols differently (e.g. is "ä" an "a" or do you have two unique letters in the alphabet?) Take "å", it can represent a unit (ångström) or a letter with a circle above it, or a unique letter in the alphabet. The letter "æ" can be seen as a combination of "ae" or a unique letter. And we can expect languages, signs and practices to evolve over time too. How can you normalize encodings without normalizing writing practice and natural language development? That would be beyond the mandate of a unicode standard organization...
Apr 19 2015
prev sibling parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Saturday, 18 April 2015 at 17:50:12 UTC, Walter Bright wrote:
 On 4/18/2015 4:35 AM, Jacob Carlborg wrote:
 \u0301 is the "combining acute accent" [1].

 [1] http://www.fileformat.info/info/unicode/char/0301/index.htm
I won't deny what the spec says, but it doesn't make any sense to have two different representations of eacute, and I don't know why anyone would use the two code point version.
é might be obvious, but Unicode isn't just for writing European prose. Uses for combining characters includes (but is *nowhere* near to limited to) mathematical notation, where the combinatorial explosion of possible combinations that still belong to one grapheme cluster (character is a familiar but misleading word when talking about Unicode) would trivially become an insanely (more atoms than in the universe levels of) large number of characters. Unicode is a nightmarish system in some ways, but considering how incredibly difficult the problem it solves is, it's actually not too crazy.
Apr 19 2015
parent reply ketmar <ketmar ketmar.no-ip.org> writes:
On Sun, 19 Apr 2015 07:54:36 +0000, John Colvin wrote:

 =C3=A9 might be obvious, but Unicode isn't just for writing European pros=
e. it is also to insert pictures of the animals into text.
 Unicode is a nightmarish system in some ways, but considering how
 incredibly difficult the problem it solves is, it's actually not too
 crazy.
it's not crazy, it's just broken in all possible ways: http://file.bestmx.net/ee/articles/uni_vs_code.pdf=
Apr 19 2015
next sibling parent "weaselcat" <weaselcat gmail.com> writes:
On Sunday, 19 April 2015 at 19:58:28 UTC, ketmar wrote:
 On Sun, 19 Apr 2015 07:54:36 +0000, John Colvin wrote:

 é might be obvious, but Unicode isn't just for writing 
 European prose.
it is also to insert pictures of the animals into text.
There's other uses for unicode? 🐧
Apr 19 2015
prev sibling next sibling parent reply "Nick B" <nick.barbalich gmail.com> writes:
On Sunday, 19 April 2015 at 19:58:28 UTC, ketmar wrote:
 On Sun, 19 Apr 2015 07:54:36 +0000, John Colvin wrote:
 it's not crazy, it's just broken in all possible ways:
 http://file.bestmx.net/ee/articles/uni_vs_code.pdf
Ketmar Great link, and a really good arguement about the problems with Unicode. Quote from 'Instead of Conclusion' Yes. This is the root of Unicode misdesign. They mixed up two mutually exclusive approaches. They blended badly two different abstraction levels: the textual level which corresponds to a language idea and the graphical level which does not care of a language, yet cares of writing direction, subscripts, superscripts and so on. In other words we need two different Unicodes built on these two opposite principles, instead of the one built on an insane mix of controversial axioms. end quote. Perhaps Unicode needs to be rebuild from the ground up ?
Apr 19 2015
parent reply ketmar <ketmar ketmar.no-ip.org> writes:
On Mon, 20 Apr 2015 01:27:36 +0000, Nick B wrote:

 Perhaps Unicode needs to be rebuild from the ground up ?
alas, it's too late. now we'll live with that "unicode" crap for many=20 years.=
Apr 19 2015
parent reply "Nick B" <nick.barbalich gmail.com> writes:
On Monday, 20 April 2015 at 03:39:54 UTC, ketmar wrote:
 On Mon, 20 Apr 2015 01:27:36 +0000, Nick B wrote:

 Perhaps Unicode needs to be rebuild from the ground up ?
alas, it's too late. now we'll live with that "unicode" crap for many years.
Perhaps. or perhaps not. This community got together under Walter and Andrei leadership to building a new programming language, on the pillars of the old. Perhaps a new Unicode standard, could start that way as well ?
Apr 19 2015
parent Jacob Carlborg <doob me.com> writes:
On 2015-04-20 08:04, Nick B wrote:

 Perhaps a new Unicode standard, could start that way as well ?
https://xkcd.com/927/ -- /Jacob Carlborg
Apr 20 2015
prev sibling parent Shachar Shemesh <shachar weka.io> writes:
On 19/04/15 22:58, ketmar wrote:
 On Sun, 19 Apr 2015 07:54:36 +0000, John Colvin wrote:

 it's not crazy, it's just broken in all possible ways:
 http://file.bestmx.net/ee/articles/uni_vs_code.pdf
This is not a very accurate depiction of Unicode. For example: And, moreover, BOM is meaningless without mentioning of encoding. So we have to specify encoding anyway. No. BOM is what lets your auto-detect the encoding. If you know you will be using UTF-8, 16 or 32 with an unknown encoding, BOM will tell you which it is. That is its entire purpose, in fact. There, pretty much, goes point #1. And then: Unicode contains at least writing direction control symbols (LTR is U+200E and RTL is U+200F) which role is IDENTICAL to the role of codepage-switching symbols with the associated disadvantages. That's just ignorance of how the UBA (TR#9) works. LRM and RLM are mere invisible characters with defined directionality. Cutting them away from a substring would not invalidate your text more than cutting away actual text would under the same conditions. In any case, unlike page switching symbols, it would only affect your display, not your understanding of the text. So point #2 is out. He has some valid argument under point #3, but also lots of !( #&$ nonsense. He is right, I think, that denoting units with separate code points makes no sense, but the rest of his arguments seem completely off. For example, asking Latin and Cyrillic to share the same region merely because some letters look alike makes no sense, implementation wise. Points #4, #5, #6 and #7 are the same point. The main objection I have there is his assumption that the situation is, somehow, worse than it was. Yes, if you knew your encoding was Windows-1255, you could assume the text is Hebrew. Or Yiddish. And this, I think, is one of the encodings with the least number of languages riding on it. Windows-1256 has Arabic, Persian, Urdu and others. Windows-1251 has the entire western Europe script. As pointed out elsewhere in this thread, Spanish and French treat case folding of accented letters differently. Also, we see that the solution he thinks would work better actually doesn't. People living in France don't switch to a QWERTY keyboard when they want to type English. They type English with their AZERTY keyboard. There simply is no automatic way to tell what language something is typed in without a human telling you (or applying content based heuristics). Microsoft Word stores, for each letter, which was the keyboard language it was typed with. This causes great problems when copying to other editors, performing searches, or simply trying to get bidirectional text to appear correctly. The problem is so bad that phone numbers where the prefix appears after the actual number is not considered bad form or unusual, even in official PR material or when sending resumes. In fact, the only time you can count on someone to switch keyboards is when they need to switch to a language with a different alphabet. No Russian speaker will type English using the Russian layout, even if what she has to say happens to use letters with the same glyphs. You simply do not plan that much ahead. The point I'm driving at is that just because some posted some rant on the Internet doesn't mean it's correct. When someone says something is broken, always ask them what they suggest instead. Shachar
Apr 19 2015
prev sibling parent reply "Paulo Pinto" <pjmlp progtools.org> writes:
On Saturday, 18 April 2015 at 08:26:12 UTC, Panke wrote:
 On Saturday, 18 April 2015 at 08:18:46 UTC, Walter Bright wrote:
 On 4/18/2015 12:58 AM, John Colvin wrote:
 On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:
 On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
 So either you have to throw out all pretenses of 
 Unicode-correctness and
 just stick with ASCII-style per-character line-wrapping, or 
 you have to
 live with byGrapheme with all the complexity that it 
 entails. The former
 is quite easy to write -- I could throw it together in a 
 couple o' hours
 max, but the latter is a pretty big project (cf. Unicode 
 line-breaking
 algorithm, which is one of the TR's).
It'd be good enough to duplicate the existing behavior, which is to treat decoded unicode characters as one column.
Code points aren't equivalent to characters. They're not the same thing in most European languages,
I know a bit of German, for what characters is that not true?
Umlauts, if combined characters are used. Also words that still have their accents left after import from foreign languages. E.g. Café Getting all unicode correct seems a daunting task with a severe performance impact, esp. if we need to assume that a string might have any normalization form or none at all. See also: http://unicode.org/reports/tr15/#Norm_Forms
Also another issue is that lower case letters and upper case might have different size requirements or look different depending on where on the word they are located. For example, German ß and SS, Greek σ and ς. I know Turkish also has similar cases. -- Paulo
Apr 18 2015
parent "Tobias Pankrath" <tobias pankrath.net> writes:
 Also another issue is that lower case letters and upper case 
 might have different size requirements or look different 
 depending on where on the word they are located.

 For example, German ß and SS, Greek σ and ς. I know Turkish 
 also has similar cases.

 --
 Paulo
While true, it does not affect wrap (the algorithm) as far as I can see.
Apr 18 2015
prev sibling parent Shachar Shemesh <shachar weka.io> writes:
On 17/04/15 19:59, H. S. Teoh via Digitalmars-d wrote:
 There's also the question of what to do with bidi markings: how do you
 handle counting the columns in that case?
Which BiDi marking are you referring to? LRM/RLM and friends? If so, don't worry: the interface, as described, is incapable of properly handling BiDi anyways. The proper way to handle BiDi line wrapping is this. First you assign a BiDi level to each character (at which point the markings are, effectively, removed from the input, so there goes your problem). Then you calculate the glyph's width until the line limit is reached, and then you reorder each line according to the BiDi levels you calculated earlier. As can be easily seen, this requires transitioning BiDi information that is per-paragraph across the line break logic, pretty much mandating multiple passes on the input. Since the requested interface does not allow that, proper BiDi line breaking is impossible with that interface. I'll mention that not everyone take that as a serious problem. Window's text control, for example, calculates line breaks on the text, and then runs the BiDi algorithm on each line individually. Few people notice this. Then again, people have already grown used to BiDi text being scrambled. Shachar
Apr 17 2015
prev sibling next sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Apr 17, 2015 at 09:59:40AM -0700, H. S. Teoh via Digitalmars-d wrote:
[...]
 -- 
 All problems are easy in retrospect.
Argh, my Perl script doth mock me! T -- Windows: the ultimate triumph of marketing over technology. -- Adrian von Bidder
Apr 17 2015
prev sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Apr 17, 2015 at 09:59:40AM -0700, H. S. Teoh via Digitalmars-d wrote:
[...]
 So either you have to throw out all pretenses of Unicode-correctness
 and just stick with ASCII-style per-character line-wrapping, or you
 have to live with byGrapheme with all the complexity that it entails.
 The former is quite easy to write -- I could throw it together in a
 couple o' hours max, but the latter is a pretty big project (cf.
 Unicode line-breaking algorithm, which is one of the TR's).
[...] Well, talk is cheap, so here's a working implementation of the non-Unicode-correct line wrapper that uses ranges and does not allocate: import std.range.primitives; /** * Range version of $(D std.string.wrap). * * Bugs: * This function does not conform to the Unicode line-breaking algorithm. It * does not take into account zero-width characters, combining diacritics, * double-width characters, non-breaking spaces, and bidi markings. Strings * containing these characters therefore may not be wrapped correctly. */ auto wrapped(R)(R range, in size_t columns = 80, R firstindent = null, R indent = null, in size_t tabsize = 8) if (isForwardRange!R && is(ElementType!R : dchar)) { import std.algorithm.iteration : map, joiner; import std.range : chain; import std.uni; alias CharType = ElementType!R; // Returns: Wrapped lines. struct Result { private R range, indent; private size_t maxCols, tabSize; private size_t currentCol = 0; private R curIndent; bool empty = true; bool atBreak = false; this(R _range, R _firstindent, R _indent, size_t columns, size_t tabsize) { this.range = _range; this.curIndent = _firstindent.save; this.indent = _indent; this.maxCols = columns; this.tabSize = tabsize; empty = _range.empty; } property CharType front() { if (atBreak) return '\n'; // should implicit convert to wider characters else if (!curIndent.empty) return curIndent.front; else return range.front; } void popFront() { if (atBreak) { // We're at a linebreak. atBreak = false; currentCol = 0; // Start new line with indent curIndent = indent.save; return; } else if (!curIndent.empty) { // We're iterating over an initial indent. curIndent.popFront(); currentCol++; return; } // We're iterating over the main range. range.popFront(); if (range.empty) { empty = true; return; } if (range.front == '\t') currentCol += tabSize; else if (isWhite(range.front)) { // Scan for next word boundary to decide whether or not to // break here. R tmp = range.save; assert(!tmp.empty); size_t col = currentCol; // Find start of next word while (!tmp.empty && isWhite(tmp.front)) { col++; tmp.popFront(); } // Remember start of next word so that if we need to break, we // won't introduce extraneous spaces to the start of the new // line. R nextWord = tmp.save; while (!tmp.empty && !isWhite(tmp.front)) { col++; tmp.popFront(); } assert(tmp.empty || isWhite(tmp.front)); if (col > maxCols) { // Word wrap needed. Move current range position to // start of next word. atBreak = true; range = nextWord; return; } } currentCol++; } property Result save() { Result copy = this; copy.range = this.range.save; //copy.indent = this.indent.save; // probably not needed? copy.curIndent = this.curIndent.save; return copy; } } static assert(isForwardRange!Result); return Result(range, firstindent, indent, columns, tabsize); } unittest { import std.algorithm.comparison : equal; auto s = ("This is a very long, artificially long, and gratuitously long "~ "single-line sentence to serve as a test case for byParagraph.") .wrapped(30, ">>>>", ">>"); assert(s.equal( ">>>>This is a very long,\n"~ ">>artificially long, and\n"~ ">>gratuitously long single-line\n"~ ">>sentence to serve as a test\n"~ ">>case for byParagraph." )); } I didn't bother with avoiding autodecoding -- that should be relatively easy to add, but I think it's stupid that we have to continually write workarounds in our code to get around auto-decoding. If it's so important that we don't autodecode, can we pretty please make the stupid decision already and kill it off for good?! T -- To err is human; to forgive is not our policy. -- Samuel Adler
Apr 17 2015
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/17/2015 11:17 AM, H. S. Teoh via Digitalmars-d wrote:
 Well, talk is cheap, so here's a working implementation of the
 non-Unicode-correct line wrapper that uses ranges and does not allocate:
awesome! Please make a pull request for this so you get proper credit!
Apr 17 2015
parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Apr 17, 2015 at 11:44:52AM -0700, Walter Bright via Digitalmars-d wrote:
 On 4/17/2015 11:17 AM, H. S. Teoh via Digitalmars-d wrote:
Well, talk is cheap, so here's a working implementation of the
non-Unicode-correct line wrapper that uses ranges and does not
allocate:
awesome! Please make a pull request for this so you get proper credit!
Doesn't that mean I have to add the autodecoding workarounds first? T -- Life is too short to run proprietary software. -- Bdale Garbee
Apr 17 2015
parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/17/2015 11:46 AM, H. S. Teoh via Digitalmars-d wrote:
 On Fri, Apr 17, 2015 at 11:44:52AM -0700, Walter Bright via Digitalmars-d
wrote:
 On 4/17/2015 11:17 AM, H. S. Teoh via Digitalmars-d wrote:
 Well, talk is cheap, so here's a working implementation of the
 non-Unicode-correct line wrapper that uses ranges and does not
 allocate:
awesome! Please make a pull request for this so you get proper credit!
Doesn't that mean I have to add the autodecoding workarounds first?
Before it gets pulled, yes, meaning that the element type of front() should match the element encoding type of Range. There's also an issue with firstindent and indent being the same range type as 'range', which is not practical as Range is likely a voldemort type. I suggest making them simply of type 'string'. I don't see any point to make them ranges. A unit test with an input range is needed, and one with some multibyte unicode encodings.
Apr 17 2015
prev sibling parent reply ketmar <ketmar ketmar.no-ip.org> writes:
On Fri, 17 Apr 2015 11:17:30 -0700, H. S. Teoh via Digitalmars-d wrote:

 Well, talk is cheap, so here's a working implementation of the
 non-Unicode-correct line wrapper that uses ranges and does not allocate:
there is some... inconsistency: `std.string.wrap` adds final "\n" to=20 string. ;-) but i always hated it for that.=
Apr 17 2015
parent reply "Panke" <tobias pankrath.net> writes:
On Friday, 17 April 2015 at 19:44:41 UTC, ketmar wrote:
 On Fri, 17 Apr 2015 11:17:30 -0700, H. S. Teoh via 
 Digitalmars-d wrote:

 Well, talk is cheap, so here's a working implementation of the
 non-Unicode-correct line wrapper that uses ranges and does not 
 allocate:
there is some... inconsistency: `std.string.wrap` adds final "\n" to string. ;-) but i always hated it for that.
A range of lines instead of inserted \n would be a good API as well.
Apr 17 2015
parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Apr 17, 2015 at 08:44:51PM +0000, Panke via Digitalmars-d wrote:
 On Friday, 17 April 2015 at 19:44:41 UTC, ketmar wrote:
On Fri, 17 Apr 2015 11:17:30 -0700, H. S. Teoh via Digitalmars-d wrote:

Well, talk is cheap, so here's a working implementation of the
non-Unicode-correct line wrapper that uses ranges and does not
allocate:
there is some... inconsistency: `std.string.wrap` adds final "\n" to string. ;-) but i always hated it for that.
A range of lines instead of inserted \n would be a good API as well.
Indeed, that would be even more useful, then you could just do .joiner("\n") to get the original functionality. However, I think Walter's goal here is to match the original wrap() functionality. Perhaps the prospective wrapped() function could be implemented in terms of a byWrappedLines() function which does return a range of wrapped lines. T -- The volume of a pizza of thickness a and radius z can be described by the following formula: pi zz a. -- Wouter Verhelst
Apr 18 2015
parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/18/2015 1:32 PM, H. S. Teoh via Digitalmars-d wrote:
 However, I think Walter's goal here is to match the original wrap()
 functionality.
Yes, although the overarching goal is: Minimize Need For Using GC In Phobos and the method here is to use ranges rather than having to allocate string temporaries.
Apr 18 2015