digitalmars.D - Today's programming challenge

digitalmars.D - Today's programming challenge - How's your Range-Fu ?

Walter Bright (11/11) Apr 17 2015 Challenge level - Moderately easy

H. S. Teoh via Digitalmars-d (33/47) Apr 17 2015 This is harder than it looks at first sight, actually. Mostly thanks to

Walter Bright (3/9) Apr 17 2015 It'd be good enough to duplicate the existing behavior, which is to trea...

John Colvin (8/22) Apr 18 2015 Code points aren't equivalent to characters. They're not the same

Jacob Carlborg (4/8) Apr 18 2015 For that we have std.ascii.
Walter Bright (9/27) Apr 18 2015 The first order of business is making wrap() work with ranges, and other...

Panke (8/31) Apr 18 2015 Umlauts, if combined characters are used. Also words that still

Walter Bright (3/23) Apr 18 2015 That doesn't make sense to me, because the umlauts and the accented e al...

Panke (4/6) Apr 18 2015 Yes, but you may have perfectly fine unicode text where the
Jacob Carlborg (19/21) Apr 18 2015 This code snippet demonstrates the problem:

Chris (8/28) Apr 18 2015 Yep, this was the cause of some bugs I had in my program. The

Gary Willoughby (4/40) Apr 18 2015 byGrapheme to the rescue:

Jacob Carlborg (7/10) Apr 18 2015 How is byGrapheme supposed to be used? I tried this put it doesn't do

Jakob Ovrum (8/18) Apr 18 2015 void main()

H. S. Teoh via Digitalmars-d (26/61) Apr 18 2015 Wait, I thought the recommended approach is to normalize first, then do

Tobias Pankrath (9/19) Apr 18 2015 1. Problem: Normalization is not closed under almost all
Chris (7/93) Apr 18 2015 This is why on OS X I always normalized strings to composed.
Walter Bright (3/6) Apr 18 2015 That should be done. There should be a fixed maximum codepoint count to

H. S. Teoh via Digitalmars-d (7/14) Apr 18 2015 Why? Scanning a string for a grapheme of arbitrary length does not need

Walter Bright (3/14) Apr 18 2015 If there's no need for allocation at all, why does it allocate? This sho...

H. S. Teoh via Digitalmars-d (7/23) Apr 18 2015 AFAICT, the only reason it allocates is because it shares the same

Andrei Alexandrescu (4/23) Apr 18 2015 Isn't this solved commonly with a normalization pass? We should have a

Tobias Pankrath (4/8) Apr 18 2015 I don't think so. The thing is, even after normalization we have

Chris (5/15) Apr 20 2015 Yes, again and again I encountered length related bugs with

Panke (4/6) Apr 20 2015 I think it is 100% reliable, it just doesn't make the problems go

Chris (21/28) Apr 20 2015 The problem is not normalization as such, the problem is with

Panke (22/25) Apr 20 2015 There are three things that you need to be aware of when handling

John Colvin (6/9) Apr 20 2015 Even that's not really true. In the end it's up to the font and

H. S. Teoh via Digitalmars-d (31/41) Apr 20 2015 Yeah, even the grapheme count does not necessarily tell you how wide the
Panke (2/7) Apr 20 2015 Why? Doesn't string.length give you the byte count?

rumbu (9/17) Apr 20 2015 You'll also need the unicode character display width:
JohnnyK (10/18) Apr 21 2015 I think what you are looking for is string.sizeof?

John Colvin (7/23) Apr 21 2015 I was talking about the "you'll need the number of graphemes".

Chris (5/31) Apr 20 2015 This is why I use a helper function that uses byCodePoint and

John Colvin (5/39) Apr 19 2015 Normalisation can allow some simplifications, sometimes, but

Walter Bright (4/6) Apr 18 2015 I won't deny what the spec says, but it doesn't make any sense to have t...

H. S. Teoh via Digitalmars-d (14/22) Apr 18 2015 Well, *somebody* has to convert it to the single code point eacute,

Walter Bright (6/23) Apr 18 2015 Data entry should be handled by the driver program, not a universal inte...

H. S. Teoh via Digitalmars-d (6/13) Apr 18 2015 Take it up with the Unicode consortium. :-)

Walter Bright (2/3) Apr 18 2015 I see nobody knows :-)

Shachar Shemesh (27/30) Apr 18 2015 A lot of areas in Unicode are due to pre-Unicode legacy.

Abdulhaq (2/38) Apr 19 2015 Yes Arabic is similar too

Shachar Shemesh (20/29) Apr 19 2015 Actually, the Arab presentation forms serve a slightly different

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (13/17) Apr 19 2015 That's probably right. It is in fact a major feat to have the

John Colvin (12/19) Apr 19 2015 é might be obvious, but Unicode isn't just for writing European

ketmar (5/9) Apr 19 2015 e.

weaselcat (3/7) Apr 19 2015 There's other uses for unicode?
Nick B (17/20) Apr 19 2015 Ketmar

ketmar (3/4) Apr 19 2015 alas, it's too late. now we'll live with that "unicode" crap for many=20

Nick B (5/10) Apr 19 2015 Perhaps. or perhaps not. This community got together under Walter

Jacob Carlborg (4/5) Apr 20 2015 https://xkcd.com/927/

Shachar Shemesh (55/58) Apr 19 2015 This is not a very accurate depiction of Unicode.

Paulo Pinto (8/40) Apr 18 2015 Also another issue is that lower case letters and upper case

Tobias Pankrath (2/9) Apr 18 2015 While true, it does not affect wrap (the algorithm) as far as I

Shachar Shemesh (20/22) Apr 17 2015 Which BiDi marking are you referring to? LRM/RLM and friends? If so,

H. S. Teoh via Digitalmars-d (6/8) Apr 17 2015 Argh, my Perl script doth mock me!
H. S. Teoh via Digitalmars-d (166/172) Apr 17 2015 [...]

Walter Bright (2/4) Apr 17 2015 awesome! Please make a pull request for this so you get proper credit!

H. S. Teoh via Digitalmars-d (5/11) Apr 17 2015 Doesn't that mean I have to add the autodecoding workarounds first?

Walter Bright (8/16) Apr 17 2015 Before it gets pulled, yes, meaning that the element type of front() sho...

ketmar (3/5) Apr 17 2015 there is some... inconsistency: `std.string.wrap` adds final "\n" to=20

Panke (3/11) Apr 17 2015 A range of lines instead of inserted \n would be a good API as

H. S. Teoh via Digitalmars-d (11/22) Apr 18 2015 Indeed, that would be even more useful, then you could just do

Walter Bright (5/7) Apr 18 2015 Yes, although the overarching goal is:

Walter Bright <newshound2 digitalmars.com> writes:

Challenge level - Moderately easy

Consider the function std.string.wrap:



It takes a string as input, and returns a GC allocated string that is 
word-wrapped. It needs to be enhanced to:

1. Accept a ForwardRange as input.
2. Return a lazy ForwardRange that delivers the characters of the wrapped
result 
one by one on demand.
3. Not allocate any memory.
4. The element encoding type of the returned range must be the same as the 
element encoding type of the input.

Apr 17 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Fri, Apr 17, 2015 at 02:09:07AM -0700, Walter Bright via Digitalmars-d wrote:
 Challenge level - Moderately easy

 Consider the function std.string.wrap:
 

 
 It takes a string as input, and returns a GC allocated string that is
 word-wrapped. It needs to be enhanced to:
 
 1. Accept a ForwardRange as input.
 2. Return a lazy ForwardRange that delivers the characters of the
 wrapped result one by one on demand.
 3. Not allocate any memory.
 4. The element encoding type of the returned range must be the same as
 the element encoding type of the input.

This is harder than it looks at first sight, actually. Mostly thanks to
the complexity of Unicode... you need to identify zero-width,
normal-width, and double-width characters, combining diacritics, various
kinds of spaces (e.g. cannot break on non-breaking space) and treat them
accordingly.  Which requires decoding.  (Well, in theory std.uni could
be enhanced to work directly with encoded data, but right now it
doesn't. In any case this is outside the scope of this challenge, I
think.)

Unfortunately, the only reliable way I know of currently that can deal
with the spacing of Unicode characters correctly is to segment the input
with byGrapheme, which currently is GC-dependent. So this fails (3).

There's also the question of what to do with bidi markings: how do you
handle counting the columns in that case?

Of course, if you forego Unicode correctness, then you *could* just
word-wrap on a per-character basis (i.e., every character counts as 1
column), but this also makes the resulting code useless as far as
dealing with general Unicode data is concerned -- it'd only work for
ASCII, and various character ranges inherited from the old 8-bit
European encodings. Not to mention, line-breaking in Chinese encodings
cannot work as prescribed anyway, because the rules are different (you
can break anywhere at a character boundary except punctuation -- there
is no such concept as a space character in Chinese writing). Same
applies for Korean/Japanese.

So either you have to throw out all pretenses of Unicode-correctness and
just stick with ASCII-style per-character line-wrapping, or you have to
live with byGrapheme with all the complexity that it entails. The former
is quite easy to write -- I could throw it together in a couple o' hours
max, but the latter is a pretty big project (cf. Unicode line-breaking
algorithm, which is one of the TR's).


T

-- 
All problems are easy in retrospect.

Apr 17 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
 So either you have to throw out all pretenses of Unicode-correctness and
 just stick with ASCII-style per-character line-wrapping, or you have to
 live with byGrapheme with all the complexity that it entails. The former
 is quite easy to write -- I could throw it together in a couple o' hours
 max, but the latter is a pretty big project (cf. Unicode line-breaking
 algorithm, which is one of the TR's).

It'd be good enough to duplicate the existing behavior, which is to treat 
decoded unicode characters as one column.

Apr 17 2015

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:
 On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
 So either you have to throw out all pretenses of 
 Unicode-correctness and
 just stick with ASCII-style per-character line-wrapping, or 
 you have to
 live with byGrapheme with all the complexity that it entails. 
 The former
 is quite easy to write -- I could throw it together in a 
 couple o' hours
 max, but the latter is a pretty big project (cf. Unicode 
 line-breaking
 algorithm, which is one of the TR's).

 It'd be good enough to duplicate the existing behavior, which 
 is to treat decoded unicode characters as one column.

Code points aren't equivalent to characters. They're not the same 
thing in most European languages, never mind the rest of the 
world. If we have a line-wrapping algorithm in phobos that works 
by code points, it needs a large "THIS IS ONLY FOR SIMPLE ENGLISH 
TEXT" warning.

Code points are a useful chunk size for some tasjs and completely 
insufficient for others.

Apr 18 2015

Jacob Carlborg <doob me.com> writes:

On 2015-04-18 09:58, John Colvin wrote:

 Code points aren't equivalent to characters. They're not the same thing
 in most European languages, never mind the rest of the world. If we have
 a line-wrapping algorithm in phobos that works by code points, it needs
 a large "THIS IS ONLY FOR SIMPLE ENGLISH TEXT" warning.

For that we have std.ascii.

-- 
/Jacob Carlborg

Apr 18 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 4/18/2015 12:58 AM, John Colvin wrote:
 On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:
 On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
 So either you have to throw out all pretenses of Unicode-correctness and
 just stick with ASCII-style per-character line-wrapping, or you have to
 live with byGrapheme with all the complexity that it entails. The former
 is quite easy to write -- I could throw it together in a couple o' hours
 max, but the latter is a pretty big project (cf. Unicode line-breaking
 algorithm, which is one of the TR's).

 It'd be good enough to duplicate the existing behavior, which is to treat
 decoded unicode characters as one column.

 Code points aren't equivalent to characters. They're not the same thing in most
 European languages,

I know a bit of German, for what characters is that not true?

 never mind the rest of the world. If we have a line-wrapping
 algorithm in phobos that works by code points, it needs a large "THIS IS ONLY
 FOR SIMPLE ENGLISH TEXT" warning.

 Code points are a useful chunk size for some tasjs and completely insufficient
 for others.

The first order of business is making wrap() work with ranges, and otherwise 
work the same as it always has (it's one of the oldest Phobos functions).

There are different standard levels of Unicode support. The lowest level is 
working correctly with code points, which is what wrap() does. Going to a
higher 
level of support comes after range support.

I know little about combining characters. You obviously know much more, do you 
want to take charge of this function?

Apr 18 2015

"Panke" <tobias pankrath.net> writes:

On Saturday, 18 April 2015 at 08:18:46 UTC, Walter Bright wrote:
 On 4/18/2015 12:58 AM, John Colvin wrote:
 On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:
 On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
 So either you have to throw out all pretenses of 
 Unicode-correctness and
 just stick with ASCII-style per-character line-wrapping, or 
 you have to
 live with byGrapheme with all the complexity that it 
 entails. The former
 is quite easy to write -- I could throw it together in a 
 couple o' hours
 max, but the latter is a pretty big project (cf. Unicode 
 line-breaking
 algorithm, which is one of the TR's).

 It'd be good enough to duplicate the existing behavior, which 
 is to treat
 decoded unicode characters as one column.

 Code points aren't equivalent to characters. They're not the 
 same thing in most
 European languages,

 I know a bit of German, for what characters is that not true?

Umlauts, if combined characters are used. Also words that still 
have their accents left after import from foreign languages. E.g. 
Café

Getting all unicode correct seems a daunting task with a severe 
performance impact, esp. if we need to assume that a string might 
have any normalization form or none at all.

See also: http://unicode.org/reports/tr15/#Norm_Forms

Apr 18 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 4/18/2015 1:26 AM, Panke wrote:
 On Saturday, 18 April 2015 at 08:18:46 UTC, Walter Bright wrote:
 On 4/18/2015 12:58 AM, John Colvin wrote:
 On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:
 On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
 So either you have to throw out all pretenses of Unicode-correctness and
 just stick with ASCII-style per-character line-wrapping, or you have to
 live with byGrapheme with all the complexity that it entails. The former
 is quite easy to write -- I could throw it together in a couple o' hours
 max, but the latter is a pretty big project (cf. Unicode line-breaking
 algorithm, which is one of the TR's).

 It'd be good enough to duplicate the existing behavior, which is to treat
 decoded unicode characters as one column.

 Code points aren't equivalent to characters. They're not the same thing in most
 European languages,

 I know a bit of German, for what characters is that not true?

 Umlauts, if combined characters are used. Also words that still have their
 accents left after import from foreign languages. E.g. Café

That doesn't make sense to me, because the umlauts and the accented e all have 
Unicode code point assignments.

Apr 18 2015

"Panke" <tobias pankrath.net> writes:

 That doesn't make sense to me, because the umlauts and the 
 accented e all have Unicode code point assignments.

Yes, but you may have perfectly fine unicode text where the 
combined form is used. Actually there is a normalization form for 
unicode that requires the combined form. To be fully correct 
phobos needs to handle that as well.

Apr 18 2015

Jacob Carlborg <doob me.com> writes:

On 2015-04-18 12:27, Walter Bright wrote:

 That doesn't make sense to me, because the umlauts and the accented e
 all have Unicode code point assignments.

This code snippet demonstrates the problem:

import std.stdio;

void main ()
{
     dstring a = "e\u0301";
     dstring b = "é";
     assert(a != b);
     assert(a.length == 2);
     assert(b.length == 1);
     writefln(a, " ", b);
}

If you run the above code all asserts should pass. If your system 
correctly supports Unicode (works on OS X 10.10) the two printed 
characters should look exactly the same.

\u0301 is the "combining acute accent" [1].

[1] http://www.fileformat.info/info/unicode/char/0301/index.htm

-- 
/Jacob Carlborg

Apr 18 2015

"Chris" <wendlec tcd.ie> writes:

On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg wrote:
 On 2015-04-18 12:27, Walter Bright wrote:

 That doesn't make sense to me, because the umlauts and the 
 accented e
 all have Unicode code point assignments.

 This code snippet demonstrates the problem:

 import std.stdio;

 void main ()
 {
     dstring a = "e\u0301";
     dstring b = "é";
     assert(a != b);
     assert(a.length == 2);
     assert(b.length == 1);
     writefln(a, " ", b);
 }

 If you run the above code all asserts should pass. If your 
 system correctly supports Unicode (works on OS X 10.10) the two 
 printed characters should look exactly the same.

 \u0301 is the "combining acute accent" [1].

 [1] http://www.fileformat.info/info/unicode/char/0301/index.htm

Yep, this was the cause of some bugs I had in my program. The 
thing is you never know, if a text is composed or decomposed, so 
you have to be prepared that "é" has length 2 or 1. On OS X these 
characters are automatically decomposed by default. So if you 
pipe it through the system an "é" (length=1) automatically 
becomes "e\u0301" (length=2). Same goes for file names on OS X. 
I've had to find a workaround for this more than once.

Apr 18 2015

"Gary Willoughby" <dev nomad.so> writes:

On Saturday, 18 April 2015 at 11:52:52 UTC, Chris wrote:
 On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg 
 wrote:
 On 2015-04-18 12:27, Walter Bright wrote:

 That doesn't make sense to me, because the umlauts and the 
 accented e
 all have Unicode code point assignments.

 This code snippet demonstrates the problem:

 import std.stdio;

 void main ()
 {
    dstring a = "e\u0301";
    dstring b = "é";
    assert(a != b);
    assert(a.length == 2);
    assert(b.length == 1);
    writefln(a, " ", b);
 }

 If you run the above code all asserts should pass. If your 
 system correctly supports Unicode (works on OS X 10.10) the 
 two printed characters should look exactly the same.

 \u0301 is the "combining acute accent" [1].

 [1] http://www.fileformat.info/info/unicode/char/0301/index.htm

 Yep, this was the cause of some bugs I had in my program. The 
 thing is you never know, if a text is composed or decomposed, 
 so you have to be prepared that "é" has length 2 or 1. On OS X 
 these characters are automatically decomposed by default. So if 
 you pipe it through the system an "é" (length=1) automatically 
 becomes "e\u0301" (length=2). Same goes for file names on OS X. 
 I've had to find a workaround for this more than once.

byGrapheme to the rescue:

http://dlang.org/phobos/std_uni.html#byGrapheme

Or is this unsuitable here?

Apr 18 2015

Jacob Carlborg <doob me.com> writes:

On 2015-04-18 14:25, Gary Willoughby wrote:

 byGrapheme to the rescue:

 http://dlang.org/phobos/std_uni.html#byGrapheme

 Or is this unsuitable here?

How is byGrapheme supposed to be used? I tried this put it doesn't do 
what I expected:

foreach (e ; "e\u0301".byGrapheme)
     writeln(e);

-- 
/Jacob Carlborg

Apr 18 2015

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Saturday, 18 April 2015 at 12:48:53 UTC, Jacob Carlborg wrote:
 On 2015-04-18 14:25, Gary Willoughby wrote:

 byGrapheme to the rescue:

 http://dlang.org/phobos/std_uni.html#byGrapheme

 Or is this unsuitable here?

 How is byGrapheme supposed to be used? I tried this put it 
 doesn't do what I expected:

 foreach (e ; "e\u0301".byGrapheme)
     writeln(e);

void main()
{
     import std.stdio;
     import std.uni;

     foreach (e ; "e\u0301".byGrapheme)
         writeln(e[]);
}

Apr 18 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Sat, Apr 18, 2015 at 11:52:50AM +0000, Chris via Digitalmars-d wrote:
 On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg wrote:
On 2015-04-18 12:27, Walter Bright wrote:

That doesn't make sense to me, because the umlauts and the accented
e all have Unicode code point assignments.

This code snippet demonstrates the problem:

import std.stdio;

void main ()
{
    dstring a = "e\u0301";
    dstring b = "�";
    assert(a != b);
    assert(a.length == 2);
    assert(b.length == 1);
    writefln(a, " ", b);
}

If you run the above code all asserts should pass. If your system
correctly supports Unicode (works on OS X 10.10) the two printed
characters should look exactly the same.

\u0301 is the "combining acute accent" [1].

[1] http://www.fileformat.info/info/unicode/char/0301/index.htm

 
 Yep, this was the cause of some bugs I had in my program. The thing is
 you never know, if a text is composed or decomposed, so you have to be
 prepared that "�" has length 2 or 1. On OS X these characters are
 automatically decomposed by default. So if you pipe it through the
 system an "�" (length=1) automatically becomes "e\u0301" (length=2).
 Same goes for file names on OS X. I've had to find a workaround for
 this more than once.

Wait, I thought the recommended approach is to normalize first, then do
string processing later? Normalizing first will eliminate
inconsistencies of this sort, and allow string-processing code to use a
uniform approach to handling the string. I don't think it's a good idea
to manually deal with composed/decomposed issues within every individual
string function.

Of course, even after normalization, you still have the issue of
zero-width characters and combining diacritics, because not every
language has precomposed characters handy.

Using byGrapheme, within the current state of Phobos, is still the best
bet as to correctly counting the number of printed columns as opposed to
the number of "characters" (which, in the Unicode definition, does not
always match the layman's notion of "character"). Unfortunately,
byGrapheme may allocate, which fails Walter's requirements.

Well, to be fair, byGrapheme only *occasionally* allocates -- only for
input with unusually long sequences of combining diacritics -- for
normal use cases you'll pretty much never have any allocations. But the
language can't express the idea of "occasionally allocates", there is
only "allocates" or " nogc". Which makes it unusable in  nogc code.

One possible solution would be to modify std.uni.graphemeStride to not
allocate, since it shouldn't need to do so just to compute the length of
the next grapheme.


T

-- 
Just because you survived after you did it, doesn't mean it wasn't stupid!

Apr 18 2015

"Tobias Pankrath" <tobias pankrath.net> writes:

 Wait, I thought the recommended approach is to normalize first, 
 then do
 string processing later? Normalizing first will eliminate
 inconsistencies of this sort, and allow string-processing code 
 to use a
 uniform approach to handling the string. I don't think it's a 
 good idea
 to manually deal with composed/decomposed issues within every 
 individual
 string function.


1. Problem: Normalization is not closed under almost all 
operations. E.g. concatenating two normalized strings does not 
guarantee the result is in normalized form.

2. Problem: Some unicode algorithms e.g. string comparison 
require a normalization step. It doesn't matter which form you 
use, but you have to pick one.

Now we could say that all strings passed to phobos have to be 
normalized as (say) NFC and that phobos function thus skip the 
normalization.

Apr 18 2015

"Chris" <wendlec tcd.ie> writes:

On Saturday, 18 April 2015 at 13:30:09 UTC, H. S. Teoh wrote:
 On Sat, Apr 18, 2015 at 11:52:50AM +0000, Chris via 
 Digitalmars-d wrote:
 On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg 
 wrote:
On 2015-04-18 12:27, Walter Bright wrote:

That doesn't make sense to me, because the umlauts and the 
accented
e all have Unicode code point assignments.

This code snippet demonstrates the problem:

import std.stdio;

void main ()
{
    dstring a = "e\u0301";
    dstring b = "é";
    assert(a != b);
    assert(a.length == 2);
    assert(b.length == 1);
    writefln(a, " ", b);
}

If you run the above code all asserts should pass. If your 
system
correctly supports Unicode (works on OS X 10.10) the two 
printed
characters should look exactly the same.

\u0301 is the "combining acute accent" [1].

[1] 
http://www.fileformat.info/info/unicode/char/0301/index.htm

 
 Yep, this was the cause of some bugs I had in my program. The 
 thing is
 you never know, if a text is composed or decomposed, so you 
 have to be
 prepared that "é" has length 2 or 1. On OS X these characters 
 are
 automatically decomposed by default. So if you pipe it through 
 the
 system an "é" (length=1) automatically becomes "e\u0301" 
 (length=2).
 Same goes for file names on OS X. I've had to find a 
 workaround for
 this more than once.

 Wait, I thought the recommended approach is to normalize first, 
 then do
 string processing later? Normalizing first will eliminate
 inconsistencies of this sort, and allow string-processing code 
 to use a
 uniform approach to handling the string. I don't think it's a 
 good idea
 to manually deal with composed/decomposed issues within every 
 individual
 string function.

 Of course, even after normalization, you still have the issue of
 zero-width characters and combining diacritics, because not 
 every
 language has precomposed characters handy.

 Using byGrapheme, within the current state of Phobos, is still 
 the best
 bet as to correctly counting the number of printed columns as 
 opposed to
 the number of "characters" (which, in the Unicode definition, 
 does not
 always match the layman's notion of "character"). Unfortunately,
 byGrapheme may allocate, which fails Walter's requirements.

 Well, to be fair, byGrapheme only *occasionally* allocates -- 
 only for
 input with unusually long sequences of combining diacritics -- 
 for
 normal use cases you'll pretty much never have any allocations. 
 But the
 language can't express the idea of "occasionally allocates", 
 there is
 only "allocates" or " nogc". Which makes it unusable in  nogc 
 code.

 One possible solution would be to modify std.uni.graphemeStride 
 to not
 allocate, since it shouldn't need to do so just to compute the 
 length of
 the next grapheme.


 T

This is why on OS X I always normalized strings to composed. 
However, there are always issues with Unicode, because, as you 
said, the layman's notion of what a character is is not the same 
as Unicode's. I wrote a utility function that uses byGrapheme and 
byCodePoint. It's a bit of an overhead, but I always get the 
correct length and character access (e.g. if txt.startsWith("é")).

Apr 18 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote:
 One possible solution would be to modify std.uni.graphemeStride to not
 allocate, since it shouldn't need to do so just to compute the length of
 the next grapheme.

That should be done. There should be a fixed maximum codepoint count to 
graphemeStride.

Apr 18 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Sat, Apr 18, 2015 at 10:53:04AM -0700, Walter Bright via Digitalmars-d wrote:
 On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote:
One possible solution would be to modify std.uni.graphemeStride to
not allocate, since it shouldn't need to do so just to compute the
length of the next grapheme.

 
 That should be done. There should be a fixed maximum codepoint count
 to graphemeStride.

Why? Scanning a string for a grapheme of arbitrary length does not need
allocation since you're just reading data. Unless there is some required
intermediate representation that I'm not aware of?


T

-- 
"How are you doing?" "Doing what?"

Apr 18 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 4/18/2015 11:29 AM, H. S. Teoh via Digitalmars-d wrote:
 On Sat, Apr 18, 2015 at 10:53:04AM -0700, Walter Bright via Digitalmars-d
wrote:
 On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote:
 One possible solution would be to modify std.uni.graphemeStride to
 not allocate, since it shouldn't need to do so just to compute the
 length of the next grapheme.

 That should be done. There should be a fixed maximum codepoint count
 to graphemeStride.

 Why? Scanning a string for a grapheme of arbitrary length does not need
 allocation since you're just reading data. Unless there is some required
 intermediate representation that I'm not aware of?

If there's no need for allocation at all, why does it allocate? This should be 
fixed.

Apr 18 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Sat, Apr 18, 2015 at 11:37:27AM -0700, Walter Bright via Digitalmars-d wrote:
 On 4/18/2015 11:29 AM, H. S. Teoh via Digitalmars-d wrote:
On Sat, Apr 18, 2015 at 10:53:04AM -0700, Walter Bright via Digitalmars-d wrote:
On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote:
One possible solution would be to modify std.uni.graphemeStride to
not allocate, since it shouldn't need to do so just to compute the
length of the next grapheme.

That should be done. There should be a fixed maximum codepoint count
to graphemeStride.

Why? Scanning a string for a grapheme of arbitrary length does not
need allocation since you're just reading data. Unless there is some
required intermediate representation that I'm not aware of?

 
 If there's no need for allocation at all, why does it allocate? This
 should be fixed.

AFAICT, the only reason it allocates is because it shares the same
underlying implementation as byGrapheme. There's probably a way to fix
this, I just don't have the time right now to figure out the code.


T

-- 
Маленькие детки - маленькие бедки.

Apr 18 2015

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 4/18/15 4:35 AM, Jacob Carlborg wrote:
 On 2015-04-18 12:27, Walter Bright wrote:

 That doesn't make sense to me, because the umlauts and the accented e
 all have Unicode code point assignments.

 This code snippet demonstrates the problem:

 import std.stdio;

 void main ()
 {
      dstring a = "e\u0301";
      dstring b = "é";
      assert(a != b);
      assert(a.length == 2);
      assert(b.length == 1);
      writefln(a, " ", b);
 }

 If you run the above code all asserts should pass. If your system
 correctly supports Unicode (works on OS X 10.10) the two printed
 characters should look exactly the same.

 \u0301 is the "combining acute accent" [1].

 [1] http://www.fileformat.info/info/unicode/char/0301/index.htm

Isn't this solved commonly with a normalization pass? We should have a 
normalizeUTF() that can be inserted in a pipeline. Then the rest of 
Phobos doesn't need to mind these combining characters. -- Andrei

Apr 18 2015

"Tobias Pankrath" <tobias pankrath.net> writes:

 Isn't this solved commonly with a normalization pass? We should 
 have a normalizeUTF() that can be inserted in a pipeline.

Yes.

 Then the rest of Phobos doesn't need to mind these combining 
 characters. -- Andrei

I don't think so. The thing is, even after normalization we have 
to deal with combining characters because in all normalization 
forms there will be combining characters left after normalization.

Apr 18 2015

"Chris" <wendlec tcd.ie> writes:

On Saturday, 18 April 2015 at 17:04:54 UTC, Tobias Pankrath wrote:
 Isn't this solved commonly with a normalization pass? We 
 should have a normalizeUTF() that can be inserted in a 
 pipeline.

 Yes.

 Then the rest of Phobos doesn't need to mind these combining 
 characters. -- Andrei

 I don't think so. The thing is, even after normalization we 
 have to deal with combining characters because in all 
 normalization forms there will be combining characters left 
 after normalization.

Yes, again and again I encountered length related bugs with 
Unicode characters. Normalization is not 100% reliable. I don't 
know anyone who works with non English characters who doesn't 
have problems with Unicode related issues sometimes.

Apr 20 2015

"Panke" <tobias pankrath.net> writes:

 Yes, again and again I encountered length related bugs with 
 Unicode characters. Normalization is not 100% reliable.

I think it is 100% reliable, it just doesn't make the problems go 
away. It just guarantees that two strings normalized to the same 
form are binary equal iff they are equal in the unicode sense. 
Nothing about columns or string length or grapheme count.

Apr 20 2015

"Chris" <wendlec tcd.ie> writes:

On Monday, 20 April 2015 at 11:04:58 UTC, Panke wrote:
 Yes, again and again I encountered length related bugs with 
 Unicode characters. Normalization is not 100% reliable.

 I think it is 100% reliable, it just doesn't make the problems 
 go away. It just guarantees that two strings normalized to the 
 same form are binary equal iff they are equal in the unicode 
 sense. Nothing about columns or string length or grapheme count.

The problem is not normalization as such, the problem is with 
string (as opposed to dstring):

import std.uni : normalize, NFC;
void main() {

   dstring de_one = "é";
   dstring de_two = "e\u0301";

   assert(de_one.length == 1);
   assert(de_two.length == 2);

   string e_one = "é";
   string e_two = "e\u0301";

   string random = "ab";

   assert(e_one.length == 2);
   assert(e_two.length == 3);
   assert(e_one.length == random.length);

   assert(normalize!NFC(e_one).length == 2);
   assert(normalize!NFC(e_two).length == 2);
}

This can lead to subtle bugs, cf. length of random and e_one. You 
have to convert everything to dstring to get the "expected" 
result. However, this is not always desirable.

Apr 20 2015

"Panke" <tobias pankrath.net> writes:

 This can lead to subtle bugs, cf. length of random and e_one. 
 You have to convert everything to dstring to get the "expected" 
 result. However, this is not always desirable.

There are three things that you need to be aware of when handling 
unicode: code units, code points and graphems.

In general the length of one guarantees anything about the length 
of the other, except for utf32, which is a 1:1 mapping between 
code units and code points.

In this thread, we were discussing the relationship between code 
points and graphemes. You're examples however apply to the 
relationship between code units and code points.

To measure the columns needed to print a string, you'll need the 
number of graphemes. (d|)?string.length gives you the number of 
code units.

If you normalize a string (in the sequence of 
characters/codepoints sense, not object.string) to NFC, it will 
decompose every precomposed character in the string (like é, 
single codeunit), establish a defined order between the composite 
characters and then recompose a selected few graphemes (like é). 
This way é always ends up as a single code unit in NFC. There are 
dozens of other combinations where you'll still have n:1 mapping 
between code points and graphemes left after normalization.

Example given already in this thread: putting an arrow over an 
latin letter is typical in math and always more than one 
codepoint.

Apr 20 2015

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
 To measure the columns needed to print a string, you'll need 
 the number of graphemes. (d|)?string.length gives you the 
 number of code units.

Even that's not really true. In the end it's up to the font and 
layout engine to decide how much space anything takes up. Unicode 
doesn't play nicely with the idea of text as a grid of rows and 
fixed-width columns of characters, although quite a lot can (and 
is, see urxvt for example) be shoe-horned in.

Apr 20 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Mon, Apr 20, 2015 at 06:03:49PM +0000, John Colvin via Digitalmars-d wrote:
 On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
To measure the columns needed to print a string, you'll need the
number of graphemes. (d|)?string.length gives you the number of code
units.

 
 Even that's not really true. In the end it's up to the font and layout
 engine to decide how much space anything takes up. Unicode doesn't
 play nicely with the idea of text as a grid of rows and fixed-width
 columns of characters, although quite a lot can (and is, see urxvt for
 example) be shoe-horned in.

Yeah, even the grapheme count does not necessarily tell you how wide the
printed string really is. The characters in the CJK block are usually
rendered with fonts that are, on average, twice as wide as your typical
Latin/Cyrillic character, so even applications like urxvt that shoehorn
proportional-width fonts into a text grid render CJK characters as two
columns rather than one.

Because of this, I actually wrote a function at one time to determine
the width of a given Unicode character (i.e., zero, single, or double)
as displayed in urxvt. Obviously, this is no help if you need to wrap
lines rendered with a proportional font. And it doesn't even attempt to
work correctly with bidi text.

This is why I said at the beginning that wrapping a line of text is a
LOT harder than it sounds. A function that only takes a string as input
does not have the necessary information to do this correctly in all use
cases. The current wrap() function doesn't even do it correctly modulo
the information available: it doesn't handle combining diacritics and
zero-width characters properly. In fact, it doesn't even handle control
characters properly, except perhaps for \t and \n. There are so many
things wrong with the current wrap() function (and many other
string-processing functions in Phobos) that it makes it look like a joke
when we claim that D provides Unicode correctness out-of-the-box.

The only use case where wrap() gives the correct result is when you
stick with pre-Unicode Latin strings to be displayed on a text console.
As such, I don't really see the general utility of wrap() as it
currently stands, and I question its value in Phobos, as opposed to an
actually more useful implementation that, for instance, correctly
implements the Unicode line-breaking algorithm.


T

-- 
It said to install Windows 2000 or better, so I installed Linux instead.

Apr 20 2015

"Panke" <tobias pankrath.net> writes:

On Monday, 20 April 2015 at 18:03:50 UTC, John Colvin wrote:
 On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
 To measure the columns needed to print a string, you'll need 
 the number of graphemes. (d|)?string.length gives you the 
 number of code units.

 Even that's not really true.

Why? Doesn't string.length give you the byte count?

Apr 20 2015

"rumbu" <rumbu rumbu.ro> writes:

On Monday, 20 April 2015 at 19:24:01 UTC, Panke wrote:
 On Monday, 20 April 2015 at 18:03:50 UTC, John Colvin wrote:
 On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
 To measure the columns needed to print a string, you'll need 
 the number of graphemes. (d|)?string.length gives you the 
 number of code units.

 Even that's not really true.

 Why? Doesn't string.length give you the byte count?

You'll also need the unicode character display width:
Even if the font is monospaced, there are characters (Katakana, 
Hangul and even in Latin script) with variable width.


ABCDEFGH
ＡＢＣＤＥＦＧＨ (unicode 0xff21 through 0xff27).

If the text above is not correctly displayed on your computer, a 
Korean console can be viewed here:

http://upload.wikimedia.org/wikipedia/commons/1/14/KoreanDOSPrompt.png

Apr 20 2015

"JohnnyK" <johnnykinsey comcast.net> writes:

On Monday, 20 April 2015 at 19:24:01 UTC, Panke wrote:
 On Monday, 20 April 2015 at 18:03:50 UTC, John Colvin wrote:
 On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
 To measure the columns needed to print a string, you'll need 
 the number of graphemes. (d|)?string.length gives you the 
 number of code units.

 Even that's not really true.

 Why? Doesn't string.length give you the byte count?

I think what you are looking for is string.sizeof?

 From the D reference

.sizeof	Returns the array length multiplied by the number of 
bytes per array element.
.length	Returns the number of elements in the array. This is a 
fixed quantity for static arrays. It is of type size_t.


Isn't a string type an array of characters (char[] string UTF-8, 
wchar[] string UTF-16, and dchar[] string UTF-32) and not 
arbitrary bytes?

Apr 21 2015

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Tuesday, 21 April 2015 at 13:06:22 UTC, JohnnyK wrote:
 On Monday, 20 April 2015 at 19:24:01 UTC, Panke wrote:
 On Monday, 20 April 2015 at 18:03:50 UTC, John Colvin wrote:
 On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
 To measure the columns needed to print a string, you'll need 
 the number of graphemes. (d|)?string.length gives you the 
 number of code units.

 Even that's not really true.

 Why? Doesn't string.length give you the byte count?


I was talking about the "you'll need the number of graphemes". 
s.length returns the number of elements in the slice, which in 
the case of D's string types gives is the same as the number of 
code units.

 I think what you are looking for is string.sizeof?

 From the D reference

 .sizeof	Returns the array length multiplied by the number of 
 bytes per array element.
 .length	Returns the number of elements in the array. This is a 
 fixed quantity for static arrays. It is of type size_t.

That is for static arrays only. .sizeof for slices is just 
size_t.sizeof + T*.sizeof i.e. 8 on 32 bit, 16 on 64 bit.

Apr 21 2015

"Chris" <wendlec tcd.ie> writes:

On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
 This can lead to subtle bugs, cf. length of random and e_one. 
 You have to convert everything to dstring to get the 
 "expected" result. However, this is not always desirable.

 There are three things that you need to be aware of when 
 handling unicode: code units, code points and graphems.

This is why I use a helper function that uses byCodePoint and 
byGrapheme. At least for my use cases it returns the correct 
length. However, I might think about an alternative version based 
on the discussion here.

 In general the length of one guarantees anything about the 
 length of the other, except for utf32, which is a 1:1 mapping 
 between code units and code points.

 In this thread, we were discussing the relationship between 
 code points and graphemes. You're examples however apply to the 
 relationship between code units and code points.

 To measure the columns needed to print a string, you'll need 
 the number of graphemes. (d|)?string.length gives you the 
 number of code units.

 If you normalize a string (in the sequence of 
 characters/codepoints sense, not object.string) to NFC, it will 
 decompose every precomposed character in the string (like é, 
 single codeunit), establish a defined order between the 
 composite characters and then recompose a selected few 
 graphemes (like é). This way é always ends up as a single code 
 unit in NFC. There are dozens of other combinations where 
 you'll still have n:1 mapping between code points and graphemes 
 left after normalization.

 Example given already in this thread: putting an arrow over an 
 latin letter is typical in math and always more than one 
 codepoint.

Apr 20 2015

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Saturday, 18 April 2015 at 16:01:20 UTC, Andrei Alexandrescu 
wrote:
 On 4/18/15 4:35 AM, Jacob Carlborg wrote:
 On 2015-04-18 12:27, Walter Bright wrote:

 That doesn't make sense to me, because the umlauts and the 
 accented e
 all have Unicode code point assignments.

 This code snippet demonstrates the problem:

 import std.stdio;

 void main ()
 {
     dstring a = "e\u0301";
     dstring b = "é";
     assert(a != b);
     assert(a.length == 2);
     assert(b.length == 1);
     writefln(a, " ", b);
 }

 If you run the above code all asserts should pass. If your 
 system
 correctly supports Unicode (works on OS X 10.10) the two 
 printed
 characters should look exactly the same.

 \u0301 is the "combining acute accent" [1].

 [1] http://www.fileformat.info/info/unicode/char/0301/index.htm

 Isn't this solved commonly with a normalization pass? We should 
 have a normalizeUTF() that can be inserted in a pipeline. Then 
 the rest of Phobos doesn't need to mind these combining 
 characters. -- Andrei

Normalisation can allow some simplifications, sometimes, but 
knowing whether it will or not requires a lot of a priori 
knowledge about the input as well as the normalisation form.

Apr 19 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 4/18/2015 4:35 AM, Jacob Carlborg wrote:
 \u0301 is the "combining acute accent" [1].

 [1] http://www.fileformat.info/info/unicode/char/0301/index.htm

I won't deny what the spec says, but it doesn't make any sense to have two 
different representations of eacute, and I don't know why anyone would use the 
two code point version.

Apr 18 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Sat, Apr 18, 2015 at 10:50:18AM -0700, Walter Bright via Digitalmars-d wrote:
 On 4/18/2015 4:35 AM, Jacob Carlborg wrote:
\u0301 is the "combining acute accent" [1].

[1] http://www.fileformat.info/info/unicode/char/0301/index.htm

 
 I won't deny what the spec says, but it doesn't make any sense to have
 two different representations of eacute, and I don't know why anyone
 would use the two code point version.

Well, *somebody* has to convert it to the single code point eacute,
whether it's the human (if the keyboard has a single key for it), or the
code interpreting keystrokes (the user may have typed it as e +
combining acute), or the program that generated the combination, or the
program that receives the data. When we don't know provenance of
incoming data, we have to assume the worst and run normalization to be
sure that we got it right.

The two code-point version may also arise from string concatenation, in
which case normalization has to be done again (or possibly from the
point of concatenation, given the right algorithms).


T

-- 
Mediocrity has been pushed to extremes.

Apr 18 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 4/18/2015 11:28 AM, H. S. Teoh via Digitalmars-d wrote:
 On Sat, Apr 18, 2015 at 10:50:18AM -0700, Walter Bright via Digitalmars-d
wrote:
 On 4/18/2015 4:35 AM, Jacob Carlborg wrote:
 \u0301 is the "combining acute accent" [1].

 [1] http://www.fileformat.info/info/unicode/char/0301/index.htm

 I won't deny what the spec says, but it doesn't make any sense to have
 two different representations of eacute, and I don't know why anyone
 would use the two code point version.

 Well, *somebody* has to convert it to the single code point eacute,
 whether it's the human (if the keyboard has a single key for it), or the
 code interpreting keystrokes (the user may have typed it as e +
 combining acute), or the program that generated the combination, or the
 program that receives the data.

Data entry should be handled by the driver program, not a universal interchange 
format.


 When we don't know provenance of
 incoming data, we have to assume the worst and run normalization to be
 sure that we got it right.

I'm not arguing against the existence of the Unicode standard, I'm saying I 
can't figure any justification for standardizing different encodings of the
same 
thing.

Apr 18 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Sat, Apr 18, 2015 at 11:40:08AM -0700, Walter Bright via Digitalmars-d wrote:
 On 4/18/2015 11:28 AM, H. S. Teoh via Digitalmars-d wrote:

[...]
When we don't know provenance of incoming data, we have to assume the
worst and run normalization to be sure that we got it right.

 
 I'm not arguing against the existence of the Unicode standard, I'm
 saying I can't figure any justification for standardizing different
 encodings of the same thing.

Take it up with the Unicode consortium. :-)


T

-- 
Tech-savvy: euphemism for nerdy.

Apr 18 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 4/18/2015 1:22 PM, H. S. Teoh via Digitalmars-d wrote:
 Take it up with the Unicode consortium. :-)

I see nobody knows :-)

Apr 18 2015

Shachar Shemesh <shachar weka.io> writes:

On 18/04/15 21:40, Walter Bright wrote:
 I'm not arguing against the existence of the Unicode standard, I'm
 saying I can't figure any justification for standardizing different
 encodings of the same thing.

A lot of areas in Unicode are due to pre-Unicode legacy.

I'm guessing here, but looking at the code points, é (U00e9 - Latin 
small letter E with acute), which comes from Latin-1, which is designed 
to follow ISO-8859-1. U0301 (Combining acute accent) comes from 
"Combining diacritical marks".

The way I understand things, Unicode would really prefer to use 
U0065+U0301 rather than U00e9. Because of legacy systems, and because 
they would rather have the ISO-8509 code pages be 1:1 mappings, rather 
than 1:n mappings, they introduced code points they really would rather 
do without.

This also explains the "presentation forms" code pages (e.g. 
http://www.unicode.org/charts/PDF/UFB00.pdf). These were intended to be 
glyphs, rather than code points. Due to legacy reasons, it was not 
possible to simply discard them. They received code points, with a 
warning not to use these code points directly.

Also, notice that some letters can only be achieved using multiple code 
points. Hebrew diacritics, for example, do not, typically, have a 
composite form. My name fully spelled (which you rarely would do),
שַׁחַר, 
cannot be represented with less than 6 code points, despite having only 
three letters.

The last paragraph isn't strictly true. You can use UFB2C + U05B7 for 
the first letter instead of U05E9 + U05C2 + U05B7. You would be using 
the presentation form which, as pointed above, is only there for legacy.

Shachar
or shall I say
שחר

Apr 18 2015

"Abdulhaq" <alynch4047 gmail.com> writes:

MiOn Sunday, 19 April 2015 at 02:20:01 UTC, Shachar Shemesh wrote:
 On 18/04/15 21:40, Walter Bright wrote:
 I'm not arguing against the existence of the Unicode standard, 
 I'm
 saying I can't figure any justification for standardizing 
 different
 encodings of the same thing.

 A lot of areas in Unicode are due to pre-Unicode legacy.

 I'm guessing here, but looking at the code points, é (U00e9 - 
 Latin small letter E with acute), which comes from Latin-1, 
 which is designed to follow ISO-8859-1. U0301 (Combining acute 
 accent) comes from "Combining diacritical marks".

 The way I understand things, Unicode would really prefer to use 
 U0065+U0301 rather than U00e9. Because of legacy systems, and 
 because they would rather have the ISO-8509 code pages be 1:1 
 mappings, rather than 1:n mappings, they introduced code points 
 they really would rather do without.

 This also explains the "presentation forms" code pages (e.g. 
 http://www.unicode.org/charts/PDF/UFB00.pdf). These were 
 intended to be glyphs, rather than code points. Due to legacy 
 reasons, it was not possible to simply discard them. They 
 received code points, with a warning not to use these code 
 points directly.

 Also, notice that some letters can only be achieved using 
 multiple code points. Hebrew diacritics, for example, do not, 
 typically, have a composite form. My name fully spelled (which 
 you rarely would do), שַׁחַר, cannot be represented with less 
 than 6 code points, despite having only three letters.

 The last paragraph isn't strictly true. You can use UFB2C + 
 U05B7 for the first letter instead of U05E9 + U05C2 + U05B7. 
 You would be using the presentation form which, as pointed 
 above, is only there for legacy.

 Shachar
 or shall I say
 שחר

Yes Arabic is similar too

Apr 19 2015

Shachar Shemesh <shachar weka.io> writes:

On 19/04/15 10:51, Abdulhaq wrote:
 MiOn Sunday, 19 April 2015 at 02:20:01 UTC, Shachar Shemesh wrote:
 On 18/04/15 21:40, Walter Bright wrote:


 Also, notice that some letters can only be achieved using multiple
 code points. Hebrew diacritics, for example, do not, typically, have a
 composite form. My name fully spelled (which you rarely would do),
 שַׁחַר, cannot be represented with less than 6 code points, despite
 having only three letters.

 Yes Arabic is similar too

Actually, the Arab presentation forms serve a slightly different 
purpose. In Hebrew, the presentation forms are mostly for Bibilical 
text, where certain decorations are usually done.

For Arabic, the main reason for the presentation forms is shaping. 
Almost every Arabic letter can be written in up to four different forms 
(alone, start of word, middle of word and end of word). This means that 
Arabic has 28 letters, but over 100 different shapes for those letters. 
These days, when the font can do the shaping, the 28 letters suffice. 
During the DOS days, you needed to actually store those glyphs 
somewhere, which means that you needed to allocate a number to them.

In Hebrew, some letters also have a final form. Since the numbers are so 
significantly smaller, however, (22 letters, 5 of which have final 
forms), Hebrew keyboards actually have all 27 letters on them. Going 
strictly by the "Unicode way", one would be expected to spell שלום with 
U05DE as the last letter, and let the shaping engine figure out that it 
should use the final form (or add a ZWNJ). Since all Hebrew code charts 
contained a final form Mem, however, you actually spell it with U05DD in 
the end, and it is considered a distinct letter.

Shachar

Apr 19 2015

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Sunday, 19 April 2015 at 02:20:01 UTC, Shachar Shemesh wrote:
 U0065+U0301 rather than U00e9. Because of legacy systems, and 
 because they would rather have the ISO-8509 code pages be 1:1 
 mappings, rather than 1:n mappings, they introduced code points 
 they really would rather do without.

That's probably right. It is in fact a major feat to have the 
world adopt a new standard wholesale, but there are also 
difficult "semiotic" issues when you encode symbols and different 
languages view symbols differently (e.g. is "ä" an "a" or do you 
have two unique letters in the alphabet?)

Take "å", it can represent a unit (ångström) or a letter with a 
circle above it, or a unique letter in the alphabet. The letter 
"æ" can be seen as a combination of "ae" or a unique letter.

And we can expect languages, signs and practices to evolve over 
time too. How can you normalize encodings without normalizing 
writing practice and natural language development? That would be 
beyond the mandate of a unicode standard organization...

Apr 19 2015

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Saturday, 18 April 2015 at 17:50:12 UTC, Walter Bright wrote:
 On 4/18/2015 4:35 AM, Jacob Carlborg wrote:
 \u0301 is the "combining acute accent" [1].

 [1] http://www.fileformat.info/info/unicode/char/0301/index.htm

 I won't deny what the spec says, but it doesn't make any sense 
 to have two different representations of eacute, and I don't 
 know why anyone would use the two code point version.

é might be obvious, but Unicode isn't just for writing European 
prose. Uses for combining characters includes (but is *nowhere* 
near to limited to) mathematical notation, where the 
combinatorial explosion of possible combinations that still 
belong to one grapheme cluster (character is a familiar but 
misleading word when talking about Unicode) would trivially 
become an insanely (more atoms than in the universe levels of) 
large number of characters.

Unicode is a nightmarish system in some ways, but considering how 
incredibly difficult the problem it solves is, it's actually not 
too crazy.

Apr 19 2015

ketmar <ketmar ketmar.no-ip.org> writes:

On Sun, 19 Apr 2015 07:54:36 +0000, John Colvin wrote:

 =C3=A9 might be obvious, but Unicode isn't just for writing European pros=

e.

it is also to insert pictures of the animals into text.

 Unicode is a nightmarish system in some ways, but considering how
 incredibly difficult the problem it solves is, it's actually not too
 crazy.

it's not crazy, it's just broken in all possible ways:
http://file.bestmx.net/ee/articles/uni_vs_code.pdf=

Apr 19 2015

"weaselcat" <weaselcat gmail.com> writes:

On Sunday, 19 April 2015 at 19:58:28 UTC, ketmar wrote:
 On Sun, 19 Apr 2015 07:54:36 +0000, John Colvin wrote:

 Ã© might be obvious, but Unicode isn't just for writing 
 European prose.

 it is also to insert pictures of the animals into text.

There's other uses for unicode?
🐧

Apr 19 2015

"Nick B" <nick.barbalich gmail.com> writes:

On Sunday, 19 April 2015 at 19:58:28 UTC, ketmar wrote:
 On Sun, 19 Apr 2015 07:54:36 +0000, John Colvin wrote:

 it's not crazy, it's just broken in all possible ways:
 http://file.bestmx.net/ee/articles/uni_vs_code.pdf

Ketmar

Great link, and a really good arguement about the problems with 
Unicode.

Quote from 'Instead of Conclusion'

Yes. This is the root of Unicode misdesign. They mixed up two 
mutually exclusive
approaches. They blended badly two different abstraction levels: 
the textual level which
corresponds to a language idea and the graphical level which does 
not care of a language, yet
cares of writing direction, subscripts, superscripts and so on.

In other words we need two different Unicodes built on these two 
opposite principles,
instead of the one built on an insane mix of controversial axioms.

end quote.

Perhaps Unicode needs to be rebuild from the ground up ?

Apr 19 2015

ketmar <ketmar ketmar.no-ip.org> writes:

On Mon, 20 Apr 2015 01:27:36 +0000, Nick B wrote:

 Perhaps Unicode needs to be rebuild from the ground up ?

alas, it's too late. now we'll live with that "unicode" crap for many=20
years.=

Apr 19 2015

"Nick B" <nick.barbalich gmail.com> writes:

On Monday, 20 April 2015 at 03:39:54 UTC, ketmar wrote:
 On Mon, 20 Apr 2015 01:27:36 +0000, Nick B wrote:

 Perhaps Unicode needs to be rebuild from the ground up ?

 alas, it's too late. now we'll live with that "unicode" crap 
 for many
 years.

Perhaps. or perhaps not. This community got together under Walter 
and Andrei leadership to building a new programming language, on 
the pillars of the old.
Perhaps a new Unicode standard, could start that way as well ?

Apr 19 2015

Jacob Carlborg <doob me.com> writes:

On 2015-04-20 08:04, Nick B wrote:

 Perhaps a new Unicode standard, could start that way as well ?

https://xkcd.com/927/

-- 
/Jacob Carlborg

Apr 20 2015

Shachar Shemesh <shachar weka.io> writes:

On 19/04/15 22:58, ketmar wrote:
 On Sun, 19 Apr 2015 07:54:36 +0000, John Colvin wrote:

 it's not crazy, it's just broken in all possible ways:
 http://file.bestmx.net/ee/articles/uni_vs_code.pdf

This is not a very accurate depiction of Unicode.

For example:
And, moreover, BOM is meaningless without mentioning of encoding. So we 
have to specify encoding anyway.

No. BOM is what lets your auto-detect the encoding. If you know you will 
be using UTF-8, 16 or 32 with an unknown encoding, BOM will tell you 
which it is. That is its entire purpose, in fact.



And then:
Unicode contains at least �writing direction� control symbols (LTR is 
U+200E and RTL is U+200F) which role is IDENTICAL to the role of 
codepage-switching symbols with the associated disadvantages.


invisible characters with defined directionality. Cutting them away from 
a substring would not invalidate your text more than cutting away actual 
text would under the same conditions. In any case, unlike page switching 
symbols, it would only affect your display, not your understanding of 
the text.




nonsense. He is right, I think, that denoting units with separate code 
points makes no sense, but the rest of his arguments seem completely 
off. For example, asking Latin and Cyrillic to share the same region 
merely because some letters look alike makes no sense, implementation wise.



there is his assumption that the situation is, somehow, worse than it 
was. Yes, if you knew your encoding was Windows-1255, you could assume 
the text is Hebrew.

Or Yiddish.

And this, I think, is one of the encodings with the least number of 
languages riding on it. Windows-1256 has Arabic, Persian, Urdu and 
others. Windows-1251 has the entire western Europe script. As pointed 
out elsewhere in this thread, Spanish and French treat case folding of 
accented letters differently.

Also, we see that the solution he thinks would work better actually 
doesn't. People living in France don't switch to a QWERTY keyboard when 
they want to type English. They type English with their AZERTY keyboard. 
There simply is no automatic way to tell what language something is 
typed in without a human telling you (or applying content based heuristics).

Microsoft Word stores, for each letter, which was the keyboard language 
it was typed with. This causes great problems when copying to other 
editors, performing searches, or simply trying to get bidirectional text 
to appear correctly. The problem is so bad that phone numbers where the 
prefix appears after the actual number is not considered bad form or 
unusual, even in official PR material or when sending resumes.

In fact, the only time you can count on someone to switch keyboards is 
when they need to switch to a language with a different alphabet. No 
Russian speaker will type English using the Russian layout, even if what 
she has to say happens to use letters with the same glyphs. You simply 
do not plan that much ahead.

The point I'm driving at is that just because some posted some rant on 
the Internet doesn't mean it's correct. When someone says something is 
broken, always ask them what they suggest instead.

Shachar

Apr 19 2015

"Paulo Pinto" <pjmlp progtools.org> writes:

On Saturday, 18 April 2015 at 08:26:12 UTC, Panke wrote:
 On Saturday, 18 April 2015 at 08:18:46 UTC, Walter Bright wrote:
 On 4/18/2015 12:58 AM, John Colvin wrote:
 On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:
 On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
 So either you have to throw out all pretenses of 
 Unicode-correctness and
 just stick with ASCII-style per-character line-wrapping, or 
 you have to
 live with byGrapheme with all the complexity that it 
 entails. The former
 is quite easy to write -- I could throw it together in a 
 couple o' hours
 max, but the latter is a pretty big project (cf. Unicode 
 line-breaking
 algorithm, which is one of the TR's).

 It'd be good enough to duplicate the existing behavior, 
 which is to treat
 decoded unicode characters as one column.

 Code points aren't equivalent to characters. They're not the 
 same thing in most
 European languages,

 I know a bit of German, for what characters is that not true?

 Umlauts, if combined characters are used. Also words that still 
 have their accents left after import from foreign languages. 
 E.g. Café

 Getting all unicode correct seems a daunting task with a severe 
 performance impact, esp. if we need to assume that a string 
 might have any normalization form or none at all.

 See also: http://unicode.org/reports/tr15/#Norm_Forms

Also another issue is that lower case letters and upper case 
might have different size requirements or look different 
depending on where on the word they are located.

For example, German ß and SS, Greek σ and ς. I know Turkish also 
has similar cases.

--
Paulo

Apr 18 2015

"Tobias Pankrath" <tobias pankrath.net> writes:

 Also another issue is that lower case letters and upper case 
 might have different size requirements or look different 
 depending on where on the word they are located.

 For example, German ß and SS, Greek σ and ς. I know Turkish 
 also has similar cases.

 --
 Paulo

While true, it does not affect wrap (the algorithm) as far as I 
can see.

Apr 18 2015

Shachar Shemesh <shachar weka.io> writes:

On 17/04/15 19:59, H. S. Teoh via Digitalmars-d wrote:
 There's also the question of what to do with bidi markings: how do you
 handle counting the columns in that case?

Which BiDi marking are you referring to? LRM/RLM and friends? If so, 
don't worry: the interface, as described, is incapable of properly 
handling BiDi anyways.

The proper way to handle BiDi line wrapping is this. First you assign a 
BiDi level to each character (at which point the markings are, 
effectively, removed from the input, so there goes your problem). Then 
you calculate the glyph's width until the line limit is reached, and 
then you reorder each line according to the BiDi levels you calculated 
earlier.

As can be easily seen, this requires transitioning BiDi information that 
is per-paragraph across the line break logic, pretty much mandating 
multiple passes on the input. Since the requested interface does not 
allow that, proper BiDi line breaking is impossible with that interface.

I'll mention that not everyone take that as a serious problem. Window's 
text control, for example, calculates line breaks on the text, and then 
runs the BiDi algorithm on each line individually. Few people notice 
this. Then again, people have already grown used to BiDi text being 
scrambled.

Shachar

Apr 17 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Fri, Apr 17, 2015 at 09:59:40AM -0700, H. S. Teoh via Digitalmars-d wrote:
[...]
 -- 
 All problems are easy in retrospect.

Argh, my Perl script doth mock me!


T

-- 
Windows: the ultimate triumph of marketing over technology. -- Adrian von Bidder

Apr 17 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Fri, Apr 17, 2015 at 09:59:40AM -0700, H. S. Teoh via Digitalmars-d wrote:
[...]
 So either you have to throw out all pretenses of Unicode-correctness
 and just stick with ASCII-style per-character line-wrapping, or you
 have to live with byGrapheme with all the complexity that it entails.
 The former is quite easy to write -- I could throw it together in a
 couple o' hours max, but the latter is a pretty big project (cf.
 Unicode line-breaking algorithm, which is one of the TR's).

[...]

Well, talk is cheap, so here's a working implementation of the
non-Unicode-correct line wrapper that uses ranges and does not allocate:

	import std.range.primitives;
	
	/**
	 * Range version of $(D std.string.wrap).
	 *
	 * Bugs:
	 * This function does not conform to the Unicode line-breaking algorithm. It
	 * does not take into account zero-width characters, combining diacritics,
	 * double-width characters, non-breaking spaces, and bidi markings.  Strings
	 * containing these characters therefore may not be wrapped correctly.
	 */
	auto wrapped(R)(R range, in size_t columns = 80, R firstindent = null,
	                R indent = null, in size_t tabsize = 8)
	    if (isForwardRange!R && is(ElementType!R : dchar))
	{
	    import std.algorithm.iteration : map, joiner;
	    import std.range : chain;
	    import std.uni;
	
	    alias CharType = ElementType!R;
	
	    // Returns: Wrapped lines.
	    struct Result
	    {
	        private R range, indent;
	        private size_t maxCols, tabSize;
	
	        private size_t currentCol = 0;
	        private R curIndent;
	        bool empty = true;
	        bool atBreak = false;
	
	        this(R _range, R _firstindent, R _indent, size_t columns, size_t
tabsize)
	        {
	            this.range = _range;
	            this.curIndent = _firstindent.save;
	            this.indent = _indent;
	            this.maxCols = columns;
	            this.tabSize = tabsize;
	
	            empty = _range.empty;
	
	        }
	
	         property CharType front()
	        {
	            if (atBreak)
	                return '\n';    // should implicit convert to wider characters
	            else if (!curIndent.empty)
	                return curIndent.front;
	            else
	                return range.front;
	        }
	
	        void popFront()
	        {
	            if (atBreak)
	            {
	                // We're at a linebreak.
	                atBreak = false;
	                currentCol = 0;
	
	                // Start new line with indent
	                curIndent = indent.save;
	                return;
	            }
	            else if (!curIndent.empty)
	            {
	                // We're iterating over an initial indent.
	                curIndent.popFront();
	                currentCol++;
	                return;
	            }
	
	            // We're iterating over the main range.
	            range.popFront();
	            if (range.empty)
	            {
	                empty = true;
	                return;
	            }
	
	            if (range.front == '\t')
	                currentCol += tabSize;
	            else if (isWhite(range.front))
	            {
	                // Scan for next word boundary to decide whether or not to
	                // break here.
	                R tmp = range.save;
	                assert(!tmp.empty);
	
	                size_t col = currentCol;
	
	                // Find start of next word
	                while (!tmp.empty && isWhite(tmp.front))
	                {
	                    col++;
	                    tmp.popFront();
	                }
	
	                // Remember start of next word so that if we need to break, we
	                // won't introduce extraneous spaces to the start of the new
	                // line.
	                R nextWord = tmp.save;
	
	                while (!tmp.empty && !isWhite(tmp.front))
	                {
	                    col++;
	                    tmp.popFront();
	                }
	                assert(tmp.empty || isWhite(tmp.front));
	
	                if (col > maxCols)
	                {
	                    // Word wrap needed. Move current range position to
	                    // start of next word.
	                    atBreak = true;
	                    range = nextWord;
	                    return;
	                }
	            }
	            currentCol++;
	        }
	
	         property Result save()
	        {
	            Result copy = this;
	            copy.range = this.range.save;
	            //copy.indent = this.indent.save; // probably not needed?
	            copy.curIndent = this.curIndent.save;
	            return copy;
	        }
	    }
	    static assert(isForwardRange!Result);
	
	    return Result(range, firstindent, indent, columns, tabsize);
	}
	
	unittest
	{
	    import std.algorithm.comparison : equal;
	
	    auto s = ("This is a very long, artificially long, and gratuitously long "~
	              "single-line sentence to serve as a test case for byParagraph.")
	             .wrapped(30, ">>>>", ">>");
	    assert(s.equal(
	        ">>>>This is a very long,\n"~
	        ">>artificially long, and\n"~
	        ">>gratuitously long single-line\n"~
	        ">>sentence to serve as a test\n"~
	        ">>case for byParagraph."
	    ));
	}


I didn't bother with avoiding autodecoding -- that should be relatively
easy to add, but I think it's stupid that we have to continually write
workarounds in our code to get around auto-decoding. If it's so
important that we don't autodecode, can we pretty please make the stupid
decision already and kill it off for good?!


T

-- 
To err is human; to forgive is not our policy. -- Samuel Adler

Apr 17 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 4/17/2015 11:17 AM, H. S. Teoh via Digitalmars-d wrote:
 Well, talk is cheap, so here's a working implementation of the
 non-Unicode-correct line wrapper that uses ranges and does not allocate:

awesome! Please make a pull request for this so you get proper credit!

Apr 17 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Fri, Apr 17, 2015 at 11:44:52AM -0700, Walter Bright via Digitalmars-d wrote:
 On 4/17/2015 11:17 AM, H. S. Teoh via Digitalmars-d wrote:
Well, talk is cheap, so here's a working implementation of the
non-Unicode-correct line wrapper that uses ranges and does not
allocate:

 
 awesome! Please make a pull request for this so you get proper credit!

Doesn't that mean I have to add the autodecoding workarounds first?


T

-- 
Life is too short to run proprietary software. -- Bdale Garbee

Apr 17 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 4/17/2015 11:46 AM, H. S. Teoh via Digitalmars-d wrote:
 On Fri, Apr 17, 2015 at 11:44:52AM -0700, Walter Bright via Digitalmars-d
wrote:
 On 4/17/2015 11:17 AM, H. S. Teoh via Digitalmars-d wrote:
 Well, talk is cheap, so here's a working implementation of the
 non-Unicode-correct line wrapper that uses ranges and does not
 allocate:

 awesome! Please make a pull request for this so you get proper credit!

 Doesn't that mean I have to add the autodecoding workarounds first?

Before it gets pulled, yes, meaning that the element type of front() should 
match the element encoding type of Range.

There's also an issue with firstindent and indent being the same range type as 
'range', which is not practical as Range is likely a voldemort type. I suggest 
making them simply of type 'string'. I don't see any point to make them ranges.

A unit test with an input range is needed, and one with some multibyte unicode 
encodings.

Apr 17 2015

ketmar <ketmar ketmar.no-ip.org> writes:

On Fri, 17 Apr 2015 11:17:30 -0700, H. S. Teoh via Digitalmars-d wrote:

 Well, talk is cheap, so here's a working implementation of the
 non-Unicode-correct line wrapper that uses ranges and does not allocate:

there is some... inconsistency: `std.string.wrap` adds final "\n" to=20
string. ;-) but i always hated it for that.=

Apr 17 2015

"Panke" <tobias pankrath.net> writes:

On Friday, 17 April 2015 at 19:44:41 UTC, ketmar wrote:
 On Fri, 17 Apr 2015 11:17:30 -0700, H. S. Teoh via 
 Digitalmars-d wrote:

 Well, talk is cheap, so here's a working implementation of the
 non-Unicode-correct line wrapper that uses ranges and does not 
 allocate:

 there is some... inconsistency: `std.string.wrap` adds final 
 "\n" to
 string. ;-) but i always hated it for that.

A range of lines instead of inserted \n would be a good API as 
well.

Apr 17 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Fri, Apr 17, 2015 at 08:44:51PM +0000, Panke via Digitalmars-d wrote:
 On Friday, 17 April 2015 at 19:44:41 UTC, ketmar wrote:
On Fri, 17 Apr 2015 11:17:30 -0700, H. S. Teoh via Digitalmars-d wrote:

Well, talk is cheap, so here's a working implementation of the
non-Unicode-correct line wrapper that uses ranges and does not
allocate:

there is some... inconsistency: `std.string.wrap` adds final "\n" to
string. ;-) but i always hated it for that.

 
 A range of lines instead of inserted \n would be a good API as well.

Indeed, that would be even more useful, then you could just do
.joiner("\n") to get the original functionality.

However, I think Walter's goal here is to match the original wrap()
functionality.

Perhaps the prospective wrapped() function could be implemented in terms
of a byWrappedLines() function which does return a range of wrapped
lines.


T

-- 
The volume of a pizza of thickness a and radius z can be described by the
following formula: pi zz a. -- Wouter Verhelst

Apr 18 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 4/18/2015 1:32 PM, H. S. Teoh via Digitalmars-d wrote:
 However, I think Walter's goal here is to match the original wrap()
 functionality.

Yes, although the overarching goal is:

     Minimize Need For Using GC In Phobos

and the method here is to use ranges rather than having to allocate string 
temporaries.

Apr 18 2015

D Programming

C/C++ Programming

Other

digitalmars.D - Today's programming challenge - How's your Range-Fu ?