digitalmars.D - VLERange: a range in between BidirectionalRange and RandomAccessRange

Andrei Alexandrescu (36/36) Jan 10 2011 I've been thinking on how to better deal with Unicode strings. Currently...

Michel Fortin (26/68) Jan 11 2011 Seems like a good idea to define things formally.

Andrei Alexandrescu (9/36) Jan 11 2011 In the design as I thought of it, the effective length of one logical

spir (25/53) Jan 11 2011 I think Michel is right. If I understand correctly, VLERange addresses

Andrei Alexandrescu (31/81) Jan 11 2011 It' not about the data, it's about algorithms. Currently there are

spir (24/51) Jan 11 2011 IIUC, for the case of text, VLERange helps abstracting from the annoying...

Andrei Alexandrescu (3/34) Jan 11 2011 You should try text.front right now, you might be surprised :o).

spir (18/30) Jan 11 2011 Hum, right now incorrectly returns "a" as expected. And indeed

Michel Fortin (26/39) Jan 11 2011 Your understanding is correct.

Andrei Alexandrescu (19/54) Jan 11 2011 I disagree. When I suggested this design I was worried of

Steven Schveighoffer (11/48) Jan 11 2011 While this makes it possible to write algorithms that only accept

Andrei Alexandrescu (13/21) Jan 11 2011 But that's neither here nor there. That would return the logical element...

Steven Schveighoffer (18/39) Jan 11 2011 This solitary difference is a very thin argument -- foreach(d;

Andrei Alexandrescu (18/60) Jan 11 2011 Unfinished sentence? Anyway, for my money you just described what we

Steven Schveighoffer (24/54) Jan 13 2011 Sorry, I forgot '.' :)

Andrei Alexandrescu (30/37) Jan 13 2011 Let's take a look:

Steven Schveighoffer (22/59) Jan 13 2011 You might be looking at my previous version. The new version (recently ...

Andrei Alexandrescu (15/67) Jan 13 2011 I was looking at your latest. It's code that compiles and runs, but

Steven Schveighoffer (21/77) Jan 13 2011 iterating the code units is possible by accessing the array data. i.e. ...
Nick Sabalausky (5/7) Jan 13 2011 I dunno, spir has succesfuly convinced me that most of the time it's

spir (26/35) Jan 13 2011 You are right in that those 2 issues are really analog. In practice,
Lutger Blijdestijn (19/29) Jan 15 2011 I agree. This is a very informative thread, thanks spir and everybody el...

Michel Fortin (40/73) Jan 15 2011 I have my idea.

Lutger Blijdestijn (5/23) Jan 15 2011 ...

foobar (9/35) Jan 15 2011 My two cents are against this kind of design.

Michel Fortin (47/91) Jan 15 2011 Nothing prevents that in the design I proposed. Andrei's design already

foobar (19/126) Jan 15 2011 Ok, I guess I missed the "byDchar()" method.

Michel Fortin (18/43) Jan 15 2011 What I don't understand is in what way using a string type would make

foobar (10/31) Jan 15 2011 First thing, the question of possibility is irrelevant since I could als...

Jonathan M Davis (28/108) Jan 15 2011 ake

Michel Fortin (11/60) Jan 15 2011 I remember that someone already complained about this issue because he

Jonathan M Davis (18/82) Jan 15 2011 If a character literal actually became a grapheme instead of a dchar, th...

Michel Fortin (20/97) Jan 15 2011 Character literals are treated as simple numbers by the language. By

foobar (10/41) Jan 15 2011 I Understand your concern regarding a simpler implementation. You want t...

Michel Fortin (15/33) Jan 16 2011 It should also work for:

foobar (6/49) Jan 16 2011 Right. This does require compiler changes.

Michel Fortin (16/64) Jan 13 2011 That's forgetting that most of the time people care about graphemes

Andrei Alexandrescu (10/70) Jan 13 2011 I'm not so sure about that. What do you base this assessment on? Denis

Nick Sabalausky (39/46) Jan 13 2011 It's what they want, they just don't know it.

Andrei Alexandrescu (6/18) Jan 13 2011 Thanks. One further question is: in the above example with

Nick Sabalausky (9/33) Jan 13 2011 My understanding is "yes". At least that's what I've heard, and I've nev...

Nick Sabalausky (8/44) Jan 13 2011 Heh, as if that wasn't bad enough, there's also digraphs which, from wha...

Daniel Gibson (3/51) Jan 14 2011 OMG, this is really fucked up.

Steven Schveighoffer (28/67) Jan 14 2011 http://en.wikipedia.org/wiki/Unicode_normalization

Jonathan M Davis (10/85) Jan 14 2011 Well, there's plenty in std.string that already deals in strings rather ...

spir (17/49) Jan 14 2011 The problem is then whether a font knows how to display it. My usual
Michel Fortin (14/24) Jan 14 2011 Correct, there's a lot of combinations with no pre-combined form. This

Gianluigi Rubino (3/12) Jan 14 2011 All the examples given so far worked fine on my iPhone.

spir (12/15) Jan 14 2011 See my previous follow-up to nick's explanation. But the answer is yes,

Daniel Gibson (5/15) Jan 14 2011 Agreed. Up until spir mentioned graphemes in this newsgroup I always

spir (8/27) Jan 14 2011 That's what makes sense for the user in 99.9% case, thus that's what

spir (26/76) Jan 14 2011 If anyone finds a pointer to such an explanation, bravo, and than you.

Nick Sabalausky (22/35) Jan 14 2011 Yea, most Unicode explanations seem to talk all about "code-units vs

spir (67/101) Jan 16 2011 If anyone is interested, ICU's documentation is far more readable (and

Walter Bright (5/11) Jan 15 2011 I know some German, and to the best of my knowledge there are zero combi...

spir (54/89) Jan 14 2011 I'm aware of that, and I have no definitive answer to the question. The

Steven Schveighoffer (8/45) Jan 14 2011 * I don't even know how to make a grapheme that is more than one

spir (17/23) Jan 14 2011 1. See my text at

Steven Schveighoffer (18/40) Jan 14 2011 I can't read that document, it's black background with super-dark-grey

Michel Fortin (34/58) Jan 14 2011 Not in my knowledge. But I rarely deal with non-latin texts, there's

Steven Schveighoffer (21/73) Jan 15 2011 Hm... this pushes the normalization outside the type, and into the

Lutger Blijdestijn (5/19) Jan 15 2011 If its a matter of choosing which is the 'default' range, I'd think prop...

Steven Schveighoffer (28/51) Jan 15 2011 English and (if I understand correctly) most other languages. Any

foobar (5/66) Jan 15 2011 The above compromise provides zero benefit. The proposed default type st...

Steven Schveighoffer (27/83) Jan 15 2011 I feel like you might be exaggerating, but maybe I'm completely wrong on...

Steven Schveighoffer (7/24) Jan 15 2011 I see from Michel's post how normalization automatically can be bad. I ...
foobar (7/105) Jan 15 2011 That was already shown by Michel and Spir where the equality operator is...

Steven Schveighoffer (3/10) Jan 15 2011 Well said, I've changed my mind. Thanks for explaining.
spir (16/18) Jan 17 2011 In a few days, D will have an external library able to deal with those

spir (27/33) Jan 17 2011 Hello Steven,

Steven Schveighoffer (7/37) Jan 17 2011 I'll reply to this to save you the trouble. I have reversed my position...

Michel Fortin (34/88) Jan 15 2011 Why don't we build a compiler with an optimizer that generates correct

Steven Schveighoffer (5/85) Jan 15 2011 You make very good points. I concede that using dchar as the element

Michel Fortin (42/85) Jan 15 2011 Not really. It pushes the normalization to the string comparison

Steven Schveighoffer (34/108) Jan 15 2011 Are these common requirements? I thought users mostly care about

Michel Fortin (18/52) Jan 15 2011 I'm glad we agree on that now.

Steven Schveighoffer (49/94) Jan 15 2011 It's a matter of me slowly wrapping my brain around unicode and how it's...

foobar (3/119) Jan 15 2011 I like Michel's proposed semantics and I also agree with you that it sho...

Steven Schveighoffer (12/18) Jan 17 2011 A grapheme would be its own specialized type. I'd probably remove the

Jonathan M Davis (12/33) Jan 17 2011 I think that it would make good sense for a grapheme to be struct which ...

Andrei Alexandrescu (4/37) Jan 17 2011 If someone makes a careful submission of a Grapheme to Phobos as

Michel Fortin (67/137) Jan 15 2011 Actually, I don't think Unicode was so badly designed. It's just that

Andrei Alexandrescu (23/119) Jan 15 2011 I'm unclear on where this is converging to. At this point the commitment...

Jonathan M Davis (32/167) Jan 15 2011 Considering that strings are already dealt with specially in order to ha...

Michel Fortin (12/25) Jan 15 2011 Walter's argument against changing this for foreach was that it'd

Andrei Alexandrescu (4/26) Jan 16 2011 I think it's poor abstraction to represent a Grapheme as a string. It

Andrei Alexandrescu (6/10) Jan 16 2011 It would make everything related a lot (a TON) slower, and it would

Andrej Mitrovic (4/4) Jan 16 2011 And how would 3rd party libraries handle Graphemes? And C modules? I
Steven Schveighoffer (15/27) Jan 17 2011 I would have agreed with you last week. Now I understand that using dch...

Lars T. Kyllingstad (5/11) Jan 17 2011 Googling "unicode sample document" turned up a few examples. This one
Andrei Alexandrescu (8/35) Jan 17 2011 This is one extreme. Char only works for English. Dchar works for most
Andrei Alexandrescu (6/12) Jan 17 2011 Oh, one more thing. You don't need a lot of Unicode text containing

Steven Schveighoffer (6/17) Jan 17 2011 True, benchmarking doesn't apply with combining characters because we ha...
spir (19/30) Jan 17 2011 Correct. For this reason, we do not use the same source at all for

spir (39/67) Jan 17 2011 Hello Steve & Andrei,

Jonathan M Davis (19/187) Jan 15 2011 a
Michel Fortin (22/47) Jan 15 2011 There's still a disagreement about whether a string or a code unit

Andrei Alexandrescu (44/88) Jan 16 2011 Disagreement as that might be, a simple fact that needs to be taken into...

Michel Fortin (55/117) Jan 16 2011 I think the only people who should *not* care are those who have

Andrei Alexandrescu (22/136) Jan 16 2011 I love the increased precision, but again I'm not sure how many people

Daniel Gibson (17/42) Jan 16 2011 So why does D use unicode anyway?

Andrei Alexandrescu (4/52) Jan 16 2011 I think German text works well with dchar.

Jonathan M Davis (85/144) Jan 16 2011 te
Daniel Gibson (47/105) Jan 16 2011 Really? UTF32 - maybe. But IMHO even when not considering graphemes and ...

Daniel Gibson (6/116) Jan 16 2011 of course I forgot:

Michel Fortin (103/120) Jan 17 2011 As I said: all those people who are not validating the inputs to make

Andrei Alexandrescu (80/197) Jan 17 2011 The question (which I see you keep on dodging :o)) is how much text

Michel Fortin (88/207) Jan 17 2011 Not much, right now.

Andrei Alexandrescu (7/10) Jan 17 2011 But at some point you must be able to talk about individual characters

Michel Fortin (42/53) Jan 17 2011 It seems that it can. NSString only exposes individual UTF-16 code

Michel Fortin (22/30) Jan 17 2011 This makes me think of what I did with my XML parser after you made

Andrei Alexandrescu (3/29) Jan 17 2011 Very insightful. Thanks for sharing. Code it up and make a solid proposa...

Steven Wawryk (3/38) Jan 17 2011 How does this differ from Steve Schveighoffer's string_t, subtract the

Andrei Alexandrescu (3/44) Jan 18 2011 There's no string, only range...

Steven Wawryk (8/43) Jan 18 2011 Which is exactly what I asked you about. I understand that you must be

Andrei Alexandrescu (23/71) Jan 18 2011 One simple fact is that I'm not the only person who needs to look at a

Steven Wawryk (10/53) Jan 18 2011 Ok, thanks for this suggestion. But if developing a proposal as

Andrei Alexandrescu (8/73) Jan 18 2011 My response of Sun, 16 Jan 2011 20:58:43 -0600 was a fair attempt at a

Steven Wawryk (15/35) Jan 18 2011 I don't think that it did. I proposed no language change, nor anything

Andrei Alexandrescu (13/45) Jan 18 2011 Adding a new string type would be disruptive. Unless I misunderstood,

Michel Fortin (59/93) Jan 18 2011 What I use right now is this (see below). I'm not sure what would be a

Andrei Alexandrescu (32/80) Jan 18 2011 [snip]

Michel Fortin (33/121) Jan 18 2011 Yes, we need a grapheme range.

spir (14/75) Jan 18 2011 On 01/18/2011 06:14 PM, Michel Fortin wrote:

spir (13/39) Jan 18 2011 This looks like a very interesting approach. And clear.
=?UTF-8?B?QWxpIMOHZWhyZWxp?= (33/37) Jan 18 2011 That's what I've been thinking. The users can choose whether they want

spir (72/110) Jan 19 2011 This is very good and helpful summary. But you do not list all relevant

spir (9/15) Jan 17 2011 Actually, there are at least 2 special cases:

Steven Schveighoffer (44/149) Jan 17 2011 I didn't read the standard, all I understand about unicode is from this ...
spir (27/32) Jan 17 2011 I think like you about pre-composed characters: they bring no real gain
=?UTF-8?B?QWxpIMOHZWhyZWxp?= (11/17) Jan 17 2011 Thanks to all that has contributed, I am also following this thread with...

spir (23/39) Jan 19 2011 This is true and false ;-)

spir (21/25) Jan 17 2011 I am unsure now about the question of a text's (apparent) natural

Gerrit Wichert (12/23) Jan 14 2011 I'm afraid that this is not a proper way to handle this problem. It may

Steven Schveighoffer (5/30) Jan 15 2011 Actually, this would only lazily *and temporarily* convert the string pe...

Joel C. Salomon (15/19) Jan 23 2011 Hebrew:

Nick Sabalausky (13/15) Jan 14 2011 How to do that on the Windows (XP) command prompt, for anyone who doesn'...

Nick Sabalausky (18/24) Jan 14 2011 Forget that step 2, that causes "Active code page: 65001" to be sent to

Andrej Mitrovic (3/14) Jan 14 2011 Does that work for you? I get back:

Nick Sabalausky (16/34) Jan 14 2011 Yea, it works for me (XP Pro SP2 32-bit), and my "chcp" is 437, not 6500...

Andrej Mitrovic (4/5) Jan 14 2011 Nope, I still get the same results (tried with different fonts, lucida

Nick Sabalausky (5/10) Jan 14 2011 Weird. Which version of windows are you on, and are you using the regula...

Andrej Mitrovic (8/11) Jan 14 2011 Okay, it appears this is an issue with Console2. I'll have to report
Andrej Mitrovic (10/12) Jan 14 2011 Woops, let me revise what I've said:

Michel Fortin (50/61) Jan 14 2011 Apple implemented all these things in the NSString class in Cocoa. They

Andrei Alexandrescu (25/42) Jan 14 2011 That's a strong indicator, but we shouldn't get ahead of ourselves.

foobar (10/39) Jan 14 2011 Combining marks do need to be supported.

Michel Fortin (18/31) Jan 14 2011 That's a good example. Although my attempt to extract the text from the

foobar (3/17) Jan 15 2011 I've looked into this and I was wrong. Ruby is a layout feature as you s...

Michel Fortin (16/63) Jan 14 2011 Then perhaps it's time we find out a way to handle non-Unicode

spir (27/29) Jan 17 2011 Text has a perf module that provides such numbers (on different stages

Andrei Alexandrescu (9/34) Jan 17 2011 Congrats on this great work. The initial numbers are in keeping with my

spir (16/52) Jan 17 2011 Andrei, would you have a look at Text's current state, mainly

Andrei Alexandrescu (28/39) Jan 17 2011 I think this is solid work that reveals good understanding of Unicode.

spir (75/115) Jan 17 2011 We are exploring a new field. (Except for the work Objective-C designers...

Andrei Alexandrescu (14/21) Jan 17 2011 Unfortunately I won't have much time to discuss all these points, but

spir (26/47) Jan 18 2011 I think it is needed to repeat again the following: Text in my view (or

Andrei Alexandrescu (3/37) Jan 18 2011 You don't provide O(n) indexing.

Jonathan M Davis (10/13) Jan 17 2011 While it would be nice at times to be able to have an index with foreach...

Andrei Alexandrescu (12/25) Jan 17 2011 It's a bit more difficult than that. When iterating a variable-length

spir (23/57) Jan 18 2011 This is a very valid point: a range's logical offset is not necessary
Steven Schveighoffer (12/45) Jan 19 2011 opApply in no way disables the range interface. It simply is used for

spir (9/21) Jan 18 2011 You are right. I fully agree, in fact. On the other hand, think at

spir (10/23) Jan 17 2011 Unfortunatly, things are complicated by _prepend_ combining marks that

spir (12/70) Jan 11 2011 People interested in solving the general problem with Unicode strings

Tomek =?ISO-8859-2?Q?Sowi=F1ski?= (28/69) Jan 11 2011 =20
Steven Wawryk (15/15) Jan 11 2011 Sorry if I'm jumping inhere without the appropriate background, but I

Michel Fortin (30/47) Jan 11 2011 Actually, displaying a UTF-8/UTF-16 string involves a range of of

Don (4/15) Jan 12 2011 I think the only problem that we really have, is that "char[]",

Andrei Alexandrescu (34/49) Jan 12 2011 I hope to assuage part of that issue with representation(). Again, it's
spir (26/29) Jan 12 2011 I'd like to know when it happens that codepoint is the appropriate level...

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (21/48) Jan 12 2011 Compare according to which alphabet's ordering? Surely not Unicode's...
Michel Fortin (25/43) Jan 12 2011 I agree with you. I don't see many use for code points.

Michel Fortin (13/28) Jan 12 2011 Crap, I meant to send this as UTF-8 with combining characters in it,

spir (9/34) Jan 13 2011 Works :-) But your first post worked as well by me: for instance <<"é"

spir (35/73) Jan 13 2011 Actually, I had once a real use case for codepoint beeing the proper
Jonathan M Davis (48/72) Jan 13 2011 ) So
spir (64/114) Jan 13 2011 D's arrays (even dchar[] & dstring) do not allow having correct results

Michel Fortin (16/24) Jan 13 2011 D is not the first language dealing correctly with Unicode strings in

spir (13/33) Jan 13 2011 Thank you very much for this information (I feel less lonely ;-).

Michel Fortin (12/16) Jan 13 2011 Mac OS sorts file names in a "natural" way since a very long time

Nick Sabalausky (3/13) Jan 13 2011 XP's explorer does that too. It's a very nice feature.

Jonathan M Davis (48/171) Jan 13 2011 ". hope you

Michel Fortin (24/38) Jan 13 2011 What's nice about Cocoa's way of handling strings is that even

Andrej Mitrovic (3/3) Jan 13 2011 OT: Spir, do you know if I can change the syntax highlighting settings

Nick Sabalausky (3/6) Jan 13 2011 I'm getting the same problem too.

Michel Fortin (7/14) Jan 13 2011 I bypassed the problem by fetching the files from the repository. But I

spir (78/104) Jan 13 2011 The problem is then: how does a library or application programmer know,

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

I've been thinking on how to better deal with Unicode strings. Currently 
strings are formally bidirectional ranges with a surreptitious random 
access interface. The random access interface accesses the support of 
the string, which is understood to hold data in a variable-encoded 
format. For as long as the programmer understands this relationship, 
code for string manipulation can be written with relative ease. However, 
there is still room for writing wrong code that looks legit.

Sometimes the best way to tackle a hairy reality is to invite it to the 
negotiation table and offer it promotion to first-class abstraction 
status. Along that vein I was thinking of defining a new range: 
VLERange, i.e. Variable Length Encoding Range. Such a range would have 
the power somewhere in between bidirectional and random access.

The primitives offered would include empty, access to front and back, 
popFront and popBack (just like BidirectionalRange), and in addition 
properties typical of random access ranges: indexing, slicing, and 
length. Note that the result of the indexing operator is not the same as 
the element type of the range, as it only represents the unit of encoding.

In addition to these (and connecting the two), a VLERange would offer 
two additional primitives:

1. size_t stepSize(size_t offset) gives the length of the step needed to 
skip to the next element.

2. size_t backstepSize(size_t offset) gives the size of the _backward_ 
step that goes to the previous element.

In both cases, offset is assumed to be at the beginning of a logical 
element of the range.

I suspect that a lot of functions in std.string can be written without 
Unicode-specific knowledge just by relying on such an interface. 
Moreover, algorithms can be generalized to other structures that use 
variable-length encoding, such as those used in data compression. (In 
that case, the support would be a bit array and the encoded type would 
be ubyte.)

Writing to such ranges is not addressed by this design. Ideas are welcome.

Adding VLERange would legitimize strings and would clarify their 
handling, at the cost of adding one additional concept that needs to be 
minded. Is the trade-off worthwhile?


Andrei

Jan 10 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 I've been thinking on how to better deal with Unicode strings. 
 Currently strings are formally bidirectional ranges with a 
 surreptitious random access interface. The random access interface 
 accesses the support of the string, which is understood to hold data in 
 a variable-encoded format. For as long as the programmer understands 
 this relationship, code for string manipulation can be written with 
 relative ease. However, there is still room for writing wrong code that 
 looks legit.
 
 Sometimes the best way to tackle a hairy reality is to invite it to the 
 negotiation table and offer it promotion to first-class abstraction 
 status. Along that vein I was thinking of defining a new range: 
 VLERange, i.e. Variable Length Encoding Range. Such a range would have 
 the power somewhere in between bidirectional and random access.
 
 The primitives offered would include empty, access to front and back, 
 popFront and popBack (just like BidirectionalRange), and in addition 
 properties typical of random access ranges: indexing, slicing, and 
 length. Note that the result of the indexing operator is not the same 
 as the element type of the range, as it only represents the unit of 
 encoding.

Seems like a good idea to define things formally.


 In addition to these (and connecting the two), a VLERange would offer 
 two additional primitives:
 
 1. size_t stepSize(size_t offset) gives the length of the step needed 
 to skip to the next element.
 
 2. size_t backstepSize(size_t offset) gives the size of the _backward_ 
 step that goes to the previous element.

I like the idea, but I'm not sure about this interface. What's the 
result of stepSize if your range must create two elements from one 
underlying unit? Perhaps in those cases the element type could be an 
array (to return more than one element from one iteration).

For instance, say we have a conversion range taking a Unicode string 
and converting it to ISO Latin 1. The best (lossy) conversion for "œ" 
is "oe" (one chararacter to two characters), in this case 'front' could 
simply return "oe" (two characters) in one iteration, with stepSize 
being the size of the "œ" code point. In the same conversion process, 
encountering "e" followed by a combining "´" would return pre-combined 
character "é" (two characters to one character).


 In both cases, offset is assumed to be at the beginning of a logical 
 element of the range.
 
 I suspect that a lot of functions in std.string can be written without 
 Unicode-specific knowledge just by relying on such an interface. 
 Moreover, algorithms can be generalized to other structures that use 
 variable-length encoding, such as those used in data compression. (In 
 that case, the support would be a bit array and the encoded type would 
 be ubyte.)

Applicability to other problems seems like a valuable benefit.


 Writing to such ranges is not addressed by this design. Ideas are welcome.

Writing, as in assigning to 'front'? That's not really possible with 
variable-length units as it'd need to shift everything in case of a 
length difference. Or maybe you meant writing as in having an output 
range for variable-length elements... I'm not sure


 Adding VLERange would legitimize strings and would clarify their 
 handling, at the cost of adding one additional concept that needs to be 
 minded. Is the trade-off worthwhile?

In my opinion it's not a trade-off at all, it's a formalization of how 
strings are handled which is better in every regard than a "special 
case". I welcome this move very much.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 11 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/11/11 4:41 AM, Michel Fortin wrote:
 On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:
 In addition to these (and connecting the two), a VLERange would offer
 two additional primitives:

 1. size_t stepSize(size_t offset) gives the length of the step needed
 to skip to the next element.

 2. size_t backstepSize(size_t offset) gives the size of the _backward_
 step that goes to the previous element.

 I like the idea, but I'm not sure about this interface. What's the
 result of stepSize if your range must create two elements from one
 underlying unit? Perhaps in those cases the element type could be an
 array (to return more than one element from one iteration).

 For instance, say we have a conversion range taking a Unicode string and
 converting it to ISO Latin 1. The best (lossy) conversion for "œ" is
 "oe" (one chararacter to two characters), in this case 'front' could
 simply return "oe" (two characters) in one iteration, with stepSize
 being the size of the "œ" code point. In the same conversion process,
 encountering "e" followed by a combining "´" would return pre-combined
 character "é" (two characters to one character).

In the design as I thought of it, the effective length of one logical 
element is one or more representation units. My understanding is that 
you are referring to a fractional number of representation units for one 
logical element.

 Writing to such ranges is not addressed by this design. Ideas are
 welcome.

 Writing, as in assigning to 'front'? That's not really possible with
 variable-length units as it'd need to shift everything in case of a
 length difference. Or maybe you meant writing as in having an output
 range for variable-length elements... I'm not sure

Well all of the above :o). Clearly assigning to e.g. front or back 
should not work. The question is what kind of API can we provide beyond 
simple append with put().


Andrei

Jan 11 2011

spir <denis.spir gmail.com> writes:

On 01/11/2011 05:36 PM, Andrei Alexandrescu wrote:
 On 1/11/11 4:41 AM, Michel Fortin wrote:
 On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:
 In addition to these (and connecting the two), a VLERange would offer
 two additional primitives:

 1. size_t stepSize(size_t offset) gives the length of the step needed
 to skip to the next element.

 2. size_t backstepSize(size_t offset) gives the size of the _backward_
 step that goes to the previous element.

 I like the idea, but I'm not sure about this interface. What's the
 result of stepSize if your range must create two elements from one
 underlying unit? Perhaps in those cases the element type could be an
 array (to return more than one element from one iteration).

 For instance, say we have a conversion range taking a Unicode string and
 converting it to ISO Latin 1. The best (lossy) conversion for "œ" is
 "oe" (one chararacter to two characters), in this case 'front' could
 simply return "oe" (two characters) in one iteration, with stepSize
 being the size of the "œ" code point. In the same conversion process,
 encountering "e" followed by a combining "´" would return pre-combined
 character "é" (two characters to one character).

 In the design as I thought of it, the effective length of one logical
 element is one or more representation units. My understanding is that
 you are referring to a fractional number of representation units for one
 logical element.

I think Michel is right. If I understand correctly, VLERange addresses 
the low-level and rather simple issue of each codepoint beeing encoding 
as a variable number of code units. Right?
If yes, then what is the advantage of VLERange? D already has 
string/wstring/dstring, allowing to work with the most advatageous 
encoding according to given source data, and dstring abstracting from 
low-level encoding issues.

The main (and massively ignored) issue when manipulating unicode text is 
rather that, unlike with legacy character sets, one codepoint does *not* 
represent a character in the common sense. In character sets like latin-1:
* each code represents a character, in the common sense (eg "à")
* each character representation has the same size (1 or 2 bytes)
* each character has a single representation ("à" --> always 0xe0)
All of this is wrong with unicode. And these are complicated and 
high-level issues, that appear _after_ decoding, on codepoint sequences.

If VLERange is helpful is dealing with those problems, then I don't 
understand your presentation, sorry. Do you for instance mean such a 
range would, under the hood, group together codes belonging to the same 
character (thus making indexing meaningful), and/or normalise (decomp & 
order) (thus allowing to comp/find/count correctly).?


denis
_________________
vita es estrany
spir.wikidot.com

Jan 11 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/11/11 9:09 AM, spir wrote:
 On 01/11/2011 05:36 PM, Andrei Alexandrescu wrote:
 On 1/11/11 4:41 AM, Michel Fortin wrote:
 On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:
 In addition to these (and connecting the two), a VLERange would offer
 two additional primitives:

 1. size_t stepSize(size_t offset) gives the length of the step needed
 to skip to the next element.

 2. size_t backstepSize(size_t offset) gives the size of the _backward_
 step that goes to the previous element.

 I like the idea, but I'm not sure about this interface. What's the
 result of stepSize if your range must create two elements from one
 underlying unit? Perhaps in those cases the element type could be an
 array (to return more than one element from one iteration).

 For instance, say we have a conversion range taking a Unicode string and
 converting it to ISO Latin 1. The best (lossy) conversion for "œ" is
 "oe" (one chararacter to two characters), in this case 'front' could
 simply return "oe" (two characters) in one iteration, with stepSize
 being the size of the "œ" code point. In the same conversion process,
 encountering "e" followed by a combining "´" would return pre-combined
 character "é" (two characters to one character).

 In the design as I thought of it, the effective length of one logical
 element is one or more representation units. My understanding is that
 you are referring to a fractional number of representation units for one
 logical element.

 I think Michel is right. If I understand correctly, VLERange addresses
 the low-level and rather simple issue of each codepoint beeing encoding
 as a variable number of code units. Right?
 If yes, then what is the advantage of VLERange? D already has
 string/wstring/dstring, allowing to work with the most advatageous
 encoding according to given source data, and dstring abstracting from
 low-level encoding issues.

It' not about the data, it's about algorithms. Currently there are 
algorithms that ostensibly work for bidirectional ranges, but internally 
"cheat" by detecting that the input is actually a string, and use that 
knowledge for better implementations.

The benefit of VLERange would that that it legitimizes those algorithms. 
I wouldn't be surprised if an entire class of algorithms would in fact 
require VLERange (e.g. many of those that we commonly consider today 
"string" algorithms).

 The main (and massively ignored) issue when manipulating unicode text is
 rather that, unlike with legacy character sets, one codepoint does *not*
 represent a character in the common sense. In character sets like latin-1:
 * each code represents a character, in the common sense (eg "à")
 * each character representation has the same size (1 or 2 bytes)
 * each character has a single representation ("à" --> always 0xe0)
 All of this is wrong with unicode. And these are complicated and
 high-level issues, that appear _after_ decoding, on codepoint sequences.

 If VLERange is helpful is dealing with those problems, then I don't
 understand your presentation, sorry. Do you for instance mean such a
 range would, under the hood, group together codes belonging to the same
 character (thus making indexing meaningful), and/or normalise (decomp &
 order) (thus allowing to comp/find/count correctly).?

VLERange would offer automatic decoding in front, back, popFront, and 
popBack - just like BidirectionalRange does right now. It would also 
offer access to the representational support by means of indexing - also 
like char[] et al already do now. The difference is that VLERange being 
a formal concept, algorithms can specialize on it instead of (a) 
specializing for UTF strings or (b) specializing for BidirectionalRange 
and then manually detecting isSomeString inside. Conversely, when 
defining an algorithm you can specify VLARange as a requirement. 
Boyer-Moore is a perfect example - it doesn't work on bidirectional 
ranges, but it does work on VLARange. I suspect there are many like it.

Of course, it would help a lot if we figured other remarkable VLARanges. 
Here are a few that come to mind:

* Multibyte encodings other than UTF. Currently we have no special 
support for those beyond e.g. forward or bidirectional ranges.

* Huffman, RLE, LZ encoded buffers (and many other compressed formats)

* Vocabulary-based translation systems, e.g. associate each word with a 
number.

* Others...?

Some of these are forward-only (don't allow bidirectional access). Once 
we have a number of examples, it would be great to figure a number of 
remarkable algorithms operating on them.


Andrei

Jan 11 2011

spir <denis.spir gmail.com> writes:

On 01/11/2011 08:09 PM, Andrei Alexandrescu wrote:
 The main (and massively ignored) issue when manipulating unicode text is
 rather that, unlike with legacy character sets, one codepoint does *not*
 represent a character in the common sense. In character sets like
 latin-1:
 * each code represents a character, in the common sense (eg "à")
 * each character representation has the same size (1 or 2 bytes)
 * each character has a single representation ("à" --> always 0xe0)
 All of this is wrong with unicode. And these are complicated and
 high-level issues, that appear _after_ decoding, on codepoint sequences.

 If VLERange is helpful is dealing with those problems, then I don't
 understand your presentation, sorry. Do you for instance mean such a
 range would, under the hood, group together codes belonging to the same
 character (thus making indexing meaningful), and/or normalise (decomp &
 order) (thus allowing to comp/find/count correctly).?

 VLERange would offer automatic decoding in front, back, popFront, and
 popBack - just like BidirectionalRange does right now. It would also
 offer access to the representational support by means of indexing - also
 like char[] et al already do now.

IIUC, for the case of text, VLERange helps abstracting from the annoying 
fact that a codepoint is encoded as a variable number of code units.
What I meant is issues like:

     auto text = "a\u0302"d;
     writeln(text);                  // "â"
     auto range = VLERange(text);
     // extracts characters correctly?
     auto letter = range.front();    // "a" or "â"?
     // case yes: compares correctly?
     assert(range.front() == "â");   // fail or pass?

Both fail using all unicode-aware types I know of, because
1. They do not recognise that a character is represented by an arbitrary 
number of codes (code _points_).
2. They do not use normalised forms for comp, search, count, etc...
(while in unicode a given char can have several representations).

 The difference is that VLERange being
 a formal concept, algorithms can specialize on it instead of (a)
 specializing for UTF strings or (b) specializing for BidirectionalRange
 and then manually detecting isSomeString inside. Conversely, when
 defining an algorithm you can specify VLARange as a requirement.
 Boyer-Moore is a perfect example - it doesn't work on bidirectional
 ranges, but it does work on VLARange. I suspect there are many like it.

 Of course, it would help a lot if we figured other remarkable VLARanges.

I think I see the point, and the general usefulness of such an 
abstraction. But it would certainly be more useful in other fields than 
text manipulation, because there are far more annoying issues (that, 
like in example above, simply prevent code correctness).

Denis
_________________
vita es estrany
spir.wikidot.com

Jan 11 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/11/11 4:46 PM, spir wrote:
 On 01/11/2011 08:09 PM, Andrei Alexandrescu wrote:
 The main (and massively ignored) issue when manipulating unicode text is
 rather that, unlike with legacy character sets, one codepoint does *not*
 represent a character in the common sense. In character sets like
 latin-1:
 * each code represents a character, in the common sense (eg "à")
 * each character representation has the same size (1 or 2 bytes)
 * each character has a single representation ("à" --> always 0xe0)
 All of this is wrong with unicode. And these are complicated and
 high-level issues, that appear _after_ decoding, on codepoint sequences.

 If VLERange is helpful is dealing with those problems, then I don't
 understand your presentation, sorry. Do you for instance mean such a
 range would, under the hood, group together codes belonging to the same
 character (thus making indexing meaningful), and/or normalise (decomp &
 order) (thus allowing to comp/find/count correctly).?

 VLERange would offer automatic decoding in front, back, popFront, and
 popBack - just like BidirectionalRange does right now. It would also
 offer access to the representational support by means of indexing - also
 like char[] et al already do now.

 IIUC, for the case of text, VLERange helps abstracting from the annoying
 fact that a codepoint is encoded as a variable number of code units.
 What I meant is issues like:

 auto text = "a\u0302"d;
 writeln(text); // "â"
 auto range = VLERange(text);
 // extracts characters correctly?
 auto letter = range.front(); // "a" or "â"?
 // case yes: compares correctly?
 assert(range.front() == "â"); // fail or pass?

You should try text.front right now, you might be surprised :o).

Andrei

Jan 11 2011

spir <denis.spir gmail.com> writes:

On 01/12/2011 02:22 AM, Andrei Alexandrescu wrote:
 IIUC, for the case of text, VLERange helps abstracting from the annoying
 fact that a codepoint is encoded as a variable number of code units.
 What I meant is issues like:

 auto text = "a\u0302"d;
 writeln(text); // "â"
 auto range = VLERange(text);
 // extracts characters correctly?
 auto letter = range.front(); // "a" or "â"?
 // case yes: compares correctly?
 assert(range.front() == "â"); // fail or pass?

 You should try text.front right now, you might be surprised :o).

Hum, right now incorrectly returns "a" as expected. And indeed
	assert ("â" == "a\u0302");
incorrectly fails as expected.
Both would work with legacy charsets like latin-1. This is a new issue 
introduced with UCS, that requires an additional level of abstraction 
(in addition to the one required by the distincton codepoint/codeunit!)

You may have a look at 
https://bitbucket.org/denispir/denispir-d/src/5ec6fe1e1065/Text.html for 
a rough implementation of a type that does the right thing, & at 
https://bitbucket.org/denispir/denispir-d/src/5ec6fe1e1065/U%20missing%20leve
%20of%20abstraction 
for a (far too long) explanation.
(I have tried to mention those problems a dozen times already, but for 
any reason nearly everybody seem definitely deaf in front of them.)


Denis
_________________
vita es estrany
spir.wikidot.com

Jan 11 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-11 11:36:54 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 1/11/11 4:41 AM, Michel Fortin wrote:
 For instance, say we have a conversion range taking a Unicode string and
 converting it to ISO Latin 1. The best (lossy) conversion for "œ" is
 "oe" (one chararacter to two characters), in this case 'front' could
 simply return "oe" (two characters) in one iteration, with stepSize
 being the size of the "œ" code point. In the same conversion process,
 encountering "e" followed by a combining "´" would return pre-combined
 character "é" (two characters to one character).

 
 In the design as I thought of it, the effective length of one logical 
 element is one or more representation units. My understanding is that 
 you are referring to a fractional number of representation units for 
 one logical element.

Your understanding is correct.

I think both cases (one becomes many & many becomes one) are important 
and must be supported. Your proposal only deal with the 
many-becomes-one case.

I proposed returning arrays so we can deal with the one-becomes-many 
case ("œ" becoming "oe"). Another idea would be to introduce 
"substeps". When checking for the next character, in addition to 
determining its step length you could also determine the number of 
substeps in it. "œ" would have two substeps, "o" and "e", and when 
there is no longer any substep you move to the next step.

All this said, I think this should stay an implementation detail as 
this would allow a variety of strategies. Also, keeping this an 
implementation detail means that your proposed 'stepSize' and 
'backstepSize' need to be an implementation detail too (because they 
won't make sense for the one-to-many case). So they can't really be 
part of a standard VLE interface.

As far as I know, all we really need to expose to algorithms is whether 
a range has elements of variable length, because this has an impact on 
your indexing capabilities. The rest seems unnecessary to me, or am I 
missing some use cases?

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 11 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/11/11 11:13 AM, Michel Fortin wrote:
 On 2011-01-11 11:36:54 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 On 1/11/11 4:41 AM, Michel Fortin wrote:
 For instance, say we have a conversion range taking a Unicode string and
 converting it to ISO Latin 1. The best (lossy) conversion for "œ" is
 "oe" (one chararacter to two characters), in this case 'front' could
 simply return "oe" (two characters) in one iteration, with stepSize
 being the size of the "œ" code point. In the same conversion process,
 encountering "e" followed by a combining "´" would return pre-combined
 character "é" (two characters to one character).

 In the design as I thought of it, the effective length of one logical
 element is one or more representation units. My understanding is that
 you are referring to a fractional number of representation units for
 one logical element.

 Your understanding is correct.

 I think both cases (one becomes many & many becomes one) are important
 and must be supported. Your proposal only deal with the many-becomes-one
 case.

I disagree. When I suggested this design I was worried of 
over-abstracting. Now this looks like abstracting for stuff that hasn't 
even been addressed concretely yet.

Besides, using bit as an encoding unit sounds like an acceptable 
approach for anything fractional.

 I proposed returning arrays so we can deal with the one-becomes-many
 case ("œ" becoming "oe"). Another idea would be to introduce "substeps".
 When checking for the next character, in addition to determining its
 step length you could also determine the number of substeps in it. "œ"
 would have two substeps, "o" and "e", and when there is no longer any
 substep you move to the next step.

 All this said, I think this should stay an implementation detail as this
 would allow a variety of strategies. Also, keeping this an
 implementation detail means that your proposed 'stepSize' and
 'backstepSize' need to be an implementation detail too (because they
 won't make sense for the one-to-many case). So they can't really be part
 of a standard VLE interface.

If you don't have at least stepSize that tells you how large the stride 
is to get to the next element, it becomes impossible to move within the 
range using integral indexes.

 As far as I know, all we really need to expose to algorithms is whether
 a range has elements of variable length, because this has an impact on
 your indexing capabilities. The rest seems unnecessary to me, or am I
 missing some use cases?

I think you could say that you don't really need stepSize because you 
can compute it as follows:

auto r1 = r;
r1.popFront();
size_t stepSize = r.length - r1.length;

This is tenuous, inefficient, and impossible if the support range 
doesn't support length (I realize that variable-length encodings work 
over other ranges than random access, but then again this may be an 
overgeneralization).


Andrei

Jan 11 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 10 Jan 2011 22:57:36 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 I've been thinking on how to better deal with Unicode strings. Currently  
 strings are formally bidirectional ranges with a surreptitious random  
 access interface. The random access interface accesses the support of  
 the string, which is understood to hold data in a variable-encoded  
 format. For as long as the programmer understands this relationship,  
 code for string manipulation can be written with relative ease. However,  
 there is still room for writing wrong code that looks legit.

 Sometimes the best way to tackle a hairy reality is to invite it to the  
 negotiation table and offer it promotion to first-class abstraction  
 status. Along that vein I was thinking of defining a new range:  
 VLERange, i.e. Variable Length Encoding Range. Such a range would have  
 the power somewhere in between bidirectional and random access.

 The primitives offered would include empty, access to front and back,  
 popFront and popBack (just like BidirectionalRange), and in addition  
 properties typical of random access ranges: indexing, slicing, and  
 length. Note that the result of the indexing operator is not the same as  
 the element type of the range, as it only represents the unit of  
 encoding.

 In addition to these (and connecting the two), a VLERange would offer  
 two additional primitives:

 1. size_t stepSize(size_t offset) gives the length of the step needed to  
 skip to the next element.

 2. size_t backstepSize(size_t offset) gives the size of the _backward_  
 step that goes to the previous element.

 In both cases, offset is assumed to be at the beginning of a logical  
 element of the range.

 I suspect that a lot of functions in std.string can be written without  
 Unicode-specific knowledge just by relying on such an interface.  
 Moreover, algorithms can be generalized to other structures that use  
 variable-length encoding, such as those used in data compression. (In  
 that case, the support would be a bit array and the encoded type would  
 be ubyte.)

 Writing to such ranges is not addressed by this design. Ideas are  
 welcome.

 Adding VLERange would legitimize strings and would clarify their  
 handling, at the cost of adding one additional concept that needs to be  
 minded. Is the trade-off worthwhile?

While this makes it possible to write algorithms that only accept  
VLERanges, I don't think it solves the major problem with strings -- they  
are treated as arrays by the compiler.

I'd also rather see an indexing operation return the element type, and  
have a separate function to get the encoding unit.  This makes more sense  
for generic code IMO.

I noticed you never commented on my proposed string type...

That reminds me, I should update with suggested changes and re-post it.

-Steve

Jan 11 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/11/11 5:30 AM, Steven Schveighoffer wrote:
 While this makes it possible to write algorithms that only accept
 VLERanges, I don't think it solves the major problem with strings --
 they are treated as arrays by the compiler.

Except when they're not - foreach with dchar...

 I'd also rather see an indexing operation return the element type, and
 have a separate function to get the encoding unit. This makes more sense
 for generic code IMO.

But that's neither here nor there. That would return the logical element 
at a physical position. I am very doubtful that much generic code could 
work without knowing they are in fact dealing with a variable-length 
encoding.

 I noticed you never commented on my proposed string type...

 That reminds me, I should update with suggested changes and re-post it.

To be frank, I think it didn't mark a visible improvement. It solved 
some problems and brought others. There was disagreement over the 
offered primitives and their semantics.

That being said, it's good you are doing this work. In the best case, 
you could bring a compelling abstraction to the table. In the worst, 
you'll become as happy about D's strings as I am :o).


Andrei

Jan 11 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 11 Jan 2011 11:54:08 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 1/11/11 5:30 AM, Steven Schveighoffer wrote:
 While this makes it possible to write algorithms that only accept
 VLERanges, I don't think it solves the major problem with strings --
 they are treated as arrays by the compiler.

 Except when they're not - foreach with dchar...

This solitary difference is a very thin argument -- foreach(d;  
byDchar(str)) would be just as good without requiring compiler help.

 I'd also rather see an indexing operation return the element type, and
 have a separate function to get the encoding unit. This makes more sense
 for generic code IMO.

 But that's neither here nor there. That would return the logical element  
 at a physical position. I am very doubtful that much generic code could  
 work without knowing they are in fact dealing with a variable-length  
 encoding.

It depends on the function, and the way the indexing is implemented.

 I noticed you never commented on my proposed string type...

 That reminds me, I should update with suggested changes and re-post it.

 To be frank, I think it didn't mark a visible improvement. It solved  
 some problems and brought others. There was disagreement over the  
 offered primitives and their semantics.

It is supposed to be simple, and provide the expected interface, without  
causing any undue performance degradation.  That is, I should be able to  
do all the things with a replacement string type that I can with a char  
array today, as efficiently as I can today, except I should have to work  
to get at the code-units.  The huge benefit is that I can say "I'm dealing  
with this as an array" when I know it's safe

The disagreement will never be fully solved, as there is just as much  
disagreement about the current state of affairs ;)  e.g. should foreach  
default to using dchar?

 That being said, it's good you are doing this work. In the best case,  
 you could bring a compelling abstraction to the table. In the worst,  
 you'll become as happy about D's strings as I am :o).

I don't think I'll ever be 'happy' with the way strings sit in phobos  
currently.  I typically deal in ASCII (i.e. code units), and phobos works  
very hard to prevent that.

-Steve

Jan 11 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/11/11 11:21 AM, Steven Schveighoffer wrote:
 On Tue, 11 Jan 2011 11:54:08 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 1/11/11 5:30 AM, Steven Schveighoffer wrote:
 While this makes it possible to write algorithms that only accept
 VLERanges, I don't think it solves the major problem with strings --
 they are treated as arrays by the compiler.

 Except when they're not - foreach with dchar...

 This solitary difference is a very thin argument -- foreach(d;
 byDchar(str)) would be just as good without requiring compiler help.

 I'd also rather see an indexing operation return the element type, and
 have a separate function to get the encoding unit. This makes more sense
 for generic code IMO.

 But that's neither here nor there. That would return the logical
 element at a physical position. I am very doubtful that much generic
 code could work without knowing they are in fact dealing with a
 variable-length encoding.

 It depends on the function, and the way the indexing is implemented.

 I noticed you never commented on my proposed string type...

 That reminds me, I should update with suggested changes and re-post it.

 To be frank, I think it didn't mark a visible improvement. It solved
 some problems and brought others. There was disagreement over the
 offered primitives and their semantics.

 It is supposed to be simple, and provide the expected interface, without
 causing any undue performance degradation. That is, I should be able to
 do all the things with a replacement string type that I can with a char
 array today, as efficiently as I can today, except I should have to work
 to get at the code-units. The huge benefit is that I can say "I'm
 dealing with this as an array" when I know it's safe

Unfinished sentence? Anyway, for my money you just described what we 
have now.

 The disagreement will never be fully solved, as there is just as much
 disagreement about the current state of affairs ;) e.g. should foreach
 default to using dchar?

I disagree about the disagreement being unsolvable. I'm not rigid; if I 
saw a terrific abstraction in your string, I'd be all for it. It just 
shuffles some issues about, and although I agree it does one thing or 
two better than char[], at the end of the day it doesn't carry its weight.

 That being said, it's good you are doing this work. In the best case,
 you could bring a compelling abstraction to the table. In the worst,
 you'll become as happy about D's strings as I am :o).

 I don't think I'll ever be 'happy' with the way strings sit in phobos
 currently. I typically deal in ASCII (i.e. code units), and phobos works
 very hard to prevent that.

I wonder if we could and should extend some of the functions in 
std.string to work with ubyte[]. I did add a function called 
representation() that I didn't document yet. Essentially representation 
gives you the ubyte[], ushort[], or uint[] underneath a string, with the 
same qualifiers. Whenever you want an algorithm to work on ASCII in 
earnest, you can pass representation(s) to it instead of s.

If you work a lot with ASCII, an AsciiString abstraction may be a better 
and more likely to be successful string type. Better yet, you could 
simply focus on AsciiChar and then define ASCII strings as arrays of 
AsciiChar.


Andrei

Jan 11 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 11 Jan 2011 18:00:30 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 1/11/11 11:21 AM, Steven Schveighoffer wrote:

 It is supposed to be simple, and provide the expected interface, without
 causing any undue performance degradation. That is, I should be able to
 do all the things with a replacement string type that I can with a char
 array today, as efficiently as I can today, except I should have to work
 to get at the code-units. The huge benefit is that I can say "I'm
 dealing with this as an array" when I know it's safe

 Unfinished sentence?

Sorry, I forgot '.' :)

 Anyway, for my money you just described what we have now.

All except the 'expected interface' part.  The string type should deal  
with dchars exclusively, since that's what it is a range of.  char[] gives  
you char's back when you index it.  Anyone who doesn't use ASCII will be  
confused by this.

Also, I expect to be able to use a char[] as an array, which Phobos  
doesn't let me in some cases (e.g. sorting ASCII character array).

 The disagreement will never be fully solved, as there is just as much
 disagreement about the current state of affairs ;) e.g. should foreach
 default to using dchar?

 I disagree about the disagreement being unsolvable. I'm not rigid; if I  
 saw a terrific abstraction in your string, I'd be all for it. It just  
 shuffles some issues about, and although I agree it does one thing or  
 two better than char[], at the end of the day it doesn't carry its  
 weight.

I see it as having two vast improvements:

1. If we replace char[] with a specific type for string, then char[] can  
be considered a true array by phobos, and phobos can now deal with a  
char[] array without the need to cast.
2. It protects the casual user from incorrectly using a string by making  
the default the correct API.

Those to me are very important.

 I don't think I'll ever be 'happy' with the way strings sit in phobos
 currently. I typically deal in ASCII (i.e. code units), and phobos works
 very hard to prevent that.

 I wonder if we could and should extend some of the functions in  
 std.string to work with ubyte[]. I did add a function called  
 representation() that I didn't document yet. Essentially representation  
 gives you the ubyte[], ushort[], or uint[] underneath a string, with the  
 same qualifiers. Whenever you want an algorithm to work on ASCII in  
 earnest, you can pass representation(s) to it instead of s.

This, again, fails on point 2 above.  A char[] is an array, and allows  
access to code-units, which is not the correct interface for a string.   
Supporting ubyte[] doesn't fix that problem.  Correct as the default is  
usually a theme in D...

 If you work a lot with ASCII, an AsciiString abstraction may be a better  
 and more likely to be successful string type. Better yet, you could  
 simply focus on AsciiChar and then define ASCII strings as arrays of  
 AsciiChar.

This seems like the wrong approach.  Adding a new type does not fix the  
problems with the original type.  We need to replace the original type or  
at least how it is treated by the compiler.

-Steve

Jan 13 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/13/11 8:52 AM, Steven Schveighoffer wrote:
 I see it as having two vast improvements:

 1. If we replace char[] with a specific type for string, then char[] can
 be considered a true array by phobos, and phobos can now deal with a
 char[] array without the need to cast.
 2. It protects the casual user from incorrectly using a string by making
 the default the correct API.

 Those to me are very important.

Let's take a look:

// Incorrect string code
void fun(string s) {
   foreach (i; 0 .. s.length) {
     writeln("The character in position ", i, " is ", s[i]);
   }
}

// Incorrect string_t code
void fun(string_t!char s) {
   foreach (i; 0 .. s.codeUnits) {
     writeln("The character in position ", i, " is ", s[i]);
   }
}

Both functions are incorrect, albeit in different ways. The only 
improvement I'm seeing is that the user needs to write codeUnits instead 
of length, which may make her think twice. Clearly, however, copiously 
incorrect code can be written with the proposed interface because it 
tries to hide the reality that underneath a variable-length encoding is 
being used, but doesn't hide it completely (albeit for good 
efficiency-related reasons).

But wait, there's less. Functions for random-access range throughout 
Phobos routinely assume fixed-length encoding, i.e. s[i + 1] lies next 
to s[i]. From a cursory look at string_t, std.range will qualify it as a 
RandomAccessRange without length. That's an odd beast but does not 
change the fixed-length encoding assumption. So you'd need to 
special-case algorithms for string_t, just like right now certain 
algorithms are specialized for string.

Where's the progress?


Andrei

Jan 13 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 1/13/11 8:52 AM, Steven Schveighoffer wrote:
 I see it as having two vast improvements:

 1. If we replace char[] with a specific type for string, then char[] can
 be considered a true array by phobos, and phobos can now deal with a
 char[] array without the need to cast.
 2. It protects the casual user from incorrectly using a string by making
 the default the correct API.

 Those to me are very important.

 Let's take a look:

 // Incorrect string code
 void fun(string s) {
    foreach (i; 0 .. s.length) {
      writeln("The character in position ", i, " is ", s[i]);
    }
 }

 // Incorrect string_t code
 void fun(string_t!char s) {
    foreach (i; 0 .. s.codeUnits) {
      writeln("The character in position ", i, " is ", s[i]);
    }
 }

 Both functions are incorrect, albeit in different ways. The only  
 improvement I'm seeing is that the user needs to write codeUnits instead  
 of length, which may make her think twice. Clearly, however, copiously  
 incorrect code can be written with the proposed interface because it  
 tries to hide the reality that underneath a variable-length encoding is  
 being used, but doesn't hide it completely (albeit for good  
 efficiency-related reasons).

You might be looking at my previous version.  The new version (recently  
posted) will throw an exception for that code if a multi-code-unit  
code-point is found.

It also supports this:

foreach(i, d; s)
{
    writeln("The character in position ", i, " is ", d);
}

where i is the index (might not be sequential)

 But wait, there's less. Functions for random-access range throughout  
 Phobos routinely assume fixed-length encoding, i.e. s[i + 1] lies next  
 to s[i]. From a cursory look at string_t, std.range will qualify it as a  
 RandomAccessRange without length. That's an odd beast but does not  
 change the fixed-length encoding assumption. So you'd need to  
 special-case algorithms for string_t, just like right now certain  
 algorithms are specialized for string.

isRandomAccessRange requires hasLength (see here:  
http://www.dsource.org/projects/phobos/browser/trunk/phobos/std/range.d#L532). 
 
This is not a random access range per that definition.  But a string isn't  
a random access range anyways (it's specifically disallowed by std.range  
per that same reference).

The plan is you would *not* have to special case algorithms for string_t  
as you do currently for char[].  If that's not the case, then we haven't  
achieved much.  Simply put, we are separating out the strange nature of  
strings from arrays, so the exceptional treatment of them is handled by  
the type itself, not the functions using it.

-Steve

Jan 13 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/13/11 11:35 AM, Steven Schveighoffer wrote:
 On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Let's take a look:

 // Incorrect string code
 void fun(string s) {
 foreach (i; 0 .. s.length) {
 writeln("The character in position ", i, " is ", s[i]);
 }
 }

 // Incorrect string_t code
 void fun(string_t!char s) {
 foreach (i; 0 .. s.codeUnits) {
 writeln("The character in position ", i, " is ", s[i]);
 }
 }

 Both functions are incorrect, albeit in different ways. The only
 improvement I'm seeing is that the user needs to write codeUnits
 instead of length, which may make her think twice. Clearly, however,
 copiously incorrect code can be written with the proposed interface
 because it tries to hide the reality that underneath a variable-length
 encoding is being used, but doesn't hide it completely (albeit for
 good efficiency-related reasons).

 You might be looking at my previous version. The new version (recently
 posted) will throw an exception for that code if a multi-code-unit
 code-point is found.

I was looking at your latest. It's code that compiles and runs, but 
dynamically fails on some inputs. I agree that it's often better to fail 
noisily instead of silently, but in a manner of speaking the 
string-based code doesn't fail at all - it correctly iterates the code 
units of a string. This may sometimes not be what the user expected; 
most of the time they'd care about the code points.

 It also supports this:

 foreach(i, d; s)
 {
 writeln("The character in position ", i, " is ", d);
 }

 where i is the index (might not be sequential)

Well string supports that too, albeit with the nit that you need to 
specify dchar.

 But wait, there's less. Functions for random-access range throughout
 Phobos routinely assume fixed-length encoding, i.e. s[i + 1] lies next
 to s[i]. From a cursory look at string_t, std.range will qualify it as
 a RandomAccessRange without length. That's an odd beast but does not
 change the fixed-length encoding assumption. So you'd need to
 special-case algorithms for string_t, just like right now certain
 algorithms are specialized for string.

 isRandomAccessRange requires hasLength (see here:
 http://www.dsource.org/projects/phobos/browser/trunk/phobos/std/range.d#L532).
 This is not a random access range per that definition.

That's an interesting twist. By the way I specified length is required 
then because I couldn't imagine having random access into something that 
I can't tell the length of. Apparently I was wrong :o).

 But a string
 isn't a random access range anyways (it's specifically disallowed by
 std.range per that same reference).

It isn't and it isn't supposed to be.

 The plan is you would *not* have to special case algorithms for string_t
 as you do currently for char[]. If that's not the case, then we haven't
 achieved much. Simply put, we are separating out the strange nature of
 strings from arrays, so the exceptional treatment of them is handled by
 the type itself, not the functions using it.

That sounds reasonable.


Andrei

Jan 13 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 13 Jan 2011 15:51:00 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 1/13/11 11:35 AM, Steven Schveighoffer wrote:
 On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Let's take a look:

 // Incorrect string code
 void fun(string s) {
 foreach (i; 0 .. s.length) {
 writeln("The character in position ", i, " is ", s[i]);
 }
 }

 // Incorrect string_t code
 void fun(string_t!char s) {
 foreach (i; 0 .. s.codeUnits) {
 writeln("The character in position ", i, " is ", s[i]);
 }
 }

 Both functions are incorrect, albeit in different ways. The only
 improvement I'm seeing is that the user needs to write codeUnits
 instead of length, which may make her think twice. Clearly, however,
 copiously incorrect code can be written with the proposed interface
 because it tries to hide the reality that underneath a variable-length
 encoding is being used, but doesn't hide it completely (albeit for
 good efficiency-related reasons).

 You might be looking at my previous version. The new version (recently
 posted) will throw an exception for that code if a multi-code-unit
 code-point is found.

 I was looking at your latest. It's code that compiles and runs, but  
 dynamically fails on some inputs. I agree that it's often better to fail  
 noisily instead of silently, but in a manner of speaking the  
 string-based code doesn't fail at all - it correctly iterates the code  
 units of a string. This may sometimes not be what the user expected;  
 most of the time they'd care about the code points.

iterating the code units is possible by accessing the array data.  i.e.  
you could do:

foreach(i, c; s.data)

if you want the code-units.

That is the point of having a separate type.  Using string_t tells the  
library "I'm using this data as a string".  Using char[] tells the library  
"I'm using this data as an array."

The difference here is, you have to *specifically* try to access the code  
units, the default is code-points.  All it does really is switch the  
default.

 It also supports this:

 foreach(i, d; s)
 {
 writeln("The character in position ", i, " is ", d);
 }

 where i is the index (might not be sequential)

 Well string supports that too, albeit with the nit that you need to  
 specify dchar.

This is not a small problem.

 isRandomAccessRange requires hasLength (see here:
 http://www.dsource.org/projects/phobos/browser/trunk/phobos/std/range.d#L532).
 This is not a random access range per that definition.

 That's an interesting twist. By the way I specified length is required  
 then because I couldn't imagine having random access into something that  
 I can't tell the length of. Apparently I was wrong :o).

Yes, in fact, you could say that specifically defines VLERange ;)  But  
actually, there are two types of VLE ranges, those which can be randomly  
accessed (where determining the beginning of a code point, given a random  
index is possible) and those that cannot (where decoding depends on the  
exact order of the data).  Actually, those would not be bi-directional  
ranges anyways.

 But a string
 isn't a random access range anyways (it's specifically disallowed by
 std.range per that same reference).

 It isn't and it isn't supposed to be.

I agree with that assessment, which is why I omitted length.

-Steve

Jan 13 2011

"Nick Sabalausky" <a a.a> writes:

"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:ignon1$2p4k$1 digitalmars.com...
 This may sometimes not be what the user expected; most of the time they'd 
 care about the code points.

I dunno, spir has succesfuly convinced me that most of the time it's 
graphemes the user cares about, not code points. Using code points is just 
as misleading as using UTF-16 code units.

Jan 13 2011

spir <denis.spir gmail.com> writes:

On 01/13/2011 11:00 PM, Nick Sabalausky wrote:
 "Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org>  wrote in message
 news:ignon1$2p4k$1 digitalmars.com...
 This may sometimes not be what the user expected; most of the time they'd
 care about the code points.

 I dunno, spir has succesfuly convinced me that most of the time it's
 graphemes the user cares about, not code points. Using code points is just
 as misleading as using UTF-16 code units.

You are right in that those 2 issues are really analog. In practice, 
once universal text is truely and commonly used, I guess problems with 
codes-do-not-represent-characters may become far more obvious; and also 
far more serious because (logical) errors can easily pass by unseen.
[In fact, how can a programmer even know for instance that a search 
routine missed its target or returned a false positive, when dealing 
with characters from unknown languages? Indeed, there are test data 
sets, but they are useless if the tools one uses just ignore the issues.]
The problem with using 16-bit representation and thus ignoring a fair 
amount of codepoints is maybe less problematic because there are rather 
few chances to randomly meet characters outside the BMP (Basic 
Multiligual Plane, part of UCS which codepoints are < 0x10000).
Outside the BMP are scripting systems of less commonly studied 
archeological languages, and various sets of images such as alchemical 
symbols, playing cards or domino tiles. I doubt they'll ever be commonly 
used, or else for specialised apps the programmer perfectly knows what 
they deal with.

A list of UCS blocks with pointers to detailed content can be found here:
http://www.fileformat.info/info/unicode/block/index.htm
Blocks over the BMP start with the line:
Linear B Syllabary 	U+10000 	U+1007F 	(88)

Denis
_________________
vita es estrany
spir.wikidot.com

Jan 13 2011

Lutger Blijdestijn <lutger.blijdestijn gmail.com> writes:

Nick Sabalausky wrote:

 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message
 news:ignon1$2p4k$1 digitalmars.com...
 This may sometimes not be what the user expected; most of the time they'd
 care about the code points.

 
 I dunno, spir has succesfuly convinced me that most of the time it's
 graphemes the user cares about, not code points. Using code points is just
 as misleading as using UTF-16 code units.

I agree. This is a very informative thread, thanks spir and everybody else. 

Going back to the topic, it seems to me that a unicode string is a 
surprisingly complicated data structure that can be viewed from multiple 
types of ranges. In the light of this thread, a dchar doesn't seem like such 
a useful type anymore, it is still a low level abstraction for the purpose 
of correctly dealing with text. Perhaps even less useful, since it gives the 
illusion of correctness for those who are not in the know.

The algorithms in std.string can be upgraded to work correctly with all the  
issues mentioned, but the generic ones in std.algorithm will just subtly do 
the wrong thing when presented with dchar ranges. And, as I understood it, 
the purpose of a VleRange was exactly to make generic algorithms just work 
(tm) for strings. 

Is it still possible to solve this problem or are we stuck with specialized 
string algorithms? Would it work if VleRange of string was a bidirectional  
range with string slices of graphemes as the ElementType and indexing with 
code units? Often used string algorithms could be specialized for 
performance, but if not, generic algorithms would still work.

Jan 15 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn 
<lutger.blijdestijn gmail.com> said:

 Nick Sabalausky wrote:
 
 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message
 news:ignon1$2p4k$1 digitalmars.com...
 
 This may sometimes not be what the user expected; most of the time they'd
 care about the code points.
 

 
 I dunno, spir has succesfuly convinced me that most of the time it's
 graphemes the user cares about, not code points. Using code points is just
 as misleading as using UTF-16 code units.

 
 I agree. This is a very informative thread, thanks spir and everybody else.
 
 Going back to the topic, it seems to me that a unicode string is a
 surprisingly complicated data structure that can be viewed from multiple
 types of ranges. In the light of this thread, a dchar doesn't seem like such
 a useful type anymore, it is still a low level abstraction for the purpose
 of correctly dealing with text. Perhaps even less useful, since it gives the
 illusion of correctness for those who are not in the know.
 
 The algorithms in std.string can be upgraded to work correctly with all the
 issues mentioned, but the generic ones in std.algorithm will just subtly do
 the wrong thing when presented with dchar ranges. And, as I understood it,
 the purpose of a VleRange was exactly to make generic algorithms just work
 (tm) for strings.
 
 Is it still possible to solve this problem or are we stuck with specialized
 string algorithms? Would it work if VleRange of string was a bidirectional
 range with string slices of graphemes as the ElementType and indexing with
 code units? Often used string algorithms could be specialized for
 performance, but if not, generic algorithms would still work.

I have my idea.

I think it'd be a good idea is to improve upon Andrei's first idea -- 
which was to treat char[], wchar[], and dchar[] all as ranges of dchar 
elements -- by changing the element type to be the same as the string. 
For instance, iterating on a char[] would give you slices of char[], 
each having one grapheme.

The second component would be to make the string equality operator (==) 
for strings compare them in their normalized form, so that ("e" with 
combining acute accent) == (pre-combined "�"). I think this would make 
D support for Unicode much more intuitive.

This implies some semantic changes, mainly that everywhere you write a 
"character" you must use double-quotes (string "a") instead of single 
quote (code point 'a'), but from the user's point of view that's pretty 
much all there is to change.

There'll still be plenty of room for specialized algorithms, but their 
purpose would be limited to optimization. Correctness would be taken 
care of by the basic range interface, and foreach should follow suit 
and iterate by grapheme by default.

I wrote this example (or something similar) earlier in this thread:

	foreach (grapheme; "expos�")
		if (grapheme == "�")
			break;

In this example, even if one of these two strings use the pre-combined 
form of "�" and the other uses a combining acute accent, the equality 
would still hold since foreach iterates on full graphemes and == 
compares using normalization.

The important thing to keep in mind here is that the grapheme-splitting 
algorithm should be optimized for the case where there is no combining 
character and the compare algorithm for the case where the string is 
already normalized, since most strings will exhibit these 
characteristics.

As for ASCII, we could make it easier to use ubyte[] for it by making 
string literals implicitly convert to ubyte[] if all their characters 
are in ASCII range.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 15 2011

Lutger Blijdestijn <lutger.blijdestijn gmail.com> writes:

Michel Fortin wrote:

 On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn
 <lutger.blijdestijn gmail.com> said:

...
 
 Is it still possible to solve this problem or are we stuck with
 specialized string algorithms? Would it work if VleRange of string was a
 bidirectional range with string slices of graphemes as the ElementType
 and indexing with code units? Often used string algorithms could be
 specialized for performance, but if not, generic algorithms would still
 work.

 
 I have my idea.
 
 I think it'd be a good idea is to improve upon Andrei's first idea --
 which was to treat char[], wchar[], and dchar[] all as ranges of dchar
 elements -- by changing the element type to be the same as the string.
 For instance, iterating on a char[] would give you slices of char[],
 each having one grapheme.
 

...

Yes, this is exactly what I meant, but you are much clearer. I hope this can 
be made to work!

Jan 15 2011

foobar <foo bar.com> writes:

Lutger Blijdestijn Wrote:

 Michel Fortin wrote:
 
 On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn
 <lutger.blijdestijn gmail.com> said:

 ...
 
 Is it still possible to solve this problem or are we stuck with
 specialized string algorithms? Would it work if VleRange of string was a
 bidirectional range with string slices of graphemes as the ElementType
 and indexing with code units? Often used string algorithms could be
 specialized for performance, but if not, generic algorithms would still
 work.

 
 I have my idea.
 
 I think it'd be a good idea is to improve upon Andrei's first idea --
 which was to treat char[], wchar[], and dchar[] all as ranges of dchar
 elements -- by changing the element type to be the same as the string.
 For instance, iterating on a char[] would give you slices of char[],
 each having one grapheme.
 

 ...
 
 Yes, this is exactly what I meant, but you are much clearer. I hope this can 
 be made to work!
 

My two cents are against this kind of design. 
The "correct" approach IMO is a 'universal text' type which is a _container_ of
said text. This type would provide ranges for the various abstraction levels.
E.g. 
text.codeUnits to iterate by codeUnits

Here's a (perhaps contrived) example:
Let's say I want to find the combining marks in some text. 

For instance, Hebrew uses combining marks for vowels (among other things) and
they are optional in the language (There's a "full" form with vowels and a
"missing" form without them).
I have a Hebrew text with in the "full" form and I want to strip it and convert
it to the "missing" form. 

How would I accomplish this with your design?

Jan 15 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-15 09:09:17 -0500, foobar <foo bar.com> said:

 Lutger Blijdestijn Wrote:
 
 Michel Fortin wrote:
 
 On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn
 <lutger.blijdestijn gmail.com> said:

 ...
 
 Is it still possible to solve this problem or are we stuck with
 specialized string algorithms? Would it work if VleRange of string was a
 bidirectional range with string slices of graphemes as the ElementType
 and indexing with code units? Often used string algorithms could be
 specialized for performance, but if not, generic algorithms would still
 work.

 
 I have my idea.
 
 I think it'd be a good idea is to improve upon Andrei's first idea --
 which was to treat char[], wchar[], and dchar[] all as ranges of dchar
 elements -- by changing the element type to be the same as the string.
 For instance, iterating on a char[] would give you slices of char[],
 each having one grapheme.
 

 ...
 
 Yes, this is exactly what I meant, but you are much clearer. I hope this can
 be made to work!
 

 
 My two cents are against this kind of design.
 The "correct" approach IMO is a 'universal text' type which is a 
 _container_ of said text. This type would provide ranges for the 
 various abstraction levels. E.g.
 text.codeUnits to iterate by codeUnits

Nothing prevents that in the design I proposed. Andrei's design already 
implements "str".byDchar() that would work for code points. I'd suggest 
changing the API to by!char(), by!wchar(), and by!cdhar() for when you 
deal with whatever kind of code unit or code point you want. This would 
be mostly symmetric to what you can already do with foreach:

	foreach (char c; "hello") {}
	foreach (wchar c; "hello") {}
	foreach (dchar c; "hello") {}
// same as:
	foreach (c; "hello".by!char()) {}
	foreach (c; "hello".by!wchar()) {}
	foreach (c; "hello".by!dchar()) {}


 Here's a (perhaps contrived) example:
 Let's say I want to find the combining marks in some text.
 
 For instance, Hebrew uses combining marks for vowels (among other 
 things) and they are optional in the language (There's a "full" form 
 with vowels and a "missing" form without them).
 I have a Hebrew text with in the "full" form and I want to strip it and 
 convert it to the "missing" form.
 
 How would I accomplish this with your design?

All you need is a range that takes a string as input and give you code 
points in a decomposed form (NFD), then you use std.algorithm.filter on 
it:

	// original string
	auto str = "...";

	// create normalized decomposed string as a lazy range of dchar (NFD)
	auto decomposed = decompose(str);

	// filter to remove your favorite combining code point (use the hex 
code you want)
	auto filtered = filter!"a != 0xFABA"(decomposed);

	// turn it back in composed form (NFC), optional
	auto recomposed = compose(filtered);

	// convert back to a string (could also be wstring or dstring)
	string result = array(recomposed.by!char());

This last line is the one doing everything. All the rest just chain 
ranges together for doing on-the-fly decomposition, filtering, and 
recomposition; the last line uses that chain of rage to fill the array.

A more naive implementation not taking advantage of code points but 
instead using a replacement table would also work:

	string str = "...";
	string result;
	string[string] replacements = ["�":"e"]; // change this for what you want
	foreach (grapheme; str) {
		auto replacement = grapheme in replacements;
		if (replacement)
			result ~= replacement;
		else
			result ~= grapheme;
	}
	

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 15 2011

foobar <foo bar.com> writes:

Michel Fortin Wrote:

 On 2011-01-15 09:09:17 -0500, foobar <foo bar.com> said:
 
 Lutger Blijdestijn Wrote:
 
 Michel Fortin wrote:
 
 On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn
 <lutger.blijdestijn gmail.com> said:

 ...
 
 Is it still possible to solve this problem or are we stuck with
 specialized string algorithms? Would it work if VleRange of string was a
 bidirectional range with string slices of graphemes as the ElementType
 and indexing with code units? Often used string algorithms could be
 specialized for performance, but if not, generic algorithms would still
 work.

 
 I have my idea.
 
 I think it'd be a good idea is to improve upon Andrei's first idea --
 which was to treat char[], wchar[], and dchar[] all as ranges of dchar
 elements -- by changing the element type to be the same as the string.
 For instance, iterating on a char[] would give you slices of char[],
 each having one grapheme.
 

 ...
 
 Yes, this is exactly what I meant, but you are much clearer. I hope this can
 be made to work!
 

 
 My two cents are against this kind of design.
 The "correct" approach IMO is a 'universal text' type which is a 
 _container_ of said text. This type would provide ranges for the 
 various abstraction levels. E.g.
 text.codeUnits to iterate by codeUnits

 
 Nothing prevents that in the design I proposed. Andrei's design already 
 implements "str".byDchar() that would work for code points. I'd suggest 
 changing the API to by!char(), by!wchar(), and by!cdhar() for when you 
 deal with whatever kind of code unit or code point you want. This would 
 be mostly symmetric to what you can already do with foreach:
 
 	foreach (char c; "hello") {}
 	foreach (wchar c; "hello") {}
 	foreach (dchar c; "hello") {}
 // same as:
 	foreach (c; "hello".by!char()) {}
 	foreach (c; "hello".by!wchar()) {}
 	foreach (c; "hello".by!dchar()) {}
 
 
 Here's a (perhaps contrived) example:
 Let's say I want to find the combining marks in some text.
 
 For instance, Hebrew uses combining marks for vowels (among other 
 things) and they are optional in the language (There's a "full" form 
 with vowels and a "missing" form without them).
 I have a Hebrew text with in the "full" form and I want to strip it and 
 convert it to the "missing" form.
 
 How would I accomplish this with your design?

 
 All you need is a range that takes a string as input and give you code 
 points in a decomposed form (NFD), then you use std.algorithm.filter on 
 it:
 
 	// original string
 	auto str = "...";
 
 	// create normalized decomposed string as a lazy range of dchar (NFD)
 	auto decomposed = decompose(str);
 
 	// filter to remove your favorite combining code point (use the hex 
 code you want)
 	auto filtered = filter!"a != 0xFABA"(decomposed);
 
 	// turn it back in composed form (NFC), optional
 	auto recomposed = compose(filtered);
 
 	// convert back to a string (could also be wstring or dstring)
 	string result = array(recomposed.by!char());
 
 This last line is the one doing everything. All the rest just chain 
 ranges together for doing on-the-fly decomposition, filtering, and 
 recomposition; the last line uses that chain of rage to fill the array.
 
 A more naive implementation not taking advantage of code points but 
 instead using a replacement table would also work:
 
 	string str = "...";
 	string result;
 	string[string] replacements = ["�":"e"]; // change this for what you want
 	foreach (grapheme; str) {
 		auto replacement = grapheme in replacements;
 		if (replacement)
 			result ~= replacement;
 		else
 			result ~= grapheme;
 	}
 	
 
 -- 
 Michel Fortin
 michel.fortin michelf.com
 http://michelf.com/
 

Ok, I guess I missed the "byDchar()" method. 
I envisioned the same algorithm looking like this:
 
// original string
string str = "...";

// create normalized decomposed string as a lazy range of dchar (NFD)
// Note: explicitly specify code points range:
auto decomposed = decompose(str.codePoints);

// filter to remove your favorite combining code point
auto filtered = filter!"a != 0xFABA"(decomposed);

// turn it back in composed form (NFC), optional
auto recomposed = compose(filtered);
 
// convert back to a string
// Note: a string type can be constructed from a range of code points
string result = string(recomposed);
 
The difference is that a string type is distinct from the intermediate code
point ranges (This happens in your design too albeit in a less obvious way to
the user). There is string specific code. Why not encapsulate it in a string
type instead of forcing the user to use complex APIs with templates everywhere?

Jan 15 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-15 10:59:52 -0500, foobar <foo bar.com> said:

 Ok, I guess I missed the "byDchar()" method.
 I envisioned the same algorithm looking like this:
 
 // original string
 string str = "...";
 
 // create normalized decomposed string as a lazy range of dchar (NFD)
 // Note: explicitly specify code points range:
 auto decomposed = decompose(str.codePoints);
 
 // filter to remove your favorite combining code point
 auto filtered = filter!"a != 0xFABA"(decomposed);
 
 // turn it back in composed form (NFC), optional
 auto recomposed = compose(filtered);
 
 // convert back to a string
 // Note: a string type can be constructed from a range of code points
 string result = string(recomposed);
 
 The difference is that a string type is distinct from the intermediate 
 code point ranges (This happens in your design too albeit in a less 
 obvious way to the user). There is string specific code. Why not 
 encapsulate it in a string type instead of forcing the user to use 
 complex APIs with templates everywhere?

What I don't understand is in what way using a string type would make 
the API less complex and use less templates?

More generally, in what way would your string type behave differently 
than char[], wchar[], and dchar[]? I think we need to clarify what how 
you expect your string type to behave before I can answer anything. I 
mean, beside cosmetic changes such as having a codePoint property 
instead of by!dchar or byDchar, what is your string type doing 
differently?

The above algorithm is already possible with strings as they are, 
provided you implement the 'decompose' and the 'compose' function 
returning a range. In fact, you only changed two things in it: by!dchar 
became codePoints, and array() became string(). Surely you're expecting 
more benefits than that.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 15 2011

foobar <foo bar.com> writes:

Michel Fortin Wrote:

 What I don't understand is in what way using a string type would make 
 the API less complex and use less templates?
 
 More generally, in what way would your string type behave differently 
 than char[], wchar[], and dchar[]? I think we need to clarify what how 
 you expect your string type to behave before I can answer anything. I 
 mean, beside cosmetic changes such as having a codePoint property 
 instead of by!dchar or byDchar, what is your string type doing 
 differently?
 
 The above algorithm is already possible with strings as they are, 
 provided you implement the 'decompose' and the 'compose' function 
 returning a range. In fact, you only changed two things in it: by!dchar 
 became codePoints, and array() became string(). Surely you're expecting 
 more benefits than that.
 
 -- 
 Michel Fortin
 michel.fortin michelf.com
 http://michelf.com/
 

First thing, the question of possibility is irrelevant since I could also write
the same algorithm in brainfuck or assembly (with a lot more code). It's never
a question of possibility but rather a question of ease of use for the user. 

What I want is to encapsulate all the low-level implementation details in one
place so that the as a user I will not need to deal with this everywhere. one
such detail is the encoding. 

auto text = w"whatever"; // should be equivalent to:
auto text = new Text("whatever", Encoding.UTF16);

now I want to write my own string function:

void func(Text a); // instead of current:
void func(T)(T a) if isTextType(T); // why the USER needs all this?

Of course, the Text type would do the correct think by default which we both
agree should be graphemes. Only if I need something advanced like in the
previous algorithm than I explicitly need to specify that I work on code points
or code units. 

In a sentence: "Make the common case trivial and the complex case possible".
The common case is what we Humans think of as characters (graphemes) and the
complex case is the encoding level.

Jan 15 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday 15 January 2011 04:24:33 Michel Fortin wrote:
 On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn
=20
 <lutger.blijdestijn gmail.com> said:
 Nick Sabalausky wrote:
 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message
 news:ignon1$2p4k$1 digitalmars.com...
=20
 This may sometimes not be what the user expected; most of the time
 they'd care about the code points.

=20
 I dunno, spir has succesfuly convinced me that most of the time it's
 graphemes the user cares about, not code points. Using code points is
 just as misleading as using UTF-16 code units.

=20
 I agree. This is a very informative thread, thanks spir and everybody
 else.
=20
 Going back to the topic, it seems to me that a unicode string is a
 surprisingly complicated data structure that can be viewed from multiple
 types of ranges. In the light of this thread, a dchar doesn't seem like
 such a useful type anymore, it is still a low level abstraction for the
 purpose of correctly dealing with text. Perhaps even less useful, since
 it gives the illusion of correctness for those who are not in the know.
=20
 The algorithms in std.string can be upgraded to work correctly with all
 the issues mentioned, but the generic ones in std.algorithm will just
 subtly do the wrong thing when presented with dchar ranges. And, as I
 understood it, the purpose of a VleRange was exactly to make generic
 algorithms just work (tm) for strings.
=20
 Is it still possible to solve this problem or are we stuck with
 specialized string algorithms? Would it work if VleRange of string was a
 bidirectional range with string slices of graphemes as the ElementType
 and indexing with code units? Often used string algorithms could be
 specialized for performance, but if not, generic algorithms would still
 work.

=20
 I have my idea.
=20
 I think it'd be a good idea is to improve upon Andrei's first idea --
 which was to treat char[], wchar[], and dchar[] all as ranges of dchar
 elements -- by changing the element type to be the same as the string.
 For instance, iterating on a char[] would give you slices of char[],
 each having one grapheme.
=20
 The second component would be to make the string equality operator (=3D=

=3D)
 for strings compare them in their normalized form, so that ("e" with
 combining acute accent) =3D=3D (pre-combined "=E9"). I think this would m=

ake
 D support for Unicode much more intuitive.
=20
 This implies some semantic changes, mainly that everywhere you write a
 "character" you must use double-quotes (string "a") instead of single
 quote (code point 'a'), but from the user's point of view that's pretty
 much all there is to change.
=20
 There'll still be plenty of room for specialized algorithms, but their
 purpose would be limited to optimization. Correctness would be taken
 care of by the basic range interface, and foreach should follow suit
 and iterate by grapheme by default.
=20
 I wrote this example (or something similar) earlier in this thread:
=20
 	foreach (grapheme; "expos=E9")
 		if (grapheme =3D=3D "=E9")
 			break;
=20
 In this example, even if one of these two strings use the pre-combined
 form of "=E9" and the other uses a combining acute accent, the equality
 would still hold since foreach iterates on full graphemes and =3D=3D
 compares using normalization.
=20
 The important thing to keep in mind here is that the grapheme-splitting
 algorithm should be optimized for the case where there is no combining
 character and the compare algorithm for the case where the string is
 already normalized, since most strings will exhibit these
 characteristics.
=20
 As for ASCII, we could make it easier to use ubyte[] for it by making
 string literals implicitly convert to ubyte[] if all their characters
 are in ASCII range.

I think that that would cause definite problems. Having the element type of=
 the=20
range be the same type as the range seems like it could cause a lot of prob=
lems=20
in std.algorithm and the like, and it's _definitely_ going to confuse=20
programmers. I'd expect it to be highly bug-prone. They _need_ to be separa=
te=20
types.

Now, given that dchar can't actually work completely as an element type, yo=
u'd=20
either need the string type to be a new type or the element type to be a ne=
w=20
type. So, either the string type has char[], wchar[], or dchar[] for its el=
ement=20
type, or char[], wchar[], and dchar[] have something like uchar as their el=
ement=20
type, where uchar is a struct which contains a char[], wchar[], or dchar[] =
which=20
holds a single grapheme.

I think that it's a great idea that programmers try to use substrings and s=
lices=20
rather than dchar, but making the element type a slice the original type so=
unds=20
like it's really asking for trouble.

=2D Jonathan M Davis

Jan 15 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-15 20:49:00 -0500, Jonathan M Davis <jmdavisProg gmx.com> said:

 On Saturday 15 January 2011 04:24:33 Michel Fortin wrote:
 I have my idea.
 
 I think it'd be a good idea is to improve upon Andrei's first idea --
 which was to treat char[], wchar[], and dchar[] all as ranges of dchar
 elements -- by changing the element type to be the same as the string.
 For instance, iterating on a char[] would give you slices of char[],
 each having one grapheme.
 
 The second component would be to make the string equality operator (=

 =)
 for strings compare them in their normalized form, so that ("e" with
 combining acute accent) == (pre-combined "�"). I think this would m

 ake
 D support for Unicode much more intuitive.
 
 This implies some semantic changes, mainly that everywhere you write a
 "character" you must use double-quotes (string "a") instead of single
 quote (code point 'a'), but from the user's point of view that's pretty
 much all there is to change.
 
 There'll still be plenty of room for specialized algorithms, but their
 purpose would be limited to optimization. Correctness would be taken
 care of by the basic range interface, and foreach should follow suit
 and iterate by grapheme by default.
 
 I wrote this example (or something similar) earlier in this thread:
 
 	foreach (grapheme; "expos�")
 		if (grapheme == "�")
 			break;
 
 In this example, even if one of these two strings use the pre-combined
 form of "�" and the other uses a combining acute accent, the equality
 would still hold since foreach iterates on full graphemes and =
 compares using normalization.

 
 I think that that would cause definite problems. Having the element 
 type of the range be the same type as the range seems like it could 
 cause a lot of problems in std.algorithm and the like, and it's 
 _definitely_ going to confuse programmers. I'd expect it to be highly 
 bug-prone. They _need_ to be separate types.

I remember that someone already complained about this issue because he 
had a tree of ranges, and Andrei said he would take a look at this 
problem eventually. Perhaps now would be a good time.


 Now, given that dchar can't actually work completely as an element 
 type, you'd either need the string type to be a new type or the element 
 type to be a new type. So, either the string type has char[], wchar[], 
 or dchar[] for its element type, or char[], wchar[], and dchar[] have 
 something like uchar as their element type, where uchar is a struct 
 which contains a char[], wchar[], or dchar[]
 which holds a single grapheme.

Having a new type for grapheme would work too. My preference still goes 
to reusing the string type because it makes the semantic simpler to 
understand, especially when comparing graphemes with literals.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 15 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday 15 January 2011 20:45:53 Michel Fortin wrote:
 On 2011-01-15 20:49:00 -0500, Jonathan M Davis <jmdavisProg gmx.com> said:
 On Saturday 15 January 2011 04:24:33 Michel Fortin wrote:
 I have my idea.
=20
 I think it'd be a good idea is to improve upon Andrei's first idea --
 which was to treat char[], wchar[], and dchar[] all as ranges of dchar
 elements -- by changing the element type to be the same as the string.
 For instance, iterating on a char[] would give you slices of char[],
 each having one grapheme.
=20
 The second component would be to make the string equality operator (=3D

=20
 =3D)
=20
 for strings compare them in their normalized form, so that ("e" with
 combining acute accent) =3D=3D (pre-combined "=E9"). I think this woul=



d m
=20
 ake
=20
 D support for Unicode much more intuitive.
=20
 This implies some semantic changes, mainly that everywhere you write a
 "character" you must use double-quotes (string "a") instead of single
 quote (code point 'a'), but from the user's point of view that's pretty
 much all there is to change.
=20
 There'll still be plenty of room for specialized algorithms, but their
 purpose would be limited to optimization. Correctness would be taken
 care of by the basic range interface, and foreach should follow suit
 and iterate by grapheme by default.
=20
 I wrote this example (or something similar) earlier in this thread:
 	foreach (grapheme; "expos=E9")
 =09
 		if (grapheme =3D=3D "=E9")
 	=09
 			break;
=20
 In this example, even if one of these two strings use the pre-combined
 form of "=E9" and the other uses a combining acute accent, the equality
 would still hold since foreach iterates on full graphemes and =3D
 compares using normalization.

=20
 I think that that would cause definite problems. Having the element
 type of the range be the same type as the range seems like it could
 cause a lot of problems in std.algorithm and the like, and it's
 _definitely_ going to confuse programmers. I'd expect it to be highly
 bug-prone. They _need_ to be separate types.

=20
 I remember that someone already complained about this issue because he
 had a tree of ranges, and Andrei said he would take a look at this
 problem eventually. Perhaps now would be a good time.
=20
 Now, given that dchar can't actually work completely as an element
 type, you'd either need the string type to be a new type or the element
 type to be a new type. So, either the string type has char[], wchar[],
 or dchar[] for its element type, or char[], wchar[], and dchar[] have
 something like uchar as their element type, where uchar is a struct
 which contains a char[], wchar[], or dchar[]
 which holds a single grapheme.

=20
 Having a new type for grapheme would work too. My preference still goes
 to reusing the string type because it makes the semantic simpler to
 understand, especially when comparing graphemes with literals.

If a character literal actually became a grapheme instead of a dchar, then =
that=20
would likely solve that issue. But I fear that the semantics of having a ra=
nge=20
be its own element type actually make understanding it _harder_, not simple=
r.=20
Being forced to compare a string literals against what should be a characte=
r=20
would definitely confuse programmers. Making a new character or grapheme ty=
pe=20
which represented a grapheme would be _far_ simpler to understand IMO. Howe=
ver,=20
making it work really well would likely require that the compiler know abou=
t the=20
grapheme type like it knows about dchar.

=2D Jonathan M Davis

Jan 15 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-15 23:58:30 -0500, Jonathan M Davis <jmdavisProg gmx.com> said:

 On Saturday 15 January 2011 20:45:53 Michel Fortin wrote:
 On 2011-01-15 20:49:00 -0500, Jonathan M Davis <jmdavisProg gmx.com> said:
 On Saturday 15 January 2011 04:24:33 Michel Fortin wrote:
 I have my idea.
 
 I think it'd be a good idea is to improve upon Andrei's first idea --
 which was to treat char[], wchar[], and dchar[] all as ranges of dchar
 elements -- by changing the element type to be the same as the string.
 For instance, iterating on a char[] would give you slices of char[],
 each having one grapheme.
 
 The second component would be to make the string equality operator (

 
 =)
 
 for strings compare them in their normalized form, so that ("e" with
 combining acute accent) == (pre-combined "�"). I think this woul



 d m
 
 ake
 
 D support for Unicode much more intuitive.
 
 This implies some semantic changes, mainly that everywhere you write a
 "character" you must use double-quotes (string "a") instead of single
 quote (code point 'a'), but from the user's point of view that's pretty
 much all there is to change.
 
 There'll still be plenty of room for specialized algorithms, but their
 purpose would be limited to optimization. Correctness would be taken
 care of by the basic range interface, and foreach should follow suit
 and iterate by grapheme by default.
 
 I wrote this example (or something similar) earlier in this thread:
 	foreach (grapheme; "expos�")
 	
 		if (grapheme == "�")
 		
 			break;
 
 In this example, even if one of these two strings use the pre-combined
 form of "�" and the other uses a combining acute accent, the equality
 would still hold since foreach iterates on full graphemes and
 compares using normalization.

 
 I think that that would cause definite problems. Having the element
 type of the range be the same type as the range seems like it could
 cause a lot of problems in std.algorithm and the like, and it's
 _definitely_ going to confuse programmers. I'd expect it to be highly
 bug-prone. They _need_ to be separate types.

 
 I remember that someone already complained about this issue because he
 had a tree of ranges, and Andrei said he would take a look at this
 problem eventually. Perhaps now would be a good time.
 
 Now, given that dchar can't actually work completely as an element
 type, you'd either need the string type to be a new type or the element
 type to be a new type. So, either the string type has char[], wchar[],
 or dchar[] for its element type, or char[], wchar[], and dchar[] have
 something like uchar as their element type, where uchar is a struct
 which contains a char[], wchar[], or dchar[]
 which holds a single grapheme.

 
 Having a new type for grapheme would work too. My preference still goes
 to reusing the string type because it makes the semantic simpler to
 understand, especially when comparing graphemes with literals.

 
 If a character literal actually became a grapheme instead of a dchar, then
 that would likely solve that issue. But I fear that the semantics of 
 having a range
 be its own element type actually make understanding it _harder_, not simpler.
 Being forced to compare a string literals against what should be a 
 character would definitely confuse programmers.

Character literals are treated as simple numbers by the language. By 
that I mean that you can write 'b' - 'a' == 1 and it'll be true. 
Arithmetic makes absolutely no sense for graphemes. If you want a 
special literal for graphemes, I'm afraid you'll have to invent 
something new. And at this point, why not use a string?


 Making a new character or grapheme type which represented a grapheme 
 would be _far_ simpler to understand IMO. However, making it work 
 really well would likely require that the compiler know about the 
 grapheme type like it knows about dchar.

I'm looking for a simple solution. One that doesn't involve inventing a 
new grapheme literal syntax or adding new types the compiler most know 
about. I'm not really opposed to any of this, but the more complicated 
is the solution, the less likely it is to be adopted.

All I'm asking is that Unicode strings behave as Unicode strings should 
behave. Making iteration use graphemes by default and string comparison 
use the normalized form by default seems like a simple way to achieve 
that goal.

The most important is not the implementation, but that the default 
behaviour be the right behaviour.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 15 2011

foobar <foo bar.com> writes:

Michel Fortin Wrote:


 Character literals are treated as simple numbers by the language. By 
 that I mean that you can write 'b' - 'a' == 1 and it'll be true. 
 Arithmetic makes absolutely no sense for graphemes. If you want a 
 special literal for graphemes, I'm afraid you'll have to invent 
 something new. And at this point, why not use a string?
 
 
 Making a new character or grapheme type which represented a grapheme 
 would be _far_ simpler to understand IMO. However, making it work 
 really well would likely require that the compiler know about the 
 grapheme type like it knows about dchar.

 
 I'm looking for a simple solution. One that doesn't involve inventing a 
 new grapheme literal syntax or adding new types the compiler most know 
 about. I'm not really opposed to any of this, but the more complicated 
 is the solution, the less likely it is to be adopted.
 
 All I'm asking is that Unicode strings behave as Unicode strings should 
 behave. Making iteration use graphemes by default and string comparison 
 use the normalized form by default seems like a simple way to achieve 
 that goal.
 
 The most important is not the implementation, but that the default 
 behaviour be the right behaviour.
 
 
 -- 
 Michel Fortin
 michel.fortin michelf.com
 http://michelf.com/
 

I Understand your concern regarding a simpler implementation. You want to
minimize the disruption caused by the proposed change. 

I'd argue that creating a specialized string type as Steve suggests makes
integration *easier*. Your suggestion requires that foreach will be changed to
default to grapheme. I agree that this can be done because it will not break
silently but with Steve's string type this is unnecessary since the type itself
would provide a grapheme range interface and the compiler doesn't need to know
about this type at all. string becomes a regular library type. 

Of course, the type should support:
string foo = "bar"; 
by making an implicit conversion from current arrays (to minimize compiler
changes)

The only disruption as far as I can tell would be using 'a' type literals
instead of "a" but that will come up in compilation after string defaults to
the new type. Also, all occurrences of:
string foo = ...;
foreach (c; foo) {...} // c is now a grapheme
will now do the correct thing by default.

Jan 15 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-16 02:11:14 -0500, foobar <foo bar.com> said:

 I Understand your concern regarding a simpler implementation. You want 
 to minimize the disruption caused by the proposed change.
 
 I'd argue that creating a specialized string type as Steve suggests 
 makes integration *easier*. Your suggestion requires that foreach will 
 be changed to default to grapheme. I agree that this can be done 
 because it will not break silently but with Steve's string type this is 
 unnecessary since the type itself would provide a grapheme range 
 interface and the compiler doesn't need to know about this type at all. 
 string becomes a regular library type.
 
 Of course, the type should support:
 string foo = "bar";
 by making an implicit conversion from current arrays (to minimize 
 compiler changes)

It should also work for:

	auto foo = "bar";


 The only disruption as far as I can tell would be using 'a' type 
 literals instead of "a" but that will come up in compilation after 
 string defaults to the new type.

You say "after string defaults to the new type", but I don't think this 
change to the language will pass. It'll break TDPL for one thing, so 
it's surely out for D2. And I somewhat doubt it's low-level enough for 
Walter's taste.

I don't care much if the default type is an array or not, I just want 
the default type to work properly as a Unicode string. The very small 
participation to this thread from the key decision makers (Andrei and 
Walter) worries me however. I'm not even sure we'll achieve that goal.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 16 2011

foobar <foo bar.com> writes:

Michel Fortin Wrote:

 On 2011-01-16 02:11:14 -0500, foobar <foo bar.com> said:
 
 I Understand your concern regarding a simpler implementation. You want 
 to minimize the disruption caused by the proposed change.
 
 I'd argue that creating a specialized string type as Steve suggests 
 makes integration *easier*. Your suggestion requires that foreach will 
 be changed to default to grapheme. I agree that this can be done 
 because it will not break silently but with Steve's string type this is 
 unnecessary since the type itself would provide a grapheme range 
 interface and the compiler doesn't need to know about this type at all. 
 string becomes a regular library type.
 
 Of course, the type should support:
 string foo = "bar";
 by making an implicit conversion from current arrays (to minimize 
 compiler changes)

 
 It should also work for:
 
 	auto foo = "bar";

Right. This does require compiler changes.
 
 
 
 The only disruption as far as I can tell would be using 'a' type 
 literals instead of "a" but that will come up in compilation after 
 string defaults to the new type.

 
 You say "after string defaults to the new type", but I don't think this 
 change to the language will pass. It'll break TDPL for one thing, so 
 it's surely out for D2. And I somewhat doubt it's low-level enough for 
 Walter's taste.
 

string is an alias in phobos so it's more of a stdlib change but I see your
point about TDPL. I did get the feeling that Andrei is willing to make a change
if it proves worthwhile by preventing writing bad code (Which we both agree
this change accomplishes). 

 I don't care much if the default type is an array or not, I just want 
 the default type to work properly as a Unicode string. The very small 
 participation to this thread from the key decision makers (Andrei and 
 Walter) worries me however. I'm not even sure we'll achieve that goal.
 
 

Anderi did take part and even asked for links that explain the subject. Perhaps
the quite is due to the mastermind doing research on the topic rather than
reluctance to do any changes. :)
  
 -- 
 Michel Fortin
 michel.fortin michelf.com
 http://michelf.com/

Jan 16 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-13 15:51:00 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 1/13/11 11:35 AM, Steven Schveighoffer wrote:
 On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Let's take a look:
 
 // Incorrect string code
 void fun(string s) {
 foreach (i; 0 .. s.length) {
 writeln("The character in position ", i, " is ", s[i]);
 }
 }
 
 // Incorrect string_t code
 void fun(string_t!char s) {
 foreach (i; 0 .. s.codeUnits) {
 writeln("The character in position ", i, " is ", s[i]);
 }
 }
 
 Both functions are incorrect, albeit in different ways. The only
 improvement I'm seeing is that the user needs to write codeUnits
 instead of length, which may make her think twice. Clearly, however,
 copiously incorrect code can be written with the proposed interface
 because it tries to hide the reality that underneath a variable-length
 encoding is being used, but doesn't hide it completely (albeit for
 good efficiency-related reasons).

 
 You might be looking at my previous version. The new version (recently
 posted) will throw an exception for that code if a multi-code-unit
 code-point is found.

 
 I was looking at your latest. It's code that compiles and runs, but 
 dynamically fails on some inputs. I agree that it's often better to 
 fail noisily instead of silently, but in a manner of speaking the 
 string-based code doesn't fail at all - it correctly iterates the code 
 units of a string. This may sometimes not be what the user expected; 
 most of the time they'd care about the code points.

That's forgetting that most of the time people care about graphemes 
(user-perceived characters), not code points.


 It also supports this:
 
 foreach(i, d; s)
 {
 writeln("The character in position ", i, " is ", d);
 }
 
 where i is the index (might not be sequential)

 
 Well string supports that too, albeit with the nit that you need to 
 specify dchar.

Except it breaks with combining characters. For instance, take the 
string "t̃", which is two code points -- 't' followed by combining 
tilde (U+0303) -- and you'll get the following output:

	The character in position 0 is t
	The character in position 1 is ̃

(Note that the tilde becomes combined with the preceding space character.)

The conception of character that normal people have does not match the 
notion of code points when combining characters enters the equation.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 13 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/13/11 7:09 PM, Michel Fortin wrote:
 On 2011-01-13 15:51:00 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 On 1/13/11 11:35 AM, Steven Schveighoffer wrote:
 On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Let's take a look:

 // Incorrect string code
 void fun(string s) {
 foreach (i; 0 .. s.length) {
 writeln("The character in position ", i, " is ", s[i]);
 }
 }

 // Incorrect string_t code
 void fun(string_t!char s) {
 foreach (i; 0 .. s.codeUnits) {
 writeln("The character in position ", i, " is ", s[i]);
 }
 }

 Both functions are incorrect, albeit in different ways. The only
 improvement I'm seeing is that the user needs to write codeUnits
 instead of length, which may make her think twice. Clearly, however,
 copiously incorrect code can be written with the proposed interface
 because it tries to hide the reality that underneath a variable-length
 encoding is being used, but doesn't hide it completely (albeit for
 good efficiency-related reasons).

 You might be looking at my previous version. The new version (recently
 posted) will throw an exception for that code if a multi-code-unit
 code-point is found.

 I was looking at your latest. It's code that compiles and runs, but
 dynamically fails on some inputs. I agree that it's often better to
 fail noisily instead of silently, but in a manner of speaking the
 string-based code doesn't fail at all - it correctly iterates the code
 units of a string. This may sometimes not be what the user expected;
 most of the time they'd care about the code points.

 That's forgetting that most of the time people care about graphemes
 (user-perceived characters), not code points.

I'm not so sure about that. What do you base this assessment on? Denis 
wrote a library that according to him does grapheme-related stuff nobody 
else does. So apparently graphemes is not what people care about 
(although it might be what they should care about).

 It also supports this:

 foreach(i, d; s)
 {
 writeln("The character in position ", i, " is ", d);
 }

 where i is the index (might not be sequential)

 Well string supports that too, albeit with the nit that you need to
 specify dchar.

 Except it breaks with combining characters. For instance, take the
 string "t̃", which is two code points -- 't' followed by combining tilde
 (U+0303) -- and you'll get the following output:

 The character in position 0 is t
 The character in position 1 is ̃

 (Note that the tilde becomes combined with the preceding space character.)

 The conception of character that normal people have does not match the
 notion of code points when combining characters enters the equation.

This might be a good time to see whether we need to address graphemes 
systematically. Could you please post a few links that would educate me 
and others in the mysteries of combining characters?


Thanks,

Andrei

Jan 13 2011

"Nick Sabalausky" <a a.a> writes:

"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:igoj6s$17r6$1 digitalmars.com...
 I'm not so sure about that. What do you base this assessment on? Denis 
 wrote a library that according to him does grapheme-related stuff nobody 
 else does. So apparently graphemes is not what people care about (although 
 it might be what they should care about).

It's what they want, they just don't know it.

Graphemes are what many people *think* code points are.

 This might be a good time to see whether we need to address graphemes 
 systematically. Could you please post a few links that would educate me 
 and others in the mysteries of combining characters?

Maybe someone else has a link to an explanation (I don't), but it's 
basically just this:

Three levels of abstraction from lowest to highest:
- Code Unit (ie, encoding)
- Code Point (ie, what Unicode assigns distinct numbers to)
- Grapheme (ie, what we think of as a "character")

A code-point can be made up of one or more code-units. Likewise, a grapheme 
can be made up of one or more code-points.

There are (at least) two types of code points:

- Regular ones, such as letters, digits, and punctuation.

- "Combining Characters", such as accent marks (or if you're familiar with 
Japanese, the little things in the upper-right corner that change an "s" to 
a "z" or an "h" to a "p". Or like German's umlaut - the two dots above a 
vowel). Ie, things that are not characters in their own right, but merely 
modify other characters. These can be often (always?) be thought of as being 
like overlays.

If a code point representing a "combining character" exists in a string, 
then instead of being displayed as a character it merely modifies whatever 
code-point came before it.

So, for instance, if you want to store the German word for five (in all 
lower-case), there are two ways to do it:

[ 'f', {u with the umlaut}, 'n', 'f' ]

Or:

[ 'f', 'u', {umlaut combining character}, 'n', 'f' ]

Those *both* get rendered exactly the same, and both represent the same 
four-letter sequence. In the second example, the 'u' and the {umlaut 
combining character} combine to form one grapheme. The f's and n's just 
happen to be single-code-point graphemes.

Note that while some characters exist in pre-combined form (such as the {u 
with the umlaut} above), legend has it there are others than can only be 
represented using a combining character.

It's also my understanding, though I'm not certain, that sometimes multiple 
combining characters can be used together on the same "root" character.

Caveat: There may very well be further complications that I'm not aware of. 
Heck, knowing Unicode, there probably are.

Jan 13 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/13/11 10:26 PM, Nick Sabalausky wrote:
[snip]
 [ 'f', {u with the umlaut}, 'n', 'f' ]

 Or:

 [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]

 Those *both* get rendered exactly the same, and both represent the same
 four-letter sequence. In the second example, the 'u' and the {umlaut
 combining character} combine to form one grapheme. The f's and n's just
 happen to be single-code-point graphemes.

 Note that while some characters exist in pre-combined form (such as the {u
 with the umlaut} above), legend has it there are others than can only be
 represented using a combining character.

 It's also my understanding, though I'm not certain, that sometimes multiple
 combining characters can be used together on the same "root" character.

Thanks. One further question is: in the above example with 
u-with-umlaut, there is one code point that corresponds to the entire 
combination. Are there combinations that do not have a unique code point?

Andrei

Jan 13 2011

"Nick Sabalausky" <a a.a> writes:

"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:igoqrm$1n5r$1 digitalmars.com...
 On 1/13/11 10:26 PM, Nick Sabalausky wrote:
 [snip]
 [ 'f', {u with the umlaut}, 'n', 'f' ]

 Or:

 [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]

 Those *both* get rendered exactly the same, and both represent the same
 four-letter sequence. In the second example, the 'u' and the {umlaut
 combining character} combine to form one grapheme. The f's and n's just
 happen to be single-code-point graphemes.

 Note that while some characters exist in pre-combined form (such as the 
 {u
 with the umlaut} above), legend has it there are others than can only be
 represented using a combining character.

 It's also my understanding, though I'm not certain, that sometimes 
 multiple
 combining characters can be used together on the same "root" character.

 Thanks. One further question is: in the above example with u-with-umlaut, 
 there is one code point that corresponds to the entire combination. Are 
 there combinations that do not have a unique code point?

My understanding is "yes". At least that's what I've heard, and I've never 
heard any claims of "no". I don't know of any specific ones offhand, though. 
Actually, it might be possible to use any combining character with any old 
letter or number (like maybe a 7 with an umlaut), though I'm not certain.

FWIW, the Wikipedia article might help, or at least link to other things 
that might help: http://en.wikipedia.org/wiki/Combining_character

Michel or spir might have better links though.

Jan 13 2011

"Nick Sabalausky" <a a.a> writes:

"Nick Sabalausky" <a a.a> wrote in message 
news:igori7$1ovh$1 digitalmars.com...
 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
 news:igoqrm$1n5r$1 digitalmars.com...
 On 1/13/11 10:26 PM, Nick Sabalausky wrote:
 [snip]
 [ 'f', {u with the umlaut}, 'n', 'f' ]

 Or:

 [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]

 Those *both* get rendered exactly the same, and both represent the same
 four-letter sequence. In the second example, the 'u' and the {umlaut
 combining character} combine to form one grapheme. The f's and n's just
 happen to be single-code-point graphemes.

 Note that while some characters exist in pre-combined form (such as the 
 {u
 with the umlaut} above), legend has it there are others than can only be
 represented using a combining character.

 It's also my understanding, though I'm not certain, that sometimes 
 multiple
 combining characters can be used together on the same "root" character.

 Thanks. One further question is: in the above example with u-with-umlaut, 
 there is one code point that corresponds to the entire combination. Are 
 there combinations that do not have a unique code point?

 My understanding is "yes". At least that's what I've heard, and I've never 
 heard any claims of "no". I don't know of any specific ones offhand, 
 though. Actually, it might be possible to use any combining character with 
 any old letter or number (like maybe a 7 with an umlaut), though I'm not 
 certain.

 FWIW, the Wikipedia article might help, or at least link to other things 
 that might help: http://en.wikipedia.org/wiki/Combining_character

 Michel or spir might have better links though.

Heh, as if that wasn't bad enough, there's also digraphs which, from what I 
can tell, seem to be single code-points that represent more than one 
glyph/character/grapheme:

http://en.wikipedia.org/wiki/Digraph_(orthography)#Digraphs_in_Unicode

This page may be helpful too:
http://en.wikipedia.org/wiki/Precomposed_character

Jan 13 2011

Daniel Gibson <metalcaedes gmail.com> writes:

Am 14.01.2011 08:00, schrieb Nick Sabalausky:
 "Nick Sabalausky"<a a.a>  wrote in message
 news:igori7$1ovh$1 digitalmars.com...
 "Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org>  wrote in message
 news:igoqrm$1n5r$1 digitalmars.com...
 On 1/13/11 10:26 PM, Nick Sabalausky wrote:
 [snip]
 [ 'f', {u with the umlaut}, 'n', 'f' ]

 Or:

 [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]

 Those *both* get rendered exactly the same, and both represent the same
 four-letter sequence. In the second example, the 'u' and the {umlaut
 combining character} combine to form one grapheme. The f's and n's just
 happen to be single-code-point graphemes.

 Note that while some characters exist in pre-combined form (such as the
 {u
 with the umlaut} above), legend has it there are others than can only be
 represented using a combining character.

 It's also my understanding, though I'm not certain, that sometimes
 multiple
 combining characters can be used together on the same "root" character.

 Thanks. One further question is: in the above example with u-with-umlaut,
 there is one code point that corresponds to the entire combination. Are
 there combinations that do not have a unique code point?

 My understanding is "yes". At least that's what I've heard, and I've never
 heard any claims of "no". I don't know of any specific ones offhand,
 though. Actually, it might be possible to use any combining character with
 any old letter or number (like maybe a 7 with an umlaut), though I'm not
 certain.

 FWIW, the Wikipedia article might help, or at least link to other things
 that might help: http://en.wikipedia.org/wiki/Combining_character

 Michel or spir might have better links though.

 Heh, as if that wasn't bad enough, there's also digraphs which, from what I
 can tell, seem to be single code-points that represent more than one
 glyph/character/grapheme:

 http://en.wikipedia.org/wiki/Digraph_(orthography)#Digraphs_in_Unicode

 This page may be helpful too:
 http://en.wikipedia.org/wiki/Precomposed_character

OMG, this is really fucked up.
Can't we just go back to 8bit charsets like ISO 8859-* etc? :/

Jan 14 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 14 Jan 2011 01:44:19 -0500, Nick Sabalausky <a a.a> wrote:

 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message
 news:igoqrm$1n5r$1 digitalmars.com...
 On 1/13/11 10:26 PM, Nick Sabalausky wrote:
 [snip]
 [ 'f', {u with the umlaut}, 'n', 'f' ]

 Or:

 [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]

 Those *both* get rendered exactly the same, and both represent the same
 four-letter sequence. In the second example, the 'u' and the {umlaut
 combining character} combine to form one grapheme. The f's and n's just
 happen to be single-code-point graphemes.

 Note that while some characters exist in pre-combined form (such as the
 {u
 with the umlaut} above), legend has it there are others than can only  
 be
 represented using a combining character.

 It's also my understanding, though I'm not certain, that sometimes
 multiple
 combining characters can be used together on the same "root" character.

 Thanks. One further question is: in the above example with  
 u-with-umlaut,
 there is one code point that corresponds to the entire combination. Are
 there combinations that do not have a unique code point?

 My understanding is "yes". At least that's what I've heard, and I've  
 never
 heard any claims of "no". I don't know of any specific ones offhand,  
 though.
 Actually, it might be possible to use any combining character with any  
 old
 letter or number (like maybe a 7 with an umlaut), though I'm not certain.

 FWIW, the Wikipedia article might help, or at least link to other things
 that might help: http://en.wikipedia.org/wiki/Combining_character

http://en.wikipedia.org/wiki/Unicode_normalization

Linked from that page, the normalization process is probably something we  
need to look at.  Using decomposed canonical form would mean we need more  
state than just what code-unit are we on, plus it creates more likelyhood  
that a match will be found with part of a grapheme (spir or Michel brought  
it up earlier).  So I think the correct case is to use composed canonical  
form.  This is after just reading that page, so maybe I'm missing  
something.

Non-composable combinations would be a problem.  The string range is  
formed on the basis that the element type is a dchar.  If there are  
combinations that cannot be composed into a single dchar, then the element  
type has to be a dchar array (or some other type which contains all the  
info).  The other option is to simply leave them decomposed.  Then you  
risk things like partial matches.

I'm leaning towards a solution like this: While iterating a string, it  
should output dchars in normalized composed form.  But a specialized  
comparison function should be used when doing things like searches or  
regex, because it might not be possible to compose two combining  
characters.

The drawback to this is that a dchar might not be able to represent a  
grapheme (only if it cannot be composed), but I think it's too much of a  
hit in complexity and performance to make the element type of a string  
larger than a dchar.

Those who wish to work with a more comprehensive string type can use a  
more complex string type such as the one created by spir.

Does that sound reasonable?

-Steve

Jan 14 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday 14 January 2011 04:47:59 Steven Schveighoffer wrote:
 On Fri, 14 Jan 2011 01:44:19 -0500, Nick Sabalausky <a a.a> wrote:
 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message
 news:igoqrm$1n5r$1 digitalmars.com...
 
 On 1/13/11 10:26 PM, Nick Sabalausky wrote:
 [snip]
 
 [ 'f', {u with the umlaut}, 'n', 'f' ]
 
 Or:
 
 [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]
 
 Those *both* get rendered exactly the same, and both represent the same
 four-letter sequence. In the second example, the 'u' and the {umlaut
 combining character} combine to form one grapheme. The f's and n's just
 happen to be single-code-point graphemes.
 
 Note that while some characters exist in pre-combined form (such as the
 {u
 with the umlaut} above), legend has it there are others than can only
 be
 represented using a combining character.
 
 It's also my understanding, though I'm not certain, that sometimes
 multiple
 combining characters can be used together on the same "root" character.

 
 Thanks. One further question is: in the above example with
 u-with-umlaut,
 there is one code point that corresponds to the entire combination. Are
 there combinations that do not have a unique code point?

 
 My understanding is "yes". At least that's what I've heard, and I've
 never
 heard any claims of "no". I don't know of any specific ones offhand,
 though.
 Actually, it might be possible to use any combining character with any
 old
 letter or number (like maybe a 7 with an umlaut), though I'm not certain.
 
 FWIW, the Wikipedia article might help, or at least link to other things
 that might help: http://en.wikipedia.org/wiki/Combining_character

 
 http://en.wikipedia.org/wiki/Unicode_normalization
 
 Linked from that page, the normalization process is probably something we
 need to look at.  Using decomposed canonical form would mean we need more
 state than just what code-unit are we on, plus it creates more likelyhood
 that a match will be found with part of a grapheme (spir or Michel brought
 it up earlier).  So I think the correct case is to use composed canonical
 form.  This is after just reading that page, so maybe I'm missing
 something.
 
 Non-composable combinations would be a problem.  The string range is
 formed on the basis that the element type is a dchar.  If there are
 combinations that cannot be composed into a single dchar, then the element
 type has to be a dchar array (or some other type which contains all the
 info).  The other option is to simply leave them decomposed.  Then you
 risk things like partial matches.
 
 I'm leaning towards a solution like this: While iterating a string, it
 should output dchars in normalized composed form.  But a specialized
 comparison function should be used when doing things like searches or
 regex, because it might not be possible to compose two combining
 characters.
 
 The drawback to this is that a dchar might not be able to represent a
 grapheme (only if it cannot be composed), but I think it's too much of a
 hit in complexity and performance to make the element type of a string
 larger than a dchar.

Well, there's plenty in std.string that already deals in strings rather than 
dchar, and for the most part, any case where you couldn't fit a grapheme in a 
dchar could be covered by using a string.

 Those who wish to work with a more comprehensive string type can use a
 more complex string type such as the one created by spir.
 
 Does that sound reasonable?

We really should have something along those lines it seems. From what little
_I_ 
know, the basic approach that you suggest seems like the correct one, but 
perhaps someone more knowledgeable will be able to come up with a reason why 
it's not a good idea. Certainly, I think that any solution that I'd come up
with 
would be similar to what you're suggesting.

- Jonathan M Davis

Jan 14 2011

spir <denis.spir gmail.com> writes:

On 01/14/2011 07:44 AM, Nick Sabalausky wrote:
 "Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org>  wrote in message
 news:igoqrm$1n5r$1 digitalmars.com...
 On 1/13/11 10:26 PM, Nick Sabalausky wrote:
 [snip]
 [ 'f', {u with the umlaut}, 'n', 'f' ]

 Or:

 [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]

 Those *both* get rendered exactly the same, and both represent the same
 four-letter sequence. In the second example, the 'u' and the {umlaut
 combining character} combine to form one grapheme. The f's and n's just
 happen to be single-code-point graphemes.

 Note that while some characters exist in pre-combined form (such as the
 {u
 with the umlaut} above), legend has it there are others than can only be
 represented using a combining character.

 It's also my understanding, though I'm not certain, that sometimes
 multiple
 combining characters can be used together on the same "root" character.

 Thanks. One further question is: in the above example with u-with-umlaut,
 there is one code point that corresponds to the entire combination. Are
 there combinations that do not have a unique code point?

 My understanding is "yes". At least that's what I've heard, and I've never
 heard any claims of "no". I don't know of any specific ones offhand, though.
 Actually, it might be possible to use any combining character with any old
 letter or number (like maybe a 7 with an umlaut), though I'm not certain.

The problem is then whether a font knows how to display it. My usual 
fonts (DejaVu series, pretty good with Unicode) show:
	7̈
meaning they do not know how to combine digits with diacritics (they do 
it well with other rather strange combinations.)

But: one of the relevant advantages of decomposed forms is that when 
they don't know the character, they can still show at least the 
component marks, here '7' & '~'. Which is better than nothing for a user 
who knows the scripting system. If I try to display for instance a 
_precomposed_ syllable from a language my font does not know, i will get 
instead either a little square with the codepoint written inside in 
minuscules digits, or a placeholder like inversed-video "?".


denis
_________________
vita es estrany
spir.wikidot.com

Jan 14 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-14 01:44:19 -0500, "Nick Sabalausky" <a a.a> said:

 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message
 news:igoqrm$1n5r$1 digitalmars.com...
 Thanks. One further question is: in the above example with u-with-umlaut,
 there is one code point that corresponds to the entire combination. Are
 there combinations that do not have a unique code point?

 
 My understanding is "yes". At least that's what I've heard, and I've never
 heard any claims of "no". I don't know of any specific ones offhand, though.
 Actually, it might be possible to use any combining character with any old
 letter or number (like maybe a 7 with an umlaut), though I'm not certain.

Correct, there's a lot of combinations with no pre-combined form. This 
should be no surprise given that you can apply any number of combining 
marks to any character.

	mythical 7 with an umlaut: 7̈
	mythical 7 with umlaut, ring above, and acute accent: 7̈̊́

I can't guaranty your news reader will display the above correctly, but 
it works as described in mine (Unison on Mac OS X). In fact, it should 
work in all Cocoa-based applications. This probably includes iOS-based 
devices too, but I haven't tested there.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 14 2011

Gianluigi Rubino <gianluigi{ }grsoft.org> writes:

Michel Fortin <michel.fortin michelf.com> wrote:

 
 	mythical 7 with an umlaut: 7̈
 	mythical 7 with umlaut, ring above, and acute accent: 7̈̊́
 
 I can't guaranty your news reader will display the above correctly, but
 it works as described in mine (Unison on Mac OS X). In fact, it should
 work in all Cocoa-based applications. This probably includes iOS-based
 devices too, but I haven't tested there.
 

All the examples given so far worked fine on my iPhone.

Gianluigi

Jan 14 2011

spir <denis.spir gmail.com> writes:

On 01/14/2011 07:33 AM, Andrei Alexandrescu wrote:
 Thanks. One further question is: in the above example with
 u-with-umlaut, there is one code point that corresponds to the entire
 combination. Are there combinations that do not have a unique code point?

See my previous follow-up to nick's explanation. But the answer is yes, 
not only for usual characters, but due to the fact that a user is, 
theoratically and practically, totally free to combine base ad combining 
codes --even to invent chracters. The only limit is that fonts will not 
know how to display unprobable combinations.
(See also my presentation text, shows an example of dots below and above 
greek letters.)

Denis
_________________
vita es estrany
spir.wikidot.com

Jan 14 2011

Daniel Gibson <metalcaedes gmail.com> writes:

Am 14.01.2011 07:26, schrieb Nick Sabalausky:
 "Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org>  wrote in message
 news:igoj6s$17r6$1 digitalmars.com...
 I'm not so sure about that. What do you base this assessment on? Denis
 wrote a library that according to him does grapheme-related stuff nobody
 else does. So apparently graphemes is not what people care about (although
 it might be what they should care about).

 It's what they want, they just don't know it.

 Graphemes are what many people *think* code points are.

Agreed. Up until spir mentioned graphemes in this newsgroup I always 
thought that one Unicode code point == one character on the screen.

I guess in the majority of use cases you want to operate on user 
perceived characters.

Jan 14 2011

spir <denis.spir gmail.com> writes:

On 01/14/2011 01:52 PM, Daniel Gibson wrote:
 Am 14.01.2011 07:26, schrieb Nick Sabalausky:
 "Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org> wrote in message
 news:igoj6s$17r6$1 digitalmars.com...
 I'm not so sure about that. What do you base this assessment on? Denis
 wrote a library that according to him does grapheme-related stuff nobody
 else does. So apparently graphemes is not what people care about
 (although
 it might be what they should care about).

 It's what they want, they just don't know it.

 Graphemes are what many people *think* code points are.

 Agreed. Up until spir mentioned graphemes in this newsgroup I always
 thought that one Unicode code point == one character on the screen.

 I guess in the majority of use cases you want to operate on user
 perceived characters.

That's what makes sense for the user in 99.9% case, thus that's what 
makes sense for the programmer, thus that's what makes sense for the 
language/type/lib designer.

denis
_________________
vita es estrany
spir.wikidot.com

Jan 14 2011

spir <denis.spir gmail.com> writes:

On 01/14/2011 07:26 AM, Nick Sabalausky wrote:
 "Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org>  wrote in message
 news:igoj6s$17r6$1 digitalmars.com...
 I'm not so sure about that. What do you base this assessment on? Denis
 wrote a library that according to him does grapheme-related stuff nobody
 else does. So apparently graphemes is not what people care about (although
 it might be what they should care about).

 It's what they want, they just don't know it.

 Graphemes are what many people *think* code points are.

 This might be a good time to see whether we need to address graphemes
 systematically. Could you please post a few links that would educate me
 and others in the mysteries of combining characters?

 Maybe someone else has a link to an explanation (I don't), but it's
 basically just this:

If anyone finds a pointer to such an explanation, bravo, and than you. 
(You will certainly not find it in Unicode literature, for instance.)
Nick's explanation below is good and concise. (Just 2 notes added.)

 Three levels of abstraction from lowest to highest:
 - Code Unit (ie, encoding)
 - Code Point (ie, what Unicode assigns distinct numbers to)
 - Grapheme (ie, what we think of as a "character")

 A code-point can be made up of one or more code-units. Likewise, a grapheme
 can be made up of one or more code-points.

 There are (at least) two types of code points:

 - Regular ones, such as letters, digits, and punctuation.

 - "Combining Characters", such as accent marks (or if you're familiar with
 Japanese, the little things in the upper-right corner that change an "s" to
 a "z" or an "h" to a "p". Or like German's umlaut - the two dots above a
 vowel). Ie, things that are not characters in their own right, but merely
 modify other characters. These can be often (always?) be thought of as being
 like overlays.

You can also say there are 2 kinds of characters: simple like "u" & 
composite "ü" or "ṵ̈̈". The former are coded with a single (base) code, 
the latter with one (rarely more) base codes and an arbitrary number of 
combining codes.

For a majority of _common_ characters made of 2 or 3 codes (western 
language letters, korean Hangul syllables,...), precombined codes have 
been added to the set. Thus, they can be coded with a single code like 
simple characters.

[Also note, to avoid things be too simple ;-), some (few) combining 
codes called "prepend" come _before_ the base in raw code sequence...]

 If a code point representing a "combining character" exists in a string,
 then instead of being displayed as a character it merely modifies whatever
 code-point came before it.

 So, for instance, if you want to store the German word for five (in all
 lower-case), there are two ways to do it:

 [ 'f', {u with the umlaut}, 'n', 'f' ]

 Or:

 [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]

Note: the second form is the base form for Unicode. There are reasons to 
have chosen it (see my text), and why UCS does not and simply cannot 
propose precomposed codes for all possible composite characters.

 Those *both* get rendered exactly the same, and both represent the same
 four-letter sequence. In the second example, the 'u' and the {umlaut
 combining character} combine to form one grapheme. The f's and n's just
 happen to be single-code-point graphemes.

 Note that while some characters exist in pre-combined form (such as the {u
 with the umlaut} above), legend has it there are others than can only be
 represented using a combining character.

 It's also my understanding, though I'm not certain, that sometimes multiple
 combining characters can be used together on the same "root" character.

There is no logical limit, only practical such as how to display 3 
diacritics above the same base? You can invent a script for a mythical 
folk's language if you like :-)
Also, some examples of real language characters (Hebrew, IIRC) in 
Unicode test data sets hold up to 8 codes.

 Caveat: There may very well be further complications that I'm not aware of.
 Heck, knowing Unicode, there probably are.

Denis
_________________
vita es estrany
spir.wikidot.com

Jan 14 2011

"Nick Sabalausky" <a a.a> writes:

"spir" <denis.spir gmail.com> wrote in message 
news:mailman.619.1295012086.4748.digitalmars-d puremagic.com...
 If anyone finds a pointer to such an explanation, bravo, and than you. 
 (You will certainly not find it in Unicode literature, for instance.)
 Nick's explanation below is good and concise. (Just 2 notes added.)

Yea, most Unicode explanations seem to talk all about "code-units vs 
code-points" and then they'll just have a brief note like "There's also 
other things like digraphs and combining codes." And that'll be all they 
mention.

You're right about the Unicode literature. It's the usual standards-body 
documentation, same as W3C: "Instead of only some people understanding how 
this works, lets encode the documentation in legalese (and have twenty 
only-slightly-different versions) to make sure that nobody understands how 
it works."

 You can also say there are 2 kinds of characters: simple like "u" & 
 composite "�" or "�??". The former are coded with a single (base) code, 
 the latter with one (rarely more) base codes and an arbitrary number of 
 combining codes.

Couple questions about the "more than one base codes":

- Do you know an example offhand?

- Does that mean like a ligature where the base codes form a single glyph, 
or does it mean that the combining code either spans or operates over 
multiple glyphs? Or can it go either way?

 For a majority of _common_ characters made of 2 or 3 codes (western 
 language letters, korean Hangul syllables,...), precombined codes have 
 been added to the set. Thus, they can be coded with a single code like 
 simple characters.

Out of curiosity, how do decomposed Hangul characters work? (Or do you 
know?) Not actually knowing any Korean, my understanding is that they're a 
set of 1 to 4 phoenetic glyphs that are then combined into one glyph. So, it 
is like a series of base codes that automatically combine, or are there 
combining characters involved?

 [Also note, to avoid things be too simple ;-), some (few) combining codes 
 called "prepend" come _before_ the base in raw code sequence...]

Fun!

Jan 14 2011

spir <denis.spir gmail.com> writes:

On 01/14/2011 08:20 PM, Nick Sabalausky wrote:
 "spir"<denis.spir gmail.com>  wrote in message
 news:mailman.619.1295012086.4748.digitalmars-d puremagic.com...
 If anyone finds a pointer to such an explanation, bravo, and than you.
 (You will certainly not find it in Unicode literature, for instance.)
 Nick's explanation below is good and concise. (Just 2 notes added.)

 Yea, most Unicode explanations seem to talk all about "code-units vs
 code-points" and then they'll just have a brief note like "There's also
 other things like digraphs and combining codes." And that'll be all they
 mention.

 You're right about the Unicode literature. It's the usual standards-body
 documentation, same as W3C: "Instead of only some people understanding how
 this works, lets encode the documentation in legalese (and have twenty
 only-slightly-different versions) to make sure that nobody understands how
 it works."

If anyone is interested, ICU's documentation is far more readable (and 
intended for programmers). ICU is *the* reference library for dealing 
with unicode (an IBM open source product, with C/C++/Java interfaces), 
used by many other products in the background.
ICU: http://site.icu-project.org/
user guide: http://userguide.icu-project.org/
section about text segmentation: 
http://userguide.icu-project.org/boundaryanalysis

Note that just like Unicode, they consider forming graphemes (grouping 
codes into character representations) a simple particular case of text 
segmentation, which they call "boundary analysis" (but they have the 
nice idea to use "character" instead of "grapheme").

The only mention I found in ICU's doc of the issue we have talked about 
here lengthily is (at http://userguide.icu-project.org/strings):
"Handling Lengths, Indexes, and Offsets in Strings

The length of a string and all indexes and offsets related to the string 
are always counted in terms of UChar code units, not in terms of UChar32 
code points. (This is the same as in common C library functions that use 
char * strings with multi-byte encodings.)

Often, a user thinks of a "character" as a complete unit in a language, 
like an 'Ä', while it may be represented with multiple Unicode code 
points including a base character and combining marks. (See the Unicode 
standard for details.) This often requires users to index and pass 
strings (UnicodeString or UChar *) with multiple code units or code 
points. It cannot be done with single-integer character types. Indexing 
of such "characters" is done with the BreakIterator class (in C: ubrk_ 
functions).

Even with such "higher-level" indexing functions, the actual index 
values will be expressed in terms of UChar code units. When more than 
one code unit is used at a time, the index value changes by more than 
one at a time. [...]

(ICU's UChar are like D wchar.)

 You can also say there are 2 kinds of characters: simple like "u"&
 composite "ü" or "ü??". The former are coded with a single (base) code,
 the latter with one (rarely more) base codes and an arbitrary number of
 combining codes.

 Couple questions about the "more than one base codes":

 - Do you know an example offhand?

No. I know this only from it beeing mentionned in documentation. Unless 
we consider (see below) L jamo as base codes.

 - Does that mean like a ligature where the base codes form a single glyph,
 or does it mean that the combining code either spans or operates over
 multiple glyphs? Or can it go either way?

IIRC examples like ij in nederlands are only considered "compability 
equivalent" to the corresponding ligatures, just like eg "ss" for "ß" in 
german. Meaning they should not be considered equal by default, this 
would be an additional feature, and langage- and app-dependant). Unlike 
base "e"+ combining "^" really == "ê".

 For a majority of _common_ characters made of 2 or 3 codes (western
 language letters, korean Hangul syllables,...), precombined codes have
 been added to the set. Thus, they can be coded with a single code like
 simple characters.

 Out of curiosity, how do decomposed Hangul characters work? (Or do you
 know?) Not actually knowing any Korean, my understanding is that they're a
 set of 1 to 4 phoenetic glyphs that are then combined into one glyph. So, it
 is like a series of base codes that automatically combine, or are there
 combining characters involved?

I know nothing about Korean language except what I studied about its 
scripting system for Unicode algorithms (but one can also code said 
algorithm blindly). See http://en.wikipedia.org/wiki/Hangul and about 
Hangul in Unicode 
http://en.wikipedia.org/wiki/Korean_language_and_computers. What I 
understand (beware, it's just wild deductions) is there are 3 kinds of 
"jamo" scripting marks (noted L, V, T) that can combine into syllabic 
"graphemes", resp in first, median, last place. These marks indeed 
somehow correspond to vocalic or consonantic phonemes.
In unicode, in addition to such jamo, which are simple marks (like base 
letters and diacritics in latin-based languages), there are precombined 
codes for LV and LVT combinations (like for "ä" or "û"). We could thus 
think that Hangul syllables are limited to 3 jamo.
But: according to Unicode's official "grapheme break cluster" algorithm 
(read: how to group codepoints into characters) 
(http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries), codes 
for L jamo can also be followed by _and_ should be combined with other 
L, LV or LVT codes. Similarly, LV or V should be combined with V or VT, 
and LVT or T with T. (Seems logical.) So, I do not know how complicated 
a Hangul syllab can be in practice or in theory.
If there can be in practice whole syllables following other schemes than 
L / LV / LVT, then this is another example of real language whole 
characters that cannot be coded by a single codepoint.


Denis
_________________
vita es estrany
spir.wikidot.com

Jan 16 2011

Walter Bright <newshound2 digitalmars.com> writes:

Nick Sabalausky wrote:
 Those *both* get rendered exactly the same, and both represent the same 
 four-letter sequence. In the second example, the 'u' and the {umlaut 
 combining character} combine to form one grapheme. The f's and n's just 
 happen to be single-code-point graphemes.

I know some German, and to the best of my knowledge there are zero combining 
characters for it. The umlauts and the B both have their own code points.

 legend has it there are others than can only be 
 represented using a combining character.

??? I've never seen or heard of any. Not even in the old script that was in 
common use in Germany until after WW2.

Jan 15 2011

spir <denis.spir gmail.com> writes:

On 01/14/2011 05:23 AM, Andrei Alexandrescu wrote:

That's forgetting that most of the time people care about graphemes
(user-perceived characters), not code points.

I'm not so sure about that. What do you base this assessment on? Denis
wrote a library that according to him does grapheme-related stuff nobody
else does. So apparently graphemes is not what people care about
(although it might be what they should care about).

I'm aware of that, and I have no definitive answer to the question. The
issue *does* exist --as shown even by trivial examples such as Michel's
below, not corner cases. The actual question is _not_ whether code or
"grapheme" is the proper level of abstraction. To this, the answer is
clear: codes are simply meaningless in 99% cases. (All historic software
deal with chars, conceptually, but they happen too be coded with single
codes.)
(And what about Objective-C? Why did its designers even bother with that?).

The question is rather: why do we nearly all happily go on ignoring the
issue? My present guess is a combination of factors:

* The issue is masked by the misleading use of "abstract character" in
unicode literature. "Abstract" is very correct, but they should have
found another term as "character", say "abstract scripting mark". Their
deceiving terminological choice lets most programmers believe that
codepoints code characters, like in historic charsets.
(Even worse: some doc explicitely states that ICU's notion of character
matches the programming notion of character.)
* ICU added precomposed codes for a bunch of characters, supposedly for
backward compatility with said charsets. (But where is the gain? We need
to decode them anyway...) The consequence is, at the pedagogical level,
very bad: most text-producing software (like editors) use such
precomposed codes when available for a given character. So that
programmers can happily go on believing in the code=character myth.
(Note: the gain in space is ridiculous for western text.)
* Most characters that appear in western texts (at least "official"
characters of natural languages) have precomposed forms.
* Programmers can very easily be unaware their code is incorrect: how do
you even notice it in test output?

Thus, practically, programmers can (1) simply don't know the issue (2)
have code that really works in typical use cases for their software (3)
do not notice their code runs incorrectly.
There is also an intermediate situation between (2) & (3), similar to
old problems with previous ASCII-only apps: they work wrongly when used
in a non-english environment, but what can users do, concretely? Most
often, they just have to cope with incorrectness, reinterpret outputs
differently, and/or find workarounds by cheating with the interface.

The responsability of designers of tools for programmers is, imo,
important. We should make the issue clear, first (very difficult, it's
an ubiquitous myth to break down), and propose services that run
correctly in situations where said issue is relevant, here manipulation
of universal text, even if not very efficient at start.
On my side, and about D, I wish that most D programmers (1) are aware of
the problem (2) understand its why's & how's (3) know there is a correct
solution. Then, (4) use it actually is their choice (and I don't care
whether or not they do).

It also supports this:

foreach(i, d; s)
{
writeln("The character in position ", i, " is ", d);
}

where i is the index (might not be sequential)

Well string supports that too, albeit with the nit that you need to
specify dchar.

Except it breaks with combining characters. For instance, take the
string "t̃", which is two code points -- 't' followed by combining tilde
(U+0303) -- and you'll get the following output:

The character in position 0 is t
The character in position 1 is ̃

(Note that the tilde becomes combined with the preceding space
character.)

The conception of character that normal people have does not match the
notion of code points when combining characters enters the equation.

This might be a good time to see whether we need to address graphemes
systematically. Could you please post a few links that would educate me
and others in the mysteries of combining characters?

Beware! far too long text.
https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/U%20missing%20level%20of%20abstraction
(the directory above contains the current rough implementation of Text,
plus a bit of its brother package DUnicode)

Thanks,

Andrei

Denis
_________________
vita es estrany
spir.wikidot.com

Jan 14 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 14 Jan 2011 08:14:02 -0500, spir <denis.spir gmail.com> wrote:

 On 01/14/2011 05:23 AM, Andrei Alexandrescu wrote:

 That's forgetting that most of the time people care about graphemes
 (user-perceived characters), not code points.

 I'm not so sure about that. What do you base this assessment on? Denis
 wrote a library that according to him does grapheme-related stuff nobody
 else does. So apparently graphemes is not what people care about
 (although it might be what they should care about).

 I'm aware of that, and I have no definitive answer to the question. The  
 issue *does* exist --as shown even by trivial examples such as Michel's  
 below, not corner cases. The actual question is _not_ whether code or  
 "grapheme" is the proper level of abstraction. To this, the answer is  
 clear: codes are simply meaningless in 99% cases. (All historic software  
 deal with chars, conceptually, but they happen too be coded with single  
 codes.)
 (And what about Objective-C? Why did its designers even bother with  
 that?).

 The question is rather: why do we nearly all happily go on ignoring the  
 issue? My present guess is a combination of factors:

 * The issue is masked by the misleading use of "abstract character" in  
 unicode literature. "Abstract" is very correct, but they should have  
 found another term as "character", say "abstract scripting mark". Their  
 deceiving terminological choice lets most programmers believe that  
 codepoints code characters, like in historic charsets.
 (Even worse: some doc explicitely states that ICU's notion of character  
 matches the programming notion of character.)
 * ICU added precomposed codes for a bunch of characters, supposedly for  
 backward compatility with said charsets. (But where is the gain? We need  
 to decode them anyway...) The consequence is, at the pedagogical level,  
 very bad: most text-producing software (like editors) use such  
 precomposed codes when available for a given character. So that  
 programmers can happily go on believing in the code=character myth.  
 (Note: the gain in space is ridiculous for western text.)
 * Most characters that appear in western texts (at least "official"  
 characters of natural languages) have precomposed forms.
 * Programmers can very easily be unaware their code is incorrect: how do  
 you even notice it in test output?

* I don't even know how to make a grapheme that is more than one  
code-unit, let alone more than one code-point :)  Every time I try, I get  
'invalid utf sequence'.

I feel significantly ignorant on this issue, and I'm slowly getting enough  
knowledge to join the discussion, but being a dumb American who only  
speaks English, I have a hard time grasping how this shit all works.

-Steve

Jan 14 2011

spir <denis.spir gmail.com> writes:

On 01/14/2011 02:37 PM, Steven Schveighoffer wrote:
 * I don't even know how to make a grapheme that is more than one
 code-unit, let alone more than one code-point :)  Every time I try, I
 get 'invalid utf sequence'.

 I feel significantly ignorant on this issue, and I'm slowly getting
 enough knowledge to join the discussion, but being a dumb American who
 only speaks English, I have a hard time grasping how this shit all works.

1. See my text at 
https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/U%20missing%20level%20of%20abstraction

2.
     writeln ("A\u0308\u0330");
<A + tilde above + umlaut below> (or the opposite)
If it does not display properly, either set your terminal to UTF* or use 
a more unicode-aware font (eg DejaVu series).

The point is not playing like that with Unicode flexibility. Rather that 
composite characters are just normal thingies in most languages of the 
world. Actually, on this point, english is a rare exception (discarding 
letters imported from foreign languages like french 'à'); to the point 
of beeing, I guess, the only western language without any diacritic.


Denis
_________________
vita es estrany
spir.wikidot.com

Jan 14 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 14 Jan 2011 08:59:35 -0500, spir <denis.spir gmail.com> wrote:

On 01/14/2011 02:37 PM, Steven Schveighoffer wrote:
* I don't even know how to make a grapheme that is more than one
code-unit, let alone more than one code-point :) Every time I try, I
get 'invalid utf sequence'.

I feel significantly ignorant on this issue, and I'm slowly getting
enough knowledge to join the discussion, but being a dumb American who
only speaks English, I have a hard time grasping how this shit all
works.

1. See my text at
https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/U%20missing%20level%20of%20abstraction

I can't read that document, it's black background with super-dark-grey
text.

2.
writeln ("A\u0308\u0330");
<A + tilde above + umlaut below> (or the opposite)
If it does not display properly, either set your terminal to UTF* or use
a more unicode-aware font (eg DejaVu series).

OK, I'll have to remember this so I can use it to test my string type ;)

The point is not playing like that with Unicode flexibility. Rather that
composite characters are just normal thingies in most languages of the
world. Actually, on this point, english is a rare exception (discarding
letters imported from foreign languages like french 'à'); to the point
of beeing, I guess, the only western language without any diacritic.

Is it common to have multiple modifiers on a single character? The
problem I see with using decomposed canonical form for strings is that we
would have to return a dchar[] for each 'element', which severely
complicates code that, for instance, only expects to handle English.

I was hoping to lazily transform a string into its composed canonical
form, allowing the (hopefully rare) exception when a composed character
does not exist. My thinking was that this at least gives a useful string
representation for 90% of usages, leaving the remaining 10% of usages to
find a more complex representation (like your Text type). If we only get
like 20% or 30% there by making dchar the element type, then we haven't
made it useful enough.

Either way, we need a string type that can be compared canonically for
things like searches or opEquals.

-Steve

Jan 14 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-14 09:34:55 -0500, "Steven Schveighoffer" 
<schveiguy yahoo.com> said:

 On Fri, 14 Jan 2011 08:59:35 -0500, spir <denis.spir gmail.com> wrote:
 
 The point is not playing like that with Unicode flexibility. Rather 
 that  composite characters are just normal thingies in most languages 
 of the  world. Actually, on this point, english is a rare exception 
 (discarding  letters imported from foreign languages like french '�'); 
 to the point  of beeing, I guess, the only western language without any 
 diacritic.

 
 Is it common to have multiple modifiers on a single character?

Not in my knowledge. But I rarely deal with non-latin texts, there's 
probably some scripts out there that takes advantage of this.


 The  problem I see with using decomposed canonical form for strings is 
 that we  would have to return a dchar[] for each 'element', which 
 severely  complicates code that, for instance, only expects to handle 
 English.

Actually, returning a sliced char[]�or wchar[] could also be valid. 
User-perceived characters are basically a substring of one or more code 
points. I'm not sure it complicates that much the semantics of the 
language -- what's complicated about writing str.front == "a" instead 
of str.front == 'a'? -- although it probably would complicate the 
generated code and make it a little slower.

In the case of NSString in Cocoa, you can only access the 'characters' 
in their UTF-16 form. But everything from comparison to search for 
substring is done using graphemes. It's like they implemented 
specialized Unicode-aware algorithms for these functions. There's no 
genericness about how it handles graphemes.

I'm not sure yet about what would be the right approach for D.


 I was hoping to lazily transform a string into its composed canonical  
 form, allowing the (hopefully rare) exception when a composed character 
  does not exist.  My thinking was that this at least gives a useful 
 string  representation for 90% of usages, leaving the remaining 10% of 
 usages to  find a more complex representation (like your Text type).  
 If we only get  like 20% or 30% there by making dchar the element type, 
 then we haven't  made it useful enough.
 
 Either way, we need a string type that can be compared canonically for  
 things like searches or opEquals.

I wonder if normalized string comparison shouldn't be built directly in 
the char[] wchar[] and dchar[] types instead. Also bring the idea above 
that iterating on a string would yield graphemes as char[] and this 
code would work perfectly irrespective of whether you used combining 
characters:

	foreach (grapheme; "expos�") {
		if (grapheme == "�")
			break;
	}

I think a good standard to evaluate our handling of Unicode is to see 
how easy it is to do things the right way. In the above, foreach would 
slice the string grapheme by grapheme, and the == operator would 
perform a normalized comparison. While it works correctly, it's 
probably not the most efficient way to do thing however.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 14 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 14 Jan 2011 12:01:42 -0500, Michel Fortin  
<michel.fortin michelf.com> wrote:

 On 2011-01-14 09:34:55 -0500, "Steven Schveighoffer"  
 <schveiguy yahoo.com> said:

 On Fri, 14 Jan 2011 08:59:35 -0500, spir <denis.spir gmail.com> wrote:

 The point is not playing like that with Unicode flexibility. Rather  
 that  composite characters are just normal thingies in most languages  
 of the  world. Actually, on this point, english is a rare exception  
 (discarding  letters imported from foreign languages like french 'à');  
 to the point  of beeing, I guess, the only western language without  
 any diacritic.

  Is it common to have multiple modifiers on a single character?

 Not in my knowledge. But I rarely deal with non-latin texts, there's  
 probably some scripts out there that takes advantage of this.


 The  problem I see with using decomposed canonical form for strings is  
 that we  would have to return a dchar[] for each 'element', which  
 severely  complicates code that, for instance, only expects to handle  
 English.

 Actually, returning a sliced char[] or wchar[] could also be valid.  
 User-perceived characters are basically a substring of one or more code  
 points. I'm not sure it complicates that much the semantics of the  
 language -- what's complicated about writing str.front == "a" instead of  
 str.front == 'a'? -- although it probably would complicate the generated  
 code and make it a little slower.

Hm... this pushes the normalization outside the type, and into the  
algorithms (such as find).  I was hoping to avoid that.  I think I can  
come up with an algorithm that normalizes into canonical form as it  
iterates.  It just might return part of a grapheme if the grapheme cannot  
be composed.

I do think that we could make a byGrapheme member to aid in this:

foreach(grapheme; s.byGrapheme) // grapheme is a substring that contains  
one composed grapheme.

 In the case of NSString in Cocoa, you can only access the 'characters'  
 in their UTF-16 form. But everything from comparison to search for  
 substring is done using graphemes. It's like they implemented  
 specialized Unicode-aware algorithms for these functions. There's no  
 genericness about how it handles graphemes.

 I'm not sure yet about what would be the right approach for D.

I hope we can use generic versions, so the type itself handles the  
conversions.  That makes any algorithm using the string range correct.

 I was hoping to lazily transform a string into its composed canonical   
 form, allowing the (hopefully rare) exception when a composed character  
  does not exist.  My thinking was that this at least gives a useful  
 string  representation for 90% of usages, leaving the remaining 10% of  
 usages to  find a more complex representation (like your Text type).   
 If we only get  like 20% or 30% there by making dchar the element type,  
 then we haven't  made it useful enough.
  Either way, we need a string type that can be compared canonically  
 for  things like searches or opEquals.

 I wonder if normalized string comparison shouldn't be built directly in  
 the char[] wchar[] and dchar[] types instead.

No, in my vision of how strings should be typed, char[] is an array, not a  
string.  It should be treated like an array of code-units, where two forms  
that create the same grapheme are considered different.

 Also bring the idea above that iterating on a string would yield  
 graphemes as char[] and this code would work perfectly irrespective of  
 whether you used combining characters:

 	foreach (grapheme; "exposé") {
 		if (grapheme == "é")
 			break;
 	}

 I think a good standard to evaluate our handling of Unicode is to see  
 how easy it is to do things the right way. In the above, foreach would  
 slice the string grapheme by grapheme, and the == operator would perform  
 a normalized comparison. While it works correctly, it's probably not the  
 most efficient way to do thing however.

I think this is a good alternative, but I'd rather not impose this on  
people like myself who deal mostly with English.  I think this should be  
possible to do with wrapper types or intermediate ranges which have  
graphemes as elements (per my suggestion above).

Does this sound reasonable?

-Steve

Jan 15 2011

Lutger Blijdestijn <lutger.blijdestijn gmail.com> writes:

Steven Schveighoffer wrote:

...
 I think a good standard to evaluate our handling of Unicode is to see
 how easy it is to do things the right way. In the above, foreach would
 slice the string grapheme by grapheme, and the == operator would perform
 a normalized comparison. While it works correctly, it's probably not the
 most efficient way to do thing however.

 
 I think this is a good alternative, but I'd rather not impose this on
 people like myself who deal mostly with English.  I think this should be
 possible to do with wrapper types or intermediate ranges which have
 graphemes as elements (per my suggestion above).
 
 Does this sound reasonable?
 
 -Steve

If its a matter of choosing which is the 'default' range, I'd think proper 
unicode handling is more reasonable than catering for english / ascii only. 
Especially since this is already the case in phobos string algorithms.

Jan 15 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn  
<lutger.blijdestijn gmail.com> wrote:

 Steven Schveighoffer wrote:

 ...
 I think a good standard to evaluate our handling of Unicode is to see
 how easy it is to do things the right way. In the above, foreach would
 slice the string grapheme by grapheme, and the == operator would  
 perform
 a normalized comparison. While it works correctly, it's probably not  
 the
 most efficient way to do thing however.

 I think this is a good alternative, but I'd rather not impose this on
 people like myself who deal mostly with English.  I think this should be
 possible to do with wrapper types or intermediate ranges which have
 graphemes as elements (per my suggestion above).

 Does this sound reasonable?

 -Steve

 If its a matter of choosing which is the 'default' range, I'd think  
 proper
 unicode handling is more reasonable than catering for english / ascii  
 only.
 Especially since this is already the case in phobos string algorithms.

English and (if I understand correctly) most other languages.  Any  
language which can be built from composable graphemes would work.  And in  
fact, ones that use some graphemes that cannot be composed will also work  
to some degree (for example, opEquals).

What I'm proposing (or think I'm proposing) is not exactly catering to  
English and ASCII, what I'm proposing is simply not catering to more  
complex languages such as Hebrew and Arabic.  What I'm trying to find is a  
middle ground where most languages work, and the code is simple and  
efficient, with possibilities to jump down to lower levels for performance  
(i.e. switch to char[] when you know ASCII is all you are using) or jump  
up to full unicode when necessary.

Essentially, we would have three levels of types:

char[], wchar[], dchar[] -- Considered to be arrays in every way.
string_t!T (string, wstring, dstring) -- Specialized string types that do  
normalization to dchars, but do not handle perfectly all graphemes.  Works  
with any algorithm that deals with bidirectional ranges.  This is the  
default string type, and the type for string literals.  Represented  
internally by a single char[], wchar[] or dchar[] array.
* utfstring_t!T -- specialized string to deal with full unicode, which may  
perform worse than string_t, but supports everything unicode supports.   
May require a battery of specialized algorithms.

* - name up for discussion

Also note that phobos currently does *no* normalization as far as I can  
tell for things like opEquals.  Two char[]'s that represent equivalent  
strings, but not in the same way, will compare as !=.

-Steve

Jan 15 2011

foobar <foo bar.com> writes:

Steven Schveighoffer Wrote:

 On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn  
 <lutger.blijdestijn gmail.com> wrote:
 
 Steven Schveighoffer wrote:

 ...
 I think a good standard to evaluate our handling of Unicode is to see
 how easy it is to do things the right way. In the above, foreach would
 slice the string grapheme by grapheme, and the == operator would  
 perform
 a normalized comparison. While it works correctly, it's probably not  
 the
 most efficient way to do thing however.

 I think this is a good alternative, but I'd rather not impose this on
 people like myself who deal mostly with English.  I think this should be
 possible to do with wrapper types or intermediate ranges which have
 graphemes as elements (per my suggestion above).

 Does this sound reasonable?

 -Steve

 If its a matter of choosing which is the 'default' range, I'd think  
 proper
 unicode handling is more reasonable than catering for english / ascii  
 only.
 Especially since this is already the case in phobos string algorithms.

 
 English and (if I understand correctly) most other languages.  Any  
 language which can be built from composable graphemes would work.  And in  
 fact, ones that use some graphemes that cannot be composed will also work  
 to some degree (for example, opEquals).
 
 What I'm proposing (or think I'm proposing) is not exactly catering to  
 English and ASCII, what I'm proposing is simply not catering to more  
 complex languages such as Hebrew and Arabic.  What I'm trying to find is a  
 middle ground where most languages work, and the code is simple and  
 efficient, with possibilities to jump down to lower levels for performance  
 (i.e. switch to char[] when you know ASCII is all you are using) or jump  
 up to full unicode when necessary.
 
 Essentially, we would have three levels of types:
 
 char[], wchar[], dchar[] -- Considered to be arrays in every way.
 string_t!T (string, wstring, dstring) -- Specialized string types that do  
 normalization to dchars, but do not handle perfectly all graphemes.  Works  
 with any algorithm that deals with bidirectional ranges.  This is the  
 default string type, and the type for string literals.  Represented  
 internally by a single char[], wchar[] or dchar[] array.
 * utfstring_t!T -- specialized string to deal with full unicode, which may  
 perform worse than string_t, but supports everything unicode supports.   
 May require a battery of specialized algorithms.
 
 * - name up for discussion
 
 Also note that phobos currently does *no* normalization as far as I can  
 tell for things like opEquals.  Two char[]'s that represent equivalent  
 strings, but not in the same way, will compare as !=.
 
 -Steve

The above compromise provides zero benefit. The proposed default type string_t
is incorrect and will cause bugs. I prefer the standard lib to not provide
normalization at all and force me to use a 3rd party lib rather than provide an
incomplete implementation that will give me a false sense of correctness and
cause very subtle and hard to find bugs.

More over, Even if you ignore Hebrew as a tiny insignificant minority you
cannot do the same for Arabic which has over one *billion* people that use that
language. 

I firmly believe that in accordance with D's principle that the default
behavior should be the correct & safe option, D should have the full unicode
type (utfstring_t above) as the default. 

You need only a subset of the functionality because you only use English? For
the same reason, you don't want the Unicode overhead? Use an ASCII type
instead. In the same vain, a geneticist should use a DNA sequence type and not
Unicode text.

Jan 15 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sat, 15 Jan 2011 13:21:12 -0500, foobar <foo bar.com> wrote:

 Steven Schveighoffer Wrote:

 English and (if I understand correctly) most other languages.  Any
 language which can be built from composable graphemes would work.  And  
 in
 fact, ones that use some graphemes that cannot be composed will also  
 work
 to some degree (for example, opEquals).

 What I'm proposing (or think I'm proposing) is not exactly catering to
 English and ASCII, what I'm proposing is simply not catering to more
 complex languages such as Hebrew and Arabic.  What I'm trying to find  
 is a
 middle ground where most languages work, and the code is simple and
 efficient, with possibilities to jump down to lower levels for  
 performance
 (i.e. switch to char[] when you know ASCII is all you are using) or jump
 up to full unicode when necessary.

 Essentially, we would have three levels of types:

 char[], wchar[], dchar[] -- Considered to be arrays in every way.
 string_t!T (string, wstring, dstring) -- Specialized string types that  
 do
 normalization to dchars, but do not handle perfectly all graphemes.   
 Works
 with any algorithm that deals with bidirectional ranges.  This is the
 default string type, and the type for string literals.  Represented
 internally by a single char[], wchar[] or dchar[] array.
 * utfstring_t!T -- specialized string to deal with full unicode, which  
 may
 perform worse than string_t, but supports everything unicode supports.
 May require a battery of specialized algorithms.

 * - name up for discussion

 Also note that phobos currently does *no* normalization as far as I can
 tell for things like opEquals.  Two char[]'s that represent equivalent
 strings, but not in the same way, will compare as !=.

 -Steve

 The above compromise provides zero benefit. The proposed default type  
 string_t is incorrect and will cause bugs. I prefer the standard lib to  
 not provide normalization at all and force me to use a 3rd party lib  
 rather than provide an incomplete implementation that will give me a  
 false sense of correctness and cause very subtle and hard to find bugs.

I feel like you might be exaggerating, but maybe I'm completely wrong on  
this, I'm not well-versed in unicode, or even languages that require  
unicode.  The clear benefit I see is that with a string type which  
normalizes to canonical code points, you can use this in any algorithm  
without having it be unicode-aware for *most languages*.  At least, that  
is how I see it.  I'm looking at it as a code-reuse proposition.

It's like calendars.  There are quite a few different calendars in  
different cultures.  But most people use a Gregorian calendar.  So we have  
three options:

a) Use a Gregorian calendar, and leave the other calendars to a 3rd party  
library
b) Use a complicated calendar system where Gregorian calendars are treated  
with equal respect to all other calendars, none are the default.
c) Use a Gregorian calendar by default, but include the other calendars as  
a separate module for those who wish to use them.

I'm looking at my proposal as more of a c) solution.

Can you show how normalization causes subtle bugs?

 More over, Even if you ignore Hebrew as a tiny insignificant minority  
 you cannot do the same for Arabic which has over one *billion* people  
 that use that language.

I hope that the medium type works 'good enough' for those languages, with  
the high level type needed for advanced usages.  At a minimum, comparison  
and substring should work for all languages.

 I firmly believe that in accordance with D's principle that the default  
 behavior should be the correct & safe option, D should have the full  
 unicode type (utfstring_t above) as the default.

 You need only a subset of the functionality because you only use  
 English? For the same reason, you don't want the Unicode overhead? Use  
 an ASCII type instead. In the same vain, a geneticist should use a DNA  
 sequence type and not Unicode text.

Or French, or Spanish, or German, etc...

Look, even the lowest level is valid unicode, but if you want to start  
extracting individual graphemes, you need more machinery.  In 99% of  
cases, I'd think you want to use strings as strings, not as sequences of  
graphemes, or code-units.

-Steve

Jan 15 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sat, 15 Jan 2011 14:51:47 -0500, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 I feel like you might be exaggerating, but maybe I'm completely wrong on  
 this, I'm not well-versed in unicode, or even languages that require  
 unicode.  The clear benefit I see is that with a string type which  
 normalizes to canonical code points, you can use this in any algorithm  
 without having it be unicode-aware for *most languages*.  At least, that  
 is how I see it.  I'm looking at it as a code-reuse proposition.

 It's like calendars.  There are quite a few different calendars in  
 different cultures.  But most people use a Gregorian calendar.  So we  
 have three options:

 a) Use a Gregorian calendar, and leave the other calendars to a 3rd  
 party library
 b) Use a complicated calendar system where Gregorian calendars are  
 treated with equal respect to all other calendars, none are the default.
 c) Use a Gregorian calendar by default, but include the other calendars  
 as a separate module for those who wish to use them.

 I'm looking at my proposal as more of a c) solution.

 Can you show how normalization causes subtle bugs?

I see from Michel's post how normalization automatically can be bad.  I  
also see that it can be wasteful.  So I've shifted my position.

Now I agree that we need a full unicode-compliant string type as the  
default.  See my reply to Michel for more info on my revised proposal.

-Steve

Jan 15 2011

foobar <foo bar.com> writes:

Steven Schveighoffer Wrote:

 On Sat, 15 Jan 2011 13:21:12 -0500, foobar <foo bar.com> wrote:
 
 Steven Schveighoffer Wrote:

 
 English and (if I understand correctly) most other languages.  Any
 language which can be built from composable graphemes would work.  And  
 in
 fact, ones that use some graphemes that cannot be composed will also  
 work
 to some degree (for example, opEquals).

 What I'm proposing (or think I'm proposing) is not exactly catering to
 English and ASCII, what I'm proposing is simply not catering to more
 complex languages such as Hebrew and Arabic.  What I'm trying to find  
 is a
 middle ground where most languages work, and the code is simple and
 efficient, with possibilities to jump down to lower levels for  
 performance
 (i.e. switch to char[] when you know ASCII is all you are using) or jump
 up to full unicode when necessary.

 Essentially, we would have three levels of types:

 char[], wchar[], dchar[] -- Considered to be arrays in every way.
 string_t!T (string, wstring, dstring) -- Specialized string types that  
 do
 normalization to dchars, but do not handle perfectly all graphemes.   
 Works
 with any algorithm that deals with bidirectional ranges.  This is the
 default string type, and the type for string literals.  Represented
 internally by a single char[], wchar[] or dchar[] array.
 * utfstring_t!T -- specialized string to deal with full unicode, which  
 may
 perform worse than string_t, but supports everything unicode supports.
 May require a battery of specialized algorithms.

 * - name up for discussion

 Also note that phobos currently does *no* normalization as far as I can
 tell for things like opEquals.  Two char[]'s that represent equivalent
 strings, but not in the same way, will compare as !=.

 -Steve

 The above compromise provides zero benefit. The proposed default type  
 string_t is incorrect and will cause bugs. I prefer the standard lib to  
 not provide normalization at all and force me to use a 3rd party lib  
 rather than provide an incomplete implementation that will give me a  
 false sense of correctness and cause very subtle and hard to find bugs.

 
 I feel like you might be exaggerating, but maybe I'm completely wrong on  
 this, I'm not well-versed in unicode, or even languages that require  
 unicode.  The clear benefit I see is that with a string type which  
 normalizes to canonical code points, you can use this in any algorithm  
 without having it be unicode-aware for *most languages*.  At least, that  
 is how I see it.  I'm looking at it as a code-reuse proposition.
 
 It's like calendars.  There are quite a few different calendars in  
 different cultures.  But most people use a Gregorian calendar.  So we have  
 three options:
 
 a) Use a Gregorian calendar, and leave the other calendars to a 3rd party  
 library
 b) Use a complicated calendar system where Gregorian calendars are treated  
 with equal respect to all other calendars, none are the default.
 c) Use a Gregorian calendar by default, but include the other calendars as  
 a separate module for those who wish to use them.
 
 I'm looking at my proposal as more of a c) solution.
 

The calendar example is a very good one. What you're saying equivalent to
saying is that most people use Gregorian but for efficiency and other reasons
you want to not implement feb 29th. 

 Can you show how normalization causes subtle bugs?
 

That was already shown by Michel and Spir where the equality operator is
incorrect due to diacritics (the example with expos�). Your solution makes this
far worse since it will reduce the bug to far less cases making the problem far
less obvious. 
One would test with expos� which will work and another test (let's say in
Hebrew) and that will *not* work and unless the programmer is a Unicode expert
(Which is very unlikely) the programmer is left scratching his head.

 More over, Even if you ignore Hebrew as a tiny insignificant minority  
 you cannot do the same for Arabic which has over one *billion* people  
 that use that language.

 
 I hope that the medium type works 'good enough' for those languages, with  
 the high level type needed for advanced usages.  At a minimum, comparison  
 and substring should work for all languages.
 

As I explained above, 'good enough' in this case is far worse because it masks
the problem.  Also, If you want comparison to work in all languages including
Hebrew/Arabic than it simply isn't good enough.

 I firmly believe that in accordance with D's principle that the default  
 behavior should be the correct & safe option, D should have the full  
 unicode type (utfstring_t above) as the default.

 You need only a subset of the functionality because you only use  
 English? For the same reason, you don't want the Unicode overhead? Use  
 an ASCII type instead. In the same vain, a geneticist should use a DNA  
 sequence type and not Unicode text.

 
 Or French, or Spanish, or German, etc...
 
 Look, even the lowest level is valid unicode, but if you want to start  
 extracting individual graphemes, you need more machinery.  In 99% of  
 cases, I'd think you want to use strings as strings, not as sequences of  
 graphemes, or code-units.
 
 -Steve

I'd like to have full Unicode support. I think it is a good thing for D to have
in order to expand in the world. As an alternative, I'd settle for loud errors
that make absolutely clear to the non-Unicode expert programmer that D simply
does NOT support e.g. Normalization. 

As Spir already said, Unicode is something few understand and even it's own
official docs do not explain such issues properly. We should not confuse users
even further with incomplete support.

Jan 15 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sat, 15 Jan 2011 15:46:11 -0500, foobar <foo bar.com> wrote:


 I'd like to have full Unicode support. I think it is a good thing for D  
 to have in order to expand in the world. As an alternative, I'd settle  
 for loud errors that make absolutely clear to the non-Unicode expert  
 programmer that D simply does NOT support e.g. Normalization.

 As Spir already said, Unicode is something few understand and even it's  
 own official docs do not explain such issues properly. We should not  
 confuse users even further with incomplete support.

Well said, I've changed my mind.  Thanks for explaining.

-Steve

Jan 15 2011

spir <denis.spir gmail.com> writes:

On 01/15/2011 09:46 PM, foobar wrote:
 I'd like to have full Unicode support. I think it is a good thing for D to
have in order to expand in the world. As an alternative, I'd settle for loud
errors that make absolutely clear to the non-Unicode expert programmer that D
simply does NOT support e.g. Normalization.

 As Spir already said, Unicode is something few understand and even it's own
official docs do not explain such issues properly. We should not confuse users
even further with incomplete support.

In a few days, D will have an external library able to deal with those 
issues, hopefully correctly and clearly for client programmers. 
Possibly, its design is not the best possible approach (esp for 
efficiency: Michel let me doubt about that, and my competence in this 
field is close to nothing). But it has the merit to exist and provide a 
clear example of the correct semantics. Let us use it as a base for 
experimentation.

Then, everything can be redesigned from scratch if we realise I was 
initially completely wrong. In any case, it would certainly be a far 
easier and fast job to do now, after having explored the issues at 
length, and with a reference implementation at hand.

Denis
_________________
vita es estrany
spir.wikidot.com

Jan 17 2011

spir <denis.spir gmail.com> writes:

On 01/15/2011 08:51 PM, Steven Schveighoffer wrote:
 More over, Even if you ignore Hebrew as a tiny insignificant minority
 you cannot do the same for Arabic which has over one *billion* people
 that use that language.

 I hope that the medium type works 'good enough' for those languages,
 with the high level type needed for advanced usages.  At a minimum,
 comparison and substring should work for all languages.

Hello Steven,

How does an application know that a given text, which supposedly is 
written in a given natural language (as for instance indicated by an 
html header) does not also hold terms from other languages? There are 
various occasions for this: quotations, use of foreign words, pointers...

A side-issue is raised by precomposed codes for composite characters. 
For most languages of the world, I guess (but unsure), all "official" 
characters have single-code representations. Good, but unfortunately 
this is not enforced by the standard (instead, the decomposed form can 
sensibly be considered the base form, but this is another topic).
So that even if ones knows for sure that all characters of all texts an 
app will ever deal with can be mapped to single codes, to be safe one 
would have to normalise to NFC anyway (Normalised Form Composed). Then, 
where is the actual gain? In fact, it is a loss because NFC is more 
costly than NFD (Decomposed) --actually, the standard NFC algo first 
decomposes to NFD to initially get an unique representation that can 
then be more easily (re)composed via simple mappings.

For further information:
Unicode's normalisation algos: http://unicode.org/reports/tr15/
list of technical reports: http://unicode.org/reports/
(Unicode's technical reports are far more readible than the standard 
itself, but unfortunately often refer to it.)

Denis
_________________
vita es estrany
spir.wikidot.com

Jan 17 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 17 Jan 2011 10:14:19 -0500, spir <denis.spir gmail.com> wrote:

 On 01/15/2011 08:51 PM, Steven Schveighoffer wrote:
 More over, Even if you ignore Hebrew as a tiny insignificant minority
 you cannot do the same for Arabic which has over one *billion* people
 that use that language.

 I hope that the medium type works 'good enough' for those languages,
 with the high level type needed for advanced usages.  At a minimum,
 comparison and substring should work for all languages.

 Hello Steven,

 How does an application know that a given text, which supposedly is  
 written in a given natural language (as for instance indicated by an  
 html header) does not also hold terms from other languages? There are  
 various occasions for this: quotations, use of foreign words, pointers...

 A side-issue is raised by precomposed codes for composite characters.  
 For most languages of the world, I guess (but unsure), all "official"  
 characters have single-code representations. Good, but unfortunately  
 this is not enforced by the standard (instead, the decomposed form can  
 sensibly be considered the base form, but this is another topic).
 So that even if ones knows for sure that all characters of all texts an  
 app will ever deal with can be mapped to single codes, to be safe one  
 would have to normalise to NFC anyway (Normalised Form Composed). Then,  
 where is the actual gain? In fact, it is a loss because NFC is more  
 costly than NFD (Decomposed) --actually, the standard NFC algo first  
 decomposes to NFD to initially get an unique representation that can  
 then be more easily (re)composed via simple mappings.

 For further information:
 Unicode's normalisation algos: http://unicode.org/reports/tr15/
 list of technical reports: http://unicode.org/reports/
 (Unicode's technical reports are far more readible than the standard  
 itself, but unfortunately often refer to it.)

I'll reply to this to save you the trouble.  I have reversed my position  
since writing a lot of these posts.

In summary, I think strings should default to an element type of a  
grapheme, which should be implemented via a slice of the original data.   
Updated string type forthcoming.

-Steve

Jan 17 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-15 12:39:32 -0500, "Steven Schveighoffer" 
<schveiguy yahoo.com> said:

 On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn  
 <lutger.blijdestijn gmail.com> wrote:
 
 Steven Schveighoffer wrote:
 
 ...
 I think a good standard to evaluate our handling of Unicode is to see
 how easy it is to do things the right way. In the above, foreach would
 slice the string grapheme by grapheme, and the == operator would  perform
 a normalized comparison. While it works correctly, it's probably not  the
 most efficient way to do thing however.

 
 I think this is a good alternative, but I'd rather not impose this on
 people like myself who deal mostly with English.  I think this should be
 possible to do with wrapper types or intermediate ranges which have
 graphemes as elements (per my suggestion above).
 
 Does this sound reasonable?
 
 -Steve

 
 If its a matter of choosing which is the 'default' range, I'd think  proper
 unicode handling is more reasonable than catering for english / ascii  only.
 Especially since this is already the case in phobos string algorithms.

 
 English and (if I understand correctly) most other languages.  Any  
 language which can be built from composable graphemes would work.  And 
 in  fact, ones that use some graphemes that cannot be composed will 
 also work  to some degree (for example, opEquals).
 
 What I'm proposing (or think I'm proposing) is not exactly catering to  
 English and ASCII, what I'm proposing is simply not catering to more  
 complex languages such as Hebrew and Arabic.  What I'm trying to find 
 is a  middle ground where most languages work, and the code is simple 
 and  efficient, with possibilities to jump down to lower levels for 
 performance  (i.e. switch to char[] when you know ASCII is all you are 
 using) or jump  up to full unicode when necessary.

Why don't we build a compiler with an optimizer that generates correct 
code *almost* all of the time? If you are worried about it not 
producing correct code for a given function, you can just add 
"pragma(correct_code)" in front of that function to disable the risky 
optimizations. No harm done, right?

One thing I see very often, often on US web sites but also elsewhere, 
is that if you enter a name with an accented letter in a form (say 
Émilie), very often the accented letter gets changed to another 
semi-random character later in the process. Why? Because somewhere in 
the process lies an encoding mismatch that no one thought about and no 
one tested for. At the very least, the form should have rejected those 
unexpected characters and show an error when it could.

Now, with proper Unicode handling up to the code point level, this kind 
of problem probably won't happen as often because the whole stack works 
with UTF encodings. But are you going to validate all of your inputs to 
make sure they have no combining code point?

Don't assume that because you're in the United States no one will try 
to enter characters where you don't expect them. People love to play 
with Unicode symbols for fun, putting them in their name, signature, or 
even domain names (✪df.ws). Just wait until they discover they can 
combine them. ☺̰̎! There is also a variety of combining mathematical 
symbols with no pre-combined form, such as ≸. Writing in Arabic, 
Hebrew, Korean, or some other foreign language isn't a prerequisite to 
use combining characters.


 Essentially, we would have three levels of types:
 
 char[], wchar[], dchar[] -- Considered to be arrays in every way.
 string_t!T (string, wstring, dstring) -- Specialized string types that 
 do  normalization to dchars, but do not handle perfectly all graphemes. 
  Works  with any algorithm that deals with bidirectional ranges.  This 
 is the  default string type, and the type for string literals.  
 Represented  internally by a single char[], wchar[] or dchar[] array.
 * utfstring_t!T -- specialized string to deal with full unicode, which 
 may  perform worse than string_t, but supports everything unicode 
 supports.   May require a battery of specialized algorithms.
 
 * - name up for discussion
 
 Also note that phobos currently does *no* normalization as far as I can 
  tell for things like opEquals.  Two char[]'s that represent equivalent 
  strings, but not in the same way, will compare as !=.

Basically, you're suggesting that the default way should be to handle 
Unicode *almost* right. And then, if you want to handle thing *really* 
right you need to be explicit about it by using "utfstring_t"? I 
understand your motivation, but it sounds backward to me.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 15 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sat, 15 Jan 2011 15:31:23 -0500, Michel Fortin  
<michel.fortin michelf.com> wrote:

 On 2011-01-15 12:39:32 -0500, "Steven Schveighoffer"  
 <schveiguy yahoo.com> said:

 On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn   
 <lutger.blijdestijn gmail.com> wrote:

 Steven Schveighoffer wrote:
  ...
 I think a good standard to evaluate our handling of Unicode is to see
 how easy it is to do things the right way. In the above, foreach  
 would
 slice the string grapheme by grapheme, and the == operator would   
 perform
 a normalized comparison. While it works correctly, it's probably  
 not  the
 most efficient way to do thing however.

  I think this is a good alternative, but I'd rather not impose this on
 people like myself who deal mostly with English.  I think this should  
 be
 possible to do with wrapper types or intermediate ranges which have
 graphemes as elements (per my suggestion above).
  Does this sound reasonable?
  -Steve

  If its a matter of choosing which is the 'default' range, I'd think   
 proper
 unicode handling is more reasonable than catering for english / ascii   
 only.
 Especially since this is already the case in phobos string algorithms.

  English and (if I understand correctly) most other languages.  Any   
 language which can be built from composable graphemes would work.  And  
 in  fact, ones that use some graphemes that cannot be composed will  
 also work  to some degree (for example, opEquals).
  What I'm proposing (or think I'm proposing) is not exactly catering  
 to  English and ASCII, what I'm proposing is simply not catering to  
 more  complex languages such as Hebrew and Arabic.  What I'm trying to  
 find is a  middle ground where most languages work, and the code is  
 simple and  efficient, with possibilities to jump down to lower levels  
 for performance  (i.e. switch to char[] when you know ASCII is all you  
 are using) or jump  up to full unicode when necessary.

 Why don't we build a compiler with an optimizer that generates correct  
 code *almost* all of the time? If you are worried about it not producing  
 correct code for a given function, you can just add  
 "pragma(correct_code)" in front of that function to disable the risky  
 optimizations. No harm done, right?

 One thing I see very often, often on US web sites but also elsewhere, is  
 that if you enter a name with an accented letter in a form (say Émilie),  
 very often the accented letter gets changed to another semi-random  
 character later in the process. Why? Because somewhere in the process  
 lies an encoding mismatch that no one thought about and no one tested  
 for. At the very least, the form should have rejected those unexpected  
 characters and show an error when it could.

 Now, with proper Unicode handling up to the code point level, this kind  
 of problem probably won't happen as often because the whole stack works  
 with UTF encodings. But are you going to validate all of your inputs to  
 make sure they have no combining code point?

 Don't assume that because you're in the United States no one will try to  
 enter characters where you don't expect them. People love to play with  
 Unicode symbols for fun, putting them in their name, signature, or even  
 domain names (✪df.ws). Just wait until they discover they can combine  
 them. ☺̰̎! There is also a variety of combining mathematical symbols  
 with no pre-combined form, such as ≸. Writing in Arabic, Hebrew,  
 Korean, or some other foreign language isn't a prerequisite to use  
 combining characters.


 Essentially, we would have three levels of types:
  char[], wchar[], dchar[] -- Considered to be arrays in every way.
 string_t!T (string, wstring, dstring) -- Specialized string types that  
 do  normalization to dchars, but do not handle perfectly all graphemes.  
  Works  with any algorithm that deals with bidirectional ranges.  This  
 is the  default string type, and the type for string literals.   
 Represented  internally by a single char[], wchar[] or dchar[] array.
 * utfstring_t!T -- specialized string to deal with full unicode, which  
 may  perform worse than string_t, but supports everything unicode  
 supports.   May require a battery of specialized algorithms.
  * - name up for discussion
  Also note that phobos currently does *no* normalization as far as I  
 can  tell for things like opEquals.  Two char[]'s that represent  
 equivalent  strings, but not in the same way, will compare as !=.

 Basically, you're suggesting that the default way should be to handle  
 Unicode *almost* right. And then, if you want to handle thing *really*  
 right you need to be explicit about it by using "utfstring_t"? I  
 understand your motivation, but it sounds backward to me.

You make very good points.  I concede that using dchar as the element  
point is not correct for unicode strings.

-Steve

Jan 15 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-15 11:59:04 -0500, "Steven Schveighoffer" 
<schveiguy yahoo.com> said:

 On Fri, 14 Jan 2011 12:01:42 -0500, Michel Fortin  
 <michel.fortin michelf.com> wrote:
 
 Actually, returning a sliced char[] or wchar[] could also be valid.  
 User-perceived characters are basically a substring of one or more code 
  points. I'm not sure it complicates that much the semantics of the  
 language -- what's complicated about writing str.front == "a" instead 
 of  str.front == 'a'? -- although it probably would complicate the 
 generated  code and make it a little slower.

 
 Hm... this pushes the normalization outside the type, and into the  
 algorithms (such as find).
 
 I was hoping to avoid that.

Not really. It pushes the normalization to the string comparison 
operator, as explained later.


 I think I can  come up with an algorithm that normalizes into canonical 
 form as it  iterates.  It just might return part of a grapheme if the 
 grapheme cannot  be composed.

The problem with normalization while iterating is that you lose 
information about what the actual code points part of the grapheme. If 
you wanted to count the number of grapheme with a particular code point 
you're lost that information.

Moreover, if all you want is to count the number of grapheme, 
normalizing the character is a waste of time.

I suggested in another post that we implement ranges for decomposing 
and recomposing on-the-fly a string in its normalized form. That's 
basically the same thing as you suggest, but it'd have to be explicit 
to avoid the problem above.


 I wonder if normalized string comparison shouldn't be built directly in 
  the char[] wchar[] and dchar[] types instead.

 
 No, in my vision of how strings should be typed, char[] is an array, 
 not a  string.  It should be treated like an array of code-units, where 
 two forms  that create the same grapheme are considered different.

Well, I agree there's a need for that sometime. But if what you want is 
just a dumb array of code units, why not use ubyte[], ushort[] and 
uint[] instead?

It seems to me that the whole point of having a different type for 
char[], wchar[], and dchar[] is that you know they are Unicode strings 
and can treat them as such. And if you treat them as Unicode strings, 
then perhaps the runtime and the compiler should too, for consistency's 
sake.


 Also bring the idea above that iterating on a string would yield  
 graphemes as char[] and this code would work perfectly irrespective of  
 whether you used combining characters:
 
 	foreach (grapheme; "expos�") {
 		if (grapheme == "�")
 			break;
 	}
 
 I think a good standard to evaluate our handling of Unicode is to see  
 how easy it is to do things the right way. In the above, foreach would  
 slice the string grapheme by grapheme, and the == operator would 
 perform  a normalized comparison. While it works correctly, it's 
 probably not the  most efficient way to do thing however.

 
 I think this is a good alternative, but I'd rather not impose this on  
 people like myself who deal mostly with English.

I'm not suggesting we impose it, just that we make it the default. If 
you want to iterate by dchar, wchar, or char, just write:

	foreach (dchar c; "expos�") {}
	foreach (wchar c; "expos�") {}
	foreach (char c; "expos�") {}
	// or
	foreach (dchar c; "expos�".by!dchar()) {}
	foreach (wchar c; "expos�".by!wchar()) {}
	foreach (char c; "expos�".by!char()) {}

and it'll work. But the default would be a slice containing the 
grapheme, because this is the right way to represent a Unicode 
character.


 I think this should be  possible to do with wrapper types or 
 intermediate ranges which have  graphemes as elements (per my 
 suggestion above).

I think it should be the reverse. If you want your code to break when 
it encounters multi-code-point graphemes then it's your choice, but you 
should have to make your choice explicit. The default should be to 
handle strings correctly.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 15 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sat, 15 Jan 2011 13:32:10 -0500, Michel Fortin  
<michel.fortin michelf.com> wrote:

 On 2011-01-15 11:59:04 -0500, "Steven Schveighoffer"  
 <schveiguy yahoo.com> said:

 On Fri, 14 Jan 2011 12:01:42 -0500, Michel Fortin   
 <michel.fortin michelf.com> wrote:

 Actually, returning a sliced char[] or wchar[] could also be valid.   
 User-perceived characters are basically a substring of one or more  
 code  points. I'm not sure it complicates that much the semantics of  
 the  language -- what's complicated about writing str.front == "a"  
 instead of  str.front == 'a'? -- although it probably would complicate  
 the generated  code and make it a little slower.

  Hm... this pushes the normalization outside the type, and into the   
 algorithms (such as find).
  I was hoping to avoid that.

 Not really. It pushes the normalization to the string comparison  
 operator, as explained later.


 I think I can  come up with an algorithm that normalizes into canonical  
 form as it  iterates.  It just might return part of a grapheme if the  
 grapheme cannot  be composed.

 The problem with normalization while iterating is that you lose  
 information about what the actual code points part of the grapheme. If  
 you wanted to count the number of grapheme with a particular code point  
 you're lost that information.

Are these common requirements?  I thought users mostly care about  
graphemes, not code points.  Asking in the dark here, since I have next to  
zero experience with unicode strings.

 Moreover, if all you want is to count the number of grapheme,  
 normalizing the character is a waste of time.

This is true.  I can see this being a common need.

 I suggested in another post that we implement ranges for decomposing and  
 recomposing on-the-fly a string in its normalized form. That's basically  
 the same thing as you suggest, but it'd have to be explicit to avoid the  
 problem above.

OK, I see your point.

 I wonder if normalized string comparison shouldn't be built directly  
 in  the char[] wchar[] and dchar[] types instead.

  No, in my vision of how strings should be typed, char[] is an array,  
 not a  string.  It should be treated like an array of code-units, where  
 two forms  that create the same grapheme are considered different.

 Well, I agree there's a need for that sometime. But if what you want is  
 just a dumb array of code units, why not use ubyte[], ushort[] and  
 uint[] instead?

Because ubyte[] ushort[] and uint[] do not say that their data is unicode  
text.  The point is, I want to write a function that takes utf-8, ubyte[]  
opens it up to any data, not just UTF-8 data.  But if we have a method of  
iterating code-units as you specify below, then I think we are OK.

 It seems to me that the whole point of having a different type for  
 char[], wchar[], and dchar[] is that you know they are Unicode strings  
 and can treat them as such. And if you treat them as Unicode strings,  
 then perhaps the runtime and the compiler should too, for consistency's  
 sake.

I'd agree with you, but then there's that pesky [] after it indicating  
it's an array.  For consistency's sake, I'd say the compiler should treat  
T[] as an array of T's.

 Also bring the idea above that iterating on a string would yield   
 graphemes as char[] and this code would work perfectly irrespective  
 of  whether you used combining characters:
  	foreach (grapheme; "exposé") {
 		if (grapheme == "é")
 			break;
 	}
  I think a good standard to evaluate our handling of Unicode is to  
 see  how easy it is to do things the right way. In the above, foreach  
 would  slice the string grapheme by grapheme, and the == operator  
 would perform  a normalized comparison. While it works correctly, it's  
 probably not the  most efficient way to do thing however.

  I think this is a good alternative, but I'd rather not impose this on   
 people like myself who deal mostly with English.

 I'm not suggesting we impose it, just that we make it the default. If  
 you want to iterate by dchar, wchar, or char, just write:

 	foreach (dchar c; "exposé") {}
 	foreach (wchar c; "exposé") {}
 	foreach (char c; "exposé") {}
 	// or
 	foreach (dchar c; "exposé".by!dchar()) {}
 	foreach (wchar c; "exposé".by!wchar()) {}
 	foreach (char c; "exposé".by!char()) {}

 and it'll work. But the default would be a slice containing the  
 grapheme, because this is the right way to represent a Unicode character.

I think this is a good idea.  I previously was nervous about it, but I'm  
not sure it makes a huge difference.  Returning a char[] is certainly less  
work than normalizing a grapheme into one or more code points, and then  
returning them.  All that it takes is to detect all the code points within  
the grapheme.  Normalization can be done if needed, but would probably  
have to output another char[], since a normalized grapheme can occupy more  
than one dchar.

What if I modified my proposed string_t type to return T[] as its element  
type, as you say, and string literals are typed as string_t!(whatever)?   
In addition, the restrictions I imposed on slicing a code point actually  
get imposed on slicing a grapheme.  That is, it is illegal to substring a  
string_t in a way that slices through a grapheme (and by deduction, a code  
point)?

Actually, we would need a grapheme to be its own type, because comparing  
two char[]'s that don't contain equivalent bits and having them be equal,  
violates the expectation that char[] is an array.

So the string_t!char would return a grapheme_t!char (names to be  
discussed) as its element type.

 I think this should be  possible to do with wrapper types or  
 intermediate ranges which have  graphemes as elements (per my  
 suggestion above).

 I think it should be the reverse. If you want your code to break when it  
 encounters multi-code-point graphemes then it's your choice, but you  
 should have to make your choice explicit. The default should be to  
 handle strings correctly.

You are probably right.

-Steve

Jan 15 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer" 
<schveiguy yahoo.com> said:

 I'm not suggesting we impose it, just that we make it the default. If  
 you want to iterate by dchar, wchar, or char, just write:
 
 	foreach (dchar c; "expos�") {}
 	foreach (wchar c; "expos�") {}
 	foreach (char c; "expos�") {}
 	// or
 	foreach (dchar c; "expos�".by!dchar()) {}
 	foreach (wchar c; "expos�".by!wchar()) {}
 	foreach (char c; "expos�".by!char()) {}
 
 and it'll work. But the default would be a slice containing the  
 grapheme, because this is the right way to represent a Unicode 
 character.

 
 I think this is a good idea.  I previously was nervous about it, but 
 I'm  not sure it makes a huge difference.  Returning a char[] is 
 certainly less  work than normalizing a grapheme into one or more code 
 points, and then  returning them.  All that it takes is to detect all 
 the code points within  the grapheme.  Normalization can be done if 
 needed, but would probably  have to output another char[], since a 
 normalized grapheme can occupy more  than one dchar.

I'm glad we agree on that now.


 What if I modified my proposed string_t type to return T[] as its 
 element  type, as you say, and string literals are typed as 
 string_t!(whatever)?   In addition, the restrictions I imposed on 
 slicing a code point actually  get imposed on slicing a grapheme.  That 
 is, it is illegal to substring a  string_t in a way that slices through 
 a grapheme (and by deduction, a code  point)?

I'm not opposed to that on principle. I'm a little uneasy about having 
so many types representing a string however. Some other raw comments:

I agree that things would be more coherent if char[], wchar[], and 
dchar[] behaved like other arrays, but I can't really see a 
justification for those types to be in the language if there's nothing 
special about them (why not a library type?). If strings and arrays of 
code units are distinct, slicing in the middle of a grapheme or in the 
middle of a code point could throw an error, but for performance 
reasons it should probably check for that only when array bounds 
checking is turned on (that would require compiler support however).


 Actually, we would need a grapheme to be its own type, because 
 comparing  two char[]'s that don't contain equivalent bits and having 
 them be equal,  violates the expectation that char[] is an array.
 
 So the string_t!char would return a grapheme_t!char (names to be  
 discussed) as its element type.

Or you could make a grapheme a string_t. ;-)


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 15 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin  
<michel.fortin michelf.com> wrote:

 On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"  
 <schveiguy yahoo.com> said:

 I'm not suggesting we impose it, just that we make it the default. If   
 you want to iterate by dchar, wchar, or char, just write:
  	foreach (dchar c; "exposé") {}
 	foreach (wchar c; "exposé") {}
 	foreach (char c; "exposé") {}
 	// or
 	foreach (dchar c; "exposé".by!dchar()) {}
 	foreach (wchar c; "exposé".by!wchar()) {}
 	foreach (char c; "exposé".by!char()) {}
  and it'll work. But the default would be a slice containing the   
 grapheme, because this is the right way to represent a Unicode  
 character.

  I think this is a good idea.  I previously was nervous about it, but  
 I'm  not sure it makes a huge difference.  Returning a char[] is  
 certainly less  work than normalizing a grapheme into one or more code  
 points, and then  returning them.  All that it takes is to detect all  
 the code points within  the grapheme.  Normalization can be done if  
 needed, but would probably  have to output another char[], since a  
 normalized grapheme can occupy more  than one dchar.

 I'm glad we agree on that now.

It's a matter of me slowly wrapping my brain around unicode and how it's  
used.  It seems like it's a typical committee defined standard where there  
are 10 ways to do everything, I was trying to weed out the lesser used (or  
so I perceived) pieces to allow a more implementable library.  It's doubly  
hard for me since I have limited experience with other languages, and I've  
never tried to write them with a computer (my language classes in high  
school were back in the days of actually writing stuff down on paper).

I once told a colleague who was on a standards committee that their  
proposed KLV standard (key length value) was ridiculous.  The wise  
committee had decided that in order to avoid future issues, the length  
would be encoded as a single byte if < 128, or 128 + length of the length  
field for anything higher.  This means you could potentially have to parse  
and process a 127-byte integer!

 What if I modified my proposed string_t type to return T[] as its  
 element  type, as you say, and string literals are typed as  
 string_t!(whatever)?   In addition, the restrictions I imposed on  
 slicing a code point actually  get imposed on slicing a grapheme.  That  
 is, it is illegal to substring a  string_t in a way that slices through  
 a grapheme (and by deduction, a code  point)?

 I'm not opposed to that on principle. I'm a little uneasy about having  
 so many types representing a string however. Some other raw comments:

 I agree that things would be more coherent if char[], wchar[], and  
 dchar[] behaved like other arrays, but I can't really see a  
 justification for those types to be in the language if there's nothing  
 special about them (why not a library type?).

I would not be opposed to getting rid of those types.  But I am very  
opposed to char[] not being an array.  If you want a string to be  
something other than an array, make it have a different syntax.  We also  
have to consider C compatibility.

However, we are in radical-change mode then, and this is probably pushed  
to D3 ;)  If we can find some way to fix the situation without  
invalidating TDPL, we should strive for that first IMO.

 If strings and arrays of code units are distinct, slicing in the middle  
 of a grapheme or in the middle of a code point could throw an error, but  
 for performance reasons it should probably check for that only when  
 array bounds checking is turned on (that would require compiler support  
 however).

Not really, it could use assert, but that throws an assert error instead  
of a RangeError.  Of course, both are errors and will abort the program.   
I do wish there was a version(noboundscheck) to do this kind of stuff  
with...

 Actually, we would need a grapheme to be its own type, because  
 comparing  two char[]'s that don't contain equivalent bits and having  
 them be equal,  violates the expectation that char[] is an array.
  So the string_t!char would return a grapheme_t!char (names to be   
 discussed) as its element type.

 Or you could make a grapheme a string_t. ;-)

I'm a little uneasy having a range return itself as its element type.  For  
all intents and purposes, a grapheme is a string of one 'element', so it  
could potentially be a string_t.

It does seem daunting to have so many types, but at the same time, types  
convey relationships at compile time that can make coding impossible to  
get wrong, or make things actually possible when having a single type  
doesn't.

I'll give you an example from a previous life:

Tango had a type called DateTime.  This type represented *either* a point  
in time, or a span of time (depending on how you used it).  But I proposed  
we switch to two distinct types, one for a point in time, one for a span  
of time.  It was argued that both were so similar, why couldn't we just  
keep one type?  The answer is simple -- having them be separate types  
allows me to express relationships that the compiler enforces.  For  
example, you can add two time spans together, but you can't add two points  
in time together.  Or maybe you want a function to accept a time span  
(like a sleep operation).  If there was only one type, then  
sleep(DateTime.now()) compiles and sleeps for what, 2011 years? ;)

I feel that making extra types when the relationship between them is  
important is worth the possible repetition of functionality.  Catching  
bugs during compilation is soooo much better than experiencing them during  
runtime.

-Steve

Jan 15 2011

foobar <foo bar.com> writes:

Steven Schveighoffer Wrote:

 On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin  
 <michel.fortin michelf.com> wrote:
 
 On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"  
 <schveiguy yahoo.com> said:

 I'm not suggesting we impose it, just that we make it the default. If   
 you want to iterate by dchar, wchar, or char, just write:
  	foreach (dchar c; "exposé") {}
 	foreach (wchar c; "exposé") {}
 	foreach (char c; "exposé") {}
 	// or
 	foreach (dchar c; "exposé".by!dchar()) {}
 	foreach (wchar c; "exposé".by!wchar()) {}
 	foreach (char c; "exposé".by!char()) {}
  and it'll work. But the default would be a slice containing the   
 grapheme, because this is the right way to represent a Unicode  
 character.

  I think this is a good idea.  I previously was nervous about it, but  
 I'm  not sure it makes a huge difference.  Returning a char[] is  
 certainly less  work than normalizing a grapheme into one or more code  
 points, and then  returning them.  All that it takes is to detect all  
 the code points within  the grapheme.  Normalization can be done if  
 needed, but would probably  have to output another char[], since a  
 normalized grapheme can occupy more  than one dchar.

 I'm glad we agree on that now.

 
 It's a matter of me slowly wrapping my brain around unicode and how it's  
 used.  It seems like it's a typical committee defined standard where there  
 are 10 ways to do everything, I was trying to weed out the lesser used (or  
 so I perceived) pieces to allow a more implementable library.  It's doubly  
 hard for me since I have limited experience with other languages, and I've  
 never tried to write them with a computer (my language classes in high  
 school were back in the days of actually writing stuff down on paper).
 
 I once told a colleague who was on a standards committee that their  
 proposed KLV standard (key length value) was ridiculous.  The wise  
 committee had decided that in order to avoid future issues, the length  
 would be encoded as a single byte if < 128, or 128 + length of the length  
 field for anything higher.  This means you could potentially have to parse  
 and process a 127-byte integer!
 
 What if I modified my proposed string_t type to return T[] as its  
 element  type, as you say, and string literals are typed as  
 string_t!(whatever)?   In addition, the restrictions I imposed on  
 slicing a code point actually  get imposed on slicing a grapheme.  That  
 is, it is illegal to substring a  string_t in a way that slices through  
 a grapheme (and by deduction, a code  point)?

 I'm not opposed to that on principle. I'm a little uneasy about having  
 so many types representing a string however. Some other raw comments:

 I agree that things would be more coherent if char[], wchar[], and  
 dchar[] behaved like other arrays, but I can't really see a  
 justification for those types to be in the language if there's nothing  
 special about them (why not a library type?).

 
 I would not be opposed to getting rid of those types.  But I am very  
 opposed to char[] not being an array.  If you want a string to be  
 something other than an array, make it have a different syntax.  We also  
 have to consider C compatibility.
 
 However, we are in radical-change mode then, and this is probably pushed  
 to D3 ;)  If we can find some way to fix the situation without  
 invalidating TDPL, we should strive for that first IMO.
 
 If strings and arrays of code units are distinct, slicing in the middle  
 of a grapheme or in the middle of a code point could throw an error, but  
 for performance reasons it should probably check for that only when  
 array bounds checking is turned on (that would require compiler support  
 however).

 
 Not really, it could use assert, but that throws an assert error instead  
 of a RangeError.  Of course, both are errors and will abort the program.   
 I do wish there was a version(noboundscheck) to do this kind of stuff  
 with...
 
 Actually, we would need a grapheme to be its own type, because  
 comparing  two char[]'s that don't contain equivalent bits and having  
 them be equal,  violates the expectation that char[] is an array.
  So the string_t!char would return a grapheme_t!char (names to be   
 discussed) as its element type.

 Or you could make a grapheme a string_t. ;-)

 
 I'm a little uneasy having a range return itself as its element type.  For  
 all intents and purposes, a grapheme is a string of one 'element', so it  
 could potentially be a string_t.
 
 It does seem daunting to have so many types, but at the same time, types  
 convey relationships at compile time that can make coding impossible to  
 get wrong, or make things actually possible when having a single type  
 doesn't.
 
 I'll give you an example from a previous life:
 
 Tango had a type called DateTime.  This type represented *either* a point  
 in time, or a span of time (depending on how you used it).  But I proposed  
 we switch to two distinct types, one for a point in time, one for a span  
 of time.  It was argued that both were so similar, why couldn't we just  
 keep one type?  The answer is simple -- having them be separate types  
 allows me to express relationships that the compiler enforces.  For  
 example, you can add two time spans together, but you can't add two points  
 in time together.  Or maybe you want a function to accept a time span  
 (like a sleep operation).  If there was only one type, then  
 sleep(DateTime.now()) compiles and sleeps for what, 2011 years? ;)
 
 I feel that making extra types when the relationship between them is  
 important is worth the possible repetition of functionality.  Catching  
 bugs during compilation is soooo much better than experiencing them during  
 runtime.
 
 -Steve

I like Michel's proposed semantics and I also agree with you that it should be
a distinct string type and not break consistency of regular arrays. 

Regarding your last point: Do you mean that a grapheme would be a sub-type of
string? (a specialization where the string represents a single element)? If so,
than it sounds good to me.

Jan 15 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sat, 15 Jan 2011 17:19:48 -0500, foobar <foo bar.com> wrote:

 I like Michel's proposed semantics and I also agree with you that it  
 should be a distinct string type and not break consistency of regular  
 arrays.

 Regarding your last point: Do you mean that a grapheme would be a  
 sub-type of string? (a specialization where the string represents a  
 single element)? If so, than it sounds good to me.

A grapheme would be its own specialized type.  I'd probably remove the  
range primitives to really differentiate it.  Unfortunately, due to the  
inability to statically check this, the invariant would have to be a  
runtime check.  Most likely this check would be disabled in release mode.

This can cause problems, and I can see why it is attractive to use strings  
to implement graphemes, but that also has its problems.  With grapheme  
being its own type, we are providing a way to optimize functions, and  
allow further restrictions on function parameters.

At the end of the day, perhaps grapheme *should* just be a string.  We'll  
have to see how this breaks in practice, either way.

-Steve

Jan 17 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Monday 17 January 2011 04:08:08 Steven Schveighoffer wrote:
 On Sat, 15 Jan 2011 17:19:48 -0500, foobar <foo bar.com> wrote:
 I like Michel's proposed semantics and I also agree with you that it
 should be a distinct string type and not break consistency of regular
 arrays.
 
 Regarding your last point: Do you mean that a grapheme would be a
 sub-type of string? (a specialization where the string represents a
 single element)? If so, than it sounds good to me.

 
 A grapheme would be its own specialized type.  I'd probably remove the
 range primitives to really differentiate it.  Unfortunately, due to the
 inability to statically check this, the invariant would have to be a
 runtime check.  Most likely this check would be disabled in release mode.
 
 This can cause problems, and I can see why it is attractive to use strings
 to implement graphemes, but that also has its problems.  With grapheme
 being its own type, we are providing a way to optimize functions, and
 allow further restrictions on function parameters.
 
 At the end of the day, perhaps grapheme *should* just be a string.  We'll
 have to see how this breaks in practice, either way.

I think that it would make good sense for a grapheme to be struct which holds a 
string as Andrei suggested:

struct Grapheme(Char) if (isSomeChar!Char)
{
     private const Char[] rep;
     ...
}

I really think that trying to use strings to represent graphemes is asking for 
it. The element of a range should be a different type than the that of the
range 
itself.

- Jonathan M Davis

Jan 17 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/17/11 6:25 AM, Jonathan M Davis wrote:
 On Monday 17 January 2011 04:08:08 Steven Schveighoffer wrote:
 On Sat, 15 Jan 2011 17:19:48 -0500, foobar<foo bar.com>  wrote:
 I like Michel's proposed semantics and I also agree with you that it
 should be a distinct string type and not break consistency of regular
 arrays.

 Regarding your last point: Do you mean that a grapheme would be a
 sub-type of string? (a specialization where the string represents a
 single element)? If so, than it sounds good to me.

 A grapheme would be its own specialized type.  I'd probably remove the
 range primitives to really differentiate it.  Unfortunately, due to the
 inability to statically check this, the invariant would have to be a
 runtime check.  Most likely this check would be disabled in release mode.

 This can cause problems, and I can see why it is attractive to use strings
 to implement graphemes, but that also has its problems.  With grapheme
 being its own type, we are providing a way to optimize functions, and
 allow further restrictions on function parameters.

 At the end of the day, perhaps grapheme *should* just be a string.  We'll
 have to see how this breaks in practice, either way.

 I think that it would make good sense for a grapheme to be struct which holds a
 string as Andrei suggested:

 struct Grapheme(Char) if (isSomeChar!Char)
 {
       private const Char[] rep;
       ...
 }

 I really think that trying to use strings to represent graphemes is asking for
 it. The element of a range should be a different type than the that of the
range
 itself.

 - Jonathan M Davis

If someone makes a careful submission of a Grapheme to Phobos as 
described above, it has a high chance of being accepted.

Andrei

Jan 17 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer" 
<schveiguy yahoo.com> said:

 On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin  
 <michel.fortin michelf.com> wrote:
 
 On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"  
 <schveiguy yahoo.com> said:
 
 I'm not suggesting we impose it, just that we make it the default. If   
 you want to iterate by dchar, wchar, or char, just write:
  	foreach (dchar c; "expos�") {}
 	foreach (wchar c; "expos�") {}
 	foreach (char c; "expos�") {}
 	// or
 	foreach (dchar c; "expos�".by!dchar()) {}
 	foreach (wchar c; "expos�".by!wchar()) {}
 	foreach (char c; "expos�".by!char()) {}
  and it'll work. But the default would be a slice containing the   
 grapheme, because this is the right way to represent a Unicode  
 character.

  I think this is a good idea.  I previously was nervous about it, but  
 I'm  not sure it makes a huge difference.  Returning a char[] is  
 certainly less  work than normalizing a grapheme into one or more code  
 points, and then  returning them.  All that it takes is to detect all  
 the code points within  the grapheme.  Normalization can be done if  
 needed, but would probably  have to output another char[], since a  
 normalized grapheme can occupy more  than one dchar.

 
 I'm glad we agree on that now.

 
 It's a matter of me slowly wrapping my brain around unicode and how 
 it's  used.  It seems like it's a typical committee defined standard 
 where there  are 10 ways to do everything, I was trying to weed out the 
 lesser used (or  so I perceived) pieces to allow a more implementable 
 library.  It's doubly  hard for me since I have limited experience with 
 other languages, and I've  never tried to write them with a computer 
 (my language classes in high  school were back in the days of actually 
 writing stuff down on paper).

Actually, I don't think Unicode was so badly designed. It's just that 
nobody hat an idea of the real scope of the problem they had in hand at 
first, and so they had to add a lot of things but wanted to keep things 
backward-compatible. We're at Unicode 6.0 now, can you name one other 
standard that evolved enough to get 6 major versions? I'm surprised 
it's not worse given all that it must support.

That said, I'm sure if someone could redesign Unicode by breaking 
backward-compatibility we'd have something simpler. You could probably 
get rid of pre-combined characters and reduce the number of 
normalization forms. But would you be able to get rid of normalization 
entirely? I don't think so. Reinventing Unicode is probably not worth 
it.


 I'm not opposed to that on principle. I'm a little uneasy about having  
 so many types representing a string however. Some other raw comments:
 
 I agree that things would be more coherent if char[], wchar[], and  
 dchar[] behaved like other arrays, but I can't really see a  
 justification for those types to be in the language if there's nothing  
 special about them (why not a library type?).

 
 I would not be opposed to getting rid of those types.  But I am very  
 opposed to char[] not being an array.  If you want a string to be  
 something other than an array, make it have a different syntax.  We 
 also  have to consider C compatibility.
 
 However, we are in radical-change mode then, and this is probably 
 pushed  to D3 ;)  If we can find some way to fix the situation without  
 invalidating TDPL, we should strive for that first IMO.

Indeed, the change would probably be too radical for D2.

I think we agree that the default type should behave as a Unicode 
string, not an array of characters. I understand your opposition to 
conflating arrays of char with strings, and I agree with you to a 
certain extent that it could have been done better. But we can't really 
change the type of string literals, can we. The only thing we can 
change (I hope) at this point is how iterating on strings work.

Walter said earlier that he oppose changing foreach's default element 
type to dchar for char[] and wchar[] (as Andrei did for ranges) on the 
ground that it would silently break D1 compatibility. This is a valid 
point in my opinion.

I think you're right when you say that not treating char[] as an array 
of character breaks, to a certain extent, C compatibility. Another 
valid point.

That said, I want to emphasize that iterating by grapheme, contrary to 
iterating by dchar, does not break any code *silently*. The compiler 
will complain loudly that you're comparing a string to a char, so 
you'll have to change your code somewhere if you want things to 
compile. You'll have to look at the code and decide what to do.

One more thing:

NSString in Cocoa is in essence the same thing as I'm proposing here: 
as array of UTF-16 code units, but with string behaviour. It supports 
by-code-unit indexing, but appending, comparing, searching for 
substrings, etc. all behave correctly as a Unicode string. Again, I 
agree that it's probably not the best design, but I can tell you it 
works well in practice. In fact, NSString doesn't even expose the 
concept of grapheme, it just uses them internally, and you're pretty 
much limited to the built-in operation. I think what we have here in 
concept is much better... even if it somewhat conflates code-unit 
arrays and strings.


 Or you could make a grapheme a string_t. ;-)

 
 I'm a little uneasy having a range return itself as its element type.  
 For  all intents and purposes, a grapheme is a string of one 'element', 
 so it  could potentially be a string_t.
 
 It does seem daunting to have so many types, but at the same time, 
 types  convey relationships at compile time that can make coding 
 impossible to  get wrong, or make things actually possible when having 
 a single type  doesn't.
 
 I'll give you an example from a previous life:
 
 [...]
 I feel that making extra types when the relationship between them is  
 important is worth the possible repetition of functionality.  Catching  
 bugs during compilation is soooo much better than experiencing them 
 during  runtime.

I can understand the utility of a separate type in your DateTime 
example, but in this case I fail to see any advantage.

I mean, a grapheme is a slice of a string, can have multiple code 
points (like a string), can be appended the same way as a string, can 
be composed or decomposed using canonical normalization or 
compatibility normalization (like a string), and should be sorted, 
uppercased, and lowercased according to Unicode rules (like a string). 
Basically, a grapheme is just a string that happens to contain only one 
grapheme. What would a custom type do differently than a string?

Also, grapheme == "a" is easy to understand because both are strings. 
But if a grapheme is a separate type, what would a grapheme literal 
look like?

So in the end I don't think a grapheme needs a specific type, at least 
not for general purpose text processing. If I split a string on 
whitespace, do I get a range where elements are of type "word"? No, 
just sliced strings.

That said, I'm much less concerned by the type used to represent a 
grapheme than by the Unicode correctness. I'm not opposed to a separate 
type, I just don't really see the point.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 15 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/15/11 4:45 PM, Michel Fortin wrote:
 On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer"
 <schveiguy yahoo.com> said:

 On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin
 <michel.fortin michelf.com> wrote:

 On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"
 <schveiguy yahoo.com> said:

 I'm not suggesting we impose it, just that we make it the default.
 If you want to iterate by dchar, wchar, or char, just write:
 foreach (dchar c; "expos�") {}
 foreach (wchar c; "expos�") {}
 foreach (char c; "expos�") {}
 // or
 foreach (dchar c; "expos�".by!dchar()) {}
 foreach (wchar c; "expos�".by!wchar()) {}
 foreach (char c; "expos�".by!char()) {}
 and it'll work. But the default would be a slice containing the
 grapheme, because this is the right way to represent a Unicode
 character.

 I think this is a good idea. I previously was nervous about it, but
 I'm not sure it makes a huge difference. Returning a char[] is
 certainly less work than normalizing a grapheme into one or more
 code points, and then returning them. All that it takes is to detect
 all the code points within the grapheme. Normalization can be done
 if needed, but would probably have to output another char[], since a
 normalized grapheme can occupy more than one dchar.

 I'm glad we agree on that now.

 It's a matter of me slowly wrapping my brain around unicode and how
 it's used. It seems like it's a typical committee defined standard
 where there are 10 ways to do everything, I was trying to weed out the
 lesser used (or so I perceived) pieces to allow a more implementable
 library. It's doubly hard for me since I have limited experience with
 other languages, and I've never tried to write them with a computer
 (my language classes in high school were back in the days of actually
 writing stuff down on paper).

 Actually, I don't think Unicode was so badly designed. It's just that
 nobody hat an idea of the real scope of the problem they had in hand at
 first, and so they had to add a lot of things but wanted to keep things
 backward-compatible. We're at Unicode 6.0 now, can you name one other
 standard that evolved enough to get 6 major versions? I'm surprised it's
 not worse given all that it must support.

 That said, I'm sure if someone could redesign Unicode by breaking
 backward-compatibility we'd have something simpler. You could probably
 get rid of pre-combined characters and reduce the number of
 normalization forms. But would you be able to get rid of normalization
 entirely? I don't think so. Reinventing Unicode is probably not worth it.


 I'm not opposed to that on principle. I'm a little uneasy about
 having so many types representing a string however. Some other raw
 comments:

 I agree that things would be more coherent if char[], wchar[], and
 dchar[] behaved like other arrays, but I can't really see a
 justification for those types to be in the language if there's
 nothing special about them (why not a library type?).

 I would not be opposed to getting rid of those types. But I am very
 opposed to char[] not being an array. If you want a string to be
 something other than an array, make it have a different syntax. We
 also have to consider C compatibility.

 However, we are in radical-change mode then, and this is probably
 pushed to D3 ;) If we can find some way to fix the situation without
 invalidating TDPL, we should strive for that first IMO.

 Indeed, the change would probably be too radical for D2.

 I think we agree that the default type should behave as a Unicode
 string, not an array of characters. I understand your opposition to
 conflating arrays of char with strings, and I agree with you to a
 certain extent that it could have been done better. But we can't really
 change the type of string literals, can we. The only thing we can change
 (I hope) at this point is how iterating on strings work.

 Walter said earlier that he oppose changing foreach's default element
 type to dchar for char[] and wchar[] (as Andrei did for ranges) on the
 ground that it would silently break D1 compatibility. This is a valid
 point in my opinion.

 I think you're right when you say that not treating char[] as an array
 of character breaks, to a certain extent, C compatibility. Another valid
 point.

 That said, I want to emphasize that iterating by grapheme, contrary to
 iterating by dchar, does not break any code *silently*. The compiler
 will complain loudly that you're comparing a string to a char, so you'll
 have to change your code somewhere if you want things to compile. You'll
 have to look at the code and decide what to do.

 One more thing:

 NSString in Cocoa is in essence the same thing as I'm proposing here: as
 array of UTF-16 code units, but with string behaviour. It supports
 by-code-unit indexing, but appending, comparing, searching for
 substrings, etc. all behave correctly as a Unicode string. Again, I
 agree that it's probably not the best design, but I can tell you it
 works well in practice. In fact, NSString doesn't even expose the
 concept of grapheme, it just uses them internally, and you're pretty
 much limited to the built-in operation. I think what we have here in
 concept is much better... even if it somewhat conflates code-unit arrays
 and strings.

I'm unclear on where this is converging to. At this point the commitment 
of the language and its standard library to (a) UTF aray representation 
and (b) code points conceptualization is quite strong. Changing that 
would be quite difficult and disruptive, and the benefits are virtually 
nonexistent for most of D's user base.

It may be more realistic to consider using what we have as back-end for 
grapheme-oriented processing. For example:

struct Grapheme(Char) if (isSomeChar!Char)
{
     private const Char[] rep;
     ...
}

auto byGrapheme(S)(S s) if (isSomeString!S)
{
    ...
}

string s = "Hello";
foreach (g; byGrapheme(s)
{
     ...
}

Andrei

Jan 15 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday 15 January 2011 15:59:27 Andrei Alexandrescu wrote:
 On 1/15/11 4:45 PM, Michel Fortin wrote:
 On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer"
=20
 <schveiguy yahoo.com> said:
 On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin
=20
 <michel.fortin michelf.com> wrote:
 On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"
=20
 <schveiguy yahoo.com> said:
 I'm not suggesting we impose it, just that we make it the default.
 If you want to iterate by dchar, wchar, or char, just write:
 foreach (dchar c; "expos=E9") {}
 foreach (wchar c; "expos=E9") {}
 foreach (char c; "expos=E9") {}
 // or
 foreach (dchar c; "expos=E9".by!dchar()) {}
 foreach (wchar c; "expos=E9".by!wchar()) {}
 foreach (char c; "expos=E9".by!char()) {}
 and it'll work. But the default would be a slice containing the
 grapheme, because this is the right way to represent a Unicode
 character.

=20
 I think this is a good idea. I previously was nervous about it, but
 I'm not sure it makes a huge difference. Returning a char[] is
 certainly less work than normalizing a grapheme into one or more
 code points, and then returning them. All that it takes is to detect
 all the code points within the grapheme. Normalization can be done
 if needed, but would probably have to output another char[], since a
 normalized grapheme can occupy more than one dchar.

=20
 I'm glad we agree on that now.

=20
 It's a matter of me slowly wrapping my brain around unicode and how
 it's used. It seems like it's a typical committee defined standard
 where there are 10 ways to do everything, I was trying to weed out the
 lesser used (or so I perceived) pieces to allow a more implementable
 library. It's doubly hard for me since I have limited experience with
 other languages, and I've never tried to write them with a computer
 (my language classes in high school were back in the days of actually
 writing stuff down on paper).

=20
 Actually, I don't think Unicode was so badly designed. It's just that
 nobody hat an idea of the real scope of the problem they had in hand at
 first, and so they had to add a lot of things but wanted to keep things
 backward-compatible. We're at Unicode 6.0 now, can you name one other
 standard that evolved enough to get 6 major versions? I'm surprised it's
 not worse given all that it must support.
=20
 That said, I'm sure if someone could redesign Unicode by breaking
 backward-compatibility we'd have something simpler. You could probably
 get rid of pre-combined characters and reduce the number of
 normalization forms. But would you be able to get rid of normalization
 entirely? I don't think so. Reinventing Unicode is probably not worth i=


t.
=20
 I'm not opposed to that on principle. I'm a little uneasy about
 having so many types representing a string however. Some other raw
 comments:
=20
 I agree that things would be more coherent if char[], wchar[], and
 dchar[] behaved like other arrays, but I can't really see a
 justification for those types to be in the language if there's
 nothing special about them (why not a library type?).

=20
 I would not be opposed to getting rid of those types. But I am very
 opposed to char[] not being an array. If you want a string to be
 something other than an array, make it have a different syntax. We
 also have to consider C compatibility.
=20
 However, we are in radical-change mode then, and this is probably
 pushed to D3 ;) If we can find some way to fix the situation without
 invalidating TDPL, we should strive for that first IMO.

=20
 Indeed, the change would probably be too radical for D2.
=20
 I think we agree that the default type should behave as a Unicode
 string, not an array of characters. I understand your opposition to
 conflating arrays of char with strings, and I agree with you to a
 certain extent that it could have been done better. But we can't really
 change the type of string literals, can we. The only thing we can change
 (I hope) at this point is how iterating on strings work.
=20
 Walter said earlier that he oppose changing foreach's default element
 type to dchar for char[] and wchar[] (as Andrei did for ranges) on the
 ground that it would silently break D1 compatibility. This is a valid
 point in my opinion.
=20
 I think you're right when you say that not treating char[] as an array
 of character breaks, to a certain extent, C compatibility. Another valid
 point.
=20
 That said, I want to emphasize that iterating by grapheme, contrary to
 iterating by dchar, does not break any code *silently*. The compiler
 will complain loudly that you're comparing a string to a char, so you'll
 have to change your code somewhere if you want things to compile. You'll
 have to look at the code and decide what to do.
=20
 One more thing:
=20
 NSString in Cocoa is in essence the same thing as I'm proposing here: as
 array of UTF-16 code units, but with string behaviour. It supports
 by-code-unit indexing, but appending, comparing, searching for
 substrings, etc. all behave correctly as a Unicode string. Again, I
 agree that it's probably not the best design, but I can tell you it
 works well in practice. In fact, NSString doesn't even expose the
 concept of grapheme, it just uses them internally, and you're pretty
 much limited to the built-in operation. I think what we have here in
 concept is much better... even if it somewhat conflates code-unit arrays
 and strings.

=20
 I'm unclear on where this is converging to. At this point the commitment
 of the language and its standard library to (a) UTF aray representation
 and (b) code points conceptualization is quite strong. Changing that
 would be quite difficult and disruptive, and the benefits are virtually
 nonexistent for most of D's user base.
=20
 It may be more realistic to consider using what we have as back-end for
 grapheme-oriented processing. For example:
=20
 struct Grapheme(Char) if (isSomeChar!Char)
 {
      private const Char[] rep;
      ...
 }
=20
 auto byGrapheme(S)(S s) if (isSomeString!S)
 {
     ...
 }
=20
 string s =3D "Hello";
 foreach (g; byGrapheme(s)
 {
      ...
 }

Considering that strings are already dealt with specially in order to have =
an=20
element of dchar, I wouldn't think that it would be all that distruptive to=
 make=20
it so that they had an element type of Grapheme instead. Wouldn't that then=
 fix=20
all of std.algorithm and the like without really disrupting anything?

The issue of foreach remains, but without being willing to change what fore=
ach=20
defaults to, you can't really fix it - though I'd suggest that we at least =
make=20
it a warning to iterate over strings without specifying the type. And if fo=
reach=20
were made to understand Grapheme like it understands dchar, then you could =
do

foreach(Grapheme g; str) { ... }

and have the compiler warn about

foreach(g; str) { ... }

and tell you to use Grapheme if you want to be comparing actual characters.=
=20
Regardless, by making strings ranges of Grapheme rather than dchar, I would=
=20
think that we would solve most of the problem. At minimum, we'd have pretty=
 much=20
the same problems that we have right now with char and wchar arrays, but we=
'd=20
get rid of a whole class of unicode problems. So, nothing would be worse, b=
ut=20
some of it would be better.

=2D Jonathan M Davis

Jan 15 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-15 22:25:47 -0500, Jonathan M Davis <jmdavisProg gmx.com> said:

 The issue of foreach remains, but without being willing to change what 
 foreach defaults to, you can't really fix it - though I'd suggest that 
 we at least make it a warning to iterate over strings without 
 specifying the type. And if foreach were made to understand Grapheme 
 like it understands dchar, then you could do
 
 foreach(Grapheme g; str) { ... }
 
 and have the compiler warn about
 
 foreach(g; str) { ... }
 
 and tell you to use Grapheme if you want to be comparing actual characters.

Walter's argument against changing this for foreach was that it'd 
*silently* break compatibility with existing D1 code. Changing the 
default to a grapheme makes this argument obsolete: since a grapheme is 
essentially a string, you can't compare it with char or wchar or dchar 
directly, so it'll break at compile time with an error and you'll have 
to decide what to do.

So Walter would have to find another argument to defend the status quo.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 15 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/15/11 10:47 PM, Michel Fortin wrote:
 On 2011-01-15 22:25:47 -0500, Jonathan M Davis <jmdavisProg gmx.com> said:

 The issue of foreach remains, but without being willing to change what
 foreach defaults to, you can't really fix it - though I'd suggest that
 we at least make it a warning to iterate over strings without
 specifying the type. And if foreach were made to understand Grapheme
 like it understands dchar, then you could do

 foreach(Grapheme g; str) { ... }

 and have the compiler warn about

 foreach(g; str) { ... }

 and tell you to use Grapheme if you want to be comparing actual
 characters.

 Walter's argument against changing this for foreach was that it'd
 *silently* break compatibility with existing D1 code. Changing the
 default to a grapheme makes this argument obsolete: since a grapheme is
 essentially a string, you can't compare it with char or wchar or dchar
 directly, so it'll break at compile time with an error and you'll have
 to decide what to do.

 So Walter would have to find another argument to defend the status quo.

I think it's poor abstraction to represent a Grapheme as a string. It 
should be its own type.

Andrei

Jan 16 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/15/11 9:25 PM, Jonathan M Davis wrote:
 Considering that strings are already dealt with specially in order to have an
 element of dchar, I wouldn't think that it would be all that distruptive to
make
 it so that they had an element type of Grapheme instead. Wouldn't that then fix
 all of std.algorithm and the like without really disrupting anything?

It would make everything related a lot (a TON) slower, and it would 
break all client code that uses dchar as the element type, or is 
otherwise unprepared to use Graphemes explicitly. There is no question 
there will be disruption.

Andrei

Jan 16 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

And how would 3rd party libraries handle Graphemes? And C modules? I
think making these Graphemes the default would make quite a mess,
since you would have to convert back and forth between char[] and
Grapheme[] all the time (right?).

Jan 16 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sun, 16 Jan 2011 13:06:16 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 1/15/11 9:25 PM, Jonathan M Davis wrote:
 Considering that strings are already dealt with specially in order to  
 have an
 element of dchar, I wouldn't think that it would be all that  
 distruptive to make
 it so that they had an element type of Grapheme instead. Wouldn't that  
 then fix
 all of std.algorithm and the like without really disrupting anything?

 It would make everything related a lot (a TON) slower, and it would  
 break all client code that uses dchar as the element type, or is  
 otherwise unprepared to use Graphemes explicitly. There is no question  
 there will be disruption.

I would have agreed with you last week.  Now I understand that using dchar  
is just as useless for unicode as using char.

Will it be slower?  Perhaps.  A TON slower?  Probably not.

But it will be correct.  Correct and slow is better than incorrect and  
fast.  If I showed you a shortest-path algorithm that ran in O(V) time,  
but didn't always find the shortest path, would you call it a success?

We need to get some real numbers together.  I'll see what I can create for  
a type, but someone else needs to supply the input :)  I'm on short supply  
of unicode data, and any attempts I've made to create some result in  
failure.  I have one example of one composed character in this thread that  
I can cling to, but in order to supply some real numbers, we need a large  
amount of data.

-Steve

Jan 17 2011

"Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:

On Mon, 17 Jan 2011 07:44:17 -0500, Steven Schveighoffer wrote:

 We need to get some real numbers together.  I'll see what I can create
 for a type, but someone else needs to supply the input :)  I'm on short
 supply of unicode data, and any attempts I've made to create some result
 in failure.  I have one example of one composed character in this thread
 that I can cling to, but in order to supply some real numbers, we need a
 large amount of data.

Googling "unicode sample document" turned up a few examples.  This one 
looks promising:

http://www.humancomp.org/unichtm/unichtm.htm

-Lars

Jan 17 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/17/11 6:44 AM, Steven Schveighoffer wrote:
 On Sun, 16 Jan 2011 13:06:16 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 1/15/11 9:25 PM, Jonathan M Davis wrote:
 Considering that strings are already dealt with specially in order to
 have an
 element of dchar, I wouldn't think that it would be all that
 distruptive to make
 it so that they had an element type of Grapheme instead. Wouldn't
 that then fix
 all of std.algorithm and the like without really disrupting anything?

 It would make everything related a lot (a TON) slower, and it would
 break all client code that uses dchar as the element type, or is
 otherwise unprepared to use Graphemes explicitly. There is no question
 there will be disruption.

 I would have agreed with you last week. Now I understand that using
 dchar is just as useless for unicode as using char.

This is one extreme. Char only works for English. Dchar works for most 
languages. It won't work for a few. That doesn't make it useless for 
languages that work with it.

 Will it be slower? Perhaps. A TON slower? Probably not.

It will be a ton slower.

 But it will be correct. Correct and slow is better than incorrect and
 fast. If I showed you a shortest-path algorithm that ran in O(V) time,
 but didn't always find the shortest path, would you call it a success?

The comparison doesn't apply.

 We need to get some real numbers together. I'll see what I can create
 for a type, but someone else needs to supply the input :) I'm on short
 supply of unicode data, and any attempts I've made to create some result
 in failure. I have one example of one composed character in this thread
 that I can cling to, but in order to supply some real numbers, we need a
 large amount of data.

I very much appreciate that you're doing actual work on this.


Andrei

Jan 17 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/17/11 6:44 AM, Steven Schveighoffer wrote:
 We need to get some real numbers together. I'll see what I can create
 for a type, but someone else needs to supply the input :) I'm on short
 supply of unicode data, and any attempts I've made to create some result
 in failure. I have one example of one composed character in this thread
 that I can cling to, but in order to supply some real numbers, we need a
 large amount of data.

Oh, one more thing. You don't need a lot of Unicode text containing 
combining characters to write benchmarks. (You do need it for testing 
purposes.) Most text won't contain combining characters anyway, so after 
you implement graphemes, just benchmark them on regular text.

Andrei

Jan 17 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 17 Jan 2011 10:00:57 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 1/17/11 6:44 AM, Steven Schveighoffer wrote:
 We need to get some real numbers together. I'll see what I can create
 for a type, but someone else needs to supply the input :) I'm on short
 supply of unicode data, and any attempts I've made to create some result
 in failure. I have one example of one composed character in this thread
 that I can cling to, but in order to supply some real numbers, we need a
 large amount of data.

 Oh, one more thing. You don't need a lot of Unicode text containing  
 combining characters to write benchmarks. (You do need it for testing  
 purposes.) Most text won't contain combining characters anyway, so after  
 you implement graphemes, just benchmark them on regular text.

True, benchmarking doesn't apply with combining characters because we have  
nothing to compare it to.  The current scheme fails on it anyways, so it  
by default would be the best solution.

-Steve

Jan 17 2011

spir <denis.spir gmail.com> writes:

On 01/17/2011 04:00 PM, Andrei Alexandrescu wrote:
 On 1/17/11 6:44 AM, Steven Schveighoffer wrote:
 We need to get some real numbers together. I'll see what I can create
 for a type, but someone else needs to supply the input :) I'm on short
 supply of unicode data, and any attempts I've made to create some result
 in failure. I have one example of one composed character in this thread
 that I can cling to, but in order to supply some real numbers, we need a
 large amount of data.

 Oh, one more thing. You don't need a lot of Unicode text containing
 combining characters to write benchmarks. (You do need it for testing
 purposes.) Most text won't contain combining characters anyway, so after
 you implement graphemes, just benchmark them on regular text.

Correct. For this reason, we do not use the same source at all for 
correctness and performance testing.
It is impossible to define typical or representative source (who 
judges?) But at very minimum, source texts for perf measurement should 
mix languages as diverse as possible, including some material of the 
ones known to be problematic and/or atypical (english, korean, hebrew...)
The following (ripped and composed from ICU data sets) is just that: 
https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/data/unicode.txt

Content:
12 natural languages
34767 bytes = utf8 code units
--> 20133 code points
--> 22033 normal codes (NFD decomposed)
--> 19205 piles = true characters

Denis
_________________
vita es estrany
spir.wikidot.com

Jan 17 2011

spir <denis.spir gmail.com> writes:

On 01/17/2011 01:44 PM, Steven Schveighoffer wrote:
 On Sun, 16 Jan 2011 13:06:16 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 1/15/11 9:25 PM, Jonathan M Davis wrote:
 Considering that strings are already dealt with specially in order to
 have an
 element of dchar, I wouldn't think that it would be all that
 distruptive to make
 it so that they had an element type of Grapheme instead. Wouldn't
 that then fix
 all of std.algorithm and the like without really disrupting anything?

 It would make everything related a lot (a TON) slower, and it would
 break all client code that uses dchar as the element type, or is
 otherwise unprepared to use Graphemes explicitly. There is no question
 there will be disruption.

 I would have agreed with you last week. Now I understand that using
 dchar is just as useless for unicode as using char.

 Will it be slower? Perhaps. A TON slower? Probably not.

 But it will be correct. Correct and slow is better than incorrect and
 fast. If I showed you a shortest-path algorithm that ran in O(V) time,
 but didn't always find the shortest path, would you call it a success?

 We need to get some real numbers together. I'll see what I can create
 for a type, but someone else needs to supply the input :) I'm on short
 supply of unicode data, and any attempts I've made to create some result
 in failure. I have one example of one composed character in this thread
 that I can cling to, but in order to supply some real numbers, we need a
 large amount of data.

 -Steve

Hello Steve & Andrei,


I see 2 questions: (1) whether we should provide Unicode correctness as 
a default or not? and relative points of level of abstraction & 
normalisation (2) what is the best way to implement such correctness?
Let us put aside (1) for a while, anyway nothing prevents us to 
experiment while waiting for an agreement; such experiment would in fact 
feed the debate with real facts instead of "airy" ideas.

It seems there are 2 opposite approaches to Unicode correctness. Mine 
was to build a types that systematically abstracts UCS-created issues 
(that real whole characters are coded by mini-arrays of codes I call 
"code piles", that those piles have variable lengths, _and_ that 
cheracters even may have several representations). Then, in my wild 
guesses, every text manipulation method should obviously be "flash 
fast", actually faster than any on the fly algo by several orders of 
magnitude. But Michel let me doubt on that point.

The other approach is precisely to provide needed abstraction ("piling" 
and normalisation) on the fly. Like proposed by Michel, and like 
Objective-C does, IIUC. This way seems to me closer to a kind of 
re-design Steven's new String type and/or Andrei's VLERange.

As you say, we need real timing numbers to decide. I think we should 
measure at least 2 routines:
* indexing (or better iteration?) which only requires "piling"
* counting occurrences of a given character or slice, which requires 
both piling and normalisation

I do not feel like implementating such routine for the on the fly 
version, and have no time for this in coming days; but if anyone is 
volunteer, feel free to rip code and data from Text's current 
implementation if it may help.

As source text, we can use the one at 
https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/data/unicode.txt 
(already my source for perf measures). It has the only merit to be a 
text (about unicode!) in twelve rather different languages.

[My intuitive guess is that Michel is wrong by orders of magnitude --but 
again I know about nothing about code performance.]


Denis
_________________
vita es estrany
spir.wikidot.com

Jan 17 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday 15 January 2011 19:25:47 Jonathan M Davis wrote:
 On Saturday 15 January 2011 15:59:27 Andrei Alexandrescu wrote:
 On 1/15/11 4:45 PM, Michel Fortin wrote:
 On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer"
=20
 <schveiguy yahoo.com> said:
 On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin
=20
 <michel.fortin michelf.com> wrote:
 On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"
=20
 <schveiguy yahoo.com> said:
 I'm not suggesting we impose it, just that we make it the default.
 If you want to iterate by dchar, wchar, or char, just write:
 foreach (dchar c; "expos=E9") {}
 foreach (wchar c; "expos=E9") {}
 foreach (char c; "expos=E9") {}
 // or
 foreach (dchar c; "expos=E9".by!dchar()) {}
 foreach (wchar c; "expos=E9".by!wchar()) {}
 foreach (char c; "expos=E9".by!char()) {}
 and it'll work. But the default would be a slice containing the
 grapheme, because this is the right way to represent a Unicode
 character.

=20
 I think this is a good idea. I previously was nervous about it, but
 I'm not sure it makes a huge difference. Returning a char[] is
 certainly less work than normalizing a grapheme into one or more
 code points, and then returning them. All that it takes is to dete=






ct
 all the code points within the grapheme. Normalization can be done
 if needed, but would probably have to output another char[], since=






 a
 normalized grapheme can occupy more than one dchar.

=20
 I'm glad we agree on that now.

=20
 It's a matter of me slowly wrapping my brain around unicode and how
 it's used. It seems like it's a typical committee defined standard
 where there are 10 ways to do everything, I was trying to weed out t=




he
 lesser used (or so I perceived) pieces to allow a more implementable
 library. It's doubly hard for me since I have limited experience with
 other languages, and I've never tried to write them with a computer
 (my language classes in high school were back in the days of actually
 writing stuff down on paper).

=20
 Actually, I don't think Unicode was so badly designed. It's just that
 nobody hat an idea of the real scope of the problem they had in hand =



at
 first, and so they had to add a lot of things but wanted to keep thin=



gs
 backward-compatible. We're at Unicode 6.0 now, can you name one other
 standard that evolved enough to get 6 major versions? I'm surprised
 it's not worse given all that it must support.
=20
 That said, I'm sure if someone could redesign Unicode by breaking
 backward-compatibility we'd have something simpler. You could probably
 get rid of pre-combined characters and reduce the number of
 normalization forms. But would you be able to get rid of normalization
 entirely? I don't think so. Reinventing Unicode is probably not worth
 it.
=20
 I'm not opposed to that on principle. I'm a little uneasy about
 having so many types representing a string however. Some other raw
 comments:
=20
 I agree that things would be more coherent if char[], wchar[], and
 dchar[] behaved like other arrays, but I can't really see a
 justification for those types to be in the language if there's
 nothing special about them (why not a library type?).

=20
 I would not be opposed to getting rid of those types. But I am very
 opposed to char[] not being an array. If you want a string to be
 something other than an array, make it have a different syntax. We
 also have to consider C compatibility.
=20
 However, we are in radical-change mode then, and this is probably
 pushed to D3 ;) If we can find some way to fix the situation without
 invalidating TDPL, we should strive for that first IMO.

=20
 Indeed, the change would probably be too radical for D2.
=20
 I think we agree that the default type should behave as a Unicode
 string, not an array of characters. I understand your opposition to
 conflating arrays of char with strings, and I agree with you to a
 certain extent that it could have been done better. But we can't real=



ly
 change the type of string literals, can we. The only thing we can
 change (I hope) at this point is how iterating on strings work.
=20
 Walter said earlier that he oppose changing foreach's default element
 type to dchar for char[] and wchar[] (as Andrei did for ranges) on the
 ground that it would silently break D1 compatibility. This is a valid
 point in my opinion.
=20
 I think you're right when you say that not treating char[] as an array
 of character breaks, to a certain extent, C compatibility. Another
 valid point.
=20
 That said, I want to emphasize that iterating by grapheme, contrary to
 iterating by dchar, does not break any code *silently*. The compiler
 will complain loudly that you're comparing a string to a char, so
 you'll have to change your code somewhere if you want things to
 compile. You'll have to look at the code and decide what to do.
=20
 One more thing:
=20
 NSString in Cocoa is in essence the same thing as I'm proposing here:
 as array of UTF-16 code units, but with string behaviour. It supports
 by-code-unit indexing, but appending, comparing, searching for
 substrings, etc. all behave correctly as a Unicode string. Again, I
 agree that it's probably not the best design, but I can tell you it
 works well in practice. In fact, NSString doesn't even expose the
 concept of grapheme, it just uses them internally, and you're pretty
 much limited to the built-in operation. I think what we have here in
 concept is much better... even if it somewhat conflates code-unit
 arrays and strings.

=20
 I'm unclear on where this is converging to. At this point the commitment
 of the language and its standard library to (a) UTF aray representation
 and (b) code points conceptualization is quite strong. Changing that
 would be quite difficult and disruptive, and the benefits are virtually
 nonexistent for most of D's user base.
=20
 It may be more realistic to consider using what we have as back-end for
 grapheme-oriented processing. For example:
=20
 struct Grapheme(Char) if (isSomeChar!Char)
 {
=20
      private const Char[] rep;
      ...
=20
 }
=20
 auto byGrapheme(S)(S s) if (isSomeString!S)
 {
=20
     ...
=20
 }
=20
 string s =3D "Hello";
 foreach (g; byGrapheme(s)
 {
=20
      ...
=20
 }

=20
 Considering that strings are already dealt with specially in order to have
 an element of dchar, I wouldn't think that it would be all that
 distruptive to make it so that they had an element type of Grapheme
 instead. Wouldn't that then fix all of std.algorithm and the like without
 really disrupting anything?
=20
 The issue of foreach remains, but without being willing to change what
 foreach defaults to, you can't really fix it - though I'd suggest that we
 at least make it a warning to iterate over strings without specifying the
 type. And if foreach were made to understand Grapheme like it understands
 dchar, then you could do
=20
 foreach(Grapheme g; str) { ... }
=20
 and have the compiler warn about
=20
 foreach(g; str) { ... }
=20
 and tell you to use Grapheme if you want to be comparing actual character=

s.
 Regardless, by making strings ranges of Grapheme rather than dchar, I wou=

ld
 think that we would solve most of the problem. At minimum, we'd have pret=

ty
 much the same problems that we have right now with char and wchar arrays,
 but we'd get rid of a whole class of unicode problems. So, nothing would
 be worse, but some of it would be better.

I suppose that the one major omission though is that string comparisons wou=
ld be=20
by code unit, not graphemes, which would be a problem. =3D=3D could be made=
 to use=20
graphemes instead, but then you couldn't compare them by code units or code=
=20
points unless you cast to ubyte[], ushort[], or uint[]... It would still=20
probably be worth making =3D=3D use graphemes though.

=2D Jonathan M Davis

Jan 15 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-15 18:59:27 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 I'm unclear on where this is converging to. At this point the 
 commitment of the language and its standard library to (a) UTF aray 
 representation and (b) code points conceptualization is quite strong. 
 Changing that would be quite difficult and disruptive, and the benefits 
 are virtually nonexistent for most of D's user base.

There's still a disagreement about whether a string or a code unit 
array should be the default string representation, and whether 
iterating on a code unit array should give you code unit or grapheme 
elements. Of those who who participated in the discussion, I don't 
think anyone is disputing the idea that a grapheme element is better 
than a dchar element for iterating over a string.


 It may be more realistic to consider using what we have as back-end for 
 grapheme-oriented processing.
 For example:
 
 struct Grapheme(Char) if (isSomeChar!Char)
 {
      private const Char[] rep;
      ...
 }
 
 auto byGrapheme(S)(S s) if (isSomeString!S)
 {
     ...
 }
 
 string s = "Hello";
 foreach (g; byGrapheme(s)
 {
      ...
 }

No doubt it's easier to implement it that way. The problem is that in 
most cases it won't be used. How many people really know what is a 
grapheme? Of those, how many will forget to use byGrapheme at one time 
or another? And so in most programs string manipulation will misbehave 
in the presence of combining characters or unnormalized strings.

If you want to help D programmers write correct code when it comes to 
Unicode manipulation, you need to help them iterate on real characters 
(graphemes), and you need the algorithms to apply to real characters 
(graphemes), not the approximation of a Unicode character that is a 
code point.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 15 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/15/11 10:45 PM, Michel Fortin wrote:
 On 2011-01-15 18:59:27 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 I'm unclear on where this is converging to. At this point the
 commitment of the language and its standard library to (a) UTF aray
 representation and (b) code points conceptualization is quite strong.
 Changing that would be quite difficult and disruptive, and the
 benefits are virtually nonexistent for most of D's user base.

 There's still a disagreement about whether a string or a code unit array
 should be the default string representation, and whether iterating on a
 code unit array should give you code unit or grapheme elements. Of those
 who who participated in the discussion, I don't think anyone is
 disputing the idea that a grapheme element is better than a dchar
 element for iterating over a string.

Disagreement as that might be, a simple fact that needs to be taken into 
account is that as of right now all of Phobos uses UTF arrays for string 
representation and dchar as element type.

Besides, for one I do dispute the idea that a grapheme element is better 
than a dchar element for iterating over a string. The grapheme has the 
attractiveness of being theoretically clean but at the same time is 
woefully inefficient and helps languages that few D users need to work 
with. At least that's my perception, and we need some serious numbers 
instead of convincing rhetoric to make a big decision.

It's all a matter of picking one's trade-offs. Clearly ASCII is out as 
no serious amount of non-English text can be trafficked without 
diacritics. So switching to UTF makes a lot of sense, and that's what D did.

When I introduced std.range and std.algorithm, they'd handle char[] and 
wchar[] no differently than any other array. A lot of algorithms simply 
did the wrong thing by default, so I attempted to fix that situation by 
defining byDchar(). So instead of passing some string str to an 
algorithm, one would pass byDchar(str).

A couple of weeks went by in testing that state of affairs, and before 
late I figured that I need to insert byDchar() virtually _everywhere_. 
There were a couple of algorithms (e.g. Boyer-Moore) that happened to 
work with arrays for subtle reasons (needless to say, they won't work 
with graphemes at all). But by and large the situation was that the 
simple and intuitive code was wrong and that the correct code 
necessitated inserting byDchar().

So my next decision, which understandably some of the people who didn't 
go through the experiment may find unintuitive, was to make byDchar() 
the default. This cleaned up a lot of crap in std itself and saved a lot 
of crap in the yet-unwritten client code.

I think it's reasonable to understand why I'm happy with the current 
state of affairs. It is better than anything we've had before and better 
than everything else I've tried.

Now, thanks to the effort people have spent in this group (thank you!), 
I have an understanding of the grapheme issue. I guarantee that 
grapheme-level iteration will have a high cost incurred to it: 
efficiency and changes in std. The languages that need composing 
characters for producing meaningful text are few and far between, so it 
makes sense to confine support for them to libraries that are not the 
default, unless we find ways to not disrupt everyone else.

 It may be more realistic to consider using what we have as back-end
 for grapheme-oriented processing.
 For example:

 struct Grapheme(Char) if (isSomeChar!Char)
 {
 private const Char[] rep;
 ...
 }

 auto byGrapheme(S)(S s) if (isSomeString!S)
 {
 ...
 }

 string s = "Hello";
 foreach (g; byGrapheme(s)
 {
 ...
 }

 No doubt it's easier to implement it that way. The problem is that in
 most cases it won't be used. How many people really know what is a
 grapheme?

How many people really should care?

 Of those, how many will forget to use byGrapheme at one time
 or another? And so in most programs string manipulation will misbehave
 in the presence of combining characters or unnormalized strings.

But most strings don't contain combining characters or unnormalized strings.

 If you want to help D programmers write correct code when it comes to
 Unicode manipulation, you need to help them iterate on real characters
 (graphemes), and you need the algorithms to apply to real characters
 (graphemes), not the approximation of a Unicode character that is a code
 point.

I don't think the situation is as clean cut, as grave, and as urgent as 
you say.


Andrei

Jan 16 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 1/15/11 10:45 PM, Michel Fortin wrote:
 No doubt it's easier to implement it that way. The problem is that in
 most cases it won't be used. How many people really know what is a
 grapheme?

 
 How many people really should care?

I think the only people who should *not* care are those who have 
validated that the input does not contain any combining code point. If 
you know the input *can't* contain combining code points, then it's 
safe to ignore them.

If we don't make correct Unicode handling the default, someday someone 
is going to ask a developer to fix a problem where his system doesn't 
handle some text correctly. Later that day, he'll come to the 
realization that almost none of his D code and none of the D libraries 
he use handle unicode correctly, and he'll say: can't fix this. His 
peer working on a similar Objective-C program will have a good laugh.

Sure, correct Unicode handling is slower and more complicated to 
implement, but at least you know you'll get the right results.


 Of those, how many will forget to use byGrapheme at one time
 or another? And so in most programs string manipulation will misbehave
 in the presence of combining characters or unnormalized strings.

 
 But most strings don't contain combining characters or unnormalized strings.

I think we should expect combining marks to be used more and more as 
our OS text system and fonts start supporting them better. Them being 
rare might be true today, but what do you know about tomorrow?

A few years ago, many Unicode symbols didn't even show up correctly on 
Windows. Today, we have Unicode domain names and people start putting 
funny symbols in them (for instance: <http://◉.ws>). I haven't seen it 
yet, but we'll surely see combining characters in domain names soon 
enough (if only as a way to make fun of programs that can't handle 
Unicode correctly). Well, let me be the first to make fun of such 
programs: <http://☺̭̏.michelf.com/>.

Also, not all combining characters are marks meant to be used by some 
foreign languages. Some are used for mathematics for instance. Or you 
could use 20E0 COMBINING ENCLOSING CIRCLE BACKSLASH as an overlay 
indicating some kind of prohibition.


 If you want to help D programmers write correct code when it comes to
 Unicode manipulation, you need to help them iterate on real characters
 (graphemes), and you need the algorithms to apply to real characters
 (graphemes), not the approximation of a Unicode character that is a code
 point.

 
 I don't think the situation is as clean cut, as grave, and as urgent as 
 you say.

I agree it's probably not as clean cut as I say (I'm trying to keep 
complicated things simple here), but it's something important to decide 
early because the cost of changing it increase as more code is written.


Quoting the first part of the same post (out of order):

 Disagreement as that might be, a simple fact that needs to be taken 
 into account is that as of right now all of Phobos uses UTF arrays for 
 string representation and dchar as element type.
 
 Besides, for one I do dispute the idea that a grapheme element is 
 better than a dchar element for iterating over a string. The grapheme 
 has the attractiveness of being theoretically clean but at the same 
 time is woefully inefficient and helps languages that few D users need 
 to work with. At least that's my perception, and we need some serious 
 numbers instead of convincing rhetoric to make a big decision.

You'll no doubt get more performance from a grapheme-aware specialized 
algorithm working directly on code points than by iterating on 
graphemes returned as string slices. But both will give *correct* 
results.

Implementing a specialized algorithm of this kind becomes an 
optimization, and it's likely you'll want an optimized version for most 
string algorithms.

I'd like to have some numbers too about performance, but I have none at 
this time.


 It's all a matter of picking one's trade-offs. Clearly ASCII is out as 
 no serious amount of non-English text can be trafficked without 
 diacritics. So switching to UTF makes a lot of sense, and that's what D 
 did.
 
 When I introduced std.range and std.algorithm, they'd handle char[] and 
 wchar[] no differently than any other array. A lot of algorithms simply 
 did the wrong thing by default, so I attempted to fix that situation by 
 defining byDchar(). So instead of passing some string str to an 
 algorithm, one would pass byDchar(str).
 
 A couple of weeks went by in testing that state of affairs, and before 
 late I figured that I need to insert byDchar() virtually _everywhere_. 
 There were a couple of algorithms (e.g. Boyer-Moore) that happened to 
 work with arrays for subtle reasons (needless to say, they won't work 
 with graphemes at all). But by and large the situation was that the 
 simple and intuitive code was wrong and that the correct code 
 necessitated inserting byDchar().
 
 So my next decision, which understandably some of the people who didn't 
 go through the experiment may find unintuitive, was to make byDchar() 
 the default. This cleaned up a lot of crap in std itself and saved a 
 lot of crap in the yet-unwritten client code.

But were your algorithms *correct* in the first place? I'd argue that 
by making byDchar the default you've not saved yourself from any crap 
because dchar isn't the right layer of abstraction.


 I think it's reasonable to understand why I'm happy with the current 
 state of affairs. It is better than anything we've had before and 
 better than everything else I've tried.

It is indeed easy to understand why you're happy with the current state 
of affairs: you never had to deal with multi-code-point character and 
can't imagine yourself having to deal with them on a semi-frequent 
basis. Other people won't be so happy with this state of affairs, but 
they'll probably notice only after most of their code has been written 
unaware of the problem.


 Now, thanks to the effort people have spent in this group (thank you!), 
 I have an understanding of the grapheme issue. I guarantee that 
 grapheme-level iteration will have a high cost incurred to it: 
 efficiency and changes in std. The languages that need composing 
 characters for producing meaningful text are few and far between, so it 
 makes sense to confine support for them to libraries that are not the 
 default, unless we find ways to not disrupt everyone else.

We all are more aware of the problem now, that's a good thing. :-)


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 16 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/16/11 3:20 PM, Michel Fortin wrote:
 On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 On 1/15/11 10:45 PM, Michel Fortin wrote:
 No doubt it's easier to implement it that way. The problem is that in
 most cases it won't be used. How many people really know what is a
 grapheme?

 How many people really should care?

 I think the only people who should *not* care are those who have
 validated that the input does not contain any combining code point. If
 you know the input *can't* contain combining code points, then it's safe
 to ignore them.

I agree. Now let me ask again: how many people really should care?

 If we don't make correct Unicode handling the default, someday someone
 is going to ask a developer to fix a problem where his system doesn't
 handle some text correctly. Later that day, he'll come to the
 realization that almost none of his D code and none of the D libraries
 he use handle unicode correctly, and he'll say: can't fix this. His peer
 working on a similar Objective-C program will have a good laugh.

 Sure, correct Unicode handling is slower and more complicated to
 implement, but at least you know you'll get the right results.

I love the increased precision, but again I'm not sure how many people 
ever manipulate text with combining characters. Meanwhile they'll 
complain that D is slower than other languages.

 Of those, how many will forget to use byGrapheme at one time
 or another? And so in most programs string manipulation will misbehave
 in the presence of combining characters or unnormalized strings.

 But most strings don't contain combining characters or unnormalized
 strings.

 I think we should expect combining marks to be used more and more as our
 OS text system and fonts start supporting them better. Them being rare
 might be true today, but what do you know about tomorrow?

I don't think languages will acquire more diacritics soon. I do hope, of 
course, that D applications gain more usage in the Arabic, Hebrew etc. 
world.

 A few years ago, many Unicode symbols didn't even show up correctly on
 Windows. Today, we have Unicode domain names and people start putting
 funny symbols in them (for instance: <http://◉.ws>). I haven't seen it
 yet, but we'll surely see combining characters in domain names soon
 enough (if only as a way to make fun of programs that can't handle
 Unicode correctly). Well, let me be the first to make fun of such
 programs: <http://☺̭̏.michelf.com/>.

Would you bet the language on that?

 Also, not all combining characters are marks meant to be used by some
 foreign languages. Some are used for mathematics for instance. Or you
 could use 20E0 COMBINING ENCLOSING CIRCLE BACKSLASH as an overlay
 indicating some kind of prohibition.


 If you want to help D programmers write correct code when it comes to
 Unicode manipulation, you need to help them iterate on real characters
 (graphemes), and you need the algorithms to apply to real characters
 (graphemes), not the approximation of a Unicode character that is a code
 point.

 I don't think the situation is as clean cut, as grave, and as urgent
 as you say.

 I agree it's probably not as clean cut as I say (I'm trying to keep
 complicated things simple here), but it's something important to decide
 early because the cost of changing it increase as more code is written.

Agreed.

 Quoting the first part of the same post (out of order):

 Disagreement as that might be, a simple fact that needs to be taken
 into account is that as of right now all of Phobos uses UTF arrays for
 string representation and dchar as element type.

 Besides, for one I do dispute the idea that a grapheme element is
 better than a dchar element for iterating over a string. The grapheme
 has the attractiveness of being theoretically clean but at the same
 time is woefully inefficient and helps languages that few D users need
 to work with. At least that's my perception, and we need some serious
 numbers instead of convincing rhetoric to make a big decision.

 You'll no doubt get more performance from a grapheme-aware specialized
 algorithm working directly on code points than by iterating on graphemes
 returned as string slices. But both will give *correct* results.

 Implementing a specialized algorithm of this kind becomes an
 optimization, and it's likely you'll want an optimized version for most
 string algorithms.

 I'd like to have some numbers too about performance, but I have none at
 this time.

I spent a fair amount of time comparing ASCII vs. Unicode code speed. 
The fact of the matter is that the overhead is measurable and often 
high. Also it occurs at a very core level. For starters, the grapheme 
itself is larger and has one extra indirection. I am confident the 
marginal overhead for graphemes would be considerable.

 It's all a matter of picking one's trade-offs. Clearly ASCII is out as
 no serious amount of non-English text can be trafficked without
 diacritics. So switching to UTF makes a lot of sense, and that's what
 D did.

 When I introduced std.range and std.algorithm, they'd handle char[]
 and wchar[] no differently than any other array. A lot of algorithms
 simply did the wrong thing by default, so I attempted to fix that
 situation by defining byDchar(). So instead of passing some string str
 to an algorithm, one would pass byDchar(str).

 A couple of weeks went by in testing that state of affairs, and before
 late I figured that I need to insert byDchar() virtually _everywhere_.
 There were a couple of algorithms (e.g. Boyer-Moore) that happened to
 work with arrays for subtle reasons (needless to say, they won't work
 with graphemes at all). But by and large the situation was that the
 simple and intuitive code was wrong and that the correct code
 necessitated inserting byDchar().

 So my next decision, which understandably some of the people who
 didn't go through the experiment may find unintuitive, was to make
 byDchar() the default. This cleaned up a lot of crap in std itself and
 saved a lot of crap in the yet-unwritten client code.

 But were your algorithms *correct* in the first place? I'd argue that by
 making byDchar the default you've not saved yourself from any crap
 because dchar isn't the right layer of abstraction.

It was correct for all but a couple languages. Again: most of today's 
languages don't ever need combining characters.

 I think it's reasonable to understand why I'm happy with the current
 state of affairs. It is better than anything we've had before and
 better than everything else I've tried.

 It is indeed easy to understand why you're happy with the current state
 of affairs: you never had to deal with multi-code-point character and
 can't imagine yourself having to deal with them on a semi-frequent
 basis.

Do you, and can you?

 Other people won't be so happy with this state of affairs, but
 they'll probably notice only after most of their code has been written
 unaware of the problem.

They can't be unaware and write said code.

 Now, thanks to the effort people have spent in this group (thank
 you!), I have an understanding of the grapheme issue. I guarantee that
 grapheme-level iteration will have a high cost incurred to it:
 efficiency and changes in std. The languages that need composing
 characters for producing meaningful text are few and far between, so
 it makes sense to confine support for them to libraries that are not
 the default, unless we find ways to not disrupt everyone else.

 We all are more aware of the problem now, that's a good thing. :-)

All I wish is it's not blown out of proportion. It fares rather low on 
my list of library issues that D has right now.


Andrei

Jan 16 2011

Daniel Gibson <metalcaedes gmail.com> writes:

Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:
 On 1/16/11 3:20 PM, Michel Fortin wrote:
 On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:
 But most strings don't contain combining characters or unnormalized
 strings.

 I think we should expect combining marks to be used more and more as our
 OS text system and fonts start supporting them better. Them being rare
 might be true today, but what do you know about tomorrow?

 I don't think languages will acquire more diacritics soon. I do hope, of
 course, that D applications gain more usage in the Arabic, Hebrew etc.
 world.

So why does D use unicode anyway?
If you don't care about not-often used languages anyway, you could have 
used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide 
which encoding he wants/needs).

You could as well say "we don't need to use dchar to represent a proper 
code point, wchar is enough for most use cases and has fewer overhead 
anyway".


 I think it's reasonable to understand why I'm happy with the current
 state of affairs. It is better than anything we've had before and
 better than everything else I've tried.

 It is indeed easy to understand why you're happy with the current state
 of affairs: you never had to deal with multi-code-point character and
 can't imagine yourself having to deal with them on a semi-frequent
 basis.

 Do you, and can you?

 Other people won't be so happy with this state of affairs, but
 they'll probably notice only after most of their code has been written
 unaware of the problem.

 They can't be unaware and write said code.

Fun fact: Germany recently introduced a new ID card and some of the 
software that was developed for this and is used in some record sections 
fucks up when a name contains diacritics.

I think especially when you're handling names (and much software does, I 
think) it's crucial to have proper support for all kinds of chars.
Of course many programmers are not aware that, if Umlaute and ß works it 
doesn't mean that all other kinds of strange characters work as well.


Cheers,
- Daniel

Jan 16 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/16/11 6:42 PM, Daniel Gibson wrote:
 Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:
 On 1/16/11 3:20 PM, Michel Fortin wrote:
 On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:
 But most strings don't contain combining characters or unnormalized
 strings.

 I think we should expect combining marks to be used more and more as our
 OS text system and fonts start supporting them better. Them being rare
 might be true today, but what do you know about tomorrow?

 I don't think languages will acquire more diacritics soon. I do hope, of
 course, that D applications gain more usage in the Arabic, Hebrew etc.
 world.

 So why does D use unicode anyway?
 If you don't care about not-often used languages anyway, you could have
 used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide
 which encoding he wants/needs).

 You could as well say "we don't need to use dchar to represent a proper
 code point, wchar is enough for most use cases and has fewer overhead
 anyway".

I consider UTF8 superior to all of the above.

 I think it's reasonable to understand why I'm happy with the current
 state of affairs. It is better than anything we've had before and
 better than everything else I've tried.

 It is indeed easy to understand why you're happy with the current state
 of affairs: you never had to deal with multi-code-point character and
 can't imagine yourself having to deal with them on a semi-frequent
 basis.

 Do you, and can you?

 Other people won't be so happy with this state of affairs, but
 they'll probably notice only after most of their code has been written
 unaware of the problem.

 They can't be unaware and write said code.

 Fun fact: Germany recently introduced a new ID card and some of the
 software that was developed for this and is used in some record sections
 fucks up when a name contains diacritics.

 I think especially when you're handling names (and much software does, I
 think) it's crucial to have proper support for all kinds of chars.
 Of course many programmers are not aware that, if Umlaute and ß works it
 doesn't mean that all other kinds of strange characters work as well.


 Cheers,
 - Daniel

I think German text works well with dchar.


Andrei

Jan 16 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday 16 January 2011 18:45:26 Andrei Alexandrescu wrote:
 On 1/16/11 6:42 PM, Daniel Gibson wrote:
 Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:
 On 1/16/11 3:20 PM, Michel Fortin wrote:
 On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
=20
 <SeeWebsiteForEmail erdani.org> said:
 But most strings don't contain combining characters or unnormalized
 strings.

=20
 I think we should expect combining marks to be used more and more as
 our OS text system and fonts start supporting them better. Them being
 rare might be true today, but what do you know about tomorrow?

=20
 I don't think languages will acquire more diacritics soon. I do hope, =



of
 course, that D applications gain more usage in the Arabic, Hebrew etc.
 world.

=20
 So why does D use unicode anyway?
 If you don't care about not-often used languages anyway, you could have
 used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide
 which encoding he wants/needs).
=20
 You could as well say "we don't need to use dchar to represent a proper
 code point, wchar is enough for most use cases and has fewer overhead
 anyway".

=20
 I consider UTF8 superior to all of the above.
=20
 I think it's reasonable to understand why I'm happy with the current
 state of affairs. It is better than anything we've had before and
 better than everything else I've tried.

=20
 It is indeed easy to understand why you're happy with the current sta=




te
 of affairs: you never had to deal with multi-code-point character and
 can't imagine yourself having to deal with them on a semi-frequent
 basis.

=20
 Do you, and can you?
=20
 Other people won't be so happy with this state of affairs, but
 they'll probably notice only after most of their code has been written
 unaware of the problem.

=20
 They can't be unaware and write said code.

=20
 Fun fact: Germany recently introduced a new ID card and some of the
 software that was developed for this and is used in some record sections
 fucks up when a name contains diacritics.
=20
 I think especially when you're handling names (and much software does, I
 think) it's crucial to have proper support for all kinds of chars.
 Of course many programmers are not aware that, if Umlaute and =C3=9F wo=


rks it
 doesn't mean that all other kinds of strange characters work as well.
=20
=20
 Cheers,
 - Daniel

=20
 I think German text works well with dchar.

I think that whether dchar will be enough will depend primarily on where th=
e=20
unicode is coming from and what the programmer is doing with it. There's pl=
enty=20
which will just work regardless of whether code poinst are pre-combined or =
not,=20
and there's other stuff which will have subtle bugs if they're not pre-comb=
ined.

=46or the most part, Western languages should have pre-combined characters,=
 but=20
whether a program sees them in combined form or not will depend on where th=
e=20
text comes from. If it comes from a file, then it all depends on the progra=
m=20
which wrote the file. If it comes from the console, then it depends on what=
 that=20
console does. If it comes from a socket or pipe or whatnot, then it depends=
 on=20
whatever program is sending the data.

So, the question becomes what the norm is? Are unicode characters normally =
pre-
combined or left as separate code points? The majority of English text will=
 be=20
fine regardless, since English only uses accented characters and the like w=
hen=20
including foreign words, but most any other European language will have acc=
ented=20
characters and then it's an open question. If it's more likely that a D pro=
gram=20
will receive pre-combined characters than not, then many programs will like=
ly be=20
safe treating a code point as a character. But if the odds are high that a =
D=20
program will receive characters which are not yet combined, then certain se=
ts of=20
text will invariably result in bugs in your average D program.

I don't think that there's much question that from a performance standpoint=
 and=20
from the standpoint of trying to avoid breaking TDPL and a lot of pre-exist=
ing=20
code, we should continue to treat a code point - a dchar - as an abstract=20
character. Moving to graphemes could really harm performance - and there _a=
re_=20
plenty of programs that couldn't care less about unicode. However, it's qui=
te=20
clear that in a number of circumstances, that's going to result in buggy co=
de.=20
The question then is whether it's okay to take a performance hit just to=20
correctly handle unicode. And I expect that a _lot_ of people are going to =
say=20
no to that.

D already does better at handling unicode than many other languages, so it'=
s=20
definitely a step up as it is. The cost for handling unicode completely cor=
rectly=20
is quite high from the sounds of it - all of a sudden you're effectively (i=
f not=20
literally) dealing with arrays of arrays instead of arrays. So, I think tha=
t=20
it's a viable option to say that the default path that D will take is the=20
_mostly_ correct but still reasonably efficient path, and then - through 3r=
d party=20
libraries or possibly even with a module in Phobos - we'll provide a means =
to=20
handle unicode 100% correctly for those who really care.

At minimum, we need the tools to handle unicode correctly, but if we can't=
=20
handle it both correctly and efficiently, then I'm afraid that it's just no=
t going=20
to be reasonable to handle it correctly - especially if we can handle it=20
_almost_ correctly and still be efficient.

Regardless, the real question is how likely a D program is to deal with uni=
code=20
which is not pre-combined. If the odds are relatively low in the general ca=
se,=20
then sticking to dchar should be fine. But if the adds or relatively high, =
then=20
not going to graphemes could mean that there will be a _lot_ of buggy D pro=
grams=20
out there.

=2D Jonathan M Davis

Jan 16 2011

Daniel Gibson <metalcaedes gmail.com> writes:

Am 17.01.2011 03:45, schrieb Andrei Alexandrescu:
 On 1/16/11 6:42 PM, Daniel Gibson wrote:
 Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:
 On 1/16/11 3:20 PM, Michel Fortin wrote:
 On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:
 But most strings don't contain combining characters or unnormalized
 strings.

 I think we should expect combining marks to be used more and more as our
 OS text system and fonts start supporting them better. Them being rare
 might be true today, but what do you know about tomorrow?

 I don't think languages will acquire more diacritics soon. I do hope, of
 course, that D applications gain more usage in the Arabic, Hebrew etc.
 world.

 So why does D use unicode anyway?
 If you don't care about not-often used languages anyway, you could have
 used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide
 which encoding he wants/needs).

 You could as well say "we don't need to use dchar to represent a proper
 code point, wchar is enough for most use cases and has fewer overhead
 anyway".

 I consider UTF8 superior to all of the above.

Really? UTF32 - maybe. But IMHO even when not considering graphemes and such 
UTF8 sucks hard in comparison to those because one code point consists of 1-4 
code units (even in German 1-2 code units).

 I think it's reasonable to understand why I'm happy with the current
 state of affairs. It is better than anything we've had before and
 better than everything else I've tried.

 It is indeed easy to understand why you're happy with the current state
 of affairs: you never had to deal with multi-code-point character and
 can't imagine yourself having to deal with them on a semi-frequent
 basis.

 Do you, and can you?

 Other people won't be so happy with this state of affairs, but
 they'll probably notice only after most of their code has been written
 unaware of the problem.

 They can't be unaware and write said code.

 Fun fact: Germany recently introduced a new ID card and some of the
 software that was developed for this and is used in some record sections
 fucks up when a name contains diacritics.

 I think especially when you're handling names (and much software does, I
 think) it's crucial to have proper support for all kinds of chars.
 Of course many programmers are not aware that, if Umlaute and ß works it
 doesn't mean that all other kinds of strange characters work as well.


 Cheers,
 - Daniel

 I think German text works well with dchar.

Yes, but even in Germany there are people whose names contain "strange" 
characters ;)
Is it common to have programs that deal with text in a specific language but
not 
with names?


I do understand your resistance to support Unicode properly - it's a lot of 
trouble and makes things inefficient (more inefficient than UTF8/16 already are 
because of that code point != code unit thing).
Another thing is that due to bad support from fonts or console/GUI technology
it 
may happen (quite often) that one grapheme is *not* displayed as a single 
character, thus messing up formatting anyway (Still you probably should cut a 
string within a grapheme).

So here's what I think can be done (and, at least the first two points, 
especially the first, should be done):

1. Mention the Grapheme and Digraph situation in string related documentation 
(std.string and maybe string-related stuff in std.algorithm like Splitter) to 
make sure people who use Phobos are aware of the problem. Then at least they 
can't say that nobody told them when their Objective-C using colleagues are 
laughing at their broken unicode-support ;)

2. Maybe add some functions that *do* deal with this.
Like "bool isPartOfGrapheme(dchar c)" or "bool isDigraph(dchar c)" so people
can 
check themselves, if they just split their string within a grapheme or
something.

3. Include a proper Unicode-string type/module, if somebody has the time and 
knowledge to develop one. spir already started something like that AFAIK and 
Steven Schveighoffer also is even working on a complete string type - maybe 
these efforts could be combined?
I guess default strings will stay mostly the way they are (but please add an 
ASCII type or allow ubyte[] asciiStr = "asdf";).
Having an additional type in Phobos that works correctly in all cases (e.g. 
Arabic, Hebrew, Japanese, ..) would be really great, though.

   UniString uStr = new UniString("sdfüñẫ");
   UniString uStr2 = uStr[3..$]; // "üñẫ"
   UniGraph ug = uStr[5]; // 'ẫ'
   size_t i = uStr2.length; // 3
something like that maybe (of course plus a lot of other stuff like proper 
comparison for different encodings of the same char like a modified icmp() 
discussed before).
But something like
   size_t len = uniLen("sdfüñẫ"); // 6
   string s = uniSlice(str, 3, str.length); // == str.uniSlice(3, str.length);
etc may be just as good.

(I hope this all made sense)

 Andrei

Cheers,
- Daniel

Jan 16 2011

Daniel Gibson <metalcaedes gmail.com> writes:

Am 17.01.2011 04:38, schrieb Daniel Gibson:
 Am 17.01.2011 03:45, schrieb Andrei Alexandrescu:
 On 1/16/11 6:42 PM, Daniel Gibson wrote:
 Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:
 On 1/16/11 3:20 PM, Michel Fortin wrote:
 On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:
 But most strings don't contain combining characters or unnormalized
 strings.

 I think we should expect combining marks to be used more and more as our
 OS text system and fonts start supporting them better. Them being rare
 might be true today, but what do you know about tomorrow?

 I don't think languages will acquire more diacritics soon. I do hope, of
 course, that D applications gain more usage in the Arabic, Hebrew etc.
 world.

 So why does D use unicode anyway?
 If you don't care about not-often used languages anyway, you could have
 used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide
 which encoding he wants/needs).

 You could as well say "we don't need to use dchar to represent a proper
 code point, wchar is enough for most use cases and has fewer overhead
 anyway".

 I consider UTF8 superior to all of the above.

 Really? UTF32 - maybe. But IMHO even when not considering graphemes and such
 UTF8 sucks hard in comparison to those because one code point consists of 1-4
 code units (even in German 1-2 code units).

 I think it's reasonable to understand why I'm happy with the current
 state of affairs. It is better than anything we've had before and
 better than everything else I've tried.

 It is indeed easy to understand why you're happy with the current state
 of affairs: you never had to deal with multi-code-point character and
 can't imagine yourself having to deal with them on a semi-frequent
 basis.

 Do you, and can you?

 Other people won't be so happy with this state of affairs, but
 they'll probably notice only after most of their code has been written
 unaware of the problem.

 They can't be unaware and write said code.

 Fun fact: Germany recently introduced a new ID card and some of the
 software that was developed for this and is used in some record sections
 fucks up when a name contains diacritics.

 I think especially when you're handling names (and much software does, I
 think) it's crucial to have proper support for all kinds of chars.
 Of course many programmers are not aware that, if Umlaute and ß works it
 doesn't mean that all other kinds of strange characters work as well.


 Cheers,
 - Daniel

 I think German text works well with dchar.

 Yes, but even in Germany there are people whose names contain "strange"
 characters ;)
 Is it common to have programs that deal with text in a specific language but
not
 with names?


 I do understand your resistance to support Unicode properly - it's a lot of
 trouble and makes things inefficient (more inefficient than UTF8/16 already are
 because of that code point != code unit thing).
 Another thing is that due to bad support from fonts or console/GUI technology
it
 may happen (quite often) that one grapheme is *not* displayed as a single
 character, thus messing up formatting anyway (Still you probably should cut a
 string within a grapheme).

I meant you should *not* cut a string within a grapheme.

 So here's what I think can be done (and, at least the first two points,
 especially the first, should be done):

 1. Mention the Grapheme and Digraph situation in string related documentation
 (std.string and maybe string-related stuff in std.algorithm like Splitter) to
 make sure people who use Phobos are aware of the problem. Then at least they
 can't say that nobody told them when their Objective-C using colleagues are
 laughing at their broken unicode-support ;)

 2. Maybe add some functions that *do* deal with this.
 Like "bool isPartOfGrapheme(dchar c)" or "bool isDigraph(dchar c)" so people
can
 check themselves, if they just split their string within a grapheme or
something.

 3. Include a proper Unicode-string type/module, if somebody has the time and
 knowledge to develop one. spir already started something like that AFAIK and
 Steven Schveighoffer also is even working on a complete string type - maybe
 these efforts could be combined?
 I guess default strings will stay mostly the way they are (but please add an
 ASCII type or allow ubyte[] asciiStr = "asdf";).
 Having an additional type in Phobos that works correctly in all cases (e.g.
 Arabic, Hebrew, Japanese, ..) would be really great, though.

 UniString uStr = new UniString("sdfüñẫ");
 UniString uStr2 = uStr[3..$]; // "üñẫ"
 UniGraph ug = uStr[5]; // 'ẫ'
 size_t i = uStr2.length; // 3

of course I forgot:
   string s = uStr2.toString();
   dstring s2 = uStr2.toDString();
to convert it back to a "normal" string

 something like that maybe (of course plus a lot of other stuff like proper
 comparison for different encodings of the same char like a modified icmp()
 discussed before).
 But something like
 size_t len = uniLen("sdfüñẫ"); // 6
 string s = uniSlice(str, 3, str.length); // == str.uniSlice(3, str.length);
 etc may be just as good.

 (I hope this all made sense)

 Andrei

 Cheers,
 - Daniel

Jan 16 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-16 18:58:54 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 1/16/11 3:20 PM, Michel Fortin wrote:
 On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:
 
 On 1/15/11 10:45 PM, Michel Fortin wrote:
 No doubt it's easier to implement it that way. The problem is that in
 most cases it won't be used. How many people really know what is a
 grapheme?

 
 How many people really should care?

 
 I think the only people who should *not* care are those who have
 validated that the input does not contain any combining code point. If
 you know the input *can't* contain combining code points, then it's safe
 to ignore them.

 
 I agree. Now let me ask again: how many people really should care?

As I said: all those people who are not validating the inputs to make 
sure they don't contain combining code points. As far as I know, no one 
is doing that, so that means everybody should use algorithms capable of 
handling multi-code-point graphemes. If someone indeed is doing this 
validation, he'll probably also be smart enough to make his algorithms 
to work with dchars.

That said, no one should really have to care but those who implement 
the string manipulation functions. The idea behind making the grapheme 
the element type is to make it easier to write grapheme-aware string 
manipulation functions, even if you don't know about graphemes. But the 
reality is probably more mixed than that.

 - - -

I gave some thought about all this, and came to an interesting 
realizations that made me refine the proposal. The new proposal is 
disruptive perhaps as much as the first, but in a different way.

But first, let's state a few facts to reframe the current discussion:

Fact 1: most people don't know Unicode very well
Fact 2: most people are confused by code units, code points, graphemes, 
and what is a 'character'
Fact 3: most people won't bother with all this, they'll just use the 
basic language facilities and assume everything work correctly if it it 
works correctly for them

Now, let's define two goals:

Goal 1: make most people's string operations work correctly
Goal 2: make most people's string operations work fast

To me, goal 1 trumps goal 2, even if goal 2 is also important. I'm not 
sure we agree on this, but let's continue.

From the above 3 facts, we can deduce that a user won't want to bother 
to using byDchar, byGrapheme, or byWhatever when using algorithms. You 
were annoyed by having to write byDchar everywhere, so changed the 
element type to always be dchar and you don't have to write byDchar 
anymore. That's understandable and perfectly reasonable.

The problem is of course that it doesn't give you correct results. Most 
of the time what you really want is to use graphemes, dchar just happen 
to be a good approximation of that that works most of the time.

Iterating by grapheme is somewhat problematic, and it degrades 
performance. Same for comparing graphemes for normalized equivalence. 
That's all true. I'm not too sure what we can do about that. It can be 
optimized, but it's very understandable that some people won't be 
satisfied by the performance and will want to avoid graphemes.

Speaking of optimization, I do understand that iterating by grapheme 
using the range interface won't give you the best performance. It's 
certainly convenient as it enables the reuse of existing algorithms 
with graphemes, but more specialized algorithms and interfaces might be 
more suited.

One observation I made with having dchar as the default element type is 
that not all algorithms really need to deal with dchar. If I'm 
searching for code point 'a' in a UTF-8 string, decoding code units 
into code points is a waste of time. Why? because the only way to 
represent code point 'a' is by having code point 'a'. And guess what? 
The almost same optimization can apply to graphemes: if you're 
searching for 'a' in a grapheme-aware manner in a UTF-8 string, all you 
have to do is search for the UTF-8 code unit 'a', then check if the 'a' 
code unit is followed by a combining mark code point to confirm it is 
really a 'a', not a composed grapheme. Iterating the string by code 
unit is enough for these cases, and it'd increase performance by a lot.

So making dchar the default type is no doubt convenient because it 
abstracts things enough so that generic algorithms can work with 
strings, but it has a performance penalty that you don't always need. I 
made an example using UTF-8, it applies even more to UTF-16. And it 
applies to grapheme-aware manipulations too.

This penalty with generic algorithms comes from the fact that they take 
a predicate of the form "a == 'a'" or "a == b", which is ill-suited for 
strings because you always need to fully decode the string (by dchar or 
by graphemes) for the purpose of calling the predicate. Given that 
comparing characters for something else than equality or them being 
part of a set is very rarely something you do, generic algorithms miss 
a big optimization opportunity here.

 - - -

So here's what I think we should do:

Todo 1: disallow generic algorithms on naked strings: string-specific 
Unicode-aware algorithms should be used instead; they can share the 
same name if their usage is similar

Todo 2: to use a generic algorithm with a strings, you must dress the 
string using one of toDchar, toGrapheme, toCodeUnits; this way your 
intentions are clear

Todo 3: string-specific algorithms can implemented as simple wrappers 
for generic algorithms with the string dressed correctly for the task, 
or they can implement more sophisticated algorithms to increase 
performance

There's two major benefits to this approach:

Benefit 1: if indeed you really don't want the performance penalty that 
comes with checking for composed graphemes, you can bypass it at some 
specific places in your code using byDchar, or you can disable it 
altogether by modifying the string-specific algorithms and recompiling 
Phobos.

Benefit 2: we don't have to rush to implementing graphemes in the 
Unicode-aware algorithms. Just make sure the interface for 
string-specific algorithms *can* accept graphemes, and we can roll out 
support for them at a later time once we have a decent implementation.

Also, all this is leaving the question open as to what to do when 
someone uses the string as a range. In my opinion, it should either 
iterate on code units (because the string is actually an array, and 
because that's what foreach does) or simply disallow iteration (asking 
that you dress the string first using toCodeUnit, toDchar, or 
toGrapheme).

Do you like that more?


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 17 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/17/11 10:34 AM, Michel Fortin wrote:
 On 2011-01-16 18:58:54 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 On 1/16/11 3:20 PM, Michel Fortin wrote:
 On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 On 1/15/11 10:45 PM, Michel Fortin wrote:
 No doubt it's easier to implement it that way. The problem is that in
 most cases it won't be used. How many people really know what is a
 grapheme?

 How many people really should care?

 I think the only people who should *not* care are those who have
 validated that the input does not contain any combining code point. If
 you know the input *can't* contain combining code points, then it's safe
 to ignore them.

 I agree. Now let me ask again: how many people really should care?

 As I said: all those people who are not validating the inputs to make
 sure they don't contain combining code points.

The question (which I see you keep on dodging :o)) is how much text 
contains combining code points.

I have worked in NLP for years, and still do. I even worked on Arabic 
text (albeit Romanized). I work with Wikipedia. I use Unicode all the 
time, but I have yet to have trouble with a combining character. I was 
just vaguely aware of their existence up until this discussion, but just 
waved it away and guess what - it worked for me.

It does not serve us well to rigidly claim that the only good way of 
doing anything Unicode is to care about graphemes. Even NSString exposes 
the UTF16 underlying encoding and provides dedicated functions for 
grapheme-based processing. For one thing, if you care about the width of 
a word in printed text (one of the case where graphemes are important), 
you need font information. And - surprise! - some fonts do NOT support 
combining characters and print signs next to one another instead of 
juxtaposing them, so the "wrong" method of counting characters is more 
informative.

 As far as I know, no one
 is doing that, so that means everybody should use algorithms capable of
 handling multi-code-point graphemes. If someone indeed is doing this
 validation, he'll probably also be smart enough to make his algorithms
 to work with dchars.

I am not sure everybody should use graphemes.

 That said, no one should really have to care but those who implement the
 string manipulation functions. The idea behind making the grapheme the
 element type is to make it easier to write grapheme-aware string
 manipulation functions, even if you don't know about graphemes. But the
 reality is probably more mixed than that.

The reality is indeed more mixed. Inevitably at some point the API needs 
to answer the question: "what is the first character of this string?" 
Transparency is not possible. You break all string code out there.

 - - -

 I gave some thought about all this, and came to an interesting
 realizations that made me refine the proposal. The new proposal is
 disruptive perhaps as much as the first, but in a different way.

 But first, let's state a few facts to reframe the current discussion:

 Fact 1: most people don't know Unicode very well
 Fact 2: most people are confused by code units, code points, graphemes,
 and what is a 'character'
 Fact 3: most people won't bother with all this, they'll just use the
 basic language facilities and assume everything work correctly if it it
 works correctly for them

Nice :o).

 Now, let's define two goals:

 Goal 1: make most people's string operations work correctly
 Goal 2: make most people's string operations work fast

Goal 3: don't break all existing code
Goal 4: make most people's string-based code easy to write and understand

 To me, goal 1 trumps goal 2, even if goal 2 is also important. I'm not
 sure we agree on this, but let's continue.

I think we disagree about what "most" means. For you it means "people 
who don't understand Unicode well but deal with combining characters 
anyway". For me it's "the largest percentage of D users across various 
writing systems".

  From the above 3 facts, we can deduce that a user won't want to bother
 to using byDchar, byGrapheme, or byWhatever when using algorithms. You
 were annoyed by having to write byDchar everywhere, so changed the
 element type to always be dchar and you don't have to write byDchar
 anymore. That's understandable and perfectly reasonable.

 The problem is of course that it doesn't give you correct results. Most
 of the time what you really want is to use graphemes, dchar just happen
 to be a good approximation of that that works most of the time.

Again, it's a matter of tradeoffs. I chose dchar because char was plain 
_wrong_ most of the time, not because char was a pretty darn good 
approximation that worked for most people most of the time. The fact 
remains that dchar _is_ a pretty darn good approximation that also has 
pretty good darn speed. So I'd say that I _still_ want to use dchar most 
of the time.

Committing to graphemes would complicate APIs for _everyone_ and would 
make things slower for _everyone_ for the sake of combining characters 
that _never_ occur in _most_ people's text. This is bad design, pure and 
simple. A good design is to cater for the majority and provide dedicated 
APIs for the few.

 Iterating by grapheme is somewhat problematic, and it degrades
 performance.

Yes.

 Same for comparing graphemes for normalized equivalence.

Yes, although I think you can optimize code such that comparing two 
strings wholesale only has a few more comparisons on the critical path. 
That would be still slower, but not as slow as iterating by grapheme in 
a naive implementation.

 That's all true. I'm not too sure what we can do about that. It can be
 optimized, but it's very understandable that some people won't be
 satisfied by the performance and will want to avoid graphemes.

I agree.

 Speaking of optimization, I do understand that iterating by grapheme
 using the range interface won't give you the best performance. It's
 certainly convenient as it enables the reuse of existing algorithms with
 graphemes, but more specialized algorithms and interfaces might be more
 suited.

Even the specialized algorithms will be significantly slower.

 One observation I made with having dchar as the default element type is
 that not all algorithms really need to deal with dchar. If I'm searching
 for code point 'a' in a UTF-8 string, decoding code units into code
 points is a waste of time. Why? because the only way to represent code
 point 'a' is by having code point 'a'.

Right. That's why many algorithms in std are specialized for such cases.

 And guess what? The almost same
 optimization can apply to graphemes: if you're searching for 'a' in a
 grapheme-aware manner in a UTF-8 string, all you have to do is search
 for the UTF-8 code unit 'a', then check if the 'a' code unit is followed
 by a combining mark code point to confirm it is really a 'a', not a
 composed grapheme. Iterating the string by code unit is enough for these
 cases, and it'd increase performance by a lot.

Unfortunately it all breaks as soon as you go beyond one code point. You 
can't search efficiently, you can't compare efficiently. Boyer-Moore and 
friends are out.

I'm not saying that we shouldn't implement the correct operations! I'm 
just not convinced they should be the default.

 So making dchar the default type is no doubt convenient because it
 abstracts things enough so that generic algorithms can work with
 strings, but it has a performance penalty that you don't always need. I
 made an example using UTF-8, it applies even more to UTF-16. And it
 applies to grapheme-aware manipulations too.

It is true that UTF manipulation incurs overhead. The tradeoff has many 
dimensions: UTF-16 is bulkier and less cache friendly, ASCII is not 
sufficient for most people, the UTF decoding overhead is not that 
high... it's difficult to find the sweetest spot.

 This penalty with generic algorithms comes from the fact that they take
 a predicate of the form "a == 'a'" or "a == b", which is ill-suited for
 strings because you always need to fully decode the string (by dchar or
 by graphemes) for the purpose of calling the predicate. Given that
 comparing characters for something else than equality or them being part
 of a set is very rarely something you do, generic algorithms miss a big
 optimization opportunity here.

How can we improve that? You can't argue for an inefficient scheme just 
because what we have isn't as efficient as it could possibly be.

 - - -

 So here's what I think we should do:

 Todo 1: disallow generic algorithms on naked strings: string-specific
 Unicode-aware algorithms should be used instead; they can share the same
 name if their usage is similar

I don't understand this. We already do this, and by "Unicode-aware" we 
understand using dchar throughout. This is transparent to client code.

 Todo 2: to use a generic algorithm with a strings, you must dress the
 string using one of toDchar, toGrapheme, toCodeUnits; this way your
 intentions are clear

Breaks a lot of existing code. Won't fly with Walter unless it solves 

change about the built-in strings is that they implicitly are two things 
at the same time. Asking for representation should be explicit.

 Todo 3: string-specific algorithms can implemented as simple wrappers
 for generic algorithms with the string dressed correctly for the task,
 or they can implement more sophisticated algorithms to increase performance

One thing I like about the current scheme is that all 
bidirectional-range algorithms work out of the box with all strings, and 
lend themselves to optimization whenever you want to.

This will have trouble passing Walter's wanking test. Mine too; every 
time I need to write a bunch of forwarding functions, that's a signal 
something went wrong somewhere. Remember MFC? :o)

 There's two major benefits to this approach:

 Benefit 1: if indeed you really don't want the performance penalty that
 comes with checking for composed graphemes, you can bypass it at some
 specific places in your code using byDchar, or you can disable it
 altogether by modifying the string-specific algorithms and recompiling
 Phobos.

 Benefit 2: we don't have to rush to implementing graphemes in the
 Unicode-aware algorithms. Just make sure the interface for
 string-specific algorithms *can* accept graphemes, and we can roll out
 support for them at a later time once we have a decent implementation.

I'm not seeing the drawbacks. Hurts everyone for the sake of a few, 
breaks existent code, makes all string processing a mess, would-be users 
will throw their hands in the air seeing the simplest examples, but 
we'll have the satisfaction of high-five-ing one another telling 
ourselves that we did the right thing.

 Also, all this is leaving the question open as to what to do when
 someone uses the string as a range. In my opinion, it should either
 iterate on code units (because the string is actually an array, and
 because that's what foreach does) or simply disallow iteration (asking
 that you dress the string first using toCodeUnit, toDchar, or toGrapheme).

 Do you like that more?

This is not about liking. I like doing the right thing as much as you 
do, and I think Phobos shows that. Clearly doing the right thing through 
and through is handling combining characters appropriately. The problem 
is keeping all desiderata in careful balance.


Andrei

Jan 17 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-17 12:33:04 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 1/17/11 10:34 AM, Michel Fortin wrote:
 As I said: all those people who are not validating the inputs to make
 sure they don't contain combining code points.

 
 The question (which I see you keep on dodging :o)) is how much text 
 contains combining code points.

Not much, right now.

The problem is that the answer to this question is likely to change as 
Unicode support improves in operating system and applications. 
Shouldn't we future-proof Phobos?


 It does not serve us well to rigidly claim that the only good way of 
 doing anything Unicode is to care about graphemes.

For the time being we can probably afford it.


 Even NSString exposes the UTF16 underlying encoding and provides 
 dedicated functions for grapheme-based processing. For one thing, if 
 you care about the width of a word in printed text (one of the case 
 where graphemes are important), you need font information. And - 
 surprise! - some fonts do NOT support combining characters and print 
 signs next to one another instead of juxtaposing them, so the "wrong" 
 method of counting characters is more informative.

Generally what OS X does in those case is that it displays that 
character in another font. That said, counting grapheme is never a good 
way to tell how much space some text will take (unless the application 
enforces a fixed width per grapheme). It's more useful for telling the 
number of character in a text document, similar to a word count.


 That said, no one should really have to care but those who implement the
 string manipulation functions. The idea behind making the grapheme the
 element type is to make it easier to write grapheme-aware string
 manipulation functions, even if you don't know about graphemes. But the
 reality is probably more mixed than that.

 
 The reality is indeed more mixed. Inevitably at some point the API 
 needs to answer the question: "what is the first character of this 
 string?" Transparency is not possible. You break all string code out 
 there.

I'm not sure what you mean by that.


 - - -
 
 I gave some thought about all this, and came to an interesting
 realizations that made me refine the proposal. The new proposal is
 disruptive perhaps as much as the first, but in a different way.
 
 But first, let's state a few facts to reframe the current discussion:
 
 Fact 1: most people don't know Unicode very well
 Fact 2: most people are confused by code units, code points, graphemes,
 and what is a 'character'
 Fact 3: most people won't bother with all this, they'll just use the
 basic language facilities and assume everything work correctly if it it
 works correctly for them

 
 Nice :o).
 
 Now, let's define two goals:
 
 Goal 1: make most people's string operations work correctly
 Goal 2: make most people's string operations work fast

 
 Goal 3: don't break all existing code
 Goal 4: make most people's string-based code easy to write and understand

Those are worthy goals too.


 To me, goal 1 trumps goal 2, even if goal 2 is also important. I'm not
 sure we agree on this, but let's continue.

 
 I think we disagree about what "most" means. For you it means "people 
 who don't understand Unicode well but deal with combining characters 
 anyway". For me it's "the largest percentage of D users across various 
 writing systems".

It's not just D users, it's also for the users of programs written by D 
users. I can't count how many times I've seen accented character 
mishandled on websites and elsewhere, and I probably have an aversion 
about doing the same thing to people of other cultures and languages.

If the operating system supports combining marks, users have an 
expectations that applications running on it will deal with them 
correctly too, and they'll (rightfully) blame your application if it 
doesn't work. Same for websites.

I understand that in some situations you don't want to deal with 
graphemes even if you theoretically should, but I don't think it should 
be the default.


 One observation I made with having dchar as the default element type is
 that not all algorithms really need to deal with dchar. If I'm searching
 for code point 'a' in a UTF-8 string, decoding code units into code
 points is a waste of time. Why? because the only way to represent code
 point 'a' is by having code point 'a'.

 
 Right. That's why many algorithms in std are specialized for such cases.
 
 And guess what? The almost same
 optimization can apply to graphemes: if you're searching for 'a' in a
 grapheme-aware manner in a UTF-8 string, all you have to do is search
 for the UTF-8 code unit 'a', then check if the 'a' code unit is followed
 by a combining mark code point to confirm it is really a 'a', not a
 composed grapheme. Iterating the string by code unit is enough for these
 cases, and it'd increase performance by a lot.

 
 Unfortunately it all breaks as soon as you go beyond one code point. 
 You can't search efficiently, you can't compare efficiently. 
 Boyer-Moore and friends are out.

Ok. Say you were searching for the needle "�toil�" in an UTF-8 
haystack, I see two way to extend the optimization described above:

1. search for the easy part "toil", then check its surrounding 
graphemes to confirm it's really "�toil�"
2. search for a code point matching '�' or 'e', then confirm that the 
code points following it form the right graphemes.

Implementing the second one can be done by converting the needle to a 
regular expression operating at code-unit level. With that you can 
search efficiently for the needle directly in code units without having 
to decode and/or normalize the whole haystack.


 This penalty with generic algorithms comes from the fact that they take
 a predicate of the form "a == 'a'" or "a == b", which is ill-suited for
 strings because you always need to fully decode the string (by dchar or
 by graphemes) for the purpose of calling the predicate. Given that
 comparing characters for something else than equality or them being part
 of a set is very rarely something you do, generic algorithms miss a big
 optimization opportunity here.

 
 How can we improve that? You can't argue for an inefficient scheme just 
 because what we have isn't as efficient as it could possibly be.

You ask what's inefficient about generic algorithms having customizable 
predicates? You can't implement the above optimization if you can't 
guaranty the predicate is "==". That said, perhaps we can detect "==" 
and only apply the optimization then.

Being able to specify the predicate doesn't gain you much for strings, 
because a < 'a' doesn't make much sense. All you need to check for is 
equality with some value or membership of given character set, both of 
which can use the optimization above.


 So here's what I think we should do:
 
 Todo 1: disallow generic algorithms on naked strings: string-specific
 Unicode-aware algorithms should be used instead; they can share the same
 name if their usage is similar

 
 I don't understand this. We already do this, and by "Unicode-aware" we 
 understand using dchar throughout. This is transparent to client code.

That's probably because you haven't understood the intent (I might not 
have made it very clear either).

The problem I see currently is that you rely on dchar being the element 
type. That should be an implementation detail, not something client 
code can see or rely on. By making it an implementation detail, you can 
later make grapheme-aware algorithms the default without changing the 
API. Since you're the gatekeeper to Phobos, you can make this change 
conditional to getting an acceptable level of performance out of the 
grapheme-aware algorithms, or on other factors like the amount of 
combining characters you encounter in the wild in the next few years.

So the general string functions would implement your compromise (using 
dchar) but not commit indefinitely to it. Someone who really want to 
work in code point can use toDchar, someone who want to deal with 
graphemes uses toGraphemes, someone who doesn't care won't choose 
anything and get the default behaviour of compromise.

All you need to do for this is document it, and try to make sure the 
string APIs don't force the implementation to work with code points.


 Todo 2: to use a generic algorithm with a strings, you must dress the
 string using one of toDchar, toGrapheme, toCodeUnits; this way your
 intentions are clear

 
 Breaks a lot of existing code.
 
 Won't fly with Walter unless it solves world hunger. Nevertheless I 

 strings is that they implicitly are two things at the same time. Asking 
 for representation should be explicit.

No, it doesn't break anything. This is just the continuation of what I 
tried to explain above: if you want to be sure you're working with 
graphemes or dchar, say it.

Also, it said nothing about iteration or foreach, so I'm not sure why 
it wouldn't fly with Walter. It can stay as it is, except for one 
thing: you and Walter should really get on the same wavelength 
regarding ElementType!(char[]) and foreach(c; string). I don't care 
that much which is the default, but they absolutely need to be the same.


 Todo 3: string-specific algorithms can implemented as simple wrappers
 for generic algorithms with the string dressed correctly for the task,
 or they can implement more sophisticated algorithms to increase performance

 
 One thing I like about the current scheme is that all 
 bidirectional-range algorithms work out of the box with all strings, 
 and lend themselves to optimization whenever you want to.

I like this as the default behaviour too. I think however that you 
should restrict the algorithms that work out of the box to those which 
can also work with graphemes. This way you can change the behaviour in 
the future and support graphemes by a simple upgrade of Phobos.

Algorithms that doesn't work with graphemes would still work with toDchar.

So what doesn't work with graphemes? Predicates such as "a < b" for 
instance. That's pretty much it.


 This will have trouble passing Walter's wanking test. Mine too; every 
 time I need to write a bunch of forwarding functions, that's a signal 
 something went wrong somewhere. Remember MFC? :o)

The idea is that we write the API as it would apply to graphemes, but 
we implement it using dchar for the time being. Some function 
signatures might have to differ a bit.


 Do you like that more?

 
 This is not about liking. I like doing the right thing as much as you 
 do, and I think Phobos shows that. Clearly doing the right thing 
 through and through is handling combining characters appropriately. The 
 problem is keeping all desiderata in careful balance.

Well then, don't you find it balanced enough? I'm not asking that 
everything be done with graphemes. I'm not even asking that anything be 
done with graphemes by default. I'm only asking that we keep the API 
clean enough so we can pass to graphemes by default in the future 
without having to rewrite all the code everywhere to use byGrapheme. If 
this isn't the right balance.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 17 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/17/11 2:29 PM, Michel Fortin wrote:
 The problem I see currently is that you rely on dchar being the element
 type. That should be an implementation detail, not something client code
 can see or rely on.

But at some point you must be able to talk about individual characters 
in a text. It can't be something that client code doesn't see!!!

SuperDuperText txt;
auto c = giveMeTheFirstCharacter(txt);

What is the type of c? That is visible to the client!


Andrei

Jan 17 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-17 15:49:26 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 1/17/11 2:29 PM, Michel Fortin wrote:
 The problem I see currently is that you rely on dchar being the element
 type. That should be an implementation detail, not something client code
 can see or rely on.

 
 But at some point you must be able to talk about individual characters 
 in a text. It can't be something that client code doesn't see!!!

It seems that it can. NSString only exposes individual UTF-16 code 
units directly (or semi-directly via an accessor method), even though 
searching and comparing is grapheme-aware. I'm not saying it's a good 
design, but it certainly can work in practice.

In any case, I didn't mean to say the client code should't be aware of 
the characters in a string. I meant that the client shouldn't assume 
the algorithm works at the same layer as ElementType!(string) for a 
given string type. Even if ElementType!(string) is dchar, the default 
function you get if you don't use any of toCodeUnit, toDchar, or 
toGrapheme can work at the dchar or grapheme level if it makes more 
sense that way.

In other words, the client says: "I have two strings, compare them!" 
The client didn't specify if they should be compared by char, wchar, 
dchar, or by normalized grapheme; so we do what's sensible. That's what 
I call the 'default' string functions, those you get when you don't ask 
for anything specific. They should have a signature making them able to 
work at the grapheme level, even though they might not for practical 
reasons (performance). This way if it becomes more important or 
practical to support graphemes, it's easy to evolve to them.


 SuperDuperText txt;
 auto c = giveMeTheFirstCharacter(txt);
 
 What is the type of c? That is visible to the client!

That depends on how you implement the giveMeTheFirstCharacter function. :-)

More seriously, you have four choice:

1. code unit
2. code point
3. grapheme
4. require the client to state explicitly which kind of 'character' he 
wants; 'character' being an overloaded word, it's reasonable to ask for 
disambiguation.

You and Walter can't come to understand each other between 1 and 2, 
regarding foreach and ranges. To keep things consistent with what I 
said above I'd tend to say 4, but that's weird for something that looks 
like an array. My second choice goes for 1 when it comes to 
consistency, and 3 when it comes to correctness, and 2 when it comes to 
being practical.

Given something is going to be inconsistent either way, I'd say any of 
the above is acceptable. But please make sure you and Walter agree on 
the default element type for ranges and foreach.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 17 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com> said:

 More seriously, you have four choice:
 
 1. code unit
 2. code point
 3. grapheme
 4. require the client to state explicitly which kind of 'character' he 
 wants; 'character' being an overloaded word, it's reasonable to ask for 
 disambiguation.

This makes me think of what I did with my XML parser after you made 
code points the element type for strings. Basically, the parser now 
uses 'front' and 'popFront' whenever it needs to get the next code 
point, but most of the time it uses 'frontUnit' and 'popFrontUnit' 
instead (which I had to add) when testing for or skipping an ASCII 
character is sufficient. This way I avoid a lot of unnecessary decoding 
of code points.

For this to work, the same range must let you skip either a unit or a 
code point. If I were using a separate range with a call to toDchar or 
toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't 
have helped much because the new range would essentially become a new 
slice independent of the original, so you can't interleave "I want to 
advance by one unit" with "I want to advance by one code point".

So perhaps the best interface for strings would be to provide multiple 
range-like interfaces that you can use at the level you want.

I'm not sure if this is a good idea, but I thought I should at least 
share my experience.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 17 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/17/11 9:48 PM, Michel Fortin wrote:
 On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com>
 said:

 More seriously, you have four choice:

 1. code unit
 2. code point
 3. grapheme
 4. require the client to state explicitly which kind of 'character' he
 wants; 'character' being an overloaded word, it's reasonable to ask
 for disambiguation.

 This makes me think of what I did with my XML parser after you made code
 points the element type for strings. Basically, the parser now uses
 'front' and 'popFront' whenever it needs to get the next code point, but
 most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I
 had to add) when testing for or skipping an ASCII character is
 sufficient. This way I avoid a lot of unnecessary decoding of code points.

 For this to work, the same range must let you skip either a unit or a
 code point. If I were using a separate range with a call to toDchar or
 toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't
 have helped much because the new range would essentially become a new
 slice independent of the original, so you can't interleave "I want to
 advance by one unit" with "I want to advance by one code point".

 So perhaps the best interface for strings would be to provide multiple
 range-like interfaces that you can use at the level you want.

 I'm not sure if this is a good idea, but I thought I should at least
 share my experience.

Very insightful. Thanks for sharing. Code it up and make a solid proposal!

Andrei

Jan 17 2011

Steven Wawryk <stevenw acres.com.au> writes:

On 18/01/11 16:46, Andrei Alexandrescu wrote:
 On 1/17/11 9:48 PM, Michel Fortin wrote:
 On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com>
 said:

 More seriously, you have four choice:

 1. code unit
 2. code point
 3. grapheme
 4. require the client to state explicitly which kind of 'character' he
 wants; 'character' being an overloaded word, it's reasonable to ask
 for disambiguation.

 This makes me think of what I did with my XML parser after you made code
 points the element type for strings. Basically, the parser now uses
 'front' and 'popFront' whenever it needs to get the next code point, but
 most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I
 had to add) when testing for or skipping an ASCII character is
 sufficient. This way I avoid a lot of unnecessary decoding of code
 points.

 For this to work, the same range must let you skip either a unit or a
 code point. If I were using a separate range with a call to toDchar or
 toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't
 have helped much because the new range would essentially become a new
 slice independent of the original, so you can't interleave "I want to
 advance by one unit" with "I want to advance by one code point".

 So perhaps the best interface for strings would be to provide multiple
 range-like interfaces that you can use at the level you want.

 I'm not sure if this is a good idea, but I thought I should at least
 share my experience.

 Very insightful. Thanks for sharing. Code it up and make a solid proposal!

 Andrei

How does this differ from Steve Schveighoffer's string_t, subtract the 
indexing and slicing of code-points, plus a bidirectional grapheme range?

Jan 17 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/18/11 1:58 AM, Steven Wawryk wrote:
 On 18/01/11 16:46, Andrei Alexandrescu wrote:
 On 1/17/11 9:48 PM, Michel Fortin wrote:
 On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com>
 said:

 More seriously, you have four choice:

 1. code unit
 2. code point
 3. grapheme
 4. require the client to state explicitly which kind of 'character' he
 wants; 'character' being an overloaded word, it's reasonable to ask
 for disambiguation.

 This makes me think of what I did with my XML parser after you made code
 points the element type for strings. Basically, the parser now uses
 'front' and 'popFront' whenever it needs to get the next code point, but
 most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I
 had to add) when testing for or skipping an ASCII character is
 sufficient. This way I avoid a lot of unnecessary decoding of code
 points.

 For this to work, the same range must let you skip either a unit or a
 code point. If I were using a separate range with a call to toDchar or
 toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't
 have helped much because the new range would essentially become a new
 slice independent of the original, so you can't interleave "I want to
 advance by one unit" with "I want to advance by one code point".

 So perhaps the best interface for strings would be to provide multiple
 range-like interfaces that you can use at the level you want.

 I'm not sure if this is a good idea, but I thought I should at least
 share my experience.

 Very insightful. Thanks for sharing. Code it up and make a solid
 proposal!

 Andrei

 How does this differ from Steve Schveighoffer's string_t, subtract the
 indexing and slicing of code-points, plus a bidirectional grapheme range?

There's no string, only range...

Andrei

Jan 18 2011

Steven Wawryk <stevenw acres.com.au> writes:

On 19/01/11 02:40, Andrei Alexandrescu wrote:
 On 1/18/11 1:58 AM, Steven Wawryk wrote:
 On 18/01/11 16:46, Andrei Alexandrescu wrote:
 On 1/17/11 9:48 PM, Michel Fortin wrote:
 This makes me think of what I did with my XML parser after you made
 code
 points the element type for strings. Basically, the parser now uses
 'front' and 'popFront' whenever it needs to get the next code point,
 but
 most of the time it uses 'frontUnit' and 'popFrontUnit' instead
 (which I
 had to add) when testing for or skipping an ASCII character is
 sufficient. This way I avoid a lot of unnecessary decoding of code
 points.

 For this to work, the same range must let you skip either a unit or a
 code point. If I were using a separate range with a call to toDchar or
 toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't
 have helped much because the new range would essentially become a new
 slice independent of the original, so you can't interleave "I want to
 advance by one unit" with "I want to advance by one code point".

 So perhaps the best interface for strings would be to provide multiple
 range-like interfaces that you can use at the level you want.

 I'm not sure if this is a good idea, but I thought I should at least
 share my experience.

 Very insightful. Thanks for sharing. Code it up and make a solid
 proposal!

 Andrei

 How does this differ from Steve Schveighoffer's string_t, subtract the
 indexing and slicing of code-points, plus a bidirectional grapheme range?

 There's no string, only range...

Which is exactly what I asked you about.  I understand that you must be 
very busy,  But how do I get you to look at the actual technical content 
of something?  Is there something in the way I phrase thing that you 
dismiss my introductory motivation without looking into the content?

I don't mean this as a criticism.  I really want to know because I'm 
considering a proposal on a different topic but wasn't sure it's worth 
it as there seems to be a barrier to getting things considered.

Jan 18 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/18/11 6:00 PM, Steven Wawryk wrote:
 On 19/01/11 02:40, Andrei Alexandrescu wrote:
 On 1/18/11 1:58 AM, Steven Wawryk wrote:
 On 18/01/11 16:46, Andrei Alexandrescu wrote:
 On 1/17/11 9:48 PM, Michel Fortin wrote:
 This makes me think of what I did with my XML parser after you made
 code
 points the element type for strings. Basically, the parser now uses
 'front' and 'popFront' whenever it needs to get the next code point,
 but
 most of the time it uses 'frontUnit' and 'popFrontUnit' instead
 (which I
 had to add) when testing for or skipping an ASCII character is
 sufficient. This way I avoid a lot of unnecessary decoding of code
 points.

 For this to work, the same range must let you skip either a unit or a
 code point. If I were using a separate range with a call to toDchar or
 toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't
 have helped much because the new range would essentially become a new
 slice independent of the original, so you can't interleave "I want to
 advance by one unit" with "I want to advance by one code point".

 So perhaps the best interface for strings would be to provide multiple
 range-like interfaces that you can use at the level you want.

 I'm not sure if this is a good idea, but I thought I should at least
 share my experience.

 Very insightful. Thanks for sharing. Code it up and make a solid
 proposal!

 Andrei

 How does this differ from Steve Schveighoffer's string_t, subtract the
 indexing and slicing of code-points, plus a bidirectional grapheme
 range?

 There's no string, only range...

 Which is exactly what I asked you about. I understand that you must be
 very busy, But how do I get you to look at the actual technical content
 of something? Is there something in the way I phrase thing that you
 dismiss my introductory motivation without looking into the content?

 I don't mean this as a criticism. I really want to know because I'm
 considering a proposal on a different topic but wasn't sure it's worth
 it as there seems to be a barrier to getting things considered.

One simple fact is that I'm not the only person who needs to look at a 
design. If you want to propose something for inclusion in Phobos, please 
put the code in good shape, document it properly, and make a submission 
in this newsgroup following the Boost model. I get one vote and everyone 
else gets a vote.

Looking back at our exchanges in search for a perceived dismissive 
attitude on my part (apologies if it seems that way - it was 
unintentional), I infer your annoyance stems from my answer to this:

 How does this differ from Steve Schveighoffer's string_t,
 subtract the indexing and slicing of code-points, plus a
 bidirectional grapheme range?



I happen to have discussed at length my beef with Steve's proposal. Now 
in one sentence you change the proposed design on the fly without 
fleshing out the consequences, add to it again without substantiation, 
and presumably expect me to come with a salient analysis of the result. 
I don't think it's fair to characterize my answer to that as dismissive, 
nor to pressure me into expanding on it.

Finally, let me say again what I already said for a few times: in order 
to experiment with grapheme-based processing, we need a byGrapheme 
range. There is no need for a new string class. We need a range over the 
existing string types. That would allow us to play with graphemes, 
assess their efficiency and ubiquity, and would ultimately put us in a 
better position when it comes to deciding whether it makes sense to make 
grapheme a character type or the default character type.


Andrei

Jan 18 2011

Steven Wawryk <stevenw acres.com.au> writes:

On 19/01/11 11:37, Andrei Alexandrescu wrote:
 On 1/18/11 6:00 PM, Steven Wawryk wrote:
 Which is exactly what I asked you about. I understand that you must be
 very busy, But how do I get you to look at the actual technical content
 of something? Is there something in the way I phrase thing that you
 dismiss my introductory motivation without looking into the content?

 I don't mean this as a criticism. I really want to know because I'm
 considering a proposal on a different topic but wasn't sure it's worth
 it as there seems to be a barrier to getting things considered.

 One simple fact is that I'm not the only person who needs to look at a
 design. If you want to propose something for inclusion in Phobos, please
 put the code in good shape, document it properly, and make a submission
 in this newsgroup following the Boost model. I get one vote and everyone
 else gets a vote.

Ok, thanks for this suggestion.  But if developing a proposal as 
concrete code is a lot of work that may be rejected, is there a way to 
sound out the idea first before deciding to commit to developing it?


 Looking back at our exchanges in search for a perceived dismissive
 attitude on my part (apologies if it seems that way - it was
 unintentional), I infer your annoyance stems from my answer to this:

 How does this differ from Steve Schveighoffer's string_t,
 subtract the indexing and slicing of code-points, plus a
 bidirectional grapheme range?




No, this was just a summary.  Here is the post that you answered 
dismissively: news://news.digitalmars.com:119/ih030g$1ok1$1 digitalmars.com

 In the interest of moving this on, would it become acceptable to you if:

 1. indexing and slicing of the code-point range were removed?
 2. any additional ranges are exposed to the user according to decisions
 made about graphemes, etc?
 3. other constructive criticisms were accommodated?

 Steve


 On 15/01/11 03:33, Andrei Alexandrescu wrote:
 On 1/14/11 5:06 AM, Steven Schveighoffer wrote:
 I respectfully disagree. A stream built on fixed-sized units, but with
 variable length elements, where you can determine the start of an
 element in O(1) time given a random index absolutely provides
 random-access. It just doesn't provide length.

 I equally respectfully disagree. I think random access is defined as
 accessing the ith element in O(1) time. That's not the case here.

 Andrei



 I happen to have discussed at length my beef with Steve's proposal. Now
 in one sentence you change the proposed design on the fly without
 fleshing out the consequences, add to it again without substantiation,
 and presumably expect me to come with a salient analysis of the result.
 I don't think it's fair to characterize my answer to that as dismissive,
 nor to pressure me into expanding on it.

Sorry, I could have given more context.  But you didn't discuss what I 
asked, based on the observation that your detailed criticisms of Steve's 
proposal all related to a single aspect of it.

Steve

Jan 18 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/18/11 7:48 PM, Steven Wawryk wrote:
 On 19/01/11 11:37, Andrei Alexandrescu wrote:
 On 1/18/11 6:00 PM, Steven Wawryk wrote:
 Which is exactly what I asked you about. I understand that you must be
 very busy, But how do I get you to look at the actual technical content
 of something? Is there something in the way I phrase thing that you
 dismiss my introductory motivation without looking into the content?

 I don't mean this as a criticism. I really want to know because I'm
 considering a proposal on a different topic but wasn't sure it's worth
 it as there seems to be a barrier to getting things considered.

 One simple fact is that I'm not the only person who needs to look at a
 design. If you want to propose something for inclusion in Phobos, please
 put the code in good shape, document it properly, and make a submission
 in this newsgroup following the Boost model. I get one vote and everyone
 else gets a vote.

 Ok, thanks for this suggestion. But if developing a proposal as concrete
 code is a lot of work that may be rejected, is there a way to sound out
 the idea first before deciding to commit to developing it?

This is the best place as far as I know.

 Looking back at our exchanges in search for a perceived dismissive
 attitude on my part (apologies if it seems that way - it was
 unintentional), I infer your annoyance stems from my answer to this:

 How does this differ from Steve Schveighoffer's string_t,
 subtract the indexing and slicing of code-points, plus a
 bidirectional grapheme range?

 No, this was just a summary. Here is the post that you answered
 dismissively: news://news.digitalmars.com:119/ih030g$1ok1$1 digitalmars.com

My response of Sun, 16 Jan 2011 20:58:43 -0600 was a fair attempt at a 
response. If you found that dismissive, I'd be hard pressed to improve 
it. To quote myself:

 I believe the proposed scheme:

 1. Changes the language in a major way;

 2. Is highly disruptive;

 3. Improves the status quo in only minor ways.

 I'd be much more willing to improve things by e.g. defining the
representation() function I talked about a bit ago, and other less disruptive
additions.

That took into consideration your amendments.

  >
  > In the interest of moving this on, would it become acceptable to you if:
  >
  > 1. indexing and slicing of the code-point range were removed?
  > 2. any additional ranges are exposed to the user according to decisions
  > made about graphemes, etc?
  > 3. other constructive criticisms were accommodated?
  >
  > Steve
  >
  >
  > On 15/01/11 03:33, Andrei Alexandrescu wrote:
  >> On 1/14/11 5:06 AM, Steven Schveighoffer wrote:
  >>> I respectfully disagree. A stream built on fixed-sized units, but with
  >>> variable length elements, where you can determine the start of an
  >>> element in O(1) time given a random index absolutely provides
  >>> random-access. It just doesn't provide length.
  >>
  >> I equally respectfully disagree. I think random access is defined as
  >> accessing the ith element in O(1) time. That's not the case here.
  >>
  >> Andrei
  >

 I happen to have discussed at length my beef with Steve's proposal. Now
 in one sentence you change the proposed design on the fly without
 fleshing out the consequences, add to it again without substantiation,
 and presumably expect me to come with a salient analysis of the result.
 I don't think it's fair to characterize my answer to that as dismissive,
 nor to pressure me into expanding on it.

 Sorry, I could have given more context. But you didn't discuss what I
 asked, based on the observation that your detailed criticisms of Steve's
 proposal all related to a single aspect of it.

I really don't know what to add to make my answer more meaningful.

Andrei

Jan 18 2011

Steven Wawryk <stevenw acres.com.au> writes:

On 19/01/11 13:53, Andrei Alexandrescu wrote:
 My response of Sun, 16 Jan 2011 20:58:43 -0600 was a fair attempt at a
 response. If you found that dismissive, I'd be hard pressed to improve
 it. To quote myself:

 I believe the proposed scheme:

 1. Changes the language in a major way;

 2. Is highly disruptive;

 3. Improves the status quo in only minor ways.

 I'd be much more willing to improve things by e.g. defining the
 representation() function I talked about a bit ago, and other less
 disruptive additions.

 That took into consideration your amendments.

I don't think that it did.  I proposed no language change, nor anything 
disruptive.  The change in status quo I proposed was essentially the 
same one you encouraged here, about a type that gives the user the 
choice of what kind of range to be operated on.  It appears to me that 
you were responding to some perception you had about Steve's full 
proposal (that may have been triggered by something I said in the 
introduction), not what I actually said in the content.

So, I would still be interested to know how to sound out this newsgroup 
with an idea (before coding commitment) and have the suggestions 
considered on something more than a superficial level.

Is the newsgroup too busy?  Should there be people nominated to screen 
ideas that are worth looking at?  Should I use a completely different 
approach?  Your suggestions so far I will take into account, but it 
still looks like there's a barrier to me.


 Sorry, I could have given more context. But you didn't discuss what I
 asked, based on the observation that your detailed criticisms of Steve's
 proposal all related to a single aspect of it.

 I really don't know what to add to make my answer more meaningful.


 Andrei

Jan 18 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/18/11 9:46 PM, Steven Wawryk wrote:
 On 19/01/11 13:53, Andrei Alexandrescu wrote:
 My response of Sun, 16 Jan 2011 20:58:43 -0600 was a fair attempt at a
 response. If you found that dismissive, I'd be hard pressed to improve
 it. To quote myself:

 I believe the proposed scheme:

 1. Changes the language in a major way;

 2. Is highly disruptive;

 3. Improves the status quo in only minor ways.

 I'd be much more willing to improve things by e.g. defining the
 representation() function I talked about a bit ago, and other less
 disruptive additions.

 That took into consideration your amendments.

 I don't think that it did. I proposed no language change, nor anything
 disruptive.

Adding a new string type would be disruptive. Unless I misunderstood, 
there is still a new string type in Steve's proposal, and one that would 
be the default one, even after the amendments you mentioned. That is a 
problem because people write this:

auto s = "hello";

and the question is, what is the type of s.

  The change in status quo I proposed was essentially the same
 one you encouraged here, about a type that gives the user the choice of
 what kind of range to be operated on. It appears to me that you were
 responding to some perception you had about Steve's full proposal (that
 may have been triggered by something I said in the introduction), not
 what I actually said in the content.

If that's what it is, great. To clarify: no new string type, only a 
range that iterates one grapheme over existing strings.

 So, I would still be interested to know how to sound out this newsgroup
 with an idea (before coding commitment) and have the suggestions
 considered on something more than a superficial level.

 Is the newsgroup too busy? Should there be people nominated to screen
 ideas that are worth looking at? Should I use a completely different
 approach? Your suggestions so far I will take into account, but it still
 looks like there's a barrier to me.

My perception is that you want to minimize risks before starting to 
invest work into this. I'm not sure how you can do that.

Andrei

Jan 18 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-18 01:16:13 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 1/17/11 9:48 PM, Michel Fortin wrote:
 On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com>
 said:
 
 More seriously, you have four choice:
 
 1. code unit
 2. code point
 3. grapheme
 4. require the client to state explicitly which kind of 'character' he
 wants; 'character' being an overloaded word, it's reasonable to ask
 for disambiguation.

 
 This makes me think of what I did with my XML parser after you made code
 points the element type for strings. Basically, the parser now uses
 'front' and 'popFront' whenever it needs to get the next code point, but
 most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I
 had to add) when testing for or skipping an ASCII character is
 sufficient. This way I avoid a lot of unnecessary decoding of code points.
 
 For this to work, the same range must let you skip either a unit or a
 code point. If I were using a separate range with a call to toDchar or
 toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't
 have helped much because the new range would essentially become a new
 slice independent of the original, so you can't interleave "I want to
 advance by one unit" with "I want to advance by one code point".
 
 So perhaps the best interface for strings would be to provide multiple
 range-like interfaces that you can use at the level you want.
 
 I'm not sure if this is a good idea, but I thought I should at least
 share my experience.

 
 Very insightful. Thanks for sharing. Code it up and make a solid proposal!

What I use right now is this (see below). I'm not sure what would be a 
good name for it though. The expectation is that I'll get either an 
ASCII char or something out of ASCII range if it isn't ASCII.

The abstraction doesn't seem very 'solid' to me, in the sense that I 
can't see how it'd apply to ranges other than strings, so it's only 
useful for strings (the character array kind), and it's only useful as 
a workaround since you made ElementType!(char[]) a dchar. Well, any 
range returning char,dchar,wchar could map frontUnit to front and 
popFrontUnit to popFront to keep things working, but it makes the 
optimization rather pointless. I don't really have an idea where to go 
from here.


char frontUnit(string input) {
	assert(input.length > 0);
	return input[0];
}
wchar frontUnit(wstring input) {
	assert(input.length > 0);
	return input[0];
}
dchar frontUnit(dstring input) {
	assert(input.length > 0);
	return input[0];
}

void popFrontUnit(ref string input) {
	assert(input.length > 0);
	input = input[1..$];
}
void popFrontUnit(ref wstring input) {
	assert(input.length > 0);
	input = input[1..$];
}
void popFrontUnit(ref dstring input) {
	assert(input.length > 0);
	input = input[1..$];
}

version (unittest) {
	import std.string : front, popFront;
}

unittest {
	string test = "�t�";
	assert(test.length == 5);
	
	string test2 = test;
	assert(test2.front == '�');
	test2.popFront();
	assert(test2.length == 3); // removed "�" which is two UTF-8 code units
	
	string test3 = test;
	assert(test3.frontUnit == "�"c[0]);
	test3.popFrontUnit();
	assert(test3.length == 4); // removed first half of "�" which, one 
UTF-8 code units
}


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 18 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/18/11 7:17 AM, Michel Fortin wrote:
 On 2011-01-18 01:16:13 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 On 1/17/11 9:48 PM, Michel Fortin wrote:
 On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com>
 said:

 More seriously, you have four choice:

 1. code unit
 2. code point
 3. grapheme
 4. require the client to state explicitly which kind of 'character' he
 wants; 'character' being an overloaded word, it's reasonable to ask
 for disambiguation.

 This makes me think of what I did with my XML parser after you made code
 points the element type for strings. Basically, the parser now uses
 'front' and 'popFront' whenever it needs to get the next code point, but
 most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I
 had to add) when testing for or skipping an ASCII character is
 sufficient. This way I avoid a lot of unnecessary decoding of code
 points.

 For this to work, the same range must let you skip either a unit or a
 code point. If I were using a separate range with a call to toDchar or
 toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't
 have helped much because the new range would essentially become a new
 slice independent of the original, so you can't interleave "I want to
 advance by one unit" with "I want to advance by one code point".

 So perhaps the best interface for strings would be to provide multiple
 range-like interfaces that you can use at the level you want.

 I'm not sure if this is a good idea, but I thought I should at least
 share my experience.

 Very insightful. Thanks for sharing. Code it up and make a solid
 proposal!

 What I use right now is this (see below). I'm not sure what would be a
 good name for it though. The expectation is that I'll get either an
 ASCII char or something out of ASCII range if it isn't ASCII.

 The abstraction doesn't seem very 'solid' to me, in the sense that I
 can't see how it'd apply to ranges other than strings, so it's only
 useful for strings (the character array kind), and it's only useful as a
 workaround since you made ElementType!(char[]) a dchar. Well, any range
 returning char,dchar,wchar could map frontUnit to front and popFrontUnit
 to popFront to keep things working, but it makes the optimization rather
 pointless. I don't really have an idea where to go from here.

[snip]

I was thinking along the lines of:

struct Grapheme
{
     private string support_;
     ...
}

struct ByGrapheme
{
     private string iteratee_;
     bool empty();
     Grapheme front();
     void popFront();
     // Additional funs
     dchar frontCodePoint();
     void popFrontCodePoint();
     char frontCodeUnit();
     void popFrontCodeUnit();
     ...
}

// helper function
ByGrapheme byGrapheme(string s);

// usage
string s = ...;
size_t i;
foreach (g; byGrapheme(s))
{

}

We need this range in Phobos.


Andrei

Jan 18 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-18 11:38:45 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 1/18/11 7:17 AM, Michel Fortin wrote:
 On 2011-01-18 01:16:13 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:
 
 On 1/17/11 9:48 PM, Michel Fortin wrote:
 On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com>
 said:
 
 More seriously, you have four choice:
 
 1. code unit
 2. code point
 3. grapheme
 4. require the client to state explicitly which kind of 'character' he
 wants; 'character' being an overloaded word, it's reasonable to ask
 for disambiguation.

 
 This makes me think of what I did with my XML parser after you made code
 points the element type for strings. Basically, the parser now uses
 'front' and 'popFront' whenever it needs to get the next code point, but
 most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I
 had to add) when testing for or skipping an ASCII character is
 sufficient. This way I avoid a lot of unnecessary decoding of code
 points.
 
 For this to work, the same range must let you skip either a unit or a
 code point. If I were using a separate range with a call to toDchar or
 toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't
 have helped much because the new range would essentially become a new
 slice independent of the original, so you can't interleave "I want to
 advance by one unit" with "I want to advance by one code point".
 
 So perhaps the best interface for strings would be to provide multiple
 range-like interfaces that you can use at the level you want.
 
 I'm not sure if this is a good idea, but I thought I should at least
 share my experience.

 
 Very insightful. Thanks for sharing. Code it up and make a solid
 proposal!

 
 What I use right now is this (see below). I'm not sure what would be a
 good name for it though. The expectation is that I'll get either an
 ASCII char or something out of ASCII range if it isn't ASCII.
 
 The abstraction doesn't seem very 'solid' to me, in the sense that I
 can't see how it'd apply to ranges other than strings, so it's only
 useful for strings (the character array kind), and it's only useful as a
 workaround since you made ElementType!(char[]) a dchar. Well, any range
 returning char,dchar,wchar could map frontUnit to front and popFrontUnit
 to popFront to keep things working, but it makes the optimization rather
 pointless. I don't really have an idea where to go from here.

 [snip]
 
 I was thinking along the lines of:
 
 struct Grapheme
 {
      private string support_;
      ...
 }
 
 struct ByGrapheme
 {
      private string iteratee_;
      bool empty();
      Grapheme front();
      void popFront();
      // Additional funs
      dchar frontCodePoint();
      void popFrontCodePoint();
      char frontCodeUnit();
      void popFrontCodeUnit();
      ...
 }
 
 // helper function
 ByGrapheme byGrapheme(string s);
 
 // usage
 string s = ...;
 size_t i;
 foreach (g; byGrapheme(s))
 {

 }
 
 We need this range in Phobos.

Yes, we need a grapheme range.

But that's not what my thing was about. It was about shortcutting code 
point decoding when it isn't necessary while still keeping the ability 
to decode to code points when iterating on the same range. For 
instance, here's a simple made up example:

	string s = "<hello>";
	if (!s.empty && s.frontUnit == '<')
		s.popFrontUnit(); // skip
	while (!s.empty && s.frontUnit != '>')
		s.popFront(); // do something with each code point
	if (!s.empty && s.frontUnit == '>')
		s.popFrontUnit(); // skip
	assert(s.empty);

Here, since I know I'm testing and skipping for '<', an ASCII 
character, decoding the code point is wasted time, so I skip that 
decoding. The problem is that this optimization can't happen with a 
range that abstracts things at the code point level. I can do it with 
strings because strings still allow you to access code units through 
the indexing operators, but this can't really apply to ranges of code 
points in general.

And parsing with range of code unit would also be a pain, because even 
if I'm testing for '<' for the first character, sometimes I really need 
to advance by code point and test for code points.

One thing that might be interesting is benchmarking my XML parser by 
replacing every instance of frontUnit and popFrontUnit with front and 
popFront. That won't change there results, but it'd give us an idea of 
the overhead of the unnecessary decoded characters code points.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 18 2011

spir <denis.spir gmail.com> writes:

On 01/18/2011 06:14 PM, Michel Fortin wrote:

On 2011-01-18 11:38:45 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:
 I was thinking along the lines of:

 struct Grapheme
 {
 private string support_;
 ...
 }

 struct ByGrapheme
 {
 private string iteratee_;
 bool empty();
 Grapheme front();
 void popFront();
 // Additional funs
 dchar frontCodePoint();
 void popFrontCodePoint();
 char frontCodeUnit();
 void popFrontCodeUnit();
 ...
 }

 // helper function
 ByGrapheme byGrapheme(string s);

 // usage
 string s = ...;
 size_t i;
 foreach (g; byGrapheme(s))
 {

 }

 We need this range in Phobos.

 Yes, we need a grapheme range.

 But that's not what my thing was about. It was about shortcutting code
 point decoding when it isn't necessary while still keeping the ability
 to decode to code points when iterating on the same range. For instance,
 here's a simple made up example:

 string s = "<hello>";
 if (!s.empty && s.frontUnit == '<')
 s.popFrontUnit(); // skip
 while (!s.empty && s.frontUnit != '>')
 s.popFront(); // do something with each code point
 if (!s.empty && s.frontUnit == '>')
 s.popFrontUnit(); // skip
 assert(s.empty);

 Here, since I know I'm testing and skipping for '<', an ASCII character,
 decoding the code point is wasted time, so I skip that decoding. The
 problem is that this optimization can't happen with a range that
 abstracts things at the code point level. I can do it with strings
 because strings still allow you to access code units through the
 indexing operators, but this can't really apply to ranges of code points
 in general.

 And parsing with range of code unit would also be a pain, because even
 if I'm testing for '<' for the first character, sometimes I really need
 to advance by code point and test for code points.

This means a single string type that exposes various _synchrone_ range 
levels (codeunit, codepoint, grapheme), doesn't it? As opposed to 
Andrei's approach of ranges beeing structures external to string types, 
IIUC, which thus move on independantly?

 One thing that might be interesting is benchmarking my XML parser by
 replacing every instance of frontUnit and popFrontUnit with front and
 popFront. That won't change there results, but it'd give us an idea of
 the overhead of the unnecessary decoded characters code points.

Yes, would you have time to do it? I would be interesting in such perf 
measurements. (--> your idea about a Text variant, for which I would 
like to know whether it's worth still decoding systematically.)

Denis
_________________
vita es estrany
spir.wikidot.com

Jan 18 2011

spir <denis.spir gmail.com> writes:

On 01/18/2011 04:48 AM, Michel Fortin wrote:
 On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com>
 said:

 More seriously, you have four choice:

 1. code unit
 2. code point
 3. grapheme
 4. require the client to state explicitly which kind of 'character' he
 wants; 'character' being an overloaded word, it's reasonable to ask
 for disambiguation.

 This makes me think of what I did with my XML parser after you made code
 points the element type for strings. Basically, the parser now uses
 'front' and 'popFront' whenever it needs to get the next code point, but
 most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I
 had to add) when testing for or skipping an ASCII character is
 sufficient. This way I avoid a lot of unnecessary decoding of code points.

 For this to work, the same range must let you skip either a unit or a
 code point. If I were using a separate range with a call to toDchar or
 toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't
 have helped much because the new range would essentially become a new
 slice independent of the original, so you can't interleave "I want to
 advance by one unit" with "I want to advance by one code point".

 So perhaps the best interface for strings would be to provide multiple
 range-like interfaces that you can use at the level you want.

 I'm not sure if this is a good idea, but I thought I should at least
 share my experience.

This looks like a very interesting approach. And clear.
I guess range synchronisation would be based on an internal lowest-level 
(codeunit) index. Then, you also need internal validity-checking and/or 
offseting routines when a higher-level range is used after a lowel-level 
one has been used. (I mean eg to ensure start-of-codepoint after a 
codeunit popFront, or throw an error.)
Also, how to avoid duplicating many operational functions (eg find a 
given slice) for each level?

Denis
_________________
vita es estrany
spir.wikidot.com

Jan 18 2011

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

Michel Fortin wrote:
 On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com>
 said:

 So perhaps the best interface for strings would be to provide multiple
 range-like interfaces that you can use at the level you want.

That's what I've been thinking. The users can choose whether they want 
random access or not. A grapheme-aware string can provide random access 
at a space cost, or no random access for efficient space use.

I see 5 layers in string processing. Layers 1 and 2 are currently 
handled by D, sometimes in an unclear way. e.g. char[] may be used as an 
array of code units or an array of code points depending on the type of 
iteration.

1) Code units: This is what D provides with its string types

This layers models RandomAccessRange

2) Code points: This is what D and Phobos provide for example with 
foreach(d; stride(s, 1))

dchar[] models RandomAccessRange at this layer

char[] and wchar[] model ForwardRange at this layer

(If I understand it correctly, Steven Schveighoffer is trying to provide 
a pseudo-RandomAccessRange to char[] and wchar[] with his string type.)

3) Graphemes: This is what the string type that spir is working on. 
There could be at least two types:

3a) RandomAccessGraphemeRange: Has random access but the data type is large

3b) ForwardGraphemeRange: space-efficient but does not provide random access

I think the programmers would be happy to be able to choose.

4) Letters: Uses either 3a or 3b. This is the layer where the idea of a 
writing system enters the picture: lower/upper case transformations and 
sorting happen at this layer. (I have a library that tries to handle 
this layer but is ignorant of graphemes; I am waiting for spir's string 
type. ;))

4a) Models RandomAccessRange if based on a RandomAccessGraphemeRange

4b) Models ForwardRange if based on a ForwardGraphemeRange

5) Text: Collection of Letters. This is where a name like "ali & tim" is 
correctly capitalized as "ALİ & TIM" because the text consists of two 
separate writing systems. (The same library that I mentioned in 4 tries 
to handle this layer as well.)

Ali

Jan 18 2011

spir <denis.spir gmail.com> writes:

On 01/19/2011 08:43 AM, Ali Çehreli wrote:
 Michel Fortin wrote:
  > On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com>
  > said:

  > So perhaps the best interface for strings would be to provide multiple
  > range-like interfaces that you can use at the level you want.

 That's what I've been thinking. The users can choose whether they want
 random access or not. A grapheme-aware string can provide random access
 at a space cost, or no random access for efficient space use.

 I see 5 layers in string processing. Layers 1 and 2 are currently
 handled by D, sometimes in an unclear way. e.g. char[] may be used as an
 array of code units or an array of code points depending on the type of
 iteration.

This is very good and helpful summary. But you do not list all relevant 
aspects of the question, I guess. Defining which codes belong to a given 
grapheme (what I call "piling") is necessary for true O(1) 
random-access, but not only. More importantly, all operations involving 
equality comparison (find, count, replace,...) require normalisation 
--in addition to piling.
A few notes:

 1) Code units: This is what D provides with its string types

 This layers models RandomAccessRange

This level is pure implementation artifact that simply cannot make any 
sense. (from user and thus programmer points of view)
Any kind of text manipulation (slice, find, replace...) may lead to 
random incorrectness, except when source texts can be guaranteed to hold 
plain ASCII (which may be hard to prove).
Conversely, pieces of text only passed around by an app do not require 
any more costly representation, in terms of time (decoding) or space. In 
addition, concat works provided all pieces share the same encoding 
(ASCII beeing a subset of most historic charsets and of UTF-8).

 2) Code points: This is what D and Phobos provide for example with
 foreach(d; stride(s, 1))

 dchar[] models RandomAccessRange at this layer

 char[] and wchar[] model ForwardRange at this layer

 (If I understand it correctly, Steven Schveighoffer is trying to provide
 a pseudo-RandomAccessRange to char[] and wchar[] with his string type.)

This level is also a kind of implementation artifact, compared to 
historic charsets, but actually based on a real fact of natural 
languages: they hold composite characters that can thus be coded by 
combining lower-level codes which represent "scripting marks" (base & 
combining ones).
For this reason, this level can have some sense. My latest guess is that 
apps that consider text as a study object (read linguistic apps), 
instead of a means, may regurarly need operating at this level, in 
addition to the next one.
Normalisation can be applied at this level --and is necessary for the 
above kind of use case. But using it for operations requiring compare 
will typically also require "piling", that is the next level, if only to 
determine what is to be compared.

 3) Graphemes: This is what the string type that spir is working on.
 There could be at least two types:

This is the meaningful level for, probably, nearly all applications.

 3a) RandomAccessGraphemeRange: Has random access but the data type is large

I guess this is Text's approach? Text is "flash fast" indeed for any 
operation benefiting from random-access. But not only: since it 
normalises its input, it should be far faster for any operation using 
compare (rough evaluations suggest a speed ratio of 1 to 2 orders of 
magnitude).
The cost is high in terms of space, which in turn certainly reduces its 
speed gain in the general case, because to cache (miss) effects. (Thank 
you Michel for making this clear.)

 3b) ForwardGraphemeRange: space-efficient but does not provide random
 access

Is it what Andrei expects, namely a Grapheme type with a corresponding 
ByGrapheme iterator IIUC?
Time efficiency of operations?

3) metadata RandomAccessGraphemeRange

Michel Fortin suggested (off list) an alternative approach to Text: 
instead of actually "piling" at construction time, just store metadata 
upon grapheme bounds. The core benefit is indeed to keep "normal" text 
storage (meaning *char[], for modification): would this point please 
Andrei better?
I let you evaluate various consequences of this change (mostly positive, 
I guess). The same metadata principle could certainly be used for 
further optimisations, but this is another story.
I'm motivated to implement this variant, looke like best of both worlds 
tome. (support welcome ;-)

 I think the programmers would be happy to be able to choose.

 4) Letters: Uses either 3a or 3b. This is the layer where the idea of a
 writing system enters the picture: lower/upper case transformations and
 sorting happen at this layer. (I have a library that tries to handle
 this layer but is ignorant of graphemes; I am waiting for spir's string
 type. ;))

 4a) Models RandomAccessRange if based on a RandomAccessGraphemeRange

 4b) Models ForwardRange if based on a ForwardGraphemeRange

I do not understand what this level means. For me, letters are, 
precisely, archetypical true characters, meaning level 3.

[Note: "grapheme", used by Unicode to denote the common sense of 
"character", is simply wrong: "sh" and "ti" are graphemes in english 
(for the same phoneme /ʃ/), not characters; and tab, §, or © are 
probalby not considered graphemes by linguists, while they are 
characters. This is the reason why I try to avoid this term and use 
"character", like ICU's doc, to avoid even more confusion.]

 5) Text: Collection of Letters. This is where a name like "ali & tim" is
 correctly capitalized as "ALİ & TIM" because the text consists of two
 separate writing systems. (The same library that I mentioned in 4 tries
 to handle this layer as well.)

This is an immensely complicated field. Note that it has nothing to do 
with text & character representation issues: whatever the character set, 
one has to confront problems like uppercase of 'i', 'ss' vs 'ß', 
definiton of "letter" or "character", matching, sorting order...
Text does not even try to address natural language issues. Instead it 
deals onl,y but hopefully clearly & correctly, with restoring simple and 
safe representation for client apps.

 Ali

Denis
_________________
vita es estrany
spir.wikidot.com

Jan 19 2011

spir <denis.spir gmail.com> writes:

On 01/17/2011 05:34 PM, Michel Fortin wrote:
 As I said: all those people who are not validating the inputs to make
 sure they don't contain combining code points. As far as I know, no one
 is doing that, so that means everybody should use algorithms capable of
 handling multi-code-point graphemes. If someone indeed is doing this
 validation, he'll probably also be smart enough to make his algorithms
 to work with dchars.

Actually, there are at least 2 special cases:
* apps that only deal with pre-unicode source stuff
* apps that only deal with source stuff "mechanically" generated by 
text-producing software which itself guarantees single-code-only graphemes

Denis
_________________
vita es estrany
spir.wikidot.com

Jan 17 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sat, 15 Jan 2011 17:45:37 -0500, Michel Fortin  
<michel.fortin michelf.com> wrote:

 On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer"  
 <schveiguy yahoo.com> said:

 On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin   
 <michel.fortin michelf.com> wrote:

 On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"   
 <schveiguy yahoo.com> said:

 I'm not suggesting we impose it, just that we make it the default.  
 If   you want to iterate by dchar, wchar, or char, just write:
  	foreach (dchar c; "exposé") {}
 	foreach (wchar c; "exposé") {}
 	foreach (char c; "exposé") {}
 	// or
 	foreach (dchar c; "exposé".by!dchar()) {}
 	foreach (wchar c; "exposé".by!wchar()) {}
 	foreach (char c; "exposé".by!char()) {}
  and it'll work. But the default would be a slice containing the    
 grapheme, because this is the right way to represent a Unicode   
 character.

  I think this is a good idea.  I previously was nervous about it,  
 but  I'm  not sure it makes a huge difference.  Returning a char[]  
 is  certainly less  work than normalizing a grapheme into one or more  
 code  points, and then  returning them.  All that it takes is to  
 detect all  the code points within  the grapheme.  Normalization can  
 be done if  needed, but would probably  have to output another  
 char[], since a  normalized grapheme can occupy more  than one dchar.

  I'm glad we agree on that now.

  It's a matter of me slowly wrapping my brain around unicode and how  
 it's  used.  It seems like it's a typical committee defined standard  
 where there  are 10 ways to do everything, I was trying to weed out the  
 lesser used (or  so I perceived) pieces to allow a more implementable  
 library.  It's doubly  hard for me since I have limited experience with  
 other languages, and I've  never tried to write them with a computer  
 (my language classes in high  school were back in the days of actually  
 writing stuff down on paper).

 Actually, I don't think Unicode was so badly designed. It's just that  
 nobody hat an idea of the real scope of the problem they had in hand at  
 first, and so they had to add a lot of things but wanted to keep things  
 backward-compatible. We're at Unicode 6.0 now, can you name one other  
 standard that evolved enough to get 6 major versions? I'm surprised it's  
 not worse given all that it must support.

I didn't read the standard, all I understand about unicode is from this NG  
;)  What I meant was the ability to do things more than one way seems like  
a committee-designed standard.  Usually with one of those, you have one  
party who "absolutely needs" one way of doing things (most likely because  
all their code is based on it), and other parties who want it a different  
way.  When compromises occur, the end result is, you have a standard  
that's unnecessarily difficult to implement.

 Indeed, the change would probably be too radical for D2.

 I think we agree that the default type should behave as a Unicode  
 string, not an array of characters. I understand your opposition to  
 conflating arrays of char with strings, and I agree with you to a  
 certain extent that it could have been done better. But we can't really  
 change the type of string literals, can we. The only thing we can change  
 (I hope) at this point is how iterating on strings work.

I was hoping to change string literal types.  If we don't do that, we have  
a half-ass solution.  I don't think it's going to be impossible, because  
string, wstring, dstring are all aliases.

In fact, with my current proposed type, this already works:

mystring s = "hello";

But this doesn't:

auto s = "hello"; // still typed as immutable(char)[]

This isn't so bad, just require one to specify the type, right?  Well, it  
fails miserably here:

foo(mystring s) {...}
foo("hello"); // fails to match.

In order to have a string type, string literals have to be typed as that  
type.

 Walter said earlier that he oppose changing foreach's default element  
 type to dchar for char[] and wchar[] (as Andrei did for ranges) on the  
 ground that it would silently break D1 compatibility. This is a valid  
 point in my opinion.

 I think you're right when you say that not treating char[] as an array  
 of character breaks, to a certain extent, C compatibility. Another valid  
 point.

 That said, I want to emphasize that iterating by grapheme, contrary to  
 iterating by dchar, does not break any code *silently*. The compiler  
 will complain loudly that you're comparing a string to a char, so you'll  
 have to change your code somewhere if you want things to compile. You'll  
 have to look at the code and decide what to do.

Changing iteration and not indexing is not going to fix the mess we have  
right now.

 One more thing:

 NSString in Cocoa is in essence the same thing as I'm proposing here: as  
 array of UTF-16 code units, but with string behaviour. It supports  
 by-code-unit indexing, but appending, comparing, searching for  
 substrings, etc. all behave correctly as a Unicode string. Again, I  
 agree that it's probably not the best design, but I can tell you it  
 works well in practice. In fact, NSString doesn't even expose the  
 concept of grapheme, it just uses them internally, and you're pretty  
 much limited to the built-in operation. I think what we have here in  
 concept is much better... even if it somewhat conflates code-unit arrays  
 and strings.

But is NSString typed the *exact same* as an array, or is it a wrapper for  
an array?  Looking at the docs, it appears it is not.

 Or you could make a grapheme a string_t. ;-)

  I'm a little uneasy having a range return itself as its element type.   
 For  all intents and purposes, a grapheme is a string of one 'element',  
 so it  could potentially be a string_t.
  It does seem daunting to have so many types, but at the same time,  
 types  convey relationships at compile time that can make coding  
 impossible to  get wrong, or make things actually possible when having  
 a single type  doesn't.
  I'll give you an example from a previous life:
  [...]
 I feel that making extra types when the relationship between them is   
 important is worth the possible repetition of functionality.  Catching   
 bugs during compilation is soooo much better than experiencing them  
 during  runtime.

 I can understand the utility of a separate type in your DateTime  
 example, but in this case I fail to see any advantage.

 I mean, a grapheme is a slice of a string, can have multiple code points  
 (like a string), can be appended the same way as a string, can be  
 composed or decomposed using canonical normalization or compatibility  
 normalization (like a string), and should be sorted, uppercased, and  
 lowercased according to Unicode rules (like a string). Basically, a  
 grapheme is just a string that happens to contain only one grapheme.  
 What would a custom type do differently than a string?

A grapheme type would not be a range, it would be an element of the string  
range.  You could not append to it (otherwise, that makes it into a  
string).

In all other respects, it should act similar to a string (as you say,  
printing, upper-casing, comparison, etc.)

 Also, grapheme == "a" is easy to understand because both are strings.  
 But if a grapheme is a separate type, what would a grapheme literal look  
 like?

A grapheme should be comparable to a string literal.  It should be  
assignable to a string literal.  The drawback is we would need a runtime  
check to ensure the string literal was actually one grapheme.  Some  
compiler help in this regard would be useful, but I'm not sure how the  
mechanics would work (you couldn't exactly type a literal differently  
based on its contents).  Another possibility is to come up with a  
different syntax to denote grapheme literals.

 So in the end I don't think a grapheme needs a specific type, at least  
 not for general purpose text processing. If I split a string on  
 whitespace, do I get a range where elements are of type "word"? No, just  
 sliced strings.

It is not clear that using a separate type is the "right answer."  It may  
be that an element of a string should be a string.  This does work in  
other languages that don't have a concept of a character.  An extra type  
however, allows us to have more concrete positions to work with.

 That said, I'm much less concerned by the type used to represent a  
 grapheme than by the Unicode correctness. I'm not opposed to a separate  
 type, I just don't really see the point.

I will try to explain better by making an actual candidate type.

-Steve

Jan 17 2011

spir <denis.spir gmail.com> writes:

On 01/15/2011 11:45 PM, Michel Fortin wrote:
 That said, I'm sure if someone could redesign Unicode by breaking
 backward-compatibility we'd have something simpler. You could probably
 get rid of pre-combined characters and reduce the number of
 normalization forms. But would you be able to get rid of normalization
 entirely? I don't think so. Reinventing Unicode is probably not worth it.

I think like you about pre-composed characters: they bring no real gain 
(even for easing passage from historic charsets, since texts must be 
decoded anyway, then mapping to single or multiple codes is nothing).
But they add complication to the design in proposing 2 // representation 
schemes (one character <--> one "code pile" versus one character <--> 
one precomposed code). And impose much weight on the back of software 
(and programmers) relative to correct indexing/ slicing and comparison, 
search, count, etc. Where normalisation forms enter the game.
My whoice would be:
* decomposed form only
* ordering imposed by the standard at text-composition time
==> no normalisation because everything is normalised from scratch.

Remains only what I call "piling". But we cannot easily get rid of it 
--without separators in standard UTF encodings.
I had the idea of UTF-33 ;-): a alternative freely agreed-upon encoding 
that just says (in addition to UTF-32) that the content is already 
normalised (NFD decomposed and ordered): either so produced intially or 
already processed. So that software can happily read texts in and only 
think at piling if needed. UTF-33+ would add "grapheme" separators (a 
costly solution in terms of space) to get rid of piling.
The aim indeed beeing to avoid stupidly doing the same job multiple 
times on the same text.

Denis
_________________
vita es estrany
spir.wikidot.com

Jan 17 2011

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

Thanks to all that has contributed, I am also following this thread with 
great interest. :)

Michel Fortin wrote:
 I mean, a grapheme is a slice of a string, can have multiple code points
 (like a string), can be appended the same way as a string, can be
 composed or decomposed using canonical normalization or compatibility
 normalization (like a string), and should be sorted, uppercased, and
 lowercased according to Unicode rules (like a string). Basically, a
 grapheme is just a string that happens to contain only one grapheme.

I would like to stress the fact that Unicode knows nothing about 
sorting, uppercasing, or lowercasing.

Those operations are tied to the alphabet (or writing system) that a 
certain grapheme happens to belong to at a given time. For example, we 
cannot uppercase the letter i without knowing what alphabet we are 
dealing with. Two possibilities: I and İ (I dot above).

It is the same issue with sorting.

Ali

Jan 17 2011

spir <denis.spir gmail.com> writes:

On 01/18/2011 06:11 AM, Ali Çehreli wrote:
 Thanks to all that has contributed, I am also following this thread with
 great interest. :)

 Michel Fortin wrote:
  > I mean, a grapheme is a slice of a string, can have multiple code points
  > (like a string), can be appended the same way as a string, can be
  > composed or decomposed using canonical normalization or compatibility
  > normalization (like a string), and should be sorted, uppercased, and
  > lowercased according to Unicode rules (like a string). Basically, a
  > grapheme is just a string that happens to contain only one grapheme.

 I would like to stress the fact that Unicode knows nothing about
 sorting, uppercasing, or lowercasing.

 Those operations are tied to the alphabet (or writing system) that a
 certain grapheme happens to belong to at a given time. For example, we
 cannot uppercase the letter i without knowing what alphabet we are
 dealing with. Two possibilities: I and İ (I dot above).

 It is the same issue with sorting.

This is true and false ;-)

You are right, indeed, on the fact that issues like sorting one are 
language-specific, and more, use-case-specific. The case of the turkish 
beeing a good example. For another one, in french I do not even know 
whether there is an official rule! Anyway, whatever the answer, even eg 
famous newpapers, and official documents, used different rules. Most of 
them let down accents on uppercase, possibly because of computer 
limitation; there is a recent move (back) toward accented uppercase.
This is very annoying: "Hélène" has 2 consistent and used uppercase 
versions. Conversely, how is software supposed to guess the lowercase 
version of "HELENE"?

Upon Unicode, it still defines norms for casing and so-called collation 
(compare, for sorting) algorithms. Dunno much more, i have never applied 
them, personly, for reasons like the ones above. The full list of it's 
technical docs can be found at http://unicode.org/reports/. See in 
particular http://unicode.org/reports/tr10/ for collation. 
(Unfortnately, case mapping is know part of the core standard doc, so 
that it's hard to get it.)

Denis
_________________
vita es estrany
spir.wikidot.com

Jan 19 2011

spir <denis.spir gmail.com> writes:

On 01/15/2011 05:59 PM, Steven Schveighoffer wrote:
 I think this is a good alternative, but I'd rather not impose this on
 people like myself who deal mostly with English.  I think this should be
 possible to do with wrapper types or intermediate ranges which have
 graphemes as elements (per my suggestion above).

I am unsure now about the question of a text's (apparent) natural 
language in relation to unicode issues. For instance English, precisely, 
seems to often include foreign words literally (or is it a kind of 
pedantism from highly educated people?). In fact, users are free to 
include whatever characters they like, as soon as they text-composition 
interface allows it. All main OSes, I guess, now have at least one 
standard way to type in characters (or codepoint) that are not directly 
accessible on keyboards, and application sometimes offer another.
Some kinds of users love to play with such flexibility. So, maybe, the 
right question is not the one of natural language but of 
text-composition means. I guess that as soon as a human user may have 
freely typed or edited a text, we cannot guarantee much upon its actual 
content, what do you think?
The case of historic ASCII-only text is relevant, indeed, but will fast 
become less. And how does an application writer recognises them without 
iterating the whole content? (The encoding is utf8 compatible.)

Denis
_________________
vita es estrany
spir.wikidot.com

Jan 17 2011

Gerrit Wichert <gwichert yahoo.com> writes:

Am 14.01.2011 15:34, schrieb Steven Schveighoffer:
 Is it common to have multiple modifiers on a single character?  The
 problem I see with using decomposed canonical form for strings is that
 we would have to return a dchar[] for each 'element', which severely
 complicates code that, for instance, only expects to handle English.

 I was hoping to lazily transform a string into its composed canonical
 form, allowing the (hopefully rare) exception when a composed
 character does not exist.  My thinking was that this at least gives a
 useful string representation for 90% of usages, leaving the remaining
 10% of usages to find a more complex representation (like your Text
 type).  If we only get like 20% or 30% there by making dchar the
 element type, then we haven't made it useful enough.

I'm afraid that this is not a proper way to handle this problem. It may
be better for a language not to 'translate' by default.
If the user wants to convert the codepoints this can be requested on
demand. But pemature default conversion is a subltle way to lose
information that may be important.
Imagine we want to write a tool for dealing with the in/output of some
other ignorant legacy software. Even if it is only text files, that
software may choke on some converted input. So i belive that it is very
importent that we are able to reproduce strings in exact that form in
which we have read them in.   

Gerrit

Jan 14 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 14 Jan 2011 15:54:19 -0500, Gerrit Wichert <gwichert yahoo.com>  
wrote:

 Am 14.01.2011 15:34, schrieb Steven Schveighoffer:
 Is it common to have multiple modifiers on a single character?  The
 problem I see with using decomposed canonical form for strings is that
 we would have to return a dchar[] for each 'element', which severely
 complicates code that, for instance, only expects to handle English.

 I was hoping to lazily transform a string into its composed canonical
 form, allowing the (hopefully rare) exception when a composed
 character does not exist.  My thinking was that this at least gives a
 useful string representation for 90% of usages, leaving the remaining
 10% of usages to find a more complex representation (like your Text
 type).  If we only get like 20% or 30% there by making dchar the
 element type, then we haven't made it useful enough.

 I'm afraid that this is not a proper way to handle this problem. It may
 be better for a language not to 'translate' by default.
 If the user wants to convert the codepoints this can be requested on
 demand. But pemature default conversion is a subltle way to lose
 information that may be important.
 Imagine we want to write a tool for dealing with the in/output of some
 other ignorant legacy software. Even if it is only text files, that
 software may choke on some converted input. So i belive that it is very
 importent that we are able to reproduce strings in exact that form in
 which we have read them in.

Actually, this would only lazily *and temporarily* convert the string per  
grapheme.  Essentially, the original is left alone, so no harm there.

-Steve.

Jan 15 2011

"Joel C. Salomon" <joelcsalomon gmail.com> writes:

On 01/14/2011 09:34 AM, Steven Schveighoffer wrote:
 Is it common to have multiple modifiers on a single character?  The
 problem I see with using decomposed canonical form for strings is that
 we would have to return a dchar[] for each 'element', which severely
 complicates code that, for instance, only expects to handle English.

Hebrew:
• Almost every letter in a printed Hebrew bible has at least one of—
  ‣ vowel marker (the Hebrew alphabet is otherwise consonantal) and
  ‣ a /dagesh/ dot, indicating the difference between /b/ & /v/, or
    between /mm/ and /m/;
• almost every word has at least one letter with a cantillation mark in
  addition to the above; and
• other marks too complicated & off-topic to explain.

Vietnamese uses Latin letters with accents playing multiple roles, so
there are often two or three accent marks on a single letter; e.g., the
name of the creator of pdfTeX is spelled “Hàn Thế Thành”, with two
accents on the “e”.

I’m sure there are others.

—Joel

Jan 23 2011

"Nick Sabalausky" <a a.a> writes:

"spir" <denis.spir gmail.com> wrote in message 
news:mailman.624.1295013588.4748.digitalmars-d puremagic.com...
 If it does not display properly, either set your terminal to UTF* or use a 
 more unicode-aware font (eg DejaVu series).

How to do that on the Windows (XP) command prompt, for anyone who doesn't 
know:

Step 1:

Right-click title bar, "Properties", "Font" tab, set font to "Lucidia 
Console" (It'll look weird at first, but you get used to it.)

Step 2 (I had to google this step):

For just the current terminal session: Run "chcp 65001". (Ie "CHange Code 
Page) Also, you can run "chcp" to just see what codepage you're already set 
to.

To make it work permanently: Put "chcp 65001" into the registry key 
"HKEY_LOCAL_MACHINE\Software\Microsoft\Command Processor\Autorun"

Jan 14 2011

"Nick Sabalausky" <a a.a> writes:

"Nick Sabalausky" <a a.a> wrote in message 
news:igq9u6$1bqu$1 digitalmars.com...
 Step 2 (I had to google this step):

 For just the current terminal session: Run "chcp 65001". (Ie "CHange Code 
 Page) Also, you can run "chcp" to just see what codepage you're already 
 set to.

 To make it work permanently: Put "chcp 65001" into the registry key 
 "HKEY_LOCAL_MACHINE\Software\Microsoft\Command Processor\Autorun"

Forget that step 2, that causes "Active code page: 65001" to be sent to 
stdout *every* time system() is invoked. We shouldn't be relying on that. 
*This* is what should be done (and this really should be done in all D 
command line apps - or better yet, put into the runtime):

import std.stdio;

version(Windows)
{
    import std.c.windows.windows;
    extern(Windows) export BOOL SetConsoleOutputCP(UINT);
}

void main()
{
    version(Windows) SetConsoleOutputCP(65001);

    writeln("HuG says: Fukken �ber Death Terminal");
}

See also: http://d.puremagic.com/issues/show_bug.cgi?id=1448

Jan 14 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 1/14/11, Nick Sabalausky <a a.a> wrote:
 import std.stdio;

 version(Windows)
 {
     import std.c.windows.windows;
     extern(Windows) export BOOL SetConsoleOutputCP(UINT);
 }

 void main()
 {
     version(Windows) SetConsoleOutputCP(65001);

     writeln("HuG says: Fukken =DCber Death Terminal");
 }

Does that work for you? I get back:
HuG says: Fukken =C3=9Cber Death Terminal

Jan 14 2011

"Nick Sabalausky" <a a.a> writes:

"Andrej Mitrovic" <andrej.mitrovich gmail.com> wrote in message 
news:mailman.631.1295038817.4748.digitalmars-d puremagic.com...
On 1/14/11, Nick Sabalausky <a a.a> wrote:
 import std.stdio;

 version(Windows)
 {
     import std.c.windows.windows;
     extern(Windows) export BOOL SetConsoleOutputCP(UINT);
 }

 void main()
 {
     version(Windows) SetConsoleOutputCP(65001);

     writeln("HuG says: Fukken �ber Death Terminal");
 }

Does that work for you? I get back:
HuG says: Fukken Über Death Terminal

Yea, it works for me (XP Pro SP2 32-bit), and my "chcp" is 437, not 65001. 
The NG or copy-paste might have messed it up. Try with a code-point escape 
sequence:

import std.stdio;

version(Windows)
{
    import std.c.windows.windows;
    extern(Windows) export BOOL SetConsoleOutputCP(UINT);
}

void main()
{
    version(Windows) SetConsoleOutputCP(65001);

    writeln("HuG says: Fukken \u00DCber Death Terminal");
}

Jan 14 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 1/14/11, Nick Sabalausky <a a.a> wrote:
 Try with a code-point escape sequence

Nope, I still get the same results (tried with different fonts, lucida
etc.., but I don't think it's a font issue). Maybe I have my settings
messed up or something.

Jan 14 2011

"Nick Sabalausky" <a a.a> writes:

"Andrej Mitrovic" <andrej.mitrovich gmail.com> wrote in message 
news:mailman.633.1295044452.4748.digitalmars-d puremagic.com...
 On 1/14/11, Nick Sabalausky <a a.a> wrote:
 Try with a code-point escape sequence

 Nope, I still get the same results (tried with different fonts, lucida
 etc.., but I don't think it's a font issue). Maybe I have my settings
 messed up or something.

Weird. Which version of windows are you on, and are you using the regular 
command line or powershell or something else? If you run "chcp 65001" from 
the cmd line first, does it work then?

Jan 14 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 1/14/11, Nick Sabalausky <a a.a> wrote:
 Weird. Which version of windows are you on, and are you using the regular
 command line or powershell or something else? If you run "chcp 65001" from
 the cmd line first, does it work then?

Okay, it appears this is an issue with Console2. I'll have to report
it to the dev, although he hasn't fixed much of anything for ages
already. I'm really contemplating of writing my own shell by now. (no
Linux jokes now, please. :p)

Works fine in cmd.exe, Lucida font without calling 65001 manually. In
fact, it works with the 437 code page as well when I comment out
SetConsoleOutputCP.

Jan 14 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 1/15/11, Andrej Mitrovic <andrej.mitrovich gmail.com> wrote:
 fact, it works with the 437 code page as well when I comment out
 SetConsoleOutputCP.

Woops, let me revise what I've said:

If the code has the call to change the codepage, then I'll get back
the correct result in console.
If it doesn't, I have to switch the codepage manually.

I don't know what the problem with Console2 is, but if I change
cmd.exe to always use a Lucida font then Console2 will output the
correct result (even though I'm using fixedsys in Console2).

This is getting too specific and I don't want to hijack the thread.
Everything is working fine now. Thx. :)

Jan 14 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-13 23:23:10 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 1/13/11 7:09 PM, Michel Fortin wrote:
 That's forgetting that most of the time people care about graphemes
 (user-perceived characters), not code points.

 
 I'm not so sure about that. What do you base this assessment on? Denis 
 wrote a library that according to him does grapheme-related stuff 
 nobody else does. So apparently graphemes is not what people care about 
 (although it might be what they should care about).

Apple implemented all these things in the NSString class in Cocoa. They 
did all this work on Unicode at the beginning of Mac OS X, at a time 
where making such changes wouldn't break anything.

It's a hard thing to change later when you have code that depend on the 
old behaviour. It's a complicated matter and not so many people will 
understand the issues, so it's no wonder many languages just deal with 
code points.


 This might be a good time to see whether we need to address graphemes 
 systematically. Could you please post a few links that would educate me 
 and others in the mysteries of combining characters?

As usual, Wikipedia offers a good summary and a couple of references. 
Here's the part about combining characters: 
<http://en.wikipedia.org/wiki/Combining_character>.

There's basically four ranges of code points which are combining:
- Combining Diacritical Marks (0300–036F)
- Combining Diacritical Marks Supplement (1DC0–1DFF)
- Combining Diacritical Marks for Symbols (20D0–20FF)
- Combining Half Marks (FE20–FE2F)

A code point followed by one or more code points in these ranges is 
conceptually a single character (a grapheme).

But for comparing strings correctly, you need to determine the 
canonical equivalence. Wikipedia describes it in Unicode Normalization 
article <http://en.wikipedia.org/wiki/Unicode_normalization>. The full 
algorithm specification can be found here: 
<http://unicode.org/reports/tr15/> (the algorithm . The canonical form 
has both a composed and decomposed variant, the first trying to use 
pre-combined character when possible, the second not using any 
pre-combined character. Not only combining marks are concerned, there 
are a few single-code-point characters which have a duplicate somewhere 
else in the code point table.

Also, there's two normalizations: the canonical one (described above) 
and the compatibility one which is more lax (making the ligature "ﬂ" 
would equivalent to "fl" for instance). If a user searches for some 
text in a document, it's probably better to search using the 
compatibility normalization so that "flower" (with ligature) and 
"ﬂower" (without ligature) can match each other. If you want to search 
case-insensitively, then you'll need to implement the collation 
algorithm, but that's getting further.

If you're wondering which direction to take, this official FAQ seems 
like a good resource (especially the first few questions):
<http://www.unicode.org/faq/normalization.html>

One important thing to note is that most of the time, strings come 
already in the normalized pre-composed form. So the normalization 
algorithm should be optimized for the case it has nothing to do. That's 
what is said in section 1.3 Description of the Normalization Algorithm 
in the specification: 
<http://www.unicode.org/reports/tr15/#Description_Norm>.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 14 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/14/11 7:50 AM, Michel Fortin wrote:
 On 2011-01-13 23:23:10 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 On 1/13/11 7:09 PM, Michel Fortin wrote:
 That's forgetting that most of the time people care about graphemes
 (user-perceived characters), not code points.

 I'm not so sure about that. What do you base this assessment on? Denis
 wrote a library that according to him does grapheme-related stuff
 nobody else does. So apparently graphemes is not what people care
 about (although it might be what they should care about).

 Apple implemented all these things in the NSString class in Cocoa. They
 did all this work on Unicode at the beginning of Mac OS X, at a time
 where making such changes wouldn't break anything.

 It's a hard thing to change later when you have code that depend on the
 old behaviour. It's a complicated matter and not so many people will
 understand the issues, so it's no wonder many languages just deal with
 code points.

That's a strong indicator, but we shouldn't get ahead of ourselves.

D took a certain risk by defaulting to Unicode at a time where the 
dominant extant systems languages left the decision to more or less 
exotic libraries, Java used UTF16 de jure but UCS2 de facto, and other 
languages were just starting to adopt Unicode.

I think that risk was justified because the relative loss in speed was 
often acceptable and the gains were there. Even so, there are people in 
this who protest against the loss in efficiency and argue that life is 
harder for ASCII users.

Switching to variable-length representation of graphemes as bundles of 
dchars and committing to that through and through will bring with it a 
larger hit in efficiency and an increased difficulty in usage. I agree 
that at a level that's the "right" thing to do, but I don't have yet the 
feeling that combining characters are a widely-adopted winner. For the 
most part, fonts don't support combining characters, and as a font 
dilettante I can tell that putting arbitrary sets of diacritics on top 
of characters is not what one should be doing as it'll look terrible. 
Unicode is begrudgingly acknowledging combining characters. Only a 
handful of libraries deal with them. I don't know how many applications 
need or care for them, versus how many applications do fine with 
precombined characters. I have trouble getting combining characters to 
combine on this machine in any of the applications I use - and this is a 
Mac.


Andrei

Jan 14 2011

foobar <foo bar.com> writes:

Andrei Alexandrescu Wrote:

 That's a strong indicator, but we shouldn't get ahead of ourselves.
 
 D took a certain risk by defaulting to Unicode at a time where the 
 dominant extant systems languages left the decision to more or less 
 exotic libraries, Java used UTF16 de jure but UCS2 de facto, and other 
 languages were just starting to adopt Unicode.
 
 I think that risk was justified because the relative loss in speed was 
 often acceptable and the gains were there. Even so, there are people in 
 this who protest against the loss in efficiency and argue that life is 
 harder for ASCII users.
 
 Switching to variable-length representation of graphemes as bundles of 
 dchars and committing to that through and through will bring with it a 
 larger hit in efficiency and an increased difficulty in usage. I agree 
 that at a level that's the "right" thing to do, but I don't have yet the 
 feeling that combining characters are a widely-adopted winner. For the 
 most part, fonts don't support combining characters, and as a font 
 dilettante I can tell that putting arbitrary sets of diacritics on top 
 of characters is not what one should be doing as it'll look terrible. 
 Unicode is begrudgingly acknowledging combining characters. Only a 
 handful of libraries deal with them. I don't know how many applications 
 need or care for them, versus how many applications do fine with 
 precombined characters. I have trouble getting combining characters to 
 combine on this machine in any of the applications I use - and this is a 
 Mac.
 
 
 Andrei

Combining marks do need to be supported.
Some languages use combining marks extensively (see my other post) and of
course font for those languages exist and they do support this. Mac doesn't
support all languages so I'm unsure if it's the best example out there. 
here's an example of the Hebrew bible: 
http://www.scripture4all.org/OnlineInterlinear/Hebrew_Index.htm

Just look at the any of the PDFs there to see how Hebrew looks like with all
sorts of different marks. 
In the same vain I could have found a Japanese text with ruby (where a Kanji
letter has on top of it Hiragana text that tells you how to read it)

Using a dchar as a string element instead of a proper grapheme will make it
really hard to work with texts in such languages. 

Regarding efficiency concerns for ASCII users - there's no rule that forces us
to have a single string type,  just look for comparison at how many integral
types D has. I believe that the correct thing is to have a 'universal string'
type be the default (just like int is for integral types) and provide
additional types for other commonly useful encodings such as ASCII.

A geneticist for instance should use a 'DNA' type that encodes the four DNA
letters instead of an ASCII string or even worse, a universal (Unicode) string.

Jan 14 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-14 18:02:32 -0500, foobar <foo bar.com> said:

 Combining marks do need to be supported.
 Some languages use combining marks extensively (see my other post) and 
 of course font for those languages exist and they do support this. Mac 
 doesn't support all languages so I'm unsure if it's the best example 
 out there.
 here's an example of the Hebrew bible:
 http://www.scripture4all.org/OnlineInterlinear/Hebrew_Index.htm
 
 Just look at the any of the PDFs there to see how Hebrew looks like 
 with all sorts of different marks.

That's a good example. Although my attempt to extract the text from the 
PDF wasn't perfect, I can confirm that the marks I got in the 
copy-pasted text are indeed combining code points, not pre-combined 
ones.

This character for instance has a combining mark: "יָ"; and it can't be 
represented by a pre-combined code point because there is no 
pre-combined form for it (or at least I couldn't find one). Some hebrew 
characters have a pre-combined form for the middle dot and some other 
marks, presumably the most common ones, but it was clearly insufficient 
for this text.


 In the same vain I could have found a Japanese text with ruby (where a 
 Kanji letter has on top of it Hiragana text that tells you how to read 
 it)

Are you sure those are combining code points? I though ruby was a 
layout feature, not something part of Unicode. And I can't find 
combining code points that would match those.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 14 2011

foobar <foo bar.com> writes:

Michel Fortin Wrote:

 In the same vain I could have found a Japanese text with ruby (where a 
 Kanji letter has on top of it Hiragana text that tells you how to read 
 it)

 
 Are you sure those are combining code points? I though ruby was a 
 layout feature, not something part of Unicode. And I can't find 
 combining code points that would match those.
 

I've looked into this and I was wrong. Ruby is a layout feature as you said. 
Sorry for the confusion. 

 
 -- 
 Michel Fortin
 michel.fortin michelf.com
 http://michelf.com/

Jan 15 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-14 17:04:08 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 1/14/11 7:50 AM, Michel Fortin wrote:
 On 2011-01-13 23:23:10 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:
 
 On 1/13/11 7:09 PM, Michel Fortin wrote:
 That's forgetting that most of the time people care about graphemes
 (user-perceived characters), not code points.

 
 I'm not so sure about that. What do you base this assessment on? Denis
 wrote a library that according to him does grapheme-related stuff
 nobody else does. So apparently graphemes is not what people care
 about (although it might be what they should care about).

 
 Apple implemented all these things in the NSString class in Cocoa. They
 did all this work on Unicode at the beginning of Mac OS X, at a time
 where making such changes wouldn't break anything.
 
 It's a hard thing to change later when you have code that depend on the
 old behaviour. It's a complicated matter and not so many people will
 understand the issues, so it's no wonder many languages just deal with
 code points.

 
 That's a strong indicator, but we shouldn't get ahead of ourselves.
 
 D took a certain risk by defaulting to Unicode at a time where the 
 dominant extant systems languages left the decision to more or less 
 exotic libraries, Java used UTF16 de jure but UCS2 de facto, and other 
 languages were just starting to adopt Unicode.
 
 I think that risk was justified because the relative loss in speed was 
 often acceptable and the gains were there. Even so, there are people in 
 this who protest against the loss in efficiency and argue that life is 
 harder for ASCII users.

Then perhaps it's time we find out a way to handle non-Unicode 
encodings too. We can get away treating ASCII strings as Unicode 
strings because of a useful property of UTF-8, but should we really do 
this?

Also, it'd really help this discussion to have some hard numbers about 
the cost of decoding graphemes.


 Switching to variable-length representation of graphemes as bundles of 
 dchars and committing to that through and through will bring with it a 
 larger hit in efficiency and an increased difficulty in usage. I agree 
 that at a level that's the "right" thing to do, but I don't have yet 
 the feeling that combining characters are a widely-adopted winner. For 
 the most part, fonts don't support combining characters, and as a font 
 dilettante I can tell that putting arbitrary sets of diacritics on top 
 of characters is not what one should be doing as it'll look terrible. 
 Unicode is begrudgingly acknowledging combining characters. Only a 
 handful of libraries deal with them. I don't know how many applications 
 need or care for them, versus how many applications do fine with 
 precombined characters. I have trouble getting combining characters to 
 combine on this machine in any of the applications I use - and this is 
 a Mac.

I'm using the character palette: Edit menu >�Special Characters... from 
there you can insert arbitrary code points. Use the search function of 
the palette to get code points with "combining" in their names, then 
click the big character box on the lower left to insert them. Have fun!


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 14 2011

spir <denis.spir gmail.com> writes:

On 01/15/2011 12:21 AM, Michel Fortin wrote:
 Also, it'd really help this discussion to have some hard numbers about
 the cost of decoding graphemes.

Text has a perf module that provides such numbers (on different stages 
of Text object construction) (but the measured algos are not yet 
stabilised, so that said numbers regularly change, but in the right 
sense ;-)
You can try the current version at 
https://bitbucket.org/denispir/denispir-d/src (the perf module is called 
chrono.d)

For information, recently, the cost of full text construction: decoding, 
normalisation (both decomp & ordering), piling, was about 5 times 
decoding alone. The heavy part (~ 70%) beeing piling. But Stephan just 
informed me about a new gain in piling I have not yet tested.
This performance places our library in-between Windows native tools and 
ICU in terms of speed. Which is imo rather good for a brand new tool 
written in a still unstable language.

I have carefully read your arguments on Text's approach to 
systematically "pile" and normalise source texts not beeing the right 
one from an efficiency point of view. Even for strict use cases of 
universal text manipulation (because the relative space cost would 
indirectly cause time cost due to cache effects). Instead, you state we 
should "pile" and/or normalise on the fly. But I am, similarly to you, 
rather doubtful on this point without any numbers available.
So, let us produce some benchmark results on both approaches if you like.


Denis
_________________
vita es estrany
spir.wikidot.com

Jan 17 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/17/11 10:55 AM, spir wrote:
 On 01/15/2011 12:21 AM, Michel Fortin wrote:
 Also, it'd really help this discussion to have some hard numbers about
 the cost of decoding graphemes.

 Text has a perf module that provides such numbers (on different stages
 of Text object construction) (but the measured algos are not yet
 stabilised, so that said numbers regularly change, but in the right
 sense ;-)
 You can try the current version at
 https://bitbucket.org/denispir/denispir-d/src (the perf module is called
 chrono.d)

 For information, recently, the cost of full text construction: decoding,
 normalisation (both decomp & ordering), piling, was about 5 times
 decoding alone. The heavy part (~ 70%) beeing piling. But Stephan just
 informed me about a new gain in piling I have not yet tested.
 This performance places our library in-between Windows native tools and
 ICU in terms of speed. Which is imo rather good for a brand new tool
 written in a still unstable language.

 I have carefully read your arguments on Text's approach to
 systematically "pile" and normalise source texts not beeing the right
 one from an efficiency point of view. Even for strict use cases of
 universal text manipulation (because the relative space cost would
 indirectly cause time cost due to cache effects). Instead, you state we
 should "pile" and/or normalise on the fly. But I am, similarly to you,
 rather doubtful on this point without any numbers available.
 So, let us produce some benchmark results on both approaches if you like.

Congrats on this great work. The initial numbers are in keeping with my 
expectation; UTF adds for certain primitives up to 3x overhead compared 
to ASCII, and I expect combining character handling to bring about as 
much on top of that.

Your work and Steve's won't go to waste; one way or another we need to 
add grapheme-based processing to D. I think it would be great if later 
on a Phobos submission was made.


Andrei

Jan 17 2011

spir <denis.spir gmail.com> writes:

On 01/17/2011 06:36 PM, Andrei Alexandrescu wrote:
 On 1/17/11 10:55 AM, spir wrote:
 On 01/15/2011 12:21 AM, Michel Fortin wrote:
 Also, it'd really help this discussion to have some hard numbers about
 the cost of decoding graphemes.

 Text has a perf module that provides such numbers (on different stages
 of Text object construction) (but the measured algos are not yet
 stabilised, so that said numbers regularly change, but in the right
 sense ;-)
 You can try the current version at
 https://bitbucket.org/denispir/denispir-d/src (the perf module is called
 chrono.d)

 For information, recently, the cost of full text construction: decoding,
 normalisation (both decomp & ordering), piling, was about 5 times
 decoding alone. The heavy part (~ 70%) beeing piling. But Stephan just
 informed me about a new gain in piling I have not yet tested.
 This performance places our library in-between Windows native tools and
 ICU in terms of speed. Which is imo rather good for a brand new tool
 written in a still unstable language.

 I have carefully read your arguments on Text's approach to
 systematically "pile" and normalise source texts not beeing the right
 one from an efficiency point of view. Even for strict use cases of
 universal text manipulation (because the relative space cost would
 indirectly cause time cost due to cache effects). Instead, you state we
 should "pile" and/or normalise on the fly. But I am, similarly to you,
 rather doubtful on this point without any numbers available.
 So, let us produce some benchmark results on both approaches if you like.

 Congrats on this great work. The initial numbers are in keeping with my
 expectation; UTF adds for certain primitives up to 3x overhead compared
 to ASCII, and I expect combining character handling to bring about as
 much on top of that.

 Your work and Steve's won't go to waste; one way or another we need to
 add grapheme-based processing to D. I think it would be great if later
 on a Phobos submission was made.

Andrei, would you have a look at Text's current state, mainly 
theinterface, when you have time for that (no hurry) at 
https://bitbucket.org/denispir/denispir-d/src
It is actually a bit more than just a string type considering true 
characters as natural elements.
* It is a textual type providing a client interface of common text 
manipulation methods similar to ones in common high-level languages.
(including the fact that a character is a singleton string)
* The repo also holds the main module (unicodedata) of Text's sister lib 
(dunicode), providing access to various unicode algos and data.
(We are about to merge the 2 libs into a new repository.)

Denis
_________________
vita es estrany
spir.wikidot.com

Jan 17 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/17/11 12:23 PM, spir wrote:
 Andrei, would you have a look at Text's current state, mainly
 theinterface, when you have time for that (no hurry) at
 https://bitbucket.org/denispir/denispir-d/src
 It is actually a bit more than just a string type considering true
 characters as natural elements.
 * It is a textual type providing a client interface of common text
 manipulation methods similar to ones in common high-level languages.
 (including the fact that a character is a singleton string)
 * The repo also holds the main module (unicodedata) of Text's sister lib
 (dunicode), providing access to various unicode algos and data.
 (We are about to merge the 2 libs into a new repository.)

I think this is solid work that reveals good understanding of Unicode. 
That being said, there are a few things I disagree about and I don't 
think it can be integrated into Phobos. One thing is that it looks a lot 
more like D1 code than D2. D2 code of this kind is automatically 
expected to play nice with the rest of Phobos (ranges and algorithms). 
As it is, the code is an island that implements its own algorithms 
(mostly by equivalent handwritten code).

In detail:

* Line 130: representing a text as a dchar[][] has its advantages but 
major efficiency issues. To be frank I think it's a disaster. I think a 
representation building on UTF strings directly is bound to be vastly 
better.

* 163: equality does what std.algorithm.equal does.

* 174: equality also does what std.algorithm.equal does (possibly with a 
custom pred)

* 189: TextException is unnecessary

* 340: Unless properly motivate, iteration with opApply is archaic and 
inefficient.

* 370: Why lose the information that the result is in fact a single Pile?

* 430, 456, 474: contains, indexOf, count and probably others should use 
generic algorithms, not duplicate them.

* 534: replace is std.array.replace

* 623: copy copies the piles shallowly (not sure if that's a problem)

As I mentioned before - why not focus on defining a Grapheme type (what 
you call Pile, but using UTF encoding) and defining a ByGrapheme range 
that iterates a UTF-encoded string by grapheme?


Andrei

Jan 17 2011

spir <denis.spir gmail.com> writes:

On 01/17/2011 07:57 PM, Andrei Alexandrescu wrote:
 On 1/17/11 12:23 PM, spir wrote:
 Andrei, would you have a look at Text's current state, mainly
 theinterface, when you have time for that (no hurry) at
 https://bitbucket.org/denispir/denispir-d/src
 It is actually a bit more than just a string type considering true
 characters as natural elements.
 * It is a textual type providing a client interface of common text
 manipulation methods similar to ones in common high-level languages.
 (including the fact that a character is a singleton string)
 * The repo also holds the main module (unicodedata) of Text's sister lib
 (dunicode), providing access to various unicode algos and data.
 (We are about to merge the 2 libs into a new repository.)

 I think this is solid work that reveals good understanding of Unicode.
 That being said, there are a few things I disagree about and I don't
 think it can be integrated into Phobos.

We are exploring a new field. (Except for the work Objective-C designers 
did -- but we just discovered it.)

 One thing is that it looks a lot
 more like D1 code than D2. D2 code of this kind is automatically
 expected to play nice with the rest of Phobos (ranges and algorithms).
 As it is, the code is an island that implements its own algorithms
 (mostly by equivalent handwritten code).

Right. We precisely initially wanted to let it play nicely with the rest 
of new Phobos. This meant mainly provide a range interface, which also 
gives access to std.algorithm routines. But we were blocked by current 
bugs related to ranges. I have posted about those issues (you may 
remember having replied to this post).

 In detail:

 * Line 130: representing a text as a dchar[][] has its advantages but
 major efficiency issues. To be frank I think it's a disaster. I think a
 representation building on UTF strings directly is bound to be vastly
 better.

I don't understand your point. Where is the difference with D's builtin 
types, then?

Also, which efficiency issue do you mention? Upon text object 
construction, we do agree and I have given some data. But this happens 
only once; it is an investment intended to provide correctness first, 
and efficiency of _every_ operation on constructed text.
Upon speed ofsuch  methods / algorithms operating _correctly_ on 
universal text, precisely, since there is no alternative to Text (yet), 
there are also no available performance data to judge.

(What about comparing Objective-C's NSString to Text's current 
performance for indexing, slicing, searching, counting,...? Even in its 
current experimental stage, I bet it would not be ridiculous, rather the 
opposite. But I may be completely wrong.)

 * 163: equality does what std.algorithm.equal does.

 * 174: equality also does what std.algorithm.equal does (possibly with a
 custom pred)

Right, these are unimportant tool func at the "pile" level. (Initially 
introduced because builtin "==" showed strange inefficency in our case. 
May test again later.)

 * 189: TextException is unnecessary

Agreed.

 * 340: Unless properly motivate, iteration with opApply is archaic and
 inefficient.

See range bug evoked above. opApply is the only workaround AFAIK.
Also, ranges cannot yet provide indexed iteration like
	foreach(i, char ; text) {...}

 * 370: Why lose the information that the result is in fact a single Pile?

I don't know what information loss you mean.

Generally speaking, Pile is more or less an implementation detail used 
to internally represent a true character; while Text is the important thing.
At one time we had to chose whether make Pile an obviously exposed type 
as well, or not. I chose (after some exchange on the topic) not to do it 
for a few reasons:
* Simplicity: one type does all the job well.
* Avoid confusion due to conflict with historic string types which 
elements (codes=characters) were atomic thingies. This was also a reason 
not to name it simply "Character"; "Pile" for me was supposed to rather 
evoke the technical side than the meaningful side.
* Lightness of the interface: if we expose Pile obviously, then we need 
to double all methods that may take or return a single character, like 
searching, counting, replacing etc... and also possibly indexing and 
iteration.

In fact, the resulting interface is more or less like a string type in 
high-level languages such as Python; with the motivating difference that 
it operates correctly on universal text.

Now, it seems you rather expect, maybe, the character/pile type to be 
the important thing and Text to just be a sequence of them? (possibly 
even unnecessary to be defined formally)

 * 430, 456, 474: contains, indexOf, count and probably others should use
 generic algorithms, not duplicate them.

 * 534: replace is std.array.replace

I had to write algos because most of them in std.algorithm require a 
range interface, IIUC; and also for testing purpose.

 * 623: copy copies the piles shallowly (not sure if that's a problem)

Had the same interrogation.

 As I mentioned before - why not focus on defining a Grapheme type (what
 you call Pile, but using UTF encoding) and defining a ByGrapheme range
 that iterates a UTF-encoded string by grapheme?

Dunno. This simply was not my approach. Seems to me Text as is provides 
clients with an interface a simple and clear as possible, while 
operating correctly in the backgroung.

It seems if you just build a ByGrapheme iterator, then you have no other 
choice than abstracting on the fly (constructing piles on the fly for 
operations like indexing and normalising them in addition for searching, 
counting...).
As I said in other posts, this may be the right thing to do from an 
efficiency point of view, but this remains to be proven. I bet the 
opposite, in fact, that --with same implementation language and same 
investment in optimisation-- the approach defining a true textual type 
like Text is inevitbly more efficient by orders of magnitude (*). Again, 
Text construction initial cost is an investment. Prove me wrong (**).

 Andrei

Denis


(*) Except, probably, for the choice of making the ElemenType a 
singleton Text (seems costly).
(**) I'm now aware of the high speed loss Text certainly suffers from 
representing characters as mini-arrays, but I guess it is marginally 
relevant compared to the gain of not piling and normalising for every 
operation.
_________________
vita es estrany
spir.wikidot.com

Jan 17 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/17/11 5:13 PM, spir wrote:
 On 01/17/2011 07:57 PM, Andrei Alexandrescu wrote:
 * Line 130: representing a text as a dchar[][] has its advantages but
 major efficiency issues. To be frank I think it's a disaster. I think a
 representation building on UTF strings directly is bound to be vastly
 better.

 I don't understand your point. Where is the difference with D's builtin
 types, then?

Unfortunately I won't have much time to discuss all these points, but 
this is a simple one: using dchar[][] wastes memory and time. You need 
to build on a flatter representation. Don't confuse the abstraction you 
are building with its underlying representation. The difference between 
your abstraction and char[]/wchar[]/dchar[] (which I strongly recommend 
you to build on) is that the abstractions offer different, higher-level 
primitives that the representation doesn't.

Let me repeat again: if anyone in this community wants to put work in a 
forward range that iterates one grapheme at a time, that work would be 
very valuable because it will allow us to experiment with graphemes in a 
non-disruptive way while benefiting of a host of algorithms. ByGrapheme 
and friends will help more than defining new string types.


Andrei

Jan 17 2011

spir <denis.spir gmail.com> writes:

On 01/18/2011 03:52 AM, Andrei Alexandrescu wrote:
 On 1/17/11 5:13 PM, spir wrote:
 On 01/17/2011 07:57 PM, Andrei Alexandrescu wrote:
 * Line 130: representing a text as a dchar[][] has its advantages but
 major efficiency issues. To be frank I think it's a disaster. I think a
 representation building on UTF strings directly is bound to be vastly
 better.

 I don't understand your point. Where is the difference with D's builtin
 types, then?

 Unfortunately I won't have much time to discuss all these points, but
 this is a simple one: using dchar[][] wastes memory and time. You need
 to build on a flatter representation. Don't confuse the abstraction you
 are building with its underlying representation. The difference between
 your abstraction and char[]/wchar[]/dchar[] (which I strongly recommend
 you to build on) is that the abstractions offer different, higher-level
 primitives that the representation doesn't.

I think it is needed to repeat again the following: Text in my view (or 
whatever variant solution to work correctly with universal text) is 
_not_ intended as a basic string type, even less default.
If programmers can guarantee all their app's input will ever hold 
single-codepoint characters only, _or_ if they jst pass pieces of text 
around without manipulation, then such a tool is big overkill.

It has a time cost a Text construction time, which I consider as an 
investment. It has also some space & time cost for operations that 
should be only slightly relevant compared to speed offered by the simple 
facts routines can then operate just (actualy nearly) like with historic 
charsets.
Indexing is just normal O(1) indexing, possibly plus producing the 
result. Not O(n) across the source with building piles along the way. 
(1000X slower, 1000000X slower?)
Counting is just O(n) with mini-array compares, not building & 
normalising piles across the whole code sequence. (10X, 100X slower?)

 Let me repeat again: if anyone in this community wants to put work in a
 forward range that iterates one grapheme at a time, that work would be
 very valuable because it will allow us to experiment with graphemes in a
 non-disruptive way while benefiting of a host of algorithms. ByGrapheme
 and friends will help more than defining new string types.

Right. I understand your point-of-view, esp "non-disruptive".
But then, how to avoid the possibly huge inefficiency evoked above? We 
have no true perf numbers yet, right, for any alternative to Text's 
approach. But for this reason we also should not randomly speak of this 
approach's space & time costs. Compared to what?


Denis
_________________
vita es estrany
spir.wikidot.com

Jan 18 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/18/11 7:25 AM, spir wrote:
 On 01/18/2011 03:52 AM, Andrei Alexandrescu wrote:
 On 1/17/11 5:13 PM, spir wrote:
 On 01/17/2011 07:57 PM, Andrei Alexandrescu wrote:
 * Line 130: representing a text as a dchar[][] has its advantages but
 major efficiency issues. To be frank I think it's a disaster. I think a
 representation building on UTF strings directly is bound to be vastly
 better.

 I don't understand your point. Where is the difference with D's builtin
 types, then?

 Unfortunately I won't have much time to discuss all these points, but
 this is a simple one: using dchar[][] wastes memory and time. You need
 to build on a flatter representation. Don't confuse the abstraction you
 are building with its underlying representation. The difference between
 your abstraction and char[]/wchar[]/dchar[] (which I strongly recommend
 you to build on) is that the abstractions offer different, higher-level
 primitives that the representation doesn't.

 I think it is needed to repeat again the following: Text in my view (or
 whatever variant solution to work correctly with universal text) is
 _not_ intended as a basic string type, even less default.
 If programmers can guarantee all their app's input will ever hold
 single-codepoint characters only, _or_ if they jst pass pieces of text
 around without manipulation, then such a tool is big overkill.

 It has a time cost a Text construction time, which I consider as an
 investment. It has also some space & time cost for operations that
 should be only slightly relevant compared to speed offered by the simple
 facts routines can then operate just (actualy nearly) like with historic
 charsets.
 Indexing is just normal O(1) indexing, possibly plus producing the
 result. Not O(n) across the source with building piles along the way.
 (1000X slower, 1000000X slower?)
 Counting is just O(n) with mini-array compares, not building &
 normalising piles across the whole code sequence. (10X, 100X slower?)

You don't provide O(n) indexing.

Andrei

Jan 18 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Monday 17 January 2011 15:13:42 spir wrote:
 See range bug evoked above. opApply is the only workaround AFAIK.
 Also, ranges cannot yet provide indexed iteration like
 	foreach(i, char ; text) {...}

While it would be nice at times to be able to have an index with foreach when 
using ranges, I would point out that it's trivial to just declare a variable 
which you increment each iteration, so it's easy to get an index even when
using 
foreach with ranges. Certainly, I wouldn't consider the lack of index with 
foreach and ranges a good reason to use opApply instead of ranges. There may be 
other reasons which make it worthwhile, but it's so trivial to get an index
that 
the loss of range abilities (particularly the ability to use such ranges with 
std.algorithm) dwarfs it in importance.

- Jonathan M Davis

Jan 17 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/17/11 11:48 PM, Jonathan M Davis wrote:
 On Monday 17 January 2011 15:13:42 spir wrote:
 See range bug evoked above. opApply is the only workaround AFAIK.
 Also, ranges cannot yet provide indexed iteration like
 	foreach(i, char ; text) {...}

 While it would be nice at times to be able to have an index with foreach when
 using ranges, I would point out that it's trivial to just declare a variable
 which you increment each iteration, so it's easy to get an index even when
using
 foreach with ranges. Certainly, I wouldn't consider the lack of index with
 foreach and ranges a good reason to use opApply instead of ranges. There may be
 other reasons which make it worthwhile, but it's so trivial to get an index
that
 the loss of range abilities (particularly the ability to use such ranges with
 std.algorithm) dwarfs it in importance.

 - Jonathan M Davis

It's a bit more difficult than that. When iterating a variable-length 
encoded range, what you need more than the current item being iterated 
is the physical offset reached inside the range. That's not all that 
difficult either as the range can always provide an extra primitive, but 
a bit annoying (e.g. because it makes iteration with foreach impossible 
if you want the index, unless you return a tuple with each step).

At any rate, I agree with two things - one, we need to fix the foreach 
situation. Two, even before we find a fix, at this point committing to 
iteration with opApply essentially commits the iteratee to an island 
where all basic algorithms need to be reinvented from first principles.


Andrei

Jan 17 2011

spir <denis.spir gmail.com> writes:

On 01/18/2011 07:11 AM, Andrei Alexandrescu wrote:
 On 1/17/11 11:48 PM, Jonathan M Davis wrote:
 On Monday 17 January 2011 15:13:42 spir wrote:
 See range bug evoked above. opApply is the only workaround AFAIK.
 Also, ranges cannot yet provide indexed iteration like
 foreach(i, char ; text) {...}

 While it would be nice at times to be able to have an index with
 foreach when
 using ranges, I would point out that it's trivial to just declare a
 variable
 which you increment each iteration, so it's easy to get an index even
 when using
 foreach with ranges. Certainly, I wouldn't consider the lack of index
 with
 foreach and ranges a good reason to use opApply instead of ranges.
 There may be
 other reasons which make it worthwhile, but it's so trivial to get an
 index that
 the loss of range abilities (particularly the ability to use such
 ranges with
 std.algorithm) dwarfs it in importance.

 - Jonathan M Davis

 It's a bit more difficult than that. When iterating a variable-length
 encoded range, what you need more than the current item being iterated
 is the physical offset reached inside the range. That's not all that
 difficult either as the range can always provide an extra primitive, but
 a bit annoying (e.g. because it makes iteration with foreach impossible
 if you want the index, unless you return a tuple with each step).

This is a very valid point: a range's logical offset is not necessary 
equal to physical (hum) offset, even on a plain sequence.
But for the case of Text it is in fact, precisely because codepoints 
have been grouped in "piles" each representing true character 
(grapheme). This is actually one third of the purpose of Text (the 
others beeing to ensure unique representation of each character, and to 
provide users with clear interface).
Thus, Jonathan's point simply applies to Text.

 At any rate, I agree with two things - one, we need to fix the foreach
 situation. Two, even before we find a fix, at this point committing to
 iteration with opApply essentially commits the iteratee to an island
 where all basic algorithms need to be reinvented from first principles.

I agree. The situation would be different if D had not proposed indexed 
iteration already, and programmers would routinely count manually and/or 
call an extra range primitive, as you say.

Upon using opApply: it works fine nevertheless, at least for a first 
rough implementation like in the case of Text.
Reinventing basic algos is not an issue at this stage, as long as they 
are simple enough, and mainly for testing. (Actually, it can be an 
advantage in avoiding integration issues, possibly due to D's current 
beta stage --I mean bugs that pop up only when combinng given features-- 
like we had eg with range & formatValue).

 Andrei

Denis
_________________
vita es estrany
spir.wikidot.com

Jan 18 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 18 Jan 2011 01:11:04 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 1/17/11 11:48 PM, Jonathan M Davis wrote:
 On Monday 17 January 2011 15:13:42 spir wrote:
 See range bug evoked above. opApply is the only workaround AFAIK.
 Also, ranges cannot yet provide indexed iteration like
 	foreach(i, char ; text) {...}

 While it would be nice at times to be able to have an index with  
 foreach when
 using ranges, I would point out that it's trivial to just declare a  
 variable
 which you increment each iteration, so it's easy to get an index even  
 when using
 foreach with ranges. Certainly, I wouldn't consider the lack of index  
 with
 foreach and ranges a good reason to use opApply instead of ranges.  
 There may be
 other reasons which make it worthwhile, but it's so trivial to get an  
 index that
 the loss of range abilities (particularly the ability to use such  
 ranges with
 std.algorithm) dwarfs it in importance.

 - Jonathan M Davis

 It's a bit more difficult than that. When iterating a variable-length  
 encoded range, what you need more than the current item being iterated  
 is the physical offset reached inside the range. That's not all that  
 difficult either as the range can always provide an extra primitive, but  
 a bit annoying (e.g. because it makes iteration with foreach impossible  
 if you want the index, unless you return a tuple with each step).

 At any rate, I agree with two things - one, we need to fix the foreach  
 situation. Two, even before we find a fix, at this point committing to  
 iteration with opApply essentially commits the iteratee to an island  
 where all basic algorithms need to be reinvented from first principles.

opApply in no way disables the range interface.  It simply is used for  
foreach.  So the only "algorithm" which is different is foreach.  If you  
use the range primitives, opApply is nowhere to be found.

That being said, we have an annoying situation in all this.  opApply  
cannot be used to foreach using indexes *and* ranges are used to foreach  
elements.  If one opApply is found, the compiler gives up on using the  
range functions for foreach (this is reflected in my most recent string_t  
code).  This means you will have to implement a "wrapper" opApply around  
the range primitives in order to also implement indexing foreach.

-Steve

Jan 19 2011

spir <denis.spir gmail.com> writes:

On 01/18/2011 06:48 AM, Jonathan M Davis wrote:
 On Monday 17 January 2011 15:13:42 spir wrote:
 See range bug evoked above. opApply is the only workaround AFAIK.
 Also, ranges cannot yet provide indexed iteration like
 	foreach(i, char ; text) {...}

 While it would be nice at times to be able to have an index with foreach when
 using ranges, I would point out that it's trivial to just declare a variable
 which you increment each iteration, so it's easy to get an index even when
using
 foreach with ranges. Certainly, I wouldn't consider the lack of index with
 foreach and ranges a good reason to use opApply instead of ranges. There may be
 other reasons which make it worthwhile, but it's so trivial to get an index
that
 the loss of range abilities (particularly the ability to use such ranges with
 std.algorithm) dwarfs it in importance.

You are right. I fully agree, in fact. On the other hand, think at 
expectations of users of a library providing iteration on "naturally" 
sequential thingies. The point is that D makes indexed iteration 
available elsewhere.

Denis
_________________
vita es estrany
spir.wikidot.com

Jan 18 2011

spir <denis.spir gmail.com> writes:

On 01/14/2011 04:50 PM, Michel Fortin wrote:
 This might be a good time to see whether we need to address graphemes
 systematically. Could you please post a few links that would educate
 me and others in the mysteries of combining characters?

 As usual, Wikipedia offers a good summary and a couple of references.
 Here's the part about combining characters:
 <http://en.wikipedia.org/wiki/Combining_character>.

 There's basically four ranges of code points which are combining:
 - Combining Diacritical Marks (0300–036F)
 - Combining Diacritical Marks Supplement (1DC0–1DFF)
 - Combining Diacritical Marks for Symbols (20D0–20FF)
 - Combining Half Marks (FE20–FE2F)

 A code point followed by one or more code points in these ranges is
 conceptually a single character (a grapheme).

Unfortunatly, things are complicated by _prepend_ combining marks that 
happen in a code sequence _before_ the base mark.
The Unicode algorithm is described here: 
http://unicode.org/reports/tr29/ section 3 (humanly readable ;-). See 
esp the first table in section 3.1.

Denis
_________________
vita es estrany
spir.wikidot.com

Jan 17 2011

spir <denis.spir gmail.com> writes:

On 01/11/2011 02:30 PM, Steven Schveighoffer wrote:
 On Mon, 10 Jan 2011 22:57:36 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 I've been thinking on how to better deal with Unicode strings.
 Currently strings are formally bidirectional ranges with a
 surreptitious random access interface. The random access interface
 accesses the support of the string, which is understood to hold data
 in a variable-encoded format. For as long as the programmer
 understands this relationship, code for string manipulation can be
 written with relative ease. However, there is still room for writing
 wrong code that looks legit.

 Sometimes the best way to tackle a hairy reality is to invite it to
 the negotiation table and offer it promotion to first-class
 abstraction status. Along that vein I was thinking of defining a new
 range: VLERange, i.e. Variable Length Encoding Range. Such a range
 would have the power somewhere in between bidirectional and random
 access.

 The primitives offered would include empty, access to front and back,
 popFront and popBack (just like BidirectionalRange), and in addition
 properties typical of random access ranges: indexing, slicing, and
 length. Note that the result of the indexing operator is not the same
 as the element type of the range, as it only represents the unit of
 encoding.

 In addition to these (and connecting the two), a VLERange would offer
 two additional primitives:

 1. size_t stepSize(size_t offset) gives the length of the step needed
 to skip to the next element.

 2. size_t backstepSize(size_t offset) gives the size of the _backward_
 step that goes to the previous element.

 In both cases, offset is assumed to be at the beginning of a logical
 element of the range.

 I suspect that a lot of functions in std.string can be written without
 Unicode-specific knowledge just by relying on such an interface.
 Moreover, algorithms can be generalized to other structures that use
 variable-length encoding, such as those used in data compression. (In
 that case, the support would be a bit array and the encoded type would
 be ubyte.)

 Writing to such ranges is not addressed by this design. Ideas are
 welcome.

 Adding VLERange would legitimize strings and would clarify their
 handling, at the cost of adding one additional concept that needs to
 be minded. Is the trade-off worthwhile?

 While this makes it possible to write algorithms that only accept
 VLERanges, I don't think it solves the major problem with strings --
 they are treated as arrays by the compiler.

 I'd also rather see an indexing operation return the element type, and
 have a separate function to get the encoding unit. This makes more sense
 for generic code IMO.

 I noticed you never commented on my proposed string type...

 That reminds me, I should update with suggested changes and re-post it.

People interested in solving the general problem with Unicode strings 
may have a look at https://bitbucket.org/denispir/denispir-d. All 
constructive feedback welcome.
(This will be asked for review in a short while. The main / client 
interface module is Text.d. A (long) presentation of the issues, 
reasons, solution can be found in the text called "U missing level of 
abstraction")

Denis
_________________
vita es estrany
spir.wikidot.com

Jan 11 2011

Tomek =?ISO-8859-2?Q?Sowi=F1ski?= <just ask.me> writes:

Andrei Alexandrescu napisa=B3:

 I've been thinking on how to better deal with Unicode strings. Currently=

=20
 strings are formally bidirectional ranges with a surreptitious random=20
 access interface. The random access interface accesses the support of=20
 the string, which is understood to hold data in a variable-encoded=20
 format. For as long as the programmer understands this relationship,=20
 code for string manipulation can be written with relative ease. However,=

=20
 there is still room for writing wrong code that looks legit.
=20
 Sometimes the best way to tackle a hairy reality is to invite it to the=20
 negotiation table and offer it promotion to first-class abstraction=20
 status. Along that vein I was thinking of defining a new range:=20
 VLERange, i.e. Variable Length Encoding Range. Such a range would have=20
 the power somewhere in between bidirectional and random access.
=20
 The primitives offered would include empty, access to front and back,=20
 popFront and popBack (just like BidirectionalRange), and in addition=20
 properties typical of random access ranges: indexing, slicing, and=20
 length.

For some compressions implementing *back is troublesome if not impossible...

 Note that the result of the indexing operator is not the same as=20
 the element type of the range, as it only represents the unit of encoding.

It's worth to mention it explicitly -- a VLERange is dually typed. It's imp=
ortant for searching. Statically check if original and encoded match, if so=
, perform fast search on directly on encoded elements. I think an important=
 feature of a VLERange should be dropping  itself down to a encoded-typed r=
ange, so that front and back return raw data.

Dual typing will also affect foreach -- in general case you'd want to choos=
e whether to decode or not by typing the element.

I can't stop thinking that VLERange is a two-piece bikini making a bare ran=
dom-access range safe to look at, and that you can take off when partners h=
ave confidence, not a limited random-access probing facility to span the vo=
id between front and back.

 In addition to these (and connecting the two), a VLERange would offer=20
 two additional primitives:
=20
 1. size_t stepSize(size_t offset) gives the length of the step needed to=

=20
 skip to the next element.
=20
 2. size_t backstepSize(size_t offset) gives the size of the _backward_=20
 step that goes to the previous element.
=20
 In both cases, offset is assumed to be at the beginning of a logical=20
 element of the range.

So when I move the spinner in an iPod, I get catapulted in position with th=
e raw data opIndex and from there I try to work my way to the next frame to=
 start playback. Sounds promising.

 I suspect that a lot of functions in std.string can be written without=20
 Unicode-specific knowledge just by relying on such an interface.=20
 Moreover, algorithms can be generalized to other structures that use=20
 variable-length encoding, such as those used in data compression. (In=20
 that case, the support would be a bit array and the encoded type would=20
 be ubyte.)

I agree, acknowledging encoding/compression as a general direction will bri=
ng substantial benefits.

 Writing to such ranges is not addressed by this design. Ideas are welcome.

Yeah, we can address outputting later, that's fair.

 Adding VLERange would legitimize strings and would clarify their=20
 handling, at the cost of adding one additional concept that needs to be=20
 minded. Is the trade-off worthwhile?

Well, the only way to find out is try it. My advice: VLERanges originated a=
s a solution to the string problem, so start with a non-string incarnation.=
 Having at least two (one, we know, is string) plugs that fit the same sock=
et will spur confidence in the abstraction.=20

--=20
Tomek

Jan 11 2011

Steven Wawryk <stevenw acres.com.au> writes:

Sorry if I'm jumping inhere without the appropriate background, but I 
don't understand why jumping through these hoops are necessary.  Please 
let me know if I'm missing anything.

Many problems can be solved by another layer of indirection.  Isn't a 
string essentially a bidirectional range of code points built on top of 
a random access range of code units?  It seems to me that each 
abstraction separately already fits within the existing D range 
framework and all the difficulties arise as a consequence of trying to 
lump them into a single abstraction.

Why not choose which of these abstractions is most appropriate in a 
given situation instead of trying to shoe-horn both concepts into a 
single abstraction, and provide for easy conversion between them?  When 
character representation is the primary requirement then make it a 
bidirectional range of code points.  When storage representation and 
random access is required then make it a random access range of code units.

Jan 11 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-11 20:28:26 -0500, Steven Wawryk <stevenw acres.com.au> said:

 Sorry if I'm jumping inhere without the appropriate background, but I 
 don't understand why jumping through these hoops are necessary.  Please 
 let me know if I'm missing anything.
 
 Many problems can be solved by another layer of indirection.  Isn't a 
 string essentially a bidirectional range of code points built on top of 
 a random access range of code units?

Actually, displaying a UTF-8/UTF-16 string involves a range of of 
glyphs layered over a range of graphemes layered over a range of code 
points layered over a range of code units. Glyphs represent the visual 
characters you can get from a font, they often map one-to-one with 
graphemes but not always (ligatures for instance). Graphemes are what 
people generally reason about when they see text (the so called 
"user-perceived characters"), they often map one-to-one with code 
points but not always (combining marks for instance). Code points are a 
list of standardized codes representing various elements of a string, 
and code units basically encode the code points.

If you're writing an XML, JSON or whatever else parser you'll probably 
care about code points. If you're advancing the insertion point in a 
text field or count the number of user-perceived characters you'll 
probably want to deal with graphemes. For searching a substring inside 
a string, or comparing strings you'll probably want to deal with either 
graphemes or collation elements (collation elements are layered on top 
of code points). To print a string you'll need to map graphemes to the 
glyphs from a particular font.

Reducing string operations to code points manipulations will only work 
as long as all your graphemes, collation elements, or glyphs map 
one-to-one with code points.


 It seems to me that each abstraction separately already fits within the 
 existing D range framework and all the difficulties arise as a 
 consequence of trying to lump them into a single abstraction.

It's true that each of these abstraction can fit within the existing 
range framework.


 Why not choose which of these abstractions is most appropriate in a 
 given situation instead of trying to shoe-horn both concepts into a 
 single abstraction, and provide for easy conversion between them?  When 
 character representation is the primary requirement then make it a 
 bidirectional range of code points.  When storage representation and 
 random access is required then make it a random access range of code 
 units.

I think you're right. The need for a new concept isn't that great, and 
it gets complicated really fast.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 11 2011

Don <nospam nospam.com> writes:

Michel Fortin wrote:
 On 2011-01-11 20:28:26 -0500, Steven Wawryk <stevenw acres.com.au> said:
 Why not choose which of these abstractions is most appropriate in a 
 given situation instead of trying to shoe-horn both concepts into a 
 single abstraction, and provide for easy conversion between them?  
 When character representation is the primary requirement then make it 
 a bidirectional range of code points.  When storage representation and 
 random access is required then make it a random access range of code 
 units.

 
 I think you're right. The need for a new concept isn't that great, and 
 it gets complicated really fast.

I think the only problem that we really have, is that "char[]", 
"dchar[]" implies that code points is always the appropriate level of 
abstraction.

Jan 12 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/12/11 11:28 AM, Don wrote:
 Michel Fortin wrote:
 On 2011-01-11 20:28:26 -0500, Steven Wawryk <stevenw acres.com.au> said:
 Why not choose which of these abstractions is most appropriate in a
 given situation instead of trying to shoe-horn both concepts into a
 single abstraction, and provide for easy conversion between them?
 When character representation is the primary requirement then make it
 a bidirectional range of code points. When storage representation and
 random access is required then make it a random access range of code
 units.

 I think you're right. The need for a new concept isn't that great, and
 it gets complicated really fast.

 I think the only problem that we really have, is that "char[]",
 "dchar[]" implies that code points is always the appropriate level of
 abstraction.

I hope to assuage part of that issue with representation(). Again, it's 
not documented yet (mainly because of the famous ddoc bug that prevents 
auto functions from carrying documentation). Here it is:

/**
  * Returns the representation type of a string, which is the same type
  * as the string except the character type is replaced by $(D ubyte),
  * $(D ushort), or $(D uint) depending on the character width.
  *
  * Example:
----
string s = "hello";
static assert(is(typeof(representation(s)) == immutable(ubyte)[]));
----
  */
/*private*/ auto representation(Char)(Char[] s) if (isSomeChar!Char)
{
     // Get representation type
     static if (Char.sizeof == 1) enum t = "ubyte";
     else static if (Char.sizeof == 2) enum t = "ushort";
     else static if (Char.sizeof == 4) enum t = "uint";
     else static assert(false); // can't happen due to isSomeChar!Char

     // Get representation qualifier
     static if (is(Char == immutable)) enum q = "immutable";
     else static if (is(Char == const)) enum q = "const";
     else static if (is(Char == shared)) enum q = "shared";
     else enum q = "";

     // Result type is qualifier(RepType)[]
     static if (q.length)
         return mixin("cast(" ~ q ~ "(" ~ t ~ ")[]) s");
     else
         return mixin("cast(" ~ t ~ "[]) s");
}


Andrei

Jan 12 2011

spir <denis.spir gmail.com> writes:

On 01/12/2011 08:28 PM, Don wrote:
 I think the only problem that we really have, is that "char[]",
 "dchar[]" implies that code points is always the appropriate level of
 abstraction.

I'd like to know when it happens that codepoint is the appropriate level 
of abstraction.
* If pieces of text are not manipulated, meaning just used in the 
application, or just transferred via the application as is (from file / 
input / literal to any kind of output), then any kind of encoding just 
works. One can even concatenate, provided all pieces use the same 
encoding. --> _lower_ level than codepoint is OK.
* But any of manipulation (indexing, slicing, compare, search, count, 
replace, not to speak about regex/parsing) requires operating at the 
_higher_ level of characters (in the common sense). Just like with 
historic character sets in which codes used to represent characters (not 
lower-level thingies as in UCS). Else, one reads, compares, changes 
meaningless bits of text.

As I see it now, we need 2 types:
* One plain string similar to good old ones (bytestring would do the 
job, since most unicode is utf8 encoded) for the first kind of use 
above. With optional validity check when it's supposed to be unicode text.
* One hiher-level type abstracting from codepoint (not code unit) 
issues, restoring the necessary properties: (1) each character is one 
element in the sequence (2) each character is always represented the 
same way.


Denis
_________________
vita es estrany
spir.wikidot.com

Jan 12 2011

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

spir wrote:
 On 01/12/2011 08:28 PM, Don wrote:
 I think the only problem that we really have, is that "char[]",
 "dchar[]" implies that code points is always the appropriate level of
 abstraction.

 I'd like to know when it happens that codepoint is the appropriate level
 of abstraction.

When on a document that describes code points... :)

 * If pieces of text are not manipulated, meaning just used in the
 application, or just transferred via the application as is (from file /
 input / literal to any kind of output), then any kind of encoding just
 works. One can even concatenate, provided all pieces use the same
 encoding. --> _lower_ level than codepoint is OK.
 * But any of manipulation (indexing, slicing, compare,

Compare according to which alphabet's ordering? Surely not Unicode's... 
I may be alone in this, but ordering is tied to an alphabet (or writing 
system), not locale.)

I try to solve that issue with my trileri library:

   http://code.google.com/p/trileri/source/browse/#svn%2Ftrunk%2Ftr

Warning: the code is in Turkish and is not aware of the concept of 
collation at all; it has its own simplistic view of text, where every 
character is an entity that can be lower/upper cased to a single character.

 search, count,
 replace, not to speak about regex/parsing) requires operating at the
 _higher_ level of characters (in the common sense).

I don't know this about Unicode: should e and ´ (acute accent) be always 
collated? If so, wouldn't it be impossible to put those two in that 
order say, in a text book? (Perhaps Unicode defines a way to stop 
collation.)

 Just like with
 historic character sets in which codes used to represent characters (not
 lower-level thingies as in UCS). Else, one reads, compares, changes
 meaningless bits of text.

 As I see it now, we need 2 types:

I think we need more than 2 types...

 * One plain string similar to good old ones (bytestring would do the
 job, since most unicode is utf8 encoded) for the first kind of use
 above. With optional validity check when it's supposed to be unicode 

text.

Agreed. D gives us three UTF encondings, but I am not sure that there is 
only one abstraction above that.

 * One hiher-level type abstracting from codepoint (not code unit)
 issues, restoring the necessary properties: (1) each character is one
 element in the sequence (2) each character is always represented the
 same way.

I think VLERange should solve only the variable-length-encoding issue. 
It should not get into higher abstractions.

Ali

Jan 12 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-12 14:57:58 -0500, spir <denis.spir gmail.com> said:

 On 01/12/2011 08:28 PM, Don wrote:
 I think the only problem that we really have, is that "char[]",
 "dchar[]" implies that code points is always the appropriate level of
 abstraction.

 
 I'd like to know when it happens that codepoint is the appropriate 
 level of abstraction.

I agree with you. I don't see many use for code points.

One of these uses is writing a parser for a format defined in term of 
code points (XML for instance). But beyond that, I don't see one.


 * If pieces of text are not manipulated, meaning just used in the 
 application, or just transferred via the application as is (from file / 
 input / literal to any kind of output), then any kind of encoding just 
 works. One can even concatenate, provided all pieces use the same 
 encoding. --> _lower_ level than codepoint is OK.
 * But any of manipulation (indexing, slicing, compare, search, count, 
 replace, not to speak about regex/parsing) requires operating at the 
 _higher_ level of characters (in the common sense). Just like with 
 historic character sets in which codes used to represent characters 
 (not lower-level thingies as in UCS). Else, one reads, compares, 
 changes meaningless bits of text.

Very true. In the same way that code points can span on multiple code 
units, user-perceived characters (graphemes) can span on multiple code 
points.

A funny exercise to make a fool of an algorithm working only with code 
points would be to replace the word "fortune" in a text containing the 
word "fortun�". If the last "�" is expressed as two code points, as "e" 
followed by a combining acute accent (this: �), replacing occurrences 
of "fortune" by "expose" would also replace "fortun�" with "expos�" 
because the combining acute accent remains as the code point following 
the word. Quite amusing, but it doesn't really make sense that it works 
like that.

In the case of "�", we're lucky enough to also have a pre-combined 
character to encode it as a single code point, so encountering "�" 
written as two code points is quite rare. But not all combinations of 
marks and characters can be represented as a single code point. The 
correct thing to do is to treat "�" (single code point) and "�" ("e" + 
combining acute accent) as equivalent.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 12 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-12 19:45:36 -0500, Michel Fortin <michel.fortin michelf.com> said:

 A funny exercise to make a fool of an algorithm working only with code 
 points would be to replace the word "fortune" in a text containing the 
 word "fortuné". If the last "é" is expressed as two code points, as "e" 
 followed by a combining acute accent (this: é), replacing occurrences 
 of "fortune" by "expose" would also replace "fortuné" with "exposé" 
 because the combining acute accent remains as the code point following 
 the word. Quite amusing, but it doesn't really make sense that it works 
 like that.
 
 In the case of "é", we're lucky enough to also have a pre-combined 
 character to encode it as a single code point, so encountering "é" 
 written as two code points is quite rare. But not all combinations of 
 marks and characters can be represented as a single code point. The 
 correct thing to do is to treat "é" (single code point) and "é" ("e" + 
 combining acute accent) as equivalent.

Crap, I meant to send this as UTF-8 with combining characters in it, 
but my news client converted everything to ISO-8859-1.

I'm not sure it'll work, but here's my second attempt at posting real 
combining marks:

	Single code point: é
	e with combining mark: é
	t with combining mark: t̂
	t with two combining marks: t̂̃

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 12 2011

spir <denis.spir gmail.com> writes:

On 01/13/2011 01:51 AM, Michel Fortin wrote:
 On 2011-01-12 19:45:36 -0500, Michel Fortin <michel.fortin michelf.com>
 said:

 A funny exercise to make a fool of an algorithm working only with code
 points would be to replace the word "fortune" in a text containing the
 word "fortuné". If the last "é" is expressed as two code points, as
 "e" followed by a combining acute accent (this: é), replacing
 occurrences of "fortune" by "expose" would also replace "fortuné" with
 "exposé" because the combining acute accent remains as the code point
 following the word. Quite amusing, but it doesn't really make sense
 that it works like that.

 In the case of "é", we're lucky enough to also have a pre-combined
 character to encode it as a single code point, so encountering "é"
 written as two code points is quite rare. But not all combinations of
 marks and characters can be represented as a single code point. The
 correct thing to do is to treat "é" (single code point) and "é" ("e" +
 combining acute accent) as equivalent.

 Crap, I meant to send this as UTF-8 with combining characters in it, but
 my news client converted everything to ISO-8859-1.

 I'm not sure it'll work, but here's my second attempt at posting real
 combining marks:

 Single code point: é
 e with combining mark: é
 t with combining mark: t̂
 t with two combining marks: t̂̃

Works :-) But your first post worked as well by me: for instance <<"é" 
("e" + combining acute accent)>> was displayed "é" as a single accented 
letter. I guess maybe your email client did not convert into iso-8859-1 
on sending, but on reading (mine is set for utf-8).

Denis
_________________
vita es estrany
spir.wikidot.com

Jan 13 2011

spir <denis.spir gmail.com> writes:

On 01/13/2011 01:45 AM, Michel Fortin wrote:
On 2011-01-12 14:57:58 -0500, spir <denis.spir gmail.com> said:

On 01/12/2011 08:28 PM, Don wrote:
I think the only problem that we really have, is that "char[]",
"dchar[]" implies that code points is always the appropriate level of
abstraction.

I'd like to know when it happens that codepoint is the appropriate
level of abstraction.

I agree with you. I don't see many use for code points.

One of these uses is writing a parser for a format defined in term of
code points (XML for instance). But beyond that, I don't see one.

Actually, I had once a real use case for codepoint beeing the proper
level of abstraction: a linguistic app of which one operational func
counts occurrences of "scripting marks" like 'a' & '¨' in "ä". hope you
see what I mean.
Once the text is properly NFD decomposed, each of those marks in coded
as a codepoint. (But if it's not decomposed, then most of those marks
are probably hidden by precomposed codes coding characters like "ä".) So
that even such an app benefits from a higher-level type basically
operating on normalised (NFD) characters.

* If pieces of text are not manipulated, meaning just used in the
application, or just transferred via the application as is (from file
/ input / literal to any kind of output), then any kind of encoding
just works. One can even concatenate, provided all pieces use the same
encoding. --> _lower_ level than codepoint is OK.
* But any of manipulation (indexing, slicing, compare, search, count,
replace, not to speak about regex/parsing) requires operating at the
_higher_ level of characters (in the common sense). Just like with
historic character sets in which codes used to represent characters
(not lower-level thingies as in UCS). Else, one reads, compares,
changes meaningless bits of text.

Very true. In the same way that code points can span on multiple code
units, user-perceived characters (graphemes) can span on multiple code
points.

A funny exercise to make a fool of an algorithm working only with code
points would be to replace the word "fortune" in a text containing the
word "fortuné". If the last "é" is expressed as two code points, as "e"
followed by a combining acute accent (this: é), replacing occurrences of
"fortune" by "expose" would also replace "fortuné" with "exposé" because
the combining acute accent remains as the code point following the word.
Quite amusing, but it doesn't really make sense that it works like that.

In the case of "é", we're lucky enough to also have a pre-combined
character to encode it as a single code point, so encountering "é"
written as two code points is quite rare. But not all combinations of
marks and characters can be represented as a single code point. The
correct thing to do is to treat "é" (single code point) and "é" ("e" +
combining acute accent) as equivalent.

You'll find another example in the introduction of the text at
https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20level%20of%20abstraction

About your last remark, this is precisely one of the two abstractions my
Text type provides: it groups togeter in "piles" codes that belong to
the same "true" character (grapheme) like "é". So that the resulting
text representation is a sequence of "piles", each representing a
character. Consequence: indexing, slicing, etc work sensibly (and even
other operations are faster for they do not need to perform that
"piling" again & again).
In addition to that, the string is first NFD-normalised, thus each
chraracter can have one & only representation. Consequence: search,
count, replace, etc, and compare (*) work as expected. In your case:
// 2 forms of "é"
assert(Text("\u00E9") == Text("\u0065\u0301"));

Denis

(*) According to UCS coding, not language-specific idiosyncrasies.
More generally, Text abstract from lower-level issues _introduced_ by
UCS, Unicode's character set. It does not code with script-, language-,
culture-, domain-, app- specific needs such as custom text sorting
rules. Some base routines for such operations are provided by Text's
brother lib DUnicode (access to some code properties, safe concat,
casefolded compare, NF* normalisation).
_________________
vita es estrany
spir.wikidot.com

Jan 13 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday 13 January 2011 01:49:31 spir wrote:
 On 01/13/2011 01:45 AM, Michel Fortin wrote:
 On 2011-01-12 14:57:58 -0500, spir <denis.spir gmail.com> said:
 On 01/12/2011 08:28 PM, Don wrote:
 I think the only problem that we really have, is that "char[]",
 "dchar[]" implies that code points is always the appropriate level of
 abstraction.

=20
 I'd like to know when it happens that codepoint is the appropriate
 level of abstraction.

=20
 I agree with you. I don't see many use for code points.
=20
 One of these uses is writing a parser for a format defined in term of
 code points (XML for instance). But beyond that, I don't see one.

=20
 Actually, I had once a real use case for codepoint beeing the proper
 level of abstraction: a linguistic app of which one operational func
 counts occurrences of "scripting marks" like 'a' & '=C2=A8' in "=C3=A4". =

hope you
 see what I mean.
 Once the text is properly NFD decomposed, each of those marks in coded
 as a codepoint. (But if it's not decomposed, then most of those marks
 are probably hidden by precomposed codes coding characters like "=C3=A4".=

) So
 that even such an app benefits from a higher-level type basically
 operating on normalised (NFD) characters.

There's also the question of efficiency. On the whole, string operations ca=
n be=20
very expensive - particularly when you're doing a lot of them. The fact tha=
t D's=20
arrays are so powerful may reduce the problem in D, but in general, if you'=
re=20
doing a lot with strings, it can get costly, performance-wise.

The question then is what is the cost of actually having strings abstracted=
 to=20
the point that they really are ranges of characters rather than code units =
or=20
code points or whatever? If the cost is large enough, then dealing with str=
ings=20
as arrays as they currently are and having the occasional unicode issue cou=
ld=20
very well be worth it. As it is, there are plenty of people who don't want =
to=20
have to care about unicode in the first place, since the programs that they=
 write=20
only deal with ASCII characters. The fact that D makes it so easy to deal w=
ith=20
unicode code points is a definite improvement, but taking the abstraction t=
o the=20
point that you're definitely dealing with characters rather than code units=
 or=20
code points could be too costly.

Now, if it can be done efficiently, then having unicode dealt with properly=
=20
without the programmer having to worry about it would be a big boon. As it =
is,=20
D's handling of unicode is a big boon, even if it doesn't deal with graphem=
es=20
and the like.

So, I think that we definitely should have an abstraction for unicode which=
 uses=20
characters as the elements in the range and doesn't have to care about the=
=20
underlying encoding of the characters (except perhaps picking whether char,=
=20
wchar, or dchar is use internally, and therefore how much space it requires=
).=20
However, I'm not at all convinced that such an abstraction can be done effi=
ciently=20
enough to make it the default way of handling strings.

=2D Jonathan M Davis

Jan 13 2011

spir <denis.spir gmail.com> writes:

On 01/13/2011 11:16 AM, Jonathan M Davis wrote:
 On Thursday 13 January 2011 01:49:31 spir wrote:
 On 01/13/2011 01:45 AM, Michel Fortin wrote:
 On 2011-01-12 14:57:58 -0500, spir<denis.spir gmail.com>  said:
 On 01/12/2011 08:28 PM, Don wrote:
 I think the only problem that we really have, is that "char[]",
 "dchar[]" implies that code points is always the appropriate level of
 abstraction.

 I'd like to know when it happens that codepoint is the appropriate
 level of abstraction.

 I agree with you. I don't see many use for code points.

 One of these uses is writing a parser for a format defined in term of
 code points (XML for instance). But beyond that, I don't see one.

 Actually, I had once a real use case for codepoint beeing the proper
 level of abstraction: a linguistic app of which one operational func
 counts occurrences of "scripting marks" like 'a'&  '¨' in "ä". hope you
 see what I mean.
 Once the text is properly NFD decomposed, each of those marks in coded
 as a codepoint. (But if it's not decomposed, then most of those marks
 are probably hidden by precomposed codes coding characters like "ä".) So
 that even such an app benefits from a higher-level type basically
 operating on normalised (NFD) characters.

 There's also the question of efficiency. On the whole, string operations can be
 very expensive - particularly when you're doing a lot of them. The fact that
D's
 arrays are so powerful may reduce the problem in D, but in general, if you're
 doing a lot with strings, it can get costly, performance-wise.

D's arrays (even dchar[] & dstring) do not allow having correct results 
when dealing with UCS/Unicode text in the general case. See Michel's 
example (and several ones I posted on this list, and the text at 
https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20leve
%20of%20abstraction 
for a very lengthy explanation).
You and some other people seem to still mistake Unicode's low level 
issue of codepoint vs code unit, with the higher-level issue of codes 
_not_ representing characters in the commmon sense ("graphemes").

The above pointed text was written precisely to introduce to this issue 
because obviously no-one wants to face it... (Eg each time I evoke it on 
this list it is ignored, except by Michel, but the same is true 
everywhere else, including on the Unicode mailing list!). The core of 
the problem is the misleading term "abstract character" which 
deceivingly lets programmers believe that a codepoints codes a 
character, like in historic character sets -- which is *wrong*. No 
Unicode document AFAIK explains this. This is a case of unsaid lie.
Compared to legacy charsets, dealing with Unicode actually requires *2* 
levels of abstraction... (one to decode codepoints from code units, one 
to construct characters from codepoints)

Note that D's stdlib currently provides no means to do this, not even on 
the fly. You'd have to interface with eg ICU (a C/C++/Java Unicode 
library) (good luck ;-). But even ICU, as well as supposed unicode-aware 
typse or librarys for any language, would give you an abstraction 
producing correct results for Michel's example. For instance, Python3 
code fails as miserably as any other. AFAIK, D is the first and only 
language having such a tool (Text.d at 
https://bitbucket.org/denispir/denispir-d/src/a005424f60f3).

 The question then is what is the cost of actually having strings abstracted to
 the point that they really are ranges of characters rather than code units or
 code points or whatever? If the cost is large enough, then dealing with strings
 as arrays as they currently are and having the occasional unicode issue could
 very well be worth it. As it is, there are plenty of people who don't want to
 have to care about unicode in the first place, since the programs that they
write
 only deal with ASCII characters. The fact that D makes it so easy to deal with
 unicode code points is a definite improvement, but taking the abstraction to
the
 point that you're definitely dealing with characters rather than code units or
 code points could be too costly.

When _manipulating_ text (indexing, search, changing), you have the 
choice between:
* On the fly abstraction (composing characters on the fly, and/or 
normalising them), for each operation for each piece of text (including 
parameters, including literals).
* Use of a type that constructs this abstraction once only for each 
piece of text.
Note that a single count operation is forced to construct this 
abstraction on the fly for the whole text... (and for the searched snippet).
Also note that optimisation is probably easier is the second case, for 
the abstraction operation is then standard.

 Now, if it can be done efficiently, then having unicode dealt with properly
 without the programmer having to worry about it would be a big boon. As it is,
 D's handling of unicode is a big boon, even if it doesn't deal with graphemes
 and the like.

It has a cost at intial Text construction time. Currently, on my very 
slow computer, 1MB source text requires ~ 500 ms (decoding + 
decomposition + ordering + "piling" codes into characters). Decoding 
only using D's builtin std.utf.decode takes about 100 ms.
The bottle neck is piling: 70% of the time in average, on a test case 
melting texts from a dozen natural languages. We would be very glad to 
get the community's help in optimising this phase :-)
(We have progressed very much already in terms of speed, but now reach 
limits of our competences.)

 So, I think that we definitely should have an abstraction for unicode which
uses
 characters as the elements in the range and doesn't have to care about the
 underlying encoding of the characters (except perhaps picking whether char,
 wchar, or dchar is use internally, and therefore how much space it requires).
 However, I'm not at all convinced that such an abstraction can be done
efficiently
 enough to make it the default way of handling strings.

If you only have ASCII, or if you don't manipulate text at all, then as 
said in a previous post any string representation works fine (whatever 
the encoding it possibly uses under the hood).
D's builtin char/dchar/wchar and string/dstring/wstring are very nice 
and well done, but they are not necessary in such a use case. Actually, 
as shown by Steven's repeted complaints, they rather get in the way when 
dealing with non-unicode source data (IIUC, by assuming string elements 
are utf codes).

And they do not even try to solve the real issues one necessarily meets 
when manipulating unicode texts, which are due to UCS's coding format. 
Thus my previous statement: the level of codepoints is nearly never the 
proper level of abstraction.

 - Jonathan M Davis

Denis
_________________
vita es estrany
spir.wikidot.com

Jan 13 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-13 06:48:46 -0500, spir <denis.spir gmail.com> said:

 Note that D's stdlib currently provides no means to do this, not even 
 on the fly. You'd have to interface with eg ICU (a C/C++/Java Unicode 
 library) (good luck ;-). But even ICU, as well as supposed 
 unicode-aware typse or librarys for any language, would give you an 
 abstraction producing correct results for Michel's example. For 
 instance, Python3 code fails as miserably as any other. AFAIK, D is the 
 first and only language having such a tool (Text.d at 
 https://bitbucket.org/denispir/denispir-d/src/a005424f60f3).

D is not the first language dealing correctly with Unicode strings in 
this manner. Objective-C's NSString class search and compare methods 
deal with characters with combining marks correctly. If you want to 
compare code points, you can do so explicitly using the NSLiteralSearch 
option, but the default is to compare the canonical version (at the 
grapheme level).
<http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Strings/Articles/SearchingStrings.html%23//apple_ref/doc/uid/20000149-CJBBGBAI>

In 

Cocoa, string sorting and case-insensitive comparition is also 
dependent on the user's locale settings, although you can also specify 
your own locale if the user's locale is not what you want.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 13 2011

spir <denis.spir gmail.com> writes:

On 01/13/2011 02:47 PM, Michel Fortin wrote:
 On 2011-01-13 06:48:46 -0500, spir <denis.spir gmail.com> said:

 Note that D's stdlib currently provides no means to do this, not even
 on the fly. You'd have to interface with eg ICU (a C/C++/Java Unicode
 library) (good luck ;-). But even ICU, as well as supposed
 unicode-aware typse or librarys for any language, would give you an
 abstraction producing correct results for Michel's example. For
 instance, Python3 code fails as miserably as any other. AFAIK, D is
 the first and only language having such a tool (Text.d at
 https://bitbucket.org/denispir/denispir-d/src/a005424f60f3).

 D is not the first language dealing correctly with Unicode strings in
 this manner. Objective-C's NSString class search and compare methods
 deal with characters with combining marks correctly. If you want to
 compare code points, you can do so explicitly using the NSLiteralSearch
 option, but the default is to compare the canonical version (at the
 grapheme level).
 <http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Strings/Articles/SearchingStrings.html%23//apple_ref/doc/uid/20000149-CJBBGBAI>

Thank you very much for this information (I feel less lonely ;-).
I'll have a look at this NSString class ASAP, looks like it does 
The-Right-Thing as default (an Apple product...)

 In
 Cocoa, string sorting and case-insensitive comparition is also dependent
 on the user's locale settings, although you can also specify your own
 locale if the user's locale is not what you want.

On this point, I'm more dubitative. (Locale settings do not guarantee 
anything about right way of sorting for given domain, a given app, a 
given use case. There is an infinity of potential choices. But maybe 
it's a right default? See kde trying to invent a, hum, "natural", way of 
sorting file names...)

Denis
_________________
vita es estrany
spir.wikidot.com

Jan 13 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-13 14:11:44 -0500, spir <denis.spir gmail.com> said:

 In Cocoa, string sorting and case-insensitive comparition is also 
 dependent on the user's locale settings, although you can also specify 
 your own locale if the user's locale is not what you want. See kde 
 trying to invent a, hum, "natural", way of sorting file names...)

Mac OS sorts file names in a "natural" way since a very long time 
(since Mac OS 8 I believe). By natural, I mean that numbers inside the 
file name are sorted in numeric order while the rest is sorted 
character by character. For instance "My File 2" will go before "My 
File 10" in file listings because "2" is less than "10".

There's an option in NSString comparison methods to use this ordering, 
but it's not the default.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 13 2011

"Nick Sabalausky" <a a.a> writes:

"Michel Fortin" <michel.fortin michelf.com> wrote in message 
news:igo5v2$gq2$1 digitalmars.com...
 On 2011-01-13 14:11:44 -0500, spir <denis.spir gmail.com> said:

 In Cocoa, string sorting and case-insensitive comparition is also 
 dependent on the user's locale settings, although you can also specify 
 your own locale if the user's locale is not what you want. See kde trying 
 to invent a, hum, "natural", way of sorting file names...)

 Mac OS sorts file names in a "natural" way since a very long time (since 
 Mac OS 8 I believe). By natural, I mean that numbers inside the file name 
 are sorted in numeric order while the rest is sorted character by 
 character. For instance "My File 2" will go before "My File 10" in file 
 listings because "2" is less than "10".

XP's explorer does that too. It's a very nice feature.

Jan 13 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday 13 January 2011 03:48:46 spir wrote:
 On 01/13/2011 11:16 AM, Jonathan M Davis wrote:
 On Thursday 13 January 2011 01:49:31 spir wrote:
 On 01/13/2011 01:45 AM, Michel Fortin wrote:
 On 2011-01-12 14:57:58 -0500, spir<denis.spir gmail.com>  said:
 On 01/12/2011 08:28 PM, Don wrote:
 I think the only problem that we really have, is that "char[]",
 "dchar[]" implies that code points is always the appropriate level =






of
 abstraction.

=20
 I'd like to know when it happens that codepoint is the appropriate
 level of abstraction.

=20
 I agree with you. I don't see many use for code points.
=20
 One of these uses is writing a parser for a format defined in term of
 code points (XML for instance). But beyond that, I don't see one.

=20
 Actually, I had once a real use case for codepoint beeing the proper
 level of abstraction: a linguistic app of which one operational func
 counts occurrences of "scripting marks" like 'a'&  '=C2=A8' in "=C3=A4=



". hope you
 see what I mean.
 Once the text is properly NFD decomposed, each of those marks in coded
 as a codepoint. (But if it's not decomposed, then most of those marks
 are probably hidden by precomposed codes coding characters like "=C3=



=A4".) So
 that even such an app benefits from a higher-level type basically
 operating on normalised (NFD) characters.

=20
 There's also the question of efficiency. On the whole, string operations
 can be very expensive - particularly when you're doing a lot of them.
 The fact that D's arrays are so powerful may reduce the problem in D,
 but in general, if you're doing a lot with strings, it can get costly,
 performance-wise.

=20
 D's arrays (even dchar[] & dstring) do not allow having correct results
 when dealing with UCS/Unicode text in the general case. See Michel's
 example (and several ones I posted on this list, and the text at
 https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20=

le
 vel%20of%20abstraction for a very lengthy explanation).
 You and some other people seem to still mistake Unicode's low level
 issue of codepoint vs code unit, with the higher-level issue of codes
 _not_ representing characters in the commmon sense ("graphemes").
=20
 The above pointed text was written precisely to introduce to this issue
 because obviously no-one wants to face it... (Eg each time I evoke it on
 this list it is ignored, except by Michel, but the same is true
 everywhere else, including on the Unicode mailing list!). The core of
 the problem is the misleading term "abstract character" which
 deceivingly lets programmers believe that a codepoints codes a
 character, like in historic character sets -- which is *wrong*. No
 Unicode document AFAIK explains this. This is a case of unsaid lie.
 Compared to legacy charsets, dealing with Unicode actually requires *2*
 levels of abstraction... (one to decode codepoints from code units, one
 to construct characters from codepoints)
=20
 Note that D's stdlib currently provides no means to do this, not even on
 the fly. You'd have to interface with eg ICU (a C/C++/Java Unicode
 library) (good luck ;-). But even ICU, as well as supposed unicode-aware
 typse or librarys for any language, would give you an abstraction
 producing correct results for Michel's example. For instance, Python3
 code fails as miserably as any other. AFAIK, D is the first and only
 language having such a tool (Text.d at
 https://bitbucket.org/denispir/denispir-d/src/a005424f60f3).
=20
 The question then is what is the cost of actually having strings
 abstracted to the point that they really are ranges of characters rather
 than code units or code points or whatever? If the cost is large enough,
 then dealing with strings as arrays as they currently are and having the
 occasional unicode issue could very well be worth it. As it is, there
 are plenty of people who don't want to have to care about unicode in the
 first place, since the programs that they write only deal with ASCII
 characters. The fact that D makes it so easy to deal with unicode code
 points is a definite improvement, but taking the abstraction to the
 point that you're definitely dealing with characters rather than code
 units or code points could be too costly.

=20
 When _manipulating_ text (indexing, search, changing), you have the
 choice between:
 * On the fly abstraction (composing characters on the fly, and/or
 normalising them), for each operation for each piece of text (including
 parameters, including literals).
 * Use of a type that constructs this abstraction once only for each
 piece of text.
 Note that a single count operation is forced to construct this
 abstraction on the fly for the whole text... (and for the searched
 snippet). Also note that optimisation is probably easier is the second
 case, for the abstraction operation is then standard.
=20
 Now, if it can be done efficiently, then having unicode dealt with
 properly without the programmer having to worry about it would be a big
 boon. As it is, D's handling of unicode is a big boon, even if it
 doesn't deal with graphemes and the like.

=20
 It has a cost at intial Text construction time. Currently, on my very
 slow computer, 1MB source text requires ~ 500 ms (decoding +
 decomposition + ordering + "piling" codes into characters). Decoding
 only using D's builtin std.utf.decode takes about 100 ms.
 The bottle neck is piling: 70% of the time in average, on a test case
 melting texts from a dozen natural languages. We would be very glad to
 get the community's help in optimising this phase :-)
 (We have progressed very much already in terms of speed, but now reach
 limits of our competences.)
=20
 So, I think that we definitely should have an abstraction for unicode
 which uses characters as the elements in the range and doesn't have to
 care about the underlying encoding of the characters (except perhaps
 picking whether char, wchar, or dchar is use internally, and therefore
 how much space it requires). However, I'm not at all convinced that such
 an abstraction can be done efficiently enough to make it the default way
 of handling strings.

=20
 If you only have ASCII, or if you don't manipulate text at all, then as
 said in a previous post any string representation works fine (whatever
 the encoding it possibly uses under the hood).
 D's builtin char/dchar/wchar and string/dstring/wstring are very nice
 and well done, but they are not necessary in such a use case. Actually,
 as shown by Steven's repeted complaints, they rather get in the way when
 dealing with non-unicode source data (IIUC, by assuming string elements
 are utf codes).
=20
 And they do not even try to solve the real issues one necessarily meets
 when manipulating unicode texts, which are due to UCS's coding format.
 Thus my previous statement: the level of codepoints is nearly never the
 proper level of abstraction.

I wasn't saying that code points are guaranteed to be characters. I was say=
ing=20
that in most cases they are, so if efficiency is an issue, then having prop=
erly=20
abstract characters could be too costly. However, having a range type which=
=20
properly abstracts characters and deals with whatever graphemes and=20
normalization and whatnot that it has to would be a very good thing to have=
=2E The=20
real question is whether it can be made efficient enough to even consider u=
sing it=20
normally instead of just when you know that you're really going to need it.

The fact that you're seeing such a large drop in performance with your Text=
 type=20
definitely would support the idea that it could be just plain too expensive=
 to=20
use such a type in the average case. Even something like a 20% drop in=20
performance could be devastating if you're dealing with code which does a l=
ot of=20
string processing. Regardless though, there will obviously be cases where y=
ou'll=20
need something like your Text type if you want to process unicode correctly.

However, regardless of what the best way to handle unicode is in general, I=
=20
think that it's painfully clear that your average programmer doesn't know m=
uch=20
about unicode. Even understanding the nuances between char, wchar, and dcha=
r is=20
more than your average programmer seems to understand at first. The idea th=
at a=20
char wouldn't be guaranteed to be an actual character is not something that=
 many=20
programmers take to immediately. It's quite foreign to how chars are typica=
lly=20
dealt with in other languages, and many programmers never worry about unico=
de at=20
all, only dealing with ASCII. So, not only is unicode a rather disgusting=20
problem, but it's not one that your average programmer begins to grasp as f=
ar as=20
I've seen. Unless the issue is abstracted away completely, it takes a fair =
bit=20
of explaining to understand how to deal with unicoder properly.

=2D Jonathan M Davis

Jan 13 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-13 07:10:09 -0500, Jonathan M Davis <jmdavisProg gmx.com> said:

 However, regardless of what the best way to handle unicode is in 
 general, I think that it's painfully clear that your average programmer 
 doesn't know much about unicode. Even understanding the nuances between 
 char, wchar, and dchar is more than your average programmer seems to 
 understand at first. The idea that a char wouldn't be guaranteed to be 
 an actual character is not something that many
 programmers take to immediately. It's quite foreign to how chars are typically
 dealt with in other languages, and many programmers never worry about 
 unicode at
 all, only dealing with ASCII. So, not only is unicode a rather 
 disgusting problem, but it's not one that your average programmer 
 begins to grasp as far as I've seen. Unless the issue is abstracted 
 away completely, it takes a fair bit of explaining to understand how to 
 deal with unicoder properly.

What's nice about Cocoa's way of handling strings is that even 
programmers not bothering about it get things right most of the time. 
Strings are compared in their canonical form (graphemes), unless you 
request a literal compression; and they are sorted and compared 
case-insensitively according to the user's locale, unless you specify 
your own locale settings. Its only major pitfall is that indexing is 
done on UTF-16 code units.

The cost for this correctness is a small performance penalty, but I 
think it's the right path to take. For when performance or access to 
code points is important, the programmer should still be able to go 
down one layer and play with code points directly.

That said, we need to make sure the performance drop is minimal. I 
somewhat doubt much that spir's approach of storing strings as an array 
of piles of characters is the right approach for most usage scenarios, 
but this area would need a little more research. spir's approach is 
certainly the ultimate step in correctness as it allows O(1) indexing 
of graphemes, but personally I'd favor not to have indexing and just do 
on-the-fly decoding at the grapheme level when performing various 
string operations.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 13 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

OT: Spir, do you know if I can change the syntax highlighting settings
on bitbucket? I can't see anything with these gray on dark-gray
colors: http://i.imgur.com/SmLk1.jpg

Jan 13 2011

"Nick Sabalausky" <a a.a> writes:

"Andrej Mitrovic" <andrej.mitrovich gmail.com> wrote in message 
news:mailman.604.1294932704.4748.digitalmars-d puremagic.com...
 OT: Spir, do you know if I can change the syntax highlighting settings
 on bitbucket? I can't see anything with these gray on dark-gray
 colors: http://i.imgur.com/SmLk1.jpg

I'm getting the same problem too.

Jan 13 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-13 15:39:14 -0500, "Nick Sabalausky" <a a.a> said:

 "Andrej Mitrovic" <andrej.mitrovich gmail.com> wrote in message
 news:mailman.604.1294932704.4748.digitalmars-d puremagic.com...
 OT: Spir, do you know if I can change the syntax highlighting settings
 on bitbucket? I can't see anything with these gray on dark-gray
 colors: http://i.imgur.com/SmLk1.jpg

 
 I'm getting the same problem too.

I bypassed the problem by fetching the files from the repository. But I 
agree it's very annoying.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 13 2011

spir <denis.spir gmail.com> writes:

On 01/13/2011 01:10 PM, Jonathan M Davis wrote:
 I wasn't saying that code points are guaranteed to be characters. I was saying
 that in most cases they are, so if efficiency is an issue, then having properly
 abstract characters could be too costly.

The problem is then: how does a library or application programmer know, 
for sure, that all true characters (graphemes) from all source texts its 
software will ever deal with are coded with a single codepoint?
If you cope with ASCII only now & forever, then you know that.
If you do not manipulate text at all, then the question vanishes.

Else, you cannot know, I guess. The problem is partially masked because, 
most of us currently process only western language sources, for which 
scripts there exist precomposed codes for every _predefine_ character, 
and text-producing software (like editors) usually use precomposed codes 
when available. Hope I'm clear.
(I hope this use of precomposed codes will change because the gain in 
space for western langs is ridiculous and the cost in processing is 
instead relevant.)
In the future, all of this may change, so that the issue would more 
often be obvious for many programmers dealing with international text. 
Note that even now nothing prevents a user (including a programmer in 
source code!), even less a text-producing software, to use decomposed 
coding (the right choice imo). And there are true characters, and you 
can "invent" as many fancy characters you like, for which no precomposed 
code is defined, indeed. All of this is valid unicode and must be 
properly dealt with.

 However, having a range type which
 properly abstracts characters and deals with whatever graphemes and
 normalization and whatnot that it has to would be a very good thing 

to have. The real question is whether it can be made efficient enough to 
even consider using it normally instead of just when you know that 
you're really going to need it.

Upon range, we initially planned to expose a range interface in our type 
for iteration, instead of opApply, for better integration with coming D2 
style, and algorithms. But had to let it down due to a few range bugs 
exposed in a previous thread (search for "range usability" IIRC).

 The fact that you're seeing such a large drop in performance with your Text
type
 definitely would support the idea that it could be just plain too expensive to
 use such a type in the average case. Even something like a 20% drop in
 performance could be devastating if you're dealing with code which does a lot
of
 string processing. Regardless though, there will obviously be cases where
you'll
 need something like your Text type if you want to process unicode correctly.

The question of efficency is not as you present it. If you cannot 
guarantee that every character is coded by a single code (in all pieces 
of text, including params and literal), then you *must* construct an 
abstraction at the level of true characters --and even probably 
normalise them.
You have the choice of doing it on the fly for _every_ operation, or 
using a tool like the type Text. In the latter case, not only everything 
is far simpler for client code, but the abstraction is constructed only 
once (and forever ;-).

In the first case, the cost is the same (or rather higher because 
optimisation can probably be more efficient for a single standard case 
than for various operation cases); but _multiplied_ by the number of 
operations you need to perform on each piece of text. Thus, for a given 
operation, you get the slowest possible run: for instance indexing is 
O(k*n) where k is the cost of "piling" a single char, and n the char 
count...

In the second case, the efficiency issue happens only initially for each 
piece of text. Then, every operation is as fast as possible: indexing is 
indeed O(1).
But: this O(1) is slightly slower than with historic charsets because 
characters are now represented by mini code arrays instead of single 
codes. The same point applies even more for every operation involving 
compares (search, count, replace). We cannot solve this: it is due to 
UCS's coding scheme.

 However, regardless of what the best way to handle unicode is in general, I
 think that it's painfully clear that your average programmer doesn't know much
 about unicode.

True. Even those who think they are informed. Because Unicode's docs all 
not only ignore the problem, but contribute to creating it by using the 
deceiving term "abstract character" (and often worse, "character" alone) 
to denote what a codepoint codes. All articles I have ever read _about_ 
Unicode by third party simply follow. Evoking this issue on the unicode 
mailing list usually results in plain silence.

 Even understanding the nuances between char, wchar, and dchar is
 more than your average programmer seems to understand at first. The idea that a
 char wouldn't be guaranteed to be an actual character is not something that
many
 programmers take to immediately. It's quite foreign to how chars are typically
 dealt with in other languages, and many programmers never worry about unicode
at
 all, only dealing with ASCII.

(average programmer ? ;-)
Not that much to "how chars are typically dealt with in other 
languages", rather to how characters were coded in historic charsets. 
Other languages ignore the issue, and thus run incorrectly with 
universal text, the same way as D's builtin tools do it.
About ASCII, note that the only kind of source it's able to encode is 
plain english text, without any bit of fancy thingy in it. A single 
non-breaking space, "≥", "×" (product U+00D7), or using a letter 
imported from foreign language like in "à la", same for "αβγ", not to 
evoke "©" & "®"...

 So, not only is unicode a rather disgusting
 problem, but it's not one that your average programmer begins to grasp as far
as
 I've seen. Unless the issue is abstracted away completely, it takes a fair bit
 of explaining to understand how to deal with unicoder properly.

Please have a look at 
https://bitbucket.org/denispir/denispir-d/src/a005424f60f3, read 
https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20level
20of%20abstraction, 
and try https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/Text.d
Any feedback welcome (esp on reformulating the text concisely ;-)

 - Jonathan M Davis

Denis
_________________
vita es estrany
spir.wikidot.com

Jan 13 2011

D Programming

C/C++ Programming

Other

digitalmars.D - VLERange: a range in between BidirectionalRange and RandomAccessRange