www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - VLERange: a range in between BidirectionalRange and RandomAccessRange

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
I've been thinking on how to better deal with Unicode strings. Currently 
strings are formally bidirectional ranges with a surreptitious random 
access interface. The random access interface accesses the support of 
the string, which is understood to hold data in a variable-encoded 
format. For as long as the programmer understands this relationship, 
code for string manipulation can be written with relative ease. However, 
there is still room for writing wrong code that looks legit.

Sometimes the best way to tackle a hairy reality is to invite it to the 
negotiation table and offer it promotion to first-class abstraction 
status. Along that vein I was thinking of defining a new range: 
VLERange, i.e. Variable Length Encoding Range. Such a range would have 
the power somewhere in between bidirectional and random access.

The primitives offered would include empty, access to front and back, 
popFront and popBack (just like BidirectionalRange), and in addition 
properties typical of random access ranges: indexing, slicing, and 
length. Note that the result of the indexing operator is not the same as 
the element type of the range, as it only represents the unit of encoding.

In addition to these (and connecting the two), a VLERange would offer 
two additional primitives:

1. size_t stepSize(size_t offset) gives the length of the step needed to 
skip to the next element.

2. size_t backstepSize(size_t offset) gives the size of the _backward_ 
step that goes to the previous element.

In both cases, offset is assumed to be at the beginning of a logical 
element of the range.

I suspect that a lot of functions in std.string can be written without 
Unicode-specific knowledge just by relying on such an interface. 
Moreover, algorithms can be generalized to other structures that use 
variable-length encoding, such as those used in data compression. (In 
that case, the support would be a bit array and the encoded type would 
be ubyte.)

Writing to such ranges is not addressed by this design. Ideas are welcome.

Adding VLERange would legitimize strings and would clarify their 
handling, at the cost of adding one additional concept that needs to be 
minded. Is the trade-off worthwhile?


Andrei
Jan 10 2011
next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 I've been thinking on how to better deal with Unicode strings. 
 Currently strings are formally bidirectional ranges with a 
 surreptitious random access interface. The random access interface 
 accesses the support of the string, which is understood to hold data in 
 a variable-encoded format. For as long as the programmer understands 
 this relationship, code for string manipulation can be written with 
 relative ease. However, there is still room for writing wrong code that 
 looks legit.
 
 Sometimes the best way to tackle a hairy reality is to invite it to the 
 negotiation table and offer it promotion to first-class abstraction 
 status. Along that vein I was thinking of defining a new range: 
 VLERange, i.e. Variable Length Encoding Range. Such a range would have 
 the power somewhere in between bidirectional and random access.
 
 The primitives offered would include empty, access to front and back, 
 popFront and popBack (just like BidirectionalRange), and in addition 
 properties typical of random access ranges: indexing, slicing, and 
 length. Note that the result of the indexing operator is not the same 
 as the element type of the range, as it only represents the unit of 
 encoding.

Seems like a good idea to define things formally.
 In addition to these (and connecting the two), a VLERange would offer 
 two additional primitives:
 
 1. size_t stepSize(size_t offset) gives the length of the step needed 
 to skip to the next element.
 
 2. size_t backstepSize(size_t offset) gives the size of the _backward_ 
 step that goes to the previous element.

I like the idea, but I'm not sure about this interface. What's the result of stepSize if your range must create two elements from one underlying unit? Perhaps in those cases the element type could be an array (to return more than one element from one iteration). For instance, say we have a conversion range taking a Unicode string and converting it to ISO Latin 1. The best (lossy) conversion for "œ" is "oe" (one chararacter to two characters), in this case 'front' could simply return "oe" (two characters) in one iteration, with stepSize being the size of the "œ" code point. In the same conversion process, encountering "e" followed by a combining "´" would return pre-combined character "é" (two characters to one character).
 In both cases, offset is assumed to be at the beginning of a logical 
 element of the range.
 
 I suspect that a lot of functions in std.string can be written without 
 Unicode-specific knowledge just by relying on such an interface. 
 Moreover, algorithms can be generalized to other structures that use 
 variable-length encoding, such as those used in data compression. (In 
 that case, the support would be a bit array and the encoded type would 
 be ubyte.)

Applicability to other problems seems like a valuable benefit.
 Writing to such ranges is not addressed by this design. Ideas are welcome.

Writing, as in assigning to 'front'? That's not really possible with variable-length units as it'd need to shift everything in case of a length difference. Or maybe you meant writing as in having an output range for variable-length elements... I'm not sure
 Adding VLERange would legitimize strings and would clarify their 
 handling, at the cost of adding one additional concept that needs to be 
 minded. Is the trade-off worthwhile?

In my opinion it's not a trade-off at all, it's a formalization of how strings are handled which is better in every regard than a "special case". I welcome this move very much. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 11 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/11/11 4:41 AM, Michel Fortin wrote:
 On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:
 In addition to these (and connecting the two), a VLERange would offer
 two additional primitives:

 1. size_t stepSize(size_t offset) gives the length of the step needed
 to skip to the next element.

 2. size_t backstepSize(size_t offset) gives the size of the _backward_
 step that goes to the previous element.

I like the idea, but I'm not sure about this interface. What's the result of stepSize if your range must create two elements from one underlying unit? Perhaps in those cases the element type could be an array (to return more than one element from one iteration). For instance, say we have a conversion range taking a Unicode string and converting it to ISO Latin 1. The best (lossy) conversion for "œ" is "oe" (one chararacter to two characters), in this case 'front' could simply return "oe" (two characters) in one iteration, with stepSize being the size of the "œ" code point. In the same conversion process, encountering "e" followed by a combining "´" would return pre-combined character "é" (two characters to one character).

In the design as I thought of it, the effective length of one logical element is one or more representation units. My understanding is that you are referring to a fractional number of representation units for one logical element.
 Writing to such ranges is not addressed by this design. Ideas are
 welcome.

Writing, as in assigning to 'front'? That's not really possible with variable-length units as it'd need to shift everything in case of a length difference. Or maybe you meant writing as in having an output range for variable-length elements... I'm not sure

Well all of the above :o). Clearly assigning to e.g. front or back should not work. The question is what kind of API can we provide beyond simple append with put(). Andrei
Jan 11 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/11/11 9:09 AM, spir wrote:
 On 01/11/2011 05:36 PM, Andrei Alexandrescu wrote:
 On 1/11/11 4:41 AM, Michel Fortin wrote:
 On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:
 In addition to these (and connecting the two), a VLERange would offer
 two additional primitives:

 1. size_t stepSize(size_t offset) gives the length of the step needed
 to skip to the next element.

 2. size_t backstepSize(size_t offset) gives the size of the _backward_
 step that goes to the previous element.

I like the idea, but I'm not sure about this interface. What's the result of stepSize if your range must create two elements from one underlying unit? Perhaps in those cases the element type could be an array (to return more than one element from one iteration). For instance, say we have a conversion range taking a Unicode string and converting it to ISO Latin 1. The best (lossy) conversion for "œ" is "oe" (one chararacter to two characters), in this case 'front' could simply return "oe" (two characters) in one iteration, with stepSize being the size of the "œ" code point. In the same conversion process, encountering "e" followed by a combining "´" would return pre-combined character "é" (two characters to one character).

In the design as I thought of it, the effective length of one logical element is one or more representation units. My understanding is that you are referring to a fractional number of representation units for one logical element.

I think Michel is right. If I understand correctly, VLERange addresses the low-level and rather simple issue of each codepoint beeing encoding as a variable number of code units. Right? If yes, then what is the advantage of VLERange? D already has string/wstring/dstring, allowing to work with the most advatageous encoding according to given source data, and dstring abstracting from low-level encoding issues.

It' not about the data, it's about algorithms. Currently there are algorithms that ostensibly work for bidirectional ranges, but internally "cheat" by detecting that the input is actually a string, and use that knowledge for better implementations. The benefit of VLERange would that that it legitimizes those algorithms. I wouldn't be surprised if an entire class of algorithms would in fact require VLERange (e.g. many of those that we commonly consider today "string" algorithms).
 The main (and massively ignored) issue when manipulating unicode text is
 rather that, unlike with legacy character sets, one codepoint does *not*
 represent a character in the common sense. In character sets like latin-1:
 * each code represents a character, in the common sense (eg "à")
 * each character representation has the same size (1 or 2 bytes)
 * each character has a single representation ("à" --> always 0xe0)
 All of this is wrong with unicode. And these are complicated and
 high-level issues, that appear _after_ decoding, on codepoint sequences.

 If VLERange is helpful is dealing with those problems, then I don't
 understand your presentation, sorry. Do you for instance mean such a
 range would, under the hood, group together codes belonging to the same
 character (thus making indexing meaningful), and/or normalise (decomp &
 order) (thus allowing to comp/find/count correctly).?

VLERange would offer automatic decoding in front, back, popFront, and popBack - just like BidirectionalRange does right now. It would also offer access to the representational support by means of indexing - also like char[] et al already do now. The difference is that VLERange being a formal concept, algorithms can specialize on it instead of (a) specializing for UTF strings or (b) specializing for BidirectionalRange and then manually detecting isSomeString inside. Conversely, when defining an algorithm you can specify VLARange as a requirement. Boyer-Moore is a perfect example - it doesn't work on bidirectional ranges, but it does work on VLARange. I suspect there are many like it. Of course, it would help a lot if we figured other remarkable VLARanges. Here are a few that come to mind: * Multibyte encodings other than UTF. Currently we have no special support for those beyond e.g. forward or bidirectional ranges. * Huffman, RLE, LZ encoded buffers (and many other compressed formats) * Vocabulary-based translation systems, e.g. associate each word with a number. * Others...? Some of these are forward-only (don't allow bidirectional access). Once we have a number of examples, it would be great to figure a number of remarkable algorithms operating on them. Andrei
Jan 11 2011
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/11/11 4:46 PM, spir wrote:
 On 01/11/2011 08:09 PM, Andrei Alexandrescu wrote:
 The main (and massively ignored) issue when manipulating unicode text is
 rather that, unlike with legacy character sets, one codepoint does *not*
 represent a character in the common sense. In character sets like
 latin-1:
 * each code represents a character, in the common sense (eg "à")
 * each character representation has the same size (1 or 2 bytes)
 * each character has a single representation ("à" --> always 0xe0)
 All of this is wrong with unicode. And these are complicated and
 high-level issues, that appear _after_ decoding, on codepoint sequences.

 If VLERange is helpful is dealing with those problems, then I don't
 understand your presentation, sorry. Do you for instance mean such a
 range would, under the hood, group together codes belonging to the same
 character (thus making indexing meaningful), and/or normalise (decomp &
 order) (thus allowing to comp/find/count correctly).?

VLERange would offer automatic decoding in front, back, popFront, and popBack - just like BidirectionalRange does right now. It would also offer access to the representational support by means of indexing - also like char[] et al already do now.

IIUC, for the case of text, VLERange helps abstracting from the annoying fact that a codepoint is encoded as a variable number of code units. What I meant is issues like: auto text = "a\u0302"d; writeln(text); // "â" auto range = VLERange(text); // extracts characters correctly? auto letter = range.front(); // "a" or "â"? // case yes: compares correctly? assert(range.front() == "â"); // fail or pass?

You should try text.front right now, you might be surprised :o). Andrei
Jan 11 2011
prev sibling parent spir <denis.spir gmail.com> writes:
On 01/12/2011 02:22 AM, Andrei Alexandrescu wrote:
 IIUC, for the case of text, VLERange helps abstracting from the annoying
 fact that a codepoint is encoded as a variable number of code units.
 What I meant is issues like:

 auto text = "a\u0302"d;
 writeln(text); // "â"
 auto range = VLERange(text);
 // extracts characters correctly?
 auto letter = range.front(); // "a" or "â"?
 // case yes: compares correctly?
 assert(range.front() == "â"); // fail or pass?

You should try text.front right now, you might be surprised :o).

Hum, right now incorrectly returns "a" as expected. And indeed assert ("â" == "a\u0302"); incorrectly fails as expected. Both would work with legacy charsets like latin-1. This is a new issue introduced with UCS, that requires an additional level of abstraction (in addition to the one required by the distincton codepoint/codeunit!) You may have a look at https://bitbucket.org/denispir/denispir-d/src/5ec6fe1e1065/Text.html for a rough implementation of a type that does the right thing, & at https://bitbucket.org/denispir/denispir-d/src/5ec6fe1e1065/U%20missing%20leve %20of%20abstraction for a (far too long) explanation. (I have tried to mention those problems a dozen times already, but for any reason nearly everybody seem definitely deaf in front of them.) Denis _________________ vita es estrany spir.wikidot.com
Jan 11 2011
prev sibling next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-11 11:36:54 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 1/11/11 4:41 AM, Michel Fortin wrote:
 For instance, say we have a conversion range taking a Unicode string and
 converting it to ISO Latin 1. The best (lossy) conversion for "œ" is
 "oe" (one chararacter to two characters), in this case 'front' could
 simply return "oe" (two characters) in one iteration, with stepSize
 being the size of the "œ" code point. In the same conversion process,
 encountering "e" followed by a combining "´" would return pre-combined
 character "é" (two characters to one character).

In the design as I thought of it, the effective length of one logical element is one or more representation units. My understanding is that you are referring to a fractional number of representation units for one logical element.

Your understanding is correct. I think both cases (one becomes many & many becomes one) are important and must be supported. Your proposal only deal with the many-becomes-one case. I proposed returning arrays so we can deal with the one-becomes-many case ("œ" becoming "oe"). Another idea would be to introduce "substeps". When checking for the next character, in addition to determining its step length you could also determine the number of substeps in it. "œ" would have two substeps, "o" and "e", and when there is no longer any substep you move to the next step. All this said, I think this should stay an implementation detail as this would allow a variety of strategies. Also, keeping this an implementation detail means that your proposed 'stepSize' and 'backstepSize' need to be an implementation detail too (because they won't make sense for the one-to-many case). So they can't really be part of a standard VLE interface. As far as I know, all we really need to expose to algorithms is whether a range has elements of variable length, because this has an impact on your indexing capabilities. The rest seems unnecessary to me, or am I missing some use cases? -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 11 2011
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/11/11 11:13 AM, Michel Fortin wrote:
 On 2011-01-11 11:36:54 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 On 1/11/11 4:41 AM, Michel Fortin wrote:
 For instance, say we have a conversion range taking a Unicode string and
 converting it to ISO Latin 1. The best (lossy) conversion for "œ" is
 "oe" (one chararacter to two characters), in this case 'front' could
 simply return "oe" (two characters) in one iteration, with stepSize
 being the size of the "œ" code point. In the same conversion process,
 encountering "e" followed by a combining "´" would return pre-combined
 character "é" (two characters to one character).

In the design as I thought of it, the effective length of one logical element is one or more representation units. My understanding is that you are referring to a fractional number of representation units for one logical element.

Your understanding is correct. I think both cases (one becomes many & many becomes one) are important and must be supported. Your proposal only deal with the many-becomes-one case.

I disagree. When I suggested this design I was worried of over-abstracting. Now this looks like abstracting for stuff that hasn't even been addressed concretely yet. Besides, using bit as an encoding unit sounds like an acceptable approach for anything fractional.
 I proposed returning arrays so we can deal with the one-becomes-many
 case ("œ" becoming "oe"). Another idea would be to introduce "substeps".
 When checking for the next character, in addition to determining its
 step length you could also determine the number of substeps in it. "œ"
 would have two substeps, "o" and "e", and when there is no longer any
 substep you move to the next step.

 All this said, I think this should stay an implementation detail as this
 would allow a variety of strategies. Also, keeping this an
 implementation detail means that your proposed 'stepSize' and
 'backstepSize' need to be an implementation detail too (because they
 won't make sense for the one-to-many case). So they can't really be part
 of a standard VLE interface.

If you don't have at least stepSize that tells you how large the stride is to get to the next element, it becomes impossible to move within the range using integral indexes.
 As far as I know, all we really need to expose to algorithms is whether
 a range has elements of variable length, because this has an impact on
 your indexing capabilities. The rest seems unnecessary to me, or am I
 missing some use cases?

I think you could say that you don't really need stepSize because you can compute it as follows: auto r1 = r; r1.popFront(); size_t stepSize = r.length - r1.length; This is tenuous, inefficient, and impossible if the support range doesn't support length (I realize that variable-length encodings work over other ranges than random access, but then again this may be an overgeneralization). Andrei
Jan 11 2011
prev sibling parent spir <denis.spir gmail.com> writes:
On 01/11/2011 08:09 PM, Andrei Alexandrescu wrote:
 The main (and massively ignored) issue when manipulating unicode text is
 rather that, unlike with legacy character sets, one codepoint does *not*
 represent a character in the common sense. In character sets like
 latin-1:
 * each code represents a character, in the common sense (eg "à")
 * each character representation has the same size (1 or 2 bytes)
 * each character has a single representation ("à" --> always 0xe0)
 All of this is wrong with unicode. And these are complicated and
 high-level issues, that appear _after_ decoding, on codepoint sequences.

 If VLERange is helpful is dealing with those problems, then I don't
 understand your presentation, sorry. Do you for instance mean such a
 range would, under the hood, group together codes belonging to the same
 character (thus making indexing meaningful), and/or normalise (decomp &
 order) (thus allowing to comp/find/count correctly).?

VLERange would offer automatic decoding in front, back, popFront, and popBack - just like BidirectionalRange does right now. It would also offer access to the representational support by means of indexing - also like char[] et al already do now.

IIUC, for the case of text, VLERange helps abstracting from the annoying fact that a codepoint is encoded as a variable number of code units. What I meant is issues like: auto text = "a\u0302"d; writeln(text); // "â" auto range = VLERange(text); // extracts characters correctly? auto letter = range.front(); // "a" or "â"? // case yes: compares correctly? assert(range.front() == "â"); // fail or pass? Both fail using all unicode-aware types I know of, because 1. They do not recognise that a character is represented by an arbitrary number of codes (code _points_). 2. They do not use normalised forms for comp, search, count, etc... (while in unicode a given char can have several representations).
 The difference is that VLERange being
 a formal concept, algorithms can specialize on it instead of (a)
 specializing for UTF strings or (b) specializing for BidirectionalRange
 and then manually detecting isSomeString inside. Conversely, when
 defining an algorithm you can specify VLARange as a requirement.
 Boyer-Moore is a perfect example - it doesn't work on bidirectional
 ranges, but it does work on VLARange. I suspect there are many like it.

 Of course, it would help a lot if we figured other remarkable VLARanges.

I think I see the point, and the general usefulness of such an abstraction. But it would certainly be more useful in other fields than text manipulation, because there are far more annoying issues (that, like in example above, simply prevent code correctness). Denis _________________ vita es estrany spir.wikidot.com
Jan 11 2011
prev sibling parent spir <denis.spir gmail.com> writes:
On 01/11/2011 05:36 PM, Andrei Alexandrescu wrote:
 On 1/11/11 4:41 AM, Michel Fortin wrote:
 On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:
 In addition to these (and connecting the two), a VLERange would offer
 two additional primitives:

 1. size_t stepSize(size_t offset) gives the length of the step needed
 to skip to the next element.

 2. size_t backstepSize(size_t offset) gives the size of the _backward_
 step that goes to the previous element.

I like the idea, but I'm not sure about this interface. What's the result of stepSize if your range must create two elements from one underlying unit? Perhaps in those cases the element type could be an array (to return more than one element from one iteration). For instance, say we have a conversion range taking a Unicode string and converting it to ISO Latin 1. The best (lossy) conversion for "œ" is "oe" (one chararacter to two characters), in this case 'front' could simply return "oe" (two characters) in one iteration, with stepSize being the size of the "œ" code point. In the same conversion process, encountering "e" followed by a combining "´" would return pre-combined character "é" (two characters to one character).

In the design as I thought of it, the effective length of one logical element is one or more representation units. My understanding is that you are referring to a fractional number of representation units for one logical element.

I think Michel is right. If I understand correctly, VLERange addresses the low-level and rather simple issue of each codepoint beeing encoding as a variable number of code units. Right? If yes, then what is the advantage of VLERange? D already has string/wstring/dstring, allowing to work with the most advatageous encoding according to given source data, and dstring abstracting from low-level encoding issues. The main (and massively ignored) issue when manipulating unicode text is rather that, unlike with legacy character sets, one codepoint does *not* represent a character in the common sense. In character sets like latin-1: * each code represents a character, in the common sense (eg "à") * each character representation has the same size (1 or 2 bytes) * each character has a single representation ("à" --> always 0xe0) All of this is wrong with unicode. And these are complicated and high-level issues, that appear _after_ decoding, on codepoint sequences. If VLERange is helpful is dealing with those problems, then I don't understand your presentation, sorry. Do you for instance mean such a range would, under the hood, group together codes belonging to the same character (thus making indexing meaningful), and/or normalise (decomp & order) (thus allowing to comp/find/count correctly).? denis _________________ vita es estrany spir.wikidot.com
Jan 11 2011
prev sibling next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 10 Jan 2011 22:57:36 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 I've been thinking on how to better deal with Unicode strings. Currently  
 strings are formally bidirectional ranges with a surreptitious random  
 access interface. The random access interface accesses the support of  
 the string, which is understood to hold data in a variable-encoded  
 format. For as long as the programmer understands this relationship,  
 code for string manipulation can be written with relative ease. However,  
 there is still room for writing wrong code that looks legit.

 Sometimes the best way to tackle a hairy reality is to invite it to the  
 negotiation table and offer it promotion to first-class abstraction  
 status. Along that vein I was thinking of defining a new range:  
 VLERange, i.e. Variable Length Encoding Range. Such a range would have  
 the power somewhere in between bidirectional and random access.

 The primitives offered would include empty, access to front and back,  
 popFront and popBack (just like BidirectionalRange), and in addition  
 properties typical of random access ranges: indexing, slicing, and  
 length. Note that the result of the indexing operator is not the same as  
 the element type of the range, as it only represents the unit of  
 encoding.

 In addition to these (and connecting the two), a VLERange would offer  
 two additional primitives:

 1. size_t stepSize(size_t offset) gives the length of the step needed to  
 skip to the next element.

 2. size_t backstepSize(size_t offset) gives the size of the _backward_  
 step that goes to the previous element.

 In both cases, offset is assumed to be at the beginning of a logical  
 element of the range.

 I suspect that a lot of functions in std.string can be written without  
 Unicode-specific knowledge just by relying on such an interface.  
 Moreover, algorithms can be generalized to other structures that use  
 variable-length encoding, such as those used in data compression. (In  
 that case, the support would be a bit array and the encoded type would  
 be ubyte.)

 Writing to such ranges is not addressed by this design. Ideas are  
 welcome.

 Adding VLERange would legitimize strings and would clarify their  
 handling, at the cost of adding one additional concept that needs to be  
 minded. Is the trade-off worthwhile?

While this makes it possible to write algorithms that only accept VLERanges, I don't think it solves the major problem with strings -- they are treated as arrays by the compiler. I'd also rather see an indexing operation return the element type, and have a separate function to get the encoding unit. This makes more sense for generic code IMO. I noticed you never commented on my proposed string type... That reminds me, I should update with suggested changes and re-post it. -Steve
Jan 11 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/11/11 5:30 AM, Steven Schveighoffer wrote:
 While this makes it possible to write algorithms that only accept
 VLERanges, I don't think it solves the major problem with strings --
 they are treated as arrays by the compiler.

Except when they're not - foreach with dchar...
 I'd also rather see an indexing operation return the element type, and
 have a separate function to get the encoding unit. This makes more sense
 for generic code IMO.

But that's neither here nor there. That would return the logical element at a physical position. I am very doubtful that much generic code could work without knowing they are in fact dealing with a variable-length encoding.
 I noticed you never commented on my proposed string type...

 That reminds me, I should update with suggested changes and re-post it.

To be frank, I think it didn't mark a visible improvement. It solved some problems and brought others. There was disagreement over the offered primitives and their semantics. That being said, it's good you are doing this work. In the best case, you could bring a compelling abstraction to the table. In the worst, you'll become as happy about D's strings as I am :o). Andrei
Jan 11 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/11/11 11:21 AM, Steven Schveighoffer wrote:
 On Tue, 11 Jan 2011 11:54:08 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 1/11/11 5:30 AM, Steven Schveighoffer wrote:
 While this makes it possible to write algorithms that only accept
 VLERanges, I don't think it solves the major problem with strings --
 they are treated as arrays by the compiler.

Except when they're not - foreach with dchar...

This solitary difference is a very thin argument -- foreach(d; byDchar(str)) would be just as good without requiring compiler help.
 I'd also rather see an indexing operation return the element type, and
 have a separate function to get the encoding unit. This makes more sense
 for generic code IMO.

But that's neither here nor there. That would return the logical element at a physical position. I am very doubtful that much generic code could work without knowing they are in fact dealing with a variable-length encoding.

It depends on the function, and the way the indexing is implemented.
 I noticed you never commented on my proposed string type...

 That reminds me, I should update with suggested changes and re-post it.

To be frank, I think it didn't mark a visible improvement. It solved some problems and brought others. There was disagreement over the offered primitives and their semantics.

It is supposed to be simple, and provide the expected interface, without causing any undue performance degradation. That is, I should be able to do all the things with a replacement string type that I can with a char array today, as efficiently as I can today, except I should have to work to get at the code-units. The huge benefit is that I can say "I'm dealing with this as an array" when I know it's safe

Unfinished sentence? Anyway, for my money you just described what we have now.
 The disagreement will never be fully solved, as there is just as much
 disagreement about the current state of affairs ;) e.g. should foreach
 default to using dchar?

I disagree about the disagreement being unsolvable. I'm not rigid; if I saw a terrific abstraction in your string, I'd be all for it. It just shuffles some issues about, and although I agree it does one thing or two better than char[], at the end of the day it doesn't carry its weight.
 That being said, it's good you are doing this work. In the best case,
 you could bring a compelling abstraction to the table. In the worst,
 you'll become as happy about D's strings as I am :o).

I don't think I'll ever be 'happy' with the way strings sit in phobos currently. I typically deal in ASCII (i.e. code units), and phobos works very hard to prevent that.

I wonder if we could and should extend some of the functions in std.string to work with ubyte[]. I did add a function called representation() that I didn't document yet. Essentially representation gives you the ubyte[], ushort[], or uint[] underneath a string, with the same qualifiers. Whenever you want an algorithm to work on ASCII in earnest, you can pass representation(s) to it instead of s. If you work a lot with ASCII, an AsciiString abstraction may be a better and more likely to be successful string type. Better yet, you could simply focus on AsciiChar and then define ASCII strings as arrays of AsciiChar. Andrei
Jan 11 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/13/11 8:52 AM, Steven Schveighoffer wrote:
 I see it as having two vast improvements:

 1. If we replace char[] with a specific type for string, then char[] can
 be considered a true array by phobos, and phobos can now deal with a
 char[] array without the need to cast.
 2. It protects the casual user from incorrectly using a string by making
 the default the correct API.

 Those to me are very important.

Let's take a look: // Incorrect string code void fun(string s) { foreach (i; 0 .. s.length) { writeln("The character in position ", i, " is ", s[i]); } } // Incorrect string_t code void fun(string_t!char s) { foreach (i; 0 .. s.codeUnits) { writeln("The character in position ", i, " is ", s[i]); } } Both functions are incorrect, albeit in different ways. The only improvement I'm seeing is that the user needs to write codeUnits instead of length, which may make her think twice. Clearly, however, copiously incorrect code can be written with the proposed interface because it tries to hide the reality that underneath a variable-length encoding is being used, but doesn't hide it completely (albeit for good efficiency-related reasons). But wait, there's less. Functions for random-access range throughout Phobos routinely assume fixed-length encoding, i.e. s[i + 1] lies next to s[i]. From a cursory look at string_t, std.range will qualify it as a RandomAccessRange without length. That's an odd beast but does not change the fixed-length encoding assumption. So you'd need to special-case algorithms for string_t, just like right now certain algorithms are specialized for string. Where's the progress? Andrei
Jan 13 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/13/11 11:35 AM, Steven Schveighoffer wrote:
 On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Let's take a look:

 // Incorrect string code
 void fun(string s) {
 foreach (i; 0 .. s.length) {
 writeln("The character in position ", i, " is ", s[i]);
 }
 }

 // Incorrect string_t code
 void fun(string_t!char s) {
 foreach (i; 0 .. s.codeUnits) {
 writeln("The character in position ", i, " is ", s[i]);
 }
 }

 Both functions are incorrect, albeit in different ways. The only
 improvement I'm seeing is that the user needs to write codeUnits
 instead of length, which may make her think twice. Clearly, however,
 copiously incorrect code can be written with the proposed interface
 because it tries to hide the reality that underneath a variable-length
 encoding is being used, but doesn't hide it completely (albeit for
 good efficiency-related reasons).

You might be looking at my previous version. The new version (recently posted) will throw an exception for that code if a multi-code-unit code-point is found.

I was looking at your latest. It's code that compiles and runs, but dynamically fails on some inputs. I agree that it's often better to fail noisily instead of silently, but in a manner of speaking the string-based code doesn't fail at all - it correctly iterates the code units of a string. This may sometimes not be what the user expected; most of the time they'd care about the code points.
 It also supports this:

 foreach(i, d; s)
 {
 writeln("The character in position ", i, " is ", d);
 }

 where i is the index (might not be sequential)

Well string supports that too, albeit with the nit that you need to specify dchar.
 But wait, there's less. Functions for random-access range throughout
 Phobos routinely assume fixed-length encoding, i.e. s[i + 1] lies next
 to s[i]. From a cursory look at string_t, std.range will qualify it as
 a RandomAccessRange without length. That's an odd beast but does not
 change the fixed-length encoding assumption. So you'd need to
 special-case algorithms for string_t, just like right now certain
 algorithms are specialized for string.

isRandomAccessRange requires hasLength (see here: http://www.dsource.org/projects/phobos/browser/trunk/phobos/std/range.d#L532). This is not a random access range per that definition.

That's an interesting twist. By the way I specified length is required then because I couldn't imagine having random access into something that I can't tell the length of. Apparently I was wrong :o).
 But a string
 isn't a random access range anyways (it's specifically disallowed by
 std.range per that same reference).

It isn't and it isn't supposed to be.
 The plan is you would *not* have to special case algorithms for string_t
 as you do currently for char[]. If that's not the case, then we haven't
 achieved much. Simply put, we are separating out the strange nature of
 strings from arrays, so the exceptional treatment of them is handled by
 the type itself, not the functions using it.

That sounds reasonable. Andrei
Jan 13 2011
next sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:ignon1$2p4k$1 digitalmars.com...
 This may sometimes not be what the user expected; most of the time they'd 
 care about the code points.

I dunno, spir has succesfuly convinced me that most of the time it's graphemes the user cares about, not code points. Using code points is just as misleading as using UTF-16 code units.
Jan 13 2011
parent reply Lutger Blijdestijn <lutger.blijdestijn gmail.com> writes:
Nick Sabalausky wrote:

 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message
 news:ignon1$2p4k$1 digitalmars.com...
 This may sometimes not be what the user expected; most of the time they'd
 care about the code points.

I dunno, spir has succesfuly convinced me that most of the time it's graphemes the user cares about, not code points. Using code points is just as misleading as using UTF-16 code units.

I agree. This is a very informative thread, thanks spir and everybody else. Going back to the topic, it seems to me that a unicode string is a surprisingly complicated data structure that can be viewed from multiple types of ranges. In the light of this thread, a dchar doesn't seem like such a useful type anymore, it is still a low level abstraction for the purpose of correctly dealing with text. Perhaps even less useful, since it gives the illusion of correctness for those who are not in the know. The algorithms in std.string can be upgraded to work correctly with all the issues mentioned, but the generic ones in std.algorithm will just subtly do the wrong thing when presented with dchar ranges. And, as I understood it, the purpose of a VleRange was exactly to make generic algorithms just work (tm) for strings. Is it still possible to solve this problem or are we stuck with specialized string algorithms? Would it work if VleRange of string was a bidirectional range with string slices of graphemes as the ElementType and indexing with code units? Often used string algorithms could be specialized for performance, but if not, generic algorithms would still work.
Jan 15 2011
next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn 
<lutger.blijdestijn gmail.com> said:

 Nick Sabalausky wrote:
 
 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message
 news:ignon1$2p4k$1 digitalmars.com...
 
 This may sometimes not be what the user expected; most of the time they'd
 care about the code points.
 

I dunno, spir has succesfuly convinced me that most of the time it's graphemes the user cares about, not code points. Using code points is just as misleading as using UTF-16 code units.

I agree. This is a very informative thread, thanks spir and everybody else. Going back to the topic, it seems to me that a unicode string is a surprisingly complicated data structure that can be viewed from multiple types of ranges. In the light of this thread, a dchar doesn't seem like such a useful type anymore, it is still a low level abstraction for the purpose of correctly dealing with text. Perhaps even less useful, since it gives the illusion of correctness for those who are not in the know. The algorithms in std.string can be upgraded to work correctly with all the issues mentioned, but the generic ones in std.algorithm will just subtly do the wrong thing when presented with dchar ranges. And, as I understood it, the purpose of a VleRange was exactly to make generic algorithms just work (tm) for strings. Is it still possible to solve this problem or are we stuck with specialized string algorithms? Would it work if VleRange of string was a bidirectional range with string slices of graphemes as the ElementType and indexing with code units? Often used string algorithms could be specialized for performance, but if not, generic algorithms would still work.

I have my idea. I think it'd be a good idea is to improve upon Andrei's first idea -- which was to treat char[], wchar[], and dchar[] all as ranges of dchar elements -- by changing the element type to be the same as the string. For instance, iterating on a char[] would give you slices of char[], each having one grapheme. The second component would be to make the string equality operator (==) for strings compare them in their normalized form, so that ("e" with combining acute accent) == (pre-combined ""). I think this would make D support for Unicode much more intuitive. This implies some semantic changes, mainly that everywhere you write a "character" you must use double-quotes (string "a") instead of single quote (code point 'a'), but from the user's point of view that's pretty much all there is to change. There'll still be plenty of room for specialized algorithms, but their purpose would be limited to optimization. Correctness would be taken care of by the basic range interface, and foreach should follow suit and iterate by grapheme by default. I wrote this example (or something similar) earlier in this thread: foreach (grapheme; "expos") if (grapheme == "") break; In this example, even if one of these two strings use the pre-combined form of "" and the other uses a combining acute accent, the equality would still hold since foreach iterates on full graphemes and == compares using normalization. The important thing to keep in mind here is that the grapheme-splitting algorithm should be optimized for the case where there is no combining character and the compare algorithm for the case where the string is already normalized, since most strings will exhibit these characteristics. As for ASCII, we could make it easier to use ubyte[] for it by making string literals implicitly convert to ubyte[] if all their characters are in ASCII range. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 15 2011
next sibling parent reply Lutger Blijdestijn <lutger.blijdestijn gmail.com> writes:
Michel Fortin wrote:

 On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn
 <lutger.blijdestijn gmail.com> said:

 
 Is it still possible to solve this problem or are we stuck with
 specialized string algorithms? Would it work if VleRange of string was a
 bidirectional range with string slices of graphemes as the ElementType
 and indexing with code units? Often used string algorithms could be
 specialized for performance, but if not, generic algorithms would still
 work.

I have my idea. I think it'd be a good idea is to improve upon Andrei's first idea -- which was to treat char[], wchar[], and dchar[] all as ranges of dchar elements -- by changing the element type to be the same as the string. For instance, iterating on a char[] would give you slices of char[], each having one grapheme.

Yes, this is exactly what I meant, but you are much clearer. I hope this can be made to work!
Jan 15 2011
parent reply foobar <foo bar.com> writes:
Lutger Blijdestijn Wrote:

 Michel Fortin wrote:
 
 On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn
 <lutger.blijdestijn gmail.com> said:

 
 Is it still possible to solve this problem or are we stuck with
 specialized string algorithms? Would it work if VleRange of string was a
 bidirectional range with string slices of graphemes as the ElementType
 and indexing with code units? Often used string algorithms could be
 specialized for performance, but if not, generic algorithms would still
 work.

I have my idea. I think it'd be a good idea is to improve upon Andrei's first idea -- which was to treat char[], wchar[], and dchar[] all as ranges of dchar elements -- by changing the element type to be the same as the string. For instance, iterating on a char[] would give you slices of char[], each having one grapheme.

Yes, this is exactly what I meant, but you are much clearer. I hope this can be made to work!

My two cents are against this kind of design. The "correct" approach IMO is a 'universal text' type which is a _container_ of said text. This type would provide ranges for the various abstraction levels. E.g. text.codeUnits to iterate by codeUnits Here's a (perhaps contrived) example: Let's say I want to find the combining marks in some text. For instance, Hebrew uses combining marks for vowels (among other things) and they are optional in the language (There's a "full" form with vowels and a "missing" form without them). I have a Hebrew text with in the "full" form and I want to strip it and convert it to the "missing" form. How would I accomplish this with your design?
Jan 15 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-15 09:09:17 -0500, foobar <foo bar.com> said:

 Lutger Blijdestijn Wrote:
 
 Michel Fortin wrote:
 
 On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn
 <lutger.blijdestijn gmail.com> said:

 
 Is it still possible to solve this problem or are we stuck with
 specialized string algorithms? Would it work if VleRange of string was a
 bidirectional range with string slices of graphemes as the ElementType
 and indexing with code units? Often used string algorithms could be
 specialized for performance, but if not, generic algorithms would still
 work.

I have my idea. I think it'd be a good idea is to improve upon Andrei's first idea -- which was to treat char[], wchar[], and dchar[] all as ranges of dchar elements -- by changing the element type to be the same as the string. For instance, iterating on a char[] would give you slices of char[], each having one grapheme.

Yes, this is exactly what I meant, but you are much clearer. I hope this can be made to work!

My two cents are against this kind of design. The "correct" approach IMO is a 'universal text' type which is a _container_ of said text. This type would provide ranges for the various abstraction levels. E.g. text.codeUnits to iterate by codeUnits

Nothing prevents that in the design I proposed. Andrei's design already implements "str".byDchar() that would work for code points. I'd suggest changing the API to by!char(), by!wchar(), and by!cdhar() for when you deal with whatever kind of code unit or code point you want. This would be mostly symmetric to what you can already do with foreach: foreach (char c; "hello") {} foreach (wchar c; "hello") {} foreach (dchar c; "hello") {} // same as: foreach (c; "hello".by!char()) {} foreach (c; "hello".by!wchar()) {} foreach (c; "hello".by!dchar()) {}
 Here's a (perhaps contrived) example:
 Let's say I want to find the combining marks in some text.
 
 For instance, Hebrew uses combining marks for vowels (among other 
 things) and they are optional in the language (There's a "full" form 
 with vowels and a "missing" form without them).
 I have a Hebrew text with in the "full" form and I want to strip it and 
 convert it to the "missing" form.
 
 How would I accomplish this with your design?

All you need is a range that takes a string as input and give you code points in a decomposed form (NFD), then you use std.algorithm.filter on it: // original string auto str = "..."; // create normalized decomposed string as a lazy range of dchar (NFD) auto decomposed = decompose(str); // filter to remove your favorite combining code point (use the hex code you want) auto filtered = filter!"a != 0xFABA"(decomposed); // turn it back in composed form (NFC), optional auto recomposed = compose(filtered); // convert back to a string (could also be wstring or dstring) string result = array(recomposed.by!char()); This last line is the one doing everything. All the rest just chain ranges together for doing on-the-fly decomposition, filtering, and recomposition; the last line uses that chain of rage to fill the array. A more naive implementation not taking advantage of code points but instead using a replacement table would also work: string str = "..."; string result; string[string] replacements = ["":"e"]; // change this for what you want foreach (grapheme; str) { auto replacement = grapheme in replacements; if (replacement) result ~= replacement; else result ~= grapheme; } -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 15 2011
parent reply foobar <foo bar.com> writes:
Michel Fortin Wrote:

 On 2011-01-15 09:09:17 -0500, foobar <foo bar.com> said:
 
 Lutger Blijdestijn Wrote:
 
 Michel Fortin wrote:
 
 On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn
 <lutger.blijdestijn gmail.com> said:

 
 Is it still possible to solve this problem or are we stuck with
 specialized string algorithms? Would it work if VleRange of string was a
 bidirectional range with string slices of graphemes as the ElementType
 and indexing with code units? Often used string algorithms could be
 specialized for performance, but if not, generic algorithms would still
 work.

I have my idea. I think it'd be a good idea is to improve upon Andrei's first idea -- which was to treat char[], wchar[], and dchar[] all as ranges of dchar elements -- by changing the element type to be the same as the string. For instance, iterating on a char[] would give you slices of char[], each having one grapheme.

Yes, this is exactly what I meant, but you are much clearer. I hope this can be made to work!

My two cents are against this kind of design. The "correct" approach IMO is a 'universal text' type which is a _container_ of said text. This type would provide ranges for the various abstraction levels. E.g. text.codeUnits to iterate by codeUnits

Nothing prevents that in the design I proposed. Andrei's design already implements "str".byDchar() that would work for code points. I'd suggest changing the API to by!char(), by!wchar(), and by!cdhar() for when you deal with whatever kind of code unit or code point you want. This would be mostly symmetric to what you can already do with foreach: foreach (char c; "hello") {} foreach (wchar c; "hello") {} foreach (dchar c; "hello") {} // same as: foreach (c; "hello".by!char()) {} foreach (c; "hello".by!wchar()) {} foreach (c; "hello".by!dchar()) {}
 Here's a (perhaps contrived) example:
 Let's say I want to find the combining marks in some text.
 
 For instance, Hebrew uses combining marks for vowels (among other 
 things) and they are optional in the language (There's a "full" form 
 with vowels and a "missing" form without them).
 I have a Hebrew text with in the "full" form and I want to strip it and 
 convert it to the "missing" form.
 
 How would I accomplish this with your design?

All you need is a range that takes a string as input and give you code points in a decomposed form (NFD), then you use std.algorithm.filter on it: // original string auto str = "..."; // create normalized decomposed string as a lazy range of dchar (NFD) auto decomposed = decompose(str); // filter to remove your favorite combining code point (use the hex code you want) auto filtered = filter!"a != 0xFABA"(decomposed); // turn it back in composed form (NFC), optional auto recomposed = compose(filtered); // convert back to a string (could also be wstring or dstring) string result = array(recomposed.by!char()); This last line is the one doing everything. All the rest just chain ranges together for doing on-the-fly decomposition, filtering, and recomposition; the last line uses that chain of rage to fill the array. A more naive implementation not taking advantage of code points but instead using a replacement table would also work: string str = "..."; string result; string[string] replacements = ["":"e"]; // change this for what you want foreach (grapheme; str) { auto replacement = grapheme in replacements; if (replacement) result ~= replacement; else result ~= grapheme; } -- Michel Fortin michel.fortin michelf.com http://michelf.com/

Ok, I guess I missed the "byDchar()" method. I envisioned the same algorithm looking like this: // original string string str = "..."; // create normalized decomposed string as a lazy range of dchar (NFD) // Note: explicitly specify code points range: auto decomposed = decompose(str.codePoints); // filter to remove your favorite combining code point auto filtered = filter!"a != 0xFABA"(decomposed); // turn it back in composed form (NFC), optional auto recomposed = compose(filtered); // convert back to a string // Note: a string type can be constructed from a range of code points string result = string(recomposed); The difference is that a string type is distinct from the intermediate code point ranges (This happens in your design too albeit in a less obvious way to the user). There is string specific code. Why not encapsulate it in a string type instead of forcing the user to use complex APIs with templates everywhere?
Jan 15 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-15 10:59:52 -0500, foobar <foo bar.com> said:

 Ok, I guess I missed the "byDchar()" method.
 I envisioned the same algorithm looking like this:
 
 // original string
 string str = "...";
 
 // create normalized decomposed string as a lazy range of dchar (NFD)
 // Note: explicitly specify code points range:
 auto decomposed = decompose(str.codePoints);
 
 // filter to remove your favorite combining code point
 auto filtered = filter!"a != 0xFABA"(decomposed);
 
 // turn it back in composed form (NFC), optional
 auto recomposed = compose(filtered);
 
 // convert back to a string
 // Note: a string type can be constructed from a range of code points
 string result = string(recomposed);
 
 The difference is that a string type is distinct from the intermediate 
 code point ranges (This happens in your design too albeit in a less 
 obvious way to the user). There is string specific code. Why not 
 encapsulate it in a string type instead of forcing the user to use 
 complex APIs with templates everywhere?

What I don't understand is in what way using a string type would make the API less complex and use less templates? More generally, in what way would your string type behave differently than char[], wchar[], and dchar[]? I think we need to clarify what how you expect your string type to behave before I can answer anything. I mean, beside cosmetic changes such as having a codePoint property instead of by!dchar or byDchar, what is your string type doing differently? The above algorithm is already possible with strings as they are, provided you implement the 'decompose' and the 'compose' function returning a range. In fact, you only changed two things in it: by!dchar became codePoints, and array() became string(). Surely you're expecting more benefits than that. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 15 2011
parent foobar <foo bar.com> writes:
Michel Fortin Wrote:

 What I don't understand is in what way using a string type would make 
 the API less complex and use less templates?
 
 More generally, in what way would your string type behave differently 
 than char[], wchar[], and dchar[]? I think we need to clarify what how 
 you expect your string type to behave before I can answer anything. I 
 mean, beside cosmetic changes such as having a codePoint property 
 instead of by!dchar or byDchar, what is your string type doing 
 differently?
 
 The above algorithm is already possible with strings as they are, 
 provided you implement the 'decompose' and the 'compose' function 
 returning a range. In fact, you only changed two things in it: by!dchar 
 became codePoints, and array() became string(). Surely you're expecting 
 more benefits than that.
 
 -- 
 Michel Fortin
 michel.fortin michelf.com
 http://michelf.com/
 

First thing, the question of possibility is irrelevant since I could also write the same algorithm in brainfuck or assembly (with a lot more code). It's never a question of possibility but rather a question of ease of use for the user. What I want is to encapsulate all the low-level implementation details in one place so that the as a user I will not need to deal with this everywhere. one such detail is the encoding. auto text = w"whatever"; // should be equivalent to: auto text = new Text("whatever", Encoding.UTF16); now I want to write my own string function: void func(Text a); // instead of current: void func(T)(T a) if isTextType(T); // why the USER needs all this? Of course, the Text type would do the correct think by default which we both agree should be graphemes. Only if I need something advanced like in the previous algorithm than I explicitly need to specify that I work on code points or code units. In a sentence: "Make the common case trivial and the complex case possible". The common case is what we Humans think of as characters (graphemes) and the complex case is the encoding level.
Jan 15 2011
prev sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-15 20:49:00 -0500, Jonathan M Davis <jmdavisProg gmx.com> said:

 On Saturday 15 January 2011 04:24:33 Michel Fortin wrote:
 I have my idea.
 
 I think it'd be a good idea is to improve upon Andrei's first idea --
 which was to treat char[], wchar[], and dchar[] all as ranges of dchar
 elements -- by changing the element type to be the same as the string.
 For instance, iterating on a char[] would give you slices of char[],
 each having one grapheme.
 
 The second component would be to make the string equality operator (=

 for strings compare them in their normalized form, so that ("e" with
 combining acute accent) == (pre-combined ""). I think this would m

 D support for Unicode much more intuitive.
 
 This implies some semantic changes, mainly that everywhere you write a
 "character" you must use double-quotes (string "a") instead of single
 quote (code point 'a'), but from the user's point of view that's pretty
 much all there is to change.
 
 There'll still be plenty of room for specialized algorithms, but their
 purpose would be limited to optimization. Correctness would be taken
 care of by the basic range interface, and foreach should follow suit
 and iterate by grapheme by default.
 
 I wrote this example (or something similar) earlier in this thread:
 
 	foreach (grapheme; "expos")
 		if (grapheme == "")
 			break;
 
 In this example, even if one of these two strings use the pre-combined
 form of "" and the other uses a combining acute accent, the equality
 would still hold since foreach iterates on full graphemes and =
 compares using normalization.

I think that that would cause definite problems. Having the element type of the range be the same type as the range seems like it could cause a lot of problems in std.algorithm and the like, and it's _definitely_ going to confuse programmers. I'd expect it to be highly bug-prone. They _need_ to be separate types.

I remember that someone already complained about this issue because he had a tree of ranges, and Andrei said he would take a look at this problem eventually. Perhaps now would be a good time.
 Now, given that dchar can't actually work completely as an element 
 type, you'd either need the string type to be a new type or the element 
 type to be a new type. So, either the string type has char[], wchar[], 
 or dchar[] for its element type, or char[], wchar[], and dchar[] have 
 something like uchar as their element type, where uchar is a struct 
 which contains a char[], wchar[], or dchar[]
 which holds a single grapheme.

Having a new type for grapheme would work too. My preference still goes to reusing the string type because it makes the semantic simpler to understand, especially when comparing graphemes with literals. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 15 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-15 23:58:30 -0500, Jonathan M Davis <jmdavisProg gmx.com> said:

 On Saturday 15 January 2011 20:45:53 Michel Fortin wrote:
 On 2011-01-15 20:49:00 -0500, Jonathan M Davis <jmdavisProg gmx.com> said:
 On Saturday 15 January 2011 04:24:33 Michel Fortin wrote:
 I have my idea.
 
 I think it'd be a good idea is to improve upon Andrei's first idea --
 which was to treat char[], wchar[], and dchar[] all as ranges of dchar
 elements -- by changing the element type to be the same as the string.
 For instance, iterating on a char[] would give you slices of char[],
 each having one grapheme.
 
 The second component would be to make the string equality operator (

=)
 for strings compare them in their normalized form, so that ("e" with
 combining acute accent) == (pre-combined ""). I think this woul



 
 ake
 
 D support for Unicode much more intuitive.
 
 This implies some semantic changes, mainly that everywhere you write a
 "character" you must use double-quotes (string "a") instead of single
 quote (code point 'a'), but from the user's point of view that's pretty
 much all there is to change.
 
 There'll still be plenty of room for specialized algorithms, but their
 purpose would be limited to optimization. Correctness would be taken
 care of by the basic range interface, and foreach should follow suit
 and iterate by grapheme by default.
 
 I wrote this example (or something similar) earlier in this thread:
 	foreach (grapheme; "expos")
 	
 		if (grapheme == "")
 		
 			break;
 
 In this example, even if one of these two strings use the pre-combined
 form of "" and the other uses a combining acute accent, the equality
 would still hold since foreach iterates on full graphemes and
 compares using normalization.

I think that that would cause definite problems. Having the element type of the range be the same type as the range seems like it could cause a lot of problems in std.algorithm and the like, and it's _definitely_ going to confuse programmers. I'd expect it to be highly bug-prone. They _need_ to be separate types.

I remember that someone already complained about this issue because he had a tree of ranges, and Andrei said he would take a look at this problem eventually. Perhaps now would be a good time.
 Now, given that dchar can't actually work completely as an element
 type, you'd either need the string type to be a new type or the element
 type to be a new type. So, either the string type has char[], wchar[],
 or dchar[] for its element type, or char[], wchar[], and dchar[] have
 something like uchar as their element type, where uchar is a struct
 which contains a char[], wchar[], or dchar[]
 which holds a single grapheme.

Having a new type for grapheme would work too. My preference still goes to reusing the string type because it makes the semantic simpler to understand, especially when comparing graphemes with literals.

If a character literal actually became a grapheme instead of a dchar, then that would likely solve that issue. But I fear that the semantics of having a range be its own element type actually make understanding it _harder_, not simpler. Being forced to compare a string literals against what should be a character would definitely confuse programmers.

Character literals are treated as simple numbers by the language. By that I mean that you can write 'b' - 'a' == 1 and it'll be true. Arithmetic makes absolutely no sense for graphemes. If you want a special literal for graphemes, I'm afraid you'll have to invent something new. And at this point, why not use a string?
 Making a new character or grapheme type which represented a grapheme 
 would be _far_ simpler to understand IMO. However, making it work 
 really well would likely require that the compiler know about the 
 grapheme type like it knows about dchar.

I'm looking for a simple solution. One that doesn't involve inventing a new grapheme literal syntax or adding new types the compiler most know about. I'm not really opposed to any of this, but the more complicated is the solution, the less likely it is to be adopted. All I'm asking is that Unicode strings behave as Unicode strings should behave. Making iteration use graphemes by default and string comparison use the normalized form by default seems like a simple way to achieve that goal. The most important is not the implementation, but that the default behaviour be the right behaviour. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 15 2011
parent reply foobar <foo bar.com> writes:
Michel Fortin Wrote:


 Character literals are treated as simple numbers by the language. By 
 that I mean that you can write 'b' - 'a' == 1 and it'll be true. 
 Arithmetic makes absolutely no sense for graphemes. If you want a 
 special literal for graphemes, I'm afraid you'll have to invent 
 something new. And at this point, why not use a string?
 
 
 Making a new character or grapheme type which represented a grapheme 
 would be _far_ simpler to understand IMO. However, making it work 
 really well would likely require that the compiler know about the 
 grapheme type like it knows about dchar.

I'm looking for a simple solution. One that doesn't involve inventing a new grapheme literal syntax or adding new types the compiler most know about. I'm not really opposed to any of this, but the more complicated is the solution, the less likely it is to be adopted. All I'm asking is that Unicode strings behave as Unicode strings should behave. Making iteration use graphemes by default and string comparison use the normalized form by default seems like a simple way to achieve that goal. The most important is not the implementation, but that the default behaviour be the right behaviour. -- Michel Fortin michel.fortin michelf.com http://michelf.com/

I Understand your concern regarding a simpler implementation. You want to minimize the disruption caused by the proposed change. I'd argue that creating a specialized string type as Steve suggests makes integration *easier*. Your suggestion requires that foreach will be changed to default to grapheme. I agree that this can be done because it will not break silently but with Steve's string type this is unnecessary since the type itself would provide a grapheme range interface and the compiler doesn't need to know about this type at all. string becomes a regular library type. Of course, the type should support: string foo = "bar"; by making an implicit conversion from current arrays (to minimize compiler changes) The only disruption as far as I can tell would be using 'a' type literals instead of "a" but that will come up in compilation after string defaults to the new type. Also, all occurrences of: string foo = ...; foreach (c; foo) {...} // c is now a grapheme will now do the correct thing by default.
Jan 15 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-16 02:11:14 -0500, foobar <foo bar.com> said:

 I Understand your concern regarding a simpler implementation. You want 
 to minimize the disruption caused by the proposed change.
 
 I'd argue that creating a specialized string type as Steve suggests 
 makes integration *easier*. Your suggestion requires that foreach will 
 be changed to default to grapheme. I agree that this can be done 
 because it will not break silently but with Steve's string type this is 
 unnecessary since the type itself would provide a grapheme range 
 interface and the compiler doesn't need to know about this type at all. 
 string becomes a regular library type.
 
 Of course, the type should support:
 string foo = "bar";
 by making an implicit conversion from current arrays (to minimize 
 compiler changes)

It should also work for: auto foo = "bar";
 The only disruption as far as I can tell would be using 'a' type 
 literals instead of "a" but that will come up in compilation after 
 string defaults to the new type.

You say "after string defaults to the new type", but I don't think this change to the language will pass. It'll break TDPL for one thing, so it's surely out for D2. And I somewhat doubt it's low-level enough for Walter's taste. I don't care much if the default type is an array or not, I just want the default type to work properly as a Unicode string. The very small participation to this thread from the key decision makers (Andrei and Walter) worries me however. I'm not even sure we'll achieve that goal. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 16 2011
parent foobar <foo bar.com> writes:
Michel Fortin Wrote:

 On 2011-01-16 02:11:14 -0500, foobar <foo bar.com> said:
 
 I Understand your concern regarding a simpler implementation. You want 
 to minimize the disruption caused by the proposed change.
 
 I'd argue that creating a specialized string type as Steve suggests 
 makes integration *easier*. Your suggestion requires that foreach will 
 be changed to default to grapheme. I agree that this can be done 
 because it will not break silently but with Steve's string type this is 
 unnecessary since the type itself would provide a grapheme range 
 interface and the compiler doesn't need to know about this type at all. 
 string becomes a regular library type.
 
 Of course, the type should support:
 string foo = "bar";
 by making an implicit conversion from current arrays (to minimize 
 compiler changes)

It should also work for: auto foo = "bar";

Right. This does require compiler changes.
 
 
 The only disruption as far as I can tell would be using 'a' type 
 literals instead of "a" but that will come up in compilation after 
 string defaults to the new type.

You say "after string defaults to the new type", but I don't think this change to the language will pass. It'll break TDPL for one thing, so it's surely out for D2. And I somewhat doubt it's low-level enough for Walter's taste.

string is an alias in phobos so it's more of a stdlib change but I see your point about TDPL. I did get the feeling that Andrei is willing to make a change if it proves worthwhile by preventing writing bad code (Which we both agree this change accomplishes).
 I don't care much if the default type is an array or not, I just want 
 the default type to work properly as a Unicode string. The very small 
 participation to this thread from the key decision makers (Andrei and 
 Walter) worries me however. I'm not even sure we'll achieve that goal.
 
 

Anderi did take part and even asked for links that explain the subject. Perhaps the quite is due to the mastermind doing research on the topic rather than reluctance to do any changes. :)
 -- 
 Michel Fortin
 michel.fortin michelf.com
 http://michelf.com/
 

Jan 16 2011
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday 15 January 2011 04:24:33 Michel Fortin wrote:
 On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn
=20
 <lutger.blijdestijn gmail.com> said:
 Nick Sabalausky wrote:
 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message
 news:ignon1$2p4k$1 digitalmars.com...
=20
 This may sometimes not be what the user expected; most of the time
 they'd care about the code points.

I dunno, spir has succesfuly convinced me that most of the time it's graphemes the user cares about, not code points. Using code points is just as misleading as using UTF-16 code units.

I agree. This is a very informative thread, thanks spir and everybody else. =20 Going back to the topic, it seems to me that a unicode string is a surprisingly complicated data structure that can be viewed from multiple types of ranges. In the light of this thread, a dchar doesn't seem like such a useful type anymore, it is still a low level abstraction for the purpose of correctly dealing with text. Perhaps even less useful, since it gives the illusion of correctness for those who are not in the know. =20 The algorithms in std.string can be upgraded to work correctly with all the issues mentioned, but the generic ones in std.algorithm will just subtly do the wrong thing when presented with dchar ranges. And, as I understood it, the purpose of a VleRange was exactly to make generic algorithms just work (tm) for strings. =20 Is it still possible to solve this problem or are we stuck with specialized string algorithms? Would it work if VleRange of string was a bidirectional range with string slices of graphemes as the ElementType and indexing with code units? Often used string algorithms could be specialized for performance, but if not, generic algorithms would still work.

I have my idea. =20 I think it'd be a good idea is to improve upon Andrei's first idea -- which was to treat char[], wchar[], and dchar[] all as ranges of dchar elements -- by changing the element type to be the same as the string. For instance, iterating on a char[] would give you slices of char[], each having one grapheme. =20 The second component would be to make the string equality operator (=3D=

 for strings compare them in their normalized form, so that ("e" with
 combining acute accent) =3D=3D (pre-combined "=E9"). I think this would m=

 D support for Unicode much more intuitive.
=20
 This implies some semantic changes, mainly that everywhere you write a
 "character" you must use double-quotes (string "a") instead of single
 quote (code point 'a'), but from the user's point of view that's pretty
 much all there is to change.
=20
 There'll still be plenty of room for specialized algorithms, but their
 purpose would be limited to optimization. Correctness would be taken
 care of by the basic range interface, and foreach should follow suit
 and iterate by grapheme by default.
=20
 I wrote this example (or something similar) earlier in this thread:
=20
 	foreach (grapheme; "expos=E9")
 		if (grapheme =3D=3D "=E9")
 			break;
=20
 In this example, even if one of these two strings use the pre-combined
 form of "=E9" and the other uses a combining acute accent, the equality
 would still hold since foreach iterates on full graphemes and =3D=3D
 compares using normalization.
=20
 The important thing to keep in mind here is that the grapheme-splitting
 algorithm should be optimized for the case where there is no combining
 character and the compare algorithm for the case where the string is
 already normalized, since most strings will exhibit these
 characteristics.
=20
 As for ASCII, we could make it easier to use ubyte[] for it by making
 string literals implicitly convert to ubyte[] if all their characters
 are in ASCII range.

I think that that would cause definite problems. Having the element type of= the=20 range be the same type as the range seems like it could cause a lot of prob= lems=20 in std.algorithm and the like, and it's _definitely_ going to confuse=20 programmers. I'd expect it to be highly bug-prone. They _need_ to be separa= te=20 types. Now, given that dchar can't actually work completely as an element type, yo= u'd=20 either need the string type to be a new type or the element type to be a ne= w=20 type. So, either the string type has char[], wchar[], or dchar[] for its el= ement=20 type, or char[], wchar[], and dchar[] have something like uchar as their el= ement=20 type, where uchar is a struct which contains a char[], wchar[], or dchar[] = which=20 holds a single grapheme. I think that it's a great idea that programmers try to use substrings and s= lices=20 rather than dchar, but making the element type a slice the original type so= unds=20 like it's really asking for trouble. =2D Jonathan M Davis
Jan 15 2011
prev sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-13 15:51:00 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 1/13/11 11:35 AM, Steven Schveighoffer wrote:
 On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Let's take a look:
 
 // Incorrect string code
 void fun(string s) {
 foreach (i; 0 .. s.length) {
 writeln("The character in position ", i, " is ", s[i]);
 }
 }
 
 // Incorrect string_t code
 void fun(string_t!char s) {
 foreach (i; 0 .. s.codeUnits) {
 writeln("The character in position ", i, " is ", s[i]);
 }
 }
 
 Both functions are incorrect, albeit in different ways. The only
 improvement I'm seeing is that the user needs to write codeUnits
 instead of length, which may make her think twice. Clearly, however,
 copiously incorrect code can be written with the proposed interface
 because it tries to hide the reality that underneath a variable-length
 encoding is being used, but doesn't hide it completely (albeit for
 good efficiency-related reasons).

You might be looking at my previous version. The new version (recently posted) will throw an exception for that code if a multi-code-unit code-point is found.

I was looking at your latest. It's code that compiles and runs, but dynamically fails on some inputs. I agree that it's often better to fail noisily instead of silently, but in a manner of speaking the string-based code doesn't fail at all - it correctly iterates the code units of a string. This may sometimes not be what the user expected; most of the time they'd care about the code points.

That's forgetting that most of the time people care about graphemes (user-perceived characters), not code points.
 It also supports this:
 
 foreach(i, d; s)
 {
 writeln("The character in position ", i, " is ", d);
 }
 
 where i is the index (might not be sequential)

Well string supports that too, albeit with the nit that you need to specify dchar.

Except it breaks with combining characters. For instance, take the string "t̃", which is two code points -- 't' followed by combining tilde (U+0303) -- and you'll get the following output: The character in position 0 is t The character in position 1 is ̃ (Note that the tilde becomes combined with the preceding space character.) The conception of character that normal people have does not match the notion of code points when combining characters enters the equation. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 13 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/13/11 7:09 PM, Michel Fortin wrote:
 On 2011-01-13 15:51:00 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 On 1/13/11 11:35 AM, Steven Schveighoffer wrote:
 On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Let's take a look:

 // Incorrect string code
 void fun(string s) {
 foreach (i; 0 .. s.length) {
 writeln("The character in position ", i, " is ", s[i]);
 }
 }

 // Incorrect string_t code
 void fun(string_t!char s) {
 foreach (i; 0 .. s.codeUnits) {
 writeln("The character in position ", i, " is ", s[i]);
 }
 }

 Both functions are incorrect, albeit in different ways. The only
 improvement I'm seeing is that the user needs to write codeUnits
 instead of length, which may make her think twice. Clearly, however,
 copiously incorrect code can be written with the proposed interface
 because it tries to hide the reality that underneath a variable-length
 encoding is being used, but doesn't hide it completely (albeit for
 good efficiency-related reasons).

You might be looking at my previous version. The new version (recently posted) will throw an exception for that code if a multi-code-unit code-point is found.

I was looking at your latest. It's code that compiles and runs, but dynamically fails on some inputs. I agree that it's often better to fail noisily instead of silently, but in a manner of speaking the string-based code doesn't fail at all - it correctly iterates the code units of a string. This may sometimes not be what the user expected; most of the time they'd care about the code points.

That's forgetting that most of the time people care about graphemes (user-perceived characters), not code points.

I'm not so sure about that. What do you base this assessment on? Denis wrote a library that according to him does grapheme-related stuff nobody else does. So apparently graphemes is not what people care about (although it might be what they should care about).
 It also supports this:

 foreach(i, d; s)
 {
 writeln("The character in position ", i, " is ", d);
 }

 where i is the index (might not be sequential)

Well string supports that too, albeit with the nit that you need to specify dchar.

Except it breaks with combining characters. For instance, take the string "t̃", which is two code points -- 't' followed by combining tilde (U+0303) -- and you'll get the following output: The character in position 0 is t The character in position 1 is ̃ (Note that the tilde becomes combined with the preceding space character.) The conception of character that normal people have does not match the notion of code points when combining characters enters the equation.

This might be a good time to see whether we need to address graphemes systematically. Could you please post a few links that would educate me and others in the mysteries of combining characters? Thanks, Andrei
Jan 13 2011
next sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:igoj6s$17r6$1 digitalmars.com...
 I'm not so sure about that. What do you base this assessment on? Denis 
 wrote a library that according to him does grapheme-related stuff nobody 
 else does. So apparently graphemes is not what people care about (although 
 it might be what they should care about).

It's what they want, they just don't know it. Graphemes are what many people *think* code points are.
 This might be a good time to see whether we need to address graphemes 
 systematically. Could you please post a few links that would educate me 
 and others in the mysteries of combining characters?

Maybe someone else has a link to an explanation (I don't), but it's basically just this: Three levels of abstraction from lowest to highest: - Code Unit (ie, encoding) - Code Point (ie, what Unicode assigns distinct numbers to) - Grapheme (ie, what we think of as a "character") A code-point can be made up of one or more code-units. Likewise, a grapheme can be made up of one or more code-points. There are (at least) two types of code points: - Regular ones, such as letters, digits, and punctuation. - "Combining Characters", such as accent marks (or if you're familiar with Japanese, the little things in the upper-right corner that change an "s" to a "z" or an "h" to a "p". Or like German's umlaut - the two dots above a vowel). Ie, things that are not characters in their own right, but merely modify other characters. These can be often (always?) be thought of as being like overlays. If a code point representing a "combining character" exists in a string, then instead of being displayed as a character it merely modifies whatever code-point came before it. So, for instance, if you want to store the German word for five (in all lower-case), there are two ways to do it: [ 'f', {u with the umlaut}, 'n', 'f' ] Or: [ 'f', 'u', {umlaut combining character}, 'n', 'f' ] Those *both* get rendered exactly the same, and both represent the same four-letter sequence. In the second example, the 'u' and the {umlaut combining character} combine to form one grapheme. The f's and n's just happen to be single-code-point graphemes. Note that while some characters exist in pre-combined form (such as the {u with the umlaut} above), legend has it there are others than can only be represented using a combining character. It's also my understanding, though I'm not certain, that sometimes multiple combining characters can be used together on the same "root" character. Caveat: There may very well be further complications that I'm not aware of. Heck, knowing Unicode, there probably are.
Jan 13 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/13/11 10:26 PM, Nick Sabalausky wrote:
[snip]
 [ 'f', {u with the umlaut}, 'n', 'f' ]

 Or:

 [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]

 Those *both* get rendered exactly the same, and both represent the same
 four-letter sequence. In the second example, the 'u' and the {umlaut
 combining character} combine to form one grapheme. The f's and n's just
 happen to be single-code-point graphemes.

 Note that while some characters exist in pre-combined form (such as the {u
 with the umlaut} above), legend has it there are others than can only be
 represented using a combining character.

 It's also my understanding, though I'm not certain, that sometimes multiple
 combining characters can be used together on the same "root" character.

Thanks. One further question is: in the above example with u-with-umlaut, there is one code point that corresponds to the entire combination. Are there combinations that do not have a unique code point? Andrei
Jan 13 2011
parent reply "Nick Sabalausky" <a a.a> writes:
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:igoqrm$1n5r$1 digitalmars.com...
 On 1/13/11 10:26 PM, Nick Sabalausky wrote:
 [snip]
 [ 'f', {u with the umlaut}, 'n', 'f' ]

 Or:

 [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]

 Those *both* get rendered exactly the same, and both represent the same
 four-letter sequence. In the second example, the 'u' and the {umlaut
 combining character} combine to form one grapheme. The f's and n's just
 happen to be single-code-point graphemes.

 Note that while some characters exist in pre-combined form (such as the 
 {u
 with the umlaut} above), legend has it there are others than can only be
 represented using a combining character.

 It's also my understanding, though I'm not certain, that sometimes 
 multiple
 combining characters can be used together on the same "root" character.

Thanks. One further question is: in the above example with u-with-umlaut, there is one code point that corresponds to the entire combination. Are there combinations that do not have a unique code point?

My understanding is "yes". At least that's what I've heard, and I've never heard any claims of "no". I don't know of any specific ones offhand, though. Actually, it might be possible to use any combining character with any old letter or number (like maybe a 7 with an umlaut), though I'm not certain. FWIW, the Wikipedia article might help, or at least link to other things that might help: http://en.wikipedia.org/wiki/Combining_character Michel or spir might have better links though.
Jan 13 2011
next sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Nick Sabalausky" <a a.a> wrote in message 
news:igori7$1ovh$1 digitalmars.com...
 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
 news:igoqrm$1n5r$1 digitalmars.com...
 On 1/13/11 10:26 PM, Nick Sabalausky wrote:
 [snip]
 [ 'f', {u with the umlaut}, 'n', 'f' ]

 Or:

 [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]

 Those *both* get rendered exactly the same, and both represent the same
 four-letter sequence. In the second example, the 'u' and the {umlaut
 combining character} combine to form one grapheme. The f's and n's just
 happen to be single-code-point graphemes.

 Note that while some characters exist in pre-combined form (such as the 
 {u
 with the umlaut} above), legend has it there are others than can only be
 represented using a combining character.

 It's also my understanding, though I'm not certain, that sometimes 
 multiple
 combining characters can be used together on the same "root" character.

Thanks. One further question is: in the above example with u-with-umlaut, there is one code point that corresponds to the entire combination. Are there combinations that do not have a unique code point?

My understanding is "yes". At least that's what I've heard, and I've never heard any claims of "no". I don't know of any specific ones offhand, though. Actually, it might be possible to use any combining character with any old letter or number (like maybe a 7 with an umlaut), though I'm not certain. FWIW, the Wikipedia article might help, or at least link to other things that might help: http://en.wikipedia.org/wiki/Combining_character Michel or spir might have better links though.

Heh, as if that wasn't bad enough, there's also digraphs which, from what I can tell, seem to be single code-points that represent more than one glyph/character/grapheme: http://en.wikipedia.org/wiki/Digraph_(orthography)#Digraphs_in_Unicode This page may be helpful too: http://en.wikipedia.org/wiki/Precomposed_character
Jan 13 2011
parent Daniel Gibson <metalcaedes gmail.com> writes:
Am 14.01.2011 08:00, schrieb Nick Sabalausky:
 "Nick Sabalausky"<a a.a>  wrote in message
 news:igori7$1ovh$1 digitalmars.com...
 "Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org>  wrote in message
 news:igoqrm$1n5r$1 digitalmars.com...
 On 1/13/11 10:26 PM, Nick Sabalausky wrote:
 [snip]
 [ 'f', {u with the umlaut}, 'n', 'f' ]

 Or:

 [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]

 Those *both* get rendered exactly the same, and both represent the same
 four-letter sequence. In the second example, the 'u' and the {umlaut
 combining character} combine to form one grapheme. The f's and n's just
 happen to be single-code-point graphemes.

 Note that while some characters exist in pre-combined form (such as the
 {u
 with the umlaut} above), legend has it there are others than can only be
 represented using a combining character.

 It's also my understanding, though I'm not certain, that sometimes
 multiple
 combining characters can be used together on the same "root" character.

Thanks. One further question is: in the above example with u-with-umlaut, there is one code point that corresponds to the entire combination. Are there combinations that do not have a unique code point?

My understanding is "yes". At least that's what I've heard, and I've never heard any claims of "no". I don't know of any specific ones offhand, though. Actually, it might be possible to use any combining character with any old letter or number (like maybe a 7 with an umlaut), though I'm not certain. FWIW, the Wikipedia article might help, or at least link to other things that might help: http://en.wikipedia.org/wiki/Combining_character Michel or spir might have better links though.

Heh, as if that wasn't bad enough, there's also digraphs which, from what I can tell, seem to be single code-points that represent more than one glyph/character/grapheme: http://en.wikipedia.org/wiki/Digraph_(orthography)#Digraphs_in_Unicode This page may be helpful too: http://en.wikipedia.org/wiki/Precomposed_character

OMG, this is really fucked up. Can't we just go back to 8bit charsets like ISO 8859-* etc? :/
Jan 14 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday 14 January 2011 04:47:59 Steven Schveighoffer wrote:
 On Fri, 14 Jan 2011 01:44:19 -0500, Nick Sabalausky <a a.a> wrote:
 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message
 news:igoqrm$1n5r$1 digitalmars.com...
 
 On 1/13/11 10:26 PM, Nick Sabalausky wrote:
 [snip]
 
 [ 'f', {u with the umlaut}, 'n', 'f' ]
 
 Or:
 
 [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]
 
 Those *both* get rendered exactly the same, and both represent the same
 four-letter sequence. In the second example, the 'u' and the {umlaut
 combining character} combine to form one grapheme. The f's and n's just
 happen to be single-code-point graphemes.
 
 Note that while some characters exist in pre-combined form (such as the
 {u
 with the umlaut} above), legend has it there are others than can only
 be
 represented using a combining character.
 
 It's also my understanding, though I'm not certain, that sometimes
 multiple
 combining characters can be used together on the same "root" character.

Thanks. One further question is: in the above example with u-with-umlaut, there is one code point that corresponds to the entire combination. Are there combinations that do not have a unique code point?

My understanding is "yes". At least that's what I've heard, and I've never heard any claims of "no". I don't know of any specific ones offhand, though. Actually, it might be possible to use any combining character with any old letter or number (like maybe a 7 with an umlaut), though I'm not certain. FWIW, the Wikipedia article might help, or at least link to other things that might help: http://en.wikipedia.org/wiki/Combining_character

http://en.wikipedia.org/wiki/Unicode_normalization Linked from that page, the normalization process is probably something we need to look at. Using decomposed canonical form would mean we need more state than just what code-unit are we on, plus it creates more likelyhood that a match will be found with part of a grapheme (spir or Michel brought it up earlier). So I think the correct case is to use composed canonical form. This is after just reading that page, so maybe I'm missing something. Non-composable combinations would be a problem. The string range is formed on the basis that the element type is a dchar. If there are combinations that cannot be composed into a single dchar, then the element type has to be a dchar array (or some other type which contains all the info). The other option is to simply leave them decomposed. Then you risk things like partial matches. I'm leaning towards a solution like this: While iterating a string, it should output dchars in normalized composed form. But a specialized comparison function should be used when doing things like searches or regex, because it might not be possible to compose two combining characters. The drawback to this is that a dchar might not be able to represent a grapheme (only if it cannot be composed), but I think it's too much of a hit in complexity and performance to make the element type of a string larger than a dchar.

Well, there's plenty in std.string that already deals in strings rather than dchar, and for the most part, any case where you couldn't fit a grapheme in a dchar could be covered by using a string.
 Those who wish to work with a more comprehensive string type can use a
 more complex string type such as the one created by spir.
 
 Does that sound reasonable?

We really should have something along those lines it seems. From what little _I_ know, the basic approach that you suggest seems like the correct one, but perhaps someone more knowledgeable will be able to come up with a reason why it's not a good idea. Certainly, I think that any solution that I'd come up with would be similar to what you're suggesting. - Jonathan M Davis
Jan 14 2011
prev sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-14 01:44:19 -0500, "Nick Sabalausky" <a a.a> said:

 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message
 news:igoqrm$1n5r$1 digitalmars.com...
 Thanks. One further question is: in the above example with u-with-umlaut,
 there is one code point that corresponds to the entire combination. Are
 there combinations that do not have a unique code point?

My understanding is "yes". At least that's what I've heard, and I've never heard any claims of "no". I don't know of any specific ones offhand, though. Actually, it might be possible to use any combining character with any old letter or number (like maybe a 7 with an umlaut), though I'm not certain.

Correct, there's a lot of combinations with no pre-combined form. This should be no surprise given that you can apply any number of combining marks to any character. mythical 7 with an umlaut: 7̈ mythical 7 with umlaut, ring above, and acute accent: 7̈̊́ I can't guaranty your news reader will display the above correctly, but it works as described in mine (Unison on Mac OS X). In fact, it should work in all Cocoa-based applications. This probably includes iOS-based devices too, but I haven't tested there. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 14 2011
parent Gianluigi Rubino <gianluigi{ }grsoft.org> writes:
Michel Fortin <michel.fortin michelf.com> wrote:

 
 	mythical 7 with an umlaut: 7̈
 	mythical 7 with umlaut, ring above, and acute accent: 7̈̊́
 
 I can't guaranty your news reader will display the above correctly, but
 it works as described in mine (Unison on Mac OS X). In fact, it should
 work in all Cocoa-based applications. This probably includes iOS-based
 devices too, but I haven't tested there.
 

Gianluigi
Jan 14 2011
prev sibling next sibling parent Daniel Gibson <metalcaedes gmail.com> writes:
Am 14.01.2011 07:26, schrieb Nick Sabalausky:
 "Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org>  wrote in message
 news:igoj6s$17r6$1 digitalmars.com...
 I'm not so sure about that. What do you base this assessment on? Denis
 wrote a library that according to him does grapheme-related stuff nobody
 else does. So apparently graphemes is not what people care about (although
 it might be what they should care about).

It's what they want, they just don't know it. Graphemes are what many people *think* code points are.

Agreed. Up until spir mentioned graphemes in this newsgroup I always thought that one Unicode code point == one character on the screen. I guess in the majority of use cases you want to operate on user perceived characters.
Jan 14 2011
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
Nick Sabalausky wrote:
 Those *both* get rendered exactly the same, and both represent the same 
 four-letter sequence. In the second example, the 'u' and the {umlaut 
 combining character} combine to form one grapheme. The f's and n's just 
 happen to be single-code-point graphemes.

I know some German, and to the best of my knowledge there are zero combining characters for it. The umlauts and the B both have their own code points.
 legend has it there are others than can only be 
 represented using a combining character.

??? I've never seen or heard of any. Not even in the old script that was in common use in Germany until after WW2.
Jan 15 2011
prev sibling next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-13 23:23:10 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 1/13/11 7:09 PM, Michel Fortin wrote:
 That's forgetting that most of the time people care about graphemes
 (user-perceived characters), not code points.

I'm not so sure about that. What do you base this assessment on? Denis wrote a library that according to him does grapheme-related stuff nobody else does. So apparently graphemes is not what people care about (although it might be what they should care about).

Apple implemented all these things in the NSString class in Cocoa. They did all this work on Unicode at the beginning of Mac OS X, at a time where making such changes wouldn't break anything. It's a hard thing to change later when you have code that depend on the old behaviour. It's a complicated matter and not so many people will understand the issues, so it's no wonder many languages just deal with code points.
 This might be a good time to see whether we need to address graphemes 
 systematically. Could you please post a few links that would educate me 
 and others in the mysteries of combining characters?

As usual, Wikipedia offers a good summary and a couple of references. Here's the part about combining characters: <http://en.wikipedia.org/wiki/Combining_character>. There's basically four ranges of code points which are combining: - Combining Diacritical Marks (0300–036F) - Combining Diacritical Marks Supplement (1DC0–1DFF) - Combining Diacritical Marks for Symbols (20D0–20FF) - Combining Half Marks (FE20–FE2F) A code point followed by one or more code points in these ranges is conceptually a single character (a grapheme). But for comparing strings correctly, you need to determine the canonical equivalence. Wikipedia describes it in Unicode Normalization article <http://en.wikipedia.org/wiki/Unicode_normalization>. The full algorithm specification can be found here: <http://unicode.org/reports/tr15/> (the algorithm . The canonical form has both a composed and decomposed variant, the first trying to use pre-combined character when possible, the second not using any pre-combined character. Not only combining marks are concerned, there are a few single-code-point characters which have a duplicate somewhere else in the code point table. Also, there's two normalizations: the canonical one (described above) and the compatibility one which is more lax (making the ligature "fl" would equivalent to "fl" for instance). If a user searches for some text in a document, it's probably better to search using the compatibility normalization so that "flower" (with ligature) and "flower" (without ligature) can match each other. If you want to search case-insensitively, then you'll need to implement the collation algorithm, but that's getting further. If you're wondering which direction to take, this official FAQ seems like a good resource (especially the first few questions): <http://www.unicode.org/faq/normalization.html> One important thing to note is that most of the time, strings come already in the normalized pre-composed form. So the normalization algorithm should be optimized for the case it has nothing to do. That's what is said in section 1.3 Description of the Normalization Algorithm in the specification: <http://www.unicode.org/reports/tr15/#Description_Norm>. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 14 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/14/11 7:50 AM, Michel Fortin wrote:
 On 2011-01-13 23:23:10 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 On 1/13/11 7:09 PM, Michel Fortin wrote:
 That's forgetting that most of the time people care about graphemes
 (user-perceived characters), not code points.

I'm not so sure about that. What do you base this assessment on? Denis wrote a library that according to him does grapheme-related stuff nobody else does. So apparently graphemes is not what people care about (although it might be what they should care about).

Apple implemented all these things in the NSString class in Cocoa. They did all this work on Unicode at the beginning of Mac OS X, at a time where making such changes wouldn't break anything. It's a hard thing to change later when you have code that depend on the old behaviour. It's a complicated matter and not so many people will understand the issues, so it's no wonder many languages just deal with code points.

That's a strong indicator, but we shouldn't get ahead of ourselves. D took a certain risk by defaulting to Unicode at a time where the dominant extant systems languages left the decision to more or less exotic libraries, Java used UTF16 de jure but UCS2 de facto, and other languages were just starting to adopt Unicode. I think that risk was justified because the relative loss in speed was often acceptable and the gains were there. Even so, there are people in this who protest against the loss in efficiency and argue that life is harder for ASCII users. Switching to variable-length representation of graphemes as bundles of dchars and committing to that through and through will bring with it a larger hit in efficiency and an increased difficulty in usage. I agree that at a level that's the "right" thing to do, but I don't have yet the feeling that combining characters are a widely-adopted winner. For the most part, fonts don't support combining characters, and as a font dilettante I can tell that putting arbitrary sets of diacritics on top of characters is not what one should be doing as it'll look terrible. Unicode is begrudgingly acknowledging combining characters. Only a handful of libraries deal with them. I don't know how many applications need or care for them, versus how many applications do fine with precombined characters. I have trouble getting combining characters to combine on this machine in any of the applications I use - and this is a Mac. Andrei
Jan 14 2011
next sibling parent reply foobar <foo bar.com> writes:
Andrei Alexandrescu Wrote:

 That's a strong indicator, but we shouldn't get ahead of ourselves.
 
 D took a certain risk by defaulting to Unicode at a time where the 
 dominant extant systems languages left the decision to more or less 
 exotic libraries, Java used UTF16 de jure but UCS2 de facto, and other 
 languages were just starting to adopt Unicode.
 
 I think that risk was justified because the relative loss in speed was 
 often acceptable and the gains were there. Even so, there are people in 
 this who protest against the loss in efficiency and argue that life is 
 harder for ASCII users.
 
 Switching to variable-length representation of graphemes as bundles of 
 dchars and committing to that through and through will bring with it a 
 larger hit in efficiency and an increased difficulty in usage. I agree 
 that at a level that's the "right" thing to do, but I don't have yet the 
 feeling that combining characters are a widely-adopted winner. For the 
 most part, fonts don't support combining characters, and as a font 
 dilettante I can tell that putting arbitrary sets of diacritics on top 
 of characters is not what one should be doing as it'll look terrible. 
 Unicode is begrudgingly acknowledging combining characters. Only a 
 handful of libraries deal with them. I don't know how many applications 
 need or care for them, versus how many applications do fine with 
 precombined characters. I have trouble getting combining characters to 
 combine on this machine in any of the applications I use - and this is a 
 Mac.
 
 
 Andrei

Combining marks do need to be supported. Some languages use combining marks extensively (see my other post) and of course font for those languages exist and they do support this. Mac doesn't support all languages so I'm unsure if it's the best example out there. here's an example of the Hebrew bible: http://www.scripture4all.org/OnlineInterlinear/Hebrew_Index.htm Just look at the any of the PDFs there to see how Hebrew looks like with all sorts of different marks. In the same vain I could have found a Japanese text with ruby (where a Kanji letter has on top of it Hiragana text that tells you how to read it) Using a dchar as a string element instead of a proper grapheme will make it really hard to work with texts in such languages. Regarding efficiency concerns for ASCII users - there's no rule that forces us to have a single string type, just look for comparison at how many integral types D has. I believe that the correct thing is to have a 'universal string' type be the default (just like int is for integral types) and provide additional types for other commonly useful encodings such as ASCII. A geneticist for instance should use a 'DNA' type that encodes the four DNA letters instead of an ASCII string or even worse, a universal (Unicode) string.
Jan 14 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-14 18:02:32 -0500, foobar <foo bar.com> said:

 Combining marks do need to be supported.
 Some languages use combining marks extensively (see my other post) and 
 of course font for those languages exist and they do support this. Mac 
 doesn't support all languages so I'm unsure if it's the best example 
 out there.
 here's an example of the Hebrew bible:
 http://www.scripture4all.org/OnlineInterlinear/Hebrew_Index.htm
 
 Just look at the any of the PDFs there to see how Hebrew looks like 
 with all sorts of different marks.

That's a good example. Although my attempt to extract the text from the PDF wasn't perfect, I can confirm that the marks I got in the copy-pasted text are indeed combining code points, not pre-combined ones. This character for instance has a combining mark: "יָ"; and it can't be represented by a pre-combined code point because there is no pre-combined form for it (or at least I couldn't find one). Some hebrew characters have a pre-combined form for the middle dot and some other marks, presumably the most common ones, but it was clearly insufficient for this text.
 In the same vain I could have found a Japanese text with ruby (where a 
 Kanji letter has on top of it Hiragana text that tells you how to read 
 it)

Are you sure those are combining code points? I though ruby was a layout feature, not something part of Unicode. And I can't find combining code points that would match those. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 14 2011
parent foobar <foo bar.com> writes:
Michel Fortin Wrote:

 In the same vain I could have found a Japanese text with ruby (where a 
 Kanji letter has on top of it Hiragana text that tells you how to read 
 it)

Are you sure those are combining code points? I though ruby was a layout feature, not something part of Unicode. And I can't find combining code points that would match those.

I've looked into this and I was wrong. Ruby is a layout feature as you said. Sorry for the confusion.
 
 -- 
 Michel Fortin
 michel.fortin michelf.com
 http://michelf.com/
 

Jan 15 2011
prev sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-14 17:04:08 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 1/14/11 7:50 AM, Michel Fortin wrote:
 On 2011-01-13 23:23:10 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:
 
 On 1/13/11 7:09 PM, Michel Fortin wrote:
 That's forgetting that most of the time people care about graphemes
 (user-perceived characters), not code points.

I'm not so sure about that. What do you base this assessment on? Denis wrote a library that according to him does grapheme-related stuff nobody else does. So apparently graphemes is not what people care about (although it might be what they should care about).

Apple implemented all these things in the NSString class in Cocoa. They did all this work on Unicode at the beginning of Mac OS X, at a time where making such changes wouldn't break anything. It's a hard thing to change later when you have code that depend on the old behaviour. It's a complicated matter and not so many people will understand the issues, so it's no wonder many languages just deal with code points.

That's a strong indicator, but we shouldn't get ahead of ourselves. D took a certain risk by defaulting to Unicode at a time where the dominant extant systems languages left the decision to more or less exotic libraries, Java used UTF16 de jure but UCS2 de facto, and other languages were just starting to adopt Unicode. I think that risk was justified because the relative loss in speed was often acceptable and the gains were there. Even so, there are people in this who protest against the loss in efficiency and argue that life is harder for ASCII users.

Then perhaps it's time we find out a way to handle non-Unicode encodings too. We can get away treating ASCII strings as Unicode strings because of a useful property of UTF-8, but should we really do this? Also, it'd really help this discussion to have some hard numbers about the cost of decoding graphemes.
 Switching to variable-length representation of graphemes as bundles of 
 dchars and committing to that through and through will bring with it a 
 larger hit in efficiency and an increased difficulty in usage. I agree 
 that at a level that's the "right" thing to do, but I don't have yet 
 the feeling that combining characters are a widely-adopted winner. For 
 the most part, fonts don't support combining characters, and as a font 
 dilettante I can tell that putting arbitrary sets of diacritics on top 
 of characters is not what one should be doing as it'll look terrible. 
 Unicode is begrudgingly acknowledging combining characters. Only a 
 handful of libraries deal with them. I don't know how many applications 
 need or care for them, versus how many applications do fine with 
 precombined characters. I have trouble getting combining characters to 
 combine on this machine in any of the applications I use - and this is 
 a Mac.

I'm using the character palette: Edit menu >Special Characters... from there you can insert arbitrary code points. Use the search function of the palette to get code points with "combining" in their names, then click the big character box on the lower left to insert them. Have fun! -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 14 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/17/11 10:55 AM, spir wrote:
 On 01/15/2011 12:21 AM, Michel Fortin wrote:
 Also, it'd really help this discussion to have some hard numbers about
 the cost of decoding graphemes.

Text has a perf module that provides such numbers (on different stages of Text object construction) (but the measured algos are not yet stabilised, so that said numbers regularly change, but in the right sense ;-) You can try the current version at https://bitbucket.org/denispir/denispir-d/src (the perf module is called chrono.d) For information, recently, the cost of full text construction: decoding, normalisation (both decomp & ordering), piling, was about 5 times decoding alone. The heavy part (~ 70%) beeing piling. But Stephan just informed me about a new gain in piling I have not yet tested. This performance places our library in-between Windows native tools and ICU in terms of speed. Which is imo rather good for a brand new tool written in a still unstable language. I have carefully read your arguments on Text's approach to systematically "pile" and normalise source texts not beeing the right one from an efficiency point of view. Even for strict use cases of universal text manipulation (because the relative space cost would indirectly cause time cost due to cache effects). Instead, you state we should "pile" and/or normalise on the fly. But I am, similarly to you, rather doubtful on this point without any numbers available. So, let us produce some benchmark results on both approaches if you like.

Congrats on this great work. The initial numbers are in keeping with my expectation; UTF adds for certain primitives up to 3x overhead compared to ASCII, and I expect combining character handling to bring about as much on top of that. Your work and Steve's won't go to waste; one way or another we need to add grapheme-based processing to D. I think it would be great if later on a Phobos submission was made. Andrei
Jan 17 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/17/11 12:23 PM, spir wrote:
 Andrei, would you have a look at Text's current state, mainly
 theinterface, when you have time for that (no hurry) at
 https://bitbucket.org/denispir/denispir-d/src
 It is actually a bit more than just a string type considering true
 characters as natural elements.
 * It is a textual type providing a client interface of common text
 manipulation methods similar to ones in common high-level languages.
 (including the fact that a character is a singleton string)
 * The repo also holds the main module (unicodedata) of Text's sister lib
 (dunicode), providing access to various unicode algos and data.
 (We are about to merge the 2 libs into a new repository.)

I think this is solid work that reveals good understanding of Unicode. That being said, there are a few things I disagree about and I don't think it can be integrated into Phobos. One thing is that it looks a lot more like D1 code than D2. D2 code of this kind is automatically expected to play nice with the rest of Phobos (ranges and algorithms). As it is, the code is an island that implements its own algorithms (mostly by equivalent handwritten code). In detail: * Line 130: representing a text as a dchar[][] has its advantages but major efficiency issues. To be frank I think it's a disaster. I think a representation building on UTF strings directly is bound to be vastly better. * 163: equality does what std.algorithm.equal does. * 174: equality also does what std.algorithm.equal does (possibly with a custom pred) * 189: TextException is unnecessary * 340: Unless properly motivate, iteration with opApply is archaic and inefficient. * 370: Why lose the information that the result is in fact a single Pile? * 430, 456, 474: contains, indexOf, count and probably others should use generic algorithms, not duplicate them. * 534: replace is std.array.replace * 623: copy copies the piles shallowly (not sure if that's a problem) As I mentioned before - why not focus on defining a Grapheme type (what you call Pile, but using UTF encoding) and defining a ByGrapheme range that iterates a UTF-encoded string by grapheme? Andrei
Jan 17 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/17/11 5:13 PM, spir wrote:
 On 01/17/2011 07:57 PM, Andrei Alexandrescu wrote:
 * Line 130: representing a text as a dchar[][] has its advantages but
 major efficiency issues. To be frank I think it's a disaster. I think a
 representation building on UTF strings directly is bound to be vastly
 better.

I don't understand your point. Where is the difference with D's builtin types, then?

Unfortunately I won't have much time to discuss all these points, but this is a simple one: using dchar[][] wastes memory and time. You need to build on a flatter representation. Don't confuse the abstraction you are building with its underlying representation. The difference between your abstraction and char[]/wchar[]/dchar[] (which I strongly recommend you to build on) is that the abstractions offer different, higher-level primitives that the representation doesn't. Let me repeat again: if anyone in this community wants to put work in a forward range that iterates one grapheme at a time, that work would be very valuable because it will allow us to experiment with graphemes in a non-disruptive way while benefiting of a host of algorithms. ByGrapheme and friends will help more than defining new string types. Andrei
Jan 17 2011
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/18/11 7:25 AM, spir wrote:
 On 01/18/2011 03:52 AM, Andrei Alexandrescu wrote:
 On 1/17/11 5:13 PM, spir wrote:
 On 01/17/2011 07:57 PM, Andrei Alexandrescu wrote:
 * Line 130: representing a text as a dchar[][] has its advantages but
 major efficiency issues. To be frank I think it's a disaster. I think a
 representation building on UTF strings directly is bound to be vastly
 better.

I don't understand your point. Where is the difference with D's builtin types, then?

Unfortunately I won't have much time to discuss all these points, but this is a simple one: using dchar[][] wastes memory and time. You need to build on a flatter representation. Don't confuse the abstraction you are building with its underlying representation. The difference between your abstraction and char[]/wchar[]/dchar[] (which I strongly recommend you to build on) is that the abstractions offer different, higher-level primitives that the representation doesn't.

I think it is needed to repeat again the following: Text in my view (or whatever variant solution to work correctly with universal text) is _not_ intended as a basic string type, even less default. If programmers can guarantee all their app's input will ever hold single-codepoint characters only, _or_ if they jst pass pieces of text around without manipulation, then such a tool is big overkill. It has a time cost a Text construction time, which I consider as an investment. It has also some space & time cost for operations that should be only slightly relevant compared to speed offered by the simple facts routines can then operate just (actualy nearly) like with historic charsets. Indexing is just normal O(1) indexing, possibly plus producing the result. Not O(n) across the source with building piles along the way. (1000X slower, 1000000X slower?) Counting is just O(n) with mini-array compares, not building & normalising piles across the whole code sequence. (10X, 100X slower?)

You don't provide O(n) indexing. Andrei
Jan 18 2011
prev sibling next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Monday 17 January 2011 15:13:42 spir wrote:
 See range bug evoked above. opApply is the only workaround AFAIK.
 Also, ranges cannot yet provide indexed iteration like
 	foreach(i, char ; text) {...}

While it would be nice at times to be able to have an index with foreach when using ranges, I would point out that it's trivial to just declare a variable which you increment each iteration, so it's easy to get an index even when using foreach with ranges. Certainly, I wouldn't consider the lack of index with foreach and ranges a good reason to use opApply instead of ranges. There may be other reasons which make it worthwhile, but it's so trivial to get an index that the loss of range abilities (particularly the ability to use such ranges with std.algorithm) dwarfs it in importance. - Jonathan M Davis
Jan 17 2011
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/17/11 11:48 PM, Jonathan M Davis wrote:
 On Monday 17 January 2011 15:13:42 spir wrote:
 See range bug evoked above. opApply is the only workaround AFAIK.
 Also, ranges cannot yet provide indexed iteration like
 	foreach(i, char ; text) {...}

While it would be nice at times to be able to have an index with foreach when using ranges, I would point out that it's trivial to just declare a variable which you increment each iteration, so it's easy to get an index even when using foreach with ranges. Certainly, I wouldn't consider the lack of index with foreach and ranges a good reason to use opApply instead of ranges. There may be other reasons which make it worthwhile, but it's so trivial to get an index that the loss of range abilities (particularly the ability to use such ranges with std.algorithm) dwarfs it in importance. - Jonathan M Davis

It's a bit more difficult than that. When iterating a variable-length encoded range, what you need more than the current item being iterated is the physical offset reached inside the range. That's not all that difficult either as the range can always provide an extra primitive, but a bit annoying (e.g. because it makes iteration with foreach impossible if you want the index, unless you return a tuple with each step). At any rate, I agree with two things - one, we need to fix the foreach situation. Two, even before we find a fix, at this point committing to iteration with opApply essentially commits the iteratee to an island where all basic algorithms need to be reinvented from first principles. Andrei
Jan 17 2011
prev sibling parent spir <denis.spir gmail.com> writes:
On 01/18/2011 07:11 AM, Andrei Alexandrescu wrote:
 On 1/17/11 11:48 PM, Jonathan M Davis wrote:
 On Monday 17 January 2011 15:13:42 spir wrote:
 See range bug evoked above. opApply is the only workaround AFAIK.
 Also, ranges cannot yet provide indexed iteration like
 foreach(i, char ; text) {...}

While it would be nice at times to be able to have an index with foreach when using ranges, I would point out that it's trivial to just declare a variable which you increment each iteration, so it's easy to get an index even when using foreach with ranges. Certainly, I wouldn't consider the lack of index with foreach and ranges a good reason to use opApply instead of ranges. There may be other reasons which make it worthwhile, but it's so trivial to get an index that the loss of range abilities (particularly the ability to use such ranges with std.algorithm) dwarfs it in importance. - Jonathan M Davis

It's a bit more difficult than that. When iterating a variable-length encoded range, what you need more than the current item being iterated is the physical offset reached inside the range. That's not all that difficult either as the range can always provide an extra primitive, but a bit annoying (e.g. because it makes iteration with foreach impossible if you want the index, unless you return a tuple with each step).

This is a very valid point: a range's logical offset is not necessary equal to physical (hum) offset, even on a plain sequence. But for the case of Text it is in fact, precisely because codepoints have been grouped in "piles" each representing true character (grapheme). This is actually one third of the purpose of Text (the others beeing to ensure unique representation of each character, and to provide users with clear interface). Thus, Jonathan's point simply applies to Text.
 At any rate, I agree with two things - one, we need to fix the foreach
 situation. Two, even before we find a fix, at this point committing to
 iteration with opApply essentially commits the iteratee to an island
 where all basic algorithms need to be reinvented from first principles.

I agree. The situation would be different if D had not proposed indexed iteration already, and programmers would routinely count manually and/or call an extra range primitive, as you say. Upon using opApply: it works fine nevertheless, at least for a first rough implementation like in the case of Text. Reinventing basic algos is not an issue at this stage, as long as they are simple enough, and mainly for testing. (Actually, it can be an advantage in avoiding integration issues, possibly due to D's current beta stage --I mean bugs that pop up only when combinng given features-- like we had eg with range & formatValue).
 Andrei

Denis _________________ vita es estrany spir.wikidot.com
Jan 18 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/18/2011 03:52 AM, Andrei Alexandrescu wrote:
 On 1/17/11 5:13 PM, spir wrote:
 On 01/17/2011 07:57 PM, Andrei Alexandrescu wrote:
 * Line 130: representing a text as a dchar[][] has its advantages but
 major efficiency issues. To be frank I think it's a disaster. I think a
 representation building on UTF strings directly is bound to be vastly
 better.

I don't understand your point. Where is the difference with D's builtin types, then?

Unfortunately I won't have much time to discuss all these points, but this is a simple one: using dchar[][] wastes memory and time. You need to build on a flatter representation. Don't confuse the abstraction you are building with its underlying representation. The difference between your abstraction and char[]/wchar[]/dchar[] (which I strongly recommend you to build on) is that the abstractions offer different, higher-level primitives that the representation doesn't.

I think it is needed to repeat again the following: Text in my view (or whatever variant solution to work correctly with universal text) is _not_ intended as a basic string type, even less default. If programmers can guarantee all their app's input will ever hold single-codepoint characters only, _or_ if they jst pass pieces of text around without manipulation, then such a tool is big overkill. It has a time cost a Text construction time, which I consider as an investment. It has also some space & time cost for operations that should be only slightly relevant compared to speed offered by the simple facts routines can then operate just (actualy nearly) like with historic charsets. Indexing is just normal O(1) indexing, possibly plus producing the result. Not O(n) across the source with building piles along the way. (1000X slower, 1000000X slower?) Counting is just O(n) with mini-array compares, not building & normalising piles across the whole code sequence. (10X, 100X slower?)
 Let me repeat again: if anyone in this community wants to put work in a
 forward range that iterates one grapheme at a time, that work would be
 very valuable because it will allow us to experiment with graphemes in a
 non-disruptive way while benefiting of a host of algorithms. ByGrapheme
 and friends will help more than defining new string types.

Right. I understand your point-of-view, esp "non-disruptive". But then, how to avoid the possibly huge inefficiency evoked above? We have no true perf numbers yet, right, for any alternative to Text's approach. But for this reason we also should not randomly speak of this approach's space & time costs. Compared to what? Denis _________________ vita es estrany spir.wikidot.com
Jan 18 2011
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 18 Jan 2011 01:11:04 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 1/17/11 11:48 PM, Jonathan M Davis wrote:
 On Monday 17 January 2011 15:13:42 spir wrote:
 See range bug evoked above. opApply is the only workaround AFAIK.
 Also, ranges cannot yet provide indexed iteration like
 	foreach(i, char ; text) {...}

While it would be nice at times to be able to have an index with foreach when using ranges, I would point out that it's trivial to just declare a variable which you increment each iteration, so it's easy to get an index even when using foreach with ranges. Certainly, I wouldn't consider the lack of index with foreach and ranges a good reason to use opApply instead of ranges. There may be other reasons which make it worthwhile, but it's so trivial to get an index that the loss of range abilities (particularly the ability to use such ranges with std.algorithm) dwarfs it in importance. - Jonathan M Davis

It's a bit more difficult than that. When iterating a variable-length encoded range, what you need more than the current item being iterated is the physical offset reached inside the range. That's not all that difficult either as the range can always provide an extra primitive, but a bit annoying (e.g. because it makes iteration with foreach impossible if you want the index, unless you return a tuple with each step). At any rate, I agree with two things - one, we need to fix the foreach situation. Two, even before we find a fix, at this point committing to iteration with opApply essentially commits the iteratee to an island where all basic algorithms need to be reinvented from first principles.

opApply in no way disables the range interface. It simply is used for foreach. So the only "algorithm" which is different is foreach. If you use the range primitives, opApply is nowhere to be found. That being said, we have an annoying situation in all this. opApply cannot be used to foreach using indexes *and* ranges are used to foreach elements. If one opApply is found, the compiler gives up on using the range functions for foreach (this is reflected in my most recent string_t code). This means you will have to implement a "wrapper" opApply around the range primitives in order to also implement indexing foreach. -Steve
Jan 19 2011
prev sibling parent spir <denis.spir gmail.com> writes:
On 01/17/2011 07:57 PM, Andrei Alexandrescu wrote:
 On 1/17/11 12:23 PM, spir wrote:
 Andrei, would you have a look at Text's current state, mainly
 theinterface, when you have time for that (no hurry) at
 https://bitbucket.org/denispir/denispir-d/src
 It is actually a bit more than just a string type considering true
 characters as natural elements.
 * It is a textual type providing a client interface of common text
 manipulation methods similar to ones in common high-level languages.
 (including the fact that a character is a singleton string)
 * The repo also holds the main module (unicodedata) of Text's sister lib
 (dunicode), providing access to various unicode algos and data.
 (We are about to merge the 2 libs into a new repository.)

I think this is solid work that reveals good understanding of Unicode. That being said, there are a few things I disagree about and I don't think it can be integrated into Phobos.

We are exploring a new field. (Except for the work Objective-C designers did -- but we just discovered it.)
 One thing is that it looks a lot
 more like D1 code than D2. D2 code of this kind is automatically
 expected to play nice with the rest of Phobos (ranges and algorithms).
 As it is, the code is an island that implements its own algorithms
 (mostly by equivalent handwritten code).

Right. We precisely initially wanted to let it play nicely with the rest of new Phobos. This meant mainly provide a range interface, which also gives access to std.algorithm routines. But we were blocked by current bugs related to ranges. I have posted about those issues (you may remember having replied to this post).
 In detail:

 * Line 130: representing a text as a dchar[][] has its advantages but
 major efficiency issues. To be frank I think it's a disaster. I think a
 representation building on UTF strings directly is bound to be vastly
 better.

I don't understand your point. Where is the difference with D's builtin types, then? Also, which efficiency issue do you mention? Upon text object construction, we do agree and I have given some data. But this happens only once; it is an investment intended to provide correctness first, and efficiency of _every_ operation on constructed text. Upon speed ofsuch methods / algorithms operating _correctly_ on universal text, precisely, since there is no alternative to Text (yet), there are also no available performance data to judge. (What about comparing Objective-C's NSString to Text's current performance for indexing, slicing, searching, counting,...? Even in its current experimental stage, I bet it would not be ridiculous, rather the opposite. But I may be completely wrong.)
 * 163: equality does what std.algorithm.equal does.

 * 174: equality also does what std.algorithm.equal does (possibly with a
 custom pred)

Right, these are unimportant tool func at the "pile" level. (Initially introduced because builtin "==" showed strange inefficency in our case. May test again later.)
 * 189: TextException is unnecessary

Agreed.
 * 340: Unless properly motivate, iteration with opApply is archaic and
 inefficient.

See range bug evoked above. opApply is the only workaround AFAIK. Also, ranges cannot yet provide indexed iteration like foreach(i, char ; text) {...}
 * 370: Why lose the information that the result is in fact a single Pile?

I don't know what information loss you mean. Generally speaking, Pile is more or less an implementation detail used to internally represent a true character; while Text is the important thing. At one time we had to chose whether make Pile an obviously exposed type as well, or not. I chose (after some exchange on the topic) not to do it for a few reasons: * Simplicity: one type does all the job well. * Avoid confusion due to conflict with historic string types which elements (codes=characters) were atomic thingies. This was also a reason not to name it simply "Character"; "Pile" for me was supposed to rather evoke the technical side than the meaningful side. * Lightness of the interface: if we expose Pile obviously, then we need to double all methods that may take or return a single character, like searching, counting, replacing etc... and also possibly indexing and iteration. In fact, the resulting interface is more or less like a string type in high-level languages such as Python; with the motivating difference that it operates correctly on universal text. Now, it seems you rather expect, maybe, the character/pile type to be the important thing and Text to just be a sequence of them? (possibly even unnecessary to be defined formally)
 * 430, 456, 474: contains, indexOf, count and probably others should use
 generic algorithms, not duplicate them.

 * 534: replace is std.array.replace

I had to write algos because most of them in std.algorithm require a range interface, IIUC; and also for testing purpose.
 * 623: copy copies the piles shallowly (not sure if that's a problem)

Had the same interrogation.
 As I mentioned before - why not focus on defining a Grapheme type (what
 you call Pile, but using UTF encoding) and defining a ByGrapheme range
 that iterates a UTF-encoded string by grapheme?

Dunno. This simply was not my approach. Seems to me Text as is provides clients with an interface a simple and clear as possible, while operating correctly in the backgroung. It seems if you just build a ByGrapheme iterator, then you have no other choice than abstracting on the fly (constructing piles on the fly for operations like indexing and normalising them in addition for searching, counting...). As I said in other posts, this may be the right thing to do from an efficiency point of view, but this remains to be proven. I bet the opposite, in fact, that --with same implementation language and same investment in optimisation-- the approach defining a true textual type like Text is inevitbly more efficient by orders of magnitude (*). Again, Text construction initial cost is an investment. Prove me wrong (**).
 Andrei

Denis (*) Except, probably, for the choice of making the ElemenType a singleton Text (seems costly). (**) I'm now aware of the high speed loss Text certainly suffers from representing characters as mini-arrays, but I guess it is marginally relevant compared to the gain of not piling and normalising for every operation. _________________ vita es estrany spir.wikidot.com
Jan 17 2011
prev sibling parent spir <denis.spir gmail.com> writes:
On 01/17/2011 06:36 PM, Andrei Alexandrescu wrote:
 On 1/17/11 10:55 AM, spir wrote:
 On 01/15/2011 12:21 AM, Michel Fortin wrote:
 Also, it'd really help this discussion to have some hard numbers about
 the cost of decoding graphemes.

Text has a perf module that provides such numbers (on different stages of Text object construction) (but the measured algos are not yet stabilised, so that said numbers regularly change, but in the right sense ;-) You can try the current version at https://bitbucket.org/denispir/denispir-d/src (the perf module is called chrono.d) For information, recently, the cost of full text construction: decoding, normalisation (both decomp & ordering), piling, was about 5 times decoding alone. The heavy part (~ 70%) beeing piling. But Stephan just informed me about a new gain in piling I have not yet tested. This performance places our library in-between Windows native tools and ICU in terms of speed. Which is imo rather good for a brand new tool written in a still unstable language. I have carefully read your arguments on Text's approach to systematically "pile" and normalise source texts not beeing the right one from an efficiency point of view. Even for strict use cases of universal text manipulation (because the relative space cost would indirectly cause time cost due to cache effects). Instead, you state we should "pile" and/or normalise on the fly. But I am, similarly to you, rather doubtful on this point without any numbers available. So, let us produce some benchmark results on both approaches if you like.

Congrats on this great work. The initial numbers are in keeping with my expectation; UTF adds for certain primitives up to 3x overhead compared to ASCII, and I expect combining character handling to bring about as much on top of that. Your work and Steve's won't go to waste; one way or another we need to add grapheme-based processing to D. I think it would be great if later on a Phobos submission was made.

Andrei, would you have a look at Text's current state, mainly theinterface, when you have time for that (no hurry) at https://bitbucket.org/denispir/denispir-d/src It is actually a bit more than just a string type considering true characters as natural elements. * It is a textual type providing a client interface of common text manipulation methods similar to ones in common high-level languages. (including the fact that a character is a singleton string) * The repo also holds the main module (unicodedata) of Text's sister lib (dunicode), providing access to various unicode algos and data. (We are about to merge the 2 libs into a new repository.) Denis _________________ vita es estrany spir.wikidot.com
Jan 17 2011
prev sibling next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-14 09:34:55 -0500, "Steven Schveighoffer" 
<schveiguy yahoo.com> said:

 On Fri, 14 Jan 2011 08:59:35 -0500, spir <denis.spir gmail.com> wrote:
 
 The point is not playing like that with Unicode flexibility. Rather 
 that  composite characters are just normal thingies in most languages 
 of the  world. Actually, on this point, english is a rare exception 
 (discarding  letters imported from foreign languages like french ''); 
 to the point  of beeing, I guess, the only western language without any 
 diacritic.

Is it common to have multiple modifiers on a single character?

Not in my knowledge. But I rarely deal with non-latin texts, there's probably some scripts out there that takes advantage of this.
 The  problem I see with using decomposed canonical form for strings is 
 that we  would have to return a dchar[] for each 'element', which 
 severely  complicates code that, for instance, only expects to handle 
 English.

Actually, returning a sliced char[]or wchar[] could also be valid. User-perceived characters are basically a substring of one or more code points. I'm not sure it complicates that much the semantics of the language -- what's complicated about writing str.front == "a" instead of str.front == 'a'? -- although it probably would complicate the generated code and make it a little slower. In the case of NSString in Cocoa, you can only access the 'characters' in their UTF-16 form. But everything from comparison to search for substring is done using graphemes. It's like they implemented specialized Unicode-aware algorithms for these functions. There's no genericness about how it handles graphemes. I'm not sure yet about what would be the right approach for D.
 I was hoping to lazily transform a string into its composed canonical  
 form, allowing the (hopefully rare) exception when a composed character 
  does not exist.  My thinking was that this at least gives a useful 
 string  representation for 90% of usages, leaving the remaining 10% of 
 usages to  find a more complex representation (like your Text type).  
 If we only get  like 20% or 30% there by making dchar the element type, 
 then we haven't  made it useful enough.
 
 Either way, we need a string type that can be compared canonically for  
 things like searches or opEquals.

I wonder if normalized string comparison shouldn't be built directly in the char[] wchar[] and dchar[] types instead. Also bring the idea above that iterating on a string would yield graphemes as char[] and this code would work perfectly irrespective of whether you used combining characters: foreach (grapheme; "expos") { if (grapheme == "") break; } I think a good standard to evaluate our handling of Unicode is to see how easy it is to do things the right way. In the above, foreach would slice the string grapheme by grapheme, and the == operator would perform a normalized comparison. While it works correctly, it's probably not the most efficient way to do thing however. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 14 2011
next sibling parent reply Lutger Blijdestijn <lutger.blijdestijn gmail.com> writes:
Steven Schveighoffer wrote:

...
 I think a good standard to evaluate our handling of Unicode is to see
 how easy it is to do things the right way. In the above, foreach would
 slice the string grapheme by grapheme, and the == operator would perform
 a normalized comparison. While it works correctly, it's probably not the
 most efficient way to do thing however.

I think this is a good alternative, but I'd rather not impose this on people like myself who deal mostly with English. I think this should be possible to do with wrapper types or intermediate ranges which have graphemes as elements (per my suggestion above). Does this sound reasonable? -Steve

If its a matter of choosing which is the 'default' range, I'd think proper unicode handling is more reasonable than catering for english / ascii only. Especially since this is already the case in phobos string algorithms.
Jan 15 2011
next sibling parent reply foobar <foo bar.com> writes:
Steven Schveighoffer Wrote:

 On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn  
 <lutger.blijdestijn gmail.com> wrote:
 
 Steven Schveighoffer wrote:

 ...
 I think a good standard to evaluate our handling of Unicode is to see
 how easy it is to do things the right way. In the above, foreach would
 slice the string grapheme by grapheme, and the == operator would  
 perform
 a normalized comparison. While it works correctly, it's probably not  
 the
 most efficient way to do thing however.

I think this is a good alternative, but I'd rather not impose this on people like myself who deal mostly with English. I think this should be possible to do with wrapper types or intermediate ranges which have graphemes as elements (per my suggestion above). Does this sound reasonable? -Steve

If its a matter of choosing which is the 'default' range, I'd think proper unicode handling is more reasonable than catering for english / ascii only. Especially since this is already the case in phobos string algorithms.

English and (if I understand correctly) most other languages. Any language which can be built from composable graphemes would work. And in fact, ones that use some graphemes that cannot be composed will also work to some degree (for example, opEquals). What I'm proposing (or think I'm proposing) is not exactly catering to English and ASCII, what I'm proposing is simply not catering to more complex languages such as Hebrew and Arabic. What I'm trying to find is a middle ground where most languages work, and the code is simple and efficient, with possibilities to jump down to lower levels for performance (i.e. switch to char[] when you know ASCII is all you are using) or jump up to full unicode when necessary. Essentially, we would have three levels of types: char[], wchar[], dchar[] -- Considered to be arrays in every way. string_t!T (string, wstring, dstring) -- Specialized string types that do normalization to dchars, but do not handle perfectly all graphemes. Works with any algorithm that deals with bidirectional ranges. This is the default string type, and the type for string literals. Represented internally by a single char[], wchar[] or dchar[] array. * utfstring_t!T -- specialized string to deal with full unicode, which may perform worse than string_t, but supports everything unicode supports. May require a battery of specialized algorithms. * - name up for discussion Also note that phobos currently does *no* normalization as far as I can tell for things like opEquals. Two char[]'s that represent equivalent strings, but not in the same way, will compare as !=. -Steve

The above compromise provides zero benefit. The proposed default type string_t is incorrect and will cause bugs. I prefer the standard lib to not provide normalization at all and force me to use a 3rd party lib rather than provide an incomplete implementation that will give me a false sense of correctness and cause very subtle and hard to find bugs. More over, Even if you ignore Hebrew as a tiny insignificant minority you cannot do the same for Arabic which has over one *billion* people that use that language. I firmly believe that in accordance with D's principle that the default behavior should be the correct & safe option, D should have the full unicode type (utfstring_t above) as the default. You need only a subset of the functionality because you only use English? For the same reason, you don't want the Unicode overhead? Use an ASCII type instead. In the same vain, a geneticist should use a DNA sequence type and not Unicode text.
Jan 15 2011
parent foobar <foo bar.com> writes:
Steven Schveighoffer Wrote:

 On Sat, 15 Jan 2011 13:21:12 -0500, foobar <foo bar.com> wrote:
 
 Steven Schveighoffer Wrote:

 English and (if I understand correctly) most other languages.  Any
 language which can be built from composable graphemes would work.  And  
 in
 fact, ones that use some graphemes that cannot be composed will also  
 work
 to some degree (for example, opEquals).

 What I'm proposing (or think I'm proposing) is not exactly catering to
 English and ASCII, what I'm proposing is simply not catering to more
 complex languages such as Hebrew and Arabic.  What I'm trying to find  
 is a
 middle ground where most languages work, and the code is simple and
 efficient, with possibilities to jump down to lower levels for  
 performance
 (i.e. switch to char[] when you know ASCII is all you are using) or jump
 up to full unicode when necessary.

 Essentially, we would have three levels of types:

 char[], wchar[], dchar[] -- Considered to be arrays in every way.
 string_t!T (string, wstring, dstring) -- Specialized string types that  
 do
 normalization to dchars, but do not handle perfectly all graphemes.   
 Works
 with any algorithm that deals with bidirectional ranges.  This is the
 default string type, and the type for string literals.  Represented
 internally by a single char[], wchar[] or dchar[] array.
 * utfstring_t!T -- specialized string to deal with full unicode, which  
 may
 perform worse than string_t, but supports everything unicode supports.
 May require a battery of specialized algorithms.

 * - name up for discussion

 Also note that phobos currently does *no* normalization as far as I can
 tell for things like opEquals.  Two char[]'s that represent equivalent
 strings, but not in the same way, will compare as !=.

 -Steve

The above compromise provides zero benefit. The proposed default type string_t is incorrect and will cause bugs. I prefer the standard lib to not provide normalization at all and force me to use a 3rd party lib rather than provide an incomplete implementation that will give me a false sense of correctness and cause very subtle and hard to find bugs.

I feel like you might be exaggerating, but maybe I'm completely wrong on this, I'm not well-versed in unicode, or even languages that require unicode. The clear benefit I see is that with a string type which normalizes to canonical code points, you can use this in any algorithm without having it be unicode-aware for *most languages*. At least, that is how I see it. I'm looking at it as a code-reuse proposition. It's like calendars. There are quite a few different calendars in different cultures. But most people use a Gregorian calendar. So we have three options: a) Use a Gregorian calendar, and leave the other calendars to a 3rd party library b) Use a complicated calendar system where Gregorian calendars are treated with equal respect to all other calendars, none are the default. c) Use a Gregorian calendar by default, but include the other calendars as a separate module for those who wish to use them. I'm looking at my proposal as more of a c) solution.

The calendar example is a very good one. What you're saying equivalent to saying is that most people use Gregorian but for efficiency and other reasons you want to not implement feb 29th.
 Can you show how normalization causes subtle bugs?
 

That was already shown by Michel and Spir where the equality operator is incorrect due to diacritics (the example with expos). Your solution makes this far worse since it will reduce the bug to far less cases making the problem far less obvious. One would test with expos which will work and another test (let's say in Hebrew) and that will *not* work and unless the programmer is a Unicode expert (Which is very unlikely) the programmer is left scratching his head.
 More over, Even if you ignore Hebrew as a tiny insignificant minority  
 you cannot do the same for Arabic which has over one *billion* people  
 that use that language.

I hope that the medium type works 'good enough' for those languages, with the high level type needed for advanced usages. At a minimum, comparison and substring should work for all languages.

As I explained above, 'good enough' in this case is far worse because it masks the problem. Also, If you want comparison to work in all languages including Hebrew/Arabic than it simply isn't good enough.
 I firmly believe that in accordance with D's principle that the default  
 behavior should be the correct & safe option, D should have the full  
 unicode type (utfstring_t above) as the default.

 You need only a subset of the functionality because you only use  
 English? For the same reason, you don't want the Unicode overhead? Use  
 an ASCII type instead. In the same vain, a geneticist should use a DNA  
 sequence type and not Unicode text.

Or French, or Spanish, or German, etc... Look, even the lowest level is valid unicode, but if you want to start extracting individual graphemes, you need more machinery. In 99% of cases, I'd think you want to use strings as strings, not as sequences of graphemes, or code-units. -Steve

I'd like to have full Unicode support. I think it is a good thing for D to have in order to expand in the world. As an alternative, I'd settle for loud errors that make absolutely clear to the non-Unicode expert programmer that D simply does NOT support e.g. Normalization. As Spir already said, Unicode is something few understand and even it's own official docs do not explain such issues properly. We should not confuse users even further with incomplete support.
Jan 15 2011
prev sibling parent Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-15 12:39:32 -0500, "Steven Schveighoffer" 
<schveiguy yahoo.com> said:

 On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn  
 <lutger.blijdestijn gmail.com> wrote:
 
 Steven Schveighoffer wrote:
 
 ...
 I think a good standard to evaluate our handling of Unicode is to see
 how easy it is to do things the right way. In the above, foreach would
 slice the string grapheme by grapheme, and the == operator would  perform
 a normalized comparison. While it works correctly, it's probably not  the
 most efficient way to do thing however.

I think this is a good alternative, but I'd rather not impose this on people like myself who deal mostly with English. I think this should be possible to do with wrapper types or intermediate ranges which have graphemes as elements (per my suggestion above). Does this sound reasonable? -Steve

If its a matter of choosing which is the 'default' range, I'd think proper unicode handling is more reasonable than catering for english / ascii only. Especially since this is already the case in phobos string algorithms.

English and (if I understand correctly) most other languages. Any language which can be built from composable graphemes would work. And in fact, ones that use some graphemes that cannot be composed will also work to some degree (for example, opEquals). What I'm proposing (or think I'm proposing) is not exactly catering to English and ASCII, what I'm proposing is simply not catering to more complex languages such as Hebrew and Arabic. What I'm trying to find is a middle ground where most languages work, and the code is simple and efficient, with possibilities to jump down to lower levels for performance (i.e. switch to char[] when you know ASCII is all you are using) or jump up to full unicode when necessary.

Why don't we build a compiler with an optimizer that generates correct code *almost* all of the time? If you are worried about it not producing correct code for a given function, you can just add "pragma(correct_code)" in front of that function to disable the risky optimizations. No harm done, right? One thing I see very often, often on US web sites but also elsewhere, is that if you enter a name with an accented letter in a form (say Émilie), very often the accented letter gets changed to another semi-random character later in the process. Why? Because somewhere in the process lies an encoding mismatch that no one thought about and no one tested for. At the very least, the form should have rejected those unexpected characters and show an error when it could. Now, with proper Unicode handling up to the code point level, this kind of problem probably won't happen as often because the whole stack works with UTF encodings. But are you going to validate all of your inputs to make sure they have no combining code point? Don't assume that because you're in the United States no one will try to enter characters where you don't expect them. People love to play with Unicode symbols for fun, putting them in their name, signature, or even domain names (✪df.ws). Just wait until they discover they can combine them. ☺̰̎! There is also a variety of combining mathematical symbols with no pre-combined form, such as ≸. Writing in Arabic, Hebrew, Korean, or some other foreign language isn't a prerequisite to use combining characters.
 Essentially, we would have three levels of types:
 
 char[], wchar[], dchar[] -- Considered to be arrays in every way.
 string_t!T (string, wstring, dstring) -- Specialized string types that 
 do  normalization to dchars, but do not handle perfectly all graphemes. 
  Works  with any algorithm that deals with bidirectional ranges.  This 
 is the  default string type, and the type for string literals.  
 Represented  internally by a single char[], wchar[] or dchar[] array.
 * utfstring_t!T -- specialized string to deal with full unicode, which 
 may  perform worse than string_t, but supports everything unicode 
 supports.   May require a battery of specialized algorithms.
 
 * - name up for discussion
 
 Also note that phobos currently does *no* normalization as far as I can 
  tell for things like opEquals.  Two char[]'s that represent equivalent 
  strings, but not in the same way, will compare as !=.

Basically, you're suggesting that the default way should be to handle Unicode *almost* right. And then, if you want to handle thing *really* right you need to be explicit about it by using "utfstring_t"? I understand your motivation, but it sounds backward to me. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 15 2011
prev sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-15 11:59:04 -0500, "Steven Schveighoffer" 
<schveiguy yahoo.com> said:

 On Fri, 14 Jan 2011 12:01:42 -0500, Michel Fortin  
 <michel.fortin michelf.com> wrote:
 
 Actually, returning a sliced char[] or wchar[] could also be valid.  
 User-perceived characters are basically a substring of one or more code 
  points. I'm not sure it complicates that much the semantics of the  
 language -- what's complicated about writing str.front == "a" instead 
 of  str.front == 'a'? -- although it probably would complicate the 
 generated  code and make it a little slower.

Hm... this pushes the normalization outside the type, and into the algorithms (such as find). I was hoping to avoid that.

Not really. It pushes the normalization to the string comparison operator, as explained later.
 I think I can  come up with an algorithm that normalizes into canonical 
 form as it  iterates.  It just might return part of a grapheme if the 
 grapheme cannot  be composed.

The problem with normalization while iterating is that you lose information about what the actual code points part of the grapheme. If you wanted to count the number of grapheme with a particular code point you're lost that information. Moreover, if all you want is to count the number of grapheme, normalizing the character is a waste of time. I suggested in another post that we implement ranges for decomposing and recomposing on-the-fly a string in its normalized form. That's basically the same thing as you suggest, but it'd have to be explicit to avoid the problem above.
 I wonder if normalized string comparison shouldn't be built directly in 
  the char[] wchar[] and dchar[] types instead.

No, in my vision of how strings should be typed, char[] is an array, not a string. It should be treated like an array of code-units, where two forms that create the same grapheme are considered different.

Well, I agree there's a need for that sometime. But if what you want is just a dumb array of code units, why not use ubyte[], ushort[] and uint[] instead? It seems to me that the whole point of having a different type for char[], wchar[], and dchar[] is that you know they are Unicode strings and can treat them as such. And if you treat them as Unicode strings, then perhaps the runtime and the compiler should too, for consistency's sake.
 Also bring the idea above that iterating on a string would yield  
 graphemes as char[] and this code would work perfectly irrespective of  
 whether you used combining characters:
 
 	foreach (grapheme; "expos") {
 		if (grapheme == "")
 			break;
 	}
 
 I think a good standard to evaluate our handling of Unicode is to see  
 how easy it is to do things the right way. In the above, foreach would  
 slice the string grapheme by grapheme, and the == operator would 
 perform  a normalized comparison. While it works correctly, it's 
 probably not the  most efficient way to do thing however.

I think this is a good alternative, but I'd rather not impose this on people like myself who deal mostly with English.

I'm not suggesting we impose it, just that we make it the default. If you want to iterate by dchar, wchar, or char, just write: foreach (dchar c; "expos") {} foreach (wchar c; "expos") {} foreach (char c; "expos") {} // or foreach (dchar c; "expos".by!dchar()) {} foreach (wchar c; "expos".by!wchar()) {} foreach (char c; "expos".by!char()) {} and it'll work. But the default would be a slice containing the grapheme, because this is the right way to represent a Unicode character.
 I think this should be  possible to do with wrapper types or 
 intermediate ranges which have  graphemes as elements (per my 
 suggestion above).

I think it should be the reverse. If you want your code to break when it encounters multi-code-point graphemes then it's your choice, but you should have to make your choice explicit. The default should be to handle strings correctly. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 15 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer" 
<schveiguy yahoo.com> said:

 I'm not suggesting we impose it, just that we make it the default. If  
 you want to iterate by dchar, wchar, or char, just write:
 
 	foreach (dchar c; "expos") {}
 	foreach (wchar c; "expos") {}
 	foreach (char c; "expos") {}
 	// or
 	foreach (dchar c; "expos".by!dchar()) {}
 	foreach (wchar c; "expos".by!wchar()) {}
 	foreach (char c; "expos".by!char()) {}
 
 and it'll work. But the default would be a slice containing the  
 grapheme, because this is the right way to represent a Unicode 
 character.

I think this is a good idea. I previously was nervous about it, but I'm not sure it makes a huge difference. Returning a char[] is certainly less work than normalizing a grapheme into one or more code points, and then returning them. All that it takes is to detect all the code points within the grapheme. Normalization can be done if needed, but would probably have to output another char[], since a normalized grapheme can occupy more than one dchar.

I'm glad we agree on that now.
 What if I modified my proposed string_t type to return T[] as its 
 element  type, as you say, and string literals are typed as 
 string_t!(whatever)?   In addition, the restrictions I imposed on 
 slicing a code point actually  get imposed on slicing a grapheme.  That 
 is, it is illegal to substring a  string_t in a way that slices through 
 a grapheme (and by deduction, a code  point)?

I'm not opposed to that on principle. I'm a little uneasy about having so many types representing a string however. Some other raw comments: I agree that things would be more coherent if char[], wchar[], and dchar[] behaved like other arrays, but I can't really see a justification for those types to be in the language if there's nothing special about them (why not a library type?). If strings and arrays of code units are distinct, slicing in the middle of a grapheme or in the middle of a code point could throw an error, but for performance reasons it should probably check for that only when array bounds checking is turned on (that would require compiler support however).
 Actually, we would need a grapheme to be its own type, because 
 comparing  two char[]'s that don't contain equivalent bits and having 
 them be equal,  violates the expectation that char[] is an array.
 
 So the string_t!char would return a grapheme_t!char (names to be  
 discussed) as its element type.

Or you could make a grapheme a string_t. ;-) -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 15 2011
next sibling parent reply foobar <foo bar.com> writes:
Steven Schveighoffer Wrote:

 On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin  
 <michel.fortin michelf.com> wrote:
 
 On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"  
 <schveiguy yahoo.com> said:

 I'm not suggesting we impose it, just that we make it the default. If   
 you want to iterate by dchar, wchar, or char, just write:
  	foreach (dchar c; "exposé") {}
 	foreach (wchar c; "exposé") {}
 	foreach (char c; "exposé") {}
 	// or
 	foreach (dchar c; "exposé".by!dchar()) {}
 	foreach (wchar c; "exposé".by!wchar()) {}
 	foreach (char c; "exposé".by!char()) {}
  and it'll work. But the default would be a slice containing the   
 grapheme, because this is the right way to represent a Unicode  
 character.

I'm not sure it makes a huge difference. Returning a char[] is certainly less work than normalizing a grapheme into one or more code points, and then returning them. All that it takes is to detect all the code points within the grapheme. Normalization can be done if needed, but would probably have to output another char[], since a normalized grapheme can occupy more than one dchar.

I'm glad we agree on that now.

It's a matter of me slowly wrapping my brain around unicode and how it's used. It seems like it's a typical committee defined standard where there are 10 ways to do everything, I was trying to weed out the lesser used (or so I perceived) pieces to allow a more implementable library. It's doubly hard for me since I have limited experience with other languages, and I've never tried to write them with a computer (my language classes in high school were back in the days of actually writing stuff down on paper). I once told a colleague who was on a standards committee that their proposed KLV standard (key length value) was ridiculous. The wise committee had decided that in order to avoid future issues, the length would be encoded as a single byte if < 128, or 128 + length of the length field for anything higher. This means you could potentially have to parse and process a 127-byte integer!
 What if I modified my proposed string_t type to return T[] as its  
 element  type, as you say, and string literals are typed as  
 string_t!(whatever)?   In addition, the restrictions I imposed on  
 slicing a code point actually  get imposed on slicing a grapheme.  That  
 is, it is illegal to substring a  string_t in a way that slices through  
 a grapheme (and by deduction, a code  point)?

I'm not opposed to that on principle. I'm a little uneasy about having so many types representing a string however. Some other raw comments: I agree that things would be more coherent if char[], wchar[], and dchar[] behaved like other arrays, but I can't really see a justification for those types to be in the language if there's nothing special about them (why not a library type?).

I would not be opposed to getting rid of those types. But I am very opposed to char[] not being an array. If you want a string to be something other than an array, make it have a different syntax. We also have to consider C compatibility. However, we are in radical-change mode then, and this is probably pushed to D3 ;) If we can find some way to fix the situation without invalidating TDPL, we should strive for that first IMO.
 If strings and arrays of code units are distinct, slicing in the middle  
 of a grapheme or in the middle of a code point could throw an error, but  
 for performance reasons it should probably check for that only when  
 array bounds checking is turned on (that would require compiler support  
 however).

Not really, it could use assert, but that throws an assert error instead of a RangeError. Of course, both are errors and will abort the program. I do wish there was a version(noboundscheck) to do this kind of stuff with...
 Actually, we would need a grapheme to be its own type, because  
 comparing  two char[]'s that don't contain equivalent bits and having  
 them be equal,  violates the expectation that char[] is an array.
  So the string_t!char would return a grapheme_t!char (names to be   
 discussed) as its element type.

Or you could make a grapheme a string_t. ;-)

I'm a little uneasy having a range return itself as its element type. For all intents and purposes, a grapheme is a string of one 'element', so it could potentially be a string_t. It does seem daunting to have so many types, but at the same time, types convey relationships at compile time that can make coding impossible to get wrong, or make things actually possible when having a single type doesn't. I'll give you an example from a previous life: Tango had a type called DateTime. This type represented *either* a point in time, or a span of time (depending on how you used it). But I proposed we switch to two distinct types, one for a point in time, one for a span of time. It was argued that both were so similar, why couldn't we just keep one type? The answer is simple -- having them be separate types allows me to express relationships that the compiler enforces. For example, you can add two time spans together, but you can't add two points in time together. Or maybe you want a function to accept a time span (like a sleep operation). If there was only one type, then sleep(DateTime.now()) compiles and sleeps for what, 2011 years? ;) I feel that making extra types when the relationship between them is important is worth the possible repetition of functionality. Catching bugs during compilation is soooo much better than experiencing them during runtime. -Steve

I like Michel's proposed semantics and I also agree with you that it should be a distinct string type and not break consistency of regular arrays. Regarding your last point: Do you mean that a grapheme would be a sub-type of string? (a specialization where the string represents a single element)? If so, than it sounds good to me.
Jan 15 2011
parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Monday 17 January 2011 04:08:08 Steven Schveighoffer wrote:
 On Sat, 15 Jan 2011 17:19:48 -0500, foobar <foo bar.com> wrote:
 I like Michel's proposed semantics and I also agree with you that it
 should be a distinct string type and not break consistency of regular
 arrays.
 
 Regarding your last point: Do you mean that a grapheme would be a
 sub-type of string? (a specialization where the string represents a
 single element)? If so, than it sounds good to me.

A grapheme would be its own specialized type. I'd probably remove the range primitives to really differentiate it. Unfortunately, due to the inability to statically check this, the invariant would have to be a runtime check. Most likely this check would be disabled in release mode. This can cause problems, and I can see why it is attractive to use strings to implement graphemes, but that also has its problems. With grapheme being its own type, we are providing a way to optimize functions, and allow further restrictions on function parameters. At the end of the day, perhaps grapheme *should* just be a string. We'll have to see how this breaks in practice, either way.

I think that it would make good sense for a grapheme to be struct which holds a string as Andrei suggested: struct Grapheme(Char) if (isSomeChar!Char) { private const Char[] rep; ... } I really think that trying to use strings to represent graphemes is asking for it. The element of a range should be a different type than the that of the range itself. - Jonathan M Davis
Jan 17 2011
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/17/11 6:25 AM, Jonathan M Davis wrote:
 On Monday 17 January 2011 04:08:08 Steven Schveighoffer wrote:
 On Sat, 15 Jan 2011 17:19:48 -0500, foobar<foo bar.com>  wrote:
 I like Michel's proposed semantics and I also agree with you that it
 should be a distinct string type and not break consistency of regular
 arrays.

 Regarding your last point: Do you mean that a grapheme would be a
 sub-type of string? (a specialization where the string represents a
 single element)? If so, than it sounds good to me.

A grapheme would be its own specialized type. I'd probably remove the range primitives to really differentiate it. Unfortunately, due to the inability to statically check this, the invariant would have to be a runtime check. Most likely this check would be disabled in release mode. This can cause problems, and I can see why it is attractive to use strings to implement graphemes, but that also has its problems. With grapheme being its own type, we are providing a way to optimize functions, and allow further restrictions on function parameters. At the end of the day, perhaps grapheme *should* just be a string. We'll have to see how this breaks in practice, either way.

I think that it would make good sense for a grapheme to be struct which holds a string as Andrei suggested: struct Grapheme(Char) if (isSomeChar!Char) { private const Char[] rep; ... } I really think that trying to use strings to represent graphemes is asking for it. The element of a range should be a different type than the that of the range itself. - Jonathan M Davis

If someone makes a careful submission of a Grapheme to Phobos as described above, it has a high chance of being accepted. Andrei
Jan 17 2011
prev sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer" 
<schveiguy yahoo.com> said:

 On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin  
 <michel.fortin michelf.com> wrote:
 
 On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"  
 <schveiguy yahoo.com> said:
 
 I'm not suggesting we impose it, just that we make it the default. If   
 you want to iterate by dchar, wchar, or char, just write:
  	foreach (dchar c; "expos") {}
 	foreach (wchar c; "expos") {}
 	foreach (char c; "expos") {}
 	// or
 	foreach (dchar c; "expos".by!dchar()) {}
 	foreach (wchar c; "expos".by!wchar()) {}
 	foreach (char c; "expos".by!char()) {}
  and it'll work. But the default would be a slice containing the   
 grapheme, because this is the right way to represent a Unicode  
 character.

I'm not sure it makes a huge difference. Returning a char[] is certainly less work than normalizing a grapheme into one or more code points, and then returning them. All that it takes is to detect all the code points within the grapheme. Normalization can be done if needed, but would probably have to output another char[], since a normalized grapheme can occupy more than one dchar.

I'm glad we agree on that now.

It's a matter of me slowly wrapping my brain around unicode and how it's used. It seems like it's a typical committee defined standard where there are 10 ways to do everything, I was trying to weed out the lesser used (or so I perceived) pieces to allow a more implementable library. It's doubly hard for me since I have limited experience with other languages, and I've never tried to write them with a computer (my language classes in high school were back in the days of actually writing stuff down on paper).

Actually, I don't think Unicode was so badly designed. It's just that nobody hat an idea of the real scope of the problem they had in hand at first, and so they had to add a lot of things but wanted to keep things backward-compatible. We're at Unicode 6.0 now, can you name one other standard that evolved enough to get 6 major versions? I'm surprised it's not worse given all that it must support. That said, I'm sure if someone could redesign Unicode by breaking backward-compatibility we'd have something simpler. You could probably get rid of pre-combined characters and reduce the number of normalization forms. But would you be able to get rid of normalization entirely? I don't think so. Reinventing Unicode is probably not worth it.
 I'm not opposed to that on principle. I'm a little uneasy about having  
 so many types representing a string however. Some other raw comments:
 
 I agree that things would be more coherent if char[], wchar[], and  
 dchar[] behaved like other arrays, but I can't really see a  
 justification for those types to be in the language if there's nothing  
 special about them (why not a library type?).

I would not be opposed to getting rid of those types. But I am very opposed to char[] not being an array. If you want a string to be something other than an array, make it have a different syntax. We also have to consider C compatibility. However, we are in radical-change mode then, and this is probably pushed to D3 ;) If we can find some way to fix the situation without invalidating TDPL, we should strive for that first IMO.

Indeed, the change would probably be too radical for D2. I think we agree that the default type should behave as a Unicode string, not an array of characters. I understand your opposition to conflating arrays of char with strings, and I agree with you to a certain extent that it could have been done better. But we can't really change the type of string literals, can we. The only thing we can change (I hope) at this point is how iterating on strings work. Walter said earlier that he oppose changing foreach's default element type to dchar for char[] and wchar[] (as Andrei did for ranges) on the ground that it would silently break D1 compatibility. This is a valid point in my opinion. I think you're right when you say that not treating char[] as an array of character breaks, to a certain extent, C compatibility. Another valid point. That said, I want to emphasize that iterating by grapheme, contrary to iterating by dchar, does not break any code *silently*. The compiler will complain loudly that you're comparing a string to a char, so you'll have to change your code somewhere if you want things to compile. You'll have to look at the code and decide what to do. One more thing: NSString in Cocoa is in essence the same thing as I'm proposing here: as array of UTF-16 code units, but with string behaviour. It supports by-code-unit indexing, but appending, comparing, searching for substrings, etc. all behave correctly as a Unicode string. Again, I agree that it's probably not the best design, but I can tell you it works well in practice. In fact, NSString doesn't even expose the concept of grapheme, it just uses them internally, and you're pretty much limited to the built-in operation. I think what we have here in concept is much better... even if it somewhat conflates code-unit arrays and strings.
 Or you could make a grapheme a string_t. ;-)

I'm a little uneasy having a range return itself as its element type. For all intents and purposes, a grapheme is a string of one 'element', so it could potentially be a string_t. It does seem daunting to have so many types, but at the same time, types convey relationships at compile time that can make coding impossible to get wrong, or make things actually possible when having a single type doesn't. I'll give you an example from a previous life: [...] I feel that making extra types when the relationship between them is important is worth the possible repetition of functionality. Catching bugs during compilation is soooo much better than experiencing them during runtime.

I can understand the utility of a separate type in your DateTime example, but in this case I fail to see any advantage. I mean, a grapheme is a slice of a string, can have multiple code points (like a string), can be appended the same way as a string, can be composed or decomposed using canonical normalization or compatibility normalization (like a string), and should be sorted, uppercased, and lowercased according to Unicode rules (like a string). Basically, a grapheme is just a string that happens to contain only one grapheme. What would a custom type do differently than a string? Also, grapheme == "a" is easy to understand because both are strings. But if a grapheme is a separate type, what would a grapheme literal look like? So in the end I don't think a grapheme needs a specific type, at least not for general purpose text processing. If I split a string on whitespace, do I get a range where elements are of type "word"? No, just sliced strings. That said, I'm much less concerned by the type used to represent a grapheme than by the Unicode correctness. I'm not opposed to a separate type, I just don't really see the point. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 15 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/15/11 4:45 PM, Michel Fortin wrote:
 On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer"
 <schveiguy yahoo.com> said:

 On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin
 <michel.fortin michelf.com> wrote:

 On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"
 <schveiguy yahoo.com> said:

 I'm not suggesting we impose it, just that we make it the default.
 If you want to iterate by dchar, wchar, or char, just write:
 foreach (dchar c; "expos") {}
 foreach (wchar c; "expos") {}
 foreach (char c; "expos") {}
 // or
 foreach (dchar c; "expos".by!dchar()) {}
 foreach (wchar c; "expos".by!wchar()) {}
 foreach (char c; "expos".by!char()) {}
 and it'll work. But the default would be a slice containing the
 grapheme, because this is the right way to represent a Unicode
 character.

I'm not sure it makes a huge difference. Returning a char[] is certainly less work than normalizing a grapheme into one or more code points, and then returning them. All that it takes is to detect all the code points within the grapheme. Normalization can be done if needed, but would probably have to output another char[], since a normalized grapheme can occupy more than one dchar.

I'm glad we agree on that now.

It's a matter of me slowly wrapping my brain around unicode and how it's used. It seems like it's a typical committee defined standard where there are 10 ways to do everything, I was trying to weed out the lesser used (or so I perceived) pieces to allow a more implementable library. It's doubly hard for me since I have limited experience with other languages, and I've never tried to write them with a computer (my language classes in high school were back in the days of actually writing stuff down on paper).

Actually, I don't think Unicode was so badly designed. It's just that nobody hat an idea of the real scope of the problem they had in hand at first, and so they had to add a lot of things but wanted to keep things backward-compatible. We're at Unicode 6.0 now, can you name one other standard that evolved enough to get 6 major versions? I'm surprised it's not worse given all that it must support. That said, I'm sure if someone could redesign Unicode by breaking backward-compatibility we'd have something simpler. You could probably get rid of pre-combined characters and reduce the number of normalization forms. But would you be able to get rid of normalization entirely? I don't think so. Reinventing Unicode is probably not worth it.
 I'm not opposed to that on principle. I'm a little uneasy about
 having so many types representing a string however. Some other raw
 comments:

 I agree that things would be more coherent if char[], wchar[], and
 dchar[] behaved like other arrays, but I can't really see a
 justification for those types to be in the language if there's
 nothing special about them (why not a library type?).

I would not be opposed to getting rid of those types. But I am very opposed to char[] not being an array. If you want a string to be something other than an array, make it have a different syntax. We also have to consider C compatibility. However, we are in radical-change mode then, and this is probably pushed to D3 ;) If we can find some way to fix the situation without invalidating TDPL, we should strive for that first IMO.

Indeed, the change would probably be too radical for D2. I think we agree that the default type should behave as a Unicode string, not an array of characters. I understand your opposition to conflating arrays of char with strings, and I agree with you to a certain extent that it could have been done better. But we can't really change the type of string literals, can we. The only thing we can change (I hope) at this point is how iterating on strings work. Walter said earlier that he oppose changing foreach's default element type to dchar for char[] and wchar[] (as Andrei did for ranges) on the ground that it would silently break D1 compatibility. This is a valid point in my opinion. I think you're right when you say that not treating char[] as an array of character breaks, to a certain extent, C compatibility. Another valid point. That said, I want to emphasize that iterating by grapheme, contrary to iterating by dchar, does not break any code *silently*. The compiler will complain loudly that you're comparing a string to a char, so you'll have to change your code somewhere if you want things to compile. You'll have to look at the code and decide what to do. One more thing: NSString in Cocoa is in essence the same thing as I'm proposing here: as array of UTF-16 code units, but with string behaviour. It supports by-code-unit indexing, but appending, comparing, searching for substrings, etc. all behave correctly as a Unicode string. Again, I agree that it's probably not the best design, but I can tell you it works well in practice. In fact, NSString doesn't even expose the concept of grapheme, it just uses them internally, and you're pretty much limited to the built-in operation. I think what we have here in concept is much better... even if it somewhat conflates code-unit arrays and strings.

I'm unclear on where this is converging to. At this point the commitment of the language and its standard library to (a) UTF aray representation and (b) code points conceptualization is quite strong. Changing that would be quite difficult and disruptive, and the benefits are virtually nonexistent for most of D's user base. It may be more realistic to consider using what we have as back-end for grapheme-oriented processing. For example: struct Grapheme(Char) if (isSomeChar!Char) { private const Char[] rep; ... } auto byGrapheme(S)(S s) if (isSomeString!S) { ... } string s = "Hello"; foreach (g; byGrapheme(s) { ... } Andrei
Jan 15 2011
next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday 15 January 2011 19:25:47 Jonathan M Davis wrote:
 On Saturday 15 January 2011 15:59:27 Andrei Alexandrescu wrote:
 On 1/15/11 4:45 PM, Michel Fortin wrote:
 On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer"
=20
 <schveiguy yahoo.com> said:
 On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin
=20
 <michel.fortin michelf.com> wrote:
 On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"
=20
 <schveiguy yahoo.com> said:
 I'm not suggesting we impose it, just that we make it the default.
 If you want to iterate by dchar, wchar, or char, just write:
 foreach (dchar c; "expos=E9") {}
 foreach (wchar c; "expos=E9") {}
 foreach (char c; "expos=E9") {}
 // or
 foreach (dchar c; "expos=E9".by!dchar()) {}
 foreach (wchar c; "expos=E9".by!wchar()) {}
 foreach (char c; "expos=E9".by!char()) {}
 and it'll work. But the default would be a slice containing the
 grapheme, because this is the right way to represent a Unicode
 character.

I think this is a good idea. I previously was nervous about it, but I'm not sure it makes a huge difference. Returning a char[] is certainly less work than normalizing a grapheme into one or more code points, and then returning them. All that it takes is to dete=






 all the code points within the grapheme. Normalization can be done
 if needed, but would probably have to output another char[], since=






 normalized grapheme can occupy more than one dchar.

I'm glad we agree on that now.

It's a matter of me slowly wrapping my brain around unicode and how it's used. It seems like it's a typical committee defined standard where there are 10 ways to do everything, I was trying to weed out t=




 lesser used (or so I perceived) pieces to allow a more implementable
 library. It's doubly hard for me since I have limited experience with
 other languages, and I've never tried to write them with a computer
 (my language classes in high school were back in the days of actually
 writing stuff down on paper).

Actually, I don't think Unicode was so badly designed. It's just that nobody hat an idea of the real scope of the problem they had in hand =



 first, and so they had to add a lot of things but wanted to keep thin=



 backward-compatible. We're at Unicode 6.0 now, can you name one other
 standard that evolved enough to get 6 major versions? I'm surprised
 it's not worse given all that it must support.
=20
 That said, I'm sure if someone could redesign Unicode by breaking
 backward-compatibility we'd have something simpler. You could probably
 get rid of pre-combined characters and reduce the number of
 normalization forms. But would you be able to get rid of normalization
 entirely? I don't think so. Reinventing Unicode is probably not worth
 it.
=20
 I'm not opposed to that on principle. I'm a little uneasy about
 having so many types representing a string however. Some other raw
 comments:
=20
 I agree that things would be more coherent if char[], wchar[], and
 dchar[] behaved like other arrays, but I can't really see a
 justification for those types to be in the language if there's
 nothing special about them (why not a library type?).

I would not be opposed to getting rid of those types. But I am very opposed to char[] not being an array. If you want a string to be something other than an array, make it have a different syntax. We also have to consider C compatibility. =20 However, we are in radical-change mode then, and this is probably pushed to D3 ;) If we can find some way to fix the situation without invalidating TDPL, we should strive for that first IMO.

Indeed, the change would probably be too radical for D2. =20 I think we agree that the default type should behave as a Unicode string, not an array of characters. I understand your opposition to conflating arrays of char with strings, and I agree with you to a certain extent that it could have been done better. But we can't real=



 change the type of string literals, can we. The only thing we can
 change (I hope) at this point is how iterating on strings work.
=20
 Walter said earlier that he oppose changing foreach's default element
 type to dchar for char[] and wchar[] (as Andrei did for ranges) on the
 ground that it would silently break D1 compatibility. This is a valid
 point in my opinion.
=20
 I think you're right when you say that not treating char[] as an array
 of character breaks, to a certain extent, C compatibility. Another
 valid point.
=20
 That said, I want to emphasize that iterating by grapheme, contrary to
 iterating by dchar, does not break any code *silently*. The compiler
 will complain loudly that you're comparing a string to a char, so
 you'll have to change your code somewhere if you want things to
 compile. You'll have to look at the code and decide what to do.
=20
 One more thing:
=20
 NSString in Cocoa is in essence the same thing as I'm proposing here:
 as array of UTF-16 code units, but with string behaviour. It supports
 by-code-unit indexing, but appending, comparing, searching for
 substrings, etc. all behave correctly as a Unicode string. Again, I
 agree that it's probably not the best design, but I can tell you it
 works well in practice. In fact, NSString doesn't even expose the
 concept of grapheme, it just uses them internally, and you're pretty
 much limited to the built-in operation. I think what we have here in
 concept is much better... even if it somewhat conflates code-unit
 arrays and strings.

I'm unclear on where this is converging to. At this point the commitment of the language and its standard library to (a) UTF aray representation and (b) code points conceptualization is quite strong. Changing that would be quite difficult and disruptive, and the benefits are virtually nonexistent for most of D's user base. =20 It may be more realistic to consider using what we have as back-end for grapheme-oriented processing. For example: =20 struct Grapheme(Char) if (isSomeChar!Char) { =20 private const Char[] rep; ... =20 } =20 auto byGrapheme(S)(S s) if (isSomeString!S) { =20 ... =20 } =20 string s =3D "Hello"; foreach (g; byGrapheme(s) { =20 ... =20 }

Considering that strings are already dealt with specially in order to have an element of dchar, I wouldn't think that it would be all that distruptive to make it so that they had an element type of Grapheme instead. Wouldn't that then fix all of std.algorithm and the like without really disrupting anything? =20 The issue of foreach remains, but without being willing to change what foreach defaults to, you can't really fix it - though I'd suggest that we at least make it a warning to iterate over strings without specifying the type. And if foreach were made to understand Grapheme like it understands dchar, then you could do =20 foreach(Grapheme g; str) { ... } =20 and have the compiler warn about =20 foreach(g; str) { ... } =20 and tell you to use Grapheme if you want to be comparing actual character=

 Regardless, by making strings ranges of Grapheme rather than dchar, I wou=

 think that we would solve most of the problem. At minimum, we'd have pret=

 much the same problems that we have right now with char and wchar arrays,
 but we'd get rid of a whole class of unicode problems. So, nothing would
 be worse, but some of it would be better.

I suppose that the one major omission though is that string comparisons wou= ld be=20 by code unit, not graphemes, which would be a problem. =3D=3D could be made= to use=20 graphemes instead, but then you couldn't compare them by code units or code= =20 points unless you cast to ubyte[], ushort[], or uint[]... It would still=20 probably be worth making =3D=3D use graphemes though. =2D Jonathan M Davis
Jan 15 2011
prev sibling next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-15 18:59:27 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 I'm unclear on where this is converging to. At this point the 
 commitment of the language and its standard library to (a) UTF aray 
 representation and (b) code points conceptualization is quite strong. 
 Changing that would be quite difficult and disruptive, and the benefits 
 are virtually nonexistent for most of D's user base.

There's still a disagreement about whether a string or a code unit array should be the default string representation, and whether iterating on a code unit array should give you code unit or grapheme elements. Of those who who participated in the discussion, I don't think anyone is disputing the idea that a grapheme element is better than a dchar element for iterating over a string.
 It may be more realistic to consider using what we have as back-end for 
 grapheme-oriented processing.
 For example:
 
 struct Grapheme(Char) if (isSomeChar!Char)
 {
      private const Char[] rep;
      ...
 }
 
 auto byGrapheme(S)(S s) if (isSomeString!S)
 {
     ...
 }
 
 string s = "Hello";
 foreach (g; byGrapheme(s)
 {
      ...
 }

No doubt it's easier to implement it that way. The problem is that in most cases it won't be used. How many people really know what is a grapheme? Of those, how many will forget to use byGrapheme at one time or another? And so in most programs string manipulation will misbehave in the presence of combining characters or unnormalized strings. If you want to help D programmers write correct code when it comes to Unicode manipulation, you need to help them iterate on real characters (graphemes), and you need the algorithms to apply to real characters (graphemes), not the approximation of a Unicode character that is a code point. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 15 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/15/11 10:45 PM, Michel Fortin wrote:
 On 2011-01-15 18:59:27 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 I'm unclear on where this is converging to. At this point the
 commitment of the language and its standard library to (a) UTF aray
 representation and (b) code points conceptualization is quite strong.
 Changing that would be quite difficult and disruptive, and the
 benefits are virtually nonexistent for most of D's user base.

There's still a disagreement about whether a string or a code unit array should be the default string representation, and whether iterating on a code unit array should give you code unit or grapheme elements. Of those who who participated in the discussion, I don't think anyone is disputing the idea that a grapheme element is better than a dchar element for iterating over a string.

Disagreement as that might be, a simple fact that needs to be taken into account is that as of right now all of Phobos uses UTF arrays for string representation and dchar as element type. Besides, for one I do dispute the idea that a grapheme element is better than a dchar element for iterating over a string. The grapheme has the attractiveness of being theoretically clean but at the same time is woefully inefficient and helps languages that few D users need to work with. At least that's my perception, and we need some serious numbers instead of convincing rhetoric to make a big decision. It's all a matter of picking one's trade-offs. Clearly ASCII is out as no serious amount of non-English text can be trafficked without diacritics. So switching to UTF makes a lot of sense, and that's what D did. When I introduced std.range and std.algorithm, they'd handle char[] and wchar[] no differently than any other array. A lot of algorithms simply did the wrong thing by default, so I attempted to fix that situation by defining byDchar(). So instead of passing some string str to an algorithm, one would pass byDchar(str). A couple of weeks went by in testing that state of affairs, and before late I figured that I need to insert byDchar() virtually _everywhere_. There were a couple of algorithms (e.g. Boyer-Moore) that happened to work with arrays for subtle reasons (needless to say, they won't work with graphemes at all). But by and large the situation was that the simple and intuitive code was wrong and that the correct code necessitated inserting byDchar(). So my next decision, which understandably some of the people who didn't go through the experiment may find unintuitive, was to make byDchar() the default. This cleaned up a lot of crap in std itself and saved a lot of crap in the yet-unwritten client code. I think it's reasonable to understand why I'm happy with the current state of affairs. It is better than anything we've had before and better than everything else I've tried. Now, thanks to the effort people have spent in this group (thank you!), I have an understanding of the grapheme issue. I guarantee that grapheme-level iteration will have a high cost incurred to it: efficiency and changes in std. The languages that need composing characters for producing meaningful text are few and far between, so it makes sense to confine support for them to libraries that are not the default, unless we find ways to not disrupt everyone else.
 It may be more realistic to consider using what we have as back-end
 for grapheme-oriented processing.
 For example:

 struct Grapheme(Char) if (isSomeChar!Char)
 {
 private const Char[] rep;
 ...
 }

 auto byGrapheme(S)(S s) if (isSomeString!S)
 {
 ...
 }

 string s = "Hello";
 foreach (g; byGrapheme(s)
 {
 ...
 }

No doubt it's easier to implement it that way. The problem is that in most cases it won't be used. How many people really know what is a grapheme?

How many people really should care?
 Of those, how many will forget to use byGrapheme at one time
 or another? And so in most programs string manipulation will misbehave
 in the presence of combining characters or unnormalized strings.

But most strings don't contain combining characters or unnormalized strings.
 If you want to help D programmers write correct code when it comes to
 Unicode manipulation, you need to help them iterate on real characters
 (graphemes), and you need the algorithms to apply to real characters
 (graphemes), not the approximation of a Unicode character that is a code
 point.

I don't think the situation is as clean cut, as grave, and as urgent as you say. Andrei
Jan 16 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 1/15/11 10:45 PM, Michel Fortin wrote:
 No doubt it's easier to implement it that way. The problem is that in
 most cases it won't be used. How many people really know what is a
 grapheme?

How many people really should care?

I think the only people who should *not* care are those who have validated that the input does not contain any combining code point. If you know the input *can't* contain combining code points, then it's safe to ignore them. If we don't make correct Unicode handling the default, someday someone is going to ask a developer to fix a problem where his system doesn't handle some text correctly. Later that day, he'll come to the realization that almost none of his D code and none of the D libraries he use handle unicode correctly, and he'll say: can't fix this. His peer working on a similar Objective-C program will have a good laugh. Sure, correct Unicode handling is slower and more complicated to implement, but at least you know you'll get the right results.
 Of those, how many will forget to use byGrapheme at one time
 or another? And so in most programs string manipulation will misbehave
 in the presence of combining characters or unnormalized strings.

But most strings don't contain combining characters or unnormalized strings.

I think we should expect combining marks to be used more and more as our OS text system and fonts start supporting them better. Them being rare might be true today, but what do you know about tomorrow? A few years ago, many Unicode symbols didn't even show up correctly on Windows. Today, we have Unicode domain names and people start putting funny symbols in them (for instance: <http://◉.ws>). I haven't seen it yet, but we'll surely see combining characters in domain names soon enough (if only as a way to make fun of programs that can't handle Unicode correctly). Well, let me be the first to make fun of such programs: <http://☺̭̏.michelf.com/>. Also, not all combining characters are marks meant to be used by some foreign languages. Some are used for mathematics for instance. Or you could use 20E0 COMBINING ENCLOSING CIRCLE BACKSLASH as an overlay indicating some kind of prohibition.
 If you want to help D programmers write correct code when it comes to
 Unicode manipulation, you need to help them iterate on real characters
 (graphemes), and you need the algorithms to apply to real characters
 (graphemes), not the approximation of a Unicode character that is a code
 point.

I don't think the situation is as clean cut, as grave, and as urgent as you say.

I agree it's probably not as clean cut as I say (I'm trying to keep complicated things simple here), but it's something important to decide early because the cost of changing it increase as more code is written. Quoting the first part of the same post (out of order):
 Disagreement as that might be, a simple fact that needs to be taken 
 into account is that as of right now all of Phobos uses UTF arrays for 
 string representation and dchar as element type.
 
 Besides, for one I do dispute the idea that a grapheme element is 
 better than a dchar element for iterating over a string. The grapheme 
 has the attractiveness of being theoretically clean but at the same 
 time is woefully inefficient and helps languages that few D users need 
 to work with. At least that's my perception, and we need some serious 
 numbers instead of convincing rhetoric to make a big decision.

You'll no doubt get more performance from a grapheme-aware specialized algorithm working directly on code points than by iterating on graphemes returned as string slices. But both will give *correct* results. Implementing a specialized algorithm of this kind becomes an optimization, and it's likely you'll want an optimized version for most string algorithms. I'd like to have some numbers too about performance, but I have none at this time.
 It's all a matter of picking one's trade-offs. Clearly ASCII is out as 
 no serious amount of non-English text can be trafficked without 
 diacritics. So switching to UTF makes a lot of sense, and that's what D 
 did.
 
 When I introduced std.range and std.algorithm, they'd handle char[] and 
 wchar[] no differently than any other array. A lot of algorithms simply 
 did the wrong thing by default, so I attempted to fix that situation by 
 defining byDchar(). So instead of passing some string str to an 
 algorithm, one would pass byDchar(str).
 
 A couple of weeks went by in testing that state of affairs, and before 
 late I figured that I need to insert byDchar() virtually _everywhere_. 
 There were a couple of algorithms (e.g. Boyer-Moore) that happened to 
 work with arrays for subtle reasons (needless to say, they won't work 
 with graphemes at all). But by and large the situation was that the 
 simple and intuitive code was wrong and that the correct code 
 necessitated inserting byDchar().
 
 So my next decision, which understandably some of the people who didn't 
 go through the experiment may find unintuitive, was to make byDchar() 
 the default. This cleaned up a lot of crap in std itself and saved a 
 lot of crap in the yet-unwritten client code.

But were your algorithms *correct* in the first place? I'd argue that by making byDchar the default you've not saved yourself from any crap because dchar isn't the right layer of abstraction.
 I think it's reasonable to understand why I'm happy with the current 
 state of affairs. It is better than anything we've had before and 
 better than everything else I've tried.

It is indeed easy to understand why you're happy with the current state of affairs: you never had to deal with multi-code-point character and can't imagine yourself having to deal with them on a semi-frequent basis. Other people won't be so happy with this state of affairs, but they'll probably notice only after most of their code has been written unaware of the problem.
 Now, thanks to the effort people have spent in this group (thank you!), 
 I have an understanding of the grapheme issue. I guarantee that 
 grapheme-level iteration will have a high cost incurred to it: 
 efficiency and changes in std. The languages that need composing 
 characters for producing meaningful text are few and far between, so it 
 makes sense to confine support for them to libraries that are not the 
 default, unless we find ways to not disrupt everyone else.

We all are more aware of the problem now, that's a good thing. :-) -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 16 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/16/11 3:20 PM, Michel Fortin wrote:
 On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 On 1/15/11 10:45 PM, Michel Fortin wrote:
 No doubt it's easier to implement it that way. The problem is that in
 most cases it won't be used. How many people really know what is a
 grapheme?

How many people really should care?

I think the only people who should *not* care are those who have validated that the input does not contain any combining code point. If you know the input *can't* contain combining code points, then it's safe to ignore them.

I agree. Now let me ask again: how many people really should care?
 If we don't make correct Unicode handling the default, someday someone
 is going to ask a developer to fix a problem where his system doesn't
 handle some text correctly. Later that day, he'll come to the
 realization that almost none of his D code and none of the D libraries
 he use handle unicode correctly, and he'll say: can't fix this. His peer
 working on a similar Objective-C program will have a good laugh.

 Sure, correct Unicode handling is slower and more complicated to
 implement, but at least you know you'll get the right results.

I love the increased precision, but again I'm not sure how many people ever manipulate text with combining characters. Meanwhile they'll complain that D is slower than other languages.
 Of those, how many will forget to use byGrapheme at one time
 or another? And so in most programs string manipulation will misbehave
 in the presence of combining characters or unnormalized strings.

But most strings don't contain combining characters or unnormalized strings.

I think we should expect combining marks to be used more and more as our OS text system and fonts start supporting them better. Them being rare might be true today, but what do you know about tomorrow?

I don't think languages will acquire more diacritics soon. I do hope, of course, that D applications gain more usage in the Arabic, Hebrew etc. world.
 A few years ago, many Unicode symbols didn't even show up correctly on
 Windows. Today, we have Unicode domain names and people start putting
 funny symbols in them (for instance: <http://◉.ws>). I haven't seen it
 yet, but we'll surely see combining characters in domain names soon
 enough (if only as a way to make fun of programs that can't handle
 Unicode correctly). Well, let me be the first to make fun of such
 programs: <http://☺̭̏.michelf.com/>.

Would you bet the language on that?
 Also, not all combining characters are marks meant to be used by some
 foreign languages. Some are used for mathematics for instance. Or you
 could use 20E0 COMBINING ENCLOSING CIRCLE BACKSLASH as an overlay
 indicating some kind of prohibition.


 If you want to help D programmers write correct code when it comes to
 Unicode manipulation, you need to help them iterate on real characters
 (graphemes), and you need the algorithms to apply to real characters
 (graphemes), not the approximation of a Unicode character that is a code
 point.

I don't think the situation is as clean cut, as grave, and as urgent as you say.

I agree it's probably not as clean cut as I say (I'm trying to keep complicated things simple here), but it's something important to decide early because the cost of changing it increase as more code is written.

Agreed.
 Quoting the first part of the same post (out of order):

 Disagreement as that might be, a simple fact that needs to be taken
 into account is that as of right now all of Phobos uses UTF arrays for
 string representation and dchar as element type.

 Besides, for one I do dispute the idea that a grapheme element is
 better than a dchar element for iterating over a string. The grapheme
 has the attractiveness of being theoretically clean but at the same
 time is woefully inefficient and helps languages that few D users need
 to work with. At least that's my perception, and we need some serious
 numbers instead of convincing rhetoric to make a big decision.

You'll no doubt get more performance from a grapheme-aware specialized algorithm working directly on code points than by iterating on graphemes returned as string slices. But both will give *correct* results. Implementing a specialized algorithm of this kind becomes an optimization, and it's likely you'll want an optimized version for most string algorithms. I'd like to have some numbers too about performance, but I have none at this time.

I spent a fair amount of time comparing ASCII vs. Unicode code speed. The fact of the matter is that the overhead is measurable and often high. Also it occurs at a very core level. For starters, the grapheme itself is larger and has one extra indirection. I am confident the marginal overhead for graphemes would be considerable.
 It's all a matter of picking one's trade-offs. Clearly ASCII is out as
 no serious amount of non-English text can be trafficked without
 diacritics. So switching to UTF makes a lot of sense, and that's what
 D did.

 When I introduced std.range and std.algorithm, they'd handle char[]
 and wchar[] no differently than any other array. A lot of algorithms
 simply did the wrong thing by default, so I attempted to fix that
 situation by defining byDchar(). So instead of passing some string str
 to an algorithm, one would pass byDchar(str).

 A couple of weeks went by in testing that state of affairs, and before
 late I figured that I need to insert byDchar() virtually _everywhere_.
 There were a couple of algorithms (e.g. Boyer-Moore) that happened to
 work with arrays for subtle reasons (needless to say, they won't work
 with graphemes at all). But by and large the situation was that the
 simple and intuitive code was wrong and that the correct code
 necessitated inserting byDchar().

 So my next decision, which understandably some of the people who
 didn't go through the experiment may find unintuitive, was to make
 byDchar() the default. This cleaned up a lot of crap in std itself and
 saved a lot of crap in the yet-unwritten client code.

But were your algorithms *correct* in the first place? I'd argue that by making byDchar the default you've not saved yourself from any crap because dchar isn't the right layer of abstraction.

It was correct for all but a couple languages. Again: most of today's languages don't ever need combining characters.
 I think it's reasonable to understand why I'm happy with the current
 state of affairs. It is better than anything we've had before and
 better than everything else I've tried.

It is indeed easy to understand why you're happy with the current state of affairs: you never had to deal with multi-code-point character and can't imagine yourself having to deal with them on a semi-frequent basis.

Do you, and can you?
 Other people won't be so happy with this state of affairs, but
 they'll probably notice only after most of their code has been written
 unaware of the problem.

They can't be unaware and write said code.
 Now, thanks to the effort people have spent in this group (thank
 you!), I have an understanding of the grapheme issue. I guarantee that
 grapheme-level iteration will have a high cost incurred to it:
 efficiency and changes in std. The languages that need composing
 characters for producing meaningful text are few and far between, so
 it makes sense to confine support for them to libraries that are not
 the default, unless we find ways to not disrupt everyone else.

We all are more aware of the problem now, that's a good thing. :-)

All I wish is it's not blown out of proportion. It fares rather low on my list of library issues that D has right now. Andrei
Jan 16 2011
next sibling parent reply Daniel Gibson <metalcaedes gmail.com> writes:
Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:
 On 1/16/11 3:20 PM, Michel Fortin wrote:
 On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:
 But most strings don't contain combining characters or unnormalized
 strings.

I think we should expect combining marks to be used more and more as our OS text system and fonts start supporting them better. Them being rare might be true today, but what do you know about tomorrow?

I don't think languages will acquire more diacritics soon. I do hope, of course, that D applications gain more usage in the Arabic, Hebrew etc. world.

So why does D use unicode anyway? If you don't care about not-often used languages anyway, you could have used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide which encoding he wants/needs). You could as well say "we don't need to use dchar to represent a proper code point, wchar is enough for most use cases and has fewer overhead anyway".
 I think it's reasonable to understand why I'm happy with the current
 state of affairs. It is better than anything we've had before and
 better than everything else I've tried.

It is indeed easy to understand why you're happy with the current state of affairs: you never had to deal with multi-code-point character and can't imagine yourself having to deal with them on a semi-frequent basis.

Do you, and can you?
 Other people won't be so happy with this state of affairs, but
 they'll probably notice only after most of their code has been written
 unaware of the problem.

They can't be unaware and write said code.

Fun fact: Germany recently introduced a new ID card and some of the software that was developed for this and is used in some record sections fucks up when a name contains diacritics. I think especially when you're handling names (and much software does, I think) it's crucial to have proper support for all kinds of chars. Of course many programmers are not aware that, if Umlaute and ß works it doesn't mean that all other kinds of strange characters work as well. Cheers, - Daniel
Jan 16 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/16/11 6:42 PM, Daniel Gibson wrote:
 Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:
 On 1/16/11 3:20 PM, Michel Fortin wrote:
 On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:
 But most strings don't contain combining characters or unnormalized
 strings.

I think we should expect combining marks to be used more and more as our OS text system and fonts start supporting them better. Them being rare might be true today, but what do you know about tomorrow?

I don't think languages will acquire more diacritics soon. I do hope, of course, that D applications gain more usage in the Arabic, Hebrew etc. world.

So why does D use unicode anyway? If you don't care about not-often used languages anyway, you could have used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide which encoding he wants/needs). You could as well say "we don't need to use dchar to represent a proper code point, wchar is enough for most use cases and has fewer overhead anyway".

I consider UTF8 superior to all of the above.
 I think it's reasonable to understand why I'm happy with the current
 state of affairs. It is better than anything we've had before and
 better than everything else I've tried.

It is indeed easy to understand why you're happy with the current state of affairs: you never had to deal with multi-code-point character and can't imagine yourself having to deal with them on a semi-frequent basis.

Do you, and can you?
 Other people won't be so happy with this state of affairs, but
 they'll probably notice only after most of their code has been written
 unaware of the problem.

They can't be unaware and write said code.

Fun fact: Germany recently introduced a new ID card and some of the software that was developed for this and is used in some record sections fucks up when a name contains diacritics. I think especially when you're handling names (and much software does, I think) it's crucial to have proper support for all kinds of chars. Of course many programmers are not aware that, if Umlaute and ß works it doesn't mean that all other kinds of strange characters work as well. Cheers, - Daniel

I think German text works well with dchar. Andrei
Jan 16 2011
parent reply Daniel Gibson <metalcaedes gmail.com> writes:
Am 17.01.2011 03:45, schrieb Andrei Alexandrescu:
 On 1/16/11 6:42 PM, Daniel Gibson wrote:
 Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:
 On 1/16/11 3:20 PM, Michel Fortin wrote:
 On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:
 But most strings don't contain combining characters or unnormalized
 strings.

I think we should expect combining marks to be used more and more as our OS text system and fonts start supporting them better. Them being rare might be true today, but what do you know about tomorrow?

I don't think languages will acquire more diacritics soon. I do hope, of course, that D applications gain more usage in the Arabic, Hebrew etc. world.

So why does D use unicode anyway? If you don't care about not-often used languages anyway, you could have used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide which encoding he wants/needs). You could as well say "we don't need to use dchar to represent a proper code point, wchar is enough for most use cases and has fewer overhead anyway".

I consider UTF8 superior to all of the above.

Really? UTF32 - maybe. But IMHO even when not considering graphemes and such UTF8 sucks hard in comparison to those because one code point consists of 1-4 code units (even in German 1-2 code units).
 I think it's reasonable to understand why I'm happy with the current
 state of affairs. It is better than anything we've had before and
 better than everything else I've tried.

It is indeed easy to understand why you're happy with the current state of affairs: you never had to deal with multi-code-point character and can't imagine yourself having to deal with them on a semi-frequent basis.

Do you, and can you?
 Other people won't be so happy with this state of affairs, but
 they'll probably notice only after most of their code has been written
 unaware of the problem.

They can't be unaware and write said code.

Fun fact: Germany recently introduced a new ID card and some of the software that was developed for this and is used in some record sections fucks up when a name contains diacritics. I think especially when you're handling names (and much software does, I think) it's crucial to have proper support for all kinds of chars. Of course many programmers are not aware that, if Umlaute and ß works it doesn't mean that all other kinds of strange characters work as well. Cheers, - Daniel

I think German text works well with dchar.

Yes, but even in Germany there are people whose names contain "strange" characters ;) Is it common to have programs that deal with text in a specific language but not with names? I do understand your resistance to support Unicode properly - it's a lot of trouble and makes things inefficient (more inefficient than UTF8/16 already are because of that code point != code unit thing). Another thing is that due to bad support from fonts or console/GUI technology it may happen (quite often) that one grapheme is *not* displayed as a single character, thus messing up formatting anyway (Still you probably should cut a string within a grapheme). So here's what I think can be done (and, at least the first two points, especially the first, should be done): 1. Mention the Grapheme and Digraph situation in string related documentation (std.string and maybe string-related stuff in std.algorithm like Splitter) to make sure people who use Phobos are aware of the problem. Then at least they can't say that nobody told them when their Objective-C using colleagues are laughing at their broken unicode-support ;) 2. Maybe add some functions that *do* deal with this. Like "bool isPartOfGrapheme(dchar c)" or "bool isDigraph(dchar c)" so people can check themselves, if they just split their string within a grapheme or something. 3. Include a proper Unicode-string type/module, if somebody has the time and knowledge to develop one. spir already started something like that AFAIK and Steven Schveighoffer also is even working on a complete string type - maybe these efforts could be combined? I guess default strings will stay mostly the way they are (but please add an ASCII type or allow ubyte[] asciiStr = "asdf";). Having an additional type in Phobos that works correctly in all cases (e.g. Arabic, Hebrew, Japanese, ..) would be really great, though. UniString uStr = new UniString("sdfüñẫ"); UniString uStr2 = uStr[3..$]; // "üñẫ" UniGraph ug = uStr[5]; // 'ẫ' size_t i = uStr2.length; // 3 something like that maybe (of course plus a lot of other stuff like proper comparison for different encodings of the same char like a modified icmp() discussed before). But something like size_t len = uniLen("sdfüñẫ"); // 6 string s = uniSlice(str, 3, str.length); // == str.uniSlice(3, str.length); etc may be just as good. (I hope this all made sense)
 Andrei

Cheers, - Daniel
Jan 16 2011
parent Daniel Gibson <metalcaedes gmail.com> writes:
Am 17.01.2011 04:38, schrieb Daniel Gibson:
 Am 17.01.2011 03:45, schrieb Andrei Alexandrescu:
 On 1/16/11 6:42 PM, Daniel Gibson wrote:
 Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:
 On 1/16/11 3:20 PM, Michel Fortin wrote:
 On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:
 But most strings don't contain combining characters or unnormalized
 strings.

I think we should expect combining marks to be used more and more as our OS text system and fonts start supporting them better. Them being rare might be true today, but what do you know about tomorrow?

I don't think languages will acquire more diacritics soon. I do hope, of course, that D applications gain more usage in the Arabic, Hebrew etc. world.

So why does D use unicode anyway? If you don't care about not-often used languages anyway, you could have used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide which encoding he wants/needs). You could as well say "we don't need to use dchar to represent a proper code point, wchar is enough for most use cases and has fewer overhead anyway".

I consider UTF8 superior to all of the above.

Really? UTF32 - maybe. But IMHO even when not considering graphemes and such UTF8 sucks hard in comparison to those because one code point consists of 1-4 code units (even in German 1-2 code units).
 I think it's reasonable to understand why I'm happy with the current
 state of affairs. It is better than anything we've had before and
 better than everything else I've tried.

It is indeed easy to understand why you're happy with the current state of affairs: you never had to deal with multi-code-point character and can't imagine yourself having to deal with them on a semi-frequent basis.

Do you, and can you?
 Other people won't be so happy with this state of affairs, but
 they'll probably notice only after most of their code has been written
 unaware of the problem.

They can't be unaware and write said code.

Fun fact: Germany recently introduced a new ID card and some of the software that was developed for this and is used in some record sections fucks up when a name contains diacritics. I think especially when you're handling names (and much software does, I think) it's crucial to have proper support for all kinds of chars. Of course many programmers are not aware that, if Umlaute and ß works it doesn't mean that all other kinds of strange characters work as well. Cheers, - Daniel

I think German text works well with dchar.

Yes, but even in Germany there are people whose names contain "strange" characters ;) Is it common to have programs that deal with text in a specific language but not with names? I do understand your resistance to support Unicode properly - it's a lot of trouble and makes things inefficient (more inefficient than UTF8/16 already are because of that code point != code unit thing). Another thing is that due to bad support from fonts or console/GUI technology it may happen (quite often) that one grapheme is *not* displayed as a single character, thus messing up formatting anyway (Still you probably should cut a string within a grapheme).

I meant you should *not* cut a string within a grapheme.
 So here's what I think can be done (and, at least the first two points,
 especially the first, should be done):

 1. Mention the Grapheme and Digraph situation in string related documentation
 (std.string and maybe string-related stuff in std.algorithm like Splitter) to
 make sure people who use Phobos are aware of the problem. Then at least they
 can't say that nobody told them when their Objective-C using colleagues are
 laughing at their broken unicode-support ;)

 2. Maybe add some functions that *do* deal with this.
 Like "bool isPartOfGrapheme(dchar c)" or "bool isDigraph(dchar c)" so people
can
 check themselves, if they just split their string within a grapheme or
something.

 3. Include a proper Unicode-string type/module, if somebody has the time and
 knowledge to develop one. spir already started something like that AFAIK and
 Steven Schveighoffer also is even working on a complete string type - maybe
 these efforts could be combined?
 I guess default strings will stay mostly the way they are (but please add an
 ASCII type or allow ubyte[] asciiStr = "asdf";).
 Having an additional type in Phobos that works correctly in all cases (e.g.
 Arabic, Hebrew, Japanese, ..) would be really great, though.

 UniString uStr = new UniString("sdfüñẫ");
 UniString uStr2 = uStr[3..$]; // "üñẫ"
 UniGraph ug = uStr[5]; // 'ẫ'
 size_t i = uStr2.length; // 3

of course I forgot: string s = uStr2.toString(); dstring s2 = uStr2.toDString(); to convert it back to a "normal" string
 something like that maybe (of course plus a lot of other stuff like proper
 comparison for different encodings of the same char like a modified icmp()
 discussed before).
 But something like
 size_t len = uniLen("sdfüñẫ"); // 6
 string s = uniSlice(str, 3, str.length); // == str.uniSlice(3, str.length);
 etc may be just as good.

 (I hope this all made sense)

 Andrei

Cheers, - Daniel

Jan 16 2011
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Sunday 16 January 2011 18:45:26 Andrei Alexandrescu wrote:
 On 1/16/11 6:42 PM, Daniel Gibson wrote:
 Am 17.01.2011 00:58, schrieb Andrei Alexandrescu:
 On 1/16/11 3:20 PM, Michel Fortin wrote:
 On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
=20
 <SeeWebsiteForEmail erdani.org> said:
 But most strings don't contain combining characters or unnormalized
 strings.

I think we should expect combining marks to be used more and more as our OS text system and fonts start supporting them better. Them being rare might be true today, but what do you know about tomorrow?

I don't think languages will acquire more diacritics soon. I do hope, =



 course, that D applications gain more usage in the Arabic, Hebrew etc.
 world.

So why does D use unicode anyway? If you don't care about not-often used languages anyway, you could have used UCS-2 like java. Or plain 8bit ISO-8859-* (the user can decide which encoding he wants/needs). =20 You could as well say "we don't need to use dchar to represent a proper code point, wchar is enough for most use cases and has fewer overhead anyway".

I consider UTF8 superior to all of the above. =20
 I think it's reasonable to understand why I'm happy with the current
 state of affairs. It is better than anything we've had before and
 better than everything else I've tried.

It is indeed easy to understand why you're happy with the current sta=




 of affairs: you never had to deal with multi-code-point character and
 can't imagine yourself having to deal with them on a semi-frequent
 basis.

Do you, and can you? =20
 Other people won't be so happy with this state of affairs, but
 they'll probably notice only after most of their code has been written
 unaware of the problem.

They can't be unaware and write said code.

Fun fact: Germany recently introduced a new ID card and some of the software that was developed for this and is used in some record sections fucks up when a name contains diacritics. =20 I think especially when you're handling names (and much software does, I think) it's crucial to have proper support for all kinds of chars. Of course many programmers are not aware that, if Umlaute and =C3=9F wo=


 doesn't mean that all other kinds of strange characters work as well.
=20
=20
 Cheers,
 - Daniel

I think German text works well with dchar.

I think that whether dchar will be enough will depend primarily on where th= e=20 unicode is coming from and what the programmer is doing with it. There's pl= enty=20 which will just work regardless of whether code poinst are pre-combined or = not,=20 and there's other stuff which will have subtle bugs if they're not pre-comb= ined. =46or the most part, Western languages should have pre-combined characters,= but=20 whether a program sees them in combined form or not will depend on where th= e=20 text comes from. If it comes from a file, then it all depends on the progra= m=20 which wrote the file. If it comes from the console, then it depends on what= that=20 console does. If it comes from a socket or pipe or whatnot, then it depends= on=20 whatever program is sending the data. So, the question becomes what the norm is? Are unicode characters normally = pre- combined or left as separate code points? The majority of English text will= be=20 fine regardless, since English only uses accented characters and the like w= hen=20 including foreign words, but most any other European language will have acc= ented=20 characters and then it's an open question. If it's more likely that a D pro= gram=20 will receive pre-combined characters than not, then many programs will like= ly be=20 safe treating a code point as a character. But if the odds are high that a = D=20 program will receive characters which are not yet combined, then certain se= ts of=20 text will invariably result in bugs in your average D program. I don't think that there's much question that from a performance standpoint= and=20 from the standpoint of trying to avoid breaking TDPL and a lot of pre-exist= ing=20 code, we should continue to treat a code point - a dchar - as an abstract=20 character. Moving to graphemes could really harm performance - and there _a= re_=20 plenty of programs that couldn't care less about unicode. However, it's qui= te=20 clear that in a number of circumstances, that's going to result in buggy co= de.=20 The question then is whether it's okay to take a performance hit just to=20 correctly handle unicode. And I expect that a _lot_ of people are going to = say=20 no to that. D already does better at handling unicode than many other languages, so it'= s=20 definitely a step up as it is. The cost for handling unicode completely cor= rectly=20 is quite high from the sounds of it - all of a sudden you're effectively (i= f not=20 literally) dealing with arrays of arrays instead of arrays. So, I think tha= t=20 it's a viable option to say that the default path that D will take is the=20 _mostly_ correct but still reasonably efficient path, and then - through 3r= d party=20 libraries or possibly even with a module in Phobos - we'll provide a means = to=20 handle unicode 100% correctly for those who really care. At minimum, we need the tools to handle unicode correctly, but if we can't= =20 handle it both correctly and efficiently, then I'm afraid that it's just no= t going=20 to be reasonable to handle it correctly - especially if we can handle it=20 _almost_ correctly and still be efficient. Regardless, the real question is how likely a D program is to deal with uni= code=20 which is not pre-combined. If the odds are relatively low in the general ca= se,=20 then sticking to dchar should be fine. But if the adds or relatively high, = then=20 not going to graphemes could mean that there will be a _lot_ of buggy D pro= grams=20 out there. =2D Jonathan M Davis
Jan 16 2011
prev sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-16 18:58:54 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 1/16/11 3:20 PM, Michel Fortin wrote:
 On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:
 
 On 1/15/11 10:45 PM, Michel Fortin wrote:
 No doubt it's easier to implement it that way. The problem is that in
 most cases it won't be used. How many people really know what is a
 grapheme?

How many people really should care?

I think the only people who should *not* care are those who have validated that the input does not contain any combining code point. If you know the input *can't* contain combining code points, then it's safe to ignore them.

I agree. Now let me ask again: how many people really should care?

As I said: all those people who are not validating the inputs to make sure they don't contain combining code points. As far as I know, no one is doing that, so that means everybody should use algorithms capable of handling multi-code-point graphemes. If someone indeed is doing this validation, he'll probably also be smart enough to make his algorithms to work with dchars. That said, no one should really have to care but those who implement the string manipulation functions. The idea behind making the grapheme the element type is to make it easier to write grapheme-aware string manipulation functions, even if you don't know about graphemes. But the reality is probably more mixed than that. - - - I gave some thought about all this, and came to an interesting realizations that made me refine the proposal. The new proposal is disruptive perhaps as much as the first, but in a different way. But first, let's state a few facts to reframe the current discussion: Fact 1: most people don't know Unicode very well Fact 2: most people are confused by code units, code points, graphemes, and what is a 'character' Fact 3: most people won't bother with all this, they'll just use the basic language facilities and assume everything work correctly if it it works correctly for them Now, let's define two goals: Goal 1: make most people's string operations work correctly Goal 2: make most people's string operations work fast To me, goal 1 trumps goal 2, even if goal 2 is also important. I'm not sure we agree on this, but let's continue. From the above 3 facts, we can deduce that a user won't want to bother to using byDchar, byGrapheme, or byWhatever when using algorithms. You were annoyed by having to write byDchar everywhere, so changed the element type to always be dchar and you don't have to write byDchar anymore. That's understandable and perfectly reasonable. The problem is of course that it doesn't give you correct results. Most of the time what you really want is to use graphemes, dchar just happen to be a good approximation of that that works most of the time. Iterating by grapheme is somewhat problematic, and it degrades performance. Same for comparing graphemes for normalized equivalence. That's all true. I'm not too sure what we can do about that. It can be optimized, but it's very understandable that some people won't be satisfied by the performance and will want to avoid graphemes. Speaking of optimization, I do understand that iterating by grapheme using the range interface won't give you the best performance. It's certainly convenient as it enables the reuse of existing algorithms with graphemes, but more specialized algorithms and interfaces might be more suited. One observation I made with having dchar as the default element type is that not all algorithms really need to deal with dchar. If I'm searching for code point 'a' in a UTF-8 string, decoding code units into code points is a waste of time. Why? because the only way to represent code point 'a' is by having code point 'a'. And guess what? The almost same optimization can apply to graphemes: if you're searching for 'a' in a grapheme-aware manner in a UTF-8 string, all you have to do is search for the UTF-8 code unit 'a', then check if the 'a' code unit is followed by a combining mark code point to confirm it is really a 'a', not a composed grapheme. Iterating the string by code unit is enough for these cases, and it'd increase performance by a lot. So making dchar the default type is no doubt convenient because it abstracts things enough so that generic algorithms can work with strings, but it has a performance penalty that you don't always need. I made an example using UTF-8, it applies even more to UTF-16. And it applies to grapheme-aware manipulations too. This penalty with generic algorithms comes from the fact that they take a predicate of the form "a == 'a'" or "a == b", which is ill-suited for strings because you always need to fully decode the string (by dchar or by graphemes) for the purpose of calling the predicate. Given that comparing characters for something else than equality or them being part of a set is very rarely something you do, generic algorithms miss a big optimization opportunity here. - - - So here's what I think we should do: Todo 1: disallow generic algorithms on naked strings: string-specific Unicode-aware algorithms should be used instead; they can share the same name if their usage is similar Todo 2: to use a generic algorithm with a strings, you must dress the string using one of toDchar, toGrapheme, toCodeUnits; this way your intentions are clear Todo 3: string-specific algorithms can implemented as simple wrappers for generic algorithms with the string dressed correctly for the task, or they can implement more sophisticated algorithms to increase performance There's two major benefits to this approach: Benefit 1: if indeed you really don't want the performance penalty that comes with checking for composed graphemes, you can bypass it at some specific places in your code using byDchar, or you can disable it altogether by modifying the string-specific algorithms and recompiling Phobos. Benefit 2: we don't have to rush to implementing graphemes in the Unicode-aware algorithms. Just make sure the interface for string-specific algorithms *can* accept graphemes, and we can roll out support for them at a later time once we have a decent implementation. Also, all this is leaving the question open as to what to do when someone uses the string as a range. In my opinion, it should either iterate on code units (because the string is actually an array, and because that's what foreach does) or simply disallow iteration (asking that you dress the string first using toCodeUnit, toDchar, or toGrapheme). Do you like that more? -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 17 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/17/11 10:34 AM, Michel Fortin wrote:
 On 2011-01-16 18:58:54 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 On 1/16/11 3:20 PM, Michel Fortin wrote:
 On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 On 1/15/11 10:45 PM, Michel Fortin wrote:
 No doubt it's easier to implement it that way. The problem is that in
 most cases it won't be used. How many people really know what is a
 grapheme?

How many people really should care?

I think the only people who should *not* care are those who have validated that the input does not contain any combining code point. If you know the input *can't* contain combining code points, then it's safe to ignore them.

I agree. Now let me ask again: how many people really should care?

As I said: all those people who are not validating the inputs to make sure they don't contain combining code points.

The question (which I see you keep on dodging :o)) is how much text contains combining code points. I have worked in NLP for years, and still do. I even worked on Arabic text (albeit Romanized). I work with Wikipedia. I use Unicode all the time, but I have yet to have trouble with a combining character. I was just vaguely aware of their existence up until this discussion, but just waved it away and guess what - it worked for me. It does not serve us well to rigidly claim that the only good way of doing anything Unicode is to care about graphemes. Even NSString exposes the UTF16 underlying encoding and provides dedicated functions for grapheme-based processing. For one thing, if you care about the width of a word in printed text (one of the case where graphemes are important), you need font information. And - surprise! - some fonts do NOT support combining characters and print signs next to one another instead of juxtaposing them, so the "wrong" method of counting characters is more informative.
 As far as I know, no one
 is doing that, so that means everybody should use algorithms capable of
 handling multi-code-point graphemes. If someone indeed is doing this
 validation, he'll probably also be smart enough to make his algorithms
 to work with dchars.

I am not sure everybody should use graphemes.
 That said, no one should really have to care but those who implement the
 string manipulation functions. The idea behind making the grapheme the
 element type is to make it easier to write grapheme-aware string
 manipulation functions, even if you don't know about graphemes. But the
 reality is probably more mixed than that.

The reality is indeed more mixed. Inevitably at some point the API needs to answer the question: "what is the first character of this string?" Transparency is not possible. You break all string code out there.
 - - -

 I gave some thought about all this, and came to an interesting
 realizations that made me refine the proposal. The new proposal is
 disruptive perhaps as much as the first, but in a different way.

 But first, let's state a few facts to reframe the current discussion:

 Fact 1: most people don't know Unicode very well
 Fact 2: most people are confused by code units, code points, graphemes,
 and what is a 'character'
 Fact 3: most people won't bother with all this, they'll just use the
 basic language facilities and assume everything work correctly if it it
 works correctly for them

Nice :o).
 Now, let's define two goals:

 Goal 1: make most people's string operations work correctly
 Goal 2: make most people's string operations work fast

Goal 3: don't break all existing code Goal 4: make most people's string-based code easy to write and understand
 To me, goal 1 trumps goal 2, even if goal 2 is also important. I'm not
 sure we agree on this, but let's continue.

I think we disagree about what "most" means. For you it means "people who don't understand Unicode well but deal with combining characters anyway". For me it's "the largest percentage of D users across various writing systems".
  From the above 3 facts, we can deduce that a user won't want to bother
 to using byDchar, byGrapheme, or byWhatever when using algorithms. You
 were annoyed by having to write byDchar everywhere, so changed the
 element type to always be dchar and you don't have to write byDchar
 anymore. That's understandable and perfectly reasonable.

 The problem is of course that it doesn't give you correct results. Most
 of the time what you really want is to use graphemes, dchar just happen
 to be a good approximation of that that works most of the time.

Again, it's a matter of tradeoffs. I chose dchar because char was plain _wrong_ most of the time, not because char was a pretty darn good approximation that worked for most people most of the time. The fact remains that dchar _is_ a pretty darn good approximation that also has pretty good darn speed. So I'd say that I _still_ want to use dchar most of the time. Committing to graphemes would complicate APIs for _everyone_ and would make things slower for _everyone_ for the sake of combining characters that _never_ occur in _most_ people's text. This is bad design, pure and simple. A good design is to cater for the majority and provide dedicated APIs for the few.
 Iterating by grapheme is somewhat problematic, and it degrades
 performance.

Yes.
 Same for comparing graphemes for normalized equivalence.

Yes, although I think you can optimize code such that comparing two strings wholesale only has a few more comparisons on the critical path. That would be still slower, but not as slow as iterating by grapheme in a naive implementation.
 That's all true. I'm not too sure what we can do about that. It can be
 optimized, but it's very understandable that some people won't be
 satisfied by the performance and will want to avoid graphemes.

I agree.
 Speaking of optimization, I do understand that iterating by grapheme
 using the range interface won't give you the best performance. It's
 certainly convenient as it enables the reuse of existing algorithms with
 graphemes, but more specialized algorithms and interfaces might be more
 suited.

Even the specialized algorithms will be significantly slower.
 One observation I made with having dchar as the default element type is
 that not all algorithms really need to deal with dchar. If I'm searching
 for code point 'a' in a UTF-8 string, decoding code units into code
 points is a waste of time. Why? because the only way to represent code
 point 'a' is by having code point 'a'.

Right. That's why many algorithms in std are specialized for such cases.
 And guess what? The almost same
 optimization can apply to graphemes: if you're searching for 'a' in a
 grapheme-aware manner in a UTF-8 string, all you have to do is search
 for the UTF-8 code unit 'a', then check if the 'a' code unit is followed
 by a combining mark code point to confirm it is really a 'a', not a
 composed grapheme. Iterating the string by code unit is enough for these
 cases, and it'd increase performance by a lot.

Unfortunately it all breaks as soon as you go beyond one code point. You can't search efficiently, you can't compare efficiently. Boyer-Moore and friends are out. I'm not saying that we shouldn't implement the correct operations! I'm just not convinced they should be the default.
 So making dchar the default type is no doubt convenient because it
 abstracts things enough so that generic algorithms can work with
 strings, but it has a performance penalty that you don't always need. I
 made an example using UTF-8, it applies even more to UTF-16. And it
 applies to grapheme-aware manipulations too.

It is true that UTF manipulation incurs overhead. The tradeoff has many dimensions: UTF-16 is bulkier and less cache friendly, ASCII is not sufficient for most people, the UTF decoding overhead is not that high... it's difficult to find the sweetest spot.
 This penalty with generic algorithms comes from the fact that they take
 a predicate of the form "a == 'a'" or "a == b", which is ill-suited for
 strings because you always need to fully decode the string (by dchar or
 by graphemes) for the purpose of calling the predicate. Given that
 comparing characters for something else than equality or them being part
 of a set is very rarely something you do, generic algorithms miss a big
 optimization opportunity here.

How can we improve that? You can't argue for an inefficient scheme just because what we have isn't as efficient as it could possibly be.
 - - -

 So here's what I think we should do:

 Todo 1: disallow generic algorithms on naked strings: string-specific
 Unicode-aware algorithms should be used instead; they can share the same
 name if their usage is similar

I don't understand this. We already do this, and by "Unicode-aware" we understand using dchar throughout. This is transparent to client code.
 Todo 2: to use a generic algorithm with a strings, you must dress the
 string using one of toDchar, toGrapheme, toCodeUnits; this way your
 intentions are clear

Breaks a lot of existing code. Won't fly with Walter unless it solves world hunger. Nevertheless I have to say I like it; the #1 thing I'd change about the built-in strings is that they implicitly are two things at the same time. Asking for representation should be explicit.
 Todo 3: string-specific algorithms can implemented as simple wrappers
 for generic algorithms with the string dressed correctly for the task,
 or they can implement more sophisticated algorithms to increase performance

One thing I like about the current scheme is that all bidirectional-range algorithms work out of the box with all strings, and lend themselves to optimization whenever you want to. This will have trouble passing Walter's wanking test. Mine too; every time I need to write a bunch of forwarding functions, that's a signal something went wrong somewhere. Remember MFC? :o)
 There's two major benefits to this approach:

 Benefit 1: if indeed you really don't want the performance penalty that
 comes with checking for composed graphemes, you can bypass it at some
 specific places in your code using byDchar, or you can disable it
 altogether by modifying the string-specific algorithms and recompiling
 Phobos.

 Benefit 2: we don't have to rush to implementing graphemes in the
 Unicode-aware algorithms. Just make sure the interface for
 string-specific algorithms *can* accept graphemes, and we can roll out
 support for them at a later time once we have a decent implementation.

I'm not seeing the drawbacks. Hurts everyone for the sake of a few, breaks existent code, makes all string processing a mess, would-be users will throw their hands in the air seeing the simplest examples, but we'll have the satisfaction of high-five-ing one another telling ourselves that we did the right thing.
 Also, all this is leaving the question open as to what to do when
 someone uses the string as a range. In my opinion, it should either
 iterate on code units (because the string is actually an array, and
 because that's what foreach does) or simply disallow iteration (asking
 that you dress the string first using toCodeUnit, toDchar, or toGrapheme).

 Do you like that more?

This is not about liking. I like doing the right thing as much as you do, and I think Phobos shows that. Clearly doing the right thing through and through is handling combining characters appropriately. The problem is keeping all desiderata in careful balance. Andrei
Jan 17 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-17 12:33:04 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 1/17/11 10:34 AM, Michel Fortin wrote:
 As I said: all those people who are not validating the inputs to make
 sure they don't contain combining code points.

The question (which I see you keep on dodging :o)) is how much text contains combining code points.

Not much, right now. The problem is that the answer to this question is likely to change as Unicode support improves in operating system and applications. Shouldn't we future-proof Phobos?
 It does not serve us well to rigidly claim that the only good way of 
 doing anything Unicode is to care about graphemes.

For the time being we can probably afford it.
 Even NSString exposes the UTF16 underlying encoding and provides 
 dedicated functions for grapheme-based processing. For one thing, if 
 you care about the width of a word in printed text (one of the case 
 where graphemes are important), you need font information. And - 
 surprise! - some fonts do NOT support combining characters and print 
 signs next to one another instead of juxtaposing them, so the "wrong" 
 method of counting characters is more informative.

Generally what OS X does in those case is that it displays that character in another font. That said, counting grapheme is never a good way to tell how much space some text will take (unless the application enforces a fixed width per grapheme). It's more useful for telling the number of character in a text document, similar to a word count.
 That said, no one should really have to care but those who implement the
 string manipulation functions. The idea behind making the grapheme the
 element type is to make it easier to write grapheme-aware string
 manipulation functions, even if you don't know about graphemes. But the
 reality is probably more mixed than that.

The reality is indeed more mixed. Inevitably at some point the API needs to answer the question: "what is the first character of this string?" Transparency is not possible. You break all string code out there.

I'm not sure what you mean by that.
 - - -
 
 I gave some thought about all this, and came to an interesting
 realizations that made me refine the proposal. The new proposal is
 disruptive perhaps as much as the first, but in a different way.
 
 But first, let's state a few facts to reframe the current discussion:
 
 Fact 1: most people don't know Unicode very well
 Fact 2: most people are confused by code units, code points, graphemes,
 and what is a 'character'
 Fact 3: most people won't bother with all this, they'll just use the
 basic language facilities and assume everything work correctly if it it
 works correctly for them

Nice :o).
 Now, let's define two goals:
 
 Goal 1: make most people's string operations work correctly
 Goal 2: make most people's string operations work fast

Goal 3: don't break all existing code Goal 4: make most people's string-based code easy to write and understand

Those are worthy goals too.
 To me, goal 1 trumps goal 2, even if goal 2 is also important. I'm not
 sure we agree on this, but let's continue.

I think we disagree about what "most" means. For you it means "people who don't understand Unicode well but deal with combining characters anyway". For me it's "the largest percentage of D users across various writing systems".

It's not just D users, it's also for the users of programs written by D users. I can't count how many times I've seen accented character mishandled on websites and elsewhere, and I probably have an aversion about doing the same thing to people of other cultures and languages. If the operating system supports combining marks, users have an expectations that applications running on it will deal with them correctly too, and they'll (rightfully) blame your application if it doesn't work. Same for websites. I understand that in some situations you don't want to deal with graphemes even if you theoretically should, but I don't think it should be the default.
 One observation I made with having dchar as the default element type is
 that not all algorithms really need to deal with dchar. If I'm searching
 for code point 'a' in a UTF-8 string, decoding code units into code
 points is a waste of time. Why? because the only way to represent code
 point 'a' is by having code point 'a'.

Right. That's why many algorithms in std are specialized for such cases.
 And guess what? The almost same
 optimization can apply to graphemes: if you're searching for 'a' in a
 grapheme-aware manner in a UTF-8 string, all you have to do is search
 for the UTF-8 code unit 'a', then check if the 'a' code unit is followed
 by a combining mark code point to confirm it is really a 'a', not a
 composed grapheme. Iterating the string by code unit is enough for these
 cases, and it'd increase performance by a lot.

Unfortunately it all breaks as soon as you go beyond one code point. You can't search efficiently, you can't compare efficiently. Boyer-Moore and friends are out.

Ok. Say you were searching for the needle "toil" in an UTF-8 haystack, I see two way to extend the optimization described above: 1. search for the easy part "toil", then check its surrounding graphemes to confirm it's really "toil" 2. search for a code point matching '' or 'e', then confirm that the code points following it form the right graphemes. Implementing the second one can be done by converting the needle to a regular expression operating at code-unit level. With that you can search efficiently for the needle directly in code units without having to decode and/or normalize the whole haystack.
 This penalty with generic algorithms comes from the fact that they take
 a predicate of the form "a == 'a'" or "a == b", which is ill-suited for
 strings because you always need to fully decode the string (by dchar or
 by graphemes) for the purpose of calling the predicate. Given that
 comparing characters for something else than equality or them being part
 of a set is very rarely something you do, generic algorithms miss a big
 optimization opportunity here.

How can we improve that? You can't argue for an inefficient scheme just because what we have isn't as efficient as it could possibly be.

You ask what's inefficient about generic algorithms having customizable predicates? You can't implement the above optimization if you can't guaranty the predicate is "==". That said, perhaps we can detect "==" and only apply the optimization then. Being able to specify the predicate doesn't gain you much for strings, because a < 'a' doesn't make much sense. All you need to check for is equality with some value or membership of given character set, both of which can use the optimization above.
 So here's what I think we should do:
 
 Todo 1: disallow generic algorithms on naked strings: string-specific
 Unicode-aware algorithms should be used instead; they can share the same
 name if their usage is similar

I don't understand this. We already do this, and by "Unicode-aware" we understand using dchar throughout. This is transparent to client code.

That's probably because you haven't understood the intent (I might not have made it very clear either). The problem I see currently is that you rely on dchar being the element type. That should be an implementation detail, not something client code can see or rely on. By making it an implementation detail, you can later make grapheme-aware algorithms the default without changing the API. Since you're the gatekeeper to Phobos, you can make this change conditional to getting an acceptable level of performance out of the grapheme-aware algorithms, or on other factors like the amount of combining characters you encounter in the wild in the next few years. So the general string functions would implement your compromise (using dchar) but not commit indefinitely to it. Someone who really want to work in code point can use toDchar, someone who want to deal with graphemes uses toGraphemes, someone who doesn't care won't choose anything and get the default behaviour of compromise. All you need to do for this is document it, and try to make sure the string APIs don't force the implementation to work with code points.
 Todo 2: to use a generic algorithm with a strings, you must dress the
 string using one of toDchar, toGrapheme, toCodeUnits; this way your
 intentions are clear

Breaks a lot of existing code. Won't fly with Walter unless it solves world hunger. Nevertheless I have to say I like it; the #1 thing I'd change about the built-in strings is that they implicitly are two things at the same time. Asking for representation should be explicit.

No, it doesn't break anything. This is just the continuation of what I tried to explain above: if you want to be sure you're working with graphemes or dchar, say it. Also, it said nothing about iteration or foreach, so I'm not sure why it wouldn't fly with Walter. It can stay as it is, except for one thing: you and Walter should really get on the same wavelength regarding ElementType!(char[]) and foreach(c; string). I don't care that much which is the default, but they absolutely need to be the same.
 Todo 3: string-specific algorithms can implemented as simple wrappers
 for generic algorithms with the string dressed correctly for the task,
 or they can implement more sophisticated algorithms to increase performance

One thing I like about the current scheme is that all bidirectional-range algorithms work out of the box with all strings, and lend themselves to optimization whenever you want to.

I like this as the default behaviour too. I think however that you should restrict the algorithms that work out of the box to those which can also work with graphemes. This way you can change the behaviour in the future and support graphemes by a simple upgrade of Phobos. Algorithms that doesn't work with graphemes would still work with toDchar. So what doesn't work with graphemes? Predicates such as "a < b" for instance. That's pretty much it.
 This will have trouble passing Walter's wanking test. Mine too; every 
 time I need to write a bunch of forwarding functions, that's a signal 
 something went wrong somewhere. Remember MFC? :o)

The idea is that we write the API as it would apply to graphemes, but we implement it using dchar for the time being. Some function signatures might have to differ a bit.
 Do you like that more?

This is not about liking. I like doing the right thing as much as you do, and I think Phobos shows that. Clearly doing the right thing through and through is handling combining characters appropriately. The problem is keeping all desiderata in careful balance.

Well then, don't you find it balanced enough? I'm not asking that everything be done with graphemes. I'm not even asking that anything be done with graphemes by default. I'm only asking that we keep the API clean enough so we can pass to graphemes by default in the future without having to rewrite all the code everywhere to use byGrapheme. If this isn't the right balance. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 17 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/17/11 2:29 PM, Michel Fortin wrote:
 The problem I see currently is that you rely on dchar being the element
 type. That should be an implementation detail, not something client code
 can see or rely on.

But at some point you must be able to talk about individual characters in a text. It can't be something that client code doesn't see!!! SuperDuperText txt; auto c = giveMeTheFirstCharacter(txt); What is the type of c? That is visible to the client! Andrei
Jan 17 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-17 15:49:26 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 1/17/11 2:29 PM, Michel Fortin wrote:
 The problem I see currently is that you rely on dchar being the element
 type. That should be an implementation detail, not something client code
 can see or rely on.

But at some point you must be able to talk about individual characters in a text. It can't be something that client code doesn't see!!!

It seems that it can. NSString only exposes individual UTF-16 code units directly (or semi-directly via an accessor method), even though searching and comparing is grapheme-aware. I'm not saying it's a good design, but it certainly can work in practice. In any case, I didn't mean to say the client code should't be aware of the characters in a string. I meant that the client shouldn't assume the algorithm works at the same layer as ElementType!(string) for a given string type. Even if ElementType!(string) is dchar, the default function you get if you don't use any of toCodeUnit, toDchar, or toGrapheme can work at the dchar or grapheme level if it makes more sense that way. In other words, the client says: "I have two strings, compare them!" The client didn't specify if they should be compared by char, wchar, dchar, or by normalized grapheme; so we do what's sensible. That's what I call the 'default' string functions, those you get when you don't ask for anything specific. They should have a signature making them able to work at the grapheme level, even though they might not for practical reasons (performance). This way if it becomes more important or practical to support graphemes, it's easy to evolve to them.
 SuperDuperText txt;
 auto c = giveMeTheFirstCharacter(txt);
 
 What is the type of c? That is visible to the client!

That depends on how you implement the giveMeTheFirstCharacter function. :-) More seriously, you have four choice: 1. code unit 2. code point 3. grapheme 4. require the client to state explicitly which kind of 'character' he wants; 'character' being an overloaded word, it's reasonable to ask for disambiguation. You and Walter can't come to understand each other between 1 and 2, regarding foreach and ranges. To keep things consistent with what I said above I'd tend to say 4, but that's weird for something that looks like an array. My second choice goes for 1 when it comes to consistency, and 3 when it comes to correctness, and 2 when it comes to being practical. Given something is going to be inconsistent either way, I'd say any of the above is acceptable. But please make sure you and Walter agree on the default element type for ranges and foreach. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 17 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com> said:

 More seriously, you have four choice:
 
 1. code unit
 2. code point
 3. grapheme
 4. require the client to state explicitly which kind of 'character' he 
 wants; 'character' being an overloaded word, it's reasonable to ask for 
 disambiguation.

This makes me think of what I did with my XML parser after you made code points the element type for strings. Basically, the parser now uses 'front' and 'popFront' whenever it needs to get the next code point, but most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I had to add) when testing for or skipping an ASCII character is sufficient. This way I avoid a lot of unnecessary decoding of code points. For this to work, the same range must let you skip either a unit or a code point. If I were using a separate range with a call to toDchar or toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't have helped much because the new range would essentially become a new slice independent of the original, so you can't interleave "I want to advance by one unit" with "I want to advance by one code point". So perhaps the best interface for strings would be to provide multiple range-like interfaces that you can use at the level you want. I'm not sure if this is a good idea, but I thought I should at least share my experience. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 17 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/17/11 9:48 PM, Michel Fortin wrote:
 On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com>
 said:

 More seriously, you have four choice:

 1. code unit
 2. code point
 3. grapheme
 4. require the client to state explicitly which kind of 'character' he
 wants; 'character' being an overloaded word, it's reasonable to ask
 for disambiguation.

This makes me think of what I did with my XML parser after you made code points the element type for strings. Basically, the parser now uses 'front' and 'popFront' whenever it needs to get the next code point, but most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I had to add) when testing for or skipping an ASCII character is sufficient. This way I avoid a lot of unnecessary decoding of code points. For this to work, the same range must let you skip either a unit or a code point. If I were using a separate range with a call to toDchar or toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't have helped much because the new range would essentially become a new slice independent of the original, so you can't interleave "I want to advance by one unit" with "I want to advance by one code point". So perhaps the best interface for strings would be to provide multiple range-like interfaces that you can use at the level you want. I'm not sure if this is a good idea, but I thought I should at least share my experience.

Very insightful. Thanks for sharing. Code it up and make a solid proposal! Andrei
Jan 17 2011
next sibling parent reply Steven Wawryk <stevenw acres.com.au> writes:
On 18/01/11 16:46, Andrei Alexandrescu wrote:
 On 1/17/11 9:48 PM, Michel Fortin wrote:
 On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com>
 said:

 More seriously, you have four choice:

 1. code unit
 2. code point
 3. grapheme
 4. require the client to state explicitly which kind of 'character' he
 wants; 'character' being an overloaded word, it's reasonable to ask
 for disambiguation.

This makes me think of what I did with my XML parser after you made code points the element type for strings. Basically, the parser now uses 'front' and 'popFront' whenever it needs to get the next code point, but most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I had to add) when testing for or skipping an ASCII character is sufficient. This way I avoid a lot of unnecessary decoding of code points. For this to work, the same range must let you skip either a unit or a code point. If I were using a separate range with a call to toDchar or toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't have helped much because the new range would essentially become a new slice independent of the original, so you can't interleave "I want to advance by one unit" with "I want to advance by one code point". So perhaps the best interface for strings would be to provide multiple range-like interfaces that you can use at the level you want. I'm not sure if this is a good idea, but I thought I should at least share my experience.

Very insightful. Thanks for sharing. Code it up and make a solid proposal! Andrei

How does this differ from Steve Schveighoffer's string_t, subtract the indexing and slicing of code-points, plus a bidirectional grapheme range?
Jan 17 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/18/11 1:58 AM, Steven Wawryk wrote:
 On 18/01/11 16:46, Andrei Alexandrescu wrote:
 On 1/17/11 9:48 PM, Michel Fortin wrote:
 On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com>
 said:

 More seriously, you have four choice:

 1. code unit
 2. code point
 3. grapheme
 4. require the client to state explicitly which kind of 'character' he
 wants; 'character' being an overloaded word, it's reasonable to ask
 for disambiguation.

This makes me think of what I did with my XML parser after you made code points the element type for strings. Basically, the parser now uses 'front' and 'popFront' whenever it needs to get the next code point, but most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I had to add) when testing for or skipping an ASCII character is sufficient. This way I avoid a lot of unnecessary decoding of code points. For this to work, the same range must let you skip either a unit or a code point. If I were using a separate range with a call to toDchar or toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't have helped much because the new range would essentially become a new slice independent of the original, so you can't interleave "I want to advance by one unit" with "I want to advance by one code point". So perhaps the best interface for strings would be to provide multiple range-like interfaces that you can use at the level you want. I'm not sure if this is a good idea, but I thought I should at least share my experience.

Very insightful. Thanks for sharing. Code it up and make a solid proposal! Andrei

How does this differ from Steve Schveighoffer's string_t, subtract the indexing and slicing of code-points, plus a bidirectional grapheme range?

There's no string, only range... Andrei
Jan 18 2011
parent reply Steven Wawryk <stevenw acres.com.au> writes:
On 19/01/11 02:40, Andrei Alexandrescu wrote:
 On 1/18/11 1:58 AM, Steven Wawryk wrote:
 On 18/01/11 16:46, Andrei Alexandrescu wrote:
 On 1/17/11 9:48 PM, Michel Fortin wrote:
 This makes me think of what I did with my XML parser after you made
 code
 points the element type for strings. Basically, the parser now uses
 'front' and 'popFront' whenever it needs to get the next code point,
 but
 most of the time it uses 'frontUnit' and 'popFrontUnit' instead
 (which I
 had to add) when testing for or skipping an ASCII character is
 sufficient. This way I avoid a lot of unnecessary decoding of code
 points.

 For this to work, the same range must let you skip either a unit or a
 code point. If I were using a separate range with a call to toDchar or
 toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't
 have helped much because the new range would essentially become a new
 slice independent of the original, so you can't interleave "I want to
 advance by one unit" with "I want to advance by one code point".

 So perhaps the best interface for strings would be to provide multiple
 range-like interfaces that you can use at the level you want.

 I'm not sure if this is a good idea, but I thought I should at least
 share my experience.

Very insightful. Thanks for sharing. Code it up and make a solid proposal! Andrei

How does this differ from Steve Schveighoffer's string_t, subtract the indexing and slicing of code-points, plus a bidirectional grapheme range?

There's no string, only range...

Which is exactly what I asked you about. I understand that you must be very busy, But how do I get you to look at the actual technical content of something? Is there something in the way I phrase thing that you dismiss my introductory motivation without looking into the content? I don't mean this as a criticism. I really want to know because I'm considering a proposal on a different topic but wasn't sure it's worth it as there seems to be a barrier to getting things considered.
Jan 18 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/18/11 6:00 PM, Steven Wawryk wrote:
 On 19/01/11 02:40, Andrei Alexandrescu wrote:
 On 1/18/11 1:58 AM, Steven Wawryk wrote:
 On 18/01/11 16:46, Andrei Alexandrescu wrote:
 On 1/17/11 9:48 PM, Michel Fortin wrote:
 This makes me think of what I did with my XML parser after you made
 code
 points the element type for strings. Basically, the parser now uses
 'front' and 'popFront' whenever it needs to get the next code point,
 but
 most of the time it uses 'frontUnit' and 'popFrontUnit' instead
 (which I
 had to add) when testing for or skipping an ASCII character is
 sufficient. This way I avoid a lot of unnecessary decoding of code
 points.

 For this to work, the same range must let you skip either a unit or a
 code point. If I were using a separate range with a call to toDchar or
 toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't
 have helped much because the new range would essentially become a new
 slice independent of the original, so you can't interleave "I want to
 advance by one unit" with "I want to advance by one code point".

 So perhaps the best interface for strings would be to provide multiple
 range-like interfaces that you can use at the level you want.

 I'm not sure if this is a good idea, but I thought I should at least
 share my experience.

Very insightful. Thanks for sharing. Code it up and make a solid proposal! Andrei

How does this differ from Steve Schveighoffer's string_t, subtract the indexing and slicing of code-points, plus a bidirectional grapheme range?

There's no string, only range...

Which is exactly what I asked you about. I understand that you must be very busy, But how do I get you to look at the actual technical content of something? Is there something in the way I phrase thing that you dismiss my introductory motivation without looking into the content? I don't mean this as a criticism. I really want to know because I'm considering a proposal on a different topic but wasn't sure it's worth it as there seems to be a barrier to getting things considered.

One simple fact is that I'm not the only person who needs to look at a design. If you want to propose something for inclusion in Phobos, please put the code in good shape, document it properly, and make a submission in this newsgroup following the Boost model. I get one vote and everyone else gets a vote. Looking back at our exchanges in search for a perceived dismissive attitude on my part (apologies if it seems that way - it was unintentional), I infer your annoyance stems from my answer to this:
 How does this differ from Steve Schveighoffer's string_t,
 subtract the indexing and slicing of code-points, plus a
 bidirectional grapheme range?



I happen to have discussed at length my beef with Steve's proposal. Now in one sentence you change the proposed design on the fly without fleshing out the consequences, add to it again without substantiation, and presumably expect me to come with a salient analysis of the result. I don't think it's fair to characterize my answer to that as dismissive, nor to pressure me into expanding on it. Finally, let me say again what I already said for a few times: in order to experiment with grapheme-based processing, we need a byGrapheme range. There is no need for a new string class. We need a range over the existing string types. That would allow us to play with graphemes, assess their efficiency and ubiquity, and would ultimately put us in a better position when it comes to deciding whether it makes sense to make grapheme a character type or the default character type. Andrei
Jan 18 2011
parent reply Steven Wawryk <stevenw acres.com.au> writes:
On 19/01/11 11:37, Andrei Alexandrescu wrote:
 On 1/18/11 6:00 PM, Steven Wawryk wrote:
 Which is exactly what I asked you about. I understand that you must be
 very busy, But how do I get you to look at the actual technical content
 of something? Is there something in the way I phrase thing that you
 dismiss my introductory motivation without looking into the content?

 I don't mean this as a criticism. I really want to know because I'm
 considering a proposal on a different topic but wasn't sure it's worth
 it as there seems to be a barrier to getting things considered.

One simple fact is that I'm not the only person who needs to look at a design. If you want to propose something for inclusion in Phobos, please put the code in good shape, document it properly, and make a submission in this newsgroup following the Boost model. I get one vote and everyone else gets a vote.

Ok, thanks for this suggestion. But if developing a proposal as concrete code is a lot of work that may be rejected, is there a way to sound out the idea first before deciding to commit to developing it?
 Looking back at our exchanges in search for a perceived dismissive
 attitude on my part (apologies if it seems that way - it was
 unintentional), I infer your annoyance stems from my answer to this:

 How does this differ from Steve Schveighoffer's string_t,
 subtract the indexing and slicing of code-points, plus a
 bidirectional grapheme range?




No, this was just a summary. Here is the post that you answered dismissively: news://news.digitalmars.com:119/ih030g$1ok1$1 digitalmars.com
 In the interest of moving this on, would it become acceptable to you if:

 1. indexing and slicing of the code-point range were removed?
 2. any additional ranges are exposed to the user according to decisions
 made about graphemes, etc?
 3. other constructive criticisms were accommodated?

 Steve


 On 15/01/11 03:33, Andrei Alexandrescu wrote:
 On 1/14/11 5:06 AM, Steven Schveighoffer wrote:
 I respectfully disagree. A stream built on fixed-sized units, but with
 variable length elements, where you can determine the start of an
 element in O(1) time given a random index absolutely provides
 random-access. It just doesn't provide length.

I equally respectfully disagree. I think random access is defined as accessing the ith element in O(1) time. That's not the case here. Andrei


 I happen to have discussed at length my beef with Steve's proposal. Now
 in one sentence you change the proposed design on the fly without
 fleshing out the consequences, add to it again without substantiation,
 and presumably expect me to come with a salient analysis of the result.
 I don't think it's fair to characterize my answer to that as dismissive,
 nor to pressure me into expanding on it.

Sorry, I could have given more context. But you didn't discuss what I asked, based on the observation that your detailed criticisms of Steve's proposal all related to a single aspect of it. Steve
Jan 18 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/18/11 7:48 PM, Steven Wawryk wrote:
 On 19/01/11 11:37, Andrei Alexandrescu wrote:
 On 1/18/11 6:00 PM, Steven Wawryk wrote:
 Which is exactly what I asked you about. I understand that you must be
 very busy, But how do I get you to look at the actual technical content
 of something? Is there something in the way I phrase thing that you
 dismiss my introductory motivation without looking into the content?

 I don't mean this as a criticism. I really want to know because I'm
 considering a proposal on a different topic but wasn't sure it's worth
 it as there seems to be a barrier to getting things considered.

One simple fact is that I'm not the only person who needs to look at a design. If you want to propose something for inclusion in Phobos, please put the code in good shape, document it properly, and make a submission in this newsgroup following the Boost model. I get one vote and everyone else gets a vote.

Ok, thanks for this suggestion. But if developing a proposal as concrete code is a lot of work that may be rejected, is there a way to sound out the idea first before deciding to commit to developing it?

This is the best place as far as I know.
 Looking back at our exchanges in search for a perceived dismissive
 attitude on my part (apologies if it seems that way - it was
 unintentional), I infer your annoyance stems from my answer to this:

 How does this differ from Steve Schveighoffer's string_t,
 subtract the indexing and slicing of code-points, plus a
 bidirectional grapheme range?




No, this was just a summary. Here is the post that you answered dismissively: news://news.digitalmars.com:119/ih030g$1ok1$1 digitalmars.com

My response of Sun, 16 Jan 2011 20:58:43 -0600 was a fair attempt at a response. If you found that dismissive, I'd be hard pressed to improve it. To quote myself:
 I believe the proposed scheme:

 1. Changes the language in a major way;

 2. Is highly disruptive;

 3. Improves the status quo in only minor ways.

 I'd be much more willing to improve things by e.g. defining the
representation() function I talked about a bit ago, and other less disruptive
additions.

That took into consideration your amendments.
  >
  > In the interest of moving this on, would it become acceptable to you if:
  >
  > 1. indexing and slicing of the code-point range were removed?
  > 2. any additional ranges are exposed to the user according to decisions
  > made about graphemes, etc?
  > 3. other constructive criticisms were accommodated?
  >
  > Steve
  >
  >
  > On 15/01/11 03:33, Andrei Alexandrescu wrote:
  >> On 1/14/11 5:06 AM, Steven Schveighoffer wrote:
  >>> I respectfully disagree. A stream built on fixed-sized units, but with
  >>> variable length elements, where you can determine the start of an
  >>> element in O(1) time given a random index absolutely provides
  >>> random-access. It just doesn't provide length.
  >>
  >> I equally respectfully disagree. I think random access is defined as
  >> accessing the ith element in O(1) time. That's not the case here.
  >>
  >> Andrei
  >


 I happen to have discussed at length my beef with Steve's proposal. Now
 in one sentence you change the proposed design on the fly without
 fleshing out the consequences, add to it again without substantiation,
 and presumably expect me to come with a salient analysis of the result.
 I don't think it's fair to characterize my answer to that as dismissive,
 nor to pressure me into expanding on it.

Sorry, I could have given more context. But you didn't discuss what I asked, based on the observation that your detailed criticisms of Steve's proposal all related to a single aspect of it.

I really don't know what to add to make my answer more meaningful. Andrei
Jan 18 2011
parent reply Steven Wawryk <stevenw acres.com.au> writes:
On 19/01/11 13:53, Andrei Alexandrescu wrote:
 My response of Sun, 16 Jan 2011 20:58:43 -0600 was a fair attempt at a
 response. If you found that dismissive, I'd be hard pressed to improve
 it. To quote myself:

 I believe the proposed scheme:

 1. Changes the language in a major way;

 2. Is highly disruptive;

 3. Improves the status quo in only minor ways.

 I'd be much more willing to improve things by e.g. defining the
 representation() function I talked about a bit ago, and other less
 disruptive additions.

That took into consideration your amendments.

I don't think that it did. I proposed no language change, nor anything disruptive. The change in status quo I proposed was essentially the same one you encouraged here, about a type that gives the user the choice of what kind of range to be operated on. It appears to me that you were responding to some perception you had about Steve's full proposal (that may have been triggered by something I said in the introduction), not what I actually said in the content. So, I would still be interested to know how to sound out this newsgroup with an idea (before coding commitment) and have the suggestions considered on something more than a superficial level. Is the newsgroup too busy? Should there be people nominated to screen ideas that are worth looking at? Should I use a completely different approach? Your suggestions so far I will take into account, but it still looks like there's a barrier to me.
 Sorry, I could have given more context. But you didn't discuss what I
 asked, based on the observation that your detailed criticisms of Steve's
 proposal all related to a single aspect of it.

I really don't know what to add to make my answer more meaningful. Andrei

Jan 18 2011
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/18/11 9:46 PM, Steven Wawryk wrote:
 On 19/01/11 13:53, Andrei Alexandrescu wrote:
 My response of Sun, 16 Jan 2011 20:58:43 -0600 was a fair attempt at a
 response. If you found that dismissive, I'd be hard pressed to improve
 it. To quote myself:

 I believe the proposed scheme:

 1. Changes the language in a major way;

 2. Is highly disruptive;

 3. Improves the status quo in only minor ways.

 I'd be much more willing to improve things by e.g. defining the
 representation() function I talked about a bit ago, and other less
 disruptive additions.

That took into consideration your amendments.

I don't think that it did. I proposed no language change, nor anything disruptive.

Adding a new string type would be disruptive. Unless I misunderstood, there is still a new string type in Steve's proposal, and one that would be the default one, even after the amendments you mentioned. That is a problem because people write this: auto s = "hello"; and the question is, what is the type of s. The change in status quo I proposed was essentially the same
 one you encouraged here, about a type that gives the user the choice of
 what kind of range to be operated on. It appears to me that you were
 responding to some perception you had about Steve's full proposal (that
 may have been triggered by something I said in the introduction), not
 what I actually said in the content.

If that's what it is, great. To clarify: no new string type, only a range that iterates one grapheme over existing strings.
 So, I would still be interested to know how to sound out this newsgroup
 with an idea (before coding commitment) and have the suggestions
 considered on something more than a superficial level.

 Is the newsgroup too busy? Should there be people nominated to screen
 ideas that are worth looking at? Should I use a completely different
 approach? Your suggestions so far I will take into account, but it still
 looks like there's a barrier to me.

My perception is that you want to minimize risks before starting to invest work into this. I'm not sure how you can do that. Andrei
Jan 18 2011
prev sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-18 01:16:13 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 1/17/11 9:48 PM, Michel Fortin wrote:
 On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com>
 said:
 
 More seriously, you have four choice:
 
 1. code unit
 2. code point
 3. grapheme
 4. require the client to state explicitly which kind of 'character' he
 wants; 'character' being an overloaded word, it's reasonable to ask
 for disambiguation.

This makes me think of what I did with my XML parser after you made code points the element type for strings. Basically, the parser now uses 'front' and 'popFront' whenever it needs to get the next code point, but most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I had to add) when testing for or skipping an ASCII character is sufficient. This way I avoid a lot of unnecessary decoding of code points. For this to work, the same range must let you skip either a unit or a code point. If I were using a separate range with a call to toDchar or toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't have helped much because the new range would essentially become a new slice independent of the original, so you can't interleave "I want to advance by one unit" with "I want to advance by one code point". So perhaps the best interface for strings would be to provide multiple range-like interfaces that you can use at the level you want. I'm not sure if this is a good idea, but I thought I should at least share my experience.

Very insightful. Thanks for sharing. Code it up and make a solid proposal!

What I use right now is this (see below). I'm not sure what would be a good name for it though. The expectation is that I'll get either an ASCII char or something out of ASCII range if it isn't ASCII. The abstraction doesn't seem very 'solid' to me, in the sense that I can't see how it'd apply to ranges other than strings, so it's only useful for strings (the character array kind), and it's only useful as a workaround since you made ElementType!(char[]) a dchar. Well, any range returning char,dchar,wchar could map frontUnit to front and popFrontUnit to popFront to keep things working, but it makes the optimization rather pointless. I don't really have an idea where to go from here. char frontUnit(string input) { assert(input.length > 0); return input[0]; } wchar frontUnit(wstring input) { assert(input.length > 0); return input[0]; } dchar frontUnit(dstring input) { assert(input.length > 0); return input[0]; } void popFrontUnit(ref string input) { assert(input.length > 0); input = input[1..$]; } void popFrontUnit(ref wstring input) { assert(input.length > 0); input = input[1..$]; } void popFrontUnit(ref dstring input) { assert(input.length > 0); input = input[1..$]; } version (unittest) { import std.string : front, popFront; } unittest { string test = "t"; assert(test.length == 5); string test2 = test; assert(test2.front == ''); test2.popFront(); assert(test2.length == 3); // removed "" which is two UTF-8 code units string test3 = test; assert(test3.frontUnit == ""c[0]); test3.popFrontUnit(); assert(test3.length == 4); // removed first half of "" which, one UTF-8 code units } -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 18 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/18/11 7:17 AM, Michel Fortin wrote:
 On 2011-01-18 01:16:13 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 On 1/17/11 9:48 PM, Michel Fortin wrote:
 On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com>
 said:

 More seriously, you have four choice:

 1. code unit
 2. code point
 3. grapheme
 4. require the client to state explicitly which kind of 'character' he
 wants; 'character' being an overloaded word, it's reasonable to ask
 for disambiguation.

This makes me think of what I did with my XML parser after you made code points the element type for strings. Basically, the parser now uses 'front' and 'popFront' whenever it needs to get the next code point, but most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I had to add) when testing for or skipping an ASCII character is sufficient. This way I avoid a lot of unnecessary decoding of code points. For this to work, the same range must let you skip either a unit or a code point. If I were using a separate range with a call to toDchar or toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't have helped much because the new range would essentially become a new slice independent of the original, so you can't interleave "I want to advance by one unit" with "I want to advance by one code point". So perhaps the best interface for strings would be to provide multiple range-like interfaces that you can use at the level you want. I'm not sure if this is a good idea, but I thought I should at least share my experience.

Very insightful. Thanks for sharing. Code it up and make a solid proposal!

What I use right now is this (see below). I'm not sure what would be a good name for it though. The expectation is that I'll get either an ASCII char or something out of ASCII range if it isn't ASCII. The abstraction doesn't seem very 'solid' to me, in the sense that I can't see how it'd apply to ranges other than strings, so it's only useful for strings (the character array kind), and it's only useful as a workaround since you made ElementType!(char[]) a dchar. Well, any range returning char,dchar,wchar could map frontUnit to front and popFrontUnit to popFront to keep things working, but it makes the optimization rather pointless. I don't really have an idea where to go from here.

I was thinking along the lines of: struct Grapheme { private string support_; ... } struct ByGrapheme { private string iteratee_; bool empty(); Grapheme front(); void popFront(); // Additional funs dchar frontCodePoint(); void popFrontCodePoint(); char frontCodeUnit(); void popFrontCodeUnit(); ... } // helper function ByGrapheme byGrapheme(string s); // usage string s = ...; size_t i; foreach (g; byGrapheme(s)) { writeln("Grapheme #", i, " is ", g); } We need this range in Phobos. Andrei
Jan 18 2011
parent Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-18 11:38:45 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 1/18/11 7:17 AM, Michel Fortin wrote:
 On 2011-01-18 01:16:13 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:
 
 On 1/17/11 9:48 PM, Michel Fortin wrote:
 On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com>
 said:
 
 More seriously, you have four choice:
 
 1. code unit
 2. code point
 3. grapheme
 4. require the client to state explicitly which kind of 'character' he
 wants; 'character' being an overloaded word, it's reasonable to ask
 for disambiguation.

This makes me think of what I did with my XML parser after you made code points the element type for strings. Basically, the parser now uses 'front' and 'popFront' whenever it needs to get the next code point, but most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I had to add) when testing for or skipping an ASCII character is sufficient. This way I avoid a lot of unnecessary decoding of code points. For this to work, the same range must let you skip either a unit or a code point. If I were using a separate range with a call to toDchar or toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't have helped much because the new range would essentially become a new slice independent of the original, so you can't interleave "I want to advance by one unit" with "I want to advance by one code point". So perhaps the best interface for strings would be to provide multiple range-like interfaces that you can use at the level you want. I'm not sure if this is a good idea, but I thought I should at least share my experience.

Very insightful. Thanks for sharing. Code it up and make a solid proposal!

What I use right now is this (see below). I'm not sure what would be a good name for it though. The expectation is that I'll get either an ASCII char or something out of ASCII range if it isn't ASCII. The abstraction doesn't seem very 'solid' to me, in the sense that I can't see how it'd apply to ranges other than strings, so it's only useful for strings (the character array kind), and it's only useful as a workaround since you made ElementType!(char[]) a dchar. Well, any range returning char,dchar,wchar could map frontUnit to front and popFrontUnit to popFront to keep things working, but it makes the optimization rather pointless. I don't really have an idea where to go from here.

I was thinking along the lines of: struct Grapheme { private string support_; ... } struct ByGrapheme { private string iteratee_; bool empty(); Grapheme front(); void popFront(); // Additional funs dchar frontCodePoint(); void popFrontCodePoint(); char frontCodeUnit(); void popFrontCodeUnit(); ... } // helper function ByGrapheme byGrapheme(string s); // usage string s = ...; size_t i; foreach (g; byGrapheme(s)) { writeln("Grapheme #", i, " is ", g); } We need this range in Phobos.

Yes, we need a grapheme range. But that's not what my thing was about. It was about shortcutting code point decoding when it isn't necessary while still keeping the ability to decode to code points when iterating on the same range. For instance, here's a simple made up example: string s = "<hello>"; if (!s.empty && s.frontUnit == '<') s.popFrontUnit(); // skip while (!s.empty && s.frontUnit != '>') s.popFront(); // do something with each code point if (!s.empty && s.frontUnit == '>') s.popFrontUnit(); // skip assert(s.empty); Here, since I know I'm testing and skipping for '<', an ASCII character, decoding the code point is wasted time, so I skip that decoding. The problem is that this optimization can't happen with a range that abstracts things at the code point level. I can do it with strings because strings still allow you to access code units through the indexing operators, but this can't really apply to ranges of code points in general. And parsing with range of code unit would also be a pain, because even if I'm testing for '<' for the first character, sometimes I really need to advance by code point and test for code points. One thing that might be interesting is benchmarking my XML parser by replacing every instance of frontUnit and popFrontUnit with front and popFront. That won't change there results, but it'd give us an idea of the overhead of the unnecessary decoded characters code points. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 18 2011
prev sibling parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
Michel Fortin wrote:
 On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com>
 said:

 So perhaps the best interface for strings would be to provide multiple
 range-like interfaces that you can use at the level you want.

That's what I've been thinking. The users can choose whether they want random access or not. A grapheme-aware string can provide random access at a space cost, or no random access for efficient space use. I see 5 layers in string processing. Layers 1 and 2 are currently handled by D, sometimes in an unclear way. e.g. char[] may be used as an array of code units or an array of code points depending on the type of iteration. 1) Code units: This is what D provides with its string types This layers models RandomAccessRange 2) Code points: This is what D and Phobos provide for example with foreach(d; stride(s, 1)) dchar[] models RandomAccessRange at this layer char[] and wchar[] model ForwardRange at this layer (If I understand it correctly, Steven Schveighoffer is trying to provide a pseudo-RandomAccessRange to char[] and wchar[] with his string type.) 3) Graphemes: This is what the string type that spir is working on. There could be at least two types: 3a) RandomAccessGraphemeRange: Has random access but the data type is large 3b) ForwardGraphemeRange: space-efficient but does not provide random access I think the programmers would be happy to be able to choose. 4) Letters: Uses either 3a or 3b. This is the layer where the idea of a writing system enters the picture: lower/upper case transformations and sorting happen at this layer. (I have a library that tries to handle this layer but is ignorant of graphemes; I am waiting for spir's string type. ;)) 4a) Models RandomAccessRange if based on a RandomAccessGraphemeRange 4b) Models ForwardRange if based on a ForwardGraphemeRange 5) Text: Collection of Letters. This is where a name like "ali & tim" is correctly capitalized as "ALİ & TIM" because the text consists of two separate writing systems. (The same library that I mentioned in 4 tries to handle this layer as well.) Ali
Jan 18 2011
prev sibling next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-15 22:25:47 -0500, Jonathan M Davis <jmdavisProg gmx.com> said:

 The issue of foreach remains, but without being willing to change what 
 foreach defaults to, you can't really fix it - though I'd suggest that 
 we at least make it a warning to iterate over strings without 
 specifying the type. And if foreach were made to understand Grapheme 
 like it understands dchar, then you could do
 
 foreach(Grapheme g; str) { ... }
 
 and have the compiler warn about
 
 foreach(g; str) { ... }
 
 and tell you to use Grapheme if you want to be comparing actual characters.

Walter's argument against changing this for foreach was that it'd *silently* break compatibility with existing D1 code. Changing the default to a grapheme makes this argument obsolete: since a grapheme is essentially a string, you can't compare it with char or wchar or dchar directly, so it'll break at compile time with an error and you'll have to decide what to do. So Walter would have to find another argument to defend the status quo. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 15 2011
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/15/11 10:47 PM, Michel Fortin wrote:
 On 2011-01-15 22:25:47 -0500, Jonathan M Davis <jmdavisProg gmx.com> said:

 The issue of foreach remains, but without being willing to change what
 foreach defaults to, you can't really fix it - though I'd suggest that
 we at least make it a warning to iterate over strings without
 specifying the type. And if foreach were made to understand Grapheme
 like it understands dchar, then you could do

 foreach(Grapheme g; str) { ... }

 and have the compiler warn about

 foreach(g; str) { ... }

 and tell you to use Grapheme if you want to be comparing actual
 characters.

Walter's argument against changing this for foreach was that it'd *silently* break compatibility with existing D1 code. Changing the default to a grapheme makes this argument obsolete: since a grapheme is essentially a string, you can't compare it with char or wchar or dchar directly, so it'll break at compile time with an error and you'll have to decide what to do. So Walter would have to find another argument to defend the status quo.

I think it's poor abstraction to represent a Grapheme as a string. It should be its own type. Andrei
Jan 16 2011
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/15/11 9:25 PM, Jonathan M Davis wrote:
 Considering that strings are already dealt with specially in order to have an
 element of dchar, I wouldn't think that it would be all that distruptive to
make
 it so that they had an element type of Grapheme instead. Wouldn't that then fix
 all of std.algorithm and the like without really disrupting anything?

It would make everything related a lot (a TON) slower, and it would break all client code that uses dchar as the element type, or is otherwise unprepared to use Graphemes explicitly. There is no question there will be disruption. Andrei
Jan 16 2011
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/17/11 6:44 AM, Steven Schveighoffer wrote:
 On Sun, 16 Jan 2011 13:06:16 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 1/15/11 9:25 PM, Jonathan M Davis wrote:
 Considering that strings are already dealt with specially in order to
 have an
 element of dchar, I wouldn't think that it would be all that
 distruptive to make
 it so that they had an element type of Grapheme instead. Wouldn't
 that then fix
 all of std.algorithm and the like without really disrupting anything?

It would make everything related a lot (a TON) slower, and it would break all client code that uses dchar as the element type, or is otherwise unprepared to use Graphemes explicitly. There is no question there will be disruption.

I would have agreed with you last week. Now I understand that using dchar is just as useless for unicode as using char.

This is one extreme. Char only works for English. Dchar works for most languages. It won't work for a few. That doesn't make it useless for languages that work with it.
 Will it be slower? Perhaps. A TON slower? Probably not.

It will be a ton slower.
 But it will be correct. Correct and slow is better than incorrect and
 fast. If I showed you a shortest-path algorithm that ran in O(V) time,
 but didn't always find the shortest path, would you call it a success?

The comparison doesn't apply.
 We need to get some real numbers together. I'll see what I can create
 for a type, but someone else needs to supply the input :) I'm on short
 supply of unicode data, and any attempts I've made to create some result
 in failure. I have one example of one composed character in this thread
 that I can cling to, but in order to supply some real numbers, we need a
 large amount of data.

I very much appreciate that you're doing actual work on this. Andrei
Jan 17 2011
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/17/11 6:44 AM, Steven Schveighoffer wrote:
 We need to get some real numbers together. I'll see what I can create
 for a type, but someone else needs to supply the input :) I'm on short
 supply of unicode data, and any attempts I've made to create some result
 in failure. I have one example of one composed character in this thread
 that I can cling to, but in order to supply some real numbers, we need a
 large amount of data.

Oh, one more thing. You don't need a lot of Unicode text containing combining characters to write benchmarks. (You do need it for testing purposes.) Most text won't contain combining characters anyway, so after you implement graphemes, just benchmark them on regular text. Andrei
Jan 17 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday 15 January 2011 15:59:27 Andrei Alexandrescu wrote:
 On 1/15/11 4:45 PM, Michel Fortin wrote:
 On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer"
=20
 <schveiguy yahoo.com> said:
 On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin
=20
 <michel.fortin michelf.com> wrote:
 On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"
=20
 <schveiguy yahoo.com> said:
 I'm not suggesting we impose it, just that we make it the default.
 If you want to iterate by dchar, wchar, or char, just write:
 foreach (dchar c; "expos=E9") {}
 foreach (wchar c; "expos=E9") {}
 foreach (char c; "expos=E9") {}
 // or
 foreach (dchar c; "expos=E9".by!dchar()) {}
 foreach (wchar c; "expos=E9".by!wchar()) {}
 foreach (char c; "expos=E9".by!char()) {}
 and it'll work. But the default would be a slice containing the
 grapheme, because this is the right way to represent a Unicode
 character.

I think this is a good idea. I previously was nervous about it, but I'm not sure it makes a huge difference. Returning a char[] is certainly less work than normalizing a grapheme into one or more code points, and then returning them. All that it takes is to detect all the code points within the grapheme. Normalization can be done if needed, but would probably have to output another char[], since a normalized grapheme can occupy more than one dchar.

I'm glad we agree on that now.

It's a matter of me slowly wrapping my brain around unicode and how it's used. It seems like it's a typical committee defined standard where there are 10 ways to do everything, I was trying to weed out the lesser used (or so I perceived) pieces to allow a more implementable library. It's doubly hard for me since I have limited experience with other languages, and I've never tried to write them with a computer (my language classes in high school were back in the days of actually writing stuff down on paper).

Actually, I don't think Unicode was so badly designed. It's just that nobody hat an idea of the real scope of the problem they had in hand at first, and so they had to add a lot of things but wanted to keep things backward-compatible. We're at Unicode 6.0 now, can you name one other standard that evolved enough to get 6 major versions? I'm surprised it's not worse given all that it must support. =20 That said, I'm sure if someone could redesign Unicode by breaking backward-compatibility we'd have something simpler. You could probably get rid of pre-combined characters and reduce the number of normalization forms. But would you be able to get rid of normalization entirely? I don't think so. Reinventing Unicode is probably not worth i=


=20
 I'm not opposed to that on principle. I'm a little uneasy about
 having so many types representing a string however. Some other raw
 comments:
=20
 I agree that things would be more coherent if char[], wchar[], and
 dchar[] behaved like other arrays, but I can't really see a
 justification for those types to be in the language if there's
 nothing special about them (why not a library type?).

I would not be opposed to getting rid of those types. But I am very opposed to char[] not being an array. If you want a string to be something other than an array, make it have a different syntax. We also have to consider C compatibility. =20 However, we are in radical-change mode then, and this is probably pushed to D3 ;) If we can find some way to fix the situation without invalidating TDPL, we should strive for that first IMO.

Indeed, the change would probably be too radical for D2. =20 I think we agree that the default type should behave as a Unicode string, not an array of characters. I understand your opposition to conflating arrays of char with strings, and I agree with you to a certain extent that it could have been done better. But we can't really change the type of string literals, can we. The only thing we can change (I hope) at this point is how iterating on strings work. =20 Walter said earlier that he oppose changing foreach's default element type to dchar for char[] and wchar[] (as Andrei did for ranges) on the ground that it would silently break D1 compatibility. This is a valid point in my opinion. =20 I think you're right when you say that not treating char[] as an array of character breaks, to a certain extent, C compatibility. Another valid point. =20 That said, I want to emphasize that iterating by grapheme, contrary to iterating by dchar, does not break any code *silently*. The compiler will complain loudly that you're comparing a string to a char, so you'll have to change your code somewhere if you want things to compile. You'll have to look at the code and decide what to do. =20 One more thing: =20 NSString in Cocoa is in essence the same thing as I'm proposing here: as array of UTF-16 code units, but with string behaviour. It supports by-code-unit indexing, but appending, comparing, searching for substrings, etc. all behave correctly as a Unicode string. Again, I agree that it's probably not the best design, but I can tell you it works well in practice. In fact, NSString doesn't even expose the concept of grapheme, it just uses them internally, and you're pretty much limited to the built-in operation. I think what we have here in concept is much better... even if it somewhat conflates code-unit arrays and strings.

I'm unclear on where this is converging to. At this point the commitment of the language and its standard library to (a) UTF aray representation and (b) code points conceptualization is quite strong. Changing that would be quite difficult and disruptive, and the benefits are virtually nonexistent for most of D's user base. =20 It may be more realistic to consider using what we have as back-end for grapheme-oriented processing. For example: =20 struct Grapheme(Char) if (isSomeChar!Char) { private const Char[] rep; ... } =20 auto byGrapheme(S)(S s) if (isSomeString!S) { ... } =20 string s =3D "Hello"; foreach (g; byGrapheme(s) { ... }

Considering that strings are already dealt with specially in order to have = an=20 element of dchar, I wouldn't think that it would be all that distruptive to= make=20 it so that they had an element type of Grapheme instead. Wouldn't that then= fix=20 all of std.algorithm and the like without really disrupting anything? The issue of foreach remains, but without being willing to change what fore= ach=20 defaults to, you can't really fix it - though I'd suggest that we at least = make=20 it a warning to iterate over strings without specifying the type. And if fo= reach=20 were made to understand Grapheme like it understands dchar, then you could = do foreach(Grapheme g; str) { ... } and have the compiler warn about foreach(g; str) { ... } and tell you to use Grapheme if you want to be comparing actual characters.= =20 Regardless, by making strings ranges of Grapheme rather than dchar, I would= =20 think that we would solve most of the problem. At minimum, we'd have pretty= much=20 the same problems that we have right now with char and wchar arrays, but we= 'd=20 get rid of a whole class of unicode problems. So, nothing would be worse, b= ut=20 some of it would be better. =2D Jonathan M Davis
Jan 15 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
And how would 3rd party libraries handle Graphemes? And C modules? I
think making these Graphemes the default would make quite a mess,
since you would have to convert back and forth between char[] and
Grapheme[] all the time (right?).
Jan 16 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sun, 16 Jan 2011 13:06:16 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 1/15/11 9:25 PM, Jonathan M Davis wrote:
 Considering that strings are already dealt with specially in order to  
 have an
 element of dchar, I wouldn't think that it would be all that  
 distruptive to make
 it so that they had an element type of Grapheme instead. Wouldn't that  
 then fix
 all of std.algorithm and the like without really disrupting anything?

It would make everything related a lot (a TON) slower, and it would break all client code that uses dchar as the element type, or is otherwise unprepared to use Graphemes explicitly. There is no question there will be disruption.

I would have agreed with you last week. Now I understand that using dchar is just as useless for unicode as using char. Will it be slower? Perhaps. A TON slower? Probably not. But it will be correct. Correct and slow is better than incorrect and fast. If I showed you a shortest-path algorithm that ran in O(V) time, but didn't always find the shortest path, would you call it a success? We need to get some real numbers together. I'll see what I can create for a type, but someone else needs to supply the input :) I'm on short supply of unicode data, and any attempts I've made to create some result in failure. I have one example of one composed character in this thread that I can cling to, but in order to supply some real numbers, we need a large amount of data. -Steve
Jan 17 2011
prev sibling next sibling parent "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:
On Mon, 17 Jan 2011 07:44:17 -0500, Steven Schveighoffer wrote:

 We need to get some real numbers together.  I'll see what I can create
 for a type, but someone else needs to supply the input :)  I'm on short
 supply of unicode data, and any attempts I've made to create some result
 in failure.  I have one example of one composed character in this thread
 that I can cling to, but in order to supply some real numbers, we need a
 large amount of data.

Googling "unicode sample document" turned up a few examples. This one looks promising: http://www.humancomp.org/unichtm/unichtm.htm -Lars
Jan 17 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 17 Jan 2011 10:00:57 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 1/17/11 6:44 AM, Steven Schveighoffer wrote:
 We need to get some real numbers together. I'll see what I can create
 for a type, but someone else needs to supply the input :) I'm on short
 supply of unicode data, and any attempts I've made to create some result
 in failure. I have one example of one composed character in this thread
 that I can cling to, but in order to supply some real numbers, we need a
 large amount of data.

Oh, one more thing. You don't need a lot of Unicode text containing combining characters to write benchmarks. (You do need it for testing purposes.) Most text won't contain combining characters anyway, so after you implement graphemes, just benchmark them on regular text.

True, benchmarking doesn't apply with combining characters because we have nothing to compare it to. The current scheme fails on it anyways, so it by default would be the best solution. -Steve
Jan 17 2011
prev sibling parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
Thanks to all that has contributed, I am also following this thread with 
great interest. :)

Michel Fortin wrote:
 I mean, a grapheme is a slice of a string, can have multiple code points
 (like a string), can be appended the same way as a string, can be
 composed or decomposed using canonical normalization or compatibility
 normalization (like a string), and should be sorted, uppercased, and
 lowercased according to Unicode rules (like a string). Basically, a
 grapheme is just a string that happens to contain only one grapheme.

I would like to stress the fact that Unicode knows nothing about sorting, uppercasing, or lowercasing. Those operations are tied to the alphabet (or writing system) that a certain grapheme happens to belong to at a given time. For example, we cannot uppercase the letter i without knowing what alphabet we are dealing with. Two possibilities: I and İ (I dot above). It is the same issue with sorting. Ali
Jan 17 2011
prev sibling next sibling parent reply "Nick Sabalausky" <a a.a> writes:
"spir" <denis.spir gmail.com> wrote in message 
news:mailman.624.1295013588.4748.digitalmars-d puremagic.com...
 If it does not display properly, either set your terminal to UTF* or use a 
 more unicode-aware font (eg DejaVu series).

How to do that on the Windows (XP) command prompt, for anyone who doesn't know: Step 1: Right-click title bar, "Properties", "Font" tab, set font to "Lucidia Console" (It'll look weird at first, but you get used to it.) Step 2 (I had to google this step): For just the current terminal session: Run "chcp 65001". (Ie "CHange Code Page) Also, you can run "chcp" to just see what codepage you're already set to. To make it work permanently: Put "chcp 65001" into the registry key "HKEY_LOCAL_MACHINE\Software\Microsoft\Command Processor\Autorun"
Jan 14 2011
parent reply "Nick Sabalausky" <a a.a> writes:
"Nick Sabalausky" <a a.a> wrote in message 
news:igq9u6$1bqu$1 digitalmars.com...
 Step 2 (I had to google this step):

 For just the current terminal session: Run "chcp 65001". (Ie "CHange Code 
 Page) Also, you can run "chcp" to just see what codepage you're already 
 set to.

 To make it work permanently: Put "chcp 65001" into the registry key 
 "HKEY_LOCAL_MACHINE\Software\Microsoft\Command Processor\Autorun"

Forget that step 2, that causes "Active code page: 65001" to be sent to stdout *every* time system() is invoked. We shouldn't be relying on that. *This* is what should be done (and this really should be done in all D command line apps - or better yet, put into the runtime): import std.stdio; version(Windows) { import std.c.windows.windows; extern(Windows) export BOOL SetConsoleOutputCP(UINT); } void main() { version(Windows) SetConsoleOutputCP(65001); writeln("HuG says: Fukken ber Death Terminal"); } See also: http://d.puremagic.com/issues/show_bug.cgi?id=1448
Jan 14 2011
parent "Nick Sabalausky" <a a.a> writes:
"Andrej Mitrovic" <andrej.mitrovich gmail.com> wrote in message 
news:mailman.631.1295038817.4748.digitalmars-d puremagic.com...
On 1/14/11, Nick Sabalausky <a a.a> wrote:
 import std.stdio;

 version(Windows)
 {
     import std.c.windows.windows;
     extern(Windows) export BOOL SetConsoleOutputCP(UINT);
 }

 void main()
 {
     version(Windows) SetConsoleOutputCP(65001);

     writeln("HuG says: Fukken ber Death Terminal");
 }

Does that work for you? I get back: HuG says: Fukken Über Death Terminal

Yea, it works for me (XP Pro SP2 32-bit), and my "chcp" is 437, not 65001. The NG or copy-paste might have messed it up. Try with a code-point escape sequence: import std.stdio; version(Windows) { import std.c.windows.windows; extern(Windows) export BOOL SetConsoleOutputCP(UINT); } void main() { version(Windows) SetConsoleOutputCP(65001); writeln("HuG says: Fukken \u00DCber Death Terminal"); }
Jan 14 2011
prev sibling parent "Joel C. Salomon" <joelcsalomon gmail.com> writes:
On 01/14/2011 09:34 AM, Steven Schveighoffer wrote:
 Is it common to have multiple modifiers on a single character?  The
 problem I see with using decomposed canonical form for strings is that
 we would have to return a dchar[] for each 'element', which severely
 complicates code that, for instance, only expects to handle English.

Hebrew: • Almost every letter in a printed Hebrew bible has at least one of— ‣ vowel marker (the Hebrew alphabet is otherwise consonantal) and ‣ a /dagesh/ dot, indicating the difference between /b/ & /v/, or between /mm/ and /m/; • almost every word has at least one letter with a cantillation mark in addition to the above; and • other marks too complicated & off-topic to explain. Vietnamese uses Latin letters with accents playing multiple roles, so there are often two or three accent marks on a single letter; e.g., the name of the creator of pdfTeX is spelled “Hàn Thế Thành”, with two accents on the “e”. I’m sure there are others. —Joel
Jan 23 2011
prev sibling parent spir <denis.spir gmail.com> writes:
On 01/19/2011 08:43 AM, Ali Çehreli wrote:
 Michel Fortin wrote:
  > On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com>
  > said:

  > So perhaps the best interface for strings would be to provide multiple
  > range-like interfaces that you can use at the level you want.

 That's what I've been thinking. The users can choose whether they want
 random access or not. A grapheme-aware string can provide random access
 at a space cost, or no random access for efficient space use.

 I see 5 layers in string processing. Layers 1 and 2 are currently
 handled by D, sometimes in an unclear way. e.g. char[] may be used as an
 array of code units or an array of code points depending on the type of
 iteration.

This is very good and helpful summary. But you do not list all relevant aspects of the question, I guess. Defining which codes belong to a given grapheme (what I call "piling") is necessary for true O(1) random-access, but not only. More importantly, all operations involving equality comparison (find, count, replace,...) require normalisation --in addition to piling. A few notes:
 1) Code units: This is what D provides with its string types

 This layers models RandomAccessRange

This level is pure implementation artifact that simply cannot make any sense. (from user and thus programmer points of view) Any kind of text manipulation (slice, find, replace...) may lead to random incorrectness, except when source texts can be guaranteed to hold plain ASCII (which may be hard to prove). Conversely, pieces of text only passed around by an app do not require any more costly representation, in terms of time (decoding) or space. In addition, concat works provided all pieces share the same encoding (ASCII beeing a subset of most historic charsets and of UTF-8).
 2) Code points: This is what D and Phobos provide for example with
 foreach(d; stride(s, 1))

 dchar[] models RandomAccessRange at this layer

 char[] and wchar[] model ForwardRange at this layer

 (If I understand it correctly, Steven Schveighoffer is trying to provide
 a pseudo-RandomAccessRange to char[] and wchar[] with his string type.)

This level is also a kind of implementation artifact, compared to historic charsets, but actually based on a real fact of natural languages: they hold composite characters that can thus be coded by combining lower-level codes which represent "scripting marks" (base & combining ones). For this reason, this level can have some sense. My latest guess is that apps that consider text as a study object (read linguistic apps), instead of a means, may regurarly need operating at this level, in addition to the next one. Normalisation can be applied at this level --and is necessary for the above kind of use case. But using it for operations requiring compare will typically also require "piling", that is the next level, if only to determine what is to be compared.
 3) Graphemes: This is what the string type that spir is working on.
 There could be at least two types:

This is the meaningful level for, probably, nearly all applications.
 3a) RandomAccessGraphemeRange: Has random access but the data type is large

I guess this is Text's approach? Text is "flash fast" indeed for any operation benefiting from random-access. But not only: since it normalises its input, it should be far faster for any operation using compare (rough evaluations suggest a speed ratio of 1 to 2 orders of magnitude). The cost is high in terms of space, which in turn certainly reduces its speed gain in the general case, because to cache (miss) effects. (Thank you Michel for making this clear.)
 3b) ForwardGraphemeRange: space-efficient but does not provide random
 access

Is it what Andrei expects, namely a Grapheme type with a corresponding ByGrapheme iterator IIUC? Time efficiency of operations? 3) metadata RandomAccessGraphemeRange Michel Fortin suggested (off list) an alternative approach to Text: instead of actually "piling" at construction time, just store metadata upon grapheme bounds. The core benefit is indeed to keep "normal" text storage (meaning *char[], for modification): would this point please Andrei better? I let you evaluate various consequences of this change (mostly positive, I guess). The same metadata principle could certainly be used for further optimisations, but this is another story. I'm motivated to implement this variant, looke like best of both worlds tome. (support welcome ;-)
 I think the programmers would be happy to be able to choose.

 4) Letters: Uses either 3a or 3b. This is the layer where the idea of a
 writing system enters the picture: lower/upper case transformations and
 sorting happen at this layer. (I have a library that tries to handle
 this layer but is ignorant of graphemes; I am waiting for spir's string
 type. ;))

 4a) Models RandomAccessRange if based on a RandomAccessGraphemeRange

 4b) Models ForwardRange if based on a ForwardGraphemeRange

I do not understand what this level means. For me, letters are, precisely, archetypical true characters, meaning level 3. [Note: "grapheme", used by Unicode to denote the common sense of "character", is simply wrong: "sh" and "ti" are graphemes in english (for the same phoneme /ʃ/), not characters; and tab, §, or © are probalby not considered graphemes by linguists, while they are characters. This is the reason why I try to avoid this term and use "character", like ICU's doc, to avoid even more confusion.]
 5) Text: Collection of Letters. This is where a name like "ali & tim" is
 correctly capitalized as "ALİ & TIM" because the text consists of two
 separate writing systems. (The same library that I mentioned in 4 tries
 to handle this layer as well.)

This is an immensely complicated field. Note that it has nothing to do with text & character representation issues: whatever the character set, one has to confront problems like uppercase of 'i', 'ss' vs 'ß', definiton of "letter" or "character", matching, sorting order... Text does not even try to address natural language issues. Instead it deals onl,y but hopefully clearly & correctly, with restoring simple and safe representation for client apps.
 Ali

Denis _________________ vita es estrany spir.wikidot.com
Jan 19 2011
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 15 Jan 2011 14:51:47 -0500, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 I feel like you might be exaggerating, but maybe I'm completely wrong on  
 this, I'm not well-versed in unicode, or even languages that require  
 unicode.  The clear benefit I see is that with a string type which  
 normalizes to canonical code points, you can use this in any algorithm  
 without having it be unicode-aware for *most languages*.  At least, that  
 is how I see it.  I'm looking at it as a code-reuse proposition.

 It's like calendars.  There are quite a few different calendars in  
 different cultures.  But most people use a Gregorian calendar.  So we  
 have three options:

 a) Use a Gregorian calendar, and leave the other calendars to a 3rd  
 party library
 b) Use a complicated calendar system where Gregorian calendars are  
 treated with equal respect to all other calendars, none are the default.
 c) Use a Gregorian calendar by default, but include the other calendars  
 as a separate module for those who wish to use them.

 I'm looking at my proposal as more of a c) solution.

 Can you show how normalization causes subtle bugs?

I see from Michel's post how normalization automatically can be bad. I also see that it can be wasteful. So I've shifted my position. Now I agree that we need a full unicode-compliant string type as the default. See my reply to Michel for more info on my revised proposal. -Steve
Jan 15 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/11/2011 02:30 PM, Steven Schveighoffer wrote:
 On Mon, 10 Jan 2011 22:57:36 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 I've been thinking on how to better deal with Unicode strings.
 Currently strings are formally bidirectional ranges with a
 surreptitious random access interface. The random access interface
 accesses the support of the string, which is understood to hold data
 in a variable-encoded format. For as long as the programmer
 understands this relationship, code for string manipulation can be
 written with relative ease. However, there is still room for writing
 wrong code that looks legit.

 Sometimes the best way to tackle a hairy reality is to invite it to
 the negotiation table and offer it promotion to first-class
 abstraction status. Along that vein I was thinking of defining a new
 range: VLERange, i.e. Variable Length Encoding Range. Such a range
 would have the power somewhere in between bidirectional and random
 access.

 The primitives offered would include empty, access to front and back,
 popFront and popBack (just like BidirectionalRange), and in addition
 properties typical of random access ranges: indexing, slicing, and
 length. Note that the result of the indexing operator is not the same
 as the element type of the range, as it only represents the unit of
 encoding.

 In addition to these (and connecting the two), a VLERange would offer
 two additional primitives:

 1. size_t stepSize(size_t offset) gives the length of the step needed
 to skip to the next element.

 2. size_t backstepSize(size_t offset) gives the size of the _backward_
 step that goes to the previous element.

 In both cases, offset is assumed to be at the beginning of a logical
 element of the range.

 I suspect that a lot of functions in std.string can be written without
 Unicode-specific knowledge just by relying on such an interface.
 Moreover, algorithms can be generalized to other structures that use
 variable-length encoding, such as those used in data compression. (In
 that case, the support would be a bit array and the encoded type would
 be ubyte.)

 Writing to such ranges is not addressed by this design. Ideas are
 welcome.

 Adding VLERange would legitimize strings and would clarify their
 handling, at the cost of adding one additional concept that needs to
 be minded. Is the trade-off worthwhile?

While this makes it possible to write algorithms that only accept VLERanges, I don't think it solves the major problem with strings -- they are treated as arrays by the compiler. I'd also rather see an indexing operation return the element type, and have a separate function to get the encoding unit. This makes more sense for generic code IMO. I noticed you never commented on my proposed string type... That reminds me, I should update with suggested changes and re-post it.

People interested in solving the general problem with Unicode strings may have a look at https://bitbucket.org/denispir/denispir-d. All constructive feedback welcome. (This will be asked for review in a short while. The main / client interface module is Text.d. A (long) presentation of the issues, reasons, solution can be found in the text called "U missing level of abstraction") Denis _________________ vita es estrany spir.wikidot.com
Jan 11 2011
prev sibling next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 11 Jan 2011 11:54:08 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 1/11/11 5:30 AM, Steven Schveighoffer wrote:
 While this makes it possible to write algorithms that only accept
 VLERanges, I don't think it solves the major problem with strings --
 they are treated as arrays by the compiler.

Except when they're not - foreach with dchar...

This solitary difference is a very thin argument -- foreach(d; byDchar(str)) would be just as good without requiring compiler help.
 I'd also rather see an indexing operation return the element type, and
 have a separate function to get the encoding unit. This makes more sense
 for generic code IMO.

But that's neither here nor there. That would return the logical element at a physical position. I am very doubtful that much generic code could work without knowing they are in fact dealing with a variable-length encoding.

It depends on the function, and the way the indexing is implemented.
 I noticed you never commented on my proposed string type...

 That reminds me, I should update with suggested changes and re-post it.

To be frank, I think it didn't mark a visible improvement. It solved some problems and brought others. There was disagreement over the offered primitives and their semantics.

It is supposed to be simple, and provide the expected interface, without causing any undue performance degradation. That is, I should be able to do all the things with a replacement string type that I can with a char array today, as efficiently as I can today, except I should have to work to get at the code-units. The huge benefit is that I can say "I'm dealing with this as an array" when I know it's safe The disagreement will never be fully solved, as there is just as much disagreement about the current state of affairs ;) e.g. should foreach default to using dchar?
 That being said, it's good you are doing this work. In the best case,  
 you could bring a compelling abstraction to the table. In the worst,  
 you'll become as happy about D's strings as I am :o).

I don't think I'll ever be 'happy' with the way strings sit in phobos currently. I typically deal in ASCII (i.e. code units), and phobos works very hard to prevent that. -Steve
Jan 11 2011
next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 15 Jan 2011 15:46:11 -0500, foobar <foo bar.com> wrote:


 I'd like to have full Unicode support. I think it is a good thing for D  
 to have in order to expand in the world. As an alternative, I'd settle  
 for loud errors that make absolutely clear to the non-Unicode expert  
 programmer that D simply does NOT support e.g. Normalization.

 As Spir already said, Unicode is something few understand and even it's  
 own official docs do not explain such issues properly. We should not  
 confuse users even further with incomplete support.

Well said, I've changed my mind. Thanks for explaining. -Steve
Jan 15 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 15 Jan 2011 17:19:48 -0500, foobar <foo bar.com> wrote:

 I like Michel's proposed semantics and I also agree with you that it  
 should be a distinct string type and not break consistency of regular  
 arrays.

 Regarding your last point: Do you mean that a grapheme would be a  
 sub-type of string? (a specialization where the string represents a  
 single element)? If so, than it sounds good to me.

A grapheme would be its own specialized type. I'd probably remove the range primitives to really differentiate it. Unfortunately, due to the inability to statically check this, the invariant would have to be a runtime check. Most likely this check would be disabled in release mode. This can cause problems, and I can see why it is attractive to use strings to implement graphemes, but that also has its problems. With grapheme being its own type, we are providing a way to optimize functions, and allow further restrictions on function parameters. At the end of the day, perhaps grapheme *should* just be a string. We'll have to see how this breaks in practice, either way. -Steve
Jan 17 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 15 Jan 2011 17:45:37 -0500, Michel Fortin  
<michel.fortin michelf.com> wrote:

 On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer"  
 <schveiguy yahoo.com> said:

 On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin   
 <michel.fortin michelf.com> wrote:

 On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"   
 <schveiguy yahoo.com> said:

 I'm not suggesting we impose it, just that we make it the default.  
 If   you want to iterate by dchar, wchar, or char, just write:
  	foreach (dchar c; "exposé") {}
 	foreach (wchar c; "exposé") {}
 	foreach (char c; "exposé") {}
 	// or
 	foreach (dchar c; "exposé".by!dchar()) {}
 	foreach (wchar c; "exposé".by!wchar()) {}
 	foreach (char c; "exposé".by!char()) {}
  and it'll work. But the default would be a slice containing the    
 grapheme, because this is the right way to represent a Unicode   
 character.

but I'm not sure it makes a huge difference. Returning a char[] is certainly less work than normalizing a grapheme into one or more code points, and then returning them. All that it takes is to detect all the code points within the grapheme. Normalization can be done if needed, but would probably have to output another char[], since a normalized grapheme can occupy more than one dchar.


it's used. It seems like it's a typical committee defined standard where there are 10 ways to do everything, I was trying to weed out the lesser used (or so I perceived) pieces to allow a more implementable library. It's doubly hard for me since I have limited experience with other languages, and I've never tried to write them with a computer (my language classes in high school were back in the days of actually writing stuff down on paper).

Actually, I don't think Unicode was so badly designed. It's just that nobody hat an idea of the real scope of the problem they had in hand at first, and so they had to add a lot of things but wanted to keep things backward-compatible. We're at Unicode 6.0 now, can you name one other standard that evolved enough to get 6 major versions? I'm surprised it's not worse given all that it must support.

I didn't read the standard, all I understand about unicode is from this NG ;) What I meant was the ability to do things more than one way seems like a committee-designed standard. Usually with one of those, you have one party who "absolutely needs" one way of doing things (most likely because all their code is based on it), and other parties who want it a different way. When compromises occur, the end result is, you have a standard that's unnecessarily difficult to implement.
 Indeed, the change would probably be too radical for D2.

 I think we agree that the default type should behave as a Unicode  
 string, not an array of characters. I understand your opposition to  
 conflating arrays of char with strings, and I agree with you to a  
 certain extent that it could have been done better. But we can't really  
 change the type of string literals, can we. The only thing we can change  
 (I hope) at this point is how iterating on strings work.

I was hoping to change string literal types. If we don't do that, we have a half-ass solution. I don't think it's going to be impossible, because string, wstring, dstring are all aliases. In fact, with my current proposed type, this already works: mystring s = "hello"; But this doesn't: auto s = "hello"; // still typed as immutable(char)[] This isn't so bad, just require one to specify the type, right? Well, it fails miserably here: foo(mystring s) {...} foo("hello"); // fails to match. In order to have a string type, string literals have to be typed as that type.
 Walter said earlier that he oppose changing foreach's default element  
 type to dchar for char[] and wchar[] (as Andrei did for ranges) on the  
 ground that it would silently break D1 compatibility. This is a valid  
 point in my opinion.

 I think you're right when you say that not treating char[] as an array  
 of character breaks, to a certain extent, C compatibility. Another valid  
 point.

 That said, I want to emphasize that iterating by grapheme, contrary to  
 iterating by dchar, does not break any code *silently*. The compiler  
 will complain loudly that you're comparing a string to a char, so you'll  
 have to change your code somewhere if you want things to compile. You'll  
 have to look at the code and decide what to do.

Changing iteration and not indexing is not going to fix the mess we have right now.
 One more thing:

 NSString in Cocoa is in essence the same thing as I'm proposing here: as  
 array of UTF-16 code units, but with string behaviour. It supports  
 by-code-unit indexing, but appending, comparing, searching for  
 substrings, etc. all behave correctly as a Unicode string. Again, I  
 agree that it's probably not the best design, but I can tell you it  
 works well in practice. In fact, NSString doesn't even expose the  
 concept of grapheme, it just uses them internally, and you're pretty  
 much limited to the built-in operation. I think what we have here in  
 concept is much better... even if it somewhat conflates code-unit arrays  
 and strings.

But is NSString typed the *exact same* as an array, or is it a wrapper for an array? Looking at the docs, it appears it is not.
 Or you could make a grapheme a string_t. ;-)

For all intents and purposes, a grapheme is a string of one 'element', so it could potentially be a string_t. It does seem daunting to have so many types, but at the same time, types convey relationships at compile time that can make coding impossible to get wrong, or make things actually possible when having a single type doesn't. I'll give you an example from a previous life: [...] I feel that making extra types when the relationship between them is important is worth the possible repetition of functionality. Catching bugs during compilation is soooo much better than experiencing them during runtime.

I can understand the utility of a separate type in your DateTime example, but in this case I fail to see any advantage. I mean, a grapheme is a slice of a string, can have multiple code points (like a string), can be appended the same way as a string, can be composed or decomposed using canonical normalization or compatibility normalization (like a string), and should be sorted, uppercased, and lowercased according to Unicode rules (like a string). Basically, a grapheme is just a string that happens to contain only one grapheme. What would a custom type do differently than a string?

A grapheme type would not be a range, it would be an element of the string range. You could not append to it (otherwise, that makes it into a string). In all other respects, it should act similar to a string (as you say, printing, upper-casing, comparison, etc.)
 Also, grapheme == "a" is easy to understand because both are strings.  
 But if a grapheme is a separate type, what would a grapheme literal look  
 like?

A grapheme should be comparable to a string literal. It should be assignable to a string literal. The drawback is we would need a runtime check to ensure the string literal was actually one grapheme. Some compiler help in this regard would be useful, but I'm not sure how the mechanics would work (you couldn't exactly type a literal differently based on its contents). Another possibility is to come up with a different syntax to denote grapheme literals.
 So in the end I don't think a grapheme needs a specific type, at least  
 not for general purpose text processing. If I split a string on  
 whitespace, do I get a range where elements are of type "word"? No, just  
 sliced strings.

It is not clear that using a separate type is the "right answer." It may be that an element of a string should be a string. This does work in other languages that don't have a concept of a character. An extra type however, allows us to have more concrete positions to work with.
 That said, I'm much less concerned by the type used to represent a  
 grapheme than by the Unicode correctness. I'm not opposed to a separate  
 type, I just don't really see the point.

I will try to explain better by making an actual candidate type. -Steve
Jan 17 2011
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 17 Jan 2011 10:14:19 -0500, spir <denis.spir gmail.com> wrote:

 On 01/15/2011 08:51 PM, Steven Schveighoffer wrote:
 More over, Even if you ignore Hebrew as a tiny insignificant minority
 you cannot do the same for Arabic which has over one *billion* people
 that use that language.

I hope that the medium type works 'good enough' for those languages, with the high level type needed for advanced usages. At a minimum, comparison and substring should work for all languages.

Hello Steven, How does an application know that a given text, which supposedly is written in a given natural language (as for instance indicated by an html header) does not also hold terms from other languages? There are various occasions for this: quotations, use of foreign words, pointers... A side-issue is raised by precomposed codes for composite characters. For most languages of the world, I guess (but unsure), all "official" characters have single-code representations. Good, but unfortunately this is not enforced by the standard (instead, the decomposed form can sensibly be considered the base form, but this is another topic). So that even if ones knows for sure that all characters of all texts an app will ever deal with can be mapped to single codes, to be safe one would have to normalise to NFC anyway (Normalised Form Composed). Then, where is the actual gain? In fact, it is a loss because NFC is more costly than NFD (Decomposed) --actually, the standard NFC algo first decomposes to NFD to initially get an unique representation that can then be more easily (re)composed via simple mappings. For further information: Unicode's normalisation algos: http://unicode.org/reports/tr15/ list of technical reports: http://unicode.org/reports/ (Unicode's technical reports are far more readible than the standard itself, but unfortunately often refer to it.)

I'll reply to this to save you the trouble. I have reversed my position since writing a lot of these posts. In summary, I think strings should default to an element type of a grapheme, which should be implemented via a slice of the original data. Updated string type forthcoming. -Steve
Jan 17 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 11 Jan 2011 18:00:30 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 1/11/11 11:21 AM, Steven Schveighoffer wrote:

 It is supposed to be simple, and provide the expected interface, without
 causing any undue performance degradation. That is, I should be able to
 do all the things with a replacement string type that I can with a char
 array today, as efficiently as I can today, except I should have to work
 to get at the code-units. The huge benefit is that I can say "I'm
 dealing with this as an array" when I know it's safe

Unfinished sentence?

Sorry, I forgot '.' :)
 Anyway, for my money you just described what we have now.

All except the 'expected interface' part. The string type should deal with dchars exclusively, since that's what it is a range of. char[] gives you char's back when you index it. Anyone who doesn't use ASCII will be confused by this. Also, I expect to be able to use a char[] as an array, which Phobos doesn't let me in some cases (e.g. sorting ASCII character array).
 The disagreement will never be fully solved, as there is just as much
 disagreement about the current state of affairs ;) e.g. should foreach
 default to using dchar?

I disagree about the disagreement being unsolvable. I'm not rigid; if I saw a terrific abstraction in your string, I'd be all for it. It just shuffles some issues about, and although I agree it does one thing or two better than char[], at the end of the day it doesn't carry its weight.

I see it as having two vast improvements: 1. If we replace char[] with a specific type for string, then char[] can be considered a true array by phobos, and phobos can now deal with a char[] array without the need to cast. 2. It protects the casual user from incorrectly using a string by making the default the correct API. Those to me are very important.
 I don't think I'll ever be 'happy' with the way strings sit in phobos
 currently. I typically deal in ASCII (i.e. code units), and phobos works
 very hard to prevent that.

I wonder if we could and should extend some of the functions in std.string to work with ubyte[]. I did add a function called representation() that I didn't document yet. Essentially representation gives you the ubyte[], ushort[], or uint[] underneath a string, with the same qualifiers. Whenever you want an algorithm to work on ASCII in earnest, you can pass representation(s) to it instead of s.

This, again, fails on point 2 above. A char[] is an array, and allows access to code-units, which is not the correct interface for a string. Supporting ubyte[] doesn't fix that problem. Correct as the default is usually a theme in D...
 If you work a lot with ASCII, an AsciiString abstraction may be a better  
 and more likely to be successful string type. Better yet, you could  
 simply focus on AsciiChar and then define ASCII strings as arrays of  
 AsciiChar.

This seems like the wrong approach. Adding a new type does not fix the problems with the original type. We need to replace the original type or at least how it is treated by the compiler. -Steve
Jan 13 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 1/13/11 8:52 AM, Steven Schveighoffer wrote:
 I see it as having two vast improvements:

 1. If we replace char[] with a specific type for string, then char[] can
 be considered a true array by phobos, and phobos can now deal with a
 char[] array without the need to cast.
 2. It protects the casual user from incorrectly using a string by making
 the default the correct API.

 Those to me are very important.

Let's take a look: // Incorrect string code void fun(string s) { foreach (i; 0 .. s.length) { writeln("The character in position ", i, " is ", s[i]); } } // Incorrect string_t code void fun(string_t!char s) { foreach (i; 0 .. s.codeUnits) { writeln("The character in position ", i, " is ", s[i]); } } Both functions are incorrect, albeit in different ways. The only improvement I'm seeing is that the user needs to write codeUnits instead of length, which may make her think twice. Clearly, however, copiously incorrect code can be written with the proposed interface because it tries to hide the reality that underneath a variable-length encoding is being used, but doesn't hide it completely (albeit for good efficiency-related reasons).

You might be looking at my previous version. The new version (recently posted) will throw an exception for that code if a multi-code-unit code-point is found. It also supports this: foreach(i, d; s) { writeln("The character in position ", i, " is ", d); } where i is the index (might not be sequential)
 But wait, there's less. Functions for random-access range throughout  
 Phobos routinely assume fixed-length encoding, i.e. s[i + 1] lies next  
 to s[i]. From a cursory look at string_t, std.range will qualify it as a  
 RandomAccessRange without length. That's an odd beast but does not  
 change the fixed-length encoding assumption. So you'd need to  
 special-case algorithms for string_t, just like right now certain  
 algorithms are specialized for string.

isRandomAccessRange requires hasLength (see here: http://www.dsource.org/projects/phobos/browser/trunk/phobos/std/range.d#L532). This is not a random access range per that definition. But a string isn't a random access range anyways (it's specifically disallowed by std.range per that same reference). The plan is you would *not* have to special case algorithms for string_t as you do currently for char[]. If that's not the case, then we haven't achieved much. Simply put, we are separating out the strange nature of strings from arrays, so the exceptional treatment of them is handled by the type itself, not the functions using it. -Steve
Jan 13 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 13 Jan 2011 15:51:00 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 1/13/11 11:35 AM, Steven Schveighoffer wrote:
 On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:
 Let's take a look:

 // Incorrect string code
 void fun(string s) {
 foreach (i; 0 .. s.length) {
 writeln("The character in position ", i, " is ", s[i]);
 }
 }

 // Incorrect string_t code
 void fun(string_t!char s) {
 foreach (i; 0 .. s.codeUnits) {
 writeln("The character in position ", i, " is ", s[i]);
 }
 }

 Both functions are incorrect, albeit in different ways. The only
 improvement I'm seeing is that the user needs to write codeUnits
 instead of length, which may make her think twice. Clearly, however,
 copiously incorrect code can be written with the proposed interface
 because it tries to hide the reality that underneath a variable-length
 encoding is being used, but doesn't hide it completely (albeit for
 good efficiency-related reasons).

You might be looking at my previous version. The new version (recently posted) will throw an exception for that code if a multi-code-unit code-point is found.

I was looking at your latest. It's code that compiles and runs, but dynamically fails on some inputs. I agree that it's often better to fail noisily instead of silently, but in a manner of speaking the string-based code doesn't fail at all - it correctly iterates the code units of a string. This may sometimes not be what the user expected; most of the time they'd care about the code points.

iterating the code units is possible by accessing the array data. i.e. you could do: foreach(i, c; s.data) if you want the code-units. That is the point of having a separate type. Using string_t tells the library "I'm using this data as a string". Using char[] tells the library "I'm using this data as an array." The difference here is, you have to *specifically* try to access the code units, the default is code-points. All it does really is switch the default.
 It also supports this:

 foreach(i, d; s)
 {
 writeln("The character in position ", i, " is ", d);
 }

 where i is the index (might not be sequential)

Well string supports that too, albeit with the nit that you need to specify dchar.

This is not a small problem.
 isRandomAccessRange requires hasLength (see here:
 http://www.dsource.org/projects/phobos/browser/trunk/phobos/std/range.d#L532).
 This is not a random access range per that definition.

That's an interesting twist. By the way I specified length is required then because I couldn't imagine having random access into something that I can't tell the length of. Apparently I was wrong :o).

Yes, in fact, you could say that specifically defines VLERange ;) But actually, there are two types of VLE ranges, those which can be randomly accessed (where determining the beginning of a code point, given a random index is possible) and those that cannot (where decoding depends on the exact order of the data). Actually, those would not be bi-directional ranges anyways.
 But a string
 isn't a random access range anyways (it's specifically disallowed by
 std.range per that same reference).

It isn't and it isn't supposed to be.

I agree with that assessment, which is why I omitted length. -Steve
Jan 13 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 14 Jan 2011 01:44:19 -0500, Nick Sabalausky <a a.a> wrote:

 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message
 news:igoqrm$1n5r$1 digitalmars.com...
 On 1/13/11 10:26 PM, Nick Sabalausky wrote:
 [snip]
 [ 'f', {u with the umlaut}, 'n', 'f' ]

 Or:

 [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]

 Those *both* get rendered exactly the same, and both represent the same
 four-letter sequence. In the second example, the 'u' and the {umlaut
 combining character} combine to form one grapheme. The f's and n's just
 happen to be single-code-point graphemes.

 Note that while some characters exist in pre-combined form (such as the
 {u
 with the umlaut} above), legend has it there are others than can only  
 be
 represented using a combining character.

 It's also my understanding, though I'm not certain, that sometimes
 multiple
 combining characters can be used together on the same "root" character.

Thanks. One further question is: in the above example with u-with-umlaut, there is one code point that corresponds to the entire combination. Are there combinations that do not have a unique code point?

My understanding is "yes". At least that's what I've heard, and I've never heard any claims of "no". I don't know of any specific ones offhand, though. Actually, it might be possible to use any combining character with any old letter or number (like maybe a 7 with an umlaut), though I'm not certain. FWIW, the Wikipedia article might help, or at least link to other things that might help: http://en.wikipedia.org/wiki/Combining_character

http://en.wikipedia.org/wiki/Unicode_normalization Linked from that page, the normalization process is probably something we need to look at. Using decomposed canonical form would mean we need more state than just what code-unit are we on, plus it creates more likelyhood that a match will be found with part of a grapheme (spir or Michel brought it up earlier). So I think the correct case is to use composed canonical form. This is after just reading that page, so maybe I'm missing something. Non-composable combinations would be a problem. The string range is formed on the basis that the element type is a dchar. If there are combinations that cannot be composed into a single dchar, then the element type has to be a dchar array (or some other type which contains all the info). The other option is to simply leave them decomposed. Then you risk things like partial matches. I'm leaning towards a solution like this: While iterating a string, it should output dchars in normalized composed form. But a specialized comparison function should be used when doing things like searches or regex, because it might not be possible to compose two combining characters. The drawback to this is that a dchar might not be able to represent a grapheme (only if it cannot be composed), but I think it's too much of a hit in complexity and performance to make the element type of a string larger than a dchar. Those who wish to work with a more comprehensive string type can use a more complex string type such as the one created by spir. Does that sound reasonable? -Steve
Jan 14 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 14 Jan 2011 08:14:02 -0500, spir <denis.spir gmail.com> wrote:

 On 01/14/2011 05:23 AM, Andrei Alexandrescu wrote:

 That's forgetting that most of the time people care about graphemes
 (user-perceived characters), not code points.

I'm not so sure about that. What do you base this assessment on? Denis wrote a library that according to him does grapheme-related stuff nobody else does. So apparently graphemes is not what people care about (although it might be what they should care about).

I'm aware of that, and I have no definitive answer to the question. The issue *does* exist --as shown even by trivial examples such as Michel's below, not corner cases. The actual question is _not_ whether code or "grapheme" is the proper level of abstraction. To this, the answer is clear: codes are simply meaningless in 99% cases. (All historic software deal with chars, conceptually, but they happen too be coded with single codes.) (And what about Objective-C? Why did its designers even bother with that?). The question is rather: why do we nearly all happily go on ignoring the issue? My present guess is a combination of factors: * The issue is masked by the misleading use of "abstract character" in unicode literature. "Abstract" is very correct, but they should have found another term as "character", say "abstract scripting mark". Their deceiving terminological choice lets most programmers believe that codepoints code characters, like in historic charsets. (Even worse: some doc explicitely states that ICU's notion of character matches the programming notion of character.) * ICU added precomposed codes for a bunch of characters, supposedly for backward compatility with said charsets. (But where is the gain? We need to decode them anyway...) The consequence is, at the pedagogical level, very bad: most text-producing software (like editors) use such precomposed codes when available for a given character. So that programmers can happily go on believing in the code=character myth. (Note: the gain in space is ridiculous for western text.) * Most characters that appear in western texts (at least "official" characters of natural languages) have precomposed forms. * Programmers can very easily be unaware their code is incorrect: how do you even notice it in test output?

* I don't even know how to make a grapheme that is more than one code-unit, let alone more than one code-point :) Every time I try, I get 'invalid utf sequence'. I feel significantly ignorant on this issue, and I'm slowly getting enough knowledge to join the discussion, but being a dumb American who only speaks English, I have a hard time grasping how this shit all works. -Steve
Jan 14 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 14 Jan 2011 08:59:35 -0500, spir <denis.spir gmail.com> wrote:

 On 01/14/2011 02:37 PM, Steven Schveighoffer wrote:
 * I don't even know how to make a grapheme that is more than one
 code-unit, let alone more than one code-point :)  Every time I try, I
 get 'invalid utf sequence'.

 I feel significantly ignorant on this issue, and I'm slowly getting
 enough knowledge to join the discussion, but being a dumb American who
 only speaks English, I have a hard time grasping how this shit all  
 works.

1. See my text at https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/U%20missing%20level%20of%20abstraction

I can't read that document, it's black background with super-dark-grey text.
 2.
      writeln ("A\u0308\u0330");
 <A + tilde above + umlaut below> (or the opposite)
 If it does not display properly, either set your terminal to UTF* or use  
 a more unicode-aware font (eg DejaVu series).

OK, I'll have to remember this so I can use it to test my string type ;)
 The point is not playing like that with Unicode flexibility. Rather that  
 composite characters are just normal thingies in most languages of the  
 world. Actually, on this point, english is a rare exception (discarding  
 letters imported from foreign languages like french 'à'); to the point  
 of beeing, I guess, the only western language without any diacritic.

Is it common to have multiple modifiers on a single character? The problem I see with using decomposed canonical form for strings is that we would have to return a dchar[] for each 'element', which severely complicates code that, for instance, only expects to handle English. I was hoping to lazily transform a string into its composed canonical form, allowing the (hopefully rare) exception when a composed character does not exist. My thinking was that this at least gives a useful string representation for 90% of usages, leaving the remaining 10% of usages to find a more complex representation (like your Text type). If we only get like 20% or 30% there by making dchar the element type, then we haven't made it useful enough. Either way, we need a string type that can be compared canonically for things like searches or opEquals. -Steve
Jan 14 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 1/14/11, Nick Sabalausky <a a.a> wrote:
 import std.stdio;

 version(Windows)
 {
     import std.c.windows.windows;
     extern(Windows) export BOOL SetConsoleOutputCP(UINT);
 }

 void main()
 {
     version(Windows) SetConsoleOutputCP(65001);

     writeln("HuG says: Fukken =DCber Death Terminal");
 }

Does that work for you? I get back: HuG says: Fukken =C3=9Cber Death Terminal
Jan 14 2011
prev sibling next sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 1/14/11, Nick Sabalausky <a a.a> wrote:
 Try with a code-point escape sequence

Nope, I still get the same results (tried with different fonts, lucida etc.., but I don't think it's a font issue). Maybe I have my settings messed up or something.
Jan 14 2011
parent "Nick Sabalausky" <a a.a> writes:
"Andrej Mitrovic" <andrej.mitrovich gmail.com> wrote in message 
news:mailman.633.1295044452.4748.digitalmars-d puremagic.com...
 On 1/14/11, Nick Sabalausky <a a.a> wrote:
 Try with a code-point escape sequence

Nope, I still get the same results (tried with different fonts, lucida etc.., but I don't think it's a font issue). Maybe I have my settings messed up or something.

Weird. Which version of windows are you on, and are you using the regular command line or powershell or something else? If you run "chcp 65001" from the cmd line first, does it work then?
Jan 14 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 1/14/11, Nick Sabalausky <a a.a> wrote:
 Weird. Which version of windows are you on, and are you using the regular
 command line or powershell or something else? If you run "chcp 65001" from
 the cmd line first, does it work then?

Okay, it appears this is an issue with Console2. I'll have to report it to the dev, although he hasn't fixed much of anything for ages already. I'm really contemplating of writing my own shell by now. (no Linux jokes now, please. :p) Works fine in cmd.exe, Lucida font without calling 65001 manually. In fact, it works with the 437 code page as well when I comment out SetConsoleOutputCP.
Jan 14 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 1/15/11, Andrej Mitrovic <andrej.mitrovich gmail.com> wrote:
 fact, it works with the 437 code page as well when I comment out
 SetConsoleOutputCP.

Woops, let me revise what I've said: If the code has the call to change the codepage, then I'll get back the correct result in console. If it doesn't, I have to switch the codepage manually. I don't know what the problem with Console2 is, but if I change cmd.exe to always use a Lucida font then Console2 will output the correct result (even though I'm using fixedsys in Console2). This is getting too specific and I don't want to hijack the thread. Everything is working fine now. Thx. :)
Jan 14 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 14 Jan 2011 15:54:19 -0500, Gerrit Wichert <gwichert yahoo.com>  
wrote:

 Am 14.01.2011 15:34, schrieb Steven Schveighoffer:
 Is it common to have multiple modifiers on a single character?  The
 problem I see with using decomposed canonical form for strings is that
 we would have to return a dchar[] for each 'element', which severely
 complicates code that, for instance, only expects to handle English.

 I was hoping to lazily transform a string into its composed canonical
 form, allowing the (hopefully rare) exception when a composed
 character does not exist.  My thinking was that this at least gives a
 useful string representation for 90% of usages, leaving the remaining
 10% of usages to find a more complex representation (like your Text
 type).  If we only get like 20% or 30% there by making dchar the
 element type, then we haven't made it useful enough.

be better for a language not to 'translate' by default. If the user wants to convert the codepoints this can be requested on demand. But pemature default conversion is a subltle way to lose information that may be important. Imagine we want to write a tool for dealing with the in/output of some other ignorant legacy software. Even if it is only text files, that software may choke on some converted input. So i belive that it is very importent that we are able to reproduce strings in exact that form in which we have read them in.

Actually, this would only lazily *and temporarily* convert the string per grapheme. Essentially, the original is left alone, so no harm there. -Steve.
Jan 15 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 14 Jan 2011 12:01:42 -0500, Michel Fortin  
<michel.fortin michelf.com> wrote:

 On 2011-01-14 09:34:55 -0500, "Steven Schveighoffer"  
 <schveiguy yahoo.com> said:

 On Fri, 14 Jan 2011 08:59:35 -0500, spir <denis.spir gmail.com> wrote:

 The point is not playing like that with Unicode flexibility. Rather  
 that  composite characters are just normal thingies in most languages  
 of the  world. Actually, on this point, english is a rare exception  
 (discarding  letters imported from foreign languages like french 'à');  
 to the point  of beeing, I guess, the only western language without  
 any diacritic.


Not in my knowledge. But I rarely deal with non-latin texts, there's probably some scripts out there that takes advantage of this.
 The  problem I see with using decomposed canonical form for strings is  
 that we  would have to return a dchar[] for each 'element', which  
 severely  complicates code that, for instance, only expects to handle  
 English.

Actually, returning a sliced char[] or wchar[] could also be valid. User-perceived characters are basically a substring of one or more code points. I'm not sure it complicates that much the semantics of the language -- what's complicated about writing str.front == "a" instead of str.front == 'a'? -- although it probably would complicate the generated code and make it a little slower.

Hm... this pushes the normalization outside the type, and into the algorithms (such as find). I was hoping to avoid that. I think I can come up with an algorithm that normalizes into canonical form as it iterates. It just might return part of a grapheme if the grapheme cannot be composed. I do think that we could make a byGrapheme member to aid in this: foreach(grapheme; s.byGrapheme) // grapheme is a substring that contains one composed grapheme.
 In the case of NSString in Cocoa, you can only access the 'characters'  
 in their UTF-16 form. But everything from comparison to search for  
 substring is done using graphemes. It's like they implemented  
 specialized Unicode-aware algorithms for these functions. There's no  
 genericness about how it handles graphemes.

 I'm not sure yet about what would be the right approach for D.

I hope we can use generic versions, so the type itself handles the conversions. That makes any algorithm using the string range correct.
 I was hoping to lazily transform a string into its composed canonical   
 form, allowing the (hopefully rare) exception when a composed character  
  does not exist.  My thinking was that this at least gives a useful  
 string  representation for 90% of usages, leaving the remaining 10% of  
 usages to  find a more complex representation (like your Text type).   
 If we only get  like 20% or 30% there by making dchar the element type,  
 then we haven't  made it useful enough.
  Either way, we need a string type that can be compared canonically  
 for  things like searches or opEquals.

I wonder if normalized string comparison shouldn't be built directly in the char[] wchar[] and dchar[] types instead.

No, in my vision of how strings should be typed, char[] is an array, not a string. It should be treated like an array of code-units, where two forms that create the same grapheme are considered different.
 Also bring the idea above that iterating on a string would yield  
 graphemes as char[] and this code would work perfectly irrespective of  
 whether you used combining characters:

 	foreach (grapheme; "exposé") {
 		if (grapheme == "é")
 			break;
 	}

 I think a good standard to evaluate our handling of Unicode is to see  
 how easy it is to do things the right way. In the above, foreach would  
 slice the string grapheme by grapheme, and the == operator would perform  
 a normalized comparison. While it works correctly, it's probably not the  
 most efficient way to do thing however.

I think this is a good alternative, but I'd rather not impose this on people like myself who deal mostly with English. I think this should be possible to do with wrapper types or intermediate ranges which have graphemes as elements (per my suggestion above). Does this sound reasonable? -Steve
Jan 15 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn  
<lutger.blijdestijn gmail.com> wrote:

 Steven Schveighoffer wrote:

 ...
 I think a good standard to evaluate our handling of Unicode is to see
 how easy it is to do things the right way. In the above, foreach would
 slice the string grapheme by grapheme, and the == operator would  
 perform
 a normalized comparison. While it works correctly, it's probably not  
 the
 most efficient way to do thing however.

I think this is a good alternative, but I'd rather not impose this on people like myself who deal mostly with English. I think this should be possible to do with wrapper types or intermediate ranges which have graphemes as elements (per my suggestion above). Does this sound reasonable? -Steve

If its a matter of choosing which is the 'default' range, I'd think proper unicode handling is more reasonable than catering for english / ascii only. Especially since this is already the case in phobos string algorithms.

English and (if I understand correctly) most other languages. Any language which can be built from composable graphemes would work. And in fact, ones that use some graphemes that cannot be composed will also work to some degree (for example, opEquals). What I'm proposing (or think I'm proposing) is not exactly catering to English and ASCII, what I'm proposing is simply not catering to more complex languages such as Hebrew and Arabic. What I'm trying to find is a middle ground where most languages work, and the code is simple and efficient, with possibilities to jump down to lower levels for performance (i.e. switch to char[] when you know ASCII is all you are using) or jump up to full unicode when necessary. Essentially, we would have three levels of types: char[], wchar[], dchar[] -- Considered to be arrays in every way. string_t!T (string, wstring, dstring) -- Specialized string types that do normalization to dchars, but do not handle perfectly all graphemes. Works with any algorithm that deals with bidirectional ranges. This is the default string type, and the type for string literals. Represented internally by a single char[], wchar[] or dchar[] array. * utfstring_t!T -- specialized string to deal with full unicode, which may perform worse than string_t, but supports everything unicode supports. May require a battery of specialized algorithms. * - name up for discussion Also note that phobos currently does *no* normalization as far as I can tell for things like opEquals. Two char[]'s that represent equivalent strings, but not in the same way, will compare as !=. -Steve
Jan 15 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 15 Jan 2011 13:21:12 -0500, foobar <foo bar.com> wrote:

 Steven Schveighoffer Wrote:

 English and (if I understand correctly) most other languages.  Any
 language which can be built from composable graphemes would work.  And  
 in
 fact, ones that use some graphemes that cannot be composed will also  
 work
 to some degree (for example, opEquals).

 What I'm proposing (or think I'm proposing) is not exactly catering to
 English and ASCII, what I'm proposing is simply not catering to more
 complex languages such as Hebrew and Arabic.  What I'm trying to find  
 is a
 middle ground where most languages work, and the code is simple and
 efficient, with possibilities to jump down to lower levels for  
 performance
 (i.e. switch to char[] when you know ASCII is all you are using) or jump
 up to full unicode when necessary.

 Essentially, we would have three levels of types:

 char[], wchar[], dchar[] -- Considered to be arrays in every way.
 string_t!T (string, wstring, dstring) -- Specialized string types that  
 do
 normalization to dchars, but do not handle perfectly all graphemes.   
 Works
 with any algorithm that deals with bidirectional ranges.  This is the
 default string type, and the type for string literals.  Represented
 internally by a single char[], wchar[] or dchar[] array.
 * utfstring_t!T -- specialized string to deal with full unicode, which  
 may
 perform worse than string_t, but supports everything unicode supports.
 May require a battery of specialized algorithms.

 * - name up for discussion

 Also note that phobos currently does *no* normalization as far as I can
 tell for things like opEquals.  Two char[]'s that represent equivalent
 strings, but not in the same way, will compare as !=.

 -Steve

The above compromise provides zero benefit. The proposed default type string_t is incorrect and will cause bugs. I prefer the standard lib to not provide normalization at all and force me to use a 3rd party lib rather than provide an incomplete implementation that will give me a false sense of correctness and cause very subtle and hard to find bugs.

I feel like you might be exaggerating, but maybe I'm completely wrong on this, I'm not well-versed in unicode, or even languages that require unicode. The clear benefit I see is that with a string type which normalizes to canonical code points, you can use this in any algorithm without having it be unicode-aware for *most languages*. At least, that is how I see it. I'm looking at it as a code-reuse proposition. It's like calendars. There are quite a few different calendars in different cultures. But most people use a Gregorian calendar. So we have three options: a) Use a Gregorian calendar, and leave the other calendars to a 3rd party library b) Use a complicated calendar system where Gregorian calendars are treated with equal respect to all other calendars, none are the default. c) Use a Gregorian calendar by default, but include the other calendars as a separate module for those who wish to use them. I'm looking at my proposal as more of a c) solution. Can you show how normalization causes subtle bugs?
 More over, Even if you ignore Hebrew as a tiny insignificant minority  
 you cannot do the same for Arabic which has over one *billion* people  
 that use that language.

I hope that the medium type works 'good enough' for those languages, with the high level type needed for advanced usages. At a minimum, comparison and substring should work for all languages.
 I firmly believe that in accordance with D's principle that the default  
 behavior should be the correct & safe option, D should have the full  
 unicode type (utfstring_t above) as the default.

 You need only a subset of the functionality because you only use  
 English? For the same reason, you don't want the Unicode overhead? Use  
 an ASCII type instead. In the same vain, a geneticist should use a DNA  
 sequence type and not Unicode text.

Or French, or Spanish, or German, etc... Look, even the lowest level is valid unicode, but if you want to start extracting individual graphemes, you need more machinery. In 99% of cases, I'd think you want to use strings as strings, not as sequences of graphemes, or code-units. -Steve
Jan 15 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 15 Jan 2011 13:32:10 -0500, Michel Fortin  
<michel.fortin michelf.com> wrote:

 On 2011-01-15 11:59:04 -0500, "Steven Schveighoffer"  
 <schveiguy yahoo.com> said:

 On Fri, 14 Jan 2011 12:01:42 -0500, Michel Fortin   
 <michel.fortin michelf.com> wrote:

 Actually, returning a sliced char[] or wchar[] could also be valid.   
 User-perceived characters are basically a substring of one or more  
 code  points. I'm not sure it complicates that much the semantics of  
 the  language -- what's complicated about writing str.front == "a"  
 instead of  str.front == 'a'? -- although it probably would complicate  
 the generated  code and make it a little slower.

algorithms (such as find). I was hoping to avoid that.

Not really. It pushes the normalization to the string comparison operator, as explained later.
 I think I can  come up with an algorithm that normalizes into canonical  
 form as it  iterates.  It just might return part of a grapheme if the  
 grapheme cannot  be composed.

The problem with normalization while iterating is that you lose information about what the actual code points part of the grapheme. If you wanted to count the number of grapheme with a particular code point you're lost that information.

Are these common requirements? I thought users mostly care about graphemes, not code points. Asking in the dark here, since I have next to zero experience with unicode strings.
 Moreover, if all you want is to count the number of grapheme,  
 normalizing the character is a waste of time.

This is true. I can see this being a common need.
 I suggested in another post that we implement ranges for decomposing and  
 recomposing on-the-fly a string in its normalized form. That's basically  
 the same thing as you suggest, but it'd have to be explicit to avoid the  
 problem above.

OK, I see your point.
 I wonder if normalized string comparison shouldn't be built directly  
 in  the char[] wchar[] and dchar[] types instead.

not a string. It should be treated like an array of code-units, where two forms that create the same grapheme are considered different.

Well, I agree there's a need for that sometime. But if what you want is just a dumb array of code units, why not use ubyte[], ushort[] and uint[] instead?

Because ubyte[] ushort[] and uint[] do not say that their data is unicode text. The point is, I want to write a function that takes utf-8, ubyte[] opens it up to any data, not just UTF-8 data. But if we have a method of iterating code-units as you specify below, then I think we are OK.
 It seems to me that the whole point of having a different type for  
 char[], wchar[], and dchar[] is that you know they are Unicode strings  
 and can treat them as such. And if you treat them as Unicode strings,  
 then perhaps the runtime and the compiler should too, for consistency's  
 sake.

I'd agree with you, but then there's that pesky [] after it indicating it's an array. For consistency's sake, I'd say the compiler should treat T[] as an array of T's.
 Also bring the idea above that iterating on a string would yield   
 graphemes as char[] and this code would work perfectly irrespective  
 of  whether you used combining characters:
  	foreach (grapheme; "exposé") {
 		if (grapheme == "é")
 			break;
 	}
  I think a good standard to evaluate our handling of Unicode is to  
 see  how easy it is to do things the right way. In the above, foreach  
 would  slice the string grapheme by grapheme, and the == operator  
 would perform  a normalized comparison. While it works correctly, it's  
 probably not the  most efficient way to do thing however.

people like myself who deal mostly with English.

I'm not suggesting we impose it, just that we make it the default. If you want to iterate by dchar, wchar, or char, just write: foreach (dchar c; "exposé") {} foreach (wchar c; "exposé") {} foreach (char c; "exposé") {} // or foreach (dchar c; "exposé".by!dchar()) {} foreach (wchar c; "exposé".by!wchar()) {} foreach (char c; "exposé".by!char()) {} and it'll work. But the default would be a slice containing the grapheme, because this is the right way to represent a Unicode character.

I think this is a good idea. I previously was nervous about it, but I'm not sure it makes a huge difference. Returning a char[] is certainly less work than normalizing a grapheme into one or more code points, and then returning them. All that it takes is to detect all the code points within the grapheme. Normalization can be done if needed, but would probably have to output another char[], since a normalized grapheme can occupy more than one dchar. What if I modified my proposed string_t type to return T[] as its element type, as you say, and string literals are typed as string_t!(whatever)? In addition, the restrictions I imposed on slicing a code point actually get imposed on slicing a grapheme. That is, it is illegal to substring a string_t in a way that slices through a grapheme (and by deduction, a code point)? Actually, we would need a grapheme to be its own type, because comparing two char[]'s that don't contain equivalent bits and having them be equal, violates the expectation that char[] is an array. So the string_t!char would return a grapheme_t!char (names to be discussed) as its element type.
 I think this should be  possible to do with wrapper types or  
 intermediate ranges which have  graphemes as elements (per my  
 suggestion above).

I think it should be the reverse. If you want your code to break when it encounters multi-code-point graphemes then it's your choice, but you should have to make your choice explicit. The default should be to handle strings correctly.

You are probably right. -Steve
Jan 15 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 15 Jan 2011 15:31:23 -0500, Michel Fortin  
<michel.fortin michelf.com> wrote:

 On 2011-01-15 12:39:32 -0500, "Steven Schveighoffer"  
 <schveiguy yahoo.com> said:

 On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn   
 <lutger.blijdestijn gmail.com> wrote:

 Steven Schveighoffer wrote:
  ...
 I think a good standard to evaluate our handling of Unicode is to see
 how easy it is to do things the right way. In the above, foreach  
 would
 slice the string grapheme by grapheme, and the == operator would   
 perform
 a normalized comparison. While it works correctly, it's probably  
 not  the
 most efficient way to do thing however.

people like myself who deal mostly with English. I think this should be possible to do with wrapper types or intermediate ranges which have graphemes as elements (per my suggestion above). Does this sound reasonable? -Steve

proper unicode handling is more reasonable than catering for english / ascii only. Especially since this is already the case in phobos string algorithms.

language which can be built from composable graphemes would work. And in fact, ones that use some graphemes that cannot be composed will also work to some degree (for example, opEquals). What I'm proposing (or think I'm proposing) is not exactly catering to English and ASCII, what I'm proposing is simply not catering to more complex languages such as Hebrew and Arabic. What I'm trying to find is a middle ground where most languages work, and the code is simple and efficient, with possibilities to jump down to lower levels for performance (i.e. switch to char[] when you know ASCII is all you are using) or jump up to full unicode when necessary.

Why don't we build a compiler with an optimizer that generates correct code *almost* all of the time? If you are worried about it not producing correct code for a given function, you can just add "pragma(correct_code)" in front of that function to disable the risky optimizations. No harm done, right? One thing I see very often, often on US web sites but also elsewhere, is that if you enter a name with an accented letter in a form (say Émilie), very often the accented letter gets changed to another semi-random character later in the process. Why? Because somewhere in the process lies an encoding mismatch that no one thought about and no one tested for. At the very least, the form should have rejected those unexpected characters and show an error when it could. Now, with proper Unicode handling up to the code point level, this kind of problem probably won't happen as often because the whole stack works with UTF encodings. But are you going to validate all of your inputs to make sure they have no combining code point? Don't assume that because you're in the United States no one will try to enter characters where you don't expect them. People love to play with Unicode symbols for fun, putting them in their name, signature, or even domain names (✪df.ws). Just wait until they discover they can combine them. ☺̰̎! There is also a variety of combining mathematical symbols with no pre-combined form, such as ≸. Writing in Arabic, Hebrew, Korean, or some other foreign language isn't a prerequisite to use combining characters.
 Essentially, we would have three levels of types:
  char[], wchar[], dchar[] -- Considered to be arrays in every way.
 string_t!T (string, wstring, dstring) -- Specialized string types that  
 do  normalization to dchars, but do not handle perfectly all graphemes.  
  Works  with any algorithm that deals with bidirectional ranges.  This  
 is the  default string type, and the type for string literals.   
 Represented  internally by a single char[], wchar[] or dchar[] array.
 * utfstring_t!T -- specialized string to deal with full unicode, which  
 may  perform worse than string_t, but supports everything unicode  
 supports.   May require a battery of specialized algorithms.
  * - name up for discussion
  Also note that phobos currently does *no* normalization as far as I  
 can  tell for things like opEquals.  Two char[]'s that represent  
 equivalent  strings, but not in the same way, will compare as !=.

Basically, you're suggesting that the default way should be to handle Unicode *almost* right. And then, if you want to handle thing *really* right you need to be explicit about it by using "utfstring_t"? I understand your motivation, but it sounds backward to me.

You make very good points. I concede that using dchar as the element point is not correct for unicode strings. -Steve
Jan 15 2011
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin  
<michel.fortin michelf.com> wrote:

 On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"  
 <schveiguy yahoo.com> said:

 I'm not suggesting we impose it, just that we make it the default. If   
 you want to iterate by dchar, wchar, or char, just write:
  	foreach (dchar c; "exposé") {}
 	foreach (wchar c; "exposé") {}
 	foreach (char c; "exposé") {}
 	// or
 	foreach (dchar c; "exposé".by!dchar()) {}
 	foreach (wchar c; "exposé".by!wchar()) {}
 	foreach (char c; "exposé".by!char()) {}
  and it'll work. But the default would be a slice containing the   
 grapheme, because this is the right way to represent a Unicode  
 character.

I'm not sure it makes a huge difference. Returning a char[] is certainly less work than normalizing a grapheme into one or more code points, and then returning them. All that it takes is to detect all the code points within the grapheme. Normalization can be done if needed, but would probably have to output another char[], since a normalized grapheme can occupy more than one dchar.

I'm glad we agree on that now.

It's a matter of me slowly wrapping my brain around unicode and how it's used. It seems like it's a typical committee defined standard where there are 10 ways to do everything, I was trying to weed out the lesser used (or so I perceived) pieces to allow a more implementable library. It's doubly hard for me since I have limited experience with other languages, and I've never tried to write them with a computer (my language classes in high school were back in the days of actually writing stuff down on paper). I once told a colleague who was on a standards committee that their proposed KLV standard (key length value) was ridiculous. The wise committee had decided that in order to avoid future issues, the length would be encoded as a single byte if < 128, or 128 + length of the length field for anything higher. This means you could potentially have to parse and process a 127-byte integer!
 What if I modified my proposed string_t type to return T[] as its  
 element  type, as you say, and string literals are typed as  
 string_t!(whatever)?   In addition, the restrictions I imposed on  
 slicing a code point actually  get imposed on slicing a grapheme.  That  
 is, it is illegal to substring a  string_t in a way that slices through  
 a grapheme (and by deduction, a code  point)?

I'm not opposed to that on principle. I'm a little uneasy about having so many types representing a string however. Some other raw comments: I agree that things would be more coherent if char[], wchar[], and dchar[] behaved like other arrays, but I can't really see a justification for those types to be in the language if there's nothing special about them (why not a library type?).

I would not be opposed to getting rid of those types. But I am very opposed to char[] not being an array. If you want a string to be something other than an array, make it have a different syntax. We also have to consider C compatibility. However, we are in radical-change mode then, and this is probably pushed to D3 ;) If we can find some way to fix the situation without invalidating TDPL, we should strive for that first IMO.
 If strings and arrays of code units are distinct, slicing in the middle  
 of a grapheme or in the middle of a code point could throw an error, but  
 for performance reasons it should probably check for that only when  
 array bounds checking is turned on (that would require compiler support  
 however).

Not really, it could use assert, but that throws an assert error instead of a RangeError. Of course, both are errors and will abort the program. I do wish there was a version(noboundscheck) to do this kind of stuff with...
 Actually, we would need a grapheme to be its own type, because  
 comparing  two char[]'s that don't contain equivalent bits and having  
 them be equal,  violates the expectation that char[] is an array.
  So the string_t!char would return a grapheme_t!char (names to be   
 discussed) as its element type.

Or you could make a grapheme a string_t. ;-)

I'm a little uneasy having a range return itself as its element type. For all intents and purposes, a grapheme is a string of one 'element', so it could potentially be a string_t. It does seem daunting to have so many types, but at the same time, types convey relationships at compile time that can make coding impossible to get wrong, or make things actually possible when having a single type doesn't. I'll give you an example from a previous life: Tango had a type called DateTime. This type represented *either* a point in time, or a span of time (depending on how you used it). But I proposed we switch to two distinct types, one for a point in time, one for a span of time. It was argued that both were so similar, why couldn't we just keep one type? The answer is simple -- having them be separate types allows me to express relationships that the compiler enforces. For example, you can add two time spans together, but you can't add two points in time together. Or maybe you want a function to accept a time span (like a sleep operation). If there was only one type, then sleep(DateTime.now()) compiles and sleeps for what, 2011 years? ;) I feel that making extra types when the relationship between them is important is worth the possible repetition of functionality. Catching bugs during compilation is soooo much better than experiencing them during runtime. -Steve
Jan 15 2011
prev sibling next sibling parent Tomek =?ISO-8859-2?Q?Sowi=F1ski?= <just ask.me> writes:
Andrei Alexandrescu napisa=B3:

 I've been thinking on how to better deal with Unicode strings. Currently=

 strings are formally bidirectional ranges with a surreptitious random=20
 access interface. The random access interface accesses the support of=20
 the string, which is understood to hold data in a variable-encoded=20
 format. For as long as the programmer understands this relationship,=20
 code for string manipulation can be written with relative ease. However,=

 there is still room for writing wrong code that looks legit.
=20
 Sometimes the best way to tackle a hairy reality is to invite it to the=20
 negotiation table and offer it promotion to first-class abstraction=20
 status. Along that vein I was thinking of defining a new range:=20
 VLERange, i.e. Variable Length Encoding Range. Such a range would have=20
 the power somewhere in between bidirectional and random access.
=20
 The primitives offered would include empty, access to front and back,=20
 popFront and popBack (just like BidirectionalRange), and in addition=20
 properties typical of random access ranges: indexing, slicing, and=20
 length.

For some compressions implementing *back is troublesome if not impossible...
 Note that the result of the indexing operator is not the same as=20
 the element type of the range, as it only represents the unit of encoding.

It's worth to mention it explicitly -- a VLERange is dually typed. It's imp= ortant for searching. Statically check if original and encoded match, if so= , perform fast search on directly on encoded elements. I think an important= feature of a VLERange should be dropping itself down to a encoded-typed r= ange, so that front and back return raw data. Dual typing will also affect foreach -- in general case you'd want to choos= e whether to decode or not by typing the element. I can't stop thinking that VLERange is a two-piece bikini making a bare ran= dom-access range safe to look at, and that you can take off when partners h= ave confidence, not a limited random-access probing facility to span the vo= id between front and back.
 In addition to these (and connecting the two), a VLERange would offer=20
 two additional primitives:
=20
 1. size_t stepSize(size_t offset) gives the length of the step needed to=

 skip to the next element.
=20
 2. size_t backstepSize(size_t offset) gives the size of the _backward_=20
 step that goes to the previous element.
=20
 In both cases, offset is assumed to be at the beginning of a logical=20
 element of the range.

So when I move the spinner in an iPod, I get catapulted in position with th= e raw data opIndex and from there I try to work my way to the next frame to= start playback. Sounds promising.
 I suspect that a lot of functions in std.string can be written without=20
 Unicode-specific knowledge just by relying on such an interface.=20
 Moreover, algorithms can be generalized to other structures that use=20
 variable-length encoding, such as those used in data compression. (In=20
 that case, the support would be a bit array and the encoded type would=20
 be ubyte.)

I agree, acknowledging encoding/compression as a general direction will bri= ng substantial benefits.
 Writing to such ranges is not addressed by this design. Ideas are welcome.

Yeah, we can address outputting later, that's fair.
 Adding VLERange would legitimize strings and would clarify their=20
 handling, at the cost of adding one additional concept that needs to be=20
 minded. Is the trade-off worthwhile?

Well, the only way to find out is try it. My advice: VLERanges originated a= s a solution to the string problem, so start with a non-string incarnation.= Having at least two (one, we know, is string) plugs that fit the same sock= et will spur confidence in the abstraction.=20 --=20 Tomek
Jan 11 2011
prev sibling next sibling parent reply Steven Wawryk <stevenw acres.com.au> writes:
Sorry if I'm jumping inhere without the appropriate background, but I 
don't understand why jumping through these hoops are necessary.  Please 
let me know if I'm missing anything.

Many problems can be solved by another layer of indirection.  Isn't a 
string essentially a bidirectional range of code points built on top of 
a random access range of code units?  It seems to me that each 
abstraction separately already fits within the existing D range 
framework and all the difficulties arise as a consequence of trying to 
lump them into a single abstraction.

Why not choose which of these abstractions is most appropriate in a 
given situation instead of trying to shoe-horn both concepts into a 
single abstraction, and provide for easy conversion between them?  When 
character representation is the primary requirement then make it a 
bidirectional range of code points.  When storage representation and 
random access is required then make it a random access range of code units.
Jan 11 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-11 20:28:26 -0500, Steven Wawryk <stevenw acres.com.au> said:

 Sorry if I'm jumping inhere without the appropriate background, but I 
 don't understand why jumping through these hoops are necessary.  Please 
 let me know if I'm missing anything.
 
 Many problems can be solved by another layer of indirection.  Isn't a 
 string essentially a bidirectional range of code points built on top of 
 a random access range of code units?

Actually, displaying a UTF-8/UTF-16 string involves a range of of glyphs layered over a range of graphemes layered over a range of code points layered over a range of code units. Glyphs represent the visual characters you can get from a font, they often map one-to-one with graphemes but not always (ligatures for instance). Graphemes are what people generally reason about when they see text (the so called "user-perceived characters"), they often map one-to-one with code points but not always (combining marks for instance). Code points are a list of standardized codes representing various elements of a string, and code units basically encode the code points. If you're writing an XML, JSON or whatever else parser you'll probably care about code points. If you're advancing the insertion point in a text field or count the number of user-perceived characters you'll probably want to deal with graphemes. For searching a substring inside a string, or comparing strings you'll probably want to deal with either graphemes or collation elements (collation elements are layered on top of code points). To print a string you'll need to map graphemes to the glyphs from a particular font. Reducing string operations to code points manipulations will only work as long as all your graphemes, collation elements, or glyphs map one-to-one with code points.
 It seems to me that each abstraction separately already fits within the 
 existing D range framework and all the difficulties arise as a 
 consequence of trying to lump them into a single abstraction.

It's true that each of these abstraction can fit within the existing range framework.
 Why not choose which of these abstractions is most appropriate in a 
 given situation instead of trying to shoe-horn both concepts into a 
 single abstraction, and provide for easy conversion between them?  When 
 character representation is the primary requirement then make it a 
 bidirectional range of code points.  When storage representation and 
 random access is required then make it a random access range of code 
 units.

I think you're right. The need for a new concept isn't that great, and it gets complicated really fast. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 11 2011
parent reply Don <nospam nospam.com> writes:
Michel Fortin wrote:
 On 2011-01-11 20:28:26 -0500, Steven Wawryk <stevenw acres.com.au> said:
 Why not choose which of these abstractions is most appropriate in a 
 given situation instead of trying to shoe-horn both concepts into a 
 single abstraction, and provide for easy conversion between them?  
 When character representation is the primary requirement then make it 
 a bidirectional range of code points.  When storage representation and 
 random access is required then make it a random access range of code 
 units.

I think you're right. The need for a new concept isn't that great, and it gets complicated really fast.

I think the only problem that we really have, is that "char[]", "dchar[]" implies that code points is always the appropriate level of abstraction.
Jan 12 2011
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/12/11 11:28 AM, Don wrote:
 Michel Fortin wrote:
 On 2011-01-11 20:28:26 -0500, Steven Wawryk <stevenw acres.com.au> said:
 Why not choose which of these abstractions is most appropriate in a
 given situation instead of trying to shoe-horn both concepts into a
 single abstraction, and provide for easy conversion between them?
 When character representation is the primary requirement then make it
 a bidirectional range of code points. When storage representation and
 random access is required then make it a random access range of code
 units.

I think you're right. The need for a new concept isn't that great, and it gets complicated really fast.

I think the only problem that we really have, is that "char[]", "dchar[]" implies that code points is always the appropriate level of abstraction.

I hope to assuage part of that issue with representation(). Again, it's not documented yet (mainly because of the famous ddoc bug that prevents auto functions from carrying documentation). Here it is: /** * Returns the representation type of a string, which is the same type * as the string except the character type is replaced by $(D ubyte), * $(D ushort), or $(D uint) depending on the character width. * * Example: ---- string s = "hello"; static assert(is(typeof(representation(s)) == immutable(ubyte)[])); ---- */ /*private*/ auto representation(Char)(Char[] s) if (isSomeChar!Char) { // Get representation type static if (Char.sizeof == 1) enum t = "ubyte"; else static if (Char.sizeof == 2) enum t = "ushort"; else static if (Char.sizeof == 4) enum t = "uint"; else static assert(false); // can't happen due to isSomeChar!Char // Get representation qualifier static if (is(Char == immutable)) enum q = "immutable"; else static if (is(Char == const)) enum q = "const"; else static if (is(Char == shared)) enum q = "shared"; else enum q = ""; // Result type is qualifier(RepType)[] static if (q.length) return mixin("cast(" ~ q ~ "(" ~ t ~ ")[]) s"); else return mixin("cast(" ~ t ~ "[]) s"); } Andrei
Jan 12 2011
prev sibling next sibling parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
spir wrote:
 On 01/12/2011 08:28 PM, Don wrote:
 I think the only problem that we really have, is that "char[]",
 "dchar[]" implies that code points is always the appropriate level of
 abstraction.

I'd like to know when it happens that codepoint is the appropriate level of abstraction.

When on a document that describes code points... :)
 * If pieces of text are not manipulated, meaning just used in the
 application, or just transferred via the application as is (from file /
 input / literal to any kind of output), then any kind of encoding just
 works. One can even concatenate, provided all pieces use the same
 encoding. --> _lower_ level than codepoint is OK.
 * But any of manipulation (indexing, slicing, compare,

Compare according to which alphabet's ordering? Surely not Unicode's... I may be alone in this, but ordering is tied to an alphabet (or writing system), not locale.) I try to solve that issue with my trileri library: http://code.google.com/p/trileri/source/browse/#svn%2Ftrunk%2Ftr Warning: the code is in Turkish and is not aware of the concept of collation at all; it has its own simplistic view of text, where every character is an entity that can be lower/upper cased to a single character.
 search, count,
 replace, not to speak about regex/parsing) requires operating at the
 _higher_ level of characters (in the common sense).

I don't know this about Unicode: should e and ´ (acute accent) be always collated? If so, wouldn't it be impossible to put those two in that order say, in a text book? (Perhaps Unicode defines a way to stop collation.)
 Just like with
 historic character sets in which codes used to represent characters (not
 lower-level thingies as in UCS). Else, one reads, compares, changes
 meaningless bits of text.

 As I see it now, we need 2 types:

I think we need more than 2 types...
 * One plain string similar to good old ones (bytestring would do the
 job, since most unicode is utf8 encoded) for the first kind of use
 above. With optional validity check when it's supposed to be unicode 

Agreed. D gives us three UTF encondings, but I am not sure that there is only one abstraction above that.
 * One hiher-level type abstracting from codepoint (not code unit)
 issues, restoring the necessary properties: (1) each character is one
 element in the sequence (2) each character is always represented the
 same way.

I think VLERange should solve only the variable-length-encoding issue. It should not get into higher abstractions. Ali
Jan 12 2011
prev sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-12 14:57:58 -0500, spir <denis.spir gmail.com> said:

 On 01/12/2011 08:28 PM, Don wrote:
 I think the only problem that we really have, is that "char[]",
 "dchar[]" implies that code points is always the appropriate level of
 abstraction.

I'd like to know when it happens that codepoint is the appropriate level of abstraction.

I agree with you. I don't see many use for code points. One of these uses is writing a parser for a format defined in term of code points (XML for instance). But beyond that, I don't see one.
 * If pieces of text are not manipulated, meaning just used in the 
 application, or just transferred via the application as is (from file / 
 input / literal to any kind of output), then any kind of encoding just 
 works. One can even concatenate, provided all pieces use the same 
 encoding. --> _lower_ level than codepoint is OK.
 * But any of manipulation (indexing, slicing, compare, search, count, 
 replace, not to speak about regex/parsing) requires operating at the 
 _higher_ level of characters (in the common sense). Just like with 
 historic character sets in which codes used to represent characters 
 (not lower-level thingies as in UCS). Else, one reads, compares, 
 changes meaningless bits of text.

Very true. In the same way that code points can span on multiple code units, user-perceived characters (graphemes) can span on multiple code points. A funny exercise to make a fool of an algorithm working only with code points would be to replace the word "fortune" in a text containing the word "fortun". If the last "" is expressed as two code points, as "e" followed by a combining acute accent (this: ), replacing occurrences of "fortune" by "expose" would also replace "fortun" with "expos" because the combining acute accent remains as the code point following the word. Quite amusing, but it doesn't really make sense that it works like that. In the case of "", we're lucky enough to also have a pre-combined character to encode it as a single code point, so encountering "" written as two code points is quite rare. But not all combinations of marks and characters can be represented as a single code point. The correct thing to do is to treat "" (single code point) and "" ("e" + combining acute accent) as equivalent. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 12 2011
next sibling parent Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-12 19:45:36 -0500, Michel Fortin <michel.fortin michelf.com> said:

 A funny exercise to make a fool of an algorithm working only with code 
 points would be to replace the word "fortune" in a text containing the 
 word "fortuné". If the last "é" is expressed as two code points, as "e" 
 followed by a combining acute accent (this: é), replacing occurrences 
 of "fortune" by "expose" would also replace "fortuné" with "exposé" 
 because the combining acute accent remains as the code point following 
 the word. Quite amusing, but it doesn't really make sense that it works 
 like that.
 
 In the case of "é", we're lucky enough to also have a pre-combined 
 character to encode it as a single code point, so encountering "é" 
 written as two code points is quite rare. But not all combinations of 
 marks and characters can be represented as a single code point. The 
 correct thing to do is to treat "é" (single code point) and "é" ("e" + 
 combining acute accent) as equivalent.

Crap, I meant to send this as UTF-8 with combining characters in it, but my news client converted everything to ISO-8859-1. I'm not sure it'll work, but here's my second attempt at posting real combining marks: Single code point: é e with combining mark: é t with combining mark: t̂ t with two combining marks: t̂̃ -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 12 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday 13 January 2011 01:49:31 spir wrote:
 On 01/13/2011 01:45 AM, Michel Fortin wrote:
 On 2011-01-12 14:57:58 -0500, spir <denis.spir gmail.com> said:
 On 01/12/2011 08:28 PM, Don wrote:
 I think the only problem that we really have, is that "char[]",
 "dchar[]" implies that code points is always the appropriate level of
 abstraction.

I'd like to know when it happens that codepoint is the appropriate level of abstraction.

I agree with you. I don't see many use for code points. =20 One of these uses is writing a parser for a format defined in term of code points (XML for instance). But beyond that, I don't see one.

Actually, I had once a real use case for codepoint beeing the proper level of abstraction: a linguistic app of which one operational func counts occurrences of "scripting marks" like 'a' & '=C2=A8' in "=C3=A4". =

 see what I mean.
 Once the text is properly NFD decomposed, each of those marks in coded
 as a codepoint. (But if it's not decomposed, then most of those marks
 are probably hidden by precomposed codes coding characters like "=C3=A4".=

 that even such an app benefits from a higher-level type basically
 operating on normalised (NFD) characters.

There's also the question of efficiency. On the whole, string operations ca= n be=20 very expensive - particularly when you're doing a lot of them. The fact tha= t D's=20 arrays are so powerful may reduce the problem in D, but in general, if you'= re=20 doing a lot with strings, it can get costly, performance-wise. The question then is what is the cost of actually having strings abstracted= to=20 the point that they really are ranges of characters rather than code units = or=20 code points or whatever? If the cost is large enough, then dealing with str= ings=20 as arrays as they currently are and having the occasional unicode issue cou= ld=20 very well be worth it. As it is, there are plenty of people who don't want = to=20 have to care about unicode in the first place, since the programs that they= write=20 only deal with ASCII characters. The fact that D makes it so easy to deal w= ith=20 unicode code points is a definite improvement, but taking the abstraction t= o the=20 point that you're definitely dealing with characters rather than code units= or=20 code points could be too costly. Now, if it can be done efficiently, then having unicode dealt with properly= =20 without the programmer having to worry about it would be a big boon. As it = is,=20 D's handling of unicode is a big boon, even if it doesn't deal with graphem= es=20 and the like. So, I think that we definitely should have an abstraction for unicode which= uses=20 characters as the elements in the range and doesn't have to care about the= =20 underlying encoding of the characters (except perhaps picking whether char,= =20 wchar, or dchar is use internally, and therefore how much space it requires= ).=20 However, I'm not at all convinced that such an abstraction can be done effi= ciently=20 enough to make it the default way of handling strings. =2D Jonathan M Davis
Jan 13 2011
prev sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-13 06:48:46 -0500, spir <denis.spir gmail.com> said:

 Note that D's stdlib currently provides no means to do this, not even 
 on the fly. You'd have to interface with eg ICU (a C/C++/Java Unicode 
 library) (good luck ;-). But even ICU, as well as supposed 
 unicode-aware typse or librarys for any language, would give you an 
 abstraction producing correct results for Michel's example. For 
 instance, Python3 code fails as miserably as any other. AFAIK, D is the 
 first and only language having such a tool (Text.d at 
 https://bitbucket.org/denispir/denispir-d/src/a005424f60f3).

D is not the first language dealing correctly with Unicode strings in this manner. Objective-C's NSString class search and compare methods deal with characters with combining marks correctly. If you want to compare code points, you can do so explicitly using the NSLiteralSearch option, but the default is to compare the canonical version (at the grapheme level). <http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Strings/Articles/SearchingStrings.html%23//apple_ref/doc/uid/20000149-CJBBGBAI> In Cocoa, string sorting and case-insensitive comparition is also dependent on the user's locale settings, although you can also specify your own locale if the user's locale is not what you want. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 13 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-13 14:11:44 -0500, spir <denis.spir gmail.com> said:

 In Cocoa, string sorting and case-insensitive comparition is also 
 dependent on the user's locale settings, although you can also specify 
 your own locale if the user's locale is not what you want. See kde 
 trying to invent a, hum, "natural", way of sorting file names...)

Mac OS sorts file names in a "natural" way since a very long time (since Mac OS 8 I believe). By natural, I mean that numbers inside the file name are sorted in numeric order while the rest is sorted character by character. For instance "My File 2" will go before "My File 10" in file listings because "2" is less than "10". There's an option in NSString comparison methods to use this ordering, but it's not the default. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 13 2011
parent "Nick Sabalausky" <a a.a> writes:
"Michel Fortin" <michel.fortin michelf.com> wrote in message 
news:igo5v2$gq2$1 digitalmars.com...
 On 2011-01-13 14:11:44 -0500, spir <denis.spir gmail.com> said:

 In Cocoa, string sorting and case-insensitive comparition is also 
 dependent on the user's locale settings, although you can also specify 
 your own locale if the user's locale is not what you want. See kde trying 
 to invent a, hum, "natural", way of sorting file names...)

Mac OS sorts file names in a "natural" way since a very long time (since Mac OS 8 I believe). By natural, I mean that numbers inside the file name are sorted in numeric order while the rest is sorted character by character. For instance "My File 2" will go before "My File 10" in file listings because "2" is less than "10".

XP's explorer does that too. It's a very nice feature.
Jan 13 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/12/2011 08:28 PM, Don wrote:
 I think the only problem that we really have, is that "char[]",
 "dchar[]" implies that code points is always the appropriate level of
 abstraction.

I'd like to know when it happens that codepoint is the appropriate level of abstraction. * If pieces of text are not manipulated, meaning just used in the application, or just transferred via the application as is (from file / input / literal to any kind of output), then any kind of encoding just works. One can even concatenate, provided all pieces use the same encoding. --> _lower_ level than codepoint is OK. * But any of manipulation (indexing, slicing, compare, search, count, replace, not to speak about regex/parsing) requires operating at the _higher_ level of characters (in the common sense). Just like with historic character sets in which codes used to represent characters (not lower-level thingies as in UCS). Else, one reads, compares, changes meaningless bits of text. As I see it now, we need 2 types: * One plain string similar to good old ones (bytestring would do the job, since most unicode is utf8 encoded) for the first kind of use above. With optional validity check when it's supposed to be unicode text. * One hiher-level type abstracting from codepoint (not code unit) issues, restoring the necessary properties: (1) each character is one element in the sequence (2) each character is always represented the same way. Denis _________________ vita es estrany spir.wikidot.com
Jan 12 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/13/2011 01:45 AM, Michel Fortin wrote:
 On 2011-01-12 14:57:58 -0500, spir <denis.spir gmail.com> said:

 On 01/12/2011 08:28 PM, Don wrote:
 I think the only problem that we really have, is that "char[]",
 "dchar[]" implies that code points is always the appropriate level of
 abstraction.

I'd like to know when it happens that codepoint is the appropriate level of abstraction.

I agree with you. I don't see many use for code points. One of these uses is writing a parser for a format defined in term of code points (XML for instance). But beyond that, I don't see one.

Actually, I had once a real use case for codepoint beeing the proper level of abstraction: a linguistic app of which one operational func counts occurrences of "scripting marks" like 'a' & '¨' in "ä". hope you see what I mean. Once the text is properly NFD decomposed, each of those marks in coded as a codepoint. (But if it's not decomposed, then most of those marks are probably hidden by precomposed codes coding characters like "ä".) So that even such an app benefits from a higher-level type basically operating on normalised (NFD) characters.
 * If pieces of text are not manipulated, meaning just used in the
 application, or just transferred via the application as is (from file
 / input / literal to any kind of output), then any kind of encoding
 just works. One can even concatenate, provided all pieces use the same
 encoding. --> _lower_ level than codepoint is OK.
 * But any of manipulation (indexing, slicing, compare, search, count,
 replace, not to speak about regex/parsing) requires operating at the
 _higher_ level of characters (in the common sense). Just like with
 historic character sets in which codes used to represent characters
 (not lower-level thingies as in UCS). Else, one reads, compares,
 changes meaningless bits of text.

Very true. In the same way that code points can span on multiple code units, user-perceived characters (graphemes) can span on multiple code points. A funny exercise to make a fool of an algorithm working only with code points would be to replace the word "fortune" in a text containing the word "fortuné". If the last "é" is expressed as two code points, as "e" followed by a combining acute accent (this: é), replacing occurrences of "fortune" by "expose" would also replace "fortuné" with "exposé" because the combining acute accent remains as the code point following the word. Quite amusing, but it doesn't really make sense that it works like that. In the case of "é", we're lucky enough to also have a pre-combined character to encode it as a single code point, so encountering "é" written as two code points is quite rare. But not all combinations of marks and characters can be represented as a single code point. The correct thing to do is to treat "é" (single code point) and "é" ("e" + combining acute accent) as equivalent.

You'll find another example in the introduction of the text at https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20level%20of%20abstraction About your last remark, this is precisely one of the two abstractions my Text type provides: it groups togeter in "piles" codes that belong to the same "true" character (grapheme) like "é". So that the resulting text representation is a sequence of "piles", each representing a character. Consequence: indexing, slicing, etc work sensibly (and even other operations are faster for they do not need to perform that "piling" again & again). In addition to that, the string is first NFD-normalised, thus each chraracter can have one & only representation. Consequence: search, count, replace, etc, and compare (*) work as expected. In your case: // 2 forms of "é" assert(Text("\u00E9") == Text("\u0065\u0301")); Denis (*) According to UCS coding, not language-specific idiosyncrasies. More generally, Text abstract from lower-level issues _introduced_ by UCS, Unicode's character set. It does not code with script-, language-, culture-, domain-, app- specific needs such as custom text sorting rules. Some base routines for such operations are provided by Text's brother lib DUnicode (access to some code properties, safe concat, casefolded compare, NF* normalisation). _________________ vita es estrany spir.wikidot.com
Jan 13 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/13/2011 01:51 AM, Michel Fortin wrote:
 On 2011-01-12 19:45:36 -0500, Michel Fortin <michel.fortin michelf.com>
 said:

 A funny exercise to make a fool of an algorithm working only with code
 points would be to replace the word "fortune" in a text containing the
 word "fortuné". If the last "é" is expressed as two code points, as
 "e" followed by a combining acute accent (this: é), replacing
 occurrences of "fortune" by "expose" would also replace "fortuné" with
 "exposé" because the combining acute accent remains as the code point
 following the word. Quite amusing, but it doesn't really make sense
 that it works like that.

 In the case of "é", we're lucky enough to also have a pre-combined
 character to encode it as a single code point, so encountering "é"
 written as two code points is quite rare. But not all combinations of
 marks and characters can be represented as a single code point. The
 correct thing to do is to treat "é" (single code point) and "é" ("e" +
 combining acute accent) as equivalent.

Crap, I meant to send this as UTF-8 with combining characters in it, but my news client converted everything to ISO-8859-1. I'm not sure it'll work, but here's my second attempt at posting real combining marks: Single code point: é e with combining mark: é t with combining mark: t̂ t with two combining marks: t̂̃

Works :-) But your first post worked as well by me: for instance <<"é" ("e" + combining acute accent)>> was displayed "é" as a single accented letter. I guess maybe your email client did not convert into iso-8859-1 on sending, but on reading (mine is set for utf-8). Denis _________________ vita es estrany spir.wikidot.com
Jan 13 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/13/2011 11:16 AM, Jonathan M Davis wrote:
 On Thursday 13 January 2011 01:49:31 spir wrote:
 On 01/13/2011 01:45 AM, Michel Fortin wrote:
 On 2011-01-12 14:57:58 -0500, spir<denis.spir gmail.com>  said:
 On 01/12/2011 08:28 PM, Don wrote:
 I think the only problem that we really have, is that "char[]",
 "dchar[]" implies that code points is always the appropriate level of
 abstraction.

I'd like to know when it happens that codepoint is the appropriate level of abstraction.

I agree with you. I don't see many use for code points. One of these uses is writing a parser for a format defined in term of code points (XML for instance). But beyond that, I don't see one.

Actually, I had once a real use case for codepoint beeing the proper level of abstraction: a linguistic app of which one operational func counts occurrences of "scripting marks" like 'a'& '¨' in "ä". hope you see what I mean. Once the text is properly NFD decomposed, each of those marks in coded as a codepoint. (But if it's not decomposed, then most of those marks are probably hidden by precomposed codes coding characters like "ä".) So that even such an app benefits from a higher-level type basically operating on normalised (NFD) characters.

There's also the question of efficiency. On the whole, string operations can be very expensive - particularly when you're doing a lot of them. The fact that D's arrays are so powerful may reduce the problem in D, but in general, if you're doing a lot with strings, it can get costly, performance-wise.

D's arrays (even dchar[] & dstring) do not allow having correct results when dealing with UCS/Unicode text in the general case. See Michel's example (and several ones I posted on this list, and the text at https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20leve %20of%20abstraction for a very lengthy explanation). You and some other people seem to still mistake Unicode's low level issue of codepoint vs code unit, with the higher-level issue of codes _not_ representing characters in the commmon sense ("graphemes"). The above pointed text was written precisely to introduce to this issue because obviously no-one wants to face it... (Eg each time I evoke it on this list it is ignored, except by Michel, but the same is true everywhere else, including on the Unicode mailing list!). The core of the problem is the misleading term "abstract character" which deceivingly lets programmers believe that a codepoints codes a character, like in historic character sets -- which is *wrong*. No Unicode document AFAIK explains this. This is a case of unsaid lie. Compared to legacy charsets, dealing with Unicode actually requires *2* levels of abstraction... (one to decode codepoints from code units, one to construct characters from codepoints) Note that D's stdlib currently provides no means to do this, not even on the fly. You'd have to interface with eg ICU (a C/C++/Java Unicode library) (good luck ;-). But even ICU, as well as supposed unicode-aware typse or librarys for any language, would give you an abstraction producing correct results for Michel's example. For instance, Python3 code fails as miserably as any other. AFAIK, D is the first and only language having such a tool (Text.d at https://bitbucket.org/denispir/denispir-d/src/a005424f60f3).
 The question then is what is the cost of actually having strings abstracted to
 the point that they really are ranges of characters rather than code units or
 code points or whatever? If the cost is large enough, then dealing with strings
 as arrays as they currently are and having the occasional unicode issue could
 very well be worth it. As it is, there are plenty of people who don't want to
 have to care about unicode in the first place, since the programs that they
write
 only deal with ASCII characters. The fact that D makes it so easy to deal with
 unicode code points is a definite improvement, but taking the abstraction to
the
 point that you're definitely dealing with characters rather than code units or
 code points could be too costly.

When _manipulating_ text (indexing, search, changing), you have the choice between: * On the fly abstraction (composing characters on the fly, and/or normalising them), for each operation for each piece of text (including parameters, including literals). * Use of a type that constructs this abstraction once only for each piece of text. Note that a single count operation is forced to construct this abstraction on the fly for the whole text... (and for the searched snippet). Also note that optimisation is probably easier is the second case, for the abstraction operation is then standard.
 Now, if it can be done efficiently, then having unicode dealt with properly
 without the programmer having to worry about it would be a big boon. As it is,
 D's handling of unicode is a big boon, even if it doesn't deal with graphemes
 and the like.

It has a cost at intial Text construction time. Currently, on my very slow computer, 1MB source text requires ~ 500 ms (decoding + decomposition + ordering + "piling" codes into characters). Decoding only using D's builtin std.utf.decode takes about 100 ms. The bottle neck is piling: 70% of the time in average, on a test case melting texts from a dozen natural languages. We would be very glad to get the community's help in optimising this phase :-) (We have progressed very much already in terms of speed, but now reach limits of our competences.)
 So, I think that we definitely should have an abstraction for unicode which
uses
 characters as the elements in the range and doesn't have to care about the
 underlying encoding of the characters (except perhaps picking whether char,
 wchar, or dchar is use internally, and therefore how much space it requires).
 However, I'm not at all convinced that such an abstraction can be done
efficiently
 enough to make it the default way of handling strings.

If you only have ASCII, or if you don't manipulate text at all, then as said in a previous post any string representation works fine (whatever the encoding it possibly uses under the hood). D's builtin char/dchar/wchar and string/dstring/wstring are very nice and well done, but they are not necessary in such a use case. Actually, as shown by Steven's repeted complaints, they rather get in the way when dealing with non-unicode source data (IIUC, by assuming string elements are utf codes). And they do not even try to solve the real issues one necessarily meets when manipulating unicode texts, which are due to UCS's coding format. Thus my previous statement: the level of codepoints is nearly never the proper level of abstraction.
 - Jonathan M Davis

Denis _________________ vita es estrany spir.wikidot.com
Jan 13 2011
prev sibling next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday 13 January 2011 03:48:46 spir wrote:
 On 01/13/2011 11:16 AM, Jonathan M Davis wrote:
 On Thursday 13 January 2011 01:49:31 spir wrote:
 On 01/13/2011 01:45 AM, Michel Fortin wrote:
 On 2011-01-12 14:57:58 -0500, spir<denis.spir gmail.com>  said:
 On 01/12/2011 08:28 PM, Don wrote:
 I think the only problem that we really have, is that "char[]",
 "dchar[]" implies that code points is always the appropriate level =






 abstraction.

I'd like to know when it happens that codepoint is the appropriate level of abstraction.

I agree with you. I don't see many use for code points. =20 One of these uses is writing a parser for a format defined in term of code points (XML for instance). But beyond that, I don't see one.

Actually, I had once a real use case for codepoint beeing the proper level of abstraction: a linguistic app of which one operational func counts occurrences of "scripting marks" like 'a'& '=C2=A8' in "=C3=A4=



 see what I mean.
 Once the text is properly NFD decomposed, each of those marks in coded
 as a codepoint. (But if it's not decomposed, then most of those marks
 are probably hidden by precomposed codes coding characters like "=C3=



 that even such an app benefits from a higher-level type basically
 operating on normalised (NFD) characters.

There's also the question of efficiency. On the whole, string operations can be very expensive - particularly when you're doing a lot of them. The fact that D's arrays are so powerful may reduce the problem in D, but in general, if you're doing a lot with strings, it can get costly, performance-wise.

D's arrays (even dchar[] & dstring) do not allow having correct results when dealing with UCS/Unicode text in the general case. See Michel's example (and several ones I posted on this list, and the text at https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20=

 vel%20of%20abstraction for a very lengthy explanation).
 You and some other people seem to still mistake Unicode's low level
 issue of codepoint vs code unit, with the higher-level issue of codes
 _not_ representing characters in the commmon sense ("graphemes").
=20
 The above pointed text was written precisely to introduce to this issue
 because obviously no-one wants to face it... (Eg each time I evoke it on
 this list it is ignored, except by Michel, but the same is true
 everywhere else, including on the Unicode mailing list!). The core of
 the problem is the misleading term "abstract character" which
 deceivingly lets programmers believe that a codepoints codes a
 character, like in historic character sets -- which is *wrong*. No
 Unicode document AFAIK explains this. This is a case of unsaid lie.
 Compared to legacy charsets, dealing with Unicode actually requires *2*
 levels of abstraction... (one to decode codepoints from code units, one
 to construct characters from codepoints)
=20
 Note that D's stdlib currently provides no means to do this, not even on
 the fly. You'd have to interface with eg ICU (a C/C++/Java Unicode
 library) (good luck ;-). But even ICU, as well as supposed unicode-aware
 typse or librarys for any language, would give you an abstraction
 producing correct results for Michel's example. For instance, Python3
 code fails as miserably as any other. AFAIK, D is the first and only
 language having such a tool (Text.d at
 https://bitbucket.org/denispir/denispir-d/src/a005424f60f3).
=20
 The question then is what is the cost of actually having strings
 abstracted to the point that they really are ranges of characters rather
 than code units or code points or whatever? If the cost is large enough,
 then dealing with strings as arrays as they currently are and having the
 occasional unicode issue could very well be worth it. As it is, there
 are plenty of people who don't want to have to care about unicode in the
 first place, since the programs that they write only deal with ASCII
 characters. The fact that D makes it so easy to deal with unicode code
 points is a definite improvement, but taking the abstraction to the
 point that you're definitely dealing with characters rather than code
 units or code points could be too costly.

When _manipulating_ text (indexing, search, changing), you have the choice between: * On the fly abstraction (composing characters on the fly, and/or normalising them), for each operation for each piece of text (including parameters, including literals). * Use of a type that constructs this abstraction once only for each piece of text. Note that a single count operation is forced to construct this abstraction on the fly for the whole text... (and for the searched snippet). Also note that optimisation is probably easier is the second case, for the abstraction operation is then standard. =20
 Now, if it can be done efficiently, then having unicode dealt with
 properly without the programmer having to worry about it would be a big
 boon. As it is, D's handling of unicode is a big boon, even if it
 doesn't deal with graphemes and the like.

It has a cost at intial Text construction time. Currently, on my very slow computer, 1MB source text requires ~ 500 ms (decoding + decomposition + ordering + "piling" codes into characters). Decoding only using D's builtin std.utf.decode takes about 100 ms. The bottle neck is piling: 70% of the time in average, on a test case melting texts from a dozen natural languages. We would be very glad to get the community's help in optimising this phase :-) (We have progressed very much already in terms of speed, but now reach limits of our competences.) =20
 So, I think that we definitely should have an abstraction for unicode
 which uses characters as the elements in the range and doesn't have to
 care about the underlying encoding of the characters (except perhaps
 picking whether char, wchar, or dchar is use internally, and therefore
 how much space it requires). However, I'm not at all convinced that such
 an abstraction can be done efficiently enough to make it the default way
 of handling strings.

If you only have ASCII, or if you don't manipulate text at all, then as said in a previous post any string representation works fine (whatever the encoding it possibly uses under the hood). D's builtin char/dchar/wchar and string/dstring/wstring are very nice and well done, but they are not necessary in such a use case. Actually, as shown by Steven's repeted complaints, they rather get in the way when dealing with non-unicode source data (IIUC, by assuming string elements are utf codes). =20 And they do not even try to solve the real issues one necessarily meets when manipulating unicode texts, which are due to UCS's coding format. Thus my previous statement: the level of codepoints is nearly never the proper level of abstraction.

I wasn't saying that code points are guaranteed to be characters. I was say= ing=20 that in most cases they are, so if efficiency is an issue, then having prop= erly=20 abstract characters could be too costly. However, having a range type which= =20 properly abstracts characters and deals with whatever graphemes and=20 normalization and whatnot that it has to would be a very good thing to have= =2E The=20 real question is whether it can be made efficient enough to even consider u= sing it=20 normally instead of just when you know that you're really going to need it. The fact that you're seeing such a large drop in performance with your Text= type=20 definitely would support the idea that it could be just plain too expensive= to=20 use such a type in the average case. Even something like a 20% drop in=20 performance could be devastating if you're dealing with code which does a l= ot of=20 string processing. Regardless though, there will obviously be cases where y= ou'll=20 need something like your Text type if you want to process unicode correctly. However, regardless of what the best way to handle unicode is in general, I= =20 think that it's painfully clear that your average programmer doesn't know m= uch=20 about unicode. Even understanding the nuances between char, wchar, and dcha= r is=20 more than your average programmer seems to understand at first. The idea th= at a=20 char wouldn't be guaranteed to be an actual character is not something that= many=20 programmers take to immediately. It's quite foreign to how chars are typica= lly=20 dealt with in other languages, and many programmers never worry about unico= de at=20 all, only dealing with ASCII. So, not only is unicode a rather disgusting=20 problem, but it's not one that your average programmer begins to grasp as f= ar as=20 I've seen. Unless the issue is abstracted away completely, it takes a fair = bit=20 of explaining to understand how to deal with unicoder properly. =2D Jonathan M Davis
Jan 13 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-13 07:10:09 -0500, Jonathan M Davis <jmdavisProg gmx.com> said:

 However, regardless of what the best way to handle unicode is in 
 general, I think that it's painfully clear that your average programmer 
 doesn't know much about unicode. Even understanding the nuances between 
 char, wchar, and dchar is more than your average programmer seems to 
 understand at first. The idea that a char wouldn't be guaranteed to be 
 an actual character is not something that many
 programmers take to immediately. It's quite foreign to how chars are typically
 dealt with in other languages, and many programmers never worry about 
 unicode at
 all, only dealing with ASCII. So, not only is unicode a rather 
 disgusting problem, but it's not one that your average programmer 
 begins to grasp as far as I've seen. Unless the issue is abstracted 
 away completely, it takes a fair bit of explaining to understand how to 
 deal with unicoder properly.

What's nice about Cocoa's way of handling strings is that even programmers not bothering about it get things right most of the time. Strings are compared in their canonical form (graphemes), unless you request a literal compression; and they are sorted and compared case-insensitively according to the user's locale, unless you specify your own locale settings. Its only major pitfall is that indexing is done on UTF-16 code units. The cost for this correctness is a small performance penalty, but I think it's the right path to take. For when performance or access to code points is important, the programmer should still be able to go down one layer and play with code points directly. That said, we need to make sure the performance drop is minimal. I somewhat doubt much that spir's approach of storing strings as an array of piles of characters is the right approach for most usage scenarios, but this area would need a little more research. spir's approach is certainly the ultimate step in correctness as it allows O(1) indexing of graphemes, but personally I'd favor not to have indexing and just do on-the-fly decoding at the grapheme level when performing various string operations. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 13 2011
parent Michel Fortin <michel.fortin michelf.com> writes:
On 2011-01-13 15:39:14 -0500, "Nick Sabalausky" <a a.a> said:

 "Andrej Mitrovic" <andrej.mitrovich gmail.com> wrote in message
 news:mailman.604.1294932704.4748.digitalmars-d puremagic.com...
 OT: Spir, do you know if I can change the syntax highlighting settings
 on bitbucket? I can't see anything with these gray on dark-gray
 colors: http://i.imgur.com/SmLk1.jpg

I'm getting the same problem too.

I bypassed the problem by fetching the files from the repository. But I agree it's very annoying. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 13 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/13/2011 01:10 PM, Jonathan M Davis wrote:
 I wasn't saying that code points are guaranteed to be characters. I was saying
 that in most cases they are, so if efficiency is an issue, then having properly
 abstract characters could be too costly.

The problem is then: how does a library or application programmer know, for sure, that all true characters (graphemes) from all source texts its software will ever deal with are coded with a single codepoint? If you cope with ASCII only now & forever, then you know that. If you do not manipulate text at all, then the question vanishes. Else, you cannot know, I guess. The problem is partially masked because, most of us currently process only western language sources, for which scripts there exist precomposed codes for every _predefine_ character, and text-producing software (like editors) usually use precomposed codes when available. Hope I'm clear. (I hope this use of precomposed codes will change because the gain in space for western langs is ridiculous and the cost in processing is instead relevant.) In the future, all of this may change, so that the issue would more often be obvious for many programmers dealing with international text. Note that even now nothing prevents a user (including a programmer in source code!), even less a text-producing software, to use decomposed coding (the right choice imo). And there are true characters, and you can "invent" as many fancy characters you like, for which no precomposed code is defined, indeed. All of this is valid unicode and must be properly dealt with.
 However, having a range type which
 properly abstracts characters and deals with whatever graphemes and
 normalization and whatnot that it has to would be a very good thing 

even consider using it normally instead of just when you know that you're really going to need it. Upon range, we initially planned to expose a range interface in our type for iteration, instead of opApply, for better integration with coming D2 style, and algorithms. But had to let it down due to a few range bugs exposed in a previous thread (search for "range usability" IIRC).
 The fact that you're seeing such a large drop in performance with your Text
type
 definitely would support the idea that it could be just plain too expensive to
 use such a type in the average case. Even something like a 20% drop in
 performance could be devastating if you're dealing with code which does a lot
of
 string processing. Regardless though, there will obviously be cases where
you'll
 need something like your Text type if you want to process unicode correctly.

The question of efficency is not as you present it. If you cannot guarantee that every character is coded by a single code (in all pieces of text, including params and literal), then you *must* construct an abstraction at the level of true characters --and even probably normalise them. You have the choice of doing it on the fly for _every_ operation, or using a tool like the type Text. In the latter case, not only everything is far simpler for client code, but the abstraction is constructed only once (and forever ;-). In the first case, the cost is the same (or rather higher because optimisation can probably be more efficient for a single standard case than for various operation cases); but _multiplied_ by the number of operations you need to perform on each piece of text. Thus, for a given operation, you get the slowest possible run: for instance indexing is O(k*n) where k is the cost of "piling" a single char, and n the char count... In the second case, the efficiency issue happens only initially for each piece of text. Then, every operation is as fast as possible: indexing is indeed O(1). But: this O(1) is slightly slower than with historic charsets because characters are now represented by mini code arrays instead of single codes. The same point applies even more for every operation involving compares (search, count, replace). We cannot solve this: it is due to UCS's coding scheme.
 However, regardless of what the best way to handle unicode is in general, I
 think that it's painfully clear that your average programmer doesn't know much
 about unicode.

True. Even those who think they are informed. Because Unicode's docs all not only ignore the problem, but contribute to creating it by using the deceiving term "abstract character" (and often worse, "character" alone) to denote what a codepoint codes. All articles I have ever read _about_ Unicode by third party simply follow. Evoking this issue on the unicode mailing list usually results in plain silence.
 Even understanding the nuances between char, wchar, and dchar is
 more than your average programmer seems to understand at first. The idea that a
 char wouldn't be guaranteed to be an actual character is not something that
many
 programmers take to immediately. It's quite foreign to how chars are typically
 dealt with in other languages, and many programmers never worry about unicode
at
 all, only dealing with ASCII.

(average programmer ? ;-) Not that much to "how chars are typically dealt with in other languages", rather to how characters were coded in historic charsets. Other languages ignore the issue, and thus run incorrectly with universal text, the same way as D's builtin tools do it. About ASCII, note that the only kind of source it's able to encode is plain english text, without any bit of fancy thingy in it. A single non-breaking space, "≥", "×" (product U+00D7), or using a letter imported from foreign language like in "à la", same for "αβγ", not to evoke "©" & "®"...
 So, not only is unicode a rather disgusting
 problem, but it's not one that your average programmer begins to grasp as far
as
 I've seen. Unless the issue is abstracted away completely, it takes a fair bit
 of explaining to understand how to deal with unicoder properly.

Please have a look at https://bitbucket.org/denispir/denispir-d/src/a005424f60f3, read https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20level 20of%20abstraction, and try https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/Text.d Any feedback welcome (esp on reformulating the text concisely ;-)
 - Jonathan M Davis

Denis _________________ vita es estrany spir.wikidot.com
Jan 13 2011
prev sibling next sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
OT: Spir, do you know if I can change the syntax highlighting settings
on bitbucket? I can't see anything with these gray on dark-gray
colors: http://i.imgur.com/SmLk1.jpg
Jan 13 2011
parent "Nick Sabalausky" <a a.a> writes:
"Andrej Mitrovic" <andrej.mitrovich gmail.com> wrote in message 
news:mailman.604.1294932704.4748.digitalmars-d puremagic.com...
 OT: Spir, do you know if I can change the syntax highlighting settings
 on bitbucket? I can't see anything with these gray on dark-gray
 colors: http://i.imgur.com/SmLk1.jpg

I'm getting the same problem too.
Jan 13 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/13/2011 02:47 PM, Michel Fortin wrote:
 On 2011-01-13 06:48:46 -0500, spir <denis.spir gmail.com> said:

 Note that D's stdlib currently provides no means to do this, not even
 on the fly. You'd have to interface with eg ICU (a C/C++/Java Unicode
 library) (good luck ;-). But even ICU, as well as supposed
 unicode-aware typse or librarys for any language, would give you an
 abstraction producing correct results for Michel's example. For
 instance, Python3 code fails as miserably as any other. AFAIK, D is
 the first and only language having such a tool (Text.d at
 https://bitbucket.org/denispir/denispir-d/src/a005424f60f3).

D is not the first language dealing correctly with Unicode strings in this manner. Objective-C's NSString class search and compare methods deal with characters with combining marks correctly. If you want to compare code points, you can do so explicitly using the NSLiteralSearch option, but the default is to compare the canonical version (at the grapheme level). <http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Strings/Articles/SearchingStrings.html%23//apple_ref/doc/uid/20000149-CJBBGBAI>

Thank you very much for this information (I feel less lonely ;-). I'll have a look at this NSString class ASAP, looks like it does The-Right-Thing as default (an Apple product...)
 In
 Cocoa, string sorting and case-insensitive comparition is also dependent
 on the user's locale settings, although you can also specify your own
 locale if the user's locale is not what you want.

On this point, I'm more dubitative. (Locale settings do not guarantee anything about right way of sorting for given domain, a given app, a given use case. There is an infinity of potential choices. But maybe it's a right default? See kde trying to invent a, hum, "natural", way of sorting file names...) Denis _________________ vita es estrany spir.wikidot.com
Jan 13 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/13/2011 11:00 PM, Nick Sabalausky wrote:
 "Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org>  wrote in message
 news:ignon1$2p4k$1 digitalmars.com...
 This may sometimes not be what the user expected; most of the time they'd
 care about the code points.

I dunno, spir has succesfuly convinced me that most of the time it's graphemes the user cares about, not code points. Using code points is just as misleading as using UTF-16 code units.

You are right in that those 2 issues are really analog. In practice, once universal text is truely and commonly used, I guess problems with codes-do-not-represent-characters may become far more obvious; and also far more serious because (logical) errors can easily pass by unseen. [In fact, how can a programmer even know for instance that a search routine missed its target or returned a false positive, when dealing with characters from unknown languages? Indeed, there are test data sets, but they are useless if the tools one uses just ignore the issues.] The problem with using 16-bit representation and thus ignoring a fair amount of codepoints is maybe less problematic because there are rather few chances to randomly meet characters outside the BMP (Basic Multiligual Plane, part of UCS which codepoints are < 0x10000). Outside the BMP are scripting systems of less commonly studied archeological languages, and various sets of images such as alchemical symbols, playing cards or domino tiles. I doubt they'll ever be commonly used, or else for specialised apps the programmer perfectly knows what they deal with. A list of UCS blocks with pointers to detailed content can be found here: http://www.fileformat.info/info/unicode/block/index.htm Blocks over the BMP start with the line: Linear B Syllabary U+10000 U+1007F (88) Denis _________________ vita es estrany spir.wikidot.com
Jan 13 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/14/2011 05:23 AM, Andrei Alexandrescu wrote:

 That's forgetting that most of the time people care about graphemes
 (user-perceived characters), not code points.

I'm not so sure about that. What do you base this assessment on? Denis wrote a library that according to him does grapheme-related stuff nobody else does. So apparently graphemes is not what people care about (although it might be what they should care about).

I'm aware of that, and I have no definitive answer to the question. The issue *does* exist --as shown even by trivial examples such as Michel's below, not corner cases. The actual question is _not_ whether code or "grapheme" is the proper level of abstraction. To this, the answer is clear: codes are simply meaningless in 99% cases. (All historic software deal with chars, conceptually, but they happen too be coded with single codes.) (And what about Objective-C? Why did its designers even bother with that?). The question is rather: why do we nearly all happily go on ignoring the issue? My present guess is a combination of factors: * The issue is masked by the misleading use of "abstract character" in unicode literature. "Abstract" is very correct, but they should have found another term as "character", say "abstract scripting mark". Their deceiving terminological choice lets most programmers believe that codepoints code characters, like in historic charsets. (Even worse: some doc explicitely states that ICU's notion of character matches the programming notion of character.) * ICU added precomposed codes for a bunch of characters, supposedly for backward compatility with said charsets. (But where is the gain? We need to decode them anyway...) The consequence is, at the pedagogical level, very bad: most text-producing software (like editors) use such precomposed codes when available for a given character. So that programmers can happily go on believing in the code=character myth. (Note: the gain in space is ridiculous for western text.) * Most characters that appear in western texts (at least "official" characters of natural languages) have precomposed forms. * Programmers can very easily be unaware their code is incorrect: how do you even notice it in test output? Thus, practically, programmers can (1) simply don't know the issue (2) have code that really works in typical use cases for their software (3) do not notice their code runs incorrectly. There is also an intermediate situation between (2) & (3), similar to old problems with previous ASCII-only apps: they work wrongly when used in a non-english environment, but what can users do, concretely? Most often, they just have to cope with incorrectness, reinterpret outputs differently, and/or find workarounds by cheating with the interface. The responsability of designers of tools for programmers is, imo, important. We should make the issue clear, first (very difficult, it's an ubiquitous myth to break down), and propose services that run correctly in situations where said issue is relevant, here manipulation of universal text, even if not very efficient at start. On my side, and about D, I wish that most D programmers (1) are aware of the problem (2) understand its why's & how's (3) know there is a correct solution. Then, (4) use it actually is their choice (and I don't care whether or not they do).
 It also supports this:

 foreach(i, d; s)
 {
 writeln("The character in position ", i, " is ", d);
 }

 where i is the index (might not be sequential)

Well string supports that too, albeit with the nit that you need to specify dchar.

Except it breaks with combining characters. For instance, take the string "t̃", which is two code points -- 't' followed by combining tilde (U+0303) -- and you'll get the following output: The character in position 0 is t The character in position 1 is ̃ (Note that the tilde becomes combined with the preceding space character.) The conception of character that normal people have does not match the notion of code points when combining characters enters the equation.

This might be a good time to see whether we need to address graphemes systematically. Could you please post a few links that would educate me and others in the mysteries of combining characters?

Beware! far too long text. https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/U%20missing%20level%20of%20abstraction (the directory above contains the current rough implementation of Text, plus a bit of its brother package DUnicode)
 Thanks,

 Andrei

Denis _________________ vita es estrany spir.wikidot.com
Jan 14 2011
prev sibling next sibling parent reply spir <denis.spir gmail.com> writes:
On 01/14/2011 07:26 AM, Nick Sabalausky wrote:
 "Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org>  wrote in message
 news:igoj6s$17r6$1 digitalmars.com...
 I'm not so sure about that. What do you base this assessment on? Denis
 wrote a library that according to him does grapheme-related stuff nobody
 else does. So apparently graphemes is not what people care about (although
 it might be what they should care about).

It's what they want, they just don't know it. Graphemes are what many people *think* code points are.
 This might be a good time to see whether we need to address graphemes
 systematically. Could you please post a few links that would educate me
 and others in the mysteries of combining characters?

Maybe someone else has a link to an explanation (I don't), but it's basically just this:

If anyone finds a pointer to such an explanation, bravo, and than you. (You will certainly not find it in Unicode literature, for instance.) Nick's explanation below is good and concise. (Just 2 notes added.)
 Three levels of abstraction from lowest to highest:
 - Code Unit (ie, encoding)
 - Code Point (ie, what Unicode assigns distinct numbers to)
 - Grapheme (ie, what we think of as a "character")

 A code-point can be made up of one or more code-units. Likewise, a grapheme
 can be made up of one or more code-points.

 There are (at least) two types of code points:

 - Regular ones, such as letters, digits, and punctuation.

 - "Combining Characters", such as accent marks (or if you're familiar with
 Japanese, the little things in the upper-right corner that change an "s" to
 a "z" or an "h" to a "p". Or like German's umlaut - the two dots above a
 vowel). Ie, things that are not characters in their own right, but merely
 modify other characters. These can be often (always?) be thought of as being
 like overlays.

You can also say there are 2 kinds of characters: simple like "u" & composite "ü" or "ṵ̈̈". The former are coded with a single (base) code, the latter with one (rarely more) base codes and an arbitrary number of combining codes. For a majority of _common_ characters made of 2 or 3 codes (western language letters, korean Hangul syllables,...), precombined codes have been added to the set. Thus, they can be coded with a single code like simple characters. [Also note, to avoid things be too simple ;-), some (few) combining codes called "prepend" come _before_ the base in raw code sequence...]
 If a code point representing a "combining character" exists in a string,
 then instead of being displayed as a character it merely modifies whatever
 code-point came before it.

 So, for instance, if you want to store the German word for five (in all
 lower-case), there are two ways to do it:

 [ 'f', {u with the umlaut}, 'n', 'f' ]

 Or:

 [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]

Note: the second form is the base form for Unicode. There are reasons to have chosen it (see my text), and why UCS does not and simply cannot propose precomposed codes for all possible composite characters.
 Those *both* get rendered exactly the same, and both represent the same
 four-letter sequence. In the second example, the 'u' and the {umlaut
 combining character} combine to form one grapheme. The f's and n's just
 happen to be single-code-point graphemes.

 Note that while some characters exist in pre-combined form (such as the {u
 with the umlaut} above), legend has it there are others than can only be
 represented using a combining character.

 It's also my understanding, though I'm not certain, that sometimes multiple
 combining characters can be used together on the same "root" character.

There is no logical limit, only practical such as how to display 3 diacritics above the same base? You can invent a script for a mythical folk's language if you like :-) Also, some examples of real language characters (Hebrew, IIRC) in Unicode test data sets hold up to 8 codes.
 Caveat: There may very well be further complications that I'm not aware of.
 Heck, knowing Unicode, there probably are.

Denis _________________ vita es estrany spir.wikidot.com
Jan 14 2011
next sibling parent "Nick Sabalausky" <a a.a> writes:
"spir" <denis.spir gmail.com> wrote in message 
news:mailman.619.1295012086.4748.digitalmars-d puremagic.com...
 If anyone finds a pointer to such an explanation, bravo, and than you. 
 (You will certainly not find it in Unicode literature, for instance.)
 Nick's explanation below is good and concise. (Just 2 notes added.)

Yea, most Unicode explanations seem to talk all about "code-units vs code-points" and then they'll just have a brief note like "There's also other things like digraphs and combining codes." And that'll be all they mention. You're right about the Unicode literature. It's the usual standards-body documentation, same as W3C: "Instead of only some people understanding how this works, lets encode the documentation in legalese (and have twenty only-slightly-different versions) to make sure that nobody understands how it works."
 You can also say there are 2 kinds of characters: simple like "u" & 
 composite "" or "??". The former are coded with a single (base) code, 
 the latter with one (rarely more) base codes and an arbitrary number of 
 combining codes.

Couple questions about the "more than one base codes": - Do you know an example offhand? - Does that mean like a ligature where the base codes form a single glyph, or does it mean that the combining code either spans or operates over multiple glyphs? Or can it go either way?
 For a majority of _common_ characters made of 2 or 3 codes (western 
 language letters, korean Hangul syllables,...), precombined codes have 
 been added to the set. Thus, they can be coded with a single code like 
 simple characters.

Out of curiosity, how do decomposed Hangul characters work? (Or do you know?) Not actually knowing any Korean, my understanding is that they're a set of 1 to 4 phoenetic glyphs that are then combined into one glyph. So, it is like a series of base codes that automatically combine, or are there combining characters involved?
 [Also note, to avoid things be too simple ;-), some (few) combining codes 
 called "prepend" come _before_ the base in raw code sequence...]

Fun!
Jan 14 2011
prev sibling parent spir <denis.spir gmail.com> writes:
On 01/14/2011 08:20 PM, Nick Sabalausky wrote:
 "spir"<denis.spir gmail.com>  wrote in message
 news:mailman.619.1295012086.4748.digitalmars-d puremagic.com...
 If anyone finds a pointer to such an explanation, bravo, and than you.
 (You will certainly not find it in Unicode literature, for instance.)
 Nick's explanation below is good and concise. (Just 2 notes added.)

Yea, most Unicode explanations seem to talk all about "code-units vs code-points" and then they'll just have a brief note like "There's also other things like digraphs and combining codes." And that'll be all they mention. You're right about the Unicode literature. It's the usual standards-body documentation, same as W3C: "Instead of only some people understanding how this works, lets encode the documentation in legalese (and have twenty only-slightly-different versions) to make sure that nobody understands how it works."

If anyone is interested, ICU's documentation is far more readable (and intended for programmers). ICU is *the* reference library for dealing with unicode (an IBM open source product, with C/C++/Java interfaces), used by many other products in the background. ICU: http://site.icu-project.org/ user guide: http://userguide.icu-project.org/ section about text segmentation: http://userguide.icu-project.org/boundaryanalysis Note that just like Unicode, they consider forming graphemes (grouping codes into character representations) a simple particular case of text segmentation, which they call "boundary analysis" (but they have the nice idea to use "character" instead of "grapheme"). The only mention I found in ICU's doc of the issue we have talked about here lengthily is (at http://userguide.icu-project.org/strings): "Handling Lengths, Indexes, and Offsets in Strings The length of a string and all indexes and offsets related to the string are always counted in terms of UChar code units, not in terms of UChar32 code points. (This is the same as in common C library functions that use char * strings with multi-byte encodings.) Often, a user thinks of a "character" as a complete unit in a language, like an 'Ä', while it may be represented with multiple Unicode code points including a base character and combining marks. (See the Unicode standard for details.) This often requires users to index and pass strings (UnicodeString or UChar *) with multiple code units or code points. It cannot be done with single-integer character types. Indexing of such "characters" is done with the BreakIterator class (in C: ubrk_ functions). Even with such "higher-level" indexing functions, the actual index values will be expressed in terms of UChar code units. When more than one code unit is used at a time, the index value changes by more than one at a time. [...] (ICU's UChar are like D wchar.)
 You can also say there are 2 kinds of characters: simple like "u"&
 composite "ü" or "ü??". The former are coded with a single (base) code,
 the latter with one (rarely more) base codes and an arbitrary number of
 combining codes.

Couple questions about the "more than one base codes": - Do you know an example offhand?

No. I know this only from it beeing mentionned in documentation. Unless we consider (see below) L jamo as base codes.
 - Does that mean like a ligature where the base codes form a single glyph,
 or does it mean that the combining code either spans or operates over
 multiple glyphs? Or can it go either way?

IIRC examples like ij in nederlands are only considered "compability equivalent" to the corresponding ligatures, just like eg "ss" for "ß" in german. Meaning they should not be considered equal by default, this would be an additional feature, and langage- and app-dependant). Unlike base "e"+ combining "^" really == "ê".
 For a majority of _common_ characters made of 2 or 3 codes (western
 language letters, korean Hangul syllables,...), precombined codes have
 been added to the set. Thus, they can be coded with a single code like
 simple characters.

Out of curiosity, how do decomposed Hangul characters work? (Or do you know?) Not actually knowing any Korean, my understanding is that they're a set of 1 to 4 phoenetic glyphs that are then combined into one glyph. So, it is like a series of base codes that automatically combine, or are there combining characters involved?

I know nothing about Korean language except what I studied about its scripting system for Unicode algorithms (but one can also code said algorithm blindly). See http://en.wikipedia.org/wiki/Hangul and about Hangul in Unicode http://en.wikipedia.org/wiki/Korean_language_and_computers. What I understand (beware, it's just wild deductions) is there are 3 kinds of "jamo" scripting marks (noted L, V, T) that can combine into syllabic "graphemes", resp in first, median, last place. These marks indeed somehow correspond to vocalic or consonantic phonemes. In unicode, in addition to such jamo, which are simple marks (like base letters and diacritics in latin-based languages), there are precombined codes for LV and LVT combinations (like for "ä" or "û"). We could thus think that Hangul syllables are limited to 3 jamo. But: according to Unicode's official "grapheme break cluster" algorithm (read: how to group codepoints into characters) (http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries), codes for L jamo can also be followed by _and_ should be combined with other L, LV or LVT codes. Similarly, LV or V should be combined with V or VT, and LVT or T with T. (Seems logical.) So, I do not know how complicated a Hangul syllab can be in practice or in theory. If there can be in practice whole syllables following other schemes than L / LV / LVT, then this is another example of real language whole characters that cannot be coded by a single codepoint. Denis _________________ vita es estrany spir.wikidot.com
Jan 16 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/14/2011 07:33 AM, Andrei Alexandrescu wrote:
 Thanks. One further question is: in the above example with
 u-with-umlaut, there is one code point that corresponds to the entire
 combination. Are there combinations that do not have a unique code point?

See my previous follow-up to nick's explanation. But the answer is yes, not only for usual characters, but due to the fact that a user is, theoratically and practically, totally free to combine base ad combining codes --even to invent chracters. The only limit is that fonts will not know how to display unprobable combinations. (See also my presentation text, shows an example of dots below and above greek letters.) Denis _________________ vita es estrany spir.wikidot.com
Jan 14 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/14/2011 07:44 AM, Nick Sabalausky wrote:
 "Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org>  wrote in message
 news:igoqrm$1n5r$1 digitalmars.com...
 On 1/13/11 10:26 PM, Nick Sabalausky wrote:
 [snip]
 [ 'f', {u with the umlaut}, 'n', 'f' ]

 Or:

 [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]

 Those *both* get rendered exactly the same, and both represent the same
 four-letter sequence. In the second example, the 'u' and the {umlaut
 combining character} combine to form one grapheme. The f's and n's just
 happen to be single-code-point graphemes.

 Note that while some characters exist in pre-combined form (such as the
 {u
 with the umlaut} above), legend has it there are others than can only be
 represented using a combining character.

 It's also my understanding, though I'm not certain, that sometimes
 multiple
 combining characters can be used together on the same "root" character.

Thanks. One further question is: in the above example with u-with-umlaut, there is one code point that corresponds to the entire combination. Are there combinations that do not have a unique code point?

My understanding is "yes". At least that's what I've heard, and I've never heard any claims of "no". I don't know of any specific ones offhand, though. Actually, it might be possible to use any combining character with any old letter or number (like maybe a 7 with an umlaut), though I'm not certain.

The problem is then whether a font knows how to display it. My usual fonts (DejaVu series, pretty good with Unicode) show: 7̈ meaning they do not know how to combine digits with diacritics (they do it well with other rather strange combinations.) But: one of the relevant advantages of decomposed forms is that when they don't know the character, they can still show at least the component marks, here '7' & '~'. Which is better than nothing for a user who knows the scripting system. If I try to display for instance a _precomposed_ syllable from a language my font does not know, i will get instead either a little square with the codepoint written inside in minuscules digits, or a placeholder like inversed-video "?". denis _________________ vita es estrany spir.wikidot.com
Jan 14 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/14/2011 01:52 PM, Daniel Gibson wrote:
 Am 14.01.2011 07:26, schrieb Nick Sabalausky:
 "Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org> wrote in message
 news:igoj6s$17r6$1 digitalmars.com...
 I'm not so sure about that. What do you base this assessment on? Denis
 wrote a library that according to him does grapheme-related stuff nobody
 else does. So apparently graphemes is not what people care about
 (although
 it might be what they should care about).

It's what they want, they just don't know it. Graphemes are what many people *think* code points are.

Agreed. Up until spir mentioned graphemes in this newsgroup I always thought that one Unicode code point == one character on the screen. I guess in the majority of use cases you want to operate on user perceived characters.

That's what makes sense for the user in 99.9% case, thus that's what makes sense for the programmer, thus that's what makes sense for the language/type/lib designer. denis _________________ vita es estrany spir.wikidot.com
Jan 14 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/14/2011 02:37 PM, Steven Schveighoffer wrote:
 * I don't even know how to make a grapheme that is more than one
 code-unit, let alone more than one code-point :)  Every time I try, I
 get 'invalid utf sequence'.

 I feel significantly ignorant on this issue, and I'm slowly getting
 enough knowledge to join the discussion, but being a dumb American who
 only speaks English, I have a hard time grasping how this shit all works.

1. See my text at https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/U%20missing%20level%20of%20abstraction 2. writeln ("A\u0308\u0330"); <A + tilde above + umlaut below> (or the opposite) If it does not display properly, either set your terminal to UTF* or use a more unicode-aware font (eg DejaVu series). The point is not playing like that with Unicode flexibility. Rather that composite characters are just normal thingies in most languages of the world. Actually, on this point, english is a rare exception (discarding letters imported from foreign languages like french 'à'); to the point of beeing, I guess, the only western language without any diacritic. Denis _________________ vita es estrany spir.wikidot.com
Jan 14 2011
prev sibling next sibling parent Gerrit Wichert <gwichert yahoo.com> writes:
Am 14.01.2011 15:34, schrieb Steven Schveighoffer:
 Is it common to have multiple modifiers on a single character?  The
 problem I see with using decomposed canonical form for strings is that
 we would have to return a dchar[] for each 'element', which severely
 complicates code that, for instance, only expects to handle English.

 I was hoping to lazily transform a string into its composed canonical
 form, allowing the (hopefully rare) exception when a composed
 character does not exist.  My thinking was that this at least gives a
 useful string representation for 90% of usages, leaving the remaining
 10% of usages to find a more complex representation (like your Text
 type).  If we only get like 20% or 30% there by making dchar the
 element type, then we haven't made it useful enough.

be better for a language not to 'translate' by default. If the user wants to convert the codepoints this can be requested on demand. But pemature default conversion is a subltle way to lose information that may be important. Imagine we want to write a tool for dealing with the in/output of some other ignorant legacy software. Even if it is only text files, that software may choke on some converted input. So i belive that it is very importent that we are able to reproduce strings in exact that form in which we have read them in. Gerrit
Jan 14 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday 15 January 2011 20:45:53 Michel Fortin wrote:
 On 2011-01-15 20:49:00 -0500, Jonathan M Davis <jmdavisProg gmx.com> said:
 On Saturday 15 January 2011 04:24:33 Michel Fortin wrote:
 I have my idea.
=20
 I think it'd be a good idea is to improve upon Andrei's first idea --
 which was to treat char[], wchar[], and dchar[] all as ranges of dchar
 elements -- by changing the element type to be the same as the string.
 For instance, iterating on a char[] would give you slices of char[],
 each having one grapheme.
=20
 The second component would be to make the string equality operator (=3D

=3D) =20
 for strings compare them in their normalized form, so that ("e" with
 combining acute accent) =3D=3D (pre-combined "=E9"). I think this woul=



=20
 ake
=20
 D support for Unicode much more intuitive.
=20
 This implies some semantic changes, mainly that everywhere you write a
 "character" you must use double-quotes (string "a") instead of single
 quote (code point 'a'), but from the user's point of view that's pretty
 much all there is to change.
=20
 There'll still be plenty of room for specialized algorithms, but their
 purpose would be limited to optimization. Correctness would be taken
 care of by the basic range interface, and foreach should follow suit
 and iterate by grapheme by default.
=20
 I wrote this example (or something similar) earlier in this thread:
 	foreach (grapheme; "expos=E9")
 =09
 		if (grapheme =3D=3D "=E9")
 	=09
 			break;
=20
 In this example, even if one of these two strings use the pre-combined
 form of "=E9" and the other uses a combining acute accent, the equality
 would still hold since foreach iterates on full graphemes and =3D
 compares using normalization.

I think that that would cause definite problems. Having the element type of the range be the same type as the range seems like it could cause a lot of problems in std.algorithm and the like, and it's _definitely_ going to confuse programmers. I'd expect it to be highly bug-prone. They _need_ to be separate types.

I remember that someone already complained about this issue because he had a tree of ranges, and Andrei said he would take a look at this problem eventually. Perhaps now would be a good time. =20
 Now, given that dchar can't actually work completely as an element
 type, you'd either need the string type to be a new type or the element
 type to be a new type. So, either the string type has char[], wchar[],
 or dchar[] for its element type, or char[], wchar[], and dchar[] have
 something like uchar as their element type, where uchar is a struct
 which contains a char[], wchar[], or dchar[]
 which holds a single grapheme.

Having a new type for grapheme would work too. My preference still goes to reusing the string type because it makes the semantic simpler to understand, especially when comparing graphemes with literals.

If a character literal actually became a grapheme instead of a dchar, then = that=20 would likely solve that issue. But I fear that the semantics of having a ra= nge=20 be its own element type actually make understanding it _harder_, not simple= r.=20 Being forced to compare a string literals against what should be a characte= r=20 would definitely confuse programmers. Making a new character or grapheme ty= pe=20 which represented a grapheme would be _far_ simpler to understand IMO. Howe= ver,=20 making it work really well would likely require that the compiler know abou= t the=20 grapheme type like it knows about dchar. =2D Jonathan M Davis
Jan 15 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/15/2011 05:59 PM, Steven Schveighoffer wrote:
 I think this is a good alternative, but I'd rather not impose this on
 people like myself who deal mostly with English.  I think this should be
 possible to do with wrapper types or intermediate ranges which have
 graphemes as elements (per my suggestion above).

I am unsure now about the question of a text's (apparent) natural language in relation to unicode issues. For instance English, precisely, seems to often include foreign words literally (or is it a kind of pedantism from highly educated people?). In fact, users are free to include whatever characters they like, as soon as they text-composition interface allows it. All main OSes, I guess, now have at least one standard way to type in characters (or codepoint) that are not directly accessible on keyboards, and application sometimes offer another. Some kinds of users love to play with such flexibility. So, maybe, the right question is not the one of natural language but of text-composition means. I guess that as soon as a human user may have freely typed or edited a text, we cannot guarantee much upon its actual content, what do you think? The case of historic ASCII-only text is relevant, indeed, but will fast become less. And how does an application writer recognises them without iterating the whole content? (The encoding is utf8 compatible.) Denis _________________ vita es estrany spir.wikidot.com
Jan 17 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/15/2011 08:51 PM, Steven Schveighoffer wrote:
 More over, Even if you ignore Hebrew as a tiny insignificant minority
 you cannot do the same for Arabic which has over one *billion* people
 that use that language.

I hope that the medium type works 'good enough' for those languages, with the high level type needed for advanced usages. At a minimum, comparison and substring should work for all languages.

Hello Steven, How does an application know that a given text, which supposedly is written in a given natural language (as for instance indicated by an html header) does not also hold terms from other languages? There are various occasions for this: quotations, use of foreign words, pointers... A side-issue is raised by precomposed codes for composite characters. For most languages of the world, I guess (but unsure), all "official" characters have single-code representations. Good, but unfortunately this is not enforced by the standard (instead, the decomposed form can sensibly be considered the base form, but this is another topic). So that even if ones knows for sure that all characters of all texts an app will ever deal with can be mapped to single codes, to be safe one would have to normalise to NFC anyway (Normalised Form Composed). Then, where is the actual gain? In fact, it is a loss because NFC is more costly than NFD (Decomposed) --actually, the standard NFC algo first decomposes to NFD to initially get an unique representation that can then be more easily (re)composed via simple mappings. For further information: Unicode's normalisation algos: http://unicode.org/reports/tr15/ list of technical reports: http://unicode.org/reports/ (Unicode's technical reports are far more readible than the standard itself, but unfortunately often refer to it.) Denis _________________ vita es estrany spir.wikidot.com
Jan 17 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/15/2011 09:46 PM, foobar wrote:
 I'd like to have full Unicode support. I think it is a good thing for D to
have in order to expand in the world. As an alternative, I'd settle for loud
errors that make absolutely clear to the non-Unicode expert programmer that D
simply does NOT support e.g. Normalization.

 As Spir already said, Unicode is something few understand and even it's own
official docs do not explain such issues properly. We should not confuse users
even further with incomplete support.

In a few days, D will have an external library able to deal with those issues, hopefully correctly and clearly for client programmers. Possibly, its design is not the best possible approach (esp for efficiency: Michel let me doubt about that, and my competence in this field is close to nothing). But it has the merit to exist and provide a clear example of the correct semantics. Let us use it as a base for experimentation. Then, everything can be redesigned from scratch if we realise I was initially completely wrong. In any case, it would certainly be a far easier and fast job to do now, after having explored the issues at length, and with a reference implementation at hand. Denis _________________ vita es estrany spir.wikidot.com
Jan 17 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/15/2011 11:45 PM, Michel Fortin wrote:
 That said, I'm sure if someone could redesign Unicode by breaking
 backward-compatibility we'd have something simpler. You could probably
 get rid of pre-combined characters and reduce the number of
 normalization forms. But would you be able to get rid of normalization
 entirely? I don't think so. Reinventing Unicode is probably not worth it.

I think like you about pre-composed characters: they bring no real gain (even for easing passage from historic charsets, since texts must be decoded anyway, then mapping to single or multiple codes is nothing). But they add complication to the design in proposing 2 // representation schemes (one character <--> one "code pile" versus one character <--> one precomposed code). And impose much weight on the back of software (and programmers) relative to correct indexing/ slicing and comparison, search, count, etc. Where normalisation forms enter the game. My whoice would be: * decomposed form only * ordering imposed by the standard at text-composition time ==> no normalisation because everything is normalised from scratch. Remains only what I call "piling". But we cannot easily get rid of it --without separators in standard UTF encodings. I had the idea of UTF-33 ;-): a alternative freely agreed-upon encoding that just says (in addition to UTF-32) that the content is already normalised (NFD decomposed and ordered): either so produced intially or already processed. So that software can happily read texts in and only think at piling if needed. UTF-33+ would add "grapheme" separators (a costly solution in terms of space) to get rid of piling. The aim indeed beeing to avoid stupidly doing the same job multiple times on the same text. Denis _________________ vita es estrany spir.wikidot.com
Jan 17 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/14/2011 04:50 PM, Michel Fortin wrote:
 This might be a good time to see whether we need to address graphemes
 systematically. Could you please post a few links that would educate
 me and others in the mysteries of combining characters?

As usual, Wikipedia offers a good summary and a couple of references. Here's the part about combining characters: <http://en.wikipedia.org/wiki/Combining_character>. There's basically four ranges of code points which are combining: - Combining Diacritical Marks (0300–036F) - Combining Diacritical Marks Supplement (1DC0–1DFF) - Combining Diacritical Marks for Symbols (20D0–20FF) - Combining Half Marks (FE20–FE2F) A code point followed by one or more code points in these ranges is conceptually a single character (a grapheme).

Unfortunatly, things are complicated by _prepend_ combining marks that happen in a code sequence _before_ the base mark. The Unicode algorithm is described here: http://unicode.org/reports/tr29/ section 3 (humanly readable ;-). See esp the first table in section 3.1. Denis _________________ vita es estrany spir.wikidot.com
Jan 17 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/15/2011 12:21 AM, Michel Fortin wrote:
 Also, it'd really help this discussion to have some hard numbers about
 the cost of decoding graphemes.

Text has a perf module that provides such numbers (on different stages of Text object construction) (but the measured algos are not yet stabilised, so that said numbers regularly change, but in the right sense ;-) You can try the current version at https://bitbucket.org/denispir/denispir-d/src (the perf module is called chrono.d) For information, recently, the cost of full text construction: decoding, normalisation (both decomp & ordering), piling, was about 5 times decoding alone. The heavy part (~ 70%) beeing piling. But Stephan just informed me about a new gain in piling I have not yet tested. This performance places our library in-between Windows native tools and ICU in terms of speed. Which is imo rather good for a brand new tool written in a still unstable language. I have carefully read your arguments on Text's approach to systematically "pile" and normalise source texts not beeing the right one from an efficiency point of view. Even for strict use cases of universal text manipulation (because the relative space cost would indirectly cause time cost due to cache effects). Instead, you state we should "pile" and/or normalise on the fly. But I am, similarly to you, rather doubtful on this point without any numbers available. So, let us produce some benchmark results on both approaches if you like. Denis _________________ vita es estrany spir.wikidot.com
Jan 17 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/17/2011 01:44 PM, Steven Schveighoffer wrote:
 On Sun, 16 Jan 2011 13:06:16 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 1/15/11 9:25 PM, Jonathan M Davis wrote:
 Considering that strings are already dealt with specially in order to
 have an
 element of dchar, I wouldn't think that it would be all that
 distruptive to make
 it so that they had an element type of Grapheme instead. Wouldn't
 that then fix
 all of std.algorithm and the like without really disrupting anything?

It would make everything related a lot (a TON) slower, and it would break all client code that uses dchar as the element type, or is otherwise unprepared to use Graphemes explicitly. There is no question there will be disruption.

I would have agreed with you last week. Now I understand that using dchar is just as useless for unicode as using char. Will it be slower? Perhaps. A TON slower? Probably not. But it will be correct. Correct and slow is better than incorrect and fast. If I showed you a shortest-path algorithm that ran in O(V) time, but didn't always find the shortest path, would you call it a success? We need to get some real numbers together. I'll see what I can create for a type, but someone else needs to supply the input :) I'm on short supply of unicode data, and any attempts I've made to create some result in failure. I have one example of one composed character in this thread that I can cling to, but in order to supply some real numbers, we need a large amount of data. -Steve

Hello Steve & Andrei, I see 2 questions: (1) whether we should provide Unicode correctness as a default or not? and relative points of level of abstraction & normalisation (2) what is the best way to implement such correctness? Let us put aside (1) for a while, anyway nothing prevents us to experiment while waiting for an agreement; such experiment would in fact feed the debate with real facts instead of "airy" ideas. It seems there are 2 opposite approaches to Unicode correctness. Mine was to build a types that systematically abstracts UCS-created issues (that real whole characters are coded by mini-arrays of codes I call "code piles", that those piles have variable lengths, _and_ that cheracters even may have several representations). Then, in my wild guesses, every text manipulation method should obviously be "flash fast", actually faster than any on the fly algo by several orders of magnitude. But Michel let me doubt on that point. The other approach is precisely to provide needed abstraction ("piling" and normalisation) on the fly. Like proposed by Michel, and like Objective-C does, IIUC. This way seems to me closer to a kind of re-design Steven's new String type and/or Andrei's VLERange. As you say, we need real timing numbers to decide. I think we should measure at least 2 routines: * indexing (or better iteration?) which only requires "piling" * counting occurrences of a given character or slice, which requires both piling and normalisation I do not feel like implementating such routine for the on the fly version, and have no time for this in coming days; but if anyone is volunteer, feel free to rip code and data from Text's current implementation if it may help. As source text, we can use the one at https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/data/unicode.txt (already my source for perf measures). It has the only merit to be a text (about unicode!) in twelve rather different languages. [My intuitive guess is that Michel is wrong by orders of magnitude --but again I know about nothing about code performance.] Denis _________________ vita es estrany spir.wikidot.com
Jan 17 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/17/2011 05:34 PM, Michel Fortin wrote:
 As I said: all those people who are not validating the inputs to make
 sure they don't contain combining code points. As far as I know, no one
 is doing that, so that means everybody should use algorithms capable of
 handling multi-code-point graphemes. If someone indeed is doing this
 validation, he'll probably also be smart enough to make his algorithms
 to work with dchars.

Actually, there are at least 2 special cases: * apps that only deal with pre-unicode source stuff * apps that only deal with source stuff "mechanically" generated by text-producing software which itself guarantees single-code-only graphemes Denis _________________ vita es estrany spir.wikidot.com
Jan 17 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/17/2011 04:00 PM, Andrei Alexandrescu wrote:
 On 1/17/11 6:44 AM, Steven Schveighoffer wrote:
 We need to get some real numbers together. I'll see what I can create
 for a type, but someone else needs to supply the input :) I'm on short
 supply of unicode data, and any attempts I've made to create some result
 in failure. I have one example of one composed character in this thread
 that I can cling to, but in order to supply some real numbers, we need a
 large amount of data.

Oh, one more thing. You don't need a lot of Unicode text containing combining characters to write benchmarks. (You do need it for testing purposes.) Most text won't contain combining characters anyway, so after you implement graphemes, just benchmark them on regular text.

Correct. For this reason, we do not use the same source at all for correctness and performance testing. It is impossible to define typical or representative source (who judges?) But at very minimum, source texts for perf measurement should mix languages as diverse as possible, including some material of the ones known to be problematic and/or atypical (english, korean, hebrew...) The following (ripped and composed from ICU data sets) is just that: https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/data/unicode.txt Content: 12 natural languages 34767 bytes = utf8 code units --> 20133 code points --> 22033 normal codes (NFD decomposed) --> 19205 piles = true characters Denis _________________ vita es estrany spir.wikidot.com
Jan 17 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/18/2011 04:48 AM, Michel Fortin wrote:
 On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin michelf.com>
 said:

 More seriously, you have four choice:

 1. code unit
 2. code point
 3. grapheme
 4. require the client to state explicitly which kind of 'character' he
 wants; 'character' being an overloaded word, it's reasonable to ask
 for disambiguation.

This makes me think of what I did with my XML parser after you made code points the element type for strings. Basically, the parser now uses 'front' and 'popFront' whenever it needs to get the next code point, but most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I had to add) when testing for or skipping an ASCII character is sufficient. This way I avoid a lot of unnecessary decoding of code points. For this to work, the same range must let you skip either a unit or a code point. If I were using a separate range with a call to toDchar or toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't have helped much because the new range would essentially become a new slice independent of the original, so you can't interleave "I want to advance by one unit" with "I want to advance by one code point". So perhaps the best interface for strings would be to provide multiple range-like interfaces that you can use at the level you want. I'm not sure if this is a good idea, but I thought I should at least share my experience.

This looks like a very interesting approach. And clear. I guess range synchronisation would be based on an internal lowest-level (codeunit) index. Then, you also need internal validity-checking and/or offseting routines when a higher-level range is used after a lowel-level one has been used. (I mean eg to ensure start-of-codepoint after a codeunit popFront, or throw an error.) Also, how to avoid duplicating many operational functions (eg find a given slice) for each level? Denis _________________ vita es estrany spir.wikidot.com
Jan 18 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/18/2011 06:48 AM, Jonathan M Davis wrote:
 On Monday 17 January 2011 15:13:42 spir wrote:
 See range bug evoked above. opApply is the only workaround AFAIK.
 Also, ranges cannot yet provide indexed iteration like
 	foreach(i, char ; text) {...}

While it would be nice at times to be able to have an index with foreach when using ranges, I would point out that it's trivial to just declare a variable which you increment each iteration, so it's easy to get an index even when using foreach with ranges. Certainly, I wouldn't consider the lack of index with foreach and ranges a good reason to use opApply instead of ranges. There may be other reasons which make it worthwhile, but it's so trivial to get an index that the loss of range abilities (particularly the ability to use such ranges with std.algorithm) dwarfs it in importance.

You are right. I fully agree, in fact. On the other hand, think at expectations of users of a library providing iteration on "naturally" sequential thingies. The point is that D makes indexed iteration available elsewhere. Denis _________________ vita es estrany spir.wikidot.com
Jan 18 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 01/18/2011 06:14 PM, Michel Fortin wrote:

On 2011-01-18 11:38:45 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:
 I was thinking along the lines of:

 struct Grapheme
 {
 private string support_;
 ...
 }

 struct ByGrapheme
 {
 private string iteratee_;
 bool empty();
 Grapheme front();
 void popFront();
 // Additional funs
 dchar frontCodePoint();
 void popFrontCodePoint();
 char frontCodeUnit();
 void popFrontCodeUnit();
 ...
 }

 // helper function
 ByGrapheme byGrapheme(string s);

 // usage
 string s = ...;
 size_t i;
 foreach (g; byGrapheme(s))
 {
 writeln("Grapheme #", i, " is ", g);
 }

 We need this range in Phobos.

Yes, we need a grapheme range. But that's not what my thing was about. It was about shortcutting code point decoding when it isn't necessary while still keeping the ability to decode to code points when iterating on the same range. For instance, here's a simple made up example: string s = "<hello>"; if (!s.empty && s.frontUnit == '<') s.popFrontUnit(); // skip while (!s.empty && s.frontUnit != '>') s.popFront(); // do something with each code point if (!s.empty && s.frontUnit == '>') s.popFrontUnit(); // skip assert(s.empty); Here, since I know I'm testing and skipping for '<', an ASCII character, decoding the code point is wasted time, so I skip that decoding. The problem is that this optimization can't happen with a range that abstracts things at the code point level. I can do it with strings because strings still allow you to access code units through the indexing operators, but this can't really apply to ranges of code points in general. And parsing with range of code unit would also be a pain, because even if I'm testing for '<' for the first character, sometimes I really need to advance by code point and test for code points.

This means a single string type that exposes various _synchrone_ range levels (codeunit, codepoint, grapheme), doesn't it? As opposed to Andrei's approach of ranges beeing structures external to string types, IIUC, which thus move on independantly?
 One thing that might be interesting is benchmarking my XML parser by
 replacing every instance of frontUnit and popFrontUnit with front and
 popFront. That won't change there results, but it'd give us an idea of
 the overhead of the unnecessary decoded characters code points.

Yes, would you have time to do it? I would be interesting in such perf measurements. (--> your idea about a Text variant, for which I would like to know whether it's worth still decoding systematically.) Denis _________________ vita es estrany spir.wikidot.com
Jan 18 2011
prev sibling parent spir <denis.spir gmail.com> writes:
On 01/18/2011 06:11 AM, Ali Çehreli wrote:
 Thanks to all that has contributed, I am also following this thread with
 great interest. :)

 Michel Fortin wrote:
  > I mean, a grapheme is a slice of a string, can have multiple code points
  > (like a string), can be appended the same way as a string, can be
  > composed or decomposed using canonical normalization or compatibility
  > normalization (like a string), and should be sorted, uppercased, and
  > lowercased according to Unicode rules (like a string). Basically, a
  > grapheme is just a string that happens to contain only one grapheme.

 I would like to stress the fact that Unicode knows nothing about
 sorting, uppercasing, or lowercasing.

 Those operations are tied to the alphabet (or writing system) that a
 certain grapheme happens to belong to at a given time. For example, we
 cannot uppercase the letter i without knowing what alphabet we are
 dealing with. Two possibilities: I and İ (I dot above).

 It is the same issue with sorting.

This is true and false ;-) You are right, indeed, on the fact that issues like sorting one are language-specific, and more, use-case-specific. The case of the turkish beeing a good example. For another one, in french I do not even know whether there is an official rule! Anyway, whatever the answer, even eg famous newpapers, and official documents, used different rules. Most of them let down accents on uppercase, possibly because of computer limitation; there is a recent move (back) toward accented uppercase. This is very annoying: "Hélène" has 2 consistent and used uppercase versions. Conversely, how is software supposed to guess the lowercase version of "HELENE"? Upon Unicode, it still defines norms for casing and so-called collation (compare, for sorting) algorithms. Dunno much more, i have never applied them, personly, for reasons like the ones above. The full list of it's technical docs can be found at http://unicode.org/reports/. See in particular http://unicode.org/reports/tr10/ for collation. (Unfortnately, case mapping is know part of the core standard doc, so that it's hard to get it.) Denis _________________ vita es estrany spir.wikidot.com
Jan 19 2011