www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Why the hell doesn't foreach decode strings

reply "Martin Nowak" <dawg dawgfoto.de> writes:
It just took me over one hour to find out the unthinkable.
foreach(c; str) will deduce c to immutable(char) and doesn't care about  
unicode.
Now there is so many unicode transcoding happening in the language that it  
starts to get annoying,
but the most basic string iteration doesn't support it by default?
Oct 20 2011
next sibling parent reply Peter Alexander <peter.alexander.au gmail.com> writes:
On 20/10/11 8:37 PM, Martin Nowak wrote:
 It just took me over one hour to find out the unthinkable.
 foreach(c; str) will deduce c to immutable(char) and doesn't care about
 unicode.
 Now there is so many unicode transcoding happening in the language that
 it starts to get annoying,
 but the most basic string iteration doesn't support it by default?

D has got itself into a tricky situation in this regard. Doing it either way introduces an unintuitive mess. The way it is now, you get the problem that you just described where foreach is unaware of Unicode. If you changed it to loop as Unicode, then indices won't match up: immutable(int)[] a = ... foreach (x, i; a) assert(x == a[i]); // ok immutable(char)[] b = ... foreach (x, i; b) assert(x == b[i]); // not necessarily! Also, the loop won't necessarily iterate b.length times. There's inconsistencies all over the place. The whole mess is caused by conflating the idea of an array with a variable length encoding that happens to use an array for storage. I don't believe there is any clean and tidy way to fix the problem without breaking compatibility.
Oct 20 2011
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 10/20/2011 2:49 PM, Peter Alexander wrote:
 The whole mess is caused by conflating the idea of an array with a variable
 length encoding that happens to use an array for storage. I don't believe there
 is any clean and tidy way to fix the problem without breaking compatibility.

There is no 'fixing' it, even to break compatibility. Sometimes you want to look at an array of utf8 as 8 bit characters, and sometimes as 20 bit dchars. Someone will be dissatisfied no matter what. There is no way to program strings in D without being aware of UTF-8 encoding.
Oct 20 2011
next sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
Walter Bright wrote:
 Sometimes you want to look at an array of utf8 as 8 bit characters,
 and sometimes as 20 bit dchars.

Well, they could always cast it to a ubyte[]. But on the other hand, you can just ask for dchar now. So yeah.
Oct 20 2011
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 10/20/2011 7:37 PM, Jonathan M Davis wrote:
 True, but if the default were dchar, then the common case would be have fewer
 bugs

Is that really the common case? It's certainly the *slow* case. Common string operations like searching, copying, etc., do not require decoding.
 (still allowing you to explicitly use char or wchar when you want to). At
 minimum, I think that it would be a good idea to implement
 http://d.puremagic.com/issues/show_bug.cgi?id=6652 and make it a warning not
 to explicitly give the type with foreach for arrays of char or wchar. It would
 catch bugs without changing the behavior of any existing code, and it still
 allows you to iterate over either code units or code points.

I like the type deduction feature of foreach, and don't think it should be removed for strings. Currently, it's consistent - T[] gets an element type of T. I want to reiterate that there's no way to program strings in D without being cognizant of them being a multibyte representation. D is both a high level and a low level language, and you can pick which to use, but you still gotta pick.
Oct 20 2011
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 10/20/2011 8:58 PM, Jonathan M Davis wrote:
 And why would you iterate over a string with foreach without decoding it
 unless you specifically need to operate on code units (which I would expect to
 be _very_ rare)? Sure, copying doesn't require decoding, but searching sure
 does

No, it doesn't. If I'm searching for a dchar, I'll be searching for a substring in the UTF-8 string. It's far, FAR more efficient to search as a substring rather than decoding while searching. Even more, 99.9999% of searches involve an ascii search string. It is simply not necessary to decode the searched string, as encoded chars cannot be ascii. For example: foreach (c; somestring) if (c == '+') found it! gains absolutely nothing by decoding somestring.
 (unless you're specifically looking for a code unit rather than a code
 point, which would not be normal). Most anything which needs to operate on the
 characters of a string needs to decode them. And iterating over them to do
 much of anything would require decoding, since otherwise you're operating on
 code units, and how often does anyone do that unless they're specifically
 messing around with character encodings?

What you write sounds intuitively correct, but in my experience writing Unicode processing code, it simply isn't true. One rarely needs to decode.
 However, in most cases, that is _not_
 what the programmer actually wants. They want to iterate over characters, not
 pieces of characters. So, the default at this point is _wrong_ in the common
 case.

This is simply not my experience when working with Unicode. Performance takes a big hit when one structures an algorithm to require decoding/encoding. Doing the algorithm using substrings is a huge win. Take a look at dmd's lexer, it handles Unicode correctly and avoids doing decoding as much as possible.
Oct 20 2011
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 10/21/2011 2:51 AM, Martin Nowak wrote:
 You have a good point here. I would have immediately thrown out the loop AFTER
 profiling.
 What hits me here is that I had an incorrect program with built-in unicode
aware
 strings.
 This is counterintuitive to correct unicode handling throughout the std
library,
 and even more to the complementary operation of appending any char type to
strings.

I understand the issue, but I don't think it's resolvable. It's a lot like the signed/unsigned issue. Java got rid of it by simply not having any unsigned types.
Oct 21 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/21/11 1:38 PM, Walter Bright wrote:
 On 10/21/2011 2:51 AM, Martin Nowak wrote:
 You have a good point here. I would have immediately thrown out the
 loop AFTER
 profiling.
 What hits me here is that I had an incorrect program with built-in
 unicode aware
 strings.
 This is counterintuitive to correct unicode handling throughout the
 std library,
 and even more to the complementary operation of appending any char
 type to strings.

I understand the issue, but I don't think it's resolvable.

It is resolvable, just not without breaking compatibility. Latching on the notion that a problem is unsolvable is highly nocive because it sets up the mind for failing to not only look for solutions, but also to see and understand them when they're in the open. I said this a number of times, and I repeat: if we had the luxury of doing it over again, I'd disable random access and .length for char[], wchar[], and their qualified versions. For those types I would add a property .rep that yields respectively ubyte[], ushort[], and the appropriately-qualified variants. This would shed the remaining awkwardnesses from a generally very elegant approach to string handling. The loop issue would be trivial to solve. foreach (x; s) would iterate one dchar at a time, whereas foreach (x; s.rep) would iterate one ubyte or ushort at a time. There would be the ability to iterate foreach (ref x; s.rep) but not foreach (ref x; s). Andrei
Oct 21 2011
parent bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:

 Latching on the notion that a problem is unsolvable is highly nocive because
it sets 
 up the mind for failing to not only look for solutions, but also to see 
 and understand them when they're in the open.

Right. Experimental psychology has confirmed this some decades ago :-)
 I said this a number of times, and I repeat:

Maybe I have missed your precedent explanations of this idea. Or to me this time it seems more clear and focused.
 if we had the luxury of doing it over again,

Unfortunately D3 language can't break too much backward compatibility :-|
 I'd disable random access and .length for char[], 
 wchar[], and their qualified versions. For those types I would add a 
 property .rep that yields respectively ubyte[], ushort[], and the 
 appropriately-qualified variants.

Good. But the need to know the length of a variable-sized string is common. So I presume in such cases you have to use: somestring.walkLength() This is not too much bad, but it looks a bit long, and it requires an import. So maybe a shorter named property function in the object module is needed: somestring.wlength What about slices? I do need to take slices of variable-length strings too. Bye, bearophile
Oct 21 2011
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2011-10-21 20:38, Walter Bright wrote:
 On 10/21/2011 2:51 AM, Martin Nowak wrote:
 You have a good point here. I would have immediately thrown out the
 loop AFTER
 profiling.
 What hits me here is that I had an incorrect program with built-in
 unicode aware
 strings.
 This is counterintuitive to correct unicode handling throughout the
 std library,
 and even more to the complementary operation of appending any char
 type to strings.

I understand the issue, but I don't think it's resolvable. It's a lot like the signed/unsigned issue. Java got rid of it by simply not having any unsigned types.

Can't we implement a new string type that people can choose to use if they want. It will hide all the Unicode details that has been brought up by this thread. -- /Jacob Carlborg
Oct 22 2011
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 10/22/2011 02:14 PM, Jacob Carlborg wrote:
 On 2011-10-21 20:38, Walter Bright wrote:
 On 10/21/2011 2:51 AM, Martin Nowak wrote:
 You have a good point here. I would have immediately thrown out the
 loop AFTER
 profiling.
 What hits me here is that I had an incorrect program with built-in
 unicode aware
 strings.
 This is counterintuitive to correct unicode handling throughout the
 std library,
 and even more to the complementary operation of appending any char
 type to strings.

I understand the issue, but I don't think it's resolvable. It's a lot like the signed/unsigned issue. Java got rid of it by simply not having any unsigned types.

Can't we implement a new string type that people can choose to use if they want. It will hide all the Unicode details that has been brought up by this thread.

Having multiple standard string types is bad. Furthermore, it is hard to meaningfully hide all the Unicode details. Not even immutable(dchar)[] necessarily encodes one character as one code unit.
Oct 22 2011
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 10/20/2011 9:06 PM, Jonathan M Davis wrote:
 It's this very problem that leads some people to argue that string should be
 its own type which holds an array of code units (which can be accessed when
 needed) rather than doing what we do now where we try and treat a string as
 both an array of chars and a range of dchars. The result is schizophrenic.

Making such a string type would be terribly inefficient. It would make D completely uncompetitive for processing strings.
Oct 20 2011
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 10/21/2011 4:14 AM, Steven Schveighoffer wrote:
 Making such a string type would be terribly inefficient. It would make D
 completely uncompetitive for processing strings.

I don't think it would. Do you have any proof to support this?

I've done string processing code, and done a lot of profiling of them. Every cycle is critical, and decoding adds a *lot* of cycles.
Oct 21 2011
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/21/11 1:39 PM, Walter Bright wrote:
 On 10/21/2011 4:14 AM, Steven Schveighoffer wrote:
 Making such a string type would be terribly inefficient. It would make D
 completely uncompetitive for processing strings.

I don't think it would. Do you have any proof to support this?

I've done string processing code, and done a lot of profiling of them. Every cycle is critical, and decoding adds a *lot* of cycles.

The key here is to allow people to easily either decode strings or not, without defaulting to an error-prone choice. Andrei
Oct 21 2011
prev sibling parent reply travert phare.normalesup.org (Christophe Travert) writes:
Walter Bright , dans le message (digitalmars.D:147161), a ťcrit†:
 On 10/20/2011 9:06 PM, Jonathan M Davis wrote:
 It's this very problem that leads some people to argue that string should be
 its own type which holds an array of code units (which can be accessed when
 needed) rather than doing what we do now where we try and treat a string as
 both an array of chars and a range of dchars. The result is schizophrenic.

Making such a string type would be terribly inefficient. It would make D completely uncompetitive for processing strings.

I definitely agree with you, but I have a piece of news for you : The whole phobos alreaday treats strings as a dchar ranges, and IS inefficient for processing strings. The fact is: char[] is not char[] in phobos. It is Range!dchar. This is aweful and schizophrenic, and also inefficient. The purpose was to allow people to use strings without knowing anything about unicode. That is why Jonathan proposed having a specific string structure to manipulate strings without having to worry about unicode, just like how strings are currently manipulated in phobos, and letting char[] be char[], and be as efficient as they should be. I was not there when it was decided to treat strings as dchar ranges, but now it is done. The only thing I can do is use ubyte[] instead of char[] so that phobos treat them as proper arrays, and propose optimized overloads for various phobos algorithm to make them as efficient as they should be (which I didn't find the time to do yet). -- Christophe
Oct 28 2011
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/28/11 4:41 AM, Christophe Travert wrote:
 I definitely agree with you, but I have a piece of news for you :
 The whole phobos alreaday treats strings as a dchar ranges, and IS
 inefficient for processing strings.

 The fact is: char[] is not char[] in phobos. It is Range!dchar. This is
 aweful and schizophrenic, and also inefficient. The purpose was to allow
 people to use strings without knowing anything about unicode.

It was to allow people who know Unicode to use strings without too much arcana.
 That is
 why Jonathan proposed having a specific string structure to manipulate
 strings without having to worry about unicode, just like how strings are
 currently manipulated in phobos, and letting char[] be char[], and be as
 efficient as they should be.

No need. Use http://www.digitalmars.com/d/2.0/phobos/std_string.html#representation
 I was not there when it was decided to treat strings as dchar ranges,
 but now it is done. The only thing I can do is use ubyte[] instead of
 char[] so that phobos treat them as proper arrays, and propose optimized
 overloads for various phobos algorithm to make them as efficient as they
 should be (which I didn't find the time to do yet).

What would you have done if you were there? What is your design of choice? Andrei
Oct 28 2011
prev sibling next sibling parent Michel Fortin <michel.fortin michelf.com> writes:
On 2011-10-21 03:58:50 +0000, Jonathan M Davis <jmdavisProg gmx.com> said:

 Sure, if you _know_ that you're dealing with a string with only ASCII, it's
 faster to just iterate over chars

It works for non-ASCII too. You're probably missing an interesting property of UTF encodings: if you want to search for a substring in a well-formed UTF sequence, you do not need to decode the bigger string, comparing the UTF-x code units of the substring with the UTF-x code units of the bigger string is plenty enough. Similarly, if you're searching for the 'Í' code point in an UTF-8 string, the most efficient way is to search the string for the two-byte UTF-8 sequence you would use to encode 'Í' (in other words, convert 'Í' to a string). Decoding the whole string is a wasteful process.
 Sure, if you _know_ that you're dealing with a string with only ASCII, it's
 faster to just iterate over chars, but then you can explicitly give the type
 of the foreach variable as char, but normally what people care about is
 iterating over characters, not pieces of characters.

If you want to iterate over what people consider characters, then you need to take into account combining marks that form multi-code-point graphemes. (You'll probably want to deal with unicode normalization too.) Treating code points as if they were characters is a misconception in the same way as treating UTF-16 code units as character is: both works most of the time but also fail in a number of cases.
 So, I would expect the
 case where people _want_ to iterate over chars to be rare. In most cases,
 iterating over a string as chars is a bug - one which in many cases won't be
 quickly caught, because the programmer is English speaking and uses almost
 exclusively ASCII for whatever testing that they do.

That's a real problem. But is treating everything as dchar the only solution to that problem?
 Defaulting to the
 guaranteed correct handling of characters and special casing when it's
 possible to write code more efficiently than that is definitely the way to go
 about it, and it's how Phobos generally does it.

Iterating on dchar is not guarantied to be correct, it only has significantly more chances of being correct.
 The fact that foreach doesn't
 is incongruous with how strings are handled in most other cases.

You could also argue that ranges are doing things the wrong way.
 I like the type deduction feature of foreach, and don't think it should be
 removed for strings. Currently, it's consistent - T[] gets an element type
 of T.

Sure, the type deduction of foreach is great, and it's completely consistent that iterating over an array of chars would iterate over chars rather than dchars when you don't give the type. However, in most cases, that is _not_ what the programmer actually wants. They want to iterate over characters, not pieces of characters.

I note that you keep confusing characters with code units.
 I want to reiterate that there's no way to program strings in D without
 being cognizant of them being a multibyte representation. D is both a high
 level and a low level language, and you can pick which to use, but you
 still gotta pick.

I fully agree that programmers need to properly understand unicode to use strings in D properly. However, the problem is that the default handling of strings with foreach is _not_ what programmers are going to normally want, so the default will cause bugs.

That said I wouldn't expect most programmers understand Unicode. Giving them dchars by default won't eliminate bugs related to multi-code-point characters, but it'll likely eliminate bugs relating to multi-code-unit sequences. That could be a good start. I'd say choosing dchar is a practical compromise between the "characters by default" and "the type of the array by default", but it is neither of those ideals. How is that pragmatic trade-off going to fare a few years in the future? I'm a little skeptical that this is the ideal solution. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Oct 20 2011
prev sibling parent Norbert Nemec <Norbert Nemec-online.de> writes:
On 21.10.2011 06:06, Jonathan M Davis wrote:
 It's this very problem that leads some people to argue that string should be
 its own type which holds an array of code units (which can be accessed when
 needed) rather than doing what we do now where we try and treat a string as
 both an array of chars and a range of dchars. The result is schizophrenic.

Indeed - expressing strings as arrays of characters will always fall short of the unicode concept in some way. A true unicode-compliant languages have to handle strings as opaque objects that do not have any encoding. There is a number of operations that can be done with these objects (concatenation, comparison, searching, etc.). Any kind of defined memory representation can only be obtained by an explicit encoding operation. Python3, for example, did a fundamental step by introducing this fundamental distinction. At first it seems silly, having to think about encodings so often when writing trivial code. After a short while, the strict conceptual separation between unencoded "strings" and encoded "arrays of something" really helps avoiding ugly problems. Sure, for a performance-critical language, the issue becomes a lot trickier. I still think it is possible and ultimately the only way to solve tricky problems that will otherwise always crop up somewhere.
Oct 24 2011
prev sibling parent reply Peter Alexander <peter.alexander.au gmail.com> writes:
On 21/10/11 3:26 AM, Walter Bright wrote:
 On 10/20/2011 2:49 PM, Peter Alexander wrote:
 The whole mess is caused by conflating the idea of an array with a
 variable
 length encoding that happens to use an array for storage. I don't
 believe there
 is any clean and tidy way to fix the problem without breaking
 compatibility.

There is no 'fixing' it, even to break compatibility. Sometimes you want to look at an array of utf8 as 8 bit characters, and sometimes as 20 bit dchars. Someone will be dissatisfied no matter what.

Then separate those ways of viewing strings. Here's one solution that I believe would satisfy everyone: 1. Remove the string, wstring and dstring aliases. An array of char should be an array of char, i.e. the same as array of byte. Same for arrays of wchar and dchar. This way, arrays of T have no subtle differences for certain kinds of T. 2. Add string, wstring and dstring structs with the following interface: a. foreach should iterate as dchar. b. property front() would be dchar. c. property length() would not exist. d. property buffer() returns the underlying immutable array of char, wchar etc. e. Remove opIndex and co. What this does: - Makes all array types consistent and intuitive. - Makes looping over strings do the expected thing. - Provides an interface to the underlying 8-bit chars for those that want it. Of course, people will still need to understand UTF-8. I don't think that's a problem. It's unreasonable to expect the language to do the thinking for you. The problem is that we have people that *do* understand UTF-8 (like the OP), but *don't* understand D's strings.
Oct 21 2011
parent reply Peter Alexander <peter.alexander.au gmail.com> writes:
On 21/10/11 11:03 PM, so wrote:
 On Fri, 21 Oct 2011 21:38:39 +0300, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:

 In another post in this thread, Walter said in reference to post on
 essentially this idea: "Making such a string type would be terribly
 inefficient. It would make D completely uncompetitive for processing
 strings."
 Now, whether that's true is debatable, but that's his stance on the idea.

 - Jonathan M Davis

Well he is right, the reason built-in strings everywhere and rarely (so rare that i have never seen one yet) someone comes up with a better alternative. IMO people are spoiled by dynamic languages and fail to see the strengths of D strings. Someone coming from C/C++ this is heaven, both for performance and flexibility.

Which operations do you believe would be less efficient?
Oct 22 2011
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.
Oct 22 2011
next sibling parent reply kennytm <kennytm gmail.com> writes:
Walter Bright <newshound2 digitalmars.com> wrote:
 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.

You can std.algorithm.find to do searching, not foreach. The former can decide whichever efficient method to use.
Oct 22 2011
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 10/22/2011 09:37 PM, kennytm wrote:
 Walter Bright<newshound2 digitalmars.com>  wrote:
 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.

You can std.algorithm.find to do searching, not foreach. The former can decide whichever efficient method to use.

Afaics the current std.algorithm.find implementation decodes its arguments.
Oct 22 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/22/11 3:05 PM, Timon Gehr wrote:
 On 10/22/2011 09:37 PM, kennytm wrote:
 Walter Bright<newshound2 digitalmars.com> wrote:
 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.

You can std.algorithm.find to do searching, not foreach. The former can decide whichever efficient method to use.

Afaics the current std.algorithm.find implementation decodes its arguments.

That can be easily fixed. Currently single-element find does decoding but substring find avoids it if possible: https://github.com/D-Programming-Language/phobos/blob/master/std/algorithm.d#L2819 Andrei
Oct 22 2011
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 10/22/2011 10:42 PM, Andrei Alexandrescu wrote:
 On 10/22/11 3:05 PM, Timon Gehr wrote:
 On 10/22/2011 09:37 PM, kennytm wrote:
 Walter Bright<newshound2 digitalmars.com> wrote:
 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.

You can std.algorithm.find to do searching, not foreach. The former can decide whichever efficient method to use.

Afaics the current std.algorithm.find implementation decodes its arguments.

That can be easily fixed. Currently single-element find does decoding but substring find avoids it if possible: https://github.com/D-Programming-Language/phobos/blob/master/std/algorithm.d#L2819

Ok, I actually did not see that, thanks. However this is usually still not the most efficient implementation in case the first argument is string and the other is wstring/dstring.
Oct 22 2011
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/22/11 5:32 PM, Timon Gehr wrote:
 Ok, I actually did not see that, thanks. However this is usually still
 not the most efficient implementation in case the first argument is
 string and the other is wstring/dstring.

I understand. For some reason I'm not seeing the URL of your pull request fixing that :o). Andrei
Oct 22 2011
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 24.10.2011 23:41, Steven Schveighoffer wrote:
 On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
 <simen.kjaras gmail.com> wrote:

 On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright
 <newshound2 digitalmars.com> wrote:

 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.

Searching that does not do decoding is fundamentally incorrect. That is, if you want to find a substring in a string, you cannot just compare chars.

Assuming both string are valid UTF-8, you can. Continuation bytes can never be confused with the first byte of a code point, and the first byte always identifies how many continuation bytes there should be.

As others have pointed out in the past to me (and I thought as you did once), the same characters can be encoded in *different ways*. They must be normalized to accurately compare.

Assuming language support stays on stage of "codepoint is a character" it's totaly expected to ignore modifiers and compare identically normalized UTF without decoding. Yes, it risks to hit certain issues.
 Plus, a combining character (such as an umlaut or accent) is part of a
 character, but may be a separate code point. If that's on the last
 character in the word such as fiancé, then searching for fiance will
 result in a match without proper decoding!

Now if you are going to do real characters... If source/needle are normalized you still can avoid lots of work by searching without decoding. All you need to decode is one codepoint on each successful match to see if there is a modifier at end of matched portion. But it depends on how you want to match if it's case-insensitive search it will be a lot more complicated, but anyway it boils down to this: 1) do inexact search, get likely match ( false positives are OK, negatives not) no decoding here 2) once found check it (or parts of it) with proper decoding There are cultural subtleties, that complicate these steps if you take them into account, but it's doable. Or if fiancé uses a
 precomposed é, it won't match. So two valid representations of the word
 either match or they don't. It's just a complete mess without proper
 unicode decoding.

It's a complete mess even with proper decoding ;)
 -Steve

-- Dmitry Olshansky
Oct 24 2011
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 10/24/2011 1:52 PM, Steven Schveighoffer wrote:
 Call me ludicrous, but is this really what we want to push on someone as a
 "unicode-aware" language?

There are different levels of Unicode support. D operates at the most basic level (forgot what it is called) where only recognition of the encodings is necessary.
Oct 24 2011
prev sibling next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 10/24/2011 10:52 PM, Steven Schveighoffer wrote:
 On Mon, 24 Oct 2011 16:18:57 -0400, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 On 24.10.2011 23:41, Steven Schveighoffer wrote:
 On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
 <simen.kjaras gmail.com> wrote:

 On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright
 <newshound2 digitalmars.com> wrote:

 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.

Searching that does not do decoding is fundamentally incorrect. That is, if you want to find a substring in a string, you cannot just compare chars.

Assuming both string are valid UTF-8, you can. Continuation bytes can never be confused with the first byte of a code point, and the first byte always identifies how many continuation bytes there should be.

As others have pointed out in the past to me (and I thought as you did once), the same characters can be encoded in *different ways*. They must be normalized to accurately compare.

Assuming language support stays on stage of "codepoint is a character" it's totaly expected to ignore modifiers and compare identically normalized UTF without decoding. Yes, it risks to hit certain issues.

Again, the "risk" is that it fails to achieve the goal you ask of it! D-language: Here, use this search algorithm, it works most of the time, but may not work correctly in some cases. If you run into one of those cases, you'll have to run a specialized search algorithm for strings. User: How do I know I hit one of those cases? D-language: You'll have to run the specialized version to find out. User: Why wouldn't I just run the specialized version first? D-language: Well, because it's slower! User: But don't I have to use both algorithms to make sure I find the data? D-language: Only if you "care" about accuracy! Call me ludicrous, but is this really what we want to push on someone as a "unicode-aware" language?
 Plus, a combining character (such as an umlaut or accent) is part of a
 character, but may be a separate code point. If that's on the last
 character in the word such as fiancé, then searching for fiance will
 result in a match without proper decoding!

Now if you are going to do real characters... If source/needle are normalized you still can avoid lots of work by searching without decoding. All you need to decode is one codepoint on each successful match to see if there is a modifier at end of matched portion. But it depends on how you want to match if it's case-insensitive search it will be a lot more complicated, but anyway it boils down to this: 1) do inexact search, get likely match ( false positives are OK, negatives not) no decoding here 2) once found check it (or parts of it) with proper decoding There are cultural subtleties, that complicate these steps if you take them into account, but it's doable.

I agree with you that simple searches using only byte (or dchar) comparison does not work, and can be optimized based on several factors. The easiest thing is to find a code unit sequence that only has one valid form, then search for that without decoding. Then when found, decode the characters around it. Or if that isn't possible, create all the un-normalized forms for one grapheme (based on how likely it is to occur), and search for one of those in the undecoded stream. This can all be built into a specialized string type. There's actually some really interesting problems to solve in this space I think. I've created a basic string type that has lamented in my unfinished pile of stuff to do. I think it can be done in a way that is close to as efficient as arrays for the most common operations (slicing, indexing, etc.), but is *correct* before it is efficient. You should always be able to drop into "array mode" and deal with the code-units.
 Or if fiancé uses a
 precomposed é, it won't match. So two valid representations of the word
 either match or they don't. It's just a complete mess without proper
 unicode decoding.

It's a complete mess even with proper decoding ;)

Yes, all the more reason to solve the problem correctly so the hapless unicode novice user doesn't have to! -Steve

The answer is probably to get proper unicode normalization into Phobos. The efficient versions can then specify that they are guaranteed to work perfectly accurate on normalized input only.
Oct 24 2011
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 25.10.2011 0:52, Steven Schveighoffer wrote:
 On Mon, 24 Oct 2011 16:18:57 -0400, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 On 24.10.2011 23:41, Steven Schveighoffer wrote:
 On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
 <simen.kjaras gmail.com> wrote:

 On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright
 <newshound2 digitalmars.com> wrote:

 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.

Searching that does not do decoding is fundamentally incorrect. That is, if you want to find a substring in a string, you cannot just compare chars.

Assuming both string are valid UTF-8, you can. Continuation bytes can never be confused with the first byte of a code point, and the first byte always identifies how many continuation bytes there should be.

As others have pointed out in the past to me (and I thought as you did once), the same characters can be encoded in *different ways*. They must be normalized to accurately compare.

Assuming language support stays on stage of "codepoint is a character" it's totaly expected to ignore modifiers and compare identically normalized UTF without decoding. Yes, it risks to hit certain issues.

Again, the "risk" is that it fails to achieve the goal you ask of it!

Last time I checked, the legion of "everything is ASCII" zealots was pretty large ;) So the "goal" is a blury line, meaning that "search for a substring on in this chunk of unicode text" can mean very different things, and be correct in it's own sense on about 3 different levels. There is no default way, so I'm not sure we have to embed all of this complexity into language right now. Phobos is were we must first implement this, once it works we may look at revisiting our pick of default comparison method, etc.
 D-language: Here, use this search algorithm, it works most of the time,
 but may not work correctly in some cases. If you run into one of those
 cases, you'll have to run a specialized search algorithm for strings.
 User: How do I know I hit one of those cases?
 D-language: You'll have to run the specialized version to find out.
 User: Why wouldn't I just run the specialized version first?
 D-language: Well, because it's slower!
 User: But don't I have to use both algorithms to make sure I find the data?
 D-language: Only if you "care" about accuracy!

 Call me ludicrous, but is this really what we want to push on someone as
 a "unicode-aware" language?

No, but we might want to fix a string search to do a little more - namely check if it skewed a graphem (assuming normalizations match). BTW adding normalization to std.uni is another thing to do right now. That should be a good enough start, and looking at unicode standard, things are rather fluid there meaning that "unicode-aware" language should be sufixed by version number :)
 Plus, a combining character (such as an umlaut or accent) is part of a
 character, but may be a separate code point. If that's on the last
 character in the word such as fiancé, then searching for fiance will
 result in a match without proper decoding!

Now if you are going to do real characters... If source/needle are normalized you still can avoid lots of work by searching without decoding. All you need to decode is one codepoint on each successful match to see if there is a modifier at end of matched portion. But it depends on how you want to match if it's case-insensitive search it will be a lot more complicated, but anyway it boils down to this: 1) do inexact search, get likely match ( false positives are OK, negatives not) no decoding here 2) once found check it (or parts of it) with proper decoding There are cultural subtleties, that complicate these steps if you take them into account, but it's doable.

I agree with you that simple searches using only byte (or dchar) comparison does not work, and can be optimized based on several factors. The easiest thing is to find a code unit sequence that only has one valid form, then search for that without decoding. Then when found, decode the characters around it. Or if that isn't possible, create all the un-normalized forms for one grapheme (based on how likely it is to occur), and search for one of those in the undecoded stream.

Yes, it's all revolves around find all of variations of a substring. If user is willing to spend some time upfront, this could be done in one pass e.g. using the trick I employed in new std.regex. For a rough idea see ShiftOr string search, that is easy & fast inexact search: http://igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION0060 There are even better variations of this stuff in the wild.
 This can all be built into a specialized string type. There's actually
 some really interesting problems to solve in this space I think. I've
 created a basic string type that has lamented in my unfinished pile of
 stuff to do. I think it can be done in a way that is close to as
 efficient as arrays for the most common operations (slicing, indexing,
 etc.), but is *correct* before it is efficient. You should always be
 able to drop into "array mode" and deal with the code-units.

Here is where we probably differ, to me there is no simple safe fast string type. Why? Because if you need anything "not pedantic and safe above all" you work with implicit data type of e.g. "a chunk of UTF-8 bytes that are normalized in NFC", ... I'd rather see them as sealed array with a bunch of meta info, and string functions to specialize on them in order to do some cool speed hacks. Of course, drilling down to row data should be possible for the user as well.
 Or if fiancé uses a
 precomposed é, it won't match. So two valid representations of the word
 either match or they don't. It's just a complete mess without proper
 unicode decoding.

It's a complete mess even with proper decoding ;)

Yes, all the more reason to solve the problem correctly so the hapless unicode novice user doesn't have to!

Hapless novice unicode user don't stand a chance. I'm serious. One needs to know a lot of these "basics" to e.g. know why indexing by "true" character could be so slow, about encodings, normalizations etc. Yet we can save a metric ton of special cased crap from ever hitting his eyes. -- Dmitry Olshansky
Oct 24 2011
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 26.10.2011 16:12, Steven Schveighoffer wrote:
<snip>
 I agree the string type needs to be fleshed out before the language
 adopts it. I think we could legitimately define a string type that
 auto-assigns from it's respective char[] array type. Once it's shown
 that the type is nearly as fast as a char[] array, it can be used as the
 default type for string literals (the only real place where the language
 gets in the way).

That's it.
 D-language: Here, use this search algorithm, it works most of the time,
 but may not work correctly in some cases. If you run into one of those
 cases, you'll have to run a specialized search algorithm for strings.
 User: How do I know I hit one of those cases?
 D-language: You'll have to run the specialized version to find out.
 User: Why wouldn't I just run the specialized version first?
 D-language: Well, because it's slower!
 User: But don't I have to use both algorithms to make sure I find the
 data?
 D-language: Only if you "care" about accuracy!

 Call me ludicrous, but is this really what we want to push on someone as
 a "unicode-aware" language?

No, but we might want to fix a string search to do a little more - namely check if it skewed a graphem (assuming normalizations match).

That's a big assumption. Valid unicode is valid unicode, even if it's not normalized.
 BTW adding normalization to std.uni is another thing to do right now.
 That should be a good enough start, and looking at unicode standard,
 things are rather fluid there meaning that "unicode-aware" language
 should be sufixed by version number :)

I agree we need normalization and it is not necessary to involve the compiler in this. I'd suggest a two to three phase approach: 1. Leave phobos' schizo view of "char arrays are not arrays" for now, and build the necessary framework to get a string type that actually works. 2. Remove schizo view. 3. (optional) make compiler use library-defined string type for string literals.

Something like that, I may land a hand here, as I plan to merge some unicode stuff between std.regex and std.uni. As extras probably striding by grapheme and support for various degree of case-insensitive string comparison.
 Plus, a combining character (such as an umlaut or accent) is part of a
 character, but may be a separate code point. If that's on the last
 character in the word such as fiancé, then searching for fiance will
 result in a match without proper decoding!

Now if you are going to do real characters... If source/needle are normalized you still can avoid lots of work by searching without decoding. All you need to decode is one codepoint on each successful match to see if there is a modifier at end of matched portion. But it depends on how you want to match if it's case-insensitive search it will be a lot more complicated, but anyway it boils down to this: 1) do inexact search, get likely match ( false positives are OK, negatives not) no decoding here 2) once found check it (or parts of it) with proper decoding There are cultural subtleties, that complicate these steps if you take them into account, but it's doable.

I agree with you that simple searches using only byte (or dchar) comparison does not work, and can be optimized based on several factors. The easiest thing is to find a code unit sequence that only has one valid form, then search for that without decoding. Then when found, decode the characters around it. Or if that isn't possible, create all the un-normalized forms for one grapheme (based on how likely it is to occur), and search for one of those in the undecoded stream.

Yes, it's all revolves around find all of variations of a substring. If user is willing to spend some time upfront, this could be done in one pass e.g. using the trick I employed in new std.regex. For a rough idea see ShiftOr string search, that is easy & fast inexact search: http://igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION0060 There are even better variations of this stuff in the wild.

This sounds good. I'll have a look at it when I have time.
 This can all be built into a specialized string type. There's actually
 some really interesting problems to solve in this space I think. I've
 created a basic string type that has lamented in my unfinished pile of
 stuff to do. I think it can be done in a way that is close to as
 efficient as arrays for the most common operations (slicing, indexing,
 etc.), but is *correct* before it is efficient. You should always be
 able to drop into "array mode" and deal with the code-units.

Here is where we probably differ, to me there is no simple safe fast string type. Why? Because if you need anything "not pedantic and safe above all" you work with implicit data type of e.g. "a chunk of UTF-8 bytes that are normalized in NFC", ...

Specialized types which dictate normalization could be created. We do not have to always assume the worst if we are optimizing. But in general, when you read a utf-8 file, it could have anything in it. It may not even be normalized.

There should be a way of e.g. auto my_string = normalize(readText("file.txt"), Normalize.NFKC); //my_string has typed normalized UTF-8 or somesuch Yes, that should keep general non-assuming case. Though it still can do some clever things, since as W3C points out >90% of text on web is already normalized in NFC (I can search for a proof link, if hard pressed).
 I'd rather see them as sealed array with a bunch of meta info, and
 string functions to specialize on them in order to do some cool speed
 hacks. Of course, drilling down to row data should be possible for the
 user as well.

This is exactly what I am envisioning ;) Except I'd build the meta-info into the type.

Yeah, I meant that the type caries this info. Looks like we are more in agreement then I suspected ;)
 Or if fiancé uses a
 precomposed é, it won't match. So two valid representations of the
 word
 either match or they don't. It's just a complete mess without proper
 unicode decoding.

It's a complete mess even with proper decoding ;)

Yes, all the more reason to solve the problem correctly so the hapless unicode novice user doesn't have to!

Hapless novice unicode user don't stand a chance. I'm serious. One needs to know a lot of these "basics" to e.g. know why indexing by "true" character could be so slow, about encodings, normalizations etc. Yet we can save a metric ton of special cased crap from ever hitting his eyes.

I disagree. 99% of the time I use strings, I don't care about indexes. I just want to deal with whole strings and substrings. I rarely use arbitrary indexes, and when I do, I'm assuming ascii data. The hapless novice unicode user doesn't need to know unicode to do: writefln("hello, world!"); Or to do: if(find(str, "xyz")) {...} Or to even use regex!

To print chars you don't need to know unicode, you OS needs to :) Though that wasn't string processing. However find/match is indeed a good example. You are probably right, if we cover enough of common operations most people might not have to "look inside". -- Dmitry Olshansky
Oct 26 2011
prev sibling parent travert phare.normalesup.org (Christophe) writes:
Dmitry Olshansky , dans le message (digitalmars.D:147415), a ťcrit†:
 Assuming language support stays on stage of "codepoint is a character" 
 it's totaly expected to ignore modifiers and compare identically 
 normalized UTF without decoding. Yes, it risks to hit certain issues.

string being seen as range of codepoint (dchar) is already aweful enough. Now seeing strings as range of displayable caracters just do not make sense. Unicode is too complicated to allow doing this for a general purpose string manipulation. All the transformations to displayable characters can only be done when displaying characters ! Just like fiancť is hidden is you write fiance' (with the approriate unicode character to have the ' placed over the 'e'). You can hide any word by using delete characters. You have to make asumption on the input, and you have to put limitations to the algorithm because in any case, you can have unexpected behavior. And I can assure you there is less unexpected behavior if you treat strings as dchar range or even char[], than if you treat them as displayable characters.
 It's a complete mess even with proper decoding ;)

Sure, that's why we better not decode.
Oct 28 2011
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 10/24/2011 7:02 AM, Steven Schveighoffer wrote:
 On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright <newshound2 digitalmars.com>
 wrote:

 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.

Searching that does not do decoding is fundamentally incorrect. That is, if you want to find a substring in a string, you cannot just compare chars.

Sure you can. A Unicode character is a string, a Unicode string is a string of those strings. So, searching for a Unicode character is searching for a substring. Doing this character by character is standard, rote stuff. It's how grep works, for example.
Oct 24 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-10-24 21:47:15 +0000, "Steven Schveighoffer" 
<schveiguy yahoo.com> said:

 What if the source character is encoded differently than the search
 string?  This is basic unicode stuff.  See my example with fiancť.

The more I think about it, the more I think it should work like this: just like we assume they contain well-formed UTF sequences, char[], wchar[], and dchar[] should also be assumed to contain **normalized** unicode strings. Which normalization form to use? no idea. Just pick a sensible one in the four. <http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms> Once we know all strings are normalized in the same way, we can then compare two strings bitwise to check if they're the same. And we can check for a substring in the same manner except we need to insert a simple check after the match to verify that it isn't surrounded by combining marks applying to the first or last character. And it'll also deeply simplify any proper Unicode string handling code we'll add in the future. - - - That said, I fear that forcing a specific normalization might be problematic. You don't always want to have to normalize everything... <http://en.wikipedia.org/wiki/Unicode_equivalence#Errors_due_to_normalization_differences> So perhaps we could simplify things a bit more: don't pick a standard normalization form. Just assume that both strings being used are in the same normalization form. Comparison will work, searching for substring in the way specified above will work, and other functions could document which normalization form they accept. Problem solved... somewhat. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Oct 24 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-10-26 11:50:32 +0000, "Steven Schveighoffer" 
<schveiguy yahoo.com> said:

 It's even easier than this:
 
 a) you want to do a proper string comparison not knowing what state the
 unicode strings are in, use the full-fledged decode-when-needed string
 type, and its associated str.find method.
 b) you know they are both the same normalized form and want to optimize,
 use std.algorithm.find(haystack.asArray, needle.asArray).

Well, treating the string as an array of dchar doesn't work in the general case, even with strings normalized the same way your fiancť example can break. So should never treat them as plain arrays unless I'm sure I have no combining marks in the string. I'm not opposed to having a new string type being developed, but I'm skeptical about its inclusion in the language. We already have three string types which can be assumed to contain valid UTF sequences. I think the first thing to do is not to develop a new string type, but to develop the normalization and grapheme splitting algorithms, and those to find a substring, using the existing char[], wchar[] and dchar[] types. Then write a program with proper handling of Unicode using those and hand-optimize it. If that proves to be a pain (it might well be), write a new string type, rewrite the program using it, do some benchmarks and then we'll know if it's a good idea, and will be able to quantify the drawbacks. But right now, all this arguing for or against a new string type is stacking hypothesis against other hypothesis, it won't lead anywhere. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Oct 26 2011
parent Michel Fortin <michel.fortin michelf.com> writes:
On 2011-10-26 14:20:54 +0000, "Steven Schveighoffer" 
<schveiguy yahoo.com> said:

 If you have the same normalization, you would at least find all the vali d
 instances.  You'd have to eliminate false positives, by decoding the nex t
 dchar after the match to see if it's a combining character.  This would  be
 a simple function to write.
 
 All I'm saying is, doing a search on the array should be available.

And all I'm saying is that such a standalone function that eliminates the false positive should be available outside of a string type too. You'll likely need it for your string type anyway, but there's no reason you should't be able to use it standalone on a char[]/wchar[]/dchar[]. I think most string algorithms should be able to work with normalized character arrays; a new string type should just be a shell around these that makes things easier to the user.
 I have a half-implemented string type which does just this.  I need to
 finish it.

That should be interesting. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Oct 26 2011
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/26/11 7:18 AM, Steven Schveighoffer wrote:
 On Mon, 24 Oct 2011 19:49:43 -0400, Simen Kjaeraas
 <simen.kjaras gmail.com> wrote:

 On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 Plus, a combining character (such as an umlaut or accent) is part of a
 character, but may be a separate code point.

If this is correct (and it is), then decoding to dchar is simply not enough. You seem to advocate decoding to graphemes, which is a whole different matter.

I am advocating that. And it's a matter of perception. D can say "we only support code-point decoding" and what that means to a user is, "we don't support language as you know it." Sure it's a part of unicode, but it takes that extra piece to make it actually usable to people who require unicode. Even in English, fiancé has an accent. To say D supports unicode, but then won't do a simple search on a file which contains a certain *valid* encoding of that word is disingenuous to say the least.

Why doesn't that simple search work? foreach (line; stdin.byLine()) { if (line.canFind("fiancé")) { writeln("There it is."); } }
 D needs a fully unicode-aware string type. I advocate D should use it as
 the default string type, but it needs one whether it's the default or
 not in order to say it supports unicode.

How do you define "supports Unicode"? For my money, the main sin of (w)string is that it offers [] and .length with potentially confusing semantics, so if I could I'd curb, not expand, its interface. Andrei
Oct 29 2011
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/31/11 8:20 AM, Steven Schveighoffer wrote:
 Why doesn't that simple search work?

 foreach (line; stdin.byLine()) {
 if (line.canFind("fiancé")) {
 writeln("There it is.");
 }
 }

I think Jonathan answered that quite well, nothing else to add...

I see this as a decoding primitive issue that can be resolved. Andrei
Oct 31 2011
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2011-10-20 23:49, Peter Alexander wrote:
 On 20/10/11 8:37 PM, Martin Nowak wrote:
 It just took me over one hour to find out the unthinkable.
 foreach(c; str) will deduce c to immutable(char) and doesn't care about
 unicode.
 Now there is so many unicode transcoding happening in the language that
 it starts to get annoying,
 but the most basic string iteration doesn't support it by default?

D has got itself into a tricky situation in this regard. Doing it either way introduces an unintuitive mess. The way it is now, you get the problem that you just described where foreach is unaware of Unicode. If you changed it to loop as Unicode, then indices won't match up: immutable(int)[] a = ... foreach (x, i; a) assert(x == a[i]); // ok immutable(char)[] b = ... foreach (x, i; b) assert(x == b[i]); // not necessarily!

The index could skip certain indexes to make that assert pass. But that would be confusing as well. -- /Jacob Carlborg
Oct 20 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, October 20, 2011 19:26:43 Walter Bright wrote:
 On 10/20/2011 2:49 PM, Peter Alexander wrote:
 The whole mess is caused by conflating the idea of an array with a
 variable length encoding that happens to use an array for storage. I
 don't believe there is any clean and tidy way to fix the problem
 without breaking compatibility.

look at an array of utf8 as 8 bit characters, and sometimes as 20 bit dchars. Someone will be dissatisfied no matter what. There is no way to program strings in D without being aware of UTF-8 encoding.

True, but if the default were dchar, then the common case would be have fewer bugs (still allowing you to explicitly use char or wchar when you want to). At minimum, I think that it would be a good idea to implement http://d.puremagic.com/issues/show_bug.cgi?id=6652 and make it a warning not to explicitly give the type with foreach for arrays of char or wchar. It would catch bugs without changing the behavior of any existing code, and it still allows you to iterate over either code units or code points. - Jonathan M Davis
Oct 20 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, October 20, 2011 20:37:40 Walter Bright wrote:
 On 10/20/2011 7:37 PM, Jonathan M Davis wrote:
 True, but if the default were dchar, then the common case would be have
 fewer bugs

Is that really the common case? It's certainly the *slow* case. Common string operations like searching, copying, etc., do not require decoding.

And why would you iterate over a string with foreach without decoding it unless you specifically need to operate on code units (which I would expect to be _very_ rare)? Sure, copying doesn't require decoding, but searching sure does (unless you're specifically looking for a code unit rather than a code point, which would not be normal). Most anything which needs to operate on the characters of a string needs to decode them. And iterating over them to do much of anything would require decoding, since otherwise you're operating on code units, and how often does anyone do that unless they're specifically messing around with character encodings? Sure, if you _know_ that you're dealing with a string with only ASCII, it's faster to just iterate over chars, but then you can explicitly give the type of the foreach variable as char, but normally what people care about is iterating over characters, not pieces of characters. So, I would expect the case where people _want_ to iterate over chars to be rare. In most cases, iterating over a string as chars is a bug - one which in many cases won't be quickly caught, because the programmer is English speaking and uses almost exclusively ASCII for whatever testing that they do. The default for string handling really should be to treat them as ranges of dchar but still make it easy for them to be treated as arrays of code units when necessary. There's plenty of code in Phobos which is able to special case strings and make operating on them more efficient when it's not necessary to operate on them as ranges of dchar or when decoding the string explicitly with functions such as stride. But the default is still to operate on them as ranges of dchar, because that is what is normally correct. Defaulting to the guaranteed correct handling of characters and special casing when it's possible to write code more efficiently than that is definitely the way to go about it, and it's how Phobos generally does it. The fact that foreach doesn't is incongruous with how strings are handled in most other cases.
 (still allowing you to explicitly use char or wchar when you want to).
 At
 minimum, I think that it would be a good idea to implement
 http://d.puremagic.com/issues/show_bug.cgi?id=6652 and make it a warning
 not to explicitly give the type with foreach for arrays of char or
 wchar. It would catch bugs without changing the behavior of any
 existing code, and it still allows you to iterate over either code
 units or code points.

I like the type deduction feature of foreach, and don't think it should be removed for strings. Currently, it's consistent - T[] gets an element type of T.

Sure, the type deduction of foreach is great, and it's completely consistent that iterating over an array of chars would iterate over chars rather than dchars when you don't give the type. However, in most cases, that is _not_ what the programmer actually wants. They want to iterate over characters, not pieces of characters. So, the default at this point is _wrong_ in the common case. As such, I'm very leery of any code which uses foreach over a string without specifying the iteration type. And in fact, unless the code is clearly intended to operate on code units, I would expect a comment indicating that the use of char instead of dchar was intentional, or I'd still consider it likely that it's a bug and a mistake on the programmer's part (likely due to a misunderstanding of unicode and how D deals with it).
 I want to reiterate that there's no way to program strings in D without
 being cognizant of them being a multibyte representation. D is both a high
 level and a low level language, and you can pick which to use, but you
 still gotta pick.

I fully agree that programmers need to properly understand unicode to use strings in D properly. However, the problem is that the default handling of strings with foreach is _not_ what programmers are going to normally want, so the default will cause bugs. If strings defaulted to iterating as ranges of dchar, or if programmers had to say what type they wanted to iterate over when dealing with strings (or at least got a warning if they didn't), then there would be fewer bugs. Pretty much every time that the use of strings with foreach comes up on this list, most everyone agrees that it's a wart in the language that the default is to iterate over chars rather than dchars. Not everyone agrees on the best way to fix the problem, but most everyone agrees that it _is_ a problem. - Jonathan M Davis
Oct 20 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
Actually, I'd have to say that having foreach default to iterating over char 
for string is a bit like if it defaulted to iterating over byte for int[]. 
Sure, it would work in some cases, but in the general case, it would be very 
wrong.

Yes, it's consistent with how arrays are normally iterated over to have 
foreach iterate over char for char[], which is why it's arguably not a good 
idea to make it default to dchar. But it's _wrong_ in most cases, so at least 
giving a warning when the programmer doesn't give an iteration type for 
foreach with an array of char or wchar would be a big help.

It's this very problem that leads some people to argue that string should be 
its own type which holds an array of code units (which can be accessed when 
needed) rather than doing what we do now where we try and treat a string as 
both an array of chars and a range of dchars. The result is schizophrenic.

- Jonathan M Davis
Oct 20 2011
prev sibling next sibling parent so <so so.so> writes:
On Thu, 20 Oct 2011 22:37:56 +0300, Martin Nowak <dawg dawgfoto.de> wrote:

 It just took me over one hour to find out the unthinkable.
 foreach(c; str) will deduce c to immutable(char) and doesn't care about  
 unicode.
 Now there is so many unicode transcoding happening in the language that  
 it starts to get annoying,
 but the most basic string iteration doesn't support it by default?

I really can't believe why people expect that. By definition string is an array of chars, and this way it is consistent. All we need is: foreach(c, whatever(str)) instead of: foreach(c, str)
Oct 21 2011
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Fri, 21 Oct 2011 06:39:56 +0200, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 10/20/2011 8:58 PM, Jonathan M Davis wrote:
 And why would you iterate over a string with foreach without decoding it
 unless you specifically need to operate on code units (which I would  
 expect to
 be _very_ rare)? Sure, copying doesn't require decoding, but searching  
 sure
 does

No, it doesn't. If I'm searching for a dchar, I'll be searching for a substring in the UTF-8 string. It's far, FAR more efficient to search as a substring rather than decoding while searching. Even more, 99.9999% of searches involve an ascii search string. It is simply not necessary to decode the searched string, as encoded chars cannot be ascii. For example: foreach (c; somestring) if (c == '+') found it! gains absolutely nothing by decoding somestring.
 (unless you're specifically looking for a code unit rather than a code
 point, which would not be normal). Most anything which needs to operate  
 on the
 characters of a string needs to decode them. And iterating over them to  
 do
 much of anything would require decoding, since otherwise you're  
 operating on
 code units, and how often does anyone do that unless they're  
 specifically
 messing around with character encodings?

What you write sounds intuitively correct, but in my experience writing Unicode processing code, it simply isn't true. One rarely needs to decode.
 However, in most cases, that is _not_
 what the programmer actually wants. They want to iterate over  
 characters, not
 pieces of characters. So, the default at this point is _wrong_ in the  
 common
 case.

This is simply not my experience when working with Unicode. Performance takes a big hit when one structures an algorithm to require decoding/encoding. Doing the algorithm using substrings is a huge win. Take a look at dmd's lexer, it handles Unicode correctly and avoids doing decoding as much as possible.

AFTER profiling. What hits me here is that I had an incorrect program with built-in unicode aware strings. This is counterintuitive to correct unicode handling throughout the std library, and even more to the complementary operation of appending any char type to strings. martin
Oct 21 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 21 Oct 2011 00:43:20 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 10/20/2011 9:06 PM, Jonathan M Davis wrote:
 It's this very problem that leads some people to argue that string  
 should be
 its own type which holds an array of code units (which can be accessed  
 when
 needed) rather than doing what we do now where we try and treat a  
 string as
 both an array of chars and a range of dchars. The result is  
 schizophrenic.

Making such a string type would be terribly inefficient. It would make D completely uncompetitive for processing strings.

I don't think it would. Do you have any proof to support this? -Steve
Oct 21 2011
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Friday, October 21, 2011 11:11 Peter Alexander wrote:
 On 21/10/11 3:26 AM, Walter Bright wrote:
 On 10/20/2011 2:49 PM, Peter Alexander wrote:
 The whole mess is caused by conflating the idea of an array with a
 variable
 length encoding that happens to use an array for storage. I don't
 believe there
 is any clean and tidy way to fix the problem without breaking
 compatibility.

There is no 'fixing' it, even to break compatibility. Sometimes you want to look at an array of utf8 as 8 bit characters, and sometimes as 20 bit dchars. Someone will be dissatisfied no matter what.

Then separate those ways of viewing strings. Here's one solution that I believe would satisfy everyone: 1. Remove the string, wstring and dstring aliases. An array of char should be an array of char, i.e. the same as array of byte. Same for arrays of wchar and dchar. This way, arrays of T have no subtle differences for certain kinds of T. 2. Add string, wstring and dstring structs with the following interface: a. foreach should iterate as dchar. b. property front() would be dchar. c. property length() would not exist. d. property buffer() returns the underlying immutable array of char, wchar etc. e. Remove opIndex and co. What this does: - Makes all array types consistent and intuitive. - Makes looping over strings do the expected thing. - Provides an interface to the underlying 8-bit chars for those that want it. Of course, people will still need to understand UTF-8. I don't think that's a problem. It's unreasonable to expect the language to do the thinking for you. The problem is that we have people that *do* understand UTF-8 (like the OP), but *don't* understand D's strings.

In another post in this thread, Walter said in reference to post on essentially this idea: "Making such a string type would be terribly inefficient. It would make D completely uncompetitive for processing strings." Now, whether that's true is debatable, but that's his stance on the idea. - Jonathan M Davis
Oct 21 2011
prev sibling next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 10/20/2011 09:37 PM, Martin Nowak wrote:
 It just took me over one hour to find out the unthinkable.
 foreach(c; str) will deduce c to immutable(char) and doesn't care about
 unicode.
 Now there is so many unicode transcoding happening in the language

In the standard library. Not in the language.
 that it starts to get annoying,
 but the most basic string iteration doesn't support it by default?

I actually like the current behaviour.
Oct 21 2011
prev sibling next sibling parent so <so so.so> writes:
On Fri, 21 Oct 2011 21:11:14 +0300, Peter Alexander  
<peter.alexander.au gmail.com> wrote:

 Of course, people will still need to understand UTF-8. I don't think  
 that's a problem. It's unreasonable to expect the language to do the  
 thinking for you. The problem is that we have people that *do*  
 understand UTF-8 (like the OP), but *don't* understand D's strings.

Indeed, if one knows/understands how unicode works AND knows the structure of D strings, the current scheme is the best of both worlds. foreach(e; string) // chars foreach(e; byUTFxxx(string)) // ...
Oct 21 2011
prev sibling next sibling parent reply so <so so.so> writes:
On Fri, 21 Oct 2011 21:38:39 +0300, Jonathan M Davis <jmdavisProg gmx.com>  
wrote:

 In another post in this thread, Walter said in reference to post on
 essentially this idea: "Making such a string type would be terribly
 inefficient. It would make D completely uncompetitive for processing  
 strings."
 Now, whether that's true is debatable, but that's his stance on the idea.

 - Jonathan M Davis

Well he is right, the reason built-in strings everywhere and rarely (so rare that i have never seen one yet) someone comes up with a better alternative. IMO people are spoiled by dynamic languages and fail to see the strengths of D strings. Someone coming from C/C++ this is heaven, both for performance and flexibility.
Oct 21 2011
parent bearophile <bearophileHUGS lycos.com> writes:
so:

 IMO people are spoiled by dynamic languages

We should aim to something *better* than dynamic languages, where possible. If you have to do certain things (even "slow" ones), there's no point in making them harder than necessary in D. Python is a good language, but it's not perfect, and in a new language I'd like something even better. D contains several small things that are better than Python (an many things that are worse than Python). Bye, bearophile
Oct 21 2011
prev sibling next sibling parent so <so so.so> writes:
On Sat, 22 Oct 2011 02:28:28 +0300, bearophile <bearophileHUGS lycos.com>  
wrote:

 so:

 IMO people are spoiled by dynamic languages

We should aim to something *better* than dynamic languages, where possible. If you have to do certain things (even "slow" ones), there's no point in making them harder than necessary in D. Python is a good language, but it's not perfect, and in a new language I'd like something even better. D contains several small things that are better than Python (an many things that are worse than Python). Bye, bearophile

With spoiled i didn't mean those languages do it better in general (expressiveness, maybe), what i meant is that they just work, without any effort from the programmer. It may be all right for them and you could say it is what every language should do but in reality this is not the case for D like languages. Not with the expense of efficiency on a low level task. We could aim all we like but we can't beat dynamic languages that easily, if you don't care reaching or beating C (on every aspect, system access, performance...) you don't have much of a limit do you? :)
Oct 21 2011
prev sibling next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Sat, 22 Oct 2011 01:28:28 +0200, bearophile <bearophileHUGS lycos.com=
  =

wrote:
 so:

 IMO people are spoiled by dynamic languages

We should aim to something *better* than dynamic languages, where =

 possible. If you have to do certain things (even "slow" ones), there's=

 no point in making them harder than necessary in D.

 Python is a good language, but it's not perfect, and in a new language=

 I'd like something even better. D contains several small things that a=

 better than Python (an many things that are worse than Python).

 Bye,
 bearophile

Well the first thing I tried out was: #!/usr/bin/env python for c in "f#a# =E2=88=9E": print c Which I still didn't get to run after: - reading SyntaxError: Non-ASCII character '\xe2' in file ./run.py on = = line 1, but no encoding declared; see = http://www.python.org/peps/pep-0263.html for details - adding a BOM to the script - remove the shebang which collides with the BOM - writing unicode("f#a# =E2=88=9E") so I don't get an encoding error martin
Oct 21 2011
prev sibling next sibling parent reply Alvaro <alvaro_segura gmail.com> writes:
El 20/10/2011 21:37, Martin Nowak escribiů:
 It just took me over one hour to find out the unthinkable.
 foreach(c; str) will deduce c to immutable(char) and doesn't care about
 unicode.
 Now there is so many unicode transcoding happening in the language that
 it starts to get annoying,
 but the most basic string iteration doesn't support it by default?

Maybe I didn't fully get your point, but, you do know that you can do the following, right? string str = "—andķ"; foreach(dchar c; str) ... and it decodes full unicode characters just fine. Or maybe you are just talking about the auto type inference, just to make sure... OTOH, as others say, it's not rare to iterate on 8-bit units if you're dealing with ASCII or if you are parsing and looking for operators and separators (which are normally 8-bit). Then you can leave the rest untouched or extract the parts in between without caring if they are 1 byte per character or several. (e.g. parsing XML, or JSON, or CSV, or INI, or conf, or D source, etc.)
Oct 22 2011
parent Daniel Gibson <metalcaedes gmail.com> writes:
Am 22.10.2011 12:55, schrieb Alvaro:
 El 20/10/2011 21:37, Martin Nowak escribiů:
 It just took me over one hour to find out the unthinkable.
 foreach(c; str) will deduce c to immutable(char) and doesn't care about
 unicode.
 Now there is so many unicode transcoding happening in the language that
 it starts to get annoying,
 but the most basic string iteration doesn't support it by default?

Maybe I didn't fully get your point, but, you do know that you can do the following, right? string str = "—andķ"; foreach(dchar c; str) ...

One visible Unicode character can consists of several dchars. (This is called "Grapheme") Cheers, - Daniel
Oct 22 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 21 Oct 2011 14:39:58 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 10/21/2011 4:14 AM, Steven Schveighoffer wrote:
 Making such a string type would be terribly inefficient. It would make  
 D
 completely uncompetitive for processing strings.

I don't think it would. Do you have any proof to support this?

I've done string processing code, and done a lot of profiling of them. Every cycle is critical, and decoding adds a *lot* of cycles.

What I mean is, default to a well-built string type, and let people who want to deal with arrays of code-units deal with arrays of code-units. This schizophrenic view phobos has of char[] arrays as not being arrays is horrendous to work with. For my usage, I almost never iterate over string characters or graphemes, I just pass strings. -Steve
Oct 24 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.

Searching that does not do decoding is fundamentally incorrect. That is, if you want to find a substring in a string, you cannot just compare chars. But if you want to do the fundamentally incorrect thing (perhaps because you as the programmer know the limits of what the data may contain), then you should be able to do that for efficiency. But to default to something incorrect and then claim it's for efficiency is almost laughable. If I gave you an O(lg(n)) shortest path algorithm that did not always find the shortest path, would you agree it was an improvment over dijkstra's? -Steve
Oct 24 2011
prev sibling next sibling parent "Simen Kjaeraas" <simen.kjaras gmail.com> writes:
On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright  
 <newshound2 digitalmars.com> wrote:

 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.

Searching that does not do decoding is fundamentally incorrect. That is, if you want to find a substring in a string, you cannot just compare chars.

Assuming both string are valid UTF-8, you can. Continuation bytes can never be confused with the first byte of a code point, and the first byte always identifies how many continuation bytes there should be. -- Simen
Oct 24 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
<simen.kjaras gmail.com> wrote:

 On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer  =

 <schveiguy yahoo.com> wrote:

 On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright  =


 <newshound2 digitalmars.com> wrote:

 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

All of the ones that don't require decoding, such as searching, woul=



 be less efficient if decoding was done.

Searching that does not do decoding is fundamentally incorrect. That=


 is, if you want to find a substring in a string, you cannot just  =


 compare chars.

Assuming both string are valid UTF-8, you can. Continuation bytes can =

 never
 be confused with the first byte of a code point, and the first byte  =

 always
 identifies how many continuation bytes there should be.

As others have pointed out in the past to me (and I thought as you did once), the same characters can be encoded in *different ways*. They mus= t be normalized to accurately compare. Plus, a combining character (such as an umlaut or accent) is part of a character, but may be a separate code point. If that's on the last character in the word such as fianc=C3=A9, then searching for fiance wil= l result in a match without proper decoding! Or if fianc=C3=A9 uses a precomposed =C3=A9, it won't match. So two valid representations of the= word either match or they don't. It's just a complete mess without proper unicode decoding. -Steve
Oct 24 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 24 Oct 2011 16:18:57 -0400, Dmitry Olshansky  =

<dmitry.olsh gmail.com> wrote:

 On 24.10.2011 23:41, Steven Schveighoffer wrote:
 On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
 <simen.kjaras gmail.com> wrote:

 On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright
 <newshound2 digitalmars.com> wrote:

 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.

Searching that does not do decoding is fundamentally incorrect. Tha=




 is, if you want to find a substring in a string, you cannot just
 compare chars.

Assuming both string are valid UTF-8, you can. Continuation bytes ca=



 never
 be confused with the first byte of a code point, and the first byte
 always
 identifies how many continuation bytes there should be.

As others have pointed out in the past to me (and I thought as you di=


 once), the same characters can be encoded in *different ways*. They m=


 be normalized to accurately compare.

Assuming language support stays on stage of "codepoint is a character"=

 it's totaly expected to ignore modifiers and compare identically  =

 normalized UTF without decoding. Yes, it risks to hit certain issues.

Again, the "risk" is that it fails to achieve the goal you ask of it! D-language: Here, use this search algorithm, it works most of the time, = = but may not work correctly in some cases. If you run into one of those = = cases, you'll have to run a specialized search algorithm for strings. User: How do I know I hit one of those cases? D-language: You'll have to run the specialized version to find out. User: Why wouldn't I just run the specialized version first? D-language: Well, because it's slower! User: But don't I have to use both algorithms to make sure I find the da= ta? D-language: Only if you "care" about accuracy! Call me ludicrous, but is this really what we want to push on someone as= a = "unicode-aware" language?
 Plus, a combining character (such as an umlaut or accent) is part of =


 character, but may be a separate code point. If that's on the last
 character in the word such as fianc=C3=A9, then searching for fiance =


 result in a match without proper decoding!

Now if you are going to do real characters... If source/needle are =

 normalized you still can avoid lots of work by searching without  =

 decoding. All you need to decode is one codepoint on each successful  =

 match to see if there is a modifier at end of matched portion.
 But it depends on how you want to match if it's case-insensitive searc=

 it will be a lot more complicated, but anyway it boils down to this:
 1) do inexact search, get likely match ( false positives are OK,  =

 negatives not) no decoding here
 2) once found check it (or parts of it) with proper decoding

 There are cultural subtleties, that complicate these steps if you take=

 them into account, but it's doable.

I agree with you that simple searches using only byte (or dchar) = comparison does not work, and can be optimized based on several factors.= = The easiest thing is to find a code unit sequence that only has one vali= d = form, then search for that without decoding. Then when found, decode th= e = characters around it. Or if that isn't possible, create all the = un-normalized forms for one grapheme (based on how likely it is to occur= ), = and search for one of those in the undecoded stream. This can all be built into a specialized string type. There's actually = = some really interesting problems to solve in this space I think. I've = created a basic string type that has lamented in my unfinished pile of = stuff to do. I think it can be done in a way that is close to as = efficient as arrays for the most common operations (slicing, indexing, = etc.), but is *correct* before it is efficient. You should always be ab= le = to drop into "array mode" and deal with the code-units.
 Or if fianc=C3=A9 uses a
 precomposed =C3=A9, it won't match. So two valid representations of t=


 either match or they don't. It's just a complete mess without proper
 unicode decoding.

It's a complete mess even with proper decoding ;)

Yes, all the more reason to solve the problem correctly so the hapless = unicode novice user doesn't have to! -Steve
Oct 24 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 24 Oct 2011 17:27:46 -0400, Walter Bright  =

<newshound2 digitalmars.com> wrote:

 On 10/24/2011 7:02 AM, Steven Schveighoffer wrote:
 On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright  =


 <newshound2 digitalmars.com>
 wrote:

 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

All of the ones that don't require decoding, such as searching, woul=



 be less
 efficient if decoding was done.

Searching that does not do decoding is fundamentally incorrect. That =


 is, if you
 want to find a substring in a string, you cannot just compare chars.

Sure you can. A Unicode character is a string, a Unicode string is a =

 string of those strings. So, searching for a Unicode character is  =

 searching for a substring.

What if the source character is encoded differently than the search = string? This is basic unicode stuff. See my example with fianc=C3=A9. -Steve
Oct 24 2011
prev sibling next sibling parent "Simen Kjaeraas" <simen.kjaras gmail.com> writes:
On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 Plus, a combining character (such as an umlaut or accent) is part of a
 character, but may be a separate code point.

If this is correct (and it is), then decoding to dchar is simply not enough. You seem to advocate decoding to graphemes, which is a whole different matter. -- Simen
Oct 24 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 24 Oct 2011 19:58:59 -0400, Michel Fortin  =

<michel.fortin michelf.com> wrote:

 On 2011-10-24 21:47:15 +0000, "Steven Schveighoffer"  =

 <schveiguy yahoo.com> said:

 What if the source character is encoded differently than the search
 string?  This is basic unicode stuff.  See my example with fianc=C3=A9=


 The more I think about it, the more I think it should work like this: =

 just like we assume they contain well-formed UTF sequences, char[],  =

 wchar[], and dchar[] should also be assumed to contain **normalized** =

 unicode strings. Which normalization form to use? no idea. Just pick a=

 sensible one in the four.
 <http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms>

 Once we know all strings are normalized in the same way, we can then  =

 compare two strings bitwise to check if they're the same. And we can  =

 check for a substring in the same manner except we need to insert a  =

 simple check after the match to verify that it isn't surrounded by  =

 combining marks applying to the first or last character. And it'll als=

 deeply simplify any proper Unicode string handling code we'll add in t=

 future.

  - - -

 That said, I fear that forcing a specific normalization might be  =

 problematic. You don't always want to have to normalize everything...
 <http://en.wikipedia.org/wiki/Unicode_equivalence#Errors_due_to_normal=

 So perhaps we could simplify things a bit more: don't pick a standard =

 normalization form. Just assume that both strings being used are in th=

 same normalization form. Comparison will work, searching for substring=

 in the way specified above will work, and other functions could docume=

 which normalization form they accept. Problem solved... somewhat.

It's even easier than this: a) you want to do a proper string comparison not knowing what state the = = unicode strings are in, use the full-fledged decode-when-needed string = type, and its associated str.find method. b) you know they are both the same normalized form and want to optimize,= = use std.algorithm.find(haystack.asArray, needle.asArray). -Steve
Oct 26 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 24 Oct 2011 18:25:25 -0400, Dmitry Olshansky  =

<dmitry.olsh gmail.com> wrote:

 On 25.10.2011 0:52, Steven Schveighoffer wrote:
 On Mon, 24 Oct 2011 16:18:57 -0400, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 On 24.10.2011 23:41, Steven Schveighoffer wrote:
 On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
 <simen.kjaras gmail.com> wrote:

 On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright
 <newshound2 digitalmars.com> wrote:

 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done.

Searching that does not do decoding is fundamentally incorrect. T=






 is, if you want to find a substring in a string, you cannot just
 compare chars.

Assuming both string are valid UTF-8, you can. Continuation bytes =





 never
 be confused with the first byte of a code point, and the first byt=





 always
 identifies how many continuation bytes there should be.

As others have pointed out in the past to me (and I thought as you =




 once), the same characters can be encoded in *different ways*. They=




 must
 be normalized to accurately compare.

Assuming language support stays on stage of "codepoint is a characte=



 it's totaly expected to ignore modifiers and compare identically
 normalized UTF without decoding. Yes, it risks to hit certain issues=



 Again, the "risk" is that it fails to achieve the goal you ask of it!=



Last time I checked, the legion of "everything is ASCII" zealots was =

 pretty large ;) So the "goal" is a blury line, meaning that "search fo=

 a substring on in this chunk of unicode text" can mean very different =

 things, and be correct in it's own sense on about 3 different levels.
 There is no default way, so I'm not sure we have to embed all of this =

 complexity into language right now. Phobos is were we must first  =

 implement this, once it works we may look at revisiting our pick of  =

 default comparison method, etc.

If we say D obeys the subset of unicode that only works on ascii = characters, well, then I think we do not support unicode at all ;) I agree the string type needs to be fleshed out before the language adop= ts = it. I think we could legitimately define a string type that auto-assign= s = from it's respective char[] array type. Once it's shown that the type = is = nearly as fast as a char[] array, it can be used as the default type for= = string literals (the only real place where the language gets in the way)= .
 D-language: Here, use this search algorithm, it works most of the tim=


 but may not work correctly in some cases. If you run into one of thos=


 cases, you'll have to run a specialized search algorithm for strings.=


 User: How do I know I hit one of those cases?
 D-language: You'll have to run the specialized version to find out.
 User: Why wouldn't I just run the specialized version first?
 D-language: Well, because it's slower!
 User: But don't I have to use both algorithms to make sure I find the=


 data?
 D-language: Only if you "care" about accuracy!

 Call me ludicrous, but is this really what we want to push on someone=


 a "unicode-aware" language?

No, but we might want to fix a string search to do a little more - =

 namely check if it skewed a graphem (assuming normalizations match).

That's a big assumption. Valid unicode is valid unicode, even if it's n= ot = normalized.
 BTW adding normalization to std.uni is another thing to do right now. =

 That should be a good enough start, and looking at unicode standard,  =

 things are rather fluid there meaning that "unicode-aware" language  =

 should be sufixed by version number :)

I agree we need normalization and it is not necessary to involve the = compiler in this. I'd suggest a two to three phase approach: 1. Leave phobos' schizo view of "char arrays are not arrays" for now, an= d = build the necessary framework to get a string type that actually works. 2. Remove schizo view. 3. (optional) make compiler use library-defined string type for string = literals.
 Plus, a combining character (such as an umlaut or accent) is part o=




 character, but may be a separate code point. If that's on the last
 character in the word such as fianc=C3=A9, then searching for fianc=




 result in a match without proper decoding!

Now if you are going to do real characters... If source/needle are normalized you still can avoid lots of work by searching without decoding. All you need to decode is one codepoint on each successful=



 match to see if there is a modifier at end of matched portion.
 But it depends on how you want to match if it's case-insensitive
 search it will be a lot more complicated, but anyway it boils down t=



 this:
 1) do inexact search, get likely match ( false positives are OK,
 negatives not) no decoding here
 2) once found check it (or parts of it) with proper decoding

 There are cultural subtleties, that complicate these steps if you ta=



 them into account, but it's doable.

I agree with you that simple searches using only byte (or dchar) comparison does not work, and can be optimized based on several facto=


 The easiest thing is to find a code unit sequence that only has one
 valid form, then search for that without decoding. Then when found,
 decode the characters around it. Or if that isn't possible, create al=


 the un-normalized forms for one grapheme (based on how likely it is t=


 occur), and search for one of those in the undecoded stream.

Yes, it's all revolves around find all of variations of a substring. If user is willing to spend some time upfront, this could be done in o=

 pass e.g. using the trick I employed in new std.regex.
 For a rough idea see ShiftOr string search,
 that is easy & fast inexact search:
 http://igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION0060

 There are even better variations of this stuff in the wild.

This sounds good. I'll have a look at it when I have time.
 This can all be built into a specialized string type. There's actuall=


 some really interesting problems to solve in this space I think. I've=


 created a basic string type that has lamented in my unfinished pile o=


 stuff to do. I think it can be done in a way that is close to as
 efficient as arrays for the most common operations (slicing, indexing=


 etc.), but is *correct* before it is efficient. You should always be
 able to drop into "array mode" and deal with the code-units.

Here is where we probably differ, to me there is no simple safe fast =

 string type. Why? Because if you need anything "not pedantic and safe =

 above all" you work with implicit data type of e.g. "a chunk of UTF-8 =

 bytes that are normalized in NFC", ...

Specialized types which dictate normalization could be created. We do n= ot = have to always assume the worst if we are optimizing. But in general, when you read a utf-8 file, it could have anything in it= . = It may not even be normalized.
 I'd rather see them as sealed array with a bunch of meta info, and  =

 string functions to specialize on them in order to do some cool speed =

 hacks. Of course, drilling down to row data should be possible for the=

 user as well.

This is exactly what I am envisioning ;) Except I'd build the meta-info= = into the type.
 Or if fianc=C3=A9 uses a
 precomposed =C3=A9, it won't match. So two valid representations of=




 word
 either match or they don't. It's just a complete mess without prope=




 unicode decoding.

It's a complete mess even with proper decoding ;)

Yes, all the more reason to solve the problem correctly so the haples=


 unicode novice user doesn't have to!

Hapless novice unicode user don't stand a chance. I'm serious. One needs to know a lot of these "basics" to e.g. know wh=

 indexing by "true" character could be so slow, about encodings,  =

 normalizations etc. Yet we can save a metric ton of special cased crap=

 from ever hitting his eyes.

I disagree. 99% of the time I use strings, I don't care about indexes. = I = just want to deal with whole strings and substrings. I rarely use = arbitrary indexes, and when I do, I'm assuming ascii data. The hapless novice unicode user doesn't need to know unicode to do: writefln("hello, world!"); Or to do: if(find(str, "xyz")) {...} Or to even use regex! -Steve
Oct 26 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 24 Oct 2011 19:49:43 -0400, Simen Kjaeraas  =

<simen.kjaras gmail.com> wrote:

 On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer  =

 <schveiguy yahoo.com> wrote:

 Plus, a combining character (such as an umlaut or accent) is part of =


 character, but may be a separate code point.

If this is correct (and it is), then decoding to dchar is simply not =

 enough.
 You seem to advocate decoding to graphemes, which is a whole different=

 matter.

I am advocating that. And it's a matter of perception. D can say "we = only support code-point decoding" and what that means to a user is, "we = = don't support language as you know it." Sure it's a part of unicode, bu= t = it takes that extra piece to make it actually usable to people who requi= re = unicode. Even in English, fianc=C3=A9 has an accent. To say D supports unicode, = but = then won't do a simple search on a file which contains a certain *valid*= = encoding of that word is disingenuous to say the least. D needs a fully unicode-aware string type. I advocate D should use it a= s = the default string type, but it needs one whether it's the default or no= t = in order to say it supports unicode. -Steve
Oct 26 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 26 Oct 2011 10:04:08 -0400, Michel Fortin  =

<michel.fortin michelf.com> wrote:

 On 2011-10-26 11:50:32 +0000, "Steven Schveighoffer"  =

 <schveiguy yahoo.com> said:

 It's even easier than this:
  a) you want to do a proper string comparison not knowing what state =


 unicode strings are in, use the full-fledged decode-when-needed strin=


 type, and its associated str.find method.
 b) you know they are both the same normalized form and want to optimi=


 use std.algorithm.find(haystack.asArray, needle.asArray).

Well, treating the string as an array of dchar doesn't work in the =

 general case, even with strings normalized the same way your fianc=C3=A9=

 example can break. So should never treat them as plain arrays unless I=

 sure I have no combining marks in the string.

If you have the same normalization, you would at least find all the vali= d = instances. You'd have to eliminate false positives, by decoding the nex= t = dchar after the match to see if it's a combining character. This would = be = a simple function to write. All I'm saying is, doing a search on the array should be available.
 I'm not opposed to having a new string type being developed, but I'm  =

 skeptical about its inclusion in the language. We already have three  =

 string types which can be assumed to contain valid UTF sequences. I  =

 think the first thing to do is not to develop a new string type, but t=

 develop the normalization and grapheme splitting algorithms, and those=

 to find a substring, using the existing char[], wchar[] and dchar[]  =

 types.  Then write a program with proper handling of Unicode using tho=

 and hand-optimize it. If that proves to be a pain (it might well be), =

 write a new string type, rewrite the program using it, do some  =

 benchmarks and then we'll know if it's a good idea, and will be able t=

 quantify the drawbacks.

 But right now, all this arguing for or against a new string type is  =

 stacking hypothesis against other hypothesis, it won't lead anywhere.

I have a half-implemented string type which does just this. I need to = finish it. There are some language barriers that need to be sorted out = = (such as how to deal with tail-const for arbitrary types). I think a user who no longer posts here (spir) had some full = implementation of unicode strings using classes, though I don't see that= = being comparable in performance. -Steve
Oct 26 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday, October 29, 2011 09:42:54 Andrei Alexandrescu wrote:
 On 10/26/11 7:18 AM, Steven Schveighoffer wrote:
 On Mon, 24 Oct 2011 19:49:43 -0400, Simen Kjaeraas
=20
 <simen.kjaras gmail.com> wrote:
 On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer
=20
 <schveiguy yahoo.com> wrote:
 Plus, a combining character (such as an umlaut or accent) is part=




 a
 character, but may be a separate code point.

If this is correct (and it is), then decoding to dchar is simply n=



 enough.
 You seem to advocate decoding to graphemes, which is a whole diffe=



 matter.

I am advocating that. And it's a matter of perception. D can say "w=


 only support code-point decoding" and what that means to a user is,=


 don't support language as you know it." Sure it's a part of unicode=


 it takes that extra piece to make it actually usable to people who
 require unicode.
=20
 Even in English, fianc=C3=A9 has an accent. To say D supports unico=


 then won't do a simple search on a file which contains a certain *v=


 encoding of that word is disingenuous to say the least.

Why doesn't that simple search work? =20 foreach (line; stdin.byLine()) { if (line.canFind("fianc=C3=A9")) { writeln("There it is."); } }

If the strings aren't normalized the same way, then it might not find f= ianc=C3=A9. If=20 they _are_ normalized the same way and fianc=C3=A9 is in there except t= hat the =C3=A9 is=20 actually modified by another code point after it (e.g. a subscript of 2= - not=20 exactly likely in this case but certainly possible), then that string w= ould =20 be found when it shouldn't be. The bigger problem though, I think, is w= hen=20 you're searching for a string which is the same without the modifiers -= which=20 would be fiance in this case - since then if the modfiying code points = are=20 after, then find will think that it found the string that you were look= ing for=20 when it didn't. Once you're dealing with modifying code points, in the general case, yo= u=20 _must_ operate on the grapheme level to ensure that you find exactly wh= at=20 you're looking for and only what you're looking for. If we assume that = all=20 strings are normalized the same way and pick the right normalization fo= r it=20 (and provide a function to normalize strings that way of course), then = we=20 could probably make that work 100% of the time (assuming that there's a= =20 normalized form with all of the modifying code points being _before_ th= e code=20 point that we modify and that no modifying code point can be a characte= r on=20 its own), but I'd have to study up on it more to be sure. Regardless, while searching for fianc=C3=A9 has a decent chance of succ= ess=20 (especially if programs generall favor using single code points instea= d of=20 multiple code points wherever possible), it's still a risky proposition= =20 without at least doing unicode normalization if not outright using a ra= nge of=20 graphemes rather than code points. - Jonathan M Davis
Oct 29 2011
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 29 Oct 2011 10:42:54 -0400, Andrei Alexandrescu  =

<SeeWebsiteForEmail erdani.org> wrote:

 On 10/26/11 7:18 AM, Steven Schveighoffer wrote:
 On Mon, 24 Oct 2011 19:49:43 -0400, Simen Kjaeraas
 <simen.kjaras gmail.com> wrote:

 On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 Plus, a combining character (such as an umlaut or accent) is part o=




 character, but may be a separate code point.

If this is correct (and it is), then decoding to dchar is simply not=



 enough.
 You seem to advocate decoding to graphemes, which is a whole differe=



 matter.

I am advocating that. And it's a matter of perception. D can say "we only support code-point decoding" and what that means to a user is, "=


 don't support language as you know it." Sure it's a part of unicode, =


 it takes that extra piece to make it actually usable to people who
 require unicode.

 Even in English, fianc=C3=A9 has an accent. To say D supports unicode=


 then won't do a simple search on a file which contains a certain *val=


 encoding of that word is disingenuous to say the least.

Why doesn't that simple search work? foreach (line; stdin.byLine()) { if (line.canFind("fianc=C3=A9")) { writeln("There it is."); } }

I think Jonathan answered that quite well, nothing else to add...
 D needs a fully unicode-aware string type. I advocate D should use it=


 the default string type, but it needs one whether it's the default or=


 not in order to say it supports unicode.

How do you define "supports Unicode"? For my money, the main sin of =

 (w)string is that it offers [] and .length with potentially confusing =

 semantics, so if I could I'd curb, not expand, its interface.

LOL, I'm so used to programming that I'm trying to figure out what the = meaning of sin(string) (as in sine) means :) I think there are two problems with [] and .length. First, that they = imply "get nth character" and "number of characters" respectively, and = second, that many times they *actually are* those things. So I agree with you the proposed string type needs to curb that interfac= e, = while giving us a fully character/grapheme aware interface (which is = currently lacking). I made an early attempt at doing this, and I will eventually get around = to = finishing it. I was in the middle of creating an algorithm to = as-efficiently-as-possible delineate a grapheme, when I got sidetracked = by = other things :) There are still lingering issues with the language whic= h = makes this a less-than-ideal replacement (arrays currently enjoy a lot o= f = "extra features" that custom types do not). -Steve
Oct 31 2011