digitalmars.D - Why the hell doesn't foreach decode strings

Martin Nowak (6/6) Oct 20 2011 It just took me over one hour to find out the unthinkable.

Peter Alexander (18/24) Oct 20 2011 D has got itself into a tricky situation in this regard. Doing it either...

Walter Bright (5/8) Oct 20 2011 There is no 'fixing' it, even to break compatibility. Sometimes you want...

Adam D. Ruppe (4/6) Oct 20 2011 Well, they could always cast it to a ubyte[].
Jonathan M Davis (9/20) Oct 20 2011 True, but if the default were dchar, then the common case would be have ...

Walter Bright (8/16) Oct 20 2011 Is that really the common case? It's certainly the *slow* case. Common s...

Jonathan M Davis (52/74) Oct 20 2011 And why would you iterate over a string with foreach without decoding it...

Walter Bright (18/32) Oct 20 2011 No, it doesn't. If I'm searching for a dchar, I'll be searching for a su...

Martin Nowak (11/52) Oct 21 2011 You have a good point here. I would have immediately thrown out the loop...

Walter Bright (3/9) Oct 21 2011 I understand the issue, but I don't think it's resolvable. It's a lot li...

Andrei Alexandrescu (17/29) Oct 21 2011 It is resolvable, just not without breaking compatibility. Latching on

bearophile (11/20) Oct 21 2011 Maybe I have missed your precedent explanations of this idea. Or to me t...

Jacob Carlborg (6/20) Oct 22 2011 Can't we implement a new string type that people can choose to use if

Timon Gehr (4/23) Oct 22 2011 Having multiple standard string types is bad. Furthermore, it is hard to...

Michel Fortin (35/70) Oct 20 2011 It works for non-ASCII too. You're probably missing an interesting

Jonathan M Davis (14/14) Oct 20 2011 Actually, I'd have to say that having foreach default to iterating over ...

Walter Bright (3/7) Oct 20 2011 Making such a string type would be terribly inefficient. It would make D...

Steven Schveighoffer (4/15) Oct 21 2011 I don't think it would. Do you have any proof to support this?

Walter Bright (3/6) Oct 21 2011 I've done string processing code, and done a lot of profiling of them. E...

Andrei Alexandrescu (4/11) Oct 21 2011 The key here is to allow people to easily either decode strings or not,
Steven Schveighoffer (9/17) Oct 24 2011 What I mean is, default to a well-built string type, and let people who ...

travert phare.normalesup.org (Christophe Travert) (18/26) Oct 28 2011 I definitely agree with you, but I have a piece of news for you :

Andrei Alexandrescu (7/23) Oct 28 2011 It was to allow people who know Unicode to use strings without too much

Norbert Nemec (16/20) Oct 24 2011 Indeed - expressing strings as arrays of characters will always fall

Peter Alexander (23/33) Oct 21 2011 Then separate those ways of viewing strings.

Jonathan M Davis (6/48) Oct 21 2011 In another post in this thread, Walter said in reference to post on

so (8/14) Oct 21 2011 Well he is right, the reason built-in strings everywhere and rarely (so ...

bearophile (5/6) Oct 21 2011 We should aim to something *better* than dynamic languages, where possib...

so (11/21) Oct 21 2011 With spoiled i didn't mean those languages do it better in general
Martin Nowak (18/29) Oct 21 2011 =

Peter Alexander (2/17) Oct 22 2011 Which operations do you believe would be less efficient?

Walter Bright (3/4) Oct 22 2011 All of the ones that don't require decoding, such as searching, would be...

kennytm (3/8) Oct 22 2011 You can std.algorithm.find to do searching, not foreach. The former can

Timon Gehr (2/10) Oct 22 2011 Afaics the current std.algorithm.find implementation decodes its argumen...

Andrei Alexandrescu (5/16) Oct 22 2011 That can be easily fixed. Currently single-element find does decoding

Timon Gehr (4/22) Oct 22 2011 Ok, I actually did not see that, thanks. However this is usually still

Andrei Alexandrescu (4/7) Oct 22 2011 I understand. For some reason I'm not seeing the URL of your pull

Steven Schveighoffer (11/15) Oct 24 2011 Searching that does not do decoding is fundamentally incorrect. That is...

Simen Kjaeraas (7/17) Oct 24 2011 Assuming both string are valid UTF-8, you can. Continuation bytes can ne...

Steven Schveighoffer (19/38) Oct 24 2011 d =

Dmitry Olshansky (19/54) Oct 24 2011 Assuming language support stays on stage of "codepoint is a character"

Steven Schveighoffer (51/105) Oct 24 2011 t

Walter Bright (4/6) Oct 24 2011 There are different levels of Unicode support. D operates at the most ba...
Timon Gehr (4/95) Oct 24 2011 The answer is probably to get proper unicode normalization into Phobos.
Dmitry Olshansky (37/127) Oct 24 2011 Last time I checked, the legion of "everything is ASCII" zealots was

Steven Schveighoffer (80/218) Oct 26 2011 hat

Dmitry Olshansky (21/146) Oct 26 2011 That's it.

travert phare.normalesup.org (Christophe) (14/18) Oct 28 2011 string being seen as range of codepoint (dchar) is already aweful

Simen Kjaeraas (8/10) Oct 24 2011 If this is correct (and it is), then decoding to dchar is simply not

Steven Schveighoffer (23/31) Oct 26 2011 a

Andrei Alexandrescu (11/34) Oct 29 2011 Why doesn't that simple search work?

Jonathan M Davis (53/86) Oct 29 2011 ot
Steven Schveighoffer (31/67) Oct 31 2011 f a

Andrei Alexandrescu (3/11) Oct 31 2011 I see this as a decoding primitive issue that can be resolved.

Walter Bright (5/14) Oct 24 2011 Sure you can. A Unicode character is a string, a Unicode string is a str...

Steven Schveighoffer (7/25) Oct 24 2011 d =

Michel Fortin (30/32) Oct 24 2011 The more I think about it, the more I think it should work like this:

Steven Schveighoffer (22/48) Oct 26 2011 .

Michel Fortin (23/30) Oct 26 2011 Well, treating the string as an array of dchar doesn't work in the

Steven Schveighoffer (29/54) Oct 26 2011 the

Michel Fortin (15/23) Oct 26 2011 And all I'm saying is that such a standalone function that eliminates

so (6/10) Oct 21 2011 Indeed, if one knows/understands how unicode works AND knows the structu...

Jacob Carlborg (5/23) Oct 20 2011 The index could skip certain indexes to make that assert pass. But that

so (7/13) Oct 21 2011 I really can't believe why people expect that.
Timon Gehr (3/9) Oct 21 2011 I actually like the current behaviour.
Alvaro (14/20) Oct 22 2011 Maybe I didn't fully get your point, but, you do know that you can do

Daniel Gibson (5/17) Oct 22 2011 One visible Unicode character can consists of several dchars.

"Martin Nowak" <dawg dawgfoto.de> writes:

It just took me over one hour to find out the unthinkable.
foreach(c; str) will deduce c to immutable(char) and doesn't care about  
unicode.
Now there is so many unicode transcoding happening in the language that it  
starts to get annoying,
but the most basic string iteration doesn't support it by default?

Oct 20 2011

Peter Alexander <peter.alexander.au gmail.com> writes:

On 20/10/11 8:37 PM, Martin Nowak wrote:
 It just took me over one hour to find out the unthinkable.
 foreach(c; str) will deduce c to immutable(char) and doesn't care about
 unicode.
 Now there is so many unicode transcoding happening in the language that
 it starts to get annoying,
 but the most basic string iteration doesn't support it by default?

D has got itself into a tricky situation in this regard. Doing it either 
way introduces an unintuitive mess.

The way it is now, you get the problem that you just described where 
foreach is unaware of Unicode.

If you changed it to loop as Unicode, then indices won't match up:

immutable(int)[] a = ...
foreach (x, i; a)
     assert(x == a[i]); // ok

immutable(char)[] b = ...
foreach (x, i; b)
     assert(x == b[i]); // not necessarily!

Also, the loop won't necessarily iterate b.length times. There's 
inconsistencies all over the place.

The whole mess is caused by conflating the idea of an array with a 
variable length encoding that happens to use an array for storage. I 
don't believe there is any clean and tidy way to fix the problem without 
breaking compatibility.

Oct 20 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 10/20/2011 2:49 PM, Peter Alexander wrote:
 The whole mess is caused by conflating the idea of an array with a variable
 length encoding that happens to use an array for storage. I don't believe there
 is any clean and tidy way to fix the problem without breaking compatibility.

There is no 'fixing' it, even to break compatibility. Sometimes you want to
look 
at an array of utf8 as 8 bit characters, and sometimes as 20 bit dchars.
Someone 
will be dissatisfied no matter what.

There is no way to program strings in D without being aware of UTF-8 encoding.

Oct 20 2011

Adam D. Ruppe <destructionator gmail.com> writes:

Walter Bright wrote:
 Sometimes you want to look at an array of utf8 as 8 bit characters,
 and sometimes as 20 bit dchars.

Well, they could always cast it to a ubyte[].

But on the other hand, you can just ask for dchar now.

So yeah.

Oct 20 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, October 20, 2011 19:26:43 Walter Bright wrote:
 On 10/20/2011 2:49 PM, Peter Alexander wrote:
 The whole mess is caused by conflating the idea of an array with a
 variable length encoding that happens to use an array for storage. I
 don't believe there is any clean and tidy way to fix the problem
 without breaking compatibility.

 There is no 'fixing' it, even to break compatibility. Sometimes you want to
 look at an array of utf8 as 8 bit characters, and sometimes as 20 bit
 dchars. Someone will be dissatisfied no matter what.
 
 There is no way to program strings in D without being aware of UTF-8
 encoding.

True, but if the default were dchar, then the common case would be have fewer 
bugs (still allowing you to explicitly use char or wchar when you want to). At 
minimum, I think that it would be a good idea to implement 
http://d.puremagic.com/issues/show_bug.cgi?id=6652 and make it a warning not 
to explicitly give the type with foreach for arrays of char or wchar. It would 
catch bugs without changing the behavior of any existing code, and it still 
allows you to iterate over either code units or code points.

- Jonathan M Davis

Oct 20 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 10/20/2011 7:37 PM, Jonathan M Davis wrote:
 True, but if the default were dchar, then the common case would be have fewer
 bugs

Is that really the common case? It's certainly the *slow* case. Common string 
operations like searching, copying, etc., do not require decoding.

 (still allowing you to explicitly use char or wchar when you want to). At
 minimum, I think that it would be a good idea to implement
 http://d.puremagic.com/issues/show_bug.cgi?id=6652 and make it a warning not
 to explicitly give the type with foreach for arrays of char or wchar. It would
 catch bugs without changing the behavior of any existing code, and it still
 allows you to iterate over either code units or code points.

I like the type deduction feature of foreach, and don't think it should be 
removed for strings. Currently, it's consistent - T[] gets an element type of T.

I want to reiterate that there's no way to program strings in D without being 
cognizant of them being a multibyte representation. D is both a high level and
a 
low level language, and you can pick which to use, but you still gotta pick.

Oct 20 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, October 20, 2011 20:37:40 Walter Bright wrote:
 On 10/20/2011 7:37 PM, Jonathan M Davis wrote:
 True, but if the default were dchar, then the common case would be have
 fewer bugs

 
 Is that really the common case? It's certainly the *slow* case. Common
 string operations like searching, copying, etc., do not require decoding.

And why would you iterate over a string with foreach without decoding it 
unless you specifically need to operate on code units (which I would expect to 
be _very_ rare)? Sure, copying doesn't require decoding, but searching sure 
does (unless you're specifically looking for a code unit rather than a code 
point, which would not be normal). Most anything which needs to operate on the 
characters of a string needs to decode them. And iterating over them to do 
much of anything would require decoding, since otherwise you're operating on 
code units, and how often does anyone do that unless they're specifically 
messing around with character encodings?

Sure, if you _know_ that you're dealing with a string with only ASCII, it's 
faster to just iterate over chars, but then you can explicitly give the type 
of the foreach variable as char, but normally what people care about is 
iterating over characters, not pieces of characters. So, I would expect the 
case where people _want_ to iterate over chars to be rare. In most cases, 
iterating over a string as chars is a bug - one which in many cases won't be 
quickly caught, because the programmer is English speaking and uses almost 
exclusively ASCII for whatever testing that they do.

The default for string handling really should be to treat them as ranges of 
dchar but still make it easy for them to be treated as arrays of code units 
when necessary. There's plenty of code in Phobos which is able to special case 
strings and make operating on them more efficient when it's not necessary to 
operate on them as ranges of dchar or when decoding the string explicitly with 
functions such as stride. But the default is still to operate on them as 
ranges of dchar, because that is what is normally correct. Defaulting to the 
guaranteed correct handling of characters and special casing when it's 
possible to write code more efficiently than that is definitely the way to go 
about it, and it's how Phobos generally does it. The fact that foreach doesn't 
is incongruous with how strings are handled in most other cases.

 (still allowing you to explicitly use char or wchar when you want to).
 At
 minimum, I think that it would be a good idea to implement
 http://d.puremagic.com/issues/show_bug.cgi?id=6652 and make it a warning
 not to explicitly give the type with foreach for arrays of char or
 wchar. It would catch bugs without changing the behavior of any
 existing code, and it still allows you to iterate over either code
 units or code points.

 
 I like the type deduction feature of foreach, and don't think it should be
 removed for strings. Currently, it's consistent - T[] gets an element type
 of T.

Sure, the type deduction of foreach is great, and it's completely consistent 
that iterating over an array of chars would iterate over chars rather than 
dchars when you don't give the type. However, in most cases, that is _not_ 
what the programmer actually wants. They want to iterate over characters, not 
pieces of characters. So, the default at this point is _wrong_ in the common 
case. As such, I'm very leery of any code which uses foreach over a string 
without specifying the iteration type. And in fact, unless the code is clearly 
intended to operate on code units, I would expect a comment indicating that 
the use of char instead of dchar was intentional, or I'd still consider it 
likely that it's a bug and a mistake on the programmer's part (likely due to a 
misunderstanding of unicode and how D deals with it).

 I want to reiterate that there's no way to program strings in D without
 being cognizant of them being a multibyte representation. D is both a high
 level and a low level language, and you can pick which to use, but you
 still gotta pick.

I fully agree that programmers need to properly understand unicode to use 
strings in D properly. However, the problem is that the default handling of 
strings with foreach is _not_ what programmers are going to normally want, so 
the default will cause bugs. If strings defaulted to iterating as ranges of 
dchar, or if programmers had to say what type they wanted to iterate over when 
dealing with strings (or at least got a warning if they didn't), then there 
would be fewer bugs. Pretty much every time that the use of strings with 
foreach comes up on this list, most everyone agrees that it's a wart in the 
language that the default is to iterate over chars rather than dchars. Not 
everyone agrees on the best way to fix the problem, but most everyone agrees 
that it _is_ a problem.

- Jonathan M Davis

Oct 20 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 10/20/2011 8:58 PM, Jonathan M Davis wrote:
 And why would you iterate over a string with foreach without decoding it
 unless you specifically need to operate on code units (which I would expect to
 be _very_ rare)? Sure, copying doesn't require decoding, but searching sure
 does

No, it doesn't. If I'm searching for a dchar, I'll be searching for a substring 
in the UTF-8 string. It's far, FAR more efficient to search as a substring 
rather than decoding while searching.

Even more, 99.9999% of searches involve an ascii search string. It is simply
not 
necessary to decode the searched string, as encoded chars cannot be ascii. For 
example:

    foreach (c; somestring)
          if (c == '+')
		found it!

gains absolutely nothing by decoding somestring.


 (unless you're specifically looking for a code unit rather than a code
 point, which would not be normal). Most anything which needs to operate on the
 characters of a string needs to decode them. And iterating over them to do
 much of anything would require decoding, since otherwise you're operating on
 code units, and how often does anyone do that unless they're specifically
 messing around with character encodings?

What you write sounds intuitively correct, but in my experience writing Unicode 
processing code, it simply isn't true. One rarely needs to decode.


 However, in most cases, that is _not_
 what the programmer actually wants. They want to iterate over characters, not
 pieces of characters. So, the default at this point is _wrong_ in the common
 case.

This is simply not my experience when working with Unicode. Performance takes a 
big hit when one structures an algorithm to require decoding/encoding. Doing
the 
algorithm using substrings is a huge win.

Take a look at dmd's lexer, it handles Unicode correctly and avoids doing 
decoding as much as possible.

Oct 20 2011

"Martin Nowak" <dawg dawgfoto.de> writes:

On Fri, 21 Oct 2011 06:39:56 +0200, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 10/20/2011 8:58 PM, Jonathan M Davis wrote:
 And why would you iterate over a string with foreach without decoding it
 unless you specifically need to operate on code units (which I would  
 expect to
 be _very_ rare)? Sure, copying doesn't require decoding, but searching  
 sure
 does

 No, it doesn't. If I'm searching for a dchar, I'll be searching for a  
 substring in the UTF-8 string. It's far, FAR more efficient to search as  
 a substring rather than decoding while searching.

 Even more, 99.9999% of searches involve an ascii search string. It is  
 simply not necessary to decode the searched string, as encoded chars  
 cannot be ascii. For example:

     foreach (c; somestring)
           if (c == '+')
 		found it!

 gains absolutely nothing by decoding somestring.


 (unless you're specifically looking for a code unit rather than a code
 point, which would not be normal). Most anything which needs to operate  
 on the
 characters of a string needs to decode them. And iterating over them to  
 do
 much of anything would require decoding, since otherwise you're  
 operating on
 code units, and how often does anyone do that unless they're  
 specifically
 messing around with character encodings?

 What you write sounds intuitively correct, but in my experience writing  
 Unicode processing code, it simply isn't true. One rarely needs to  
 decode.


 However, in most cases, that is _not_
 what the programmer actually wants. They want to iterate over  
 characters, not
 pieces of characters. So, the default at this point is _wrong_ in the  
 common
 case.

 This is simply not my experience when working with Unicode. Performance  
 takes a big hit when one structures an algorithm to require  
 decoding/encoding. Doing the algorithm using substrings is a huge win.

 Take a look at dmd's lexer, it handles Unicode correctly and avoids  
 doing decoding as much as possible.

You have a good point here. I would have immediately thrown out the loop  
AFTER profiling.
What hits me here is that I had an incorrect program with built-in unicode  
aware strings.
This is counterintuitive to correct unicode handling throughout the std  
library,
and even more to the complementary operation of appending any char type to  
strings.

martin

Oct 21 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 10/21/2011 2:51 AM, Martin Nowak wrote:
 You have a good point here. I would have immediately thrown out the loop AFTER
 profiling.
 What hits me here is that I had an incorrect program with built-in unicode
aware
 strings.
 This is counterintuitive to correct unicode handling throughout the std
library,
 and even more to the complementary operation of appending any char type to
strings.

I understand the issue, but I don't think it's resolvable. It's a lot like the 
signed/unsigned issue. Java got rid of it by simply not having any unsigned
types.

Oct 21 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/21/11 1:38 PM, Walter Bright wrote:
 On 10/21/2011 2:51 AM, Martin Nowak wrote:
 You have a good point here. I would have immediately thrown out the
 loop AFTER
 profiling.
 What hits me here is that I had an incorrect program with built-in
 unicode aware
 strings.
 This is counterintuitive to correct unicode handling throughout the
 std library,
 and even more to the complementary operation of appending any char
 type to strings.

 I understand the issue, but I don't think it's resolvable.

It is resolvable, just not without breaking compatibility. Latching on 
the notion that a problem is unsolvable is highly nocive because it sets 
up the mind for failing to not only look for solutions, but also to see 
and understand them when they're in the open.

I said this a number of times, and I repeat: if we had the luxury of 
doing it over again, I'd disable random access and .length for char[], 
wchar[], and their qualified versions. For those types I would add a 
property .rep that yields respectively ubyte[], ushort[], and the 
appropriately-qualified variants.

This would shed the remaining awkwardnesses from a generally very 
elegant approach to string handling.

The loop issue would be trivial to solve. foreach (x; s) would iterate 
one dchar at a time, whereas foreach (x; s.rep) would iterate one ubyte 
or ushort at a time. There would be the ability to iterate foreach (ref 
x; s.rep) but not foreach (ref x; s).


Andrei

Oct 21 2011

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:

 Latching on the notion that a problem is unsolvable is highly nocive because
it sets 
 up the mind for failing to not only look for solutions, but also to see 
 and understand them when they're in the open.

Right. Experimental psychology has confirmed this some decades ago :-)


 I said this a number of times, and I repeat:

Maybe I have missed your precedent explanations of this idea. Or to me this
time it seems more clear and focused.


 if we had the luxury of doing it over again,

Unfortunately D3 language can't break too much backward compatibility :-|


 I'd disable random access and .length for char[], 
 wchar[], and their qualified versions. For those types I would add a 
 property .rep that yields respectively ubyte[], ushort[], and the 
 appropriately-qualified variants.

Good. But the need to know the length of a variable-sized string is common. So
I presume in such cases you have to use:
somestring.walkLength()
This is not too much bad, but it looks a bit long, and it requires an import.
So maybe a shorter named property function in the object module is needed:
somestring.wlength

What about slices? I do need to take slices of variable-length strings too.

Bye,
bearophile

Oct 21 2011

Jacob Carlborg <doob me.com> writes:

On 2011-10-21 20:38, Walter Bright wrote:
 On 10/21/2011 2:51 AM, Martin Nowak wrote:
 You have a good point here. I would have immediately thrown out the
 loop AFTER
 profiling.
 What hits me here is that I had an incorrect program with built-in
 unicode aware
 strings.
 This is counterintuitive to correct unicode handling throughout the
 std library,
 and even more to the complementary operation of appending any char
 type to strings.

 I understand the issue, but I don't think it's resolvable. It's a lot
 like the signed/unsigned issue. Java got rid of it by simply not having
 any unsigned types.

Can't we implement a new string type that people can choose to use if 
they want. It will hide all the Unicode details that has been brought up 
by this thread.

-- 
/Jacob Carlborg

Oct 22 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 10/22/2011 02:14 PM, Jacob Carlborg wrote:
 On 2011-10-21 20:38, Walter Bright wrote:
 On 10/21/2011 2:51 AM, Martin Nowak wrote:
 You have a good point here. I would have immediately thrown out the
 loop AFTER
 profiling.
 What hits me here is that I had an incorrect program with built-in
 unicode aware
 strings.
 This is counterintuitive to correct unicode handling throughout the
 std library,
 and even more to the complementary operation of appending any char
 type to strings.

 I understand the issue, but I don't think it's resolvable. It's a lot
 like the signed/unsigned issue. Java got rid of it by simply not having
 any unsigned types.

 Can't we implement a new string type that people can choose to use if
 they want. It will hide all the Unicode details that has been brought up
 by this thread.

Having multiple standard string types is bad. Furthermore, it is hard to 
meaningfully hide all the Unicode details. Not even immutable(dchar)[] 
necessarily encodes one character as one code unit.

Oct 22 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-10-21 03:58:50 +0000, Jonathan M Davis <jmdavisProg gmx.com> said:

 Sure, if you _know_ that you're dealing with a string with only ASCII, it's
 faster to just iterate over chars

It works for non-ASCII too. You're probably missing an interesting 
property of UTF encodings: if you want to search for a substring in a 
well-formed UTF sequence, you do not need to decode the bigger string, 
comparing the UTF-x code units of the substring with the UTF-x code 
units of the bigger string is plenty enough.

Similarly, if you're searching for the '�' code point in an UTF-8 
string, the most efficient way is to search the string for the two-byte 
UTF-8 sequence you would use to encode '�' (in other words, convert '�' 
to a string). Decoding the whole string is a wasteful process.


 Sure, if you _know_ that you're dealing with a string with only ASCII, it's
 faster to just iterate over chars, but then you can explicitly give the type
 of the foreach variable as char, but normally what people care about is
 iterating over characters, not pieces of characters.

If you want to iterate over what people consider characters, then you 
need to take into account combining marks that form multi-code-point 
graphemes. (You'll probably want to deal with unicode normalization 
too.) Treating code points as if they were characters is a 
misconception in the same way as treating UTF-16 code units as 
character is: both works most of the time but also fail in a number of 
cases.


 So, I would expect the
 case where people _want_ to iterate over chars to be rare. In most cases,
 iterating over a string as chars is a bug - one which in many cases won't be
 quickly caught, because the programmer is English speaking and uses almost
 exclusively ASCII for whatever testing that they do.

That's a real problem. But is treating everything as dchar the only 
solution to that problem?


 Defaulting to the
 guaranteed correct handling of characters and special casing when it's
 possible to write code more efficiently than that is definitely the way to go
 about it, and it's how Phobos generally does it.

Iterating on dchar is not guarantied to be correct, it only has 
significantly more chances of being correct.


 The fact that foreach doesn't
 is incongruous with how strings are handled in most other cases.

You could also argue that ranges are doing things the wrong way.


 I like the type deduction feature of foreach, and don't think it should be
 removed for strings. Currently, it's consistent - T[] gets an element type
 of T.

 
 Sure, the type deduction of foreach is great, and it's completely consistent
 that iterating over an array of chars would iterate over chars rather than
 dchars when you don't give the type. However, in most cases, that is _not_
 what the programmer actually wants. They want to iterate over characters, not
 pieces of characters.

I note that you keep confusing characters with code units.

 I want to reiterate that there's no way to program strings in D without
 being cognizant of them being a multibyte representation. D is both a high
 level and a low level language, and you can pick which to use, but you
 still gotta pick.

 
 I fully agree that programmers need to properly understand unicode to use
 strings in D properly. However, the problem is that the default handling of
 strings with foreach is _not_ what programmers are going to normally want, so
 the default will cause bugs.

That said I wouldn't expect most programmers understand Unicode. Giving 
them dchars by default won't eliminate bugs related to multi-code-point 
characters, but it'll likely eliminate bugs relating to multi-code-unit 
sequences. That could be a good start. I'd say choosing dchar is a 
practical compromise between the "characters by default" and "the type 
of the array by default", but it is neither of those ideals. How is 
that pragmatic trade-off going to fare a few years in the future? I'm a 
little skeptical that this is the ideal solution.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Oct 20 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

Actually, I'd have to say that having foreach default to iterating over char 
for string is a bit like if it defaulted to iterating over byte for int[]. 
Sure, it would work in some cases, but in the general case, it would be very 
wrong.

Yes, it's consistent with how arrays are normally iterated over to have 
foreach iterate over char for char[], which is why it's arguably not a good 
idea to make it default to dchar. But it's _wrong_ in most cases, so at least 
giving a warning when the programmer doesn't give an iteration type for 
foreach with an array of char or wchar would be a big help.

It's this very problem that leads some people to argue that string should be 
its own type which holds an array of code units (which can be accessed when 
needed) rather than doing what we do now where we try and treat a string as 
both an array of chars and a range of dchars. The result is schizophrenic.

- Jonathan M Davis

Oct 20 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 10/20/2011 9:06 PM, Jonathan M Davis wrote:
 It's this very problem that leads some people to argue that string should be
 its own type which holds an array of code units (which can be accessed when
 needed) rather than doing what we do now where we try and treat a string as
 both an array of chars and a range of dchars. The result is schizophrenic.

Making such a string type would be terribly inefficient. It would make D 
completely uncompetitive for processing strings.

Oct 20 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 21 Oct 2011 00:43:20 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 10/20/2011 9:06 PM, Jonathan M Davis wrote:
 It's this very problem that leads some people to argue that string  
 should be
 its own type which holds an array of code units (which can be accessed  
 when
 needed) rather than doing what we do now where we try and treat a  
 string as
 both an array of chars and a range of dchars. The result is  
 schizophrenic.

 Making such a string type would be terribly inefficient. It would make D  
 completely uncompetitive for processing strings.

I don't think it would.  Do you have any proof to support this?

-Steve

Oct 21 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 10/21/2011 4:14 AM, Steven Schveighoffer wrote:
 Making such a string type would be terribly inefficient. It would make D
 completely uncompetitive for processing strings.

 I don't think it would. Do you have any proof to support this?

I've done string processing code, and done a lot of profiling of them. Every 
cycle is critical, and decoding adds a *lot* of cycles.

Oct 21 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/21/11 1:39 PM, Walter Bright wrote:
 On 10/21/2011 4:14 AM, Steven Schveighoffer wrote:
 Making such a string type would be terribly inefficient. It would make D
 completely uncompetitive for processing strings.

 I don't think it would. Do you have any proof to support this?

 I've done string processing code, and done a lot of profiling of them.
 Every cycle is critical, and decoding adds a *lot* of cycles.

The key here is to allow people to easily either decode strings or not, 
without defaulting to an error-prone choice.

Andrei

Oct 21 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 21 Oct 2011 14:39:58 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 10/21/2011 4:14 AM, Steven Schveighoffer wrote:
 Making such a string type would be terribly inefficient. It would make  
 D
 completely uncompetitive for processing strings.

 I don't think it would. Do you have any proof to support this?

 I've done string processing code, and done a lot of profiling of them.  
 Every cycle is critical, and decoding adds a *lot* of cycles.

What I mean is, default to a well-built string type, and let people who  
want to deal with arrays of code-units deal with arrays of code-units.   
This schizophrenic view phobos has of char[] arrays as not being arrays is  
horrendous to work with.

For my usage, I almost never iterate over string characters or graphemes,  
I just pass strings.

-Steve

Oct 24 2011

travert phare.normalesup.org (Christophe Travert) writes:

Walter Bright , dans le message (digitalmars.D:147161), a �crit�:
 On 10/20/2011 9:06 PM, Jonathan M Davis wrote:
 It's this very problem that leads some people to argue that string should be
 its own type which holds an array of code units (which can be accessed when
 needed) rather than doing what we do now where we try and treat a string as
 both an array of chars and a range of dchars. The result is schizophrenic.

 
 Making such a string type would be terribly inefficient. It would make D 
 completely uncompetitive for processing strings.

I definitely agree with you, but I have a piece of news for you :
The whole phobos alreaday treats strings as a dchar ranges, and IS 
inefficient for processing strings.

The fact is: char[] is not char[] in phobos. It is Range!dchar. This is 
aweful and schizophrenic, and also inefficient. The purpose was to allow 
people to use strings without knowing anything about unicode. That is 
why Jonathan proposed having a specific string structure to manipulate 
strings without having to worry about unicode, just like how strings are 
currently manipulated in phobos, and letting char[] be char[], and be as 
efficient as they should be.

I was not there when it was decided to treat strings as dchar ranges, 
but now it is done. The only thing I can do is use ubyte[] instead of 
char[] so that phobos treat them as proper arrays, and propose optimized 
overloads for various phobos algorithm to make them as efficient as they 
should be (which I didn't find the time to do yet).

-- 
Christophe

Oct 28 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/28/11 4:41 AM, Christophe Travert wrote:
 I definitely agree with you, but I have a piece of news for you :
 The whole phobos alreaday treats strings as a dchar ranges, and IS
 inefficient for processing strings.

 The fact is: char[] is not char[] in phobos. It is Range!dchar. This is
 aweful and schizophrenic, and also inefficient. The purpose was to allow
 people to use strings without knowing anything about unicode.

It was to allow people who know Unicode to use strings without too much 
arcana.

 That is
 why Jonathan proposed having a specific string structure to manipulate
 strings without having to worry about unicode, just like how strings are
 currently manipulated in phobos, and letting char[] be char[], and be as
 efficient as they should be.

No need. Use 
http://www.digitalmars.com/d/2.0/phobos/std_string.html#representation

 I was not there when it was decided to treat strings as dchar ranges,
 but now it is done. The only thing I can do is use ubyte[] instead of
 char[] so that phobos treat them as proper arrays, and propose optimized
 overloads for various phobos algorithm to make them as efficient as they
 should be (which I didn't find the time to do yet).

What would you have done if you were there? What is your design of choice?


Andrei

Oct 28 2011

Norbert Nemec <Norbert Nemec-online.de> writes:

On 21.10.2011 06:06, Jonathan M Davis wrote:
 It's this very problem that leads some people to argue that string should be
 its own type which holds an array of code units (which can be accessed when
 needed) rather than doing what we do now where we try and treat a string as
 both an array of chars and a range of dchars. The result is schizophrenic.

Indeed - expressing strings as arrays of characters will always fall 
short of the unicode concept in some way. A true unicode-compliant 
languages have to handle strings as opaque objects that do not have any 
encoding. There is a number of operations that can be done with these 
objects (concatenation, comparison, searching, etc.). Any kind of 
defined memory representation can only be obtained by an explicit 
encoding operation.

Python3, for example, did a fundamental step by introducing this 
fundamental distinction. At first it seems silly, having to think about 
encodings so often when writing trivial code. After a short while, the 
strict conceptual separation between unencoded "strings" and encoded 
"arrays of something" really helps avoiding ugly problems.

Sure, for a performance-critical language, the issue becomes a lot 
trickier. I still think it is possible and ultimately the only way to 
solve tricky problems that will otherwise always crop up somewhere.

Oct 24 2011

Peter Alexander <peter.alexander.au gmail.com> writes:

On 21/10/11 3:26 AM, Walter Bright wrote:
 On 10/20/2011 2:49 PM, Peter Alexander wrote:
 The whole mess is caused by conflating the idea of an array with a
 variable
 length encoding that happens to use an array for storage. I don't
 believe there
 is any clean and tidy way to fix the problem without breaking
 compatibility.

 There is no 'fixing' it, even to break compatibility. Sometimes you want
 to look at an array of utf8 as 8 bit characters, and sometimes as 20 bit
 dchars. Someone will be dissatisfied no matter what.

Then separate those ways of viewing strings.

Here's one solution that I believe would satisfy everyone:

1. Remove the string, wstring and dstring aliases. An array of char 
should be an array of char, i.e. the same as array of byte. Same for 
arrays of wchar and dchar. This way, arrays of T have no subtle 
differences for certain kinds of T.

2. Add string, wstring and dstring structs with the following interface:

  a. foreach should iterate as dchar.
  b.  property front() would be dchar.
  c.  property length() would not exist.
  d.  property buffer() returns the underlying immutable array of char, 
wchar etc.
  e. Remove opIndex and co.

What this does:
- Makes all array types consistent and intuitive.
- Makes looping over strings do the expected thing.
- Provides an interface to the underlying 8-bit chars for those that 
want it.


Of course, people will still need to understand UTF-8. I don't think 
that's a problem. It's unreasonable to expect the language to do the 
thinking for you. The problem is that we have people that *do* 
understand UTF-8 (like the OP), but *don't* understand D's strings.

Oct 21 2011

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Friday, October 21, 2011 11:11 Peter Alexander wrote:
 On 21/10/11 3:26 AM, Walter Bright wrote:
 On 10/20/2011 2:49 PM, Peter Alexander wrote:
 The whole mess is caused by conflating the idea of an array with a
 variable
 length encoding that happens to use an array for storage. I don't
 believe there
 is any clean and tidy way to fix the problem without breaking
 compatibility.

 
 There is no 'fixing' it, even to break compatibility. Sometimes you want
 to look at an array of utf8 as 8 bit characters, and sometimes as 20 bit
 dchars. Someone will be dissatisfied no matter what.

 
 Then separate those ways of viewing strings.
 
 Here's one solution that I believe would satisfy everyone:
 
 1. Remove the string, wstring and dstring aliases. An array of char
 should be an array of char, i.e. the same as array of byte. Same for
 arrays of wchar and dchar. This way, arrays of T have no subtle
 differences for certain kinds of T.
 
 2. Add string, wstring and dstring structs with the following interface:
 
 a. foreach should iterate as dchar.
 b.  property front() would be dchar.
 c.  property length() would not exist.
 d.  property buffer() returns the underlying immutable array of char,
 wchar etc.
 e. Remove opIndex and co.
 
 What this does:
 - Makes all array types consistent and intuitive.
 - Makes looping over strings do the expected thing.
 - Provides an interface to the underlying 8-bit chars for those that
 want it.
 
 
 Of course, people will still need to understand UTF-8. I don't think
 that's a problem. It's unreasonable to expect the language to do the
 thinking for you. The problem is that we have people that *do*
 understand UTF-8 (like the OP), but *don't* understand D's strings.

In another post in this thread, Walter said in reference to post on 
essentially this idea: "Making such a string type would be terribly 
inefficient. It would make D completely uncompetitive for processing strings." 
Now, whether that's true is debatable, but that's his stance on the idea.

- Jonathan M Davis

Oct 21 2011

so <so so.so> writes:

On Fri, 21 Oct 2011 21:38:39 +0300, Jonathan M Davis <jmdavisProg gmx.com>  
wrote:

 In another post in this thread, Walter said in reference to post on
 essentially this idea: "Making such a string type would be terribly
 inefficient. It would make D completely uncompetitive for processing  
 strings."
 Now, whether that's true is debatable, but that's his stance on the idea.

 - Jonathan M Davis

Well he is right, the reason built-in strings everywhere and rarely (so  
rare that i have never seen one yet) someone comes up with a better  
alternative.
IMO people are spoiled by dynamic languages and fail to see the strengths  
of D strings. Someone coming from C/C++ this is heaven, both for  
performance and flexibility.

Oct 21 2011

bearophile <bearophileHUGS lycos.com> writes:

so:

 IMO people are spoiled by dynamic languages

We should aim to something *better* than dynamic languages, where possible. If
you have to do certain things (even "slow" ones), there's no point in making
them harder than necessary in D.

Python is a good language, but it's not perfect, and in a new language I'd like
something even better. D contains several small things that are better than
Python (an many things that are worse than Python).

Bye,
bearophile

Oct 21 2011

so <so so.so> writes:

On Sat, 22 Oct 2011 02:28:28 +0300, bearophile <bearophileHUGS lycos.com>  
wrote:

 so:

 IMO people are spoiled by dynamic languages

 We should aim to something *better* than dynamic languages, where  
 possible. If you have to do certain things (even "slow" ones), there's  
 no point in making them harder than necessary in D.

 Python is a good language, but it's not perfect, and in a new language  
 I'd like something even better. D contains several small things that are  
 better than Python (an many things that are worse than Python).

 Bye,
 bearophile

With spoiled i didn't mean those languages do it better in general  
(expressiveness, maybe), what i meant is that they just work, without any  
effort from the programmer.
It may be all right for them and you could say it is what every language  
should do but in reality this is not the case for D like languages.
Not with the expense of efficiency on a low level task.

We could aim all we like but we can't beat dynamic languages that easily,  
if you don't care reaching or beating C (on every aspect, system access,  
performance...) you don't have much of a limit do you? :)

Oct 21 2011

"Martin Nowak" <dawg dawgfoto.de> writes:

On Sat, 22 Oct 2011 01:28:28 +0200, bearophile <bearophileHUGS lycos.com=
  =

wrote:

 so:

 IMO people are spoiled by dynamic languages

 We should aim to something *better* than dynamic languages, where  =

 possible. If you have to do certain things (even "slow" ones), there's=

  =

 no point in making them harder than necessary in D.

 Python is a good language, but it's not perfect, and in a new language=

  =

 I'd like something even better. D contains several small things that a=

re  =

 better than Python (an many things that are worse than Python).

 Bye,
 bearophile

Well the first thing I tried out was:


for c in "f#a
     print c

Which I still didn't get to run after:

  - reading SyntaxError: Non-ASCII character '\xe2' in file ./run.py on =
 =

line 1, but no encoding declared; see  =

http://www.python.org/peps/pep-0263.html for details
  - adding a BOM to the script
  - remove the shebang which collides with the BOM
  - writing unicode("f#a

martin

Oct 21 2011

Peter Alexander <peter.alexander.au gmail.com> writes:

On 21/10/11 11:03 PM, so wrote:
 On Fri, 21 Oct 2011 21:38:39 +0300, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:

 In another post in this thread, Walter said in reference to post on
 essentially this idea: "Making such a string type would be terribly
 inefficient. It would make D completely uncompetitive for processing
 strings."
 Now, whether that's true is debatable, but that's his stance on the idea.

 - Jonathan M Davis

 Well he is right, the reason built-in strings everywhere and rarely (so
 rare that i have never seen one yet) someone comes up with a better
 alternative.
 IMO people are spoiled by dynamic languages and fail to see the
 strengths of D strings. Someone coming from C/C++ this is heaven, both
 for performance and flexibility.

Which operations do you believe would be less efficient?

Oct 22 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

All of the ones that don't require decoding, such as searching, would be less 
efficient if decoding was done.

Oct 22 2011

kennytm <kennytm gmail.com> writes:

Walter Bright <newshound2 digitalmars.com> wrote:
 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

 
 All of the ones that don't require decoding, such as searching, would be
 less efficient if decoding was done.

You can std.algorithm.find to do searching, not foreach. The former can
decide whichever efficient method to use.

Oct 22 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 10/22/2011 09:37 PM, kennytm wrote:
 Walter Bright<newshound2 digitalmars.com>  wrote:
 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

 All of the ones that don't require decoding, such as searching, would be
 less efficient if decoding was done.

 You can std.algorithm.find to do searching, not foreach. The former can
 decide whichever efficient method to use.

Afaics the current std.algorithm.find implementation decodes its arguments.

Oct 22 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/22/11 3:05 PM, Timon Gehr wrote:
 On 10/22/2011 09:37 PM, kennytm wrote:
 Walter Bright<newshound2 digitalmars.com> wrote:
 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

 All of the ones that don't require decoding, such as searching, would be
 less efficient if decoding was done.

 You can std.algorithm.find to do searching, not foreach. The former can
 decide whichever efficient method to use.

 Afaics the current std.algorithm.find implementation decodes its arguments.

That can be easily fixed. Currently single-element find does decoding 
but substring find avoids it if possible:

https://github.com/D-Programming-Language/phobos/blob/master/std/algorithm.d#L2819


Andrei

Oct 22 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 10/22/2011 10:42 PM, Andrei Alexandrescu wrote:
 On 10/22/11 3:05 PM, Timon Gehr wrote:
 On 10/22/2011 09:37 PM, kennytm wrote:
 Walter Bright<newshound2 digitalmars.com> wrote:
 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

 All of the ones that don't require decoding, such as searching,
 would be
 less efficient if decoding was done.

 You can std.algorithm.find to do searching, not foreach. The former can
 decide whichever efficient method to use.

 Afaics the current std.algorithm.find implementation decodes its
 arguments.

 That can be easily fixed. Currently single-element find does decoding
 but substring find avoids it if possible:

 https://github.com/D-Programming-Language/phobos/blob/master/std/algorithm.d#L2819

Ok, I actually did not see that, thanks. However this is usually still 
not the most efficient implementation in case the first argument is 
string and the other is wstring/dstring.

Oct 22 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/22/11 5:32 PM, Timon Gehr wrote:
 Ok, I actually did not see that, thanks. However this is usually still
 not the most efficient implementation in case the first argument is
 string and the other is wstring/dstring.

I understand. For some reason I'm not seeing the URL of your pull 
request fixing that :o).

Andrei

Oct 22 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

 All of the ones that don't require decoding, such as searching, would be  
 less efficient if decoding was done.

Searching that does not do decoding is fundamentally incorrect.  That is,  
if you want to find a substring in a string, you cannot just compare chars.

But if you want to do the fundamentally incorrect thing (perhaps because  
you as the programmer know the limits of what the data may contain), then  
you should be able to do that for efficiency.  But to default to something  
incorrect and then claim it's for efficiency is almost laughable.

If I gave you an O(lg(n)) shortest path algorithm that did not always find  
the shortest path, would you agree it was an improvment over dijkstra's?

-Steve

Oct 24 2011

"Simen Kjaeraas" <simen.kjaras gmail.com> writes:

On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright  
 <newshound2 digitalmars.com> wrote:

 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

 All of the ones that don't require decoding, such as searching, would  
 be less efficient if decoding was done.

 Searching that does not do decoding is fundamentally incorrect.  That  
 is, if you want to find a substring in a string, you cannot just compare  
 chars.

Assuming both string are valid UTF-8, you can. Continuation bytes can never
be confused with the first byte of a code point, and the first byte always
identifies how many continuation bytes there should be.

-- 
   Simen

Oct 24 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
<simen.kjaras gmail.com> wrote:

 On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer  =

 <schveiguy yahoo.com> wrote:

 On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright  =


 <newshound2 digitalmars.com> wrote:

 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

 All of the ones that don't require decoding, such as searching, woul=



d  =

 be less efficient if decoding was done.

 Searching that does not do decoding is fundamentally incorrect.  That=


  =

 is, if you want to find a substring in a string, you cannot just  =


 compare chars.

 Assuming both string are valid UTF-8, you can. Continuation bytes can =

 =

 never
 be confused with the first byte of a code point, and the first byte  =

 always
 identifies how many continuation bytes there should be.

As others have pointed out in the past to me (and I thought as you did
once), the same characters can be encoded in *different ways*.  They mus=
t
be normalized to accurately compare.

Plus, a combining character (such as an umlaut or accent) is part of a
character, but may be a separate code point.  If that's on the last
character in the word such as fianc=C3=A9, then searching for fiance wil=
l
result in a match without proper decoding!  Or if fianc=C3=A9 uses a
precomposed =C3=A9, it won't match.  So two valid representations of the=
 word
either match or they don't.  It's just a complete mess without proper
unicode decoding.

-Steve

Oct 24 2011

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 24.10.2011 23:41, Steven Schveighoffer wrote:
 On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
 <simen.kjaras gmail.com> wrote:

 On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright
 <newshound2 digitalmars.com> wrote:

 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

 All of the ones that don't require decoding, such as searching,
 would be less efficient if decoding was done.

 Searching that does not do decoding is fundamentally incorrect. That
 is, if you want to find a substring in a string, you cannot just
 compare chars.

 Assuming both string are valid UTF-8, you can. Continuation bytes can
 never
 be confused with the first byte of a code point, and the first byte
 always
 identifies how many continuation bytes there should be.

 As others have pointed out in the past to me (and I thought as you did
 once), the same characters can be encoded in *different ways*. They must
 be normalized to accurately compare.

Assuming language support stays on stage of "codepoint is a character" 
it's totaly expected to ignore modifiers and compare identically 
normalized UTF without decoding. Yes, it risks to hit certain issues.

 Plus, a combining character (such as an umlaut or accent) is part of a
 character, but may be a separate code point. If that's on the last
 character in the word such as fiancé, then searching for fiance will
 result in a match without proper decoding!

Now if you are going to do real characters... If source/needle are 
normalized you still can avoid lots of work by searching without 
decoding. All you need to decode is one codepoint on each successful 
match to see if there is a modifier at end of matched portion.
But it depends on how you want to match if it's case-insensitive search 
it will be a lot more complicated, but anyway it boils down to this:
1) do inexact search, get likely match ( false positives are OK, 
negatives not) no decoding here
2) once found check it (or parts of it) with proper decoding

There are cultural subtleties, that complicate these steps if you take 
them into account, but it's doable.

Or if fiancé uses a
 precomposed é, it won't match. So two valid representations of the word
 either match or they don't. It's just a complete mess without proper
 unicode decoding.

It's a complete mess even with proper decoding ;)


 -Steve


-- 
Dmitry Olshansky

Oct 24 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 24 Oct 2011 16:18:57 -0400, Dmitry Olshansky  =

<dmitry.olsh gmail.com> wrote:

 On 24.10.2011 23:41, Steven Schveighoffer wrote:
 On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
 <simen.kjaras gmail.com> wrote:

 On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright
 <newshound2 digitalmars.com> wrote:

 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

 All of the ones that don't require decoding, such as searching,
 would be less efficient if decoding was done.

 Searching that does not do decoding is fundamentally incorrect. Tha=




t
 is, if you want to find a substring in a string, you cannot just
 compare chars.

 Assuming both string are valid UTF-8, you can. Continuation bytes ca=



n
 never
 be confused with the first byte of a code point, and the first byte
 always
 identifies how many continuation bytes there should be.

 As others have pointed out in the past to me (and I thought as you di=


d
 once), the same characters can be encoded in *different ways*. They m=


ust
 be normalized to accurately compare.

 Assuming language support stays on stage of "codepoint is a character"=

  =

 it's totaly expected to ignore modifiers and compare identically  =

 normalized UTF without decoding. Yes, it risks to hit certain issues.

Again, the "risk" is that it fails to achieve the goal you ask of it!

D-language: Here, use this search algorithm, it works most of the time, =
 =

but may not work correctly in some cases.  If you run into one of those =
 =

cases, you'll have to run a specialized search algorithm for strings.
User: How do I know I hit one of those cases?
D-language: You'll have to run the specialized version to find out.
User: Why wouldn't I just run the specialized version first?
D-language: Well, because it's slower!
User: But don't I have to use both algorithms to make sure I find the da=
ta?
D-language: Only if you "care" about accuracy!

Call me ludicrous, but is this really what we want to push on someone as=
 a  =

"unicode-aware" language?

 Plus, a combining character (such as an umlaut or accent) is part of =


a
 character, but may be a separate code point. If that's on the last
 character in the word such as fianc=C3=A9, then searching for fiance =


will
 result in a match without proper decoding!

 Now if you are going to do real characters... If source/needle are  =

 normalized you still can avoid lots of work by searching without  =

 decoding. All you need to decode is one codepoint on each successful  =

 match to see if there is a modifier at end of matched portion.
 But it depends on how you want to match if it's case-insensitive searc=

h  =

 it will be a lot more complicated, but anyway it boils down to this:
 1) do inexact search, get likely match ( false positives are OK,  =

 negatives not) no decoding here
 2) once found check it (or parts of it) with proper decoding

 There are cultural subtleties, that complicate these steps if you take=

  =

 them into account, but it's doable.

I agree with you that simple searches using only byte (or dchar)  =

comparison does not work, and can be optimized based on several factors.=
   =

The easiest thing is to find a code unit sequence that only has one vali=
d  =

form, then search for that without decoding.  Then when found, decode th=
e  =

characters around it.  Or if that isn't possible, create all the  =

un-normalized forms for one grapheme (based on how likely it is to occur=
),  =

and search for one of those in the undecoded stream.

This can all be built into a specialized string type.  There's actually =
 =

some really interesting problems to solve in this space I think.  I've  =

created a basic string type that has lamented in my unfinished pile of  =

stuff to do.  I think it can be done in a way that is close to as  =

efficient as arrays for the most common operations (slicing, indexing,  =

etc.), but is *correct* before it is efficient.  You should always be ab=
le  =

to drop into "array mode" and deal with the code-units.

 Or if fianc=C3=A9 uses a
 precomposed =C3=A9, it won't match. So two valid representations of t=


he word
 either match or they don't. It's just a complete mess without proper
 unicode decoding.

 It's a complete mess even with proper decoding ;)

Yes, all the more reason to solve the problem correctly so the hapless  =

unicode novice user doesn't have to!

-Steve

Oct 24 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 10/24/2011 1:52 PM, Steven Schveighoffer wrote:
 Call me ludicrous, but is this really what we want to push on someone as a
 "unicode-aware" language?

There are different levels of Unicode support. D operates at the most basic 
level (forgot what it is called) where only recognition of the encodings is 
necessary.

Oct 24 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 10/24/2011 10:52 PM, Steven Schveighoffer wrote:
 On Mon, 24 Oct 2011 16:18:57 -0400, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 On 24.10.2011 23:41, Steven Schveighoffer wrote:
 On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
 <simen.kjaras gmail.com> wrote:

 On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright
 <newshound2 digitalmars.com> wrote:

 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

 All of the ones that don't require decoding, such as searching,
 would be less efficient if decoding was done.

 Searching that does not do decoding is fundamentally incorrect. That
 is, if you want to find a substring in a string, you cannot just
 compare chars.

 Assuming both string are valid UTF-8, you can. Continuation bytes can
 never
 be confused with the first byte of a code point, and the first byte
 always
 identifies how many continuation bytes there should be.

 As others have pointed out in the past to me (and I thought as you did
 once), the same characters can be encoded in *different ways*. They must
 be normalized to accurately compare.

 Assuming language support stays on stage of "codepoint is a character"
 it's totaly expected to ignore modifiers and compare identically
 normalized UTF without decoding. Yes, it risks to hit certain issues.

 Again, the "risk" is that it fails to achieve the goal you ask of it!

 D-language: Here, use this search algorithm, it works most of the time,
 but may not work correctly in some cases. If you run into one of those
 cases, you'll have to run a specialized search algorithm for strings.
 User: How do I know I hit one of those cases?
 D-language: You'll have to run the specialized version to find out.
 User: Why wouldn't I just run the specialized version first?
 D-language: Well, because it's slower!
 User: But don't I have to use both algorithms to make sure I find the data?
 D-language: Only if you "care" about accuracy!

 Call me ludicrous, but is this really what we want to push on someone as
 a "unicode-aware" language?

 Plus, a combining character (such as an umlaut or accent) is part of a
 character, but may be a separate code point. If that's on the last
 character in the word such as fiancé, then searching for fiance will
 result in a match without proper decoding!

 Now if you are going to do real characters... If source/needle are
 normalized you still can avoid lots of work by searching without
 decoding. All you need to decode is one codepoint on each successful
 match to see if there is a modifier at end of matched portion.
 But it depends on how you want to match if it's case-insensitive
 search it will be a lot more complicated, but anyway it boils down to
 this:
 1) do inexact search, get likely match ( false positives are OK,
 negatives not) no decoding here
 2) once found check it (or parts of it) with proper decoding

 There are cultural subtleties, that complicate these steps if you take
 them into account, but it's doable.

 I agree with you that simple searches using only byte (or dchar)
 comparison does not work, and can be optimized based on several factors.
 The easiest thing is to find a code unit sequence that only has one
 valid form, then search for that without decoding. Then when found,
 decode the characters around it. Or if that isn't possible, create all
 the un-normalized forms for one grapheme (based on how likely it is to
 occur), and search for one of those in the undecoded stream.

 This can all be built into a specialized string type. There's actually
 some really interesting problems to solve in this space I think. I've
 created a basic string type that has lamented in my unfinished pile of
 stuff to do. I think it can be done in a way that is close to as
 efficient as arrays for the most common operations (slicing, indexing,
 etc.), but is *correct* before it is efficient. You should always be
 able to drop into "array mode" and deal with the code-units.

 Or if fiancé uses a
 precomposed é, it won't match. So two valid representations of the word
 either match or they don't. It's just a complete mess without proper
 unicode decoding.

 It's a complete mess even with proper decoding ;)

 Yes, all the more reason to solve the problem correctly so the hapless
 unicode novice user doesn't have to!

 -Steve

The answer is probably to get proper unicode normalization into Phobos. 
The efficient versions can then specify that they are guaranteed to work 
perfectly accurate on normalized input only.

Oct 24 2011

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 25.10.2011 0:52, Steven Schveighoffer wrote:
 On Mon, 24 Oct 2011 16:18:57 -0400, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 On 24.10.2011 23:41, Steven Schveighoffer wrote:
 On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
 <simen.kjaras gmail.com> wrote:

 On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright
 <newshound2 digitalmars.com> wrote:

 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

 All of the ones that don't require decoding, such as searching,
 would be less efficient if decoding was done.

 Searching that does not do decoding is fundamentally incorrect. That
 is, if you want to find a substring in a string, you cannot just
 compare chars.

 Assuming both string are valid UTF-8, you can. Continuation bytes can
 never
 be confused with the first byte of a code point, and the first byte
 always
 identifies how many continuation bytes there should be.

 As others have pointed out in the past to me (and I thought as you did
 once), the same characters can be encoded in *different ways*. They must
 be normalized to accurately compare.

 Assuming language support stays on stage of "codepoint is a character"
 it's totaly expected to ignore modifiers and compare identically
 normalized UTF without decoding. Yes, it risks to hit certain issues.

 Again, the "risk" is that it fails to achieve the goal you ask of it!

Last time I checked, the legion of "everything is ASCII" zealots was 
pretty large ;) So the "goal" is a blury line, meaning that "search for 
a substring on in this chunk of unicode text" can mean very different 
things, and be correct in it's own sense on about 3 different levels.
There is no default way, so I'm not sure we have to embed all of this 
complexity into language right now. Phobos is were we must first 
implement this, once it works we may look at revisiting our pick of 
default comparison method, etc.

 D-language: Here, use this search algorithm, it works most of the time,
 but may not work correctly in some cases. If you run into one of those
 cases, you'll have to run a specialized search algorithm for strings.
 User: How do I know I hit one of those cases?
 D-language: You'll have to run the specialized version to find out.
 User: Why wouldn't I just run the specialized version first?
 D-language: Well, because it's slower!
 User: But don't I have to use both algorithms to make sure I find the data?
 D-language: Only if you "care" about accuracy!

 Call me ludicrous, but is this really what we want to push on someone as
 a "unicode-aware" language?

No, but we might want to fix a string search to do a little more - 
namely check if it skewed a graphem (assuming normalizations match). BTW 
adding normalization to std.uni is another thing to do right now. That 
should be a good enough start, and looking at unicode standard, things 
are rather fluid there meaning that "unicode-aware" language should be 
sufixed by version number :)

 Plus, a combining character (such as an umlaut or accent) is part of a
 character, but may be a separate code point. If that's on the last
 character in the word such as fiancé, then searching for fiance will
 result in a match without proper decoding!

 Now if you are going to do real characters... If source/needle are
 normalized you still can avoid lots of work by searching without
 decoding. All you need to decode is one codepoint on each successful
 match to see if there is a modifier at end of matched portion.
 But it depends on how you want to match if it's case-insensitive
 search it will be a lot more complicated, but anyway it boils down to
 this:
 1) do inexact search, get likely match ( false positives are OK,
 negatives not) no decoding here
 2) once found check it (or parts of it) with proper decoding

 There are cultural subtleties, that complicate these steps if you take
 them into account, but it's doable.

 I agree with you that simple searches using only byte (or dchar)
 comparison does not work, and can be optimized based on several factors.
 The easiest thing is to find a code unit sequence that only has one
 valid form, then search for that without decoding. Then when found,
 decode the characters around it. Or if that isn't possible, create all
 the un-normalized forms for one grapheme (based on how likely it is to
 occur), and search for one of those in the undecoded stream.

Yes, it's all revolves around find all of variations of a substring.
If user is willing to spend some time upfront, this could be done in one 
pass e.g. using the trick I employed in new std.regex.
For a rough idea see ShiftOr string search,
that is easy & fast inexact search:
http://igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION0060

There are even better variations of this stuff in the wild.

 This can all be built into a specialized string type. There's actually
 some really interesting problems to solve in this space I think. I've
 created a basic string type that has lamented in my unfinished pile of
 stuff to do. I think it can be done in a way that is close to as
 efficient as arrays for the most common operations (slicing, indexing,
 etc.), but is *correct* before it is efficient. You should always be
 able to drop into "array mode" and deal with the code-units.

Here is where we probably differ, to me there is no simple safe fast 
string type. Why? Because if you need anything "not pedantic and safe 
above all" you work with implicit data type of e.g. "a chunk of UTF-8 
bytes that are normalized in NFC", ...
I'd rather see them as sealed array with a bunch of meta info, and 
string functions to specialize on them in order to do some cool speed 
hacks. Of course, drilling down to row data should be possible for the 
user as well.

 Or if fiancé uses a
 precomposed é, it won't match. So two valid representations of the word
 either match or they don't. It's just a complete mess without proper
 unicode decoding.

 It's a complete mess even with proper decoding ;)

 Yes, all the more reason to solve the problem correctly so the hapless
 unicode novice user doesn't have to!


Hapless novice unicode user don't stand a chance.
I'm serious. One needs to know a lot of these "basics" to e.g. know why 
indexing by "true" character could be so slow, about encodings, 
normalizations etc. Yet we can save a metric ton of special cased crap 
from ever hitting his eyes.


-- 
Dmitry Olshansky

Oct 24 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 24 Oct 2011 18:25:25 -0400, Dmitry Olshansky  =

<dmitry.olsh gmail.com> wrote:

 On 25.10.2011 0:52, Steven Schveighoffer wrote:
 On Mon, 24 Oct 2011 16:18:57 -0400, Dmitry Olshansky
 <dmitry.olsh gmail.com> wrote:

 On 24.10.2011 23:41, Steven Schveighoffer wrote:
 On Mon, 24 Oct 2011 11:58:15 -0400, Simen Kjaeraas
 <simen.kjaras gmail.com> wrote:

 On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright
 <newshound2 digitalmars.com> wrote:

 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

 All of the ones that don't require decoding, such as searching,
 would be less efficient if decoding was done.

 Searching that does not do decoding is fundamentally incorrect. T=






hat
 is, if you want to find a substring in a string, you cannot just
 compare chars.

 Assuming both string are valid UTF-8, you can. Continuation bytes =





can
 never
 be confused with the first byte of a code point, and the first byt=





e
 always
 identifies how many continuation bytes there should be.

 As others have pointed out in the past to me (and I thought as you =




did
 once), the same characters can be encoded in *different ways*. They=




  =

 must
 be normalized to accurately compare.

 Assuming language support stays on stage of "codepoint is a characte=



r"
 it's totaly expected to ignore modifiers and compare identically
 normalized UTF without decoding. Yes, it risks to hit certain issues=



.
 Again, the "risk" is that it fails to achieve the goal you ask of it!=



 Last time I checked, the legion of "everything is ASCII" zealots was  =

 pretty large ;) So the "goal" is a blury line, meaning that "search fo=

r  =

 a substring on in this chunk of unicode text" can mean very different =

 =

 things, and be correct in it's own sense on about 3 different levels.
 There is no default way, so I'm not sure we have to embed all of this =

 =

 complexity into language right now. Phobos is were we must first  =

 implement this, once it works we may look at revisiting our pick of  =

 default comparison method, etc.

If we say D obeys the subset of unicode that only works on ascii  =

characters, well, then I think we do not support unicode at all ;)

I agree the string type needs to be fleshed out before the language adop=
ts  =

it.  I think we could legitimately define a string type that auto-assign=
s  =

 from it's respective char[] array type.  Once it's shown that the type =
is  =

nearly as fast as a char[] array, it can be used as the default type for=
  =

string literals (the only real place where the language gets in the way)=
.

 D-language: Here, use this search algorithm, it works most of the tim=


e,
 but may not work correctly in some cases. If you run into one of thos=


e
 cases, you'll have to run a specialized search algorithm for strings.=


 User: How do I know I hit one of those cases?
 D-language: You'll have to run the specialized version to find out.
 User: Why wouldn't I just run the specialized version first?
 D-language: Well, because it's slower!
 User: But don't I have to use both algorithms to make sure I find the=


  =

 data?
 D-language: Only if you "care" about accuracy!

 Call me ludicrous, but is this really what we want to push on someone=


 as
 a "unicode-aware" language?

 No, but we might want to fix a string search to do a little more -  =

 namely check if it skewed a graphem (assuming normalizations match).

That's a big assumption.  Valid unicode is valid unicode, even if it's n=
ot  =

normalized.

 BTW adding normalization to std.uni is another thing to do right now. =

 =

 That should be a good enough start, and looking at unicode standard,  =

 things are rather fluid there meaning that "unicode-aware" language  =

 should be sufixed by version number :)

I agree we need normalization and it is not necessary to involve the  =

compiler in this.  I'd suggest a two to three phase approach:

1. Leave phobos' schizo view of "char arrays are not arrays" for now, an=
d  =

build the necessary framework to get a string type that actually works.
2. Remove schizo view.
3. (optional) make compiler use library-defined string type for string  =

literals.

 Plus, a combining character (such as an umlaut or accent) is part o=




f a
 character, but may be a separate code point. If that's on the last
 character in the word such as fianc=C3=A9, then searching for fianc=




e will
 result in a match without proper decoding!

 Now if you are going to do real characters... If source/needle are
 normalized you still can avoid lots of work by searching without
 decoding. All you need to decode is one codepoint on each successful=



 match to see if there is a modifier at end of matched portion.
 But it depends on how you want to match if it's case-insensitive
 search it will be a lot more complicated, but anyway it boils down t=



o
 this:
 1) do inexact search, get likely match ( false positives are OK,
 negatives not) no decoding here
 2) once found check it (or parts of it) with proper decoding

 There are cultural subtleties, that complicate these steps if you ta=



ke
 them into account, but it's doable.

 I agree with you that simple searches using only byte (or dchar)
 comparison does not work, and can be optimized based on several facto=


rs.
 The easiest thing is to find a code unit sequence that only has one
 valid form, then search for that without decoding. Then when found,
 decode the characters around it. Or if that isn't possible, create al=


l
 the un-normalized forms for one grapheme (based on how likely it is t=


o
 occur), and search for one of those in the undecoded stream.

 Yes, it's all revolves around find all of variations of a substring.
 If user is willing to spend some time upfront, this could be done in o=

ne  =

 pass e.g. using the trick I employed in new std.regex.
 For a rough idea see ShiftOr string search,
 that is easy & fast inexact search:
 http://igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION0060

 There are even better variations of this stuff in the wild.

This sounds good.  I'll have a look at it when I have time.

 This can all be built into a specialized string type. There's actuall=


y
 some really interesting problems to solve in this space I think. I've=


 created a basic string type that has lamented in my unfinished pile o=


f
 stuff to do. I think it can be done in a way that is close to as
 efficient as arrays for the most common operations (slicing, indexing=


,
 etc.), but is *correct* before it is efficient. You should always be
 able to drop into "array mode" and deal with the code-units.

 Here is where we probably differ, to me there is no simple safe fast  =

 string type. Why? Because if you need anything "not pedantic and safe =

 =

 above all" you work with implicit data type of e.g. "a chunk of UTF-8 =

 =

 bytes that are normalized in NFC", ...

Specialized types which dictate normalization could be created.  We do n=
ot  =

have to always assume the worst if we are optimizing.

But in general, when you read a utf-8 file, it could have anything in it=
.   =

It may not even be normalized.

 I'd rather see them as sealed array with a bunch of meta info, and  =

 string functions to specialize on them in order to do some cool speed =

 =

 hacks. Of course, drilling down to row data should be possible for the=

  =

 user as well.

This is exactly what I am envisioning ;)  Except I'd build the meta-info=
  =

into the type.

 Or if fianc=C3=A9 uses a
 precomposed =C3=A9, it won't match. So two valid representations of=




 the  =

 word
 either match or they don't. It's just a complete mess without prope=




r
 unicode decoding.

 It's a complete mess even with proper decoding ;)

 Yes, all the more reason to solve the problem correctly so the haples=


s
 unicode novice user doesn't have to!


 Hapless novice unicode user don't stand a chance.
 I'm serious. One needs to know a lot of these "basics" to e.g. know wh=

y  =

 indexing by "true" character could be so slow, about encodings,  =

 normalizations etc. Yet we can save a metric ton of special cased crap=

  =

 from ever hitting his eyes.

I disagree.  99% of the time I use strings, I don't care about indexes. =
 I  =

just want to deal with whole strings and substrings.  I rarely use  =

arbitrary indexes, and when I do, I'm assuming ascii data.

The hapless novice unicode user doesn't need to know unicode to do:

writefln("hello, world!");

Or to do:

if(find(str, "xyz")) {...}

Or to even use regex!

-Steve

Oct 26 2011

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 26.10.2011 16:12, Steven Schveighoffer wrote:
<snip>
 I agree the string type needs to be fleshed out before the language
 adopts it. I think we could legitimately define a string type that
 auto-assigns from it's respective char[] array type. Once it's shown
 that the type is nearly as fast as a char[] array, it can be used as the
 default type for string literals (the only real place where the language
 gets in the way).

That's it.

 D-language: Here, use this search algorithm, it works most of the time,
 but may not work correctly in some cases. If you run into one of those
 cases, you'll have to run a specialized search algorithm for strings.
 User: How do I know I hit one of those cases?
 D-language: You'll have to run the specialized version to find out.
 User: Why wouldn't I just run the specialized version first?
 D-language: Well, because it's slower!
 User: But don't I have to use both algorithms to make sure I find the
 data?
 D-language: Only if you "care" about accuracy!

 Call me ludicrous, but is this really what we want to push on someone as
 a "unicode-aware" language?

 No, but we might want to fix a string search to do a little more -
 namely check if it skewed a graphem (assuming normalizations match).

 That's a big assumption. Valid unicode is valid unicode, even if it's
 not normalized.

 BTW adding normalization to std.uni is another thing to do right now.
 That should be a good enough start, and looking at unicode standard,
 things are rather fluid there meaning that "unicode-aware" language
 should be sufixed by version number :)

 I agree we need normalization and it is not necessary to involve the
 compiler in this. I'd suggest a two to three phase approach:

 1. Leave phobos' schizo view of "char arrays are not arrays" for now,
 and build the necessary framework to get a string type that actually works.
 2. Remove schizo view.
 3. (optional) make compiler use library-defined string type for string
 literals.

Something like that, I may land a hand here, as I plan to merge some 
unicode stuff between std.regex and std.uni. As extras probably striding 
by grapheme and support for various degree of case-insensitive string 
comparison.

 Plus, a combining character (such as an umlaut or accent) is part of a
 character, but may be a separate code point. If that's on the last
 character in the word such as fiancé, then searching for fiance will
 result in a match without proper decoding!

 Now if you are going to do real characters... If source/needle are
 normalized you still can avoid lots of work by searching without
 decoding. All you need to decode is one codepoint on each successful
 match to see if there is a modifier at end of matched portion.
 But it depends on how you want to match if it's case-insensitive
 search it will be a lot more complicated, but anyway it boils down to
 this:
 1) do inexact search, get likely match ( false positives are OK,
 negatives not) no decoding here
 2) once found check it (or parts of it) with proper decoding

 There are cultural subtleties, that complicate these steps if you take
 them into account, but it's doable.

 I agree with you that simple searches using only byte (or dchar)
 comparison does not work, and can be optimized based on several factors.
 The easiest thing is to find a code unit sequence that only has one
 valid form, then search for that without decoding. Then when found,
 decode the characters around it. Or if that isn't possible, create all
 the un-normalized forms for one grapheme (based on how likely it is to
 occur), and search for one of those in the undecoded stream.

 Yes, it's all revolves around find all of variations of a substring.
 If user is willing to spend some time upfront, this could be done in
 one pass e.g. using the trick I employed in new std.regex.
 For a rough idea see ShiftOr string search,
 that is easy & fast inexact search:
 http://igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION0060

 There are even better variations of this stuff in the wild.

 This sounds good. I'll have a look at it when I have time.

 This can all be built into a specialized string type. There's actually
 some really interesting problems to solve in this space I think. I've
 created a basic string type that has lamented in my unfinished pile of
 stuff to do. I think it can be done in a way that is close to as
 efficient as arrays for the most common operations (slicing, indexing,
 etc.), but is *correct* before it is efficient. You should always be
 able to drop into "array mode" and deal with the code-units.

 Here is where we probably differ, to me there is no simple safe fast
 string type. Why? Because if you need anything "not pedantic and safe
 above all" you work with implicit data type of e.g. "a chunk of UTF-8
 bytes that are normalized in NFC", ...

 Specialized types which dictate normalization could be created. We do
 not have to always assume the worst if we are optimizing.

 But in general, when you read a utf-8 file, it could have anything in
 it. It may not even be normalized.

There should be a way of e.g.
auto my_string = normalize(readText("file.txt"), Normalize.NFKC);
//my_string has typed normalized UTF-8 or somesuch
Yes, that should keep general non-assuming case. Though it still can do 
some clever things, since as W3C points out >90% of text on web is 
already normalized in NFC (I can search for a proof link, if hard pressed).

 I'd rather see them as sealed array with a bunch of meta info, and
 string functions to specialize on them in order to do some cool speed
 hacks. Of course, drilling down to row data should be possible for the
 user as well.

 This is exactly what I am envisioning ;) Except I'd build the meta-info
 into the type.

Yeah, I meant that the type caries this info.
Looks like we are more in agreement then I suspected ;)

 Or if fiancé uses a
 precomposed é, it won't match. So two valid representations of the
 word
 either match or they don't. It's just a complete mess without proper
 unicode decoding.

 It's a complete mess even with proper decoding ;)

 Yes, all the more reason to solve the problem correctly so the hapless
 unicode novice user doesn't have to!


 Hapless novice unicode user don't stand a chance.
 I'm serious. One needs to know a lot of these "basics" to e.g. know
 why indexing by "true" character could be so slow, about encodings,
 normalizations etc. Yet we can save a metric ton of special cased crap
 from ever hitting his eyes.

 I disagree. 99% of the time I use strings, I don't care about indexes. I
 just want to deal with whole strings and substrings. I rarely use
 arbitrary indexes, and when I do, I'm assuming ascii data.

 The hapless novice unicode user doesn't need to know unicode to do:

 writefln("hello, world!");

 Or to do:

 if(find(str, "xyz")) {...}

 Or to even use regex!

To print chars you don't need to know unicode,  you OS needs to :) 
Though that wasn't string processing. However find/match is indeed a 
good example. You are probably right, if we cover enough of common 
operations most people might not have to "look inside".


-- 
Dmitry Olshansky

Oct 26 2011

travert phare.normalesup.org (Christophe) writes:

Dmitry Olshansky , dans le message (digitalmars.D:147415), a �crit�:
 Assuming language support stays on stage of "codepoint is a character" 
 it's totaly expected to ignore modifiers and compare identically 
 normalized UTF without decoding. Yes, it risks to hit certain issues.

string being seen as range of codepoint (dchar) is already aweful 
enough. Now seeing strings as range of displayable caracters just do not 
make sense. Unicode is too complicated to allow doing this for a general 
purpose string manipulation. All the transformations to displayable 
characters can only be done when displaying characters !

Just like fianc� is hidden is you write fiance' (with the approriate 
unicode character to have the ' placed over the 'e'). You can hide any 
word by using delete characters. You have to make asumption on the 
input, and you have to put limitations to the algorithm because in any 
case, you can have unexpected behavior. And I can assure you there is 
less unexpected behavior if you treat strings as dchar range or even 
char[], than if you treat them as displayable characters.

 It's a complete mess even with proper decoding ;)

Sure, that's why we better not decode.

Oct 28 2011

"Simen Kjaeraas" <simen.kjaras gmail.com> writes:

On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 Plus, a combining character (such as an umlaut or accent) is part of a
 character, but may be a separate code point.

If this is correct (and it is), then decoding to dchar is simply not  
enough.
You seem to advocate decoding to graphemes, which is a whole different  
matter.


-- 
   Simen

Oct 24 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 24 Oct 2011 19:49:43 -0400, Simen Kjaeraas  =

<simen.kjaras gmail.com> wrote:

 On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer  =

 <schveiguy yahoo.com> wrote:

 Plus, a combining character (such as an umlaut or accent) is part of =


a
 character, but may be a separate code point.

 If this is correct (and it is), then decoding to dchar is simply not  =

 enough.
 You seem to advocate decoding to graphemes, which is a whole different=

  =

 matter.

I am advocating that.  And it's a matter of perception.  D can say "we  =

only support code-point decoding" and what that means to a user is, "we =
 =

don't support language as you know it."  Sure it's a part of unicode, bu=
t  =

it takes that extra piece to make it actually usable to people who requi=
re  =

unicode.

Even in English, fianc=C3=A9 has an accent.  To say D supports unicode, =
but  =

then won't do a simple search on a file which contains a certain *valid*=
  =

encoding of that word is disingenuous to say the least.

D needs a fully unicode-aware string type.  I advocate D should use it a=
s  =

the default string type, but it needs one whether it's the default or no=
t  =

in order to say it supports unicode.

-Steve

Oct 26 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/26/11 7:18 AM, Steven Schveighoffer wrote:
 On Mon, 24 Oct 2011 19:49:43 -0400, Simen Kjaeraas
 <simen.kjaras gmail.com> wrote:

 On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 Plus, a combining character (such as an umlaut or accent) is part of a
 character, but may be a separate code point.

 If this is correct (and it is), then decoding to dchar is simply not
 enough.
 You seem to advocate decoding to graphemes, which is a whole different
 matter.

 I am advocating that. And it's a matter of perception. D can say "we
 only support code-point decoding" and what that means to a user is, "we
 don't support language as you know it." Sure it's a part of unicode, but
 it takes that extra piece to make it actually usable to people who
 require unicode.

 Even in English, fiancé has an accent. To say D supports unicode, but
 then won't do a simple search on a file which contains a certain *valid*
 encoding of that word is disingenuous to say the least.

Why doesn't that simple search work?

foreach (line; stdin.byLine()) {
     if (line.canFind("fiancé")) {
        writeln("There it is.");
     }
}

 D needs a fully unicode-aware string type. I advocate D should use it as
 the default string type, but it needs one whether it's the default or
 not in order to say it supports unicode.

How do you define "supports Unicode"? For my money, the main sin of 
(w)string is that it offers [] and .length with potentially confusing 
semantics, so if I could I'd curb, not expand, its interface.


Andrei

Oct 29 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday, October 29, 2011 09:42:54 Andrei Alexandrescu wrote:
 On 10/26/11 7:18 AM, Steven Schveighoffer wrote:
 On Mon, 24 Oct 2011 19:49:43 -0400, Simen Kjaeraas
=20
 <simen.kjaras gmail.com> wrote:
 On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer
=20
 <schveiguy yahoo.com> wrote:
 Plus, a combining character (such as an umlaut or accent) is part=




 of
 a
 character, but may be a separate code point.

=20
 If this is correct (and it is), then decoding to dchar is simply n=



ot
 enough.
 You seem to advocate decoding to graphemes, which is a whole diffe=



rent
 matter.

=20
 I am advocating that. And it's a matter of perception. D can say "w=


e
 only support code-point decoding" and what that means to a user is,=


 "we
 don't support language as you know it." Sure it's a part of unicode=


, but
 it takes that extra piece to make it actually usable to people who
 require unicode.
=20
 Even in English, fianc=C3=A9 has an accent. To say D supports unico=


de, but
 then won't do a simple search on a file which contains a certain *v=


alid*
 encoding of that word is disingenuous to say the least.

=20
 Why doesn't that simple search work?
=20
 foreach (line; stdin.byLine()) {
      if (line.canFind("fianc=C3=A9")) {
         writeln("There it is.");
      }
 }

If the strings aren't normalized the same way, then it might not find f=
ianc=C3=A9. If=20
they _are_ normalized the same way and fianc=C3=A9 is in there except t=
hat the =C3=A9 is=20
actually modified by another code point after it (e.g. a subscript of 2=
 - not=20
exactly likely in this case but certainly possible), then that string w=
ould  =20
be found when it shouldn't be. The bigger problem though, I think, is w=
hen=20
you're searching for a string which is the same without the modifiers -=
 which=20
would be fiance in this case - since then if the modfiying code points =
are=20
after, then find will think that it found the string that you were look=
ing for=20
when it didn't.

Once you're dealing with modifying code points, in the general case, yo=
u=20
_must_ operate on the grapheme level to ensure that you find exactly wh=
at=20
you're looking for and only what you're looking for. If we assume that =
all=20
strings are normalized the same way and pick the right normalization fo=
r it=20
(and provide a function to normalize strings that way of course), then =
we=20
could probably make that work 100% of the time (assuming that there's a=
=20
normalized form with all of the modifying code points being _before_ th=
e code=20
point that we modify and that no modifying code point can be a characte=
r on=20
its own), but I'd have to study up on it more to be sure.

Regardless, while searching for fianc=C3=A9 has a decent chance of succ=
ess=20
(especially if programs generall favor using  single code points instea=
d of=20
multiple code points wherever possible), it's still a risky proposition=
=20
without at least doing unicode normalization if not outright using a ra=
nge of=20
graphemes rather than code points.

- Jonathan M Davis

Oct 29 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sat, 29 Oct 2011 10:42:54 -0400, Andrei Alexandrescu  =

<SeeWebsiteForEmail erdani.org> wrote:

 On 10/26/11 7:18 AM, Steven Schveighoffer wrote:
 On Mon, 24 Oct 2011 19:49:43 -0400, Simen Kjaeraas
 <simen.kjaras gmail.com> wrote:

 On Mon, 24 Oct 2011 21:41:57 +0200, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 Plus, a combining character (such as an umlaut or accent) is part o=




f a
 character, but may be a separate code point.

 If this is correct (and it is), then decoding to dchar is simply not=



 enough.
 You seem to advocate decoding to graphemes, which is a whole differe=



nt
 matter.

 I am advocating that. And it's a matter of perception. D can say "we
 only support code-point decoding" and what that means to a user is, "=


we
 don't support language as you know it." Sure it's a part of unicode, =


but
 it takes that extra piece to make it actually usable to people who
 require unicode.

 Even in English, fianc=C3=A9 has an accent. To say D supports unicode=


, but
 then won't do a simple search on a file which contains a certain *val=


id*
 encoding of that word is disingenuous to say the least.

 Why doesn't that simple search work?

 foreach (line; stdin.byLine()) {
      if (line.canFind("fianc=C3=A9")) {
         writeln("There it is.");
      }
 }

I think Jonathan answered that quite well, nothing else to add...

 D needs a fully unicode-aware string type. I advocate D should use it=


 as
 the default string type, but it needs one whether it's the default or=


 not in order to say it supports unicode.

 How do you define "supports Unicode"? For my money, the main sin of  =

 (w)string is that it offers [] and .length with potentially confusing =

 =

 semantics, so if I could I'd curb, not expand, its interface.

LOL, I'm so used to programming that I'm trying to figure out what the  =

meaning of sin(string) (as in sine) means :)

I think there are two problems with [] and .length.  First, that they  =

imply "get nth character" and "number of characters" respectively, and  =

second, that many times they *actually are* those things.

So I agree with you the proposed string type needs to curb that interfac=
e,  =

while giving us a fully character/grapheme aware interface (which is  =

currently lacking).

I made an early attempt at doing this, and I will eventually get around =
to  =

finishing it.  I was in the middle of creating an algorithm to  =

as-efficiently-as-possible delineate a grapheme, when I got sidetracked =
by  =

other things :)  There are still lingering issues with the language whic=
h  =

makes this a less-than-ideal replacement (arrays currently enjoy a lot o=
f  =

"extra features" that custom types do not).

-Steve

Oct 31 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/31/11 8:20 AM, Steven Schveighoffer wrote:
 Why doesn't that simple search work?

 foreach (line; stdin.byLine()) {
 if (line.canFind("fiancé")) {
 writeln("There it is.");
 }
 }

 I think Jonathan answered that quite well, nothing else to add...

I see this as a decoding primitive issue that can be resolved.

Andrei

Oct 31 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 10/24/2011 7:02 AM, Steven Schveighoffer wrote:
 On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright <newshound2 digitalmars.com>
 wrote:

 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

 All of the ones that don't require decoding, such as searching, would be less
 efficient if decoding was done.

 Searching that does not do decoding is fundamentally incorrect. That is, if you
 want to find a substring in a string, you cannot just compare chars.

Sure you can. A Unicode character is a string, a Unicode string is a string of 
those strings. So, searching for a Unicode character is searching for a
substring.

Doing this character by character is standard, rote stuff. It's how grep works, 
for example.

Oct 24 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 24 Oct 2011 17:27:46 -0400, Walter Bright  =

<newshound2 digitalmars.com> wrote:

 On 10/24/2011 7:02 AM, Steven Schveighoffer wrote:
 On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright  =


 <newshound2 digitalmars.com>
 wrote:

 On 10/22/2011 2:21 AM, Peter Alexander wrote:
 Which operations do you believe would be less efficient?

 All of the ones that don't require decoding, such as searching, woul=



d  =

 be less
 efficient if decoding was done.

 Searching that does not do decoding is fundamentally incorrect. That =


 =

 is, if you
 want to find a substring in a string, you cannot just compare chars.

 Sure you can. A Unicode character is a string, a Unicode string is a  =

 string of those strings. So, searching for a Unicode character is  =

 searching for a substring.

What if the source character is encoded differently than the search  =

string?  This is basic unicode stuff.  See my example with fianc=C3=A9.

-Steve

Oct 24 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-10-24 21:47:15 +0000, "Steven Schveighoffer"
<schveiguy yahoo.com> said:

What if the source character is encoded differently than the search
string? This is basic unicode stuff. See my example with fianc�.

The more I think about it, the more I think it should work like this:
just like we assume they contain well-formed UTF sequences, char[],
wchar[], and dchar[] should also be assumed to contain **normalized**
unicode strings. Which normalization form to use? no idea. Just pick a
sensible one in the four.
<http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms>

Once we know all strings are normalized in the same way, we can then
compare two strings bitwise to check if they're the same. And we can
check for a substring in the same manner except we need to insert a
simple check after the match to verify that it isn't surrounded by
combining marks applying to the first or last character. And it'll also
deeply simplify any proper Unicode string handling code we'll add in
the future.

- - -

That said, I fear that forcing a specific normalization might be
problematic. You don't always want to have to normalize everything...
<http://en.wikipedia.org/wiki/Unicode_equivalence#Errors_due_to_normalization_differences>

perhaps we could simplify things a bit more: don't pick a standard
normalization form. Just assume that both strings being used are in the
same normalization form. Comparison will work, searching for substring
in the way specified above will work, and other functions could
document which normalization form they accept. Problem solved...
somewhat.

--
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Oct 24 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 24 Oct 2011 19:58:59 -0400, Michel Fortin  =

<michel.fortin michelf.com> wrote:

 On 2011-10-24 21:47:15 +0000, "Steven Schveighoffer"  =

 <schveiguy yahoo.com> said:

 What if the source character is encoded differently than the search
 string?  This is basic unicode stuff.  See my example with fianc=C3=A9=


.
 The more I think about it, the more I think it should work like this: =

 =

 just like we assume they contain well-formed UTF sequences, char[],  =

 wchar[], and dchar[] should also be assumed to contain **normalized** =

 =

 unicode strings. Which normalization form to use? no idea. Just pick a=

  =

 sensible one in the four.
 <http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms>

 Once we know all strings are normalized in the same way, we can then  =

 compare two strings bitwise to check if they're the same. And we can  =

 check for a substring in the same manner except we need to insert a  =

 simple check after the match to verify that it isn't surrounded by  =

 combining marks applying to the first or last character. And it'll als=

o  =

 deeply simplify any proper Unicode string handling code we'll add in t=

he  =

 future.

  - - -

 That said, I fear that forcing a specific normalization might be  =

 problematic. You don't always want to have to normalize everything...
 <http://en.wikipedia.org/wiki/Unicode_equivalence#Errors_due_to_normal=

ization_differences>
 So perhaps we could simplify things a bit more: don't pick a standard =

 =

 normalization form. Just assume that both strings being used are in th=

e  =

 same normalization form. Comparison will work, searching for substring=

  =

 in the way specified above will work, and other functions could docume=

nt  =

 which normalization form they accept. Problem solved... somewhat.

It's even easier than this:

a) you want to do a proper string comparison not knowing what state the =
 =

unicode strings are in, use the full-fledged decode-when-needed string  =

type, and its associated str.find method.
b) you know they are both the same normalized form and want to optimize,=
  =

use std.algorithm.find(haystack.asArray, needle.asArray).

-Steve

Oct 26 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-10-26 11:50:32 +0000, "Steven Schveighoffer" 
<schveiguy yahoo.com> said:

 It's even easier than this:
 
 a) you want to do a proper string comparison not knowing what state the
 unicode strings are in, use the full-fledged decode-when-needed string
 type, and its associated str.find method.
 b) you know they are both the same normalized form and want to optimize,
 use std.algorithm.find(haystack.asArray, needle.asArray).

Well, treating the string as an array of dchar doesn't work in the 
general case, even with strings normalized the same way your fianc� 
example can break. So should never treat them as plain arrays unless 
I'm sure I have no combining marks in the string.

I'm not opposed to having a new string type being developed, but I'm 
skeptical about its inclusion in the language. We already have three 
string types which can be assumed to contain valid UTF sequences. I 
think the first thing to do is not to develop a new string type, but to 
develop the normalization and grapheme splitting algorithms, and those 
to find a substring, using the existing char[], wchar[] and dchar[] 
types.  Then write a program with proper handling of Unicode using 
those and hand-optimize it. If that proves to be a pain (it might well 
be), write a new string type, rewrite the program using it, do some 
benchmarks and then we'll know if it's a good idea, and will be able to 
quantify the drawbacks.

But right now, all this arguing for or against a new string type is 
stacking hypothesis against other hypothesis, it won't lead anywhere.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Oct 26 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Wed, 26 Oct 2011 10:04:08 -0400, Michel Fortin  =

<michel.fortin michelf.com> wrote:

 On 2011-10-26 11:50:32 +0000, "Steven Schveighoffer"  =

 <schveiguy yahoo.com> said:

 It's even easier than this:
  a) you want to do a proper string comparison not knowing what state =


the
 unicode strings are in, use the full-fledged decode-when-needed strin=


g
 type, and its associated str.find method.
 b) you know they are both the same normalized form and want to optimi=


ze,
 use std.algorithm.find(haystack.asArray, needle.asArray).

 Well, treating the string as an array of dchar doesn't work in the  =

 general case, even with strings normalized the same way your fianc=C3=A9=

  =

 example can break. So should never treat them as plain arrays unless I=

'm  =

 sure I have no combining marks in the string.

If you have the same normalization, you would at least find all the vali=
d  =

instances.  You'd have to eliminate false positives, by decoding the nex=
t  =

dchar after the match to see if it's a combining character.  This would =
be  =

a simple function to write.

All I'm saying is, doing a search on the array should be available.

 I'm not opposed to having a new string type being developed, but I'm  =

 skeptical about its inclusion in the language. We already have three  =

 string types which can be assumed to contain valid UTF sequences. I  =

 think the first thing to do is not to develop a new string type, but t=

o  =

 develop the normalization and grapheme splitting algorithms, and those=

  =

 to find a substring, using the existing char[], wchar[] and dchar[]  =

 types.  Then write a program with proper handling of Unicode using tho=

se  =

 and hand-optimize it. If that proves to be a pain (it might well be), =

 =

 write a new string type, rewrite the program using it, do some  =

 benchmarks and then we'll know if it's a good idea, and will be able t=

o  =

 quantify the drawbacks.

 But right now, all this arguing for or against a new string type is  =

 stacking hypothesis against other hypothesis, it won't lead anywhere.

I have a half-implemented string type which does just this.  I need to  =

finish it.  There are some language barriers that need to be sorted out =
 =

(such as how to deal with tail-const for arbitrary types).

I think a user who no longer posts here (spir) had some full  =

implementation of unicode strings using classes, though I don't see that=
  =

being comparable in performance.

-Steve

Oct 26 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-10-26 14:20:54 +0000, "Steven Schveighoffer" 
<schveiguy yahoo.com> said:

 If you have the same normalization, you would at least find all the vali d
 instances.  You'd have to eliminate false positives, by decoding the nex t
 dchar after the match to see if it's a combining character.  This would  be
 a simple function to write.
 
 All I'm saying is, doing a search on the array should be available.

And all I'm saying is that such a standalone function that eliminates 
the false positive should be available outside of a string type too. 
You'll likely need it for your string type anyway, but there's no 
reason you should't be able to use it standalone on a 
char[]/wchar[]/dchar[].

I think most string algorithms should be able to work with normalized 
character arrays; a new string type should just be a shell around these 
that makes things easier to the user.


 I have a half-implemented string type which does just this.  I need to
 finish it.

That should be interesting.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Oct 26 2011

so <so so.so> writes:

On Fri, 21 Oct 2011 21:11:14 +0300, Peter Alexander  
<peter.alexander.au gmail.com> wrote:

 Of course, people will still need to understand UTF-8. I don't think  
 that's a problem. It's unreasonable to expect the language to do the  
 thinking for you. The problem is that we have people that *do*  
 understand UTF-8 (like the OP), but *don't* understand D's strings.

Indeed, if one knows/understands how unicode works AND knows the structure  
of D strings, the current scheme is the best of both worlds.

foreach(e; string) // chars
foreach(e; byUTFxxx(string)) // ...

Oct 21 2011

Jacob Carlborg <doob me.com> writes:

On 2011-10-20 23:49, Peter Alexander wrote:
 On 20/10/11 8:37 PM, Martin Nowak wrote:
 It just took me over one hour to find out the unthinkable.
 foreach(c; str) will deduce c to immutable(char) and doesn't care about
 unicode.
 Now there is so many unicode transcoding happening in the language that
 it starts to get annoying,
 but the most basic string iteration doesn't support it by default?

 D has got itself into a tricky situation in this regard. Doing it either
 way introduces an unintuitive mess.

 The way it is now, you get the problem that you just described where
 foreach is unaware of Unicode.

 If you changed it to loop as Unicode, then indices won't match up:

 immutable(int)[] a = ...
 foreach (x, i; a)
 assert(x == a[i]); // ok

 immutable(char)[] b = ...
 foreach (x, i; b)
 assert(x == b[i]); // not necessarily!

The index could skip certain indexes to make that assert pass. But that 
would be confusing as well.

-- 
/Jacob Carlborg

Oct 20 2011

so <so so.so> writes:

On Thu, 20 Oct 2011 22:37:56 +0300, Martin Nowak <dawg dawgfoto.de> wrote:

 It just took me over one hour to find out the unthinkable.
 foreach(c; str) will deduce c to immutable(char) and doesn't care about  
 unicode.
 Now there is so many unicode transcoding happening in the language that  
 it starts to get annoying,
 but the most basic string iteration doesn't support it by default?

I really can't believe why people expect that.
By definition string is an array of chars, and this way it is consistent.
All we need is:

foreach(c, whatever(str))

instead of:

foreach(c, str)

Oct 21 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 10/20/2011 09:37 PM, Martin Nowak wrote:
 It just took me over one hour to find out the unthinkable.
 foreach(c; str) will deduce c to immutable(char) and doesn't care about
 unicode.
 Now there is so many unicode transcoding happening in the language

In the standard library. Not in the language.

 that it starts to get annoying,
 but the most basic string iteration doesn't support it by default?

I actually like the current behaviour.

Oct 21 2011

Alvaro <alvaro_segura gmail.com> writes:

El 20/10/2011 21:37, Martin Nowak escribi�:
 It just took me over one hour to find out the unthinkable.
 foreach(c; str) will deduce c to immutable(char) and doesn't care about
 unicode.
 Now there is so many unicode transcoding happening in the language that
 it starts to get annoying,
 but the most basic string iteration doesn't support it by default?

Maybe I didn't fully get your point, but, you do know that you can do 
the following, right?

   string str = "�and�";
   foreach(dchar c; str)
   ...

and it decodes full unicode characters just fine. Or maybe you are just 
talking about the auto type inference, just to make sure...

OTOH, as others say, it's not rare to iterate on 8-bit units if you're 
dealing with ASCII or if you are parsing and looking for operators and 
separators (which are normally 8-bit). Then you can leave the rest 
untouched or extract the parts in between without caring if they are 1 
byte per character or several. (e.g. parsing XML, or JSON, or CSV, or 
INI, or conf, or D source, etc.)

Oct 22 2011

Daniel Gibson <metalcaedes gmail.com> writes:

Am 22.10.2011 12:55, schrieb Alvaro:
 El 20/10/2011 21:37, Martin Nowak escribi�:
 It just took me over one hour to find out the unthinkable.
 foreach(c; str) will deduce c to immutable(char) and doesn't care about
 unicode.
 Now there is so many unicode transcoding happening in the language that
 it starts to get annoying,
 but the most basic string iteration doesn't support it by default?

 Maybe I didn't fully get your point, but, you do know that you can do
 the following, right?

 string str = "�and�";
 foreach(dchar c; str)
 ...

One visible Unicode character can consists of several dchars.
(This is called "Grapheme")

Cheers,
- Daniel

Oct 22 2011

D Programming

C/C++ Programming

Other

digitalmars.D - Why the hell doesn't foreach decode strings