www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Proposal for fixing dchar ranges

reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
I proposed this inside the long "major performance problem with  =

std.array.front," I've also proposed it before, a long time ago.

But seems to be getting no attention buried in that thread, not even  =

negative attention :)

An idea to fix the whole problems I see with char[] being treated  =

specially by phobos: introduce an actual string type, with char[] as  =

backing, that is a dchar range, that actually dictates the rules we want=
.  =

Then, make the compiler use this type for literals.

e.g.:

struct string {
    immutable(char)[] representation;
    this(char[] data) { representation =3D data;}
    ... // dchar range primitives
}

Then, a char[] array is simply an array of char[].

points:

1. No more issues with foreach(c; "casse=CC=81"), it iterates via dchar
2. No more issues with "casse=CC=81"[4], it is a static compiler error.
3. No more awkward ASCII manipulation using ubyte[].
4. No more phobos schizophrenia saying char[] is not an array.
5. No more special casing char[] array templates to fool the compiler.
6. Any other special rules we come up with can be dictated by the librar=
y,  =

and not ignored by the compiler.

Note, std.algorithm.copy(string1, mutablestring) will still decode/encod=
e,  =

but it's more explicit. It's EXPLICITLY a dchar range. Use  =

std.algorithm.copy(string1.representation, mutablestring.representation)=
  =

will avoid the issues.

I imagine only code that is currently UTF ignorant will break, and that =
 =

code is easily 'fixed' by adding the 'representation' qualifier.

-Steve
Mar 10 2014
next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 10 Mar 2014 09:35:44 -0400, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 Then, a char[] array is simply an array of char[].

An array of char even. -Steve
Mar 10 2014
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer 
wrote:
 I proposed this inside the long "major performance problem with 
 std.array.front," I've also proposed it before, a long time ago.

 But seems to be getting no attention buried in that thread, not 
 even negative attention :)

 An idea to fix the whole problems I see with char[] being 
 treated specially by phobos: introduce an actual string type, 
 with char[] as backing, that is a dchar range, that actually 
 dictates the rules we want. Then, make the compiler use this 
 type for literals.

 e.g.:

 struct string {
    immutable(char)[] representation;
    this(char[] data) { representation = data;}
    ... // dchar range primitives
 }

 Then, a char[] array is simply an array of char[].

 points:

 1. No more issues with foreach(c; "cassé"), it iterates via 
 dchar
 2. No more issues with "cassé"[4], it is a static compiler 
 error.
 3. No more awkward ASCII manipulation using ubyte[].
 4. No more phobos schizophrenia saying char[] is not an array.
 5. No more special casing char[] array templates to fool the 
 compiler.
 6. Any other special rules we come up with can be dictated by 
 the library, and not ignored by the compiler.

 Note, std.algorithm.copy(string1, mutablestring) will still 
 decode/encode, but it's more explicit. It's EXPLICITLY a dchar 
 range. Use std.algorithm.copy(string1.representation, 
 mutablestring.representation) will avoid the issues.

 I imagine only code that is currently UTF ignorant will break, 
 and that code is easily 'fixed' by adding the 'representation' 
 qualifier.

 -Steve

It will break any code that slices stored char[] strings directly which may or may not be breaking UTF depending on how indices are calculated. Also adding one more runtime dependency into language but there are so many that it probably does not matter.
Mar 10 2014
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Mar 10, 2014 at 09:35:44AM -0400, Steven Schveighoffer wrote:
[...]
 An idea to fix the whole problems I see with char[] being treated
 specially by phobos: introduce an actual string type, with char[] as
 backing, that is a dchar range, that actually dictates the rules we
 want. Then, make the compiler use this type for literals.
 
 e.g.:
 
 struct string {
    immutable(char)[] representation;
    this(char[] data) { representation = data;}
    ... // dchar range primitives
 }
 
 Then, a char[] array is simply an array of char[].
 
 points:
 
 1. No more issues with foreach(c; "cassé"), it iterates via dchar
 2. No more issues with "cassé"[4], it is a static compiler error.
 3. No more awkward ASCII manipulation using ubyte[].
 4. No more phobos schizophrenia saying char[] is not an array.
 5. No more special casing char[] array templates to fool the compiler.
 6. Any other special rules we come up with can be dictated by the
 library, and not ignored by the compiler.

I like this idea. Special-casing char[] in templates was a bad idea. It makes Phobos code needlessly complex, and the inconsistent treatment of char[] sometimes as an array of char and sometimes not causes silly issues like foreach defaulting to char but range iteration defaulting to dchar. Enclosing it in a struct means we can enforce string rules separately from the fact that it's a char array.
 Note, std.algorithm.copy(string1, mutablestring) will still
 decode/encode, but it's more explicit. It's EXPLICITLY a dchar
 range. Use std.algorithm.copy(string1.representation,
 mutablestring.representation) will avoid the issues.
 
 I imagine only code that is currently UTF ignorant will break, and
 that code is easily 'fixed' by adding the 'representation'
 qualifier.

The only concern I have is the current use of char[] and const(char)[] as mutable strings, and the current implicit conversion from string to const(char)[]. We would need similar wrappers for char[] and const(char)[], and string and mutablestring must be implicitly convertible to conststring, otherwise a LOT of existing code will break in a major way. Plus, these wrappers should also expose the same dchar range API with .representation giving a way to get at the raw code units. T -- It is the quality rather than the quantity that matters. -- Lucius Annaeus Seneca
Mar 10 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 10 Mar 2014 10:48:26 -0400, Dicebot <public dicebot.lv> wrote:

 On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer wrote:
 I proposed this inside the long "major performance problem with  =


 std.array.front," I've also proposed it before, a long time ago.

 But seems to be getting no attention buried in that thread, not even =


 negative attention :)

 An idea to fix the whole problems I see with char[] being treated  =


 specially by phobos: introduce an actual string type, with char[] as =


 backing, that is a dchar range, that actually dictates the rules we  =


 want. Then, make the compiler use this type for literals.

 e.g.:

 struct string {
    immutable(char)[] representation;
    this(char[] data) { representation =3D data;}
    ... // dchar range primitives
 }

 Then, a char[] array is simply an array of char[].

 points:

 1. No more issues with foreach(c; "casse=CC=81"), it iterates via dch=


 2. No more issues with "casse=CC=81"[4], it is a static compiler erro=


 3. No more awkward ASCII manipulation using ubyte[].
 4. No more phobos schizophrenia saying char[] is not an array.
 5. No more special casing char[] array templates to fool the compiler=


 6. Any other special rules we come up with can be dictated by the  =


 library, and not ignored by the compiler.

 Note, std.algorithm.copy(string1, mutablestring) will still  =


 decode/encode, but it's more explicit. It's EXPLICITLY a dchar range.=


 Use std.algorithm.copy(string1.representation,  =


 mutablestring.representation) will avoid the issues.

 I imagine only code that is currently UTF ignorant will break, and th=


 code is easily 'fixed' by adding the 'representation' qualifier.

It will break any code that slices stored char[] strings directly whic=

 may or may not be breaking UTF depending on how indices are calculated=

That is already broken. What I'm looking to do is remove the cruft and = "WTF" factor of the current state of affairs (an array that's not an = array). Originally (in that long ago proposal) I had proposed to check for and = disallow invalid slicing during runtime. In fact, it could be added if = desired with the type defined by the library.
 Also adding one more runtime dependency into language but there are so=

 many that it probably does not matter.

alias string =3D immutable(char)[]; There isn't much extra dependency one must add to revert to the original= = behavior. In fact, one nice thing about this proposal is the compiler = changes can be done and tested before any real meddling with the string = = type is done. -Steve
Mar 10 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 10 Mar 2014 10:54:50 -0400, H. S. Teoh <hsteoh quickfur.ath.cx>  
wrote:


 The only concern I have is the current use of char[] and const(char)[]
 as mutable strings, and the current implicit conversion from string to
 const(char)[]. We would need similar wrappers for char[] and
 const(char)[], and string and mutablestring must be implicitly
 convertible to conststring, otherwise a LOT of existing code will break
 in a major way.

I agree that is a limitation of the proposal. It's more of a language-wide problem that one cannot make a struct that can be tail-const-ified. One idea to begin with is to weakly bind to immutable(char)[] using alias this. That way, existing code devolves to current behavior. Then you pick off the primitives you want by defining them in the struct itself.
 Plus, these wrappers should also expose the same dchar
 range API with .representation giving a way to get at the raw code
 units.

It already does that, representation is a public member. -Steve
Mar 10 2014
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Monday, 10 March 2014 at 15:01:54 UTC, Steven Schveighoffer 
wrote:
 That is already broken. What I'm looking to do is remove the 
 cruft and "WTF" factor of the current state of affairs (an 
 array that's not an array).

 Originally (in that long ago proposal) I had proposed to check 
 for and disallow invalid slicing during runtime. In fact, it 
 could be added if desired with the type defined by the library.

Broken as if in "you are not supposed to do it user code"? Yes. Broken as in "does the wrong thing" - no. If your index is properly calculated, it is no different from casting to ubyte[] and then slicing. I am pretty sure even Phobos does it here and there.
Mar 10 2014
prev sibling next sibling parent "Boyd" <gaboonviper gmx.net> writes:
I personally love this idea, though I think it probably 
introduces too much silent breaking changes for it to be 
universally acceptable by D users.

Perhaps naming it 'String', and deprecating 'string' would make 
it more acceptable?

------------
On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer 
wrote:
 I proposed this inside the long "major performance problem with 
 std.array.front," I've also proposed it before, a long time ago.

 But seems to be getting no attention buried in that thread, not 
 even negative attention :)

 An idea to fix the whole problems I see with char[] being 
 treated specially by phobos: introduce an actual string type, 
 with char[] as backing, that is a dchar range, that actually 
 dictates the rules we want. Then, make the compiler use this 
 type for literals.

 e.g.:

 struct string {
    immutable(char)[] representation;
    this(char[] data) { representation = data;}
    ... // dchar range primitives
 }

 Then, a char[] array is simply an array of char[].

 points:

 1. No more issues with foreach(c; "cassé"), it iterates via 
 dchar
 2. No more issues with "cassé"[4], it is a static compiler 
 error.
 3. No more awkward ASCII manipulation using ubyte[].
 4. No more phobos schizophrenia saying char[] is not an array.
 5. No more special casing char[] array templates to fool the 
 compiler.
 6. Any other special rules we come up with can be dictated by 
 the library, and not ignored by the compiler.

 Note, std.algorithm.copy(string1, mutablestring) will still 
 decode/encode, but it's more explicit. It's EXPLICITLY a dchar 
 range. Use std.algorithm.copy(string1.representation, 
 mutablestring.representation) will avoid the issues.

 I imagine only code that is currently UTF ignorant will break, 
 and that code is easily 'fixed' by adding the 'representation' 
 qualifier.

 -Steve

Mar 10 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 10 Mar 2014 11:11:50 -0400, Boyd <gaboonviper gmx.net> wrote:

 I personally love this idea, though I think it probably introduces too  
 much silent breaking changes for it to be universally acceptable by D  
 users.

What silent breaking changes? -Steve
Mar 10 2014
prev sibling next sibling parent "Boyd" <gaboonviper gmx.net> writes:
Utf8 aware slicing for strings would be an issue.

----------
On Monday, 10 March 2014 at 15:13:26 UTC, Steven Schveighoffer 
wrote:
 On Mon, 10 Mar 2014 11:11:50 -0400, Boyd <gaboonviper gmx.net> 
 wrote:

 I personally love this idea, though I think it probably 
 introduces too much silent breaking changes for it to be 
 universally acceptable by D users.

What silent breaking changes? -Steve

Mar 10 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 10 Mar 2014 11:20:49 -0400, Boyd <gaboonviper gmx.net> wrote:

 Utf8 aware slicing for strings would be an issue.

I'm not proposing to add this. -Steve
Mar 10 2014
prev sibling next sibling parent "Boyd" <gaboonviper gmx.net> writes:
Ok, then you just destroyed my sole hypothetical objection to 
this.
-----------
On Monday, 10 March 2014 at 15:22:41 UTC, Steven Schveighoffer 
wrote:
 On Mon, 10 Mar 2014 11:20:49 -0400, Boyd <gaboonviper gmx.net> 
 wrote:

 Utf8 aware slicing for strings would be an issue.

I'm not proposing to add this. -Steve

Mar 10 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 10 Mar 2014 11:11:23 -0400, Dicebot <public dicebot.lv> wrote:

 On Monday, 10 March 2014 at 15:01:54 UTC, Steven Schveighoffer wrote:
 That is already broken. What I'm looking to do is remove the cruft and  
 "WTF" factor of the current state of affairs (an array that's not an  
 array).

 Originally (in that long ago proposal) I had proposed to check for and  
 disallow invalid slicing during runtime. In fact, it could be added if  
 desired with the type defined by the library.

Broken as if in "you are not supposed to do it user code"? Yes. Broken as in "does the wrong thing" - no. If your index is properly calculated, it is no different from casting to ubyte[] and then slicing. I am pretty sure even Phobos does it here and there.

If the idea to ensure the user cannot slice a code point was added, you would still be able to slice via str.representation[a..b], or even str.ptr[a..b] if you were so sure of the length you didn't want it to be checked ;) The idea behind the proposal is to make it fully backwards compatible with existing code, except for randomly accessing a char, and probably .length. Slicing would still work as it does now, but could be adjusted later. It will break existing code. To fix those breaks, you would need to use the char[] array directly via the representation member, or rethink your code to be UTF-correct. Basically, instead of pretending an array isn't an array, create a new mostly-compatible type that behaves as we want it to behave in all circumstances, not just when you use phobos algorithms. The breaks may be trivial to work around, and might seem annoying. However, they may be actual UTF bugs that make your code more correct when you fix them. The biggest problem right now is the lack of the ability to implicitly cast to tail-const with a custom struct. We can keep an alias-this link for those cases until we can fix that in the compiler. -Steve
Mar 10 2014
prev sibling next sibling parent "Brad Anderson" <eco gnuk.net> writes:
On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer 
wrote:
 I proposed this inside the long "major performance problem with 
 std.array.front," I've also proposed it before, a long time ago.

 But seems to be getting no attention buried in that thread, not 
 even negative attention :)

 An idea to fix the whole problems I see with char[] being 
 treated specially by phobos: introduce an actual string type, 
 with char[] as backing, that is a dchar range, that actually 
 dictates the rules we want. Then, make the compiler use this 
 type for literals.

 e.g.:

 struct string {
    immutable(char)[] representation;
    this(char[] data) { representation = data;}
    ... // dchar range primitives
 }

 Then, a char[] array is simply an array of char[].

 points:

 1. No more issues with foreach(c; "cassé"), it iterates via 
 dchar
 2. No more issues with "cassé"[4], it is a static compiler 
 error.
 3. No more awkward ASCII manipulation using ubyte[].
 4. No more phobos schizophrenia saying char[] is not an array.
 5. No more special casing char[] array templates to fool the 
 compiler.
 6. Any other special rules we come up with can be dictated by 
 the library, and not ignored by the compiler.

 Note, std.algorithm.copy(string1, mutablestring) will still 
 decode/encode, but it's more explicit. It's EXPLICITLY a dchar 
 range. Use std.algorithm.copy(string1.representation, 
 mutablestring.representation) will avoid the issues.

 I imagine only code that is currently UTF ignorant will break, 
 and that code is easily 'fixed' by adding the 'representation' 
 qualifier.

 -Steve

Generally I think it's a good idea. Going a bit further you could also enable Short String Optimization but you'd have to encapsulate the backing array. It seems like this would be an even bigger breaking change than Walter's proposal though (right or wrong, slicing strings is very common).
Mar 10 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson <eco gnuk.net> wrote:

 It seems like this would be an even bigger breaking change than Walter's  
 proposal though (right or wrong, slicing strings is very common).

You're the second person to mention that, I was not planning on disabling string slicing. Just random access to individual chars, and probably .length. -Steve
Mar 10 2014
prev sibling next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer 
wrote:
 I proposed this inside the long "major performance problem with 
 std.array.front," I've also proposed it before, a long time ago.

 But seems to be getting no attention buried in that thread, not 
 even negative attention :)

 An idea to fix the whole problems I see with char[] being 
 treated specially by phobos: introduce an actual string type, 
 with char[] as backing, that is a dchar range, that actually 
 dictates the rules we want. Then, make the compiler use this 
 type for literals.

 e.g.:

 struct string {
    immutable(char)[] representation;
    this(char[] data) { representation = data;}
    ... // dchar range primitives
 }

 Then, a char[] array is simply an array of char[].

 points:

 1. No more issues with foreach(c; "cassé"), it iterates via 
 dchar
 2. No more issues with "cassé"[4], it is a static compiler 
 error.
 3. No more awkward ASCII manipulation using ubyte[].
 4. No more phobos schizophrenia saying char[] is not an array.
 5. No more special casing char[] array templates to fool the 
 compiler.
 6. Any other special rules we come up with can be dictated by 
 the library, and not ignored by the compiler.

 Note, std.algorithm.copy(string1, mutablestring) will still 
 decode/encode, but it's more explicit. It's EXPLICITLY a dchar 
 range. Use std.algorithm.copy(string1.representation, 
 mutablestring.representation) will avoid the issues.

 I imagine only code that is currently UTF ignorant will break, 
 and that code is easily 'fixed' by adding the 'representation' 
 qualifier.

 -Steve

I know warnings are disliked, but couldn't we make the slicing and indexing work as currently but issue a warning*? It's not ideal but it does mean we get backwards compatibility. In my mind this is an important enough improvement to justify a little unpleasantness. We can't afford the breakage but we also should definitely act on this. *Alternatively, they could just be deprecated from the get-go.
Mar 10 2014
prev sibling next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 10 March 2014 at 17:54:49 UTC, Steven Schveighoffer 
wrote:
 On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson 
 <eco gnuk.net> wrote:

 It seems like this would be an even bigger breaking change 
 than Walter's proposal though (right or wrong, slicing strings 
 is very common).

You're the second person to mention that, I was not planning on disabling string slicing. Just random access to individual chars, and probably .length. -Steve

How is slicing any better than indexing?
Mar 10 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 10 Mar 2014 14:01:45 -0400, John Colvin  
<john.loughran.colvin gmail.com> wrote:

 On Monday, 10 March 2014 at 17:54:49 UTC, Steven Schveighoffer wrote:
 On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson <eco gnuk.net> wrote:

 It seems like this would be an even bigger breaking change than  
 Walter's proposal though (right or wrong, slicing strings is very  
 common).

You're the second person to mention that, I was not planning on disabling string slicing. Just random access to individual chars, and probably .length. -Steve

How is slicing any better than indexing?

Because one can slice out a multi-code-unit code point, one cannot access it via index. Strings would be horribly crippled without slicing. Without indexing, they are fine. A possibility is to allow index, but actually decode the code point at that index (error on invalid index). That might actually be the correct mechanism. -Steve
Mar 10 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 10 Mar 2014 13:59:53 -0400, John Colvin  =

<john.loughran.colvin gmail.com> wrote:

 On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer wrote:
 I proposed this inside the long "major performance problem with  =


 std.array.front," I've also proposed it before, a long time ago.

 But seems to be getting no attention buried in that thread, not even =


 negative attention :)

 An idea to fix the whole problems I see with char[] being treated  =


 specially by phobos: introduce an actual string type, with char[] as =


 backing, that is a dchar range, that actually dictates the rules we  =


 want. Then, make the compiler use this type for literals.

 e.g.:

 struct string {
    immutable(char)[] representation;
    this(char[] data) { representation =3D data;}
    ... // dchar range primitives
 }

 Then, a char[] array is simply an array of char[].

 points:

 1. No more issues with foreach(c; "casse=CC=81"), it iterates via dch=


 2. No more issues with "casse=CC=81"[4], it is a static compiler erro=


 3. No more awkward ASCII manipulation using ubyte[].
 4. No more phobos schizophrenia saying char[] is not an array.
 5. No more special casing char[] array templates to fool the compiler=


 6. Any other special rules we come up with can be dictated by the  =


 library, and not ignored by the compiler.

 Note, std.algorithm.copy(string1, mutablestring) will still  =


 decode/encode, but it's more explicit. It's EXPLICITLY a dchar range.=


 Use std.algorithm.copy(string1.representation,  =


 mutablestring.representation) will avoid the issues.

 I imagine only code that is currently UTF ignorant will break, and th=


 code is easily 'fixed' by adding the 'representation' qualifier.

 -Steve

I know warnings are disliked, but couldn't we make the slicing and =

 indexing work as currently but issue a warning*? It's not ideal but it=

 does mean we get backwards compatibility.

As I mentioned elsewhere (but repeating here for viewers), I was not = planning on disabling slicing. Indexing is rarely a feature one needs or should use, especially with = encoded strings. -Steve
Mar 10 2014
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
 An idea to fix the whole problems I see with char[] being treated specially by
 phobos: introduce an actual string type, with char[] as backing, that is a
dchar
 range, that actually dictates the rules we want. Then, make the compiler use
 this type for literals.

Proposals to make a string class for D have come up many times. I have a kneejerk dislike for it. It's a really strong feature for D to have strings be an array type, and I'll go to great lengths to keep it that way.
Mar 10 2014
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/10/2014 11:54 AM, Steven Schveighoffer wrote:
 BTW, this escaped my view the first time reading your post, but I am NOT
 proposing a string *class*.

Right, but here I used the term "class" to be more generic as in being a user defined type, i.e. struct or class. I should have been more clear.
Mar 10 2014
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/10/2014 1:36 PM, Steven Schveighoffer wrote:
 What strings are already is a user-defined type,

No, they are not.
 but with horrible enforcement.

With no enforcement, and that is by design. Keep in mind that D is a systems programming language, and that means unfettered access to strings.
Mar 10 2014
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/10/2014 2:09 PM, Steven Schveighoffer wrote:
 What in my proposal makes you think you don't have unfettered access? The
 underlying immutable(char)[] representation is accessible. In fact, you would
 have more access, since phobos functions would then work with a char[] like
it's
 a proper array.

You divide the D world into two camps - those that use 'struct string', and those that use immutable(char)[] strings.
 I imagine only code that is currently UTF ignorant will break,

This also makes it a non-starter.
Mar 10 2014
parent Walter Bright <newshound2 digitalmars.com> writes:
On 3/10/2014 3:26 PM, Steven Schveighoffer wrote:
 On Mon, 10 Mar 2014 17:52:05 -0400, Walter Bright <newshound2 digitalmars.com>
 wrote:
 This also makes it a non-starter.

You're the guardian of changes to the language, clearly holding a veto on any proposals. But this doesn't come across as very open-minded, especially from someone who wanted to do something that would change the fundamental treatment of strings last week.

I deserve that criticism. On the other hand, I've pretty much given up on fixing std.array.front() because of that. In the last couple days, we also wound up annoying a valuable client with some minor breakage with std.json, reiterating how important it is to not break code if we can at all avoid it.
 IMO, breaking incorrect code is a good idea, and worth at least exploring.

Breaking broken code, yes.
Mar 10 2014
prev sibling parent reply Ary Borenszweig <ary esperanto.org.ar> writes:
On 3/10/14, 3:30 PM, Walter Bright wrote:
 On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
 An idea to fix the whole problems I see with char[] being treated
 specially by
 phobos: introduce an actual string type, with char[] as backing, that
 is a dchar
 range, that actually dictates the rules we want. Then, make the
 compiler use
 this type for literals.

Proposals to make a string class for D have come up many times. I have a kneejerk dislike for it. It's a really strong feature for D to have strings be an array type, and I'll go to great lengths to keep it that way.

You can also look at Erlang, where strings are just lists of numbers. Eventually they realized it was a huge mistake and introduced another type, a binary string, which is much more efficient and works as expected. I think making strings behave like arrays is a design mistake.
Mar 12 2014
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/12/14, 6:24 AM, Ary Borenszweig wrote:
 On 3/10/14, 3:30 PM, Walter Bright wrote:
 On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
 An idea to fix the whole problems I see with char[] being treated
 specially by
 phobos: introduce an actual string type, with char[] as backing, that
 is a dchar
 range, that actually dictates the rules we want. Then, make the
 compiler use
 this type for literals.

Proposals to make a string class for D have come up many times. I have a kneejerk dislike for it. It's a really strong feature for D to have strings be an array type, and I'll go to great lengths to keep it that way.

You can also look at Erlang, where strings are just lists of numbers. Eventually they realized it was a huge mistake and introduced another type, a binary string, which is much more efficient and works as expected. I think making strings behave like arrays is a design mistake.

Erlang's mistake was different from what you believe was D's mistake. There is no comparison to be drawn. Andrei
Mar 12 2014
parent reply Ary Borenszweig <ary esperanto.org.ar> writes:
On 3/12/14, 1:53 PM, Andrei Alexandrescu wrote:
 On 3/12/14, 6:24 AM, Ary Borenszweig wrote:
 On 3/10/14, 3:30 PM, Walter Bright wrote:
 On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
 An idea to fix the whole problems I see with char[] being treated
 specially by
 phobos: introduce an actual string type, with char[] as backing, that
 is a dchar
 range, that actually dictates the rules we want. Then, make the
 compiler use
 this type for literals.

Proposals to make a string class for D have come up many times. I have a kneejerk dislike for it. It's a really strong feature for D to have strings be an array type, and I'll go to great lengths to keep it that way.

You can also look at Erlang, where strings are just lists of numbers. Eventually they realized it was a huge mistake and introduced another type, a binary string, which is much more efficient and works as expected. I think making strings behave like arrays is a design mistake.

Erlang's mistake was different from what you believe was D's mistake. There is no comparison to be drawn. Andrei

What's D's mistake then?
Mar 12 2014
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/12/14, 10:29 AM, Ary Borenszweig wrote:
 What's D's mistake then?

I don't think we made a mistake with D's strings. They could have been done better if we made all iteration requests explicit. Andrei
Mar 12 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 10 Mar 2014 14:30:07 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
 An idea to fix the whole problems I see with char[] being treated  
 specially by
 phobos: introduce an actual string type, with char[] as backing, that  
 is a dchar
 range, that actually dictates the rules we want. Then, make the  
 compiler use
 this type for literals.

Proposals to make a string class for D have come up many times. I have a kneejerk dislike for it. It's a really strong feature for D to have strings be an array type, and I'll go to great lengths to keep it that way.

I wholly agree, they should be an array type. But what they are now is worse. -Steve
Mar 10 2014
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Mon, 10 Mar 2014 11:30:07 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
 An idea to fix the whole problems I see with char[] being treated
 specially by phobos: introduce an actual string type, with char[]
 as backing, that is a dchar range, that actually dictates the rules
 we want. Then, make the compiler use this type for literals.

Proposals to make a string class for D have come up many times. I have a kneejerk dislike for it. It's a really strong feature for D to have strings be an array type, and I'll go to great lengths to keep it that way.

Question: which type T doesn't have slicing, has an ElementType of dchar, has typeof(T[0]).sizeof == 4, ElementEncodingType!T == char and still satisfies isArray? It's a string. Would you call that 'an array type'? writeln(isArray!string); //true writeln(hasSlicing!string); //false writeln(ElementType!string.stringof); //dchar writeln(ElementEncodingType!string.stringof); //char I wouldn't call that an array. Part of the problem is that you want string to be arrays (fixed size elements, direct indexing) and Andrei doesn't want them to be arrays (operating on code points => not fixed size => not arrays).
Mar 10 2014
prev sibling next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 10 Mar 2014 14:30:07 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
 An idea to fix the whole problems I see with char[] being treated  
 specially by
 phobos: introduce an actual string type, with char[] as backing, that  
 is a dchar
 range, that actually dictates the rules we want. Then, make the  
 compiler use
 this type for literals.

Proposals to make a string class for D have come up many times. I have a kneejerk dislike for it. It's a really strong feature for D to have strings be an array type, and I'll go to great lengths to keep it that way.

BTW, this escaped my view the first time reading your post, but I am NOT proposing a string *class*. In fact, I'm not proposing we change anything technical about strings, the code generated should be basically identical. What I'm proposing is to encapsulate what you can and can't do with a string in the type itself, instead of making the standard library flip over backwards to treat it as something else when the compiler treats it as a simple array of char. -Steve
Mar 10 2014
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Mar 11, 2014 at 12:49:40AM +0000, Meta wrote:
 On Tuesday, 11 March 2014 at 00:02:13 UTC, bearophile wrote:
Walter Bright:

In the last couple days, we also wound up annoying a valuable
client with some minor breakage with std.json, reiterating how
important it is to not break code if we can at all avoid it..

There are still some breaking changed that I'd like to perform in D, like deprecating certain usages of the comma operator, etc.


 That damnable comma operator is one of the worst things that was
 inherited from C. IMO, it has no use outside the header of a for
 loop, and even there it's suspect.

I've always been of the opinion that the comma operator in a for loop should be treated as special syntax, rather than a language-wide operator. The comma operator must die. :P T -- Public parking: euphemism for paid parking. -- Flora
Mar 11 2014
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Mon, 10 Mar 2014 13:55:00 -0400
schrieb "Steven Schveighoffer" <schveiguy yahoo.com>:

 On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson <eco gnuk.net>
 wrote:
 
 It seems like this would be an even bigger breaking change than
 Walter's proposal though (right or wrong, slicing strings is very
 common).

You're the second person to mention that, I was not planning on disabling string slicing. Just random access to individual chars, and probably .length. -Steve

Unfortunately slicing by code units is probably the most important safety issue with the current implementation: As was mentioned in the other thread: size_t index = str.countUntil('a'); auto slice = str[0..index]; This can be a safety and security issue. (I realize that this would break lots of code so I'm not sure if we should/can fix it. But I think this was the most important problem mentioned in the other thread.)
Mar 10 2014
prev sibling next sibling parent "Artem Tarasov" <lomereiter gmail.com> writes:
On Monday, 10 March 2014 at 18:50:28 UTC, Johannes Pfau wrote:
 Question: which type T doesn't have slicing, has an ElementType 
 of
 dchar, has typeof(T[0]).sizeof == 4, ElementEncodingType!T == 
 char and
 still satisfies isArray?

In addition, hasLength!T == false, which totally freaked me out when I first discovered that.
Mar 10 2014
prev sibling next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 10 March 2014 at 18:09:51 UTC, Steven Schveighoffer 
wrote:
 On Mon, 10 Mar 2014 14:01:45 -0400, John Colvin 
 <john.loughran.colvin gmail.com> wrote:

 On Monday, 10 March 2014 at 17:54:49 UTC, Steven Schveighoffer 
 wrote:
 On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson 
 <eco gnuk.net> wrote:

 It seems like this would be an even bigger breaking change 
 than Walter's proposal though (right or wrong, slicing 
 strings is very common).

You're the second person to mention that, I was not planning on disabling string slicing. Just random access to individual chars, and probably .length. -Steve

How is slicing any better than indexing?

Because one can slice out a multi-code-unit code point, one cannot access it via index. Strings would be horribly crippled without slicing. Without indexing, they are fine. A possibility is to allow index, but actually decode the code point at that index (error on invalid index). That might actually be the correct mechanism. -Steve

In order to be correct, both require exactly the same knowledge: The beginning of a code point, followed by the end of a code point. In the indexing case they just happen to be the same code-point and happen to be one code unit from each other. I don't see how one is any more or less errror-prone or fundamentally wrong than the other. I do understand that slicing is more important however.
Mar 10 2014
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Mar 10, 2014 at 07:49:04PM +0100, Johannes Pfau wrote:
 Am Mon, 10 Mar 2014 11:30:07 -0700
 schrieb Walter Bright <newshound2 digitalmars.com>:
 
 On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
 An idea to fix the whole problems I see with char[] being treated
 specially by phobos: introduce an actual string type, with char[]
 as backing, that is a dchar range, that actually dictates the
 rules we want. Then, make the compiler use this type for literals.

Proposals to make a string class for D have come up many times. I have a kneejerk dislike for it. It's a really strong feature for D to have strings be an array type, and I'll go to great lengths to keep it that way.


I'm on the fence about this one. The nice thing about strings being an array type, is that it is a familiar concept to C coders, and it allows array slicing for extracting substrings, etc., which fits nicely with the C view of strings as character arrays. As a C coder myself, I like it this way too. But the bad thing about strings being an array type, is that it's a holdover from C, and it allows slicing for extracting substrings -- malformed substrings by permitting slicing a multibyte (multiword) character. Basically, the nice aspects of strings being arrays only apply when you're dealing with ASCII (or mostly-ASCII) strings. These very same "nice" aspects turn into problems when dealing with anything non-ASCII. The only way the user can get it right using only array operations, is if they understand the whole of Unicode in their head and are willing to reinvent Unicode algorithms every time they slice a string or do some operation on it. Since D purportedly supports Unicode by default, it shouldn't be this way. D should *actually* support Unicode all the way -- use proper Unicode algorithms for substring extraction, collation, line-breaking, normalization, etc.. Being a systems language, of course, means that D should allow you to get under the hood and do things directly with the raw string representation -- but this shouldn't be the *default* modus operandi. The default should be a properly-encapsulated string type with Unicode algorithms to operate on it (with the option of reaching into the raw representation where necessary).
 Question: which type T doesn't have slicing, has an ElementType of
 dchar, has typeof(T[0]).sizeof == 4, ElementEncodingType!T == char and
 still satisfies isArray?
 
 It's a string. Would you call that 'an array type'?
 
 	writeln(isArray!string);   //true
 	writeln(hasSlicing!string); //false
 	writeln(ElementType!string.stringof); //dchar
 	writeln(ElementEncodingType!string.stringof); //char
 
 I wouldn't call that an array. Part of the problem is that you want
 string to be arrays (fixed size elements, direct indexing) and Andrei
 doesn't want them to be arrays (operating on code points => not fixed
 size => not arrays).

Exactly. What we have right now is a frankensteinian hybrid that's neither fully an array, nor fully a Unicode string type. If we call the current messy AA implementation split between compiler, aaA.d, and object.di a design problem, then I'd call the current state of D strings a design problem too. This underlying inconsistency is ultimately what leads to the poor performance of strings in std.algorithm. It's precisely because of this that I've given up on using std.algorithm for strings altogether -- std.regex is far better: more flexible, more expressive, and more performant, and specifically designed to operate on strings. Nowadays I only use std.algorithm for non-string ranges (because then the behaviour is actually consistent!!). T -- MS Windows: 64-bit overhaul of 32-bit extensions and a graphical shell for a 16-bit patch to an 8-bit operating system originally coded for a 4-bit microprocessor, written by a 2-bit company that can't stand 1-bit of competition.
Mar 10 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 10 Mar 2014 15:30:00 -0400, John Colvin  
<john.loughran.colvin gmail.com> wrote:

 On Monday, 10 March 2014 at 18:09:51 UTC, Steven Schveighoffer wrote:
 Because one can slice out a multi-code-unit code point, one cannot  
 access it via index. Strings would be horribly crippled without  
 slicing. Without indexing, they are fine.

 A possibility is to allow index, but actually decode the code point at  
 that index (error on invalid index). That might actually be the correct  
 mechanism.

In order to be correct, both require exactly the same knowledge: The beginning of a code point, followed by the end of a code point. In the indexing case they just happen to be the same code-point and happen to be one code unit from each other. I don't see how one is any more or less errror-prone or fundamentally wrong than the other.

Using indexing, you simply cannot get the single code unit that represents a multi-code-unit code point. It doesn't fit in a char. It's guaranteed to fail, whereas slicing will give you access to the all the data in the string. Now, with indexing actually decoding a code point, one can alias a[i] to a[i..$].front(), which means decode the first code point you come to at index i. This means indexing is slow(er), and returns a dchar. I think as a first step, that might be too much to add silently. I'd rather break it first, then add it back later. -Steve
Mar 10 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 10 Mar 2014 14:54:22 -0400, Johannes Pfau <nospam example.com>  
wrote:

 Am Mon, 10 Mar 2014 13:55:00 -0400
 schrieb "Steven Schveighoffer" <schveiguy yahoo.com>:

 On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson <eco gnuk.net>
 wrote:

 It seems like this would be an even bigger breaking change than
 Walter's proposal though (right or wrong, slicing strings is very
 common).

You're the second person to mention that, I was not planning on disabling string slicing. Just random access to individual chars, and probably .length. -Steve

Unfortunately slicing by code units is probably the most important safety issue with the current implementation: As was mentioned in the other thread: size_t index = str.countUntil('a'); auto slice = str[0..index]; This can be a safety and security issue. (I realize that this would break lots of code so I'm not sure if we should/can fix it. But I think this was the most important problem mentioned in the other thread.)

Slicing can never be a code point based operation. It would be too slow (read linear complexity). What needs to be broken is the expectation that an index is the number of code points or characters in a string. Think of an index as a position that has no real meaning except they are ordered in the stream. Like a set of ordered numbers, not necessarily consecutive. The index 4 may not exist, while 5 does. At this point, my proposal does not fix that particular problem, but I don't think there's any way to fix that "problem" except to train the user who wrote it not to do that. However, it does not leave us in a worse position. -Steve
Mar 10 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 10 Mar 2014 16:06:25 -0400, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:


 Think of an index as a position that has no real meaning except they are  
 ordered in the stream. Like a set of ordered numbers, not necessarily  
 consecutive. The index 4 may not exist, while 5 does.

I said that wrong, of course it has meaning. What I mean is that if you have two positions, the ordering will indicate where the characters/graphemes/code points occur in the stream, but their value will not be indicative of how far they are apart in terms of characters/graphemes/code points. In other words, if I have two characters, at position p1 and p2, then p1 > p2 => p1 comes later in the string than p2 p1 == p2 => p1 and p2 refer to the same character p1 - p2 => not defined to any particular value. -Steve
Mar 10 2014
prev sibling next sibling parent "Brad Anderson" <eco gnuk.net> writes:
On Monday, 10 March 2014 at 17:54:49 UTC, Steven Schveighoffer 
wrote:
 On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson 
 <eco gnuk.net> wrote:

 It seems like this would be an even bigger breaking change 
 than Walter's proposal though (right or wrong, slicing strings 
 is very common).

You're the second person to mention that, I was not planning on disabling string slicing. Just random access to individual chars, and probably .length. -Steve

Sorry, I misunderstood. That sounds reasonable.
Mar 10 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 10 Mar 2014 15:01:20 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 3/10/2014 11:54 AM, Steven Schveighoffer wrote:
 BTW, this escaped my view the first time reading your post, but I am NOT
 proposing a string *class*.

Right, but here I used the term "class" to be more generic as in being a user defined type, i.e. struct or class. I should have been more clear.

Then I don't understand your point. What strings are already is a user-defined type, but with horrible enforcement. i.e. things that shouldn't be allowed are only disallowed if you opt-in using phobos' template constraints. -Steve
Mar 10 2014
prev sibling next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 10 March 2014 at 20:00:07 UTC, Steven Schveighoffer 
wrote:
 On Mon, 10 Mar 2014 15:30:00 -0400, John Colvin 
 <john.loughran.colvin gmail.com> wrote:

 On Monday, 10 March 2014 at 18:09:51 UTC, Steven Schveighoffer 
 wrote:
 Because one can slice out a multi-code-unit code point, one 
 cannot access it via index. Strings would be horribly 
 crippled without slicing. Without indexing, they are fine.

 A possibility is to allow index, but actually decode the code 
 point at that index (error on invalid index). That might 
 actually be the correct mechanism.

In order to be correct, both require exactly the same knowledge: The beginning of a code point, followed by the end of a code point. In the indexing case they just happen to be the same code-point and happen to be one code unit from each other. I don't see how one is any more or less errror-prone or fundamentally wrong than the other.

Using indexing, you simply cannot get the single code unit that represents a multi-code-unit code point. It doesn't fit in a char. It's guaranteed to fail, whereas slicing will give you access to the all the data in the string.

I think I understand your motivation now. Indexing never provides anything that slicing doesn't do more generally.
 Now, with indexing actually decoding a code point, one can 
 alias a[i] to a[i..$].front(), which means decode the first 
 code point you come to at index i. This means indexing is 
 slow(er), and returns a dchar. I think as a first step, that 
 might be too much to add silently. I'd rather break it first, 
 then add it back later.

 -Steve

Of course that i has to be at the beginning of a code-point. Doesn't seem like that useful a feature and potentially very confusing for people who naively expect normal indexing.
Mar 10 2014
prev sibling next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 10 March 2014 at 19:48:34 UTC, H. S. Teoh wrote:
 On Mon, Mar 10, 2014 at 07:49:04PM +0100, Johannes Pfau wrote:
 Am Mon, 10 Mar 2014 11:30:07 -0700
 schrieb Walter Bright <newshound2 digitalmars.com>:
 
 On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
 An idea to fix the whole problems I see with char[] being 
 treated
 specially by phobos: introduce an actual string type, with 
 char[]
 as backing, that is a dchar range, that actually dictates 
 the
 rules we want. Then, make the compiler use this type for 
 literals.

Proposals to make a string class for D have come up many times. I have a kneejerk dislike for it. It's a really strong feature for D to have strings be an array type, and I'll go to great lengths to keep it that way.


I'm on the fence about this one. The nice thing about strings being an array type, is that it is a familiar concept to C coders, and it allows array slicing for extracting substrings, etc., which fits nicely with the C view of strings as character arrays. As a C coder myself, I like it this way too. But the bad thing about strings being an array type, is that it's a holdover from C, and it allows slicing for extracting substrings -- malformed substrings by permitting slicing a multibyte (multiword) character. Basically, the nice aspects of strings being arrays only apply when you're dealing with ASCII (or mostly-ASCII) strings. These very same "nice" aspects turn into problems when dealing with anything non-ASCII. The only way the user can get it right using only array operations, is if they understand the whole of Unicode in their head and are willing to reinvent Unicode algorithms every time they slice a string or do some operation on it. Since D purportedly supports Unicode by default, it shouldn't be this way. D should *actually* support Unicode all the way -- use proper Unicode algorithms for substring extraction, collation, line-breaking, normalization, etc.. Being a systems language, of course, means that D should allow you to get under the hood and do things directly with the raw string representation -- but this shouldn't be the *default* modus operandi. The default should be a properly-encapsulated string type with Unicode algorithms to operate on it (with the option of reaching into the raw representation where necessary).

You started off on the fence, but you seem pretty convinced by the end!
Mar 10 2014
prev sibling next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 10 March 2014 at 20:52:27 UTC, Walter Bright wrote:
 On 3/10/2014 1:36 PM, Steven Schveighoffer wrote:
 What strings are already is a user-defined type,

No, they are not.
 but with horrible enforcement.

With no enforcement, and that is by design. Keep in mind that D is a systems programming language, and that means unfettered access to strings.

I don't see how this proposal would limit that access. The raw immutable(char)[] is still there, ready to be used just as always. It seems like it fits the D ethos: safe and reasonably fast by default, unsafe and lightning fast on request. (Admittedly a bad wording, sometimes the fastest can still be safe)
Mar 10 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 10 Mar 2014 16:52:27 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 3/10/2014 1:36 PM, Steven Schveighoffer wrote:
 What strings are already is a user-defined type,

No, they are not.

The functionality added via phobos can hardly be considered extraneous. One would not use strings without the library.
 but with horrible enforcement.

With no enforcement, and that is by design.

The enforcement is opt-in. That is, you have to use phobos' templates in order to use them "properly": auto getIt(R)(R r, size_t idx) { if(idx < r.length) return r[idx]; } The above compiles fine for strings. However, it does not compile fine if you do: auto getIt(R)(R r, size_t idx) if(hasLength!R && isRandomAccessRange!R) Any other range will fail to compile for the more strict version and the simple implementation without template constraints. In other words, the compiler doesn't believe the same thing phobos does. shooting one's foot is quite easy.
 Keep in mind that D is a systems programming language, and that means  
 unfettered access to strings.

Access is fine, with clear intentions. And we do not have unfettered access. I cannot sort a mutable string of ASCII characters without first converting it to ubyte[]. What in my proposal makes you think you don't have unfettered access? The underlying immutable(char)[] representation is accessible. In fact, you would have more access, since phobos functions would then work with a char[] like it's a proper array. -Steve
Mar 10 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 10 Mar 2014 16:54:34 -0400, John Colvin   
<john.loughran.colvin gmail.com> wrote:
 Of course that i has to be at the beginning of a code-point. Doesn't  
 seem like that useful a feature and potentially very confusing for  
 people who naively expect normal indexing.

What it would do is remove the confusion of is(typeof(r.front) != typeof(r[0])) Naivety is to be expected when you have made your C-derived language's default string type an encoded UTF8 array called char[]. It doesn't magically make D programs UTF aware. I would suggest that a lofty goal is for the string type to be completely safe, and efficient, and only allow raw access via the .representation member. But I don't think, given the current code base, that we can achieve that in one proposal. It has to be gradual. This is a first step. -Steve
Mar 10 2014
prev sibling next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer 
wrote:
 I proposed this inside the long "major performance problem with 
 std.array.front," I've also proposed it before, a long time ago.

 But seems to be getting no attention buried in that thread, not 
 even negative attention :)

 An idea to fix the whole problems I see with char[] being 
 treated specially by phobos: introduce an actual string type, 
 with char[] as backing, that is a dchar range, that actually 
 dictates the rules we want. Then, make the compiler use this 
 type for literals.

 e.g.:

 struct string {
    immutable(char)[] representation;
    this(char[] data) { representation = data;}
    ... // dchar range primitives
 }

 Then, a char[] array is simply an array of char[].

 points:

 1. No more issues with foreach(c; "cassé"), it iterates via 
 dchar
 2. No more issues with "cassé"[4], it is a static compiler 
 error.
 3. No more awkward ASCII manipulation using ubyte[].
 4. No more phobos schizophrenia saying char[] is not an array.
 5. No more special casing char[] array templates to fool the 
 compiler.
 6. Any other special rules we come up with can be dictated by 
 the library, and not ignored by the compiler.

 Note, std.algorithm.copy(string1, mutablestring) will still 
 decode/encode, but it's more explicit. It's EXPLICITLY a dchar 
 range. Use std.algorithm.copy(string1.representation, 
 mutablestring.representation) will avoid the issues.

 I imagine only code that is currently UTF ignorant will break, 
 and that code is easily 'fixed' by adding the 'representation' 
 qualifier.

 -Steve

just to check I understand this fully: in this new scheme, what would this do? auto s = "cassé".representation; foreach(i, c; s) write(i, ':', c, ' '); writeln(s); Currently - without the .representation - I get 0:c 1:a 2:s 3:s 4:e 5:̠6:` cassé or, to spell it out a bit more: 0:c 1:a 2:s 3:s 4:e 5:xCC 6:x81 cassé
Mar 10 2014
prev sibling next sibling parent "Chris Williams" <yoreanon-chrisw yahoo.co.jp> writes:
On Monday, 10 March 2014 at 18:13:14 UTC, Steven Schveighoffer 
wrote:
 Indexing is rarely a feature one needs or should use, 
 especially with encoded strings.

If I was writing something like a chat or terminal window, I would want to be able to jump to chunks of text based on some sort of buffer length, then search for actual character boundaries. Similarly, if I was indexing text, I don't care what the underlying data is just whether any particular set of n-bytes have been seen together among some document. For the latter case, I don't need to be able to interpret the data as text while indexing, but once I perform an actual search and want to jump the user to that line in the file, being able to take a byte offset that I had stored in the index and convert that to a textual position would be good. I do think that D should have something like alias String8 = UTF!char; alias String16 = UTF!wchar; alias String32 = UTF!dchar; And that those sit on top of an underlying immutable(xchar)[] buffer, providing variants of things like foreach and length based on code-point or grapheme boundaries. But I don't think there's any value in reinterpretting "string". Not being a struct or an object, it doesn't have the extensibility to be useful for all the variations of access that working with Unicode and the underlying bytes warrants.
Mar 10 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 10 Mar 2014 17:46:23 -0400, John Colvin  =

<john.loughran.colvin gmail.com> wrote:

 On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer wrote:
 I proposed this inside the long "major performance problem with  =


 std.array.front," I've also proposed it before, a long time ago.

 But seems to be getting no attention buried in that thread, not even =


 negative attention :)

 An idea to fix the whole problems I see with char[] being treated  =


 specially by phobos: introduce an actual string type, with char[] as =


 backing, that is a dchar range, that actually dictates the rules we  =


 want. Then, make the compiler use this type for literals.

 e.g.:

 struct string {
    immutable(char)[] representation;
    this(char[] data) { representation =3D data;}
    ... // dchar range primitives
 }

 Then, a char[] array is simply an array of char[].

 points:

 1. No more issues with foreach(c; "casse=CC=81"), it iterates via dch=


 2. No more issues with "casse=CC=81"[4], it is a static compiler erro=


 3. No more awkward ASCII manipulation using ubyte[].
 4. No more phobos schizophrenia saying char[] is not an array.
 5. No more special casing char[] array templates to fool the compiler=


 6. Any other special rules we come up with can be dictated by the  =


 library, and not ignored by the compiler.

 Note, std.algorithm.copy(string1, mutablestring) will still  =


 decode/encode, but it's more explicit. It's EXPLICITLY a dchar range.=


 Use std.algorithm.copy(string1.representation,  =


 mutablestring.representation) will avoid the issues.

 I imagine only code that is currently UTF ignorant will break, and th=


 code is easily 'fixed' by adding the 'representation' qualifier.

 -Steve

just to check I understand this fully: in this new scheme, what would this do? auto s =3D "casse=CC=81".representation; foreach(i, c; s) write(i, ':', c, ' '); writeln(s); Currently - without the .representation - I get 0:c 1:a 2:s 3:s 4:e 5:=CC=A06:` casse=CC=81 or, to spell it out a bit more: 0:c 1:a 2:s 3:s 4:e 5:xCC 6:x81 casse=CC=81

The plan is for foreach on s to iterate by char, and foreach on "casse=CC= =81" = to iterate by dchar. What this means is the accent will be iterated separately from the e, an= d = likely gets put onto the colon after 5. However, the half code-units tha= t = has no meaning anywhere (xCC and X81) would not be iterated. In your above code, using .representation would be equivalent to what it= = is now without .representation (i.e. over char), and without = .representation would be equivalent to this on today's compiler (except = = faster): foreach(i, dchar c; s) -Steve
Mar 10 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 10 Mar 2014 17:52:05 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 3/10/2014 2:09 PM, Steven Schveighoffer wrote:
 What in my proposal makes you think you don't have unfettered access?  
 The
 underlying immutable(char)[] representation is accessible. In fact, you  
 would
 have more access, since phobos functions would then work with a char[]  
 like it's
 a proper array.

You divide the D world into two camps - those that use 'struct string', and those that use immutable(char)[] strings.

Really? It's not that divisive. However, the situation is certainly better than today's world of those who use 'string' and those who use 'string.representation'. Those who use string.representation would actually get much more use out of it. Those who use string would see no changes.
  > I imagine only code that is currently UTF ignorant will break,

 This also makes it a non-starter.

You're the guardian of changes to the language, clearly holding a veto on any proposals. But this doesn't come across as very open-minded, especially from someone who wanted to do something that would change the fundamental treatment of strings last week. IMO, breaking incorrect code is a good idea, and worth at least exploring. -Steve
Mar 10 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Walter Bright:

 In the last couple days, we also wound up annoying a valuable 
 client with some minor breakage with std.json, reiterating how 
 important it is to not break code if we can at all avoid it..

There are still some breaking changed that I'd like to perform in D, like deprecating certain usages of the comma operator, etc. Bye, bearophile
Mar 10 2014
prev sibling next sibling parent "Meta" <jared771 gmail.com> writes:
On Tuesday, 11 March 2014 at 00:02:13 UTC, bearophile wrote:
 Walter Bright:

 In the last couple days, we also wound up annoying a valuable 
 client with some minor breakage with std.json, reiterating how 
 important it is to not break code if we can at all avoid it..

There are still some breaking changed that I'd like to perform in D, like deprecating certain usages of the comma operator, etc. Bye, bearophile

That damnable comma operator is one of the worst things that was inherited from C. IMO, it has no use outside the header of a for loop, and even there it's suspect.
Mar 10 2014
prev sibling next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 10 March 2014 at 22:15:34 UTC, Steven Schveighoffer 
wrote:
 On Mon, 10 Mar 2014 17:46:23 -0400, John Colvin 
 <john.loughran.colvin gmail.com> wrote:

 On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer 
 wrote:
 I proposed this inside the long "major performance problem 
 with std.array.front," I've also proposed it before, a long 
 time ago.

 But seems to be getting no attention buried in that thread, 
 not even negative attention :)

 An idea to fix the whole problems I see with char[] being 
 treated specially by phobos: introduce an actual string type, 
 with char[] as backing, that is a dchar range, that actually 
 dictates the rules we want. Then, make the compiler use this 
 type for literals.

 e.g.:

 struct string {
   immutable(char)[] representation;
   this(char[] data) { representation = data;}
   ... // dchar range primitives
 }

 Then, a char[] array is simply an array of char[].

 points:

 1. No more issues with foreach(c; "cassé"), it iterates via 
 dchar
 2. No more issues with "cassé"[4], it is a static compiler 
 error.
 3. No more awkward ASCII manipulation using ubyte[].
 4. No more phobos schizophrenia saying char[] is not an array.
 5. No more special casing char[] array templates to fool the 
 compiler.
 6. Any other special rules we come up with can be dictated by 
 the library, and not ignored by the compiler.

 Note, std.algorithm.copy(string1, mutablestring) will still 
 decode/encode, but it's more explicit. It's EXPLICITLY a 
 dchar range. Use std.algorithm.copy(string1.representation, 
 mutablestring.representation) will avoid the issues.

 I imagine only code that is currently UTF ignorant will 
 break, and that code is easily 'fixed' by adding the 
 'representation' qualifier.

 -Steve

just to check I understand this fully: in this new scheme, what would this do? auto s = "cassé".representation; foreach(i, c; s) write(i, ':', c, ' '); writeln(s); Currently - without the .representation - I get 0:c 1:a 2:s 3:s 4:e 5:̠6:` cassé or, to spell it out a bit more: 0:c 1:a 2:s 3:s 4:e 5:xCC 6:x81 cassé

The plan is for foreach on s to iterate by char, and foreach on "cassé" to iterate by dchar. What this means is the accent will be iterated separately from the e, and likely gets put onto the colon after 5. However, the half code-units that has no meaning anywhere (xCC and X81) would not be iterated. In your above code, using .representation would be equivalent to what it is now without .representation (i.e. over char), and without .representation would be equivalent to this on today's compiler (except faster): foreach(i, dchar c; s) -Steve

Awesome, let's do this :)
Mar 11 2014
prev sibling next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 10 March 2014 at 21:52:04 UTC, Walter Bright wrote:
 On 3/10/2014 2:09 PM, Steven Schveighoffer wrote:
 What in my proposal makes you think you don't have unfettered 
 access? The
 underlying immutable(char)[] representation is accessible. In 
 fact, you would
 have more access, since phobos functions would then work with 
 a char[] like it's
 a proper array.

You divide the D world into two camps - those that use 'struct string', and those that use immutable(char)[] strings.

I would go so far as to say this is a good thing, as long as the 'struct string' is transparently the default. If you want good unicode support that works in a sane and relatively transparent manner, just write string, use literals as normal etc. If you want a normal array of characters, that behaves sanely and consistently as an array, use char[] with relevant qualifiers.
Mar 11 2014
prev sibling next sibling parent "Kagamin" <spam here.lot> writes:
Automatic decoding by default itself is a WTF factor. The problem 
with it is it encourages unicode ignorance and pretends to work 
correctly, so it's harder for the developer to discover the 
incorrectness.
Mar 11 2014
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Monday, 10 March 2014 at 17:54:49 UTC, Steven Schveighoffer 
wrote:
 On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson 
 <eco gnuk.net> wrote:

 It seems like this would be an even bigger breaking change 
 than Walter's proposal though (right or wrong, slicing strings 
 is very common).

You're the second person to mention that, I was not planning on disabling string slicing. Just random access to individual chars, and probably .length. -Steve

It is unacceptable to have slicing which is not O(1) for basic types.
Mar 11 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 11 Mar 2014 09:11:22 -0400, Dicebot <public dicebot.lv> wrote:

 On Monday, 10 March 2014 at 17:54:49 UTC, Steven Schveighoffer wrote:
 On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson <eco gnuk.net> wrote:

 It seems like this would be an even bigger breaking change than  
 Walter's proposal though (right or wrong, slicing strings is very  
 common).

You're the second person to mention that, I was not planning on disabling string slicing. Just random access to individual chars, and probably .length.

It is unacceptable to have slicing which is not O(1) for basic types.

It would be O(1), work just like it does today. -Steve
Mar 11 2014
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Tuesday, 11 March 2014 at 14:04:38 UTC, Steven Schveighoffer 
wrote:
 It would be O(1), work just like it does today.

 -Steve

Today it works by allowing arbitrary index and not checking if resulting slice is valid UTF-8. Anything that implies decoding is O(n). What exactly do you have in mind for this?
Mar 11 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 11 Mar 2014 10:06:47 -0400, Dicebot <public dicebot.lv> wrote:

 On Tuesday, 11 March 2014 at 14:04:38 UTC, Steven Schveighoffer wrote:
 It would be O(1), work just like it does today.

 -Steve

Today it works by allowing arbitrary index and not checking if resulting slice is valid UTF-8. Anything that implies decoding is O(n). What exactly do you have in mind for this?

Well, a valid improvement would be to throw an exception when the slice didn't start/end on a valid code point. This is easily checkable in O(1) time, but I wouldn't recommend it to begin with, it may have huge performance issues. Typically, one does not arbitrarily slice up via some specific value, they use a function to get an index, and they don't care what the index value actually is. Alternatively, it could be done via assert, to disable it during release mode. This might be acceptable. But I would never expect any kind of indexing or slicing to use "number of code points", which clearly requires O(n) decoding to determine it's position. That would be disastrous. -Steve
Mar 11 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Meta:

 That damnable comma operator is one of the worst things that 
 was inherited from C. IMO, it has no use outside the header of 
 a for loop, and even there it's suspect.

The place for the discussion about the comma operator: https://d.puremagic.com/issues/show_bug.cgi?id=2659 Bye, bearophile
Mar 11 2014
prev sibling next sibling parent "Chris Williams" <yoreanon-chrisw yahoo.co.jp> writes:
On Tuesday, 11 March 2014 at 14:16:31 UTC, Steven Schveighoffer 
wrote:
 But I would never expect any kind of indexing or slicing to use 
 "number of code points", which clearly requires O(n) decoding 
 to determine it's position. That would be disastrous.

If the indexes put into the slice aren't by code-point, but people need to use proper helper functions to convert a code-point into an index, then we're basically back to where we are today.
Mar 11 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 11 Mar 2014 13:18:46 -0400, Chris Williams  
<yoreanon-chrisw yahoo.co.jp> wrote:

 On Tuesday, 11 March 2014 at 14:16:31 UTC, Steven Schveighoffer wrote:
 But I would never expect any kind of indexing or slicing to use "number  
 of code points", which clearly requires O(n) decoding to determine it's  
 position. That would be disastrous.

If the indexes put into the slice aren't by code-point, but people need to use proper helper functions to convert a code-point into an index, then we're basically back to where we are today.

No, where we are today is that in some cases, the language treats a char[] as an array of char, in other cases, it treats a char[] as a bi-directional dchar range. What I'm proposing is we have a type that defines "This is what a string looks like," and it is consistent across all uses of the string, instead of the schizophrenic view we have now. I would also point out that quite a bit of deception and nonsense is needed to maintain that view, including things like assert(!hasLength!(char[]) && __traits(compiles, { char[] x; int y = x.length;})). The documentation for hasLength says "Tests if a given range has the length attribute," which is clearly a lie. However, I want to define right here, that index is not a number of code points. One does not frequently get code point counts, one gets indexes. It has always been that way, and I'm not planning to change that. That you can't use the index to determine the number of code points that came before it, is not a frequent issue that arises. e.g., I want to find the first instance of "xyz" in a string, do I care how many code points it has to go through, or what point I have to slice the string to get that? A previous poster brings up this incorrect code: auto index = countUntil(str, "xyz"); auto newstr = str[index..$]; But it can easily be done this way also: auto index = indexOf(str, "xyz"); auto codepts = walkLength(str[0..index]); auto newstr = str[index..$]; Given how D works, I think it would be very costly and near impossible to somehow make the incorrect slice operation statically rejected. One simply has to be trained what a code point is, and what a code unit is. HOWEVER, for the most part, nobody needs to care. Strings work fine without having to randomly access specific code points or slice based on them. Using indexes works just fine. -Steve
Mar 11 2014
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Tue, 11 Mar 2014 14:02:26 -0400
schrieb "Steven Schveighoffer" <schveiguy yahoo.com>:

 A previous poster brings up this incorrect code:
 
 auto index = countUntil(str, "xyz");
 auto newstr = str[index..$];
 
 But it can easily be done this way also:
 
 auto index = indexOf(str, "xyz");
 auto codepts = walkLength(str[0..index]);
 auto newstr = str[index..$];
 
 Given how D works, I think it would be very costly and near
 impossible to somehow make the incorrect slice operation statically
 rejected. One simply has to be trained what a code point is, and what
 a code unit is. HOWEVER, for the most part, nobody needs to care.
 Strings work fine without having to randomly access specific code
 points or slice based on them. Using indexes works just fine.
 
 -Steve

Yes, you can workaround the count problem, but then it is not "consistent across all uses of the string". What if the above code was a generic template written for arrays? Then it silently fails for strings and you have to special case it. I think the problem here is that if ranges / algorithms have to work on the same data type as slicing/indexing. If .front returns code units, then indexing/slicing should be done with code units. If it returns code points then slicing has to happen on code points for consistency or it should be disallowed. (Slicing on code units is important - no doubt. But it is error prone and should be explicit in some way: string.sliceCP(a, b) or string.representation[a...b])
Mar 11 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 11 Mar 2014 14:25:10 -0400, Johannes Pfau <nospam example.com>  
wrote:

 Yes, you can workaround the count problem, but then it is not
 "consistent across all uses of the string". What if the above code was
 a generic template written for arrays? Then it silently fails for
 strings and you have to special case it.

 I think the problem here is that if ranges / algorithms have to work on
 the same data type as slicing/indexing. If .front returns code units,
 then indexing/slicing should be done with code units. If it returns
 code points then slicing has to happen on code points for consistency
 or it should be disallowed. (Slicing on code units is important - no
 doubt. But it is error prone and should be explicit in some way:
 string.sliceCP(a, b) or string.representation[a...b])

I look at it a different way -- indexes are increasing, just not consecutive. If there is a way to say "indexes are not linear", then that would be a good trait to expose. For instance, think of a tree-map, which has keys that may not be consecutive. Should you be able to slice such a container? I'd say yes. But tree[0..5] may not get you the first 5 elements. -Steve
Mar 11 2014
prev sibling next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 11 March 2014 at 18:26:36 UTC, Johannes Pfau wrote:
 I think the problem here is that if ranges / algorithms have to 
 work on
 the same data type as slicing/indexing. If .front returns code 
 units,
 then indexing/slicing should be done with code units. If it 
 returns
 code points then slicing has to happen on code points for 
 consistency
 or it should be disallowed. (Slicing on code units is important 
 - no
 doubt. But it is error prone and should be explicit in some way:
 string.sliceCP(a, b) or string.representation[a...b])

I think it is import to remember that in terms of ranges/algorithms, strings are not indexable, nor sliceable ranges. The "only way" to generically slice a string in generic code, is to explicitly test that a range is actually a string, and then knowingly call an "internal primitive" that is NOT a part of the range traits. So slicing/indexing *is* already disallowed, in terms of range/algorithms anyways.
Mar 12 2014
prev sibling next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 11 March 2014 at 18:02:26 UTC, Steven Schveighoffer 
wrote:
 No, where we are today is that in some cases, the language 
 treats a char[] as an array of char, in other cases, it treats 
 a char[] as a bi-directional dchar range.

 -Steve

I want to mention something I've had trouble with recently, that I haven't seen mentioned yet, but is related: The ambiguity of the "lone char". By that I mean: When a function accepts 'char' as an argument, it is (IMO) very hard to know if it is actually accepting a? 1. An ascii char in the 0 .. 128 range? 2. A code unit? 3. (heaven forbid) a codepoint in the 0 .. 256 range packed into a char? Currently (fortuantly? unfortunatly?) the current choice taken in our algorithms is 3, which is actually the 'safest' solution. So if you write: find("cassé", cast(char)'é'); It *will* correctly find the 'é', but it *won't* search for it in individual codeunits. -------- Another more pernicious case is that of output ranges. "put" is supposed to know how to convert and string/char width, into any sting/char width. Again, things become funky if you tell "put" to place a string, into a sink that accepts a char. Is the sink actually telling you to feed it code units? or ascii?
Mar 12 2014
prev sibling next sibling parent reply Marco Leise <Marco.Leise gmx.de> writes:
The Unicode standard is too complex for general purpose
algorithms to do useful things on D strings. We don't see that
however, since our writing systems are sufficiently well
supported.

As an inspiration I'll leave a string here that contains
combined characters in Korean
(http://decodeunicode.org/hangul_syllables)
and Latin as well as full width characters that span 2
characters in e.g. Latin, Greek or Cyrillic scripts
(http://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms):

Halfwidth / =EF=BC=A6=EF=BD=95=EF=BD=8C=EF=BD=8C=EF=BD=97=EF=BD=89=EF=BD=84=
=EF=BD=94=EF=BD=88, =E1=86=A8=E1=86=A8=E1=86=A8=E1=86=9A=E1=86=9A=E1=85=B1=
=E1=85=B1=E1=85=B1=E1=85=A1=E1=85=93=E1=85=B2=E1=84=84=E1=86=92=E1=84=8B=E1=
=86=AE, a=CD=A2b 9=CD=9A c=CC=8A=CC=B9

(I used the "unfonts" package for the Hangul part)

What I want to say is that for correct Unicode handling we
should either use existing libraries or get a feeling for
what the Unicode standard provides, then form use cases out of it.

For example when we talk about the length of a string we are
actually talking about 4 different things:

  - number of code units
  - number of code points
  - number of user perceived characters
  - display width using a monospace font

The same distinction applies for slicing, depending on use case.

Related:
  - What normalization do D strings use. Both Linux and
    MacOS X use UTF-8, but the binary representation of non-ASCII
    file names is different.
  - How do we handle sorting strings?

The topic matter is complex, but not difficult (as in rocket science).
If we really want to find a solution, we should form an expert group
and stop talking until we read the latest Unicode specs. They are a
moving target. Don't expect to ever be "done" with full Unicode
support in D.

--=20
Marco
Mar 17 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
18-Mar-2014 10:21, Marco Leise пишет:
 The Unicode standard is too complex for general purpose
 algorithms to do useful things on D strings. We don't see that
 however, since our writing systems are sufficiently well
 supported.

 As an inspiration I'll leave a string here that contains
 combined characters in Korean
 (http://decodeunicode.org/hangul_syllables)
 and Latin as well as full width characters that span 2
 characters in e.g. Latin, Greek or Cyrillic scripts
 (http://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms):

 Halfwidth / Fullwidth,
ᆨᆨᆨᆚᆚᅱᅱᅱᅡᅓᅲᄄᆒᄋᆮ, a͢b 9͚ c̹̊

 (I used the "unfonts" package for the Hangul part)

 What I want to say is that for correct Unicode handling we
 should either use existing libraries or get a feeling for
 what the Unicode standard provides, then form use cases out of it.

There is ICU and very few other things, like support in OSX frameworks (NSString). Industry in general kinda sucks on this point but desperately wants to improve.
 For example when we talk about the length of a string we are
 actually talking about 4 different things:

    - number of code units
    - number of code points
    - number of user perceived characters
    - display width using a monospace font

 The same distinction applies for slicing, depending on use case.

 Related:
    - What normalization do D strings use. Both Linux and
      MacOS X use UTF-8, but the binary representation of non-ASCII
      file names is different.

There is no single normalization to fix on. D programs may be written for Linux only, for Mac-only or for both. IMO we should just provide ways to normalize strings. (std.uni.normalize has 'normalize' for starters).
    - How do we handle sorting strings?

Unicode collation algorithm and provide ways to tweak the default one.
 The topic matter is complex, but not difficult (as in rocket science).
 If we really want to find a solution, we should form an expert group
 and stop talking until we read the latest Unicode specs.

Well, I did. You seem motivated, would you like to join the group?
 They are a
 moving target. Don't expect to ever be "done" with full Unicode
 support in D.

The 6.x standard line seems pretty stable to me. There is a point in provding support that worth approaching. After that ROI is drooping steadily as the amount of work to specialize for each specific culture rises. At some point we can only talk about opening up ways to specialize. D (or any library for that matter) won't ever have all possible tinkering that Unicode standard permits. So I expect D to be "done" with Unicode one day simply by reaching a point of having all universally applicable stuff (and stated defaults) plus having a toolbox to craft your own versions of algorithms. This is the goal of new std.uni. -- Dmitry Olshansky
Mar 18 2014
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
19-Mar-2014 18:42, Marco Leise пишет:
 Am Tue, 18 Mar 2014 23:18:16 +0400
 schrieb Dmitry Olshansky <dmitry.olsh gmail.com>:

 Related:
     - What normalization do D strings use. Both Linux and
       MacOS X use UTF-8, but the binary representation of non-ASCII
       file names is different.

There is no single normalization to fix on. D programs may be written for Linux only, for Mac-only or for both.

Normalizations C and D are the non lossy ones and as far as I understood equivalent. So I agree.

Right, the KC & KD ones are really all about fuzzy matching and searching.
 IMO we should just provide ways to normalize strings.
 (std.uni.normalize has 'normalize' for starters).

I wondered if anyone will actually read up on normalization prior to touching Unicode strings. I didn't, Andrei didn't and so on... So I expect strA == strB to be common enough, just like floatA == floatB until the news spread.

If that of any comfort other languages are even worse here. In C++ your are hopeless without ICU.
 Since == is supposed to
 compare for equivalence, could we hide all those details in
 an opaque string type and offer correct comparison functions?

Well, turns out the Unicode standard ties equivalence to normalization forms. In other words unless both your strings are normalized the same way there is really no point in trying to compare them. As for opaque type - we could have say String!NFC and String!NFD or some-such. It would then make sure the normalization is the right one.
     - How do we handle sorting strings?

Unicode collation algorithm and provide ways to tweak the default one.

I wish I didn't look at the UCA. Jeeeez... But yeah, that's the way to go.

Needless to say I had a nice jaw-dropping moment when I realized what elephant I have missed with our std.uni (somewhere in the middle of the work).
 Big frameworks like Java added a Collate class with predefined
 constants for several languages. That's too much work for us.
 But the API doesn't need to preclude adding those.

Indeed some kind of Collator is in order. On the use side of things it's simply a functor that compares strings. The fact that it's full of tables and the like is well hidden. The only thing above that is caching preprocessed strings, that maybe useful for databases and string indexes.
 The topic matter is complex, but not difficult (as in rocket science).
 If we really want to find a solution, we should form an expert group
 and stop talking until we read the latest Unicode specs.

Well, I did. You seem motivated, would you like to join the group?

Yes, I'd like to see a Unicode 6.x approved stamp on D. I didn't know that you already wrote all the simple algorithms for 2.064. Those would have been my candidates to work on, too. Is there anything that can be implemented in a day or two? :)

Cool, consider yourself enlisted :) I reckon word and line breaking algorithms are piece of cake compared to UCA. Given the power toys of CodepointSet and toTrie it shouldn't be that hard to come up with prototype. Then we just move precomputed versions of related tries to std/internal/ and that's it, ready for public consumption.
 D (or any library for that matter) won't ever have all possible
 tinkering that Unicode standard permits. So I expect D to be "done" with
 Unicode one day simply by reaching a point of having all universally
 applicable stuff (and stated defaults) plus having a toolbox to craft
 your own versions of algorithms. This is the goal of new std.uni.

Sorting strings is a very basic feature, but as I learned now also highly complex. I expected some kind of tables for download that would suffice, but the rules are pretty detailed. E.g. in German phonebook order, ä/ö/ü has the same order as ae/oe/ue.

This is tailoring, an awful thing that makes cultural differences what they are in Unicode ;) What we need first and furthermost DUCET based version (default Unicode collation element tables). -- Dmitry Olshansky
Mar 19 2014
prev sibling next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Tue, 18 Mar 2014 23:18:16 +0400
schrieb Dmitry Olshansky <dmitry.olsh gmail.com>:

 18-Mar-2014 10:21, Marco Leise =D0=BF=D0=B8=D1=88=D0=B5=D1=82:
 The Unicode standard is too complex for general purpose
 algorithms to do useful things on D strings. We don't see that
 however, since our writing systems are sufficiently well
 supported.

 As an inspiration I'll leave a string here that contains
 combined characters in Korean
 (http://decodeunicode.org/hangul_syllables)
 and Latin as well as full width characters that span 2
 characters in e.g. Latin, Greek or Cyrillic scripts
 (http://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms):

 Halfwidth / =EF=BC=A6=EF=BD=95=EF=BD=8C=EF=BD=8C=EF=BD=97=EF=BD=89=EF=


=85=B1=E1=85=B1=E1=85=B1=E1=85=A1=E1=85=93=E1=85=B2=E1=84=84=E1=86=92=E1=84= =8B=E1=86=AE, a=CD=A2b 9=CD=9A c=CC=8A=CC=B9
 (I used the "unfonts" package for the Hangul part)

 What I want to say is that for correct Unicode handling we
 should either use existing libraries or get a feeling for
 what the Unicode standard provides, then form use cases out of it.

There is ICU and very few other things, like support in OSX frameworks=20 (NSString). Industry in general kinda sucks on this point but=20 desperately wants to improve.
 For example when we talk about the length of a string we are
 actually talking about 4 different things:

    - number of code units
    - number of code points
    - number of user perceived characters
    - display width using a monospace font

 The same distinction applies for slicing, depending on use case.

 Related:
    - What normalization do D strings use. Both Linux and
      MacOS X use UTF-8, but the binary representation of non-ASCII
      file names is different.

There is no single normalization to fix on. D programs may be written for Linux only, for Mac-only or for both.

Normalizations C and D are the non lossy ones and as far as I understood equivalent. So I agree. =20
 IMO we should just provide ways to normalize strings.
 (std.uni.normalize has 'normalize' for starters).

I wondered if anyone will actually read up on normalization prior to touching Unicode strings. I didn't, Andrei didn't and so on... So I expect strA =3D=3D strB to be common enough, just like floatA =3D=3D floatB until the news spread. Since =3D=3D is supposed to compare for equivalence, could we hide all those details in an opaque string type and offer correct comparison functions?
    - How do we handle sorting strings?

Unicode collation algorithm and provide ways to tweak the default one.

I wish I didn't look at the UCA. Jeeeez... But yeah, that's the way to go. Big frameworks like Java added a Collate class with predefined constants for several languages. That's too much work for us. But the API doesn't need to preclude adding those.
 The topic matter is complex, but not difficult (as in rocket science).
 If we really want to find a solution, we should form an expert group
 and stop talking until we read the latest Unicode specs.

Well, I did. You seem motivated, would you like to join the group?

Yes, I'd like to see a Unicode 6.x approved stamp on D. I didn't know that you already wrote all the simple algorithms for 2.064. Those would have been my candidates to work on, too. Is there anything that can be implemented in a day or two? :)
 They are a
 moving target. Don't expect to ever be "done" with full Unicode
 support in D.

The 6.x standard line seems pretty stable to me. There is a point in=20 provding support that worth approaching. After that ROI is drooping=20 steadily as the amount of work to specialize for each specific culture=20 rises. At some point we can only talk about opening up ways to specialize. =20 D (or any library for that matter) won't ever have all possible=20 tinkering that Unicode standard permits. So I expect D to be "done" with=

 Unicode one day simply by reaching a point of having all universally=20
 applicable stuff (and stated defaults) plus having a toolbox to craft=20
 your own versions of algorithms. This is the goal of new std.uni.

Sorting strings is a very basic feature, but as I learned now also highly complex. I expected some kind of tables for download that would suffice, but the rules are pretty detailed. E.g. in German phonebook order, =C3=A4/=C3=B6/=C3=BC has the same order as ae/oe/ue. --=20 Marco
Mar 19 2014
prev sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Thu, 20 Mar 2014 01:55:08 +0400
schrieb Dmitry Olshansky <dmitry.olsh gmail.com>:

 Well, turns out the Unicode standard ties equivalence to normalization=20
 forms. In other words unless both your strings are normalized the same=20
 way there is really no point in trying to compare them.
=20
 As for opaque type - we could have say String!NFC and String!NFD or=20
 some-such. It would then make sure the normalization is the right one.

And I thought of going the slow route where normalized and unnormalized strings can coexist and be compared. No NFD or NFC, just UTF8 strings. Pros: + Learning about normalization isn't needed to use strings correctly. And few people do that. + Strings don't need to be normalized. Every modification to data is bad, e.g. when said string is fed back to the source. Think about a file name on a file system where a different normalization is a different file. Cons: - Comparisons for already normalized strings are unnecessarily slow. Maybe the normalization form (NFC, NFD, mixed) could be stored alongside the string.
 Cool, consider yourself enlisted :)
 I reckon word and line breaking algorithms are piece of cake compared to=

 UCA. Given the power toys of CodepointSet and toTrie it shouldn't be=20
 that hard to come up with prototype. Then we just move precomputed=20
 versions of related tries to std/internal/ and that's it, ready for=20
 public consumption.

Would a typical use case be to find the previous/next boundary given a code unit index? E.g. the cursor sits on a word and you want to jump to the start or end of it. Just iterating the words and lines might not be too useful.
 D (or any library for that matter) won't ever have all possible
 tinkering that Unicode standard permits. So I expect D to be "done" wi=



 Unicode one day simply by reaching a point of having all universally
 applicable stuff (and stated defaults) plus having a toolbox to craft
 your own versions of algorithms. This is the goal of new std.uni.

Sorting strings is a very basic feature, but as I learned now also highly complex. I expected some kind of tables for download that would suffice, but the rules are pretty detailed. E.g. in German phonebook order, =C3=A4/=C3=B6/=C3=BC has the same order=


 ae/oe/ue.

This is tailoring, an awful thing that makes cultural differences what=20 they are in Unicode ;) =20 What we need first and furthermost DUCET based version (default Unicode=20 collation element tables).

Of course. --=20 Marco
Mar 19 2014