digitalmars.D - Proposal for fixing dchar ranges

Steven Schveighoffer (35/35) Mar 10 2014 I proposed this inside the long "major performance problem with =

Steven Schveighoffer (4/5) Mar 10 2014 An array of char even.
Dicebot (6/41) Mar 10 2014 It will break any code that slices stored char[] strings directly

Steven Schveighoffer (25/69) Mar 10 2014 =

Dicebot (7/13) Mar 10 2014 Broken as if in "you are not supposed to do it user code"? Yes.

Steven Schveighoffer (20/32) Mar 10 2014 If the idea to ensure the user cannot slice a code point was added, you ...

H. S. Teoh (20/52) Mar 10 2014 I like this idea. Special-casing char[] in templates was a bad idea. It

Steven Schveighoffer (9/18) Mar 10 2014 I agree that is a limitation of the proposal. It's more of a language-wi...

Boyd (8/43) Mar 10 2014 I personally love this idea, though I think it probably

Steven Schveighoffer (3/6) Mar 10 2014 What silent breaking changes?

Boyd (4/11) Mar 10 2014 Utf8 aware slicing for strings would be an issue.

Steven Schveighoffer (3/4) Mar 10 2014 I'm not proposing to add this.

Boyd (5/10) Mar 10 2014 Ok, then you just destroyed my sole hypothetical objection to

Brad Anderson (8/43) Mar 10 2014 Generally I think it's a good idea. Going a bit further you could

Steven Schveighoffer (5/7) Mar 10 2014 You're the second person to mention that, I was not planning on disablin...

John Colvin (3/12) Mar 10 2014 How is slicing any better than indexing?

Steven Schveighoffer (9/22) Mar 10 2014 Because one can slice out a multi-code-unit code point, one cannot acces...

John Colvin (9/34) Mar 10 2014 In order to be correct, both require exactly the same knowledge:

Steven Schveighoffer (12/27) Mar 10 2014 Using indexing, you simply cannot get the single code unit that represen...

John Colvin (7/38) Mar 10 2014 I think I understand your motivation now. Indexing never provides

Steven Schveighoffer (13/16) Mar 10 2014 What it would do is remove the confusion of is(typeof(r.front) !=

Johannes Pfau (10/22) Mar 10 2014 Unfortunately slicing by code units is probably the most important

Steven Schveighoffer (13/35) Mar 10 2014 Slicing can never be a code point based operation. It would be too slow ...

Steven Schveighoffer (12/15) Mar 10 2014 I said that wrong, of course it has meaning. What I mean is that if you ...

Brad Anderson (3/12) Mar 10 2014 Sorry, I misunderstood. That sounds reasonable.
Dicebot (4/13) Mar 11 2014 It is unacceptable to have slicing which is not O(1) for basic

Steven Schveighoffer (3/15) Mar 11 2014 It would be O(1), work just like it does today.

Dicebot (5/7) Mar 11 2014 Today it works by allowing arbitrary index and not checking if

Steven Schveighoffer (13/20) Mar 11 2014 Well, a valid improvement would be to throw an exception when the slice ...

Chris Williams (6/9) Mar 11 2014 If the indexes put into the slice aren't by code-point, but

Steven Schveighoffer (34/41) Mar 11 2014 No, where we are today is that in some cases, the language treats a char...

Johannes Pfau (13/32) Mar 11 2014 Yes, you can workaround the count problem, but then it is not

Steven Schveighoffer (9/20) Mar 11 2014 I look at it a different way -- indexes are increasing, just not
monarch_dodra (10/22) Mar 12 2014 I think it is import to remember that in terms of

monarch_dodra (24/28) Mar 12 2014 I want to mention something I've had trouble with recently, that

John Colvin (9/44) Mar 10 2014 I know warnings are disliked, but couldn't we make the slicing

Steven Schveighoffer (15/59) Mar 10 2014 =

Chris Williams (24/26) Mar 10 2014 If I was writing something like a chat or terminal window, I

Walter Bright (4/8) Mar 10 2014 Proposals to make a string class for D have come up many times. I have a...

Steven Schveighoffer (5/17) Mar 10 2014 I wholly agree, they should be an array type. But what they are now is
Johannes Pfau (14/24) Mar 10 2014 Question: which type T doesn't have slicing, has an ElementType of

Artem Tarasov (3/8) Mar 10 2014 In addition, hasLength!T == false, which totally freaked me out
H. S. Teoh (38/66) Mar 10 2014 I'm on the fence about this one. The nice thing about strings being an

John Colvin (3/68) Mar 10 2014 You started off on the fence, but you seem pretty convinced by

Steven Schveighoffer (10/22) Mar 10 2014 BTW, this escaped my view the first time reading your post, but I am NOT...

Walter Bright (3/5) Mar 10 2014 Right, but here I used the term "class" to be more generic as in being a...

Steven Schveighoffer (7/12) Mar 10 2014 Then I don't understand your point. What strings are already is a

Walter Bright (5/7) Mar 10 2014 With no enforcement, and that is by design.

John Colvin (6/13) Mar 10 2014 I don't see how this proposal would limit that access. The raw
Steven Schveighoffer (26/33) Mar 10 2014 The functionality added via phobos can hardly be considered extraneous. ...

Walter Bright (4/9) Mar 10 2014 You divide the D world into two camps - those that use 'struct string', ...

Steven Schveighoffer (13/25) Mar 10 2014 Really? It's not that divisive. However, the situation is certainly bett...

Walter Bright (6/14) Mar 10 2014 I deserve that criticism. On the other hand, I've pretty much given up o...

bearophile (5/8) Mar 10 2014 There are still some breaking changed that I'd like to perform in

Meta (4/13) Mar 10 2014 That damnable comma operator is one of the worst things that was

H. S. Teoh (8/20) Mar 11 2014 I've always been of the opinion that the comma operator in a for loop
bearophile (5/8) Mar 11 2014 The place for the discussion about the comma operator:

John Colvin (8/18) Mar 11 2014 I would go so far as to say this is a good thing, as long as the

Ary Borenszweig (5/16) Mar 12 2014 You can also look at Erlang, where strings are just lists of numbers.

Andrei Alexandrescu (4/22) Mar 12 2014 Erlang's mistake was different from what you believe was D's mistake.

Ary Borenszweig (2/27) Mar 12 2014 What's D's mistake then?

Andrei Alexandrescu (4/5) Mar 12 2014 I don't think we made a mistake with D's strings. They could have been

John Colvin (13/48) Mar 10 2014 just to check I understand this fully:

Steven Schveighoffer (25/77) Mar 10 2014 =

John Colvin (3/82) Mar 11 2014 Awesome, let's do this :)

Kagamin (4/4) Mar 11 2014 Automatic decoding by default itself is a WTF factor. The problem
Marco Leise (37/37) Mar 17 2014 The Unicode standard is too complex for general purpose

Dmitry Olshansky (21/54) Mar 18 2014 There is ICU and very few other things, like support in OSX frameworks

Marco Leise (32/97) Mar 19 2014 =BD=84=EF=BD=94=EF=BD=88, =E1=86=A8=E1=86=A8=E1=86=A8=E1=86=9A=E1=86=9A=...

Dmitry Olshansky (28/76) Mar 19 2014 If that of any comfort other languages are even worse here. In C++ your

Marco Leise (26/55) Mar 19 2014 And I thought of going the slow route where normalized and

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

I proposed this inside the long "major performance problem with  =

std.array.front," I've also proposed it before, a long time ago.

But seems to be getting no attention buried in that thread, not even  =

negative attention :)

An idea to fix the whole problems I see with char[] being treated  =

specially by phobos: introduce an actual string type, with char[] as  =

backing, that is a dchar range, that actually dictates the rules we want=
.  =

Then, make the compiler use this type for literals.

e.g.:

struct string {
    immutable(char)[] representation;
    this(char[] data) { representation =3D data;}
    ... // dchar range primitives
}

Then, a char[] array is simply an array of char[].

points:

1. No more issues with foreach(c; "casse=CC=81"), it iterates via dchar
2. No more issues with "casse=CC=81"[4], it is a static compiler error.
3. No more awkward ASCII manipulation using ubyte[].
4. No more phobos schizophrenia saying char[] is not an array.
5. No more special casing char[] array templates to fool the compiler.
6. Any other special rules we come up with can be dictated by the librar=
y,  =

and not ignored by the compiler.

Note, std.algorithm.copy(string1, mutablestring) will still decode/encod=
e,  =

but it's more explicit. It's EXPLICITLY a dchar range. Use  =

std.algorithm.copy(string1.representation, mutablestring.representation)=
  =

will avoid the issues.

I imagine only code that is currently UTF ignorant will break, and that =
 =

code is easily 'fixed' by adding the 'representation' qualifier.

-Steve

Mar 10 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 10 Mar 2014 09:35:44 -0400, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 Then, a char[] array is simply an array of char[].

An array of char even.

-Steve

Mar 10 2014

"Dicebot" <public dicebot.lv> writes:

On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer 
wrote:
 I proposed this inside the long "major performance problem with 
 std.array.front," I've also proposed it before, a long time ago.

 But seems to be getting no attention buried in that thread, not 
 even negative attention :)

 An idea to fix the whole problems I see with char[] being 
 treated specially by phobos: introduce an actual string type, 
 with char[] as backing, that is a dchar range, that actually 
 dictates the rules we want. Then, make the compiler use this 
 type for literals.

 e.g.:

 struct string {
    immutable(char)[] representation;
    this(char[] data) { representation = data;}
    ... // dchar range primitives
 }

 Then, a char[] array is simply an array of char[].

 points:

 1. No more issues with foreach(c; "cassé"), it iterates via 
 dchar
 2. No more issues with "cassé"[4], it is a static compiler 
 error.
 3. No more awkward ASCII manipulation using ubyte[].
 4. No more phobos schizophrenia saying char[] is not an array.
 5. No more special casing char[] array templates to fool the 
 compiler.
 6. Any other special rules we come up with can be dictated by 
 the library, and not ignored by the compiler.

 Note, std.algorithm.copy(string1, mutablestring) will still 
 decode/encode, but it's more explicit. It's EXPLICITLY a dchar 
 range. Use std.algorithm.copy(string1.representation, 
 mutablestring.representation) will avoid the issues.

 I imagine only code that is currently UTF ignorant will break, 
 and that code is easily 'fixed' by adding the 'representation' 
 qualifier.

 -Steve

It will break any code that slices stored char[] strings directly 
which may or may not be breaking UTF depending on how indices are 
calculated. Also adding one more runtime dependency into language 
but there are so many that it probably does not matter.

Mar 10 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 10 Mar 2014 10:48:26 -0400, Dicebot <public dicebot.lv> wrote:

 On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer wrote:
 I proposed this inside the long "major performance problem with  =


 std.array.front," I've also proposed it before, a long time ago.

 But seems to be getting no attention buried in that thread, not even =


 =

 negative attention :)

 An idea to fix the whole problems I see with char[] being treated  =


 specially by phobos: introduce an actual string type, with char[] as =


 =

 backing, that is a dchar range, that actually dictates the rules we  =


 want. Then, make the compiler use this type for literals.

 e.g.:

 struct string {
    immutable(char)[] representation;
    this(char[] data) { representation =3D data;}
    ... // dchar range primitives
 }

 Then, a char[] array is simply an array of char[].

 points:

 1. No more issues with foreach(c; "casse=CC=81"), it iterates via dch=


ar
 2. No more issues with "casse=CC=81"[4], it is a static compiler erro=


r.
 3. No more awkward ASCII manipulation using ubyte[].
 4. No more phobos schizophrenia saying char[] is not an array.
 5. No more special casing char[] array templates to fool the compiler=


.
 6. Any other special rules we come up with can be dictated by the  =


 library, and not ignored by the compiler.

 Note, std.algorithm.copy(string1, mutablestring) will still  =


 decode/encode, but it's more explicit. It's EXPLICITLY a dchar range.=


  =

 Use std.algorithm.copy(string1.representation,  =


 mutablestring.representation) will avoid the issues.

 I imagine only code that is currently UTF ignorant will break, and th=


at  =

 code is easily 'fixed' by adding the 'representation' qualifier.

 It will break any code that slices stored char[] strings directly whic=

h  =

 may or may not be breaking UTF depending on how indices are calculated=

.

That is already broken. What I'm looking to do is remove the cruft and  =

"WTF" factor of the current state of affairs (an array that's not an  =

array).

Originally (in that long ago proposal) I had proposed to check for and  =

disallow invalid slicing during runtime. In fact, it could be added if  =

desired with the type defined by the library.

 Also adding one more runtime dependency into language but there are so=

  =

 many that it probably does not matter.

alias string =3D immutable(char)[];

There isn't much extra dependency one must add to revert to the original=
  =

behavior. In fact, one nice thing about this proposal is the compiler  =

changes can be done and tested before any real meddling with the string =
 =

type is done.

-Steve

Mar 10 2014

"Dicebot" <public dicebot.lv> writes:

On Monday, 10 March 2014 at 15:01:54 UTC, Steven Schveighoffer 
wrote:
 That is already broken. What I'm looking to do is remove the 
 cruft and "WTF" factor of the current state of affairs (an 
 array that's not an array).

 Originally (in that long ago proposal) I had proposed to check 
 for and disallow invalid slicing during runtime. In fact, it 
 could be added if desired with the type defined by the library.

Broken as if in "you are not supposed to do it user code"? Yes. 
Broken as in "does the wrong thing" - no. If your index is 
properly calculated, it is no different from casting to ubyte[] 
and then slicing. I am pretty sure even Phobos does it here and 
there.

Mar 10 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 10 Mar 2014 11:11:23 -0400, Dicebot <public dicebot.lv> wrote:

 On Monday, 10 March 2014 at 15:01:54 UTC, Steven Schveighoffer wrote:
 That is already broken. What I'm looking to do is remove the cruft and  
 "WTF" factor of the current state of affairs (an array that's not an  
 array).

 Originally (in that long ago proposal) I had proposed to check for and  
 disallow invalid slicing during runtime. In fact, it could be added if  
 desired with the type defined by the library.

 Broken as if in "you are not supposed to do it user code"? Yes. Broken  
 as in "does the wrong thing" - no. If your index is properly calculated,  
 it is no different from casting to ubyte[] and then slicing. I am pretty  
 sure even Phobos does it here and there.

If the idea to ensure the user cannot slice a code point was added, you  
would still be able to slice via str.representation[a..b], or even  
str.ptr[a..b] if you were so sure of the length you didn't want it to be  
checked ;)

The idea behind the proposal is to make it fully backwards compatible with  
existing code, except for randomly accessing a char, and probably .length.  
Slicing would still work as it does now, but could be adjusted later.

It will break existing code. To fix those breaks, you would need to use  
the char[] array directly via the representation member, or rethink your  
code to be UTF-correct. Basically, instead of pretending an array isn't an  
array, create a new mostly-compatible type that behaves as we want it to  
behave in all circumstances, not just when you use phobos algorithms.

The breaks may be trivial to work around, and might seem annoying.  
However, they may be actual UTF bugs that make your code more correct when  
you fix them.

The biggest problem right now is the lack of the ability to implicitly  
cast to tail-const with a custom struct. We can keep an alias-this link  
for those cases until we can fix that in the compiler.

-Steve

Mar 10 2014

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Mon, Mar 10, 2014 at 09:35:44AM -0400, Steven Schveighoffer wrote:
[...]
 An idea to fix the whole problems I see with char[] being treated
 specially by phobos: introduce an actual string type, with char[] as
 backing, that is a dchar range, that actually dictates the rules we
 want. Then, make the compiler use this type for literals.
 
 e.g.:
 
 struct string {
    immutable(char)[] representation;
    this(char[] data) { representation = data;}
    ... // dchar range primitives
 }
 
 Then, a char[] array is simply an array of char[].
 
 points:
 
 1. No more issues with foreach(c; "cassé"), it iterates via dchar
 2. No more issues with "cassé"[4], it is a static compiler error.
 3. No more awkward ASCII manipulation using ubyte[].
 4. No more phobos schizophrenia saying char[] is not an array.
 5. No more special casing char[] array templates to fool the compiler.
 6. Any other special rules we come up with can be dictated by the
 library, and not ignored by the compiler.

I like this idea. Special-casing char[] in templates was a bad idea. It
makes Phobos code needlessly complex, and the inconsistent treatment of
char[] sometimes as an array of char and sometimes not causes silly
issues like foreach defaulting to char but range iteration defaulting to
dchar. Enclosing it in a struct means we can enforce string rules
separately from the fact that it's a char array.


 Note, std.algorithm.copy(string1, mutablestring) will still
 decode/encode, but it's more explicit. It's EXPLICITLY a dchar
 range. Use std.algorithm.copy(string1.representation,
 mutablestring.representation) will avoid the issues.
 
 I imagine only code that is currently UTF ignorant will break, and
 that code is easily 'fixed' by adding the 'representation'
 qualifier.

[...]

The only concern I have is the current use of char[] and const(char)[]
as mutable strings, and the current implicit conversion from string to
const(char)[]. We would need similar wrappers for char[] and
const(char)[], and string and mutablestring must be implicitly
convertible to conststring, otherwise a LOT of existing code will break
in a major way. Plus, these wrappers should also expose the same dchar
range API with .representation giving a way to get at the raw code
units.


T

-- 
It is the quality rather than the quantity that matters. -- Lucius Annaeus
Seneca

Mar 10 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 10 Mar 2014 10:54:50 -0400, H. S. Teoh <hsteoh quickfur.ath.cx>  
wrote:


 The only concern I have is the current use of char[] and const(char)[]
 as mutable strings, and the current implicit conversion from string to
 const(char)[]. We would need similar wrappers for char[] and
 const(char)[], and string and mutablestring must be implicitly
 convertible to conststring, otherwise a LOT of existing code will break
 in a major way.

I agree that is a limitation of the proposal. It's more of a language-wide  
problem that one cannot make a struct that can be tail-const-ified.

One idea to begin with is to weakly bind to immutable(char)[] using alias  
this. That way, existing code devolves to current behavior. Then you pick  
off the primitives you want by defining them in the struct itself.

 Plus, these wrappers should also expose the same dchar
 range API with .representation giving a way to get at the raw code
 units.

It already does that, representation is a public member.

-Steve

Mar 10 2014

"Boyd" <gaboonviper gmx.net> writes:

I personally love this idea, though I think it probably 
introduces too much silent breaking changes for it to be 
universally acceptable by D users.

Perhaps naming it 'String', and deprecating 'string' would make 
it more acceptable?

------------
On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer 
wrote:
 I proposed this inside the long "major performance problem with 
 std.array.front," I've also proposed it before, a long time ago.

 But seems to be getting no attention buried in that thread, not 
 even negative attention :)

 An idea to fix the whole problems I see with char[] being 
 treated specially by phobos: introduce an actual string type, 
 with char[] as backing, that is a dchar range, that actually 
 dictates the rules we want. Then, make the compiler use this 
 type for literals.

 e.g.:

 struct string {
    immutable(char)[] representation;
    this(char[] data) { representation = data;}
    ... // dchar range primitives
 }

 Then, a char[] array is simply an array of char[].

 points:

 1. No more issues with foreach(c; "cassé"), it iterates via 
 dchar
 2. No more issues with "cassé"[4], it is a static compiler 
 error.
 3. No more awkward ASCII manipulation using ubyte[].
 4. No more phobos schizophrenia saying char[] is not an array.
 5. No more special casing char[] array templates to fool the 
 compiler.
 6. Any other special rules we come up with can be dictated by 
 the library, and not ignored by the compiler.

 Note, std.algorithm.copy(string1, mutablestring) will still 
 decode/encode, but it's more explicit. It's EXPLICITLY a dchar 
 range. Use std.algorithm.copy(string1.representation, 
 mutablestring.representation) will avoid the issues.

 I imagine only code that is currently UTF ignorant will break, 
 and that code is easily 'fixed' by adding the 'representation' 
 qualifier.

 -Steve

Mar 10 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 10 Mar 2014 11:11:50 -0400, Boyd <gaboonviper gmx.net> wrote:

 I personally love this idea, though I think it probably introduces too  
 much silent breaking changes for it to be universally acceptable by D  
 users.

What silent breaking changes?

-Steve

Mar 10 2014

"Boyd" <gaboonviper gmx.net> writes:

Utf8 aware slicing for strings would be an issue.

----------
On Monday, 10 March 2014 at 15:13:26 UTC, Steven Schveighoffer 
wrote:
 On Mon, 10 Mar 2014 11:11:50 -0400, Boyd <gaboonviper gmx.net> 
 wrote:

 I personally love this idea, though I think it probably 
 introduces too much silent breaking changes for it to be 
 universally acceptable by D users.

 What silent breaking changes?

 -Steve

Mar 10 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 10 Mar 2014 11:20:49 -0400, Boyd <gaboonviper gmx.net> wrote:

 Utf8 aware slicing for strings would be an issue.

I'm not proposing to add this.

-Steve

Mar 10 2014

"Boyd" <gaboonviper gmx.net> writes:

Ok, then you just destroyed my sole hypothetical objection to 
this.
-----------
On Monday, 10 March 2014 at 15:22:41 UTC, Steven Schveighoffer 
wrote:
 On Mon, 10 Mar 2014 11:20:49 -0400, Boyd <gaboonviper gmx.net> 
 wrote:

 Utf8 aware slicing for strings would be an issue.

 I'm not proposing to add this.

 -Steve

Mar 10 2014

"Brad Anderson" <eco gnuk.net> writes:

On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer 
wrote:
 I proposed this inside the long "major performance problem with 
 std.array.front," I've also proposed it before, a long time ago.

 But seems to be getting no attention buried in that thread, not 
 even negative attention :)

 An idea to fix the whole problems I see with char[] being 
 treated specially by phobos: introduce an actual string type, 
 with char[] as backing, that is a dchar range, that actually 
 dictates the rules we want. Then, make the compiler use this 
 type for literals.

 e.g.:

 struct string {
    immutable(char)[] representation;
    this(char[] data) { representation = data;}
    ... // dchar range primitives
 }

 Then, a char[] array is simply an array of char[].

 points:

 1. No more issues with foreach(c; "cassé"), it iterates via 
 dchar
 2. No more issues with "cassé"[4], it is a static compiler 
 error.
 3. No more awkward ASCII manipulation using ubyte[].
 4. No more phobos schizophrenia saying char[] is not an array.
 5. No more special casing char[] array templates to fool the 
 compiler.
 6. Any other special rules we come up with can be dictated by 
 the library, and not ignored by the compiler.

 Note, std.algorithm.copy(string1, mutablestring) will still 
 decode/encode, but it's more explicit. It's EXPLICITLY a dchar 
 range. Use std.algorithm.copy(string1.representation, 
 mutablestring.representation) will avoid the issues.

 I imagine only code that is currently UTF ignorant will break, 
 and that code is easily 'fixed' by adding the 'representation' 
 qualifier.

 -Steve

Generally I think it's a good idea. Going a bit further you could 
also enable Short String Optimization but you'd have to 
encapsulate the backing array.

It seems like this would be an even bigger breaking change than 
Walter's proposal though (right or wrong, slicing strings is very 
common).

Mar 10 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson <eco gnuk.net> wrote:

 It seems like this would be an even bigger breaking change than Walter's  
 proposal though (right or wrong, slicing strings is very common).

You're the second person to mention that, I was not planning on disabling  
string slicing. Just random access to individual chars, and probably  
.length.

-Steve

Mar 10 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Monday, 10 March 2014 at 17:54:49 UTC, Steven Schveighoffer 
wrote:
 On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson 
 <eco gnuk.net> wrote:

 It seems like this would be an even bigger breaking change 
 than Walter's proposal though (right or wrong, slicing strings 
 is very common).

 You're the second person to mention that, I was not planning on 
 disabling string slicing. Just random access to individual 
 chars, and probably .length.

 -Steve

How is slicing any better than indexing?

Mar 10 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 10 Mar 2014 14:01:45 -0400, John Colvin  
<john.loughran.colvin gmail.com> wrote:

 On Monday, 10 March 2014 at 17:54:49 UTC, Steven Schveighoffer wrote:
 On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson <eco gnuk.net> wrote:

 It seems like this would be an even bigger breaking change than  
 Walter's proposal though (right or wrong, slicing strings is very  
 common).

 You're the second person to mention that, I was not planning on  
 disabling string slicing. Just random access to individual chars, and  
 probably .length.

 -Steve

 How is slicing any better than indexing?

Because one can slice out a multi-code-unit code point, one cannot access  
it via index. Strings would be horribly crippled without slicing. Without  
indexing, they are fine.

A possibility is to allow index, but actually decode the code point at  
that index (error on invalid index). That might actually be the correct  
mechanism.

-Steve

Mar 10 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Monday, 10 March 2014 at 18:09:51 UTC, Steven Schveighoffer 
wrote:
 On Mon, 10 Mar 2014 14:01:45 -0400, John Colvin 
 <john.loughran.colvin gmail.com> wrote:

 On Monday, 10 March 2014 at 17:54:49 UTC, Steven Schveighoffer 
 wrote:
 On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson 
 <eco gnuk.net> wrote:

 It seems like this would be an even bigger breaking change 
 than Walter's proposal though (right or wrong, slicing 
 strings is very common).

 You're the second person to mention that, I was not planning 
 on disabling string slicing. Just random access to individual 
 chars, and probably .length.

 -Steve

 How is slicing any better than indexing?

 Because one can slice out a multi-code-unit code point, one 
 cannot access it via index. Strings would be horribly crippled 
 without slicing. Without indexing, they are fine.

 A possibility is to allow index, but actually decode the code 
 point at that index (error on invalid index). That might 
 actually be the correct mechanism.

 -Steve

In order to be correct, both require exactly the same knowledge: 
The beginning of a code point, followed by the end of a code 
point. In the indexing case they just happen to be the same 
code-point and happen to be one code unit from each other. I 
don't see how one is any more or less errror-prone or 
fundamentally wrong than the other.

I do understand that slicing is more important however.

Mar 10 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 10 Mar 2014 15:30:00 -0400, John Colvin  
<john.loughran.colvin gmail.com> wrote:

 On Monday, 10 March 2014 at 18:09:51 UTC, Steven Schveighoffer wrote:
 Because one can slice out a multi-code-unit code point, one cannot  
 access it via index. Strings would be horribly crippled without  
 slicing. Without indexing, they are fine.

 A possibility is to allow index, but actually decode the code point at  
 that index (error on invalid index). That might actually be the correct  
 mechanism.

 In order to be correct, both require exactly the same knowledge: The  
 beginning of a code point, followed by the end of a code point. In the  
 indexing case they just happen to be the same code-point and happen to  
 be one code unit from each other. I don't see how one is any more or  
 less errror-prone or fundamentally wrong than the other.

Using indexing, you simply cannot get the single code unit that represents  
a multi-code-unit code point. It doesn't fit in a char. It's guaranteed to  
fail, whereas slicing will give you access to the all the data in the  
string.

Now, with indexing actually decoding a code point, one can alias a[i] to  
a[i..$].front(), which means decode the first code point you come to at  
index i. This means indexing is slow(er), and returns a dchar. I think as  
a first step, that might be too much to add silently. I'd rather break it  
first, then add it back later.

-Steve

Mar 10 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Monday, 10 March 2014 at 20:00:07 UTC, Steven Schveighoffer 
wrote:
 On Mon, 10 Mar 2014 15:30:00 -0400, John Colvin 
 <john.loughran.colvin gmail.com> wrote:

 On Monday, 10 March 2014 at 18:09:51 UTC, Steven Schveighoffer 
 wrote:
 Because one can slice out a multi-code-unit code point, one 
 cannot access it via index. Strings would be horribly 
 crippled without slicing. Without indexing, they are fine.

 A possibility is to allow index, but actually decode the code 
 point at that index (error on invalid index). That might 
 actually be the correct mechanism.

 In order to be correct, both require exactly the same 
 knowledge: The beginning of a code point, followed by the end 
 of a code point. In the indexing case they just happen to be 
 the same code-point and happen to be one code unit from each 
 other. I don't see how one is any more or less errror-prone or 
 fundamentally wrong than the other.

 Using indexing, you simply cannot get the single code unit that 
 represents a multi-code-unit code point. It doesn't fit in a 
 char. It's guaranteed to fail, whereas slicing will give you 
 access to the all the data in the string.

I think I understand your motivation now. Indexing never provides 
anything that slicing doesn't do more generally.

 Now, with indexing actually decoding a code point, one can 
 alias a[i] to a[i..$].front(), which means decode the first 
 code point you come to at index i. This means indexing is 
 slow(er), and returns a dchar. I think as a first step, that 
 might be too much to add silently. I'd rather break it first, 
 then add it back later.

 -Steve

Of course that i has to be at the beginning of a code-point. 
Doesn't seem like that useful a feature and potentially very 
confusing for people who naively expect normal indexing.

Mar 10 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 10 Mar 2014 16:54:34 -0400, John Colvin   
<john.loughran.colvin gmail.com> wrote:
 Of course that i has to be at the beginning of a code-point. Doesn't  
 seem like that useful a feature and potentially very confusing for  
 people who naively expect normal indexing.

What it would do is remove the confusion of is(typeof(r.front) !=   
typeof(r[0]))

Naivety is to be expected when you have made your C-derived language's  
default string type an encoded UTF8 array called char[]. It doesn't  
magically make D programs UTF aware.

I would suggest that a lofty goal is for the string type to be completely  
safe, and efficient, and only allow raw access via the .representation  
member. But I don't think, given the current code base,
that we can achieve that in one proposal. It has to be gradual. This is a  
first step.

-Steve

Mar 10 2014

Johannes Pfau <nospam example.com> writes:

Am Mon, 10 Mar 2014 13:55:00 -0400
schrieb "Steven Schveighoffer" <schveiguy yahoo.com>:

 On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson <eco gnuk.net>
 wrote:
 
 It seems like this would be an even bigger breaking change than
 Walter's proposal though (right or wrong, slicing strings is very
 common).

 
 You're the second person to mention that, I was not planning on
 disabling string slicing. Just random access to individual chars, and
 probably .length.
 
 -Steve

Unfortunately slicing by code units is probably the most important
safety issue with the current implementation: As was mentioned in the
other thread:

size_t index = str.countUntil('a');
auto slice = str[0..index];

This can be a safety and security issue. (I realize that this would
break lots of code so I'm not sure if we should/can fix it. But I think
this was the most important problem mentioned in the other thread.)

Mar 10 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 10 Mar 2014 14:54:22 -0400, Johannes Pfau <nospam example.com>  
wrote:

 Am Mon, 10 Mar 2014 13:55:00 -0400
 schrieb "Steven Schveighoffer" <schveiguy yahoo.com>:

 On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson <eco gnuk.net>
 wrote:

 It seems like this would be an even bigger breaking change than
 Walter's proposal though (right or wrong, slicing strings is very
 common).

 You're the second person to mention that, I was not planning on
 disabling string slicing. Just random access to individual chars, and
 probably .length.

 -Steve

 Unfortunately slicing by code units is probably the most important
 safety issue with the current implementation: As was mentioned in the
 other thread:

 size_t index = str.countUntil('a');
 auto slice = str[0..index];

 This can be a safety and security issue. (I realize that this would
 break lots of code so I'm not sure if we should/can fix it. But I think
 this was the most important problem mentioned in the other thread.)

Slicing can never be a code point based operation. It would be too slow  
(read linear complexity). What needs to be broken is the expectation that  
an index is the number of code points or characters in a string. Think of  
an index as a position that has no real meaning except they are ordered in  
the stream. Like a set of ordered numbers, not necessarily consecutive.  
The index 4 may not exist, while 5 does.

At this point, my proposal does not fix that particular problem, but I  
don't think there's any way to fix that "problem" except to train the user  
who wrote it not to do that. However, it does not leave us in a worse  
position.

-Steve

Mar 10 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 10 Mar 2014 16:06:25 -0400, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:


 Think of an index as a position that has no real meaning except they are  
 ordered in the stream. Like a set of ordered numbers, not necessarily  
 consecutive. The index 4 may not exist, while 5 does.

I said that wrong, of course it has meaning. What I mean is that if you  
have two positions, the ordering will indicate where the  
characters/graphemes/code points occur in the stream, but their value will  
not be indicative of how far they are apart in terms of  
characters/graphemes/code points.

In other words, if I have two characters, at position p1 and p2, then

p1 > p2 => p1 comes later in the string than p2
p1 == p2 => p1 and p2 refer to the same character
p1 - p2 => not defined to any particular value.

-Steve

Mar 10 2014

"Brad Anderson" <eco gnuk.net> writes:

On Monday, 10 March 2014 at 17:54:49 UTC, Steven Schveighoffer 
wrote:
 On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson 
 <eco gnuk.net> wrote:

 It seems like this would be an even bigger breaking change 
 than Walter's proposal though (right or wrong, slicing strings 
 is very common).

 You're the second person to mention that, I was not planning on 
 disabling string slicing. Just random access to individual 
 chars, and probably .length.

 -Steve

Sorry, I misunderstood. That sounds reasonable.

Mar 10 2014

"Dicebot" <public dicebot.lv> writes:

On Monday, 10 March 2014 at 17:54:49 UTC, Steven Schveighoffer 
wrote:
 On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson 
 <eco gnuk.net> wrote:

 It seems like this would be an even bigger breaking change 
 than Walter's proposal though (right or wrong, slicing strings 
 is very common).

 You're the second person to mention that, I was not planning on 
 disabling string slicing. Just random access to individual 
 chars, and probably .length.

 -Steve

It is unacceptable to have slicing which is not O(1) for basic 
types.

Mar 11 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 11 Mar 2014 09:11:22 -0400, Dicebot <public dicebot.lv> wrote:

 On Monday, 10 March 2014 at 17:54:49 UTC, Steven Schveighoffer wrote:
 On Mon, 10 Mar 2014 13:06:08 -0400, Brad Anderson <eco gnuk.net> wrote:

 It seems like this would be an even bigger breaking change than  
 Walter's proposal though (right or wrong, slicing strings is very  
 common).

 You're the second person to mention that, I was not planning on  
 disabling string slicing. Just random access to individual chars, and  
 probably .length.

 It is unacceptable to have slicing which is not O(1) for basic types.

It would be O(1), work just like it does today.

-Steve

Mar 11 2014

"Dicebot" <public dicebot.lv> writes:

On Tuesday, 11 March 2014 at 14:04:38 UTC, Steven Schveighoffer 
wrote:
 It would be O(1), work just like it does today.

 -Steve

Today it works by allowing arbitrary index and not checking if 
resulting slice is valid UTF-8. Anything that implies decoding is 
O(n). What exactly do you have in mind for this?

Mar 11 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 11 Mar 2014 10:06:47 -0400, Dicebot <public dicebot.lv> wrote:

 On Tuesday, 11 March 2014 at 14:04:38 UTC, Steven Schveighoffer wrote:
 It would be O(1), work just like it does today.

 -Steve

 Today it works by allowing arbitrary index and not checking if resulting  
 slice is valid UTF-8. Anything that implies decoding is O(n). What  
 exactly do you have in mind for this?

Well, a valid improvement would be to throw an exception when the slice  
didn't start/end on a valid code point. This is easily checkable in O(1)  
time, but I wouldn't recommend it to begin with, it may have huge  
performance issues. Typically, one does not arbitrarily slice up via some  
specific value, they use a function to get an index, and they don't care  
what the index value actually is.

Alternatively, it could be done via assert, to disable it during release  
mode. This might be acceptable.

But I would never expect any kind of indexing or slicing to use "number of  
code points", which clearly requires O(n) decoding to determine it's  
position. That would be disastrous.

-Steve

Mar 11 2014

"Chris Williams" <yoreanon-chrisw yahoo.co.jp> writes:

On Tuesday, 11 March 2014 at 14:16:31 UTC, Steven Schveighoffer 
wrote:
 But I would never expect any kind of indexing or slicing to use 
 "number of code points", which clearly requires O(n) decoding 
 to determine it's position. That would be disastrous.

If the indexes put into the slice aren't by code-point, but 
people need to use proper helper functions to convert a 
code-point into an index, then we're basically back to where we 
are today.

Mar 11 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 11 Mar 2014 13:18:46 -0400, Chris Williams  
<yoreanon-chrisw yahoo.co.jp> wrote:

 On Tuesday, 11 March 2014 at 14:16:31 UTC, Steven Schveighoffer wrote:
 But I would never expect any kind of indexing or slicing to use "number  
 of code points", which clearly requires O(n) decoding to determine it's  
 position. That would be disastrous.

 If the indexes put into the slice aren't by code-point, but people need  
 to use proper helper functions to convert a code-point into an index,  
 then we're basically back to where we are today.

No, where we are today is that in some cases, the language treats a char[]  
as an array of char, in other cases, it treats a char[] as a  
bi-directional dchar range.

What I'm proposing is we have a type that defines "This is what a string  
looks like," and it is consistent across all uses of the string, instead  
of the schizophrenic view we have now. I would also point out that quite a  
bit of deception and nonsense is needed to maintain that view, including  
things like assert(!hasLength!(char[]) && __traits(compiles, { char[] x;  
int y = x.length;})). The documentation for hasLength says "Tests if a  
given range has the length attribute," which is clearly a lie.

However, I want to define right here, that index is not a number of code  
points. One does not frequently get code point counts, one gets indexes.  
It has always been that way, and I'm not planning to change that. That you  
can't use the index to determine the number of code points that came  
before it, is not a frequent issue that arises.

e.g., I want to find the first instance of "xyz" in a string, do I care  
how many code points it has to go through, or what point I have to slice  
the string to get that?

A previous poster brings up this incorrect code:

auto index = countUntil(str, "xyz");
auto newstr = str[index..$];

But it can easily be done this way also:

auto index = indexOf(str, "xyz");
auto codepts = walkLength(str[0..index]);
auto newstr = str[index..$];

Given how D works, I think it would be very costly and near impossible to  
somehow make the incorrect slice operation statically rejected. One simply  
has to be trained what a code point is, and what a code unit is. HOWEVER,  
for the most part, nobody needs to care. Strings work fine without having  
to randomly access specific code points or slice based on them. Using  
indexes works just fine.

-Steve

Mar 11 2014

Johannes Pfau <nospam example.com> writes:

Am Tue, 11 Mar 2014 14:02:26 -0400
schrieb "Steven Schveighoffer" <schveiguy yahoo.com>:

 A previous poster brings up this incorrect code:
 
 auto index = countUntil(str, "xyz");
 auto newstr = str[index..$];
 
 But it can easily be done this way also:
 
 auto index = indexOf(str, "xyz");
 auto codepts = walkLength(str[0..index]);
 auto newstr = str[index..$];
 
 Given how D works, I think it would be very costly and near
 impossible to somehow make the incorrect slice operation statically
 rejected. One simply has to be trained what a code point is, and what
 a code unit is. HOWEVER, for the most part, nobody needs to care.
 Strings work fine without having to randomly access specific code
 points or slice based on them. Using indexes works just fine.
 
 -Steve

Yes, you can workaround the count problem, but then it is not
"consistent across all uses of the string". What if the above code was
a generic template written for arrays? Then it silently fails for
strings and you have to special case it.

I think the problem here is that if ranges / algorithms have to work on
the same data type as slicing/indexing. If .front returns code units,
then indexing/slicing should be done with code units. If it returns
code points then slicing has to happen on code points for consistency
or it should be disallowed. (Slicing on code units is important - no
doubt. But it is error prone and should be explicit in some way:
string.sliceCP(a, b) or string.representation[a...b])

Mar 11 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 11 Mar 2014 14:25:10 -0400, Johannes Pfau <nospam example.com>  
wrote:

 Yes, you can workaround the count problem, but then it is not
 "consistent across all uses of the string". What if the above code was
 a generic template written for arrays? Then it silently fails for
 strings and you have to special case it.

 I think the problem here is that if ranges / algorithms have to work on
 the same data type as slicing/indexing. If .front returns code units,
 then indexing/slicing should be done with code units. If it returns
 code points then slicing has to happen on code points for consistency
 or it should be disallowed. (Slicing on code units is important - no
 doubt. But it is error prone and should be explicit in some way:
 string.sliceCP(a, b) or string.representation[a...b])

I look at it a different way -- indexes are increasing, just not  
consecutive. If there is a way to say "indexes are not linear", then that  
would be a good trait to expose.

For instance, think of a tree-map, which has keys that may not be  
consecutive. Should you be able to slice such a container? I'd say yes.  
But tree[0..5] may not get you the first 5 elements.

-Steve

Mar 11 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Tuesday, 11 March 2014 at 18:26:36 UTC, Johannes Pfau wrote:
 I think the problem here is that if ranges / algorithms have to 
 work on
 the same data type as slicing/indexing. If .front returns code 
 units,
 then indexing/slicing should be done with code units. If it 
 returns
 code points then slicing has to happen on code points for 
 consistency
 or it should be disallowed. (Slicing on code units is important 
 - no
 doubt. But it is error prone and should be explicit in some way:
 string.sliceCP(a, b) or string.representation[a...b])

I think it is import to remember that in terms of 
ranges/algorithms, strings are not indexable, nor sliceable 
ranges.

The "only way" to generically slice a string in generic code, is 
to explicitly test that a range is actually a string, and then 
knowingly call an "internal primitive" that is NOT a part of the 
range traits.

So slicing/indexing *is* already disallowed, in terms of 
range/algorithms anyways.

Mar 12 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Tuesday, 11 March 2014 at 18:02:26 UTC, Steven Schveighoffer 
wrote:
 No, where we are today is that in some cases, the language 
 treats a char[] as an array of char, in other cases, it treats 
 a char[] as a bi-directional dchar range.

 -Steve

I want to mention something I've had trouble with recently, that 
I haven't seen mentioned yet, but is related:

The ambiguity of the "lone char".

By that I mean: When a function accepts 'char' as an argument, it 
is (IMO) very hard to know if it is actually accepting a?
1. An ascii char in the 0 .. 128 range?
2. A code unit?
3. (heaven forbid) a codepoint in the 0 .. 256 range packed into 
a char?

Currently (fortuantly? unfortunatly?) the current choice taken in 
our algorithms is 3, which is actually the 'safest' solution.

So if you write:
find("cassé", cast(char)'é');

It *will* correctly find the 'é', but it *won't* search for it in 
individual codeunits.

--------

Another more pernicious case is that of output ranges. "put" is 
supposed to know how to convert and string/char width, into any 
sting/char width.

Again, things become funky if you tell "put" to place a string, 
into a sink that accepts a char.

Is the sink actually telling you to feed it code units? or ascii?

Mar 12 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer 
wrote:
 I proposed this inside the long "major performance problem with 
 std.array.front," I've also proposed it before, a long time ago.

 But seems to be getting no attention buried in that thread, not 
 even negative attention :)

 An idea to fix the whole problems I see with char[] being 
 treated specially by phobos: introduce an actual string type, 
 with char[] as backing, that is a dchar range, that actually 
 dictates the rules we want. Then, make the compiler use this 
 type for literals.

 e.g.:

 struct string {
    immutable(char)[] representation;
    this(char[] data) { representation = data;}
    ... // dchar range primitives
 }

 Then, a char[] array is simply an array of char[].

 points:

 1. No more issues with foreach(c; "cassé"), it iterates via 
 dchar
 2. No more issues with "cassé"[4], it is a static compiler 
 error.
 3. No more awkward ASCII manipulation using ubyte[].
 4. No more phobos schizophrenia saying char[] is not an array.
 5. No more special casing char[] array templates to fool the 
 compiler.
 6. Any other special rules we come up with can be dictated by 
 the library, and not ignored by the compiler.

 Note, std.algorithm.copy(string1, mutablestring) will still 
 decode/encode, but it's more explicit. It's EXPLICITLY a dchar 
 range. Use std.algorithm.copy(string1.representation, 
 mutablestring.representation) will avoid the issues.

 I imagine only code that is currently UTF ignorant will break, 
 and that code is easily 'fixed' by adding the 'representation' 
 qualifier.

 -Steve

I know warnings are disliked, but couldn't we make the slicing 
and indexing work as currently but issue a warning*? It's not 
ideal but it does mean we get backwards compatibility.

In my mind this is an important enough improvement to justify a 
little unpleasantness. We can't afford the breakage but we also 
should definitely act on this.


*Alternatively, they could just be deprecated from the get-go.

Mar 10 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 10 Mar 2014 13:59:53 -0400, John Colvin  =

<john.loughran.colvin gmail.com> wrote:

 On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer wrote:
 I proposed this inside the long "major performance problem with  =


 std.array.front," I've also proposed it before, a long time ago.

 But seems to be getting no attention buried in that thread, not even =


 =

 negative attention :)

 An idea to fix the whole problems I see with char[] being treated  =


 specially by phobos: introduce an actual string type, with char[] as =


 =

 backing, that is a dchar range, that actually dictates the rules we  =


 want. Then, make the compiler use this type for literals.

 e.g.:

 struct string {
    immutable(char)[] representation;
    this(char[] data) { representation =3D data;}
    ... // dchar range primitives
 }

 Then, a char[] array is simply an array of char[].

 points:

 1. No more issues with foreach(c; "casse=CC=81"), it iterates via dch=


ar
 2. No more issues with "casse=CC=81"[4], it is a static compiler erro=


r.
 3. No more awkward ASCII manipulation using ubyte[].
 4. No more phobos schizophrenia saying char[] is not an array.
 5. No more special casing char[] array templates to fool the compiler=


.
 6. Any other special rules we come up with can be dictated by the  =


 library, and not ignored by the compiler.

 Note, std.algorithm.copy(string1, mutablestring) will still  =


 decode/encode, but it's more explicit. It's EXPLICITLY a dchar range.=


  =

 Use std.algorithm.copy(string1.representation,  =


 mutablestring.representation) will avoid the issues.

 I imagine only code that is currently UTF ignorant will break, and th=


at  =

 code is easily 'fixed' by adding the 'representation' qualifier.

 -Steve

 I know warnings are disliked, but couldn't we make the slicing and  =

 indexing work as currently but issue a warning*? It's not ideal but it=

  =

 does mean we get backwards compatibility.

As I mentioned elsewhere (but repeating here for viewers), I was not  =

planning on disabling slicing.

Indexing is rarely a feature one needs or should use, especially with  =

encoded strings.

-Steve

Mar 10 2014

"Chris Williams" <yoreanon-chrisw yahoo.co.jp> writes:

On Monday, 10 March 2014 at 18:13:14 UTC, Steven Schveighoffer 
wrote:
 Indexing is rarely a feature one needs or should use, 
 especially with encoded strings.

If I was writing something like a chat or terminal window, I 
would want to be able to jump to chunks of text based on some 
sort of buffer length, then search for actual character 
boundaries. Similarly, if I was indexing text, I don't care what 
the underlying data is just whether any particular set of n-bytes 
have been seen together among some document. For the latter case, 
I don't need to be able to interpret the data as text while 
indexing, but once I perform an actual search and want to jump 
the user to that line in the file, being able to take a byte 
offset that I had stored in the index and convert that to a 
textual position would be good.

I do think that D should have something like

alias String8 = UTF!char;
alias String16 = UTF!wchar;
alias String32 = UTF!dchar;

And that those sit on top of an underlying immutable(xchar)[] 
buffer, providing variants of things like foreach and length 
based on code-point or grapheme boundaries. But I don't think 
there's any value in reinterpretting "string". Not being a struct 
or an object, it doesn't have the extensibility to be useful for 
all the variations of access that working with Unicode and the 
underlying bytes warrants.

Mar 10 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
 An idea to fix the whole problems I see with char[] being treated specially by
 phobos: introduce an actual string type, with char[] as backing, that is a
dchar
 range, that actually dictates the rules we want. Then, make the compiler use
 this type for literals.

Proposals to make a string class for D have come up many times. I have a 
kneejerk dislike for it. It's a really strong feature for D to have strings be 
an array type, and I'll go to great lengths to keep it that way.

Mar 10 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 10 Mar 2014 14:30:07 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
 An idea to fix the whole problems I see with char[] being treated  
 specially by
 phobos: introduce an actual string type, with char[] as backing, that  
 is a dchar
 range, that actually dictates the rules we want. Then, make the  
 compiler use
 this type for literals.

 Proposals to make a string class for D have come up many times. I have a  
 kneejerk dislike for it. It's a really strong feature for D to have  
 strings be an array type, and I'll go to great lengths to keep it that  
 way.

I wholly agree, they should be an array type. But what they are now is  
worse.

-Steve

Mar 10 2014

Johannes Pfau <nospam example.com> writes:

Am Mon, 10 Mar 2014 11:30:07 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
 An idea to fix the whole problems I see with char[] being treated
 specially by phobos: introduce an actual string type, with char[]
 as backing, that is a dchar range, that actually dictates the rules
 we want. Then, make the compiler use this type for literals.

 
 Proposals to make a string class for D have come up many times. I
 have a kneejerk dislike for it. It's a really strong feature for D to
 have strings be an array type, and I'll go to great lengths to keep
 it that way.

Question: which type T doesn't have slicing, has an ElementType of
dchar, has typeof(T[0]).sizeof == 4, ElementEncodingType!T == char and
still satisfies isArray?

It's a string. Would you call that 'an array type'?

	writeln(isArray!string);   //true
	writeln(hasSlicing!string); //false
	writeln(ElementType!string.stringof); //dchar
	writeln(ElementEncodingType!string.stringof); //char

I wouldn't call that an array. Part of the problem is that you want
string to be arrays (fixed size elements, direct indexing) and Andrei
doesn't want them to be arrays (operating on code points => not fixed
size => not arrays).

Mar 10 2014

"Artem Tarasov" <lomereiter gmail.com> writes:

On Monday, 10 March 2014 at 18:50:28 UTC, Johannes Pfau wrote:
 Question: which type T doesn't have slicing, has an ElementType 
 of
 dchar, has typeof(T[0]).sizeof == 4, ElementEncodingType!T == 
 char and
 still satisfies isArray?

In addition, hasLength!T == false, which totally freaked me out 
when I first discovered that.

Mar 10 2014

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Mon, Mar 10, 2014 at 07:49:04PM +0100, Johannes Pfau wrote:
 Am Mon, 10 Mar 2014 11:30:07 -0700
 schrieb Walter Bright <newshound2 digitalmars.com>:
 
 On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
 An idea to fix the whole problems I see with char[] being treated
 specially by phobos: introduce an actual string type, with char[]
 as backing, that is a dchar range, that actually dictates the
 rules we want. Then, make the compiler use this type for literals.

 
 Proposals to make a string class for D have come up many times. I
 have a kneejerk dislike for it. It's a really strong feature for D
 to have strings be an array type, and I'll go to great lengths to
 keep it that way.


I'm on the fence about this one. The nice thing about strings being an
array type, is that it is a familiar concept to C coders, and it allows
array slicing for extracting substrings, etc., which fits nicely with
the C view of strings as character arrays. As a C coder myself, I like
it this way too. But the bad thing about strings being an array type, is
that it's a holdover from C, and it allows slicing for extracting
substrings -- malformed substrings by permitting slicing a multibyte
(multiword) character.

Basically, the nice aspects of strings being arrays only apply when
you're dealing with ASCII (or mostly-ASCII) strings. These very same
"nice" aspects turn into problems when dealing with anything non-ASCII.
The only way the user can get it right using only array operations, is
if they understand the whole of Unicode in their head and are willing to
reinvent Unicode algorithms every time they slice a string or do some
operation on it. Since D purportedly supports Unicode by default, it
shouldn't be this way. D should *actually* support Unicode all the way
-- use proper Unicode algorithms for substring extraction, collation,
line-breaking, normalization, etc.. Being a systems language, of course,
means that D should allow you to get under the hood and do things
directly with the raw string representation -- but this shouldn't be the
*default* modus operandi.  The default should be a properly-encapsulated
string type with Unicode algorithms to operate on it (with the option of
reaching into the raw representation where necessary).


 Question: which type T doesn't have slicing, has an ElementType of
 dchar, has typeof(T[0]).sizeof == 4, ElementEncodingType!T == char and
 still satisfies isArray?
 
 It's a string. Would you call that 'an array type'?
 
 	writeln(isArray!string);   //true
 	writeln(hasSlicing!string); //false
 	writeln(ElementType!string.stringof); //dchar
 	writeln(ElementEncodingType!string.stringof); //char
 
 I wouldn't call that an array. Part of the problem is that you want
 string to be arrays (fixed size elements, direct indexing) and Andrei
 doesn't want them to be arrays (operating on code points => not fixed
 size => not arrays).

Exactly. What we have right now is a frankensteinian hybrid that's
neither fully an array, nor fully a Unicode string type. If we call the
current messy AA implementation split between compiler, aaA.d, and
object.di a design problem, then I'd call the current state of D strings
a design problem too. This underlying inconsistency is ultimately what
leads to the poor performance of strings in std.algorithm.

It's precisely because of this that I've given up on using std.algorithm
for strings altogether -- std.regex is far better: more flexible, more
expressive, and more performant, and specifically designed to operate on
strings. Nowadays I only use std.algorithm for non-string ranges
(because then the behaviour is actually consistent!!).


T

-- 
MS Windows: 64-bit overhaul of 32-bit extensions and a graphical shell for a
16-bit patch to an 8-bit operating system originally coded for a 4-bit
microprocessor, written by a 2-bit company that can't stand 1-bit of
competition.

Mar 10 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Monday, 10 March 2014 at 19:48:34 UTC, H. S. Teoh wrote:
 On Mon, Mar 10, 2014 at 07:49:04PM +0100, Johannes Pfau wrote:
 Am Mon, 10 Mar 2014 11:30:07 -0700
 schrieb Walter Bright <newshound2 digitalmars.com>:
 
 On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
 An idea to fix the whole problems I see with char[] being 
 treated
 specially by phobos: introduce an actual string type, with 
 char[]
 as backing, that is a dchar range, that actually dictates 
 the
 rules we want. Then, make the compiler use this type for 
 literals.

 
 Proposals to make a string class for D have come up many 
 times. I
 have a kneejerk dislike for it. It's a really strong feature 
 for D
 to have strings be an array type, and I'll go to great 
 lengths to
 keep it that way.


 I'm on the fence about this one. The nice thing about strings 
 being an
 array type, is that it is a familiar concept to C coders, and 
 it allows
 array slicing for extracting substrings, etc., which fits 
 nicely with
 the C view of strings as character arrays. As a C coder myself, 
 I like
 it this way too. But the bad thing about strings being an array 
 type, is
 that it's a holdover from C, and it allows slicing for 
 extracting
 substrings -- malformed substrings by permitting slicing a 
 multibyte
 (multiword) character.

 Basically, the nice aspects of strings being arrays only apply 
 when
 you're dealing with ASCII (or mostly-ASCII) strings. These very 
 same
 "nice" aspects turn into problems when dealing with anything 
 non-ASCII.
 The only way the user can get it right using only array 
 operations, is
 if they understand the whole of Unicode in their head and are 
 willing to
 reinvent Unicode algorithms every time they slice a string or 
 do some
 operation on it. Since D purportedly supports Unicode by 
 default, it
 shouldn't be this way. D should *actually* support Unicode all 
 the way
 -- use proper Unicode algorithms for substring extraction, 
 collation,
 line-breaking, normalization, etc.. Being a systems language, 
 of course,
 means that D should allow you to get under the hood and do 
 things
 directly with the raw string representation -- but this 
 shouldn't be the
 *default* modus operandi.  The default should be a 
 properly-encapsulated
 string type with Unicode algorithms to operate on it (with the 
 option of
 reaching into the raw representation where necessary).

You started off on the fence, but you seem pretty convinced by 
the end!

Mar 10 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 10 Mar 2014 14:30:07 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
 An idea to fix the whole problems I see with char[] being treated  
 specially by
 phobos: introduce an actual string type, with char[] as backing, that  
 is a dchar
 range, that actually dictates the rules we want. Then, make the  
 compiler use
 this type for literals.

 Proposals to make a string class for D have come up many times. I have a  
 kneejerk dislike for it. It's a really strong feature for D to have  
 strings be an array type, and I'll go to great lengths to keep it that  
 way.

BTW, this escaped my view the first time reading your post, but I am NOT  
proposing a string *class*. In fact, I'm not proposing we change anything  
technical about strings, the code generated should be basically identical.  
What I'm proposing is to encapsulate what you can and can't do with a  
string in the type itself, instead of making the standard library flip  
over backwards to treat it as something else when the compiler treats it  
as a simple array of char.

-Steve

Mar 10 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/10/2014 11:54 AM, Steven Schveighoffer wrote:
 BTW, this escaped my view the first time reading your post, but I am NOT
 proposing a string *class*.

Right, but here I used the term "class" to be more generic as in being a user 
defined type, i.e. struct or class. I should have been more clear.

Mar 10 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 10 Mar 2014 15:01:20 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 3/10/2014 11:54 AM, Steven Schveighoffer wrote:
 BTW, this escaped my view the first time reading your post, but I am NOT
 proposing a string *class*.

 Right, but here I used the term "class" to be more generic as in being a  
 user defined type, i.e. struct or class. I should have been more clear.

Then I don't understand your point. What strings are already is a  
user-defined type, but with horrible enforcement. i.e. things that  
shouldn't be allowed are only disallowed if you opt-in using phobos'  
template constraints.

-Steve

Mar 10 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/10/2014 1:36 PM, Steven Schveighoffer wrote:
 What strings are already is a user-defined type,

No, they are not.

 but with horrible enforcement.

With no enforcement, and that is by design.

Keep in mind that D is a systems programming language, and that means
unfettered 
access to strings.

Mar 10 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Monday, 10 March 2014 at 20:52:27 UTC, Walter Bright wrote:
 On 3/10/2014 1:36 PM, Steven Schveighoffer wrote:
 What strings are already is a user-defined type,

 No, they are not.


 but with horrible enforcement.

 With no enforcement, and that is by design.

 Keep in mind that D is a systems programming language, and that 
 means unfettered access to strings.

I don't see how this proposal would limit that access. The raw 
immutable(char)[] is still there, ready to be used just as always.

It seems like it fits the D ethos: safe and reasonably fast by 
default, unsafe and lightning fast on request. (Admittedly a bad 
wording, sometimes the fastest can still be safe)

Mar 10 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 10 Mar 2014 16:52:27 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 3/10/2014 1:36 PM, Steven Schveighoffer wrote:
 What strings are already is a user-defined type,

 No, they are not.

The functionality added via phobos can hardly be considered extraneous.  
One would not use strings without the library.

 but with horrible enforcement.

 With no enforcement, and that is by design.

The enforcement is opt-in. That is, you have to use phobos' templates in  
order to use them "properly":

auto getIt(R)(R r, size_t idx)
{
    if(idx < r.length)
       return r[idx];
}

The above compiles fine for strings. However, it does not compile fine if  
you do:

auto getIt(R)(R r, size_t idx) if(hasLength!R && isRandomAccessRange!R)

Any other range will fail to compile for the more strict version and the  
simple implementation without template constraints. In other words, the  
compiler doesn't believe the same thing phobos does. shooting one's foot  
is quite easy.

 Keep in mind that D is a systems programming language, and that means  
 unfettered access to strings.

Access is fine, with clear intentions. And we do not have unfettered  
access. I cannot sort a mutable string of ASCII characters without first  
converting it to ubyte[].

What in my proposal makes you think you don't have unfettered access? The  
underlying immutable(char)[] representation is accessible. In fact, you  
would have more access, since phobos functions would then work with a  
char[] like it's a proper array.

-Steve

Mar 10 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/10/2014 2:09 PM, Steven Schveighoffer wrote:
 What in my proposal makes you think you don't have unfettered access? The
 underlying immutable(char)[] representation is accessible. In fact, you would
 have more access, since phobos functions would then work with a char[] like
it's
 a proper array.

You divide the D world into two camps - those that use 'struct string', and 
those that use immutable(char)[] strings.

 I imagine only code that is currently UTF ignorant will break,

This also makes it a non-starter.

Mar 10 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 10 Mar 2014 17:52:05 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 3/10/2014 2:09 PM, Steven Schveighoffer wrote:
 What in my proposal makes you think you don't have unfettered access?  
 The
 underlying immutable(char)[] representation is accessible. In fact, you  
 would
 have more access, since phobos functions would then work with a char[]  
 like it's
 a proper array.

 You divide the D world into two camps - those that use 'struct string',  
 and those that use immutable(char)[] strings.

Really? It's not that divisive. However, the situation is certainly better  
than today's world of those who use 'string' and those who use  
'string.representation'. Those who use string.representation would  
actually get much more use out of it. Those who use string would see no  
changes.

  > I imagine only code that is currently UTF ignorant will break,

 This also makes it a non-starter.

You're the guardian of changes to the language, clearly holding a veto on  
any proposals. But this doesn't come across as very open-minded,  
especially from someone who wanted to do something that would change the  
fundamental treatment of strings last week.

IMO, breaking incorrect code is a good idea, and worth at least exploring.

-Steve

Mar 10 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/10/2014 3:26 PM, Steven Schveighoffer wrote:
 On Mon, 10 Mar 2014 17:52:05 -0400, Walter Bright <newshound2 digitalmars.com>
 wrote:
 This also makes it a non-starter.

 You're the guardian of changes to the language, clearly holding a veto on any
 proposals. But this doesn't come across as very open-minded, especially from
 someone who wanted to do something that would change the fundamental treatment
 of strings last week.

I deserve that criticism. On the other hand, I've pretty much given up on
fixing 
std.array.front() because of that. In the last couple days, we also wound up 
annoying a valuable client with some minor breakage with std.json, reiterating 
how important it is to not break code if we can at all avoid it.


 IMO, breaking incorrect code is a good idea, and worth at least exploring.

Breaking broken code, yes.

Mar 10 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Walter Bright:

 In the last couple days, we also wound up annoying a valuable 
 client with some minor breakage with std.json, reiterating how 
 important it is to not break code if we can at all avoid it..

There are still some breaking changed that I'd like to perform in 
D, like deprecating certain usages of the comma operator, etc.

Bye,
bearophile

Mar 10 2014

"Meta" <jared771 gmail.com> writes:

On Tuesday, 11 March 2014 at 00:02:13 UTC, bearophile wrote:
 Walter Bright:

 In the last couple days, we also wound up annoying a valuable 
 client with some minor breakage with std.json, reiterating how 
 important it is to not break code if we can at all avoid it..

 There are still some breaking changed that I'd like to perform 
 in D, like deprecating certain usages of the comma operator, 
 etc.

 Bye,
 bearophile

That damnable comma operator is one of the worst things that was 
inherited from C. IMO, it has no use outside the header of a for 
loop, and even there it's suspect.

Mar 10 2014

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, Mar 11, 2014 at 12:49:40AM +0000, Meta wrote:
 On Tuesday, 11 March 2014 at 00:02:13 UTC, bearophile wrote:
Walter Bright:

In the last couple days, we also wound up annoying a valuable
client with some minor breakage with std.json, reiterating how
important it is to not break code if we can at all avoid it..

There are still some breaking changed that I'd like to perform in
D, like deprecating certain usages of the comma operator, etc.


[...]
 That damnable comma operator is one of the worst things that was
 inherited from C. IMO, it has no use outside the header of a for
 loop, and even there it's suspect.

I've always been of the opinion that the comma operator in a for loop
should be treated as special syntax, rather than a language-wide
operator. The comma operator must die. :P


T

-- 
Public parking: euphemism for paid parking. -- Flora

Mar 11 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Meta:

 That damnable comma operator is one of the worst things that 
 was inherited from C. IMO, it has no use outside the header of 
 a for loop, and even there it's suspect.

The place for the discussion about the comma operator:
https://d.puremagic.com/issues/show_bug.cgi?id=2659

Bye,
bearophile

Mar 11 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Monday, 10 March 2014 at 21:52:04 UTC, Walter Bright wrote:
 On 3/10/2014 2:09 PM, Steven Schveighoffer wrote:
 What in my proposal makes you think you don't have unfettered 
 access? The
 underlying immutable(char)[] representation is accessible. In 
 fact, you would
 have more access, since phobos functions would then work with 
 a char[] like it's
 a proper array.

 You divide the D world into two camps - those that use 'struct 
 string', and those that use immutable(char)[] strings.

I would go so far as to say this is a good thing, as long as the 
'struct string' is transparently the default.

If you want good unicode support that works in a sane and 
relatively transparent manner, just write string, use literals as 
normal etc.
If you want a normal array of characters, that behaves sanely and 
consistently as an array, use char[] with relevant qualifiers.

Mar 11 2014

Ary Borenszweig <ary esperanto.org.ar> writes:

On 3/10/14, 3:30 PM, Walter Bright wrote:
 On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
 An idea to fix the whole problems I see with char[] being treated
 specially by
 phobos: introduce an actual string type, with char[] as backing, that
 is a dchar
 range, that actually dictates the rules we want. Then, make the
 compiler use
 this type for literals.

 Proposals to make a string class for D have come up many times. I have a
 kneejerk dislike for it. It's a really strong feature for D to have
 strings be an array type, and I'll go to great lengths to keep it that way.

You can also look at Erlang, where strings are just lists of numbers. 
Eventually they realized it was a huge mistake and introduced another 
type, a binary string, which is much more efficient and works as expected.

I think making strings behave like arrays is a design mistake.

Mar 12 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/12/14, 6:24 AM, Ary Borenszweig wrote:
 On 3/10/14, 3:30 PM, Walter Bright wrote:
 On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
 An idea to fix the whole problems I see with char[] being treated
 specially by
 phobos: introduce an actual string type, with char[] as backing, that
 is a dchar
 range, that actually dictates the rules we want. Then, make the
 compiler use
 this type for literals.

 Proposals to make a string class for D have come up many times. I have a
 kneejerk dislike for it. It's a really strong feature for D to have
 strings be an array type, and I'll go to great lengths to keep it that
 way.

 You can also look at Erlang, where strings are just lists of numbers.
 Eventually they realized it was a huge mistake and introduced another
 type, a binary string, which is much more efficient and works as expected.

 I think making strings behave like arrays is a design mistake.

Erlang's mistake was different from what you believe was D's mistake. 
There is no comparison to be drawn.

Andrei

Mar 12 2014

Ary Borenszweig <ary esperanto.org.ar> writes:

On 3/12/14, 1:53 PM, Andrei Alexandrescu wrote:
 On 3/12/14, 6:24 AM, Ary Borenszweig wrote:
 On 3/10/14, 3:30 PM, Walter Bright wrote:
 On 3/10/2014 6:35 AM, Steven Schveighoffer wrote:
 An idea to fix the whole problems I see with char[] being treated
 specially by
 phobos: introduce an actual string type, with char[] as backing, that
 is a dchar
 range, that actually dictates the rules we want. Then, make the
 compiler use
 this type for literals.

 Proposals to make a string class for D have come up many times. I have a
 kneejerk dislike for it. It's a really strong feature for D to have
 strings be an array type, and I'll go to great lengths to keep it that
 way.

 You can also look at Erlang, where strings are just lists of numbers.
 Eventually they realized it was a huge mistake and introduced another
 type, a binary string, which is much more efficient and works as
 expected.

 I think making strings behave like arrays is a design mistake.

 Erlang's mistake was different from what you believe was D's mistake.
 There is no comparison to be drawn.

 Andrei

What's D's mistake then?

Mar 12 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/12/14, 10:29 AM, Ary Borenszweig wrote:
 What's D's mistake then?

I don't think we made a mistake with D's strings. They could have been 
done better if we made all iteration requests explicit.

Andrei

Mar 12 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer 
wrote:
 I proposed this inside the long "major performance problem with 
 std.array.front," I've also proposed it before, a long time ago.

 But seems to be getting no attention buried in that thread, not 
 even negative attention :)

 An idea to fix the whole problems I see with char[] being 
 treated specially by phobos: introduce an actual string type, 
 with char[] as backing, that is a dchar range, that actually 
 dictates the rules we want. Then, make the compiler use this 
 type for literals.

 e.g.:

 struct string {
    immutable(char)[] representation;
    this(char[] data) { representation = data;}
    ... // dchar range primitives
 }

 Then, a char[] array is simply an array of char[].

 points:

 1. No more issues with foreach(c; "cassé"), it iterates via 
 dchar
 2. No more issues with "cassé"[4], it is a static compiler 
 error.
 3. No more awkward ASCII manipulation using ubyte[].
 4. No more phobos schizophrenia saying char[] is not an array.
 5. No more special casing char[] array templates to fool the 
 compiler.
 6. Any other special rules we come up with can be dictated by 
 the library, and not ignored by the compiler.

 Note, std.algorithm.copy(string1, mutablestring) will still 
 decode/encode, but it's more explicit. It's EXPLICITLY a dchar 
 range. Use std.algorithm.copy(string1.representation, 
 mutablestring.representation) will avoid the issues.

 I imagine only code that is currently UTF ignorant will break, 
 and that code is easily 'fixed' by adding the 'representation' 
 qualifier.

 -Steve

just to check I understand this fully:

in this new scheme, what would this do?

auto s = "cassé".representation;
foreach(i, c; s) write(i, ':', c, ' ');
writeln(s);

Currently - without the .representation - I get

0:c 1:a 2:s 3:s 4:e 5:̠6:`
cassé

or, to spell it out a bit more:
0:c 1:a 2:s 3:s 4:e 5:xCC 6:x81
cassé

Mar 10 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 10 Mar 2014 17:46:23 -0400, John Colvin  =

<john.loughran.colvin gmail.com> wrote:

 On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer wrote:
 I proposed this inside the long "major performance problem with  =


 std.array.front," I've also proposed it before, a long time ago.

 But seems to be getting no attention buried in that thread, not even =


 =

 negative attention :)

 An idea to fix the whole problems I see with char[] being treated  =


 specially by phobos: introduce an actual string type, with char[] as =


 =

 backing, that is a dchar range, that actually dictates the rules we  =


 want. Then, make the compiler use this type for literals.

 e.g.:

 struct string {
    immutable(char)[] representation;
    this(char[] data) { representation =3D data;}
    ... // dchar range primitives
 }

 Then, a char[] array is simply an array of char[].

 points:

 1. No more issues with foreach(c; "casse=CC=81"), it iterates via dch=


ar
 2. No more issues with "casse=CC=81"[4], it is a static compiler erro=


r.
 3. No more awkward ASCII manipulation using ubyte[].
 4. No more phobos schizophrenia saying char[] is not an array.
 5. No more special casing char[] array templates to fool the compiler=


.
 6. Any other special rules we come up with can be dictated by the  =


 library, and not ignored by the compiler.

 Note, std.algorithm.copy(string1, mutablestring) will still  =


 decode/encode, but it's more explicit. It's EXPLICITLY a dchar range.=


  =

 Use std.algorithm.copy(string1.representation,  =


 mutablestring.representation) will avoid the issues.

 I imagine only code that is currently UTF ignorant will break, and th=


at  =

 code is easily 'fixed' by adding the 'representation' qualifier.

 -Steve

 just to check I understand this fully:

 in this new scheme, what would this do?

 auto s =3D "casse=CC=81".representation;
 foreach(i, c; s) write(i, ':', c, ' ');
 writeln(s);

 Currently - without the .representation - I get

 0:c 1:a 2:s 3:s 4:e 5:=CC=A06:`
 casse=CC=81

 or, to spell it out a bit more:
 0:c 1:a 2:s 3:s 4:e 5:xCC 6:x81
 casse=CC=81

The plan is for foreach on s to iterate by char, and foreach on "casse=CC=
=81"  =

to iterate by dchar.

What this means is the accent will be iterated separately from the e, an=
d  =

likely gets put onto the colon after 5. However, the half code-units tha=
t  =

has no meaning anywhere (xCC and X81) would not be iterated.

In your above code, using .representation would be equivalent to what it=
  =

is now without .representation (i.e. over char), and without  =

.representation would be equivalent to this on today's compiler (except =
 =

faster):

foreach(i, dchar c; s)

-Steve

Mar 10 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Monday, 10 March 2014 at 22:15:34 UTC, Steven Schveighoffer 
wrote:
 On Mon, 10 Mar 2014 17:46:23 -0400, John Colvin 
 <john.loughran.colvin gmail.com> wrote:

 On Monday, 10 March 2014 at 13:35:33 UTC, Steven Schveighoffer 
 wrote:
 I proposed this inside the long "major performance problem 
 with std.array.front," I've also proposed it before, a long 
 time ago.

 But seems to be getting no attention buried in that thread, 
 not even negative attention :)

 An idea to fix the whole problems I see with char[] being 
 treated specially by phobos: introduce an actual string type, 
 with char[] as backing, that is a dchar range, that actually 
 dictates the rules we want. Then, make the compiler use this 
 type for literals.

 e.g.:

 struct string {
   immutable(char)[] representation;
   this(char[] data) { representation = data;}
   ... // dchar range primitives
 }

 Then, a char[] array is simply an array of char[].

 points:

 1. No more issues with foreach(c; "cassé"), it iterates via 
 dchar
 2. No more issues with "cassé"[4], it is a static compiler 
 error.
 3. No more awkward ASCII manipulation using ubyte[].
 4. No more phobos schizophrenia saying char[] is not an array.
 5. No more special casing char[] array templates to fool the 
 compiler.
 6. Any other special rules we come up with can be dictated by 
 the library, and not ignored by the compiler.

 Note, std.algorithm.copy(string1, mutablestring) will still 
 decode/encode, but it's more explicit. It's EXPLICITLY a 
 dchar range. Use std.algorithm.copy(string1.representation, 
 mutablestring.representation) will avoid the issues.

 I imagine only code that is currently UTF ignorant will 
 break, and that code is easily 'fixed' by adding the 
 'representation' qualifier.

 -Steve

 just to check I understand this fully:

 in this new scheme, what would this do?

 auto s = "cassé".representation;
 foreach(i, c; s) write(i, ':', c, ' ');
 writeln(s);

 Currently - without the .representation - I get

 0:c 1:a 2:s 3:s 4:e 5:̠6:`
 cassé

 or, to spell it out a bit more:
 0:c 1:a 2:s 3:s 4:e 5:xCC 6:x81
 cassé

 The plan is for foreach on s to iterate by char, and foreach on 
 "cassé" to iterate by dchar.

 What this means is the accent will be iterated separately from 
 the e, and likely gets put onto the colon after 5. However, the 
 half code-units that has no meaning anywhere (xCC and X81) 
 would not be iterated.

 In your above code, using .representation would be equivalent 
 to what it is now without .representation (i.e. over char), and 
 without .representation would be equivalent to this on today's 
 compiler (except faster):

 foreach(i, dchar c; s)

 -Steve

Awesome, let's do this :)

Mar 11 2014

"Kagamin" <spam here.lot> writes:

Automatic decoding by default itself is a WTF factor. The problem 
with it is it encourages unicode ignorance and pretends to work 
correctly, so it's harder for the developer to discover the 
incorrectness.

Mar 11 2014

Marco Leise <Marco.Leise gmx.de> writes:

The Unicode standard is too complex for general purpose
algorithms to do useful things on D strings. We don't see that
however, since our writing systems are sufficiently well
supported.

As an inspiration I'll leave a string here that contains
combined characters in Korean
(http://decodeunicode.org/hangul_syllables)
and Latin as well as full width characters that span 2
characters in e.g. Latin, Greek or Cyrillic scripts
(http://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms):

Halfwidth / =EF=BC=A6=EF=BD=95=EF=BD=8C=EF=BD=8C=EF=BD=97=EF=BD=89=EF=BD=84=
=EF=BD=94=EF=BD=88, =E1=86=A8=E1=86=A8=E1=86=A8=E1=86=9A=E1=86=9A=E1=85=B1=
=E1=85=B1=E1=85=B1=E1=85=A1=E1=85=93=E1=85=B2=E1=84=84=E1=86=92=E1=84=8B=E1=
=86=AE, a=CD=A2b 9=CD=9A c=CC=8A=CC=B9

(I used the "unfonts" package for the Hangul part)

What I want to say is that for correct Unicode handling we
should either use existing libraries or get a feeling for
what the Unicode standard provides, then form use cases out of it.

For example when we talk about the length of a string we are
actually talking about 4 different things:

  - number of code units
  - number of code points
  - number of user perceived characters
  - display width using a monospace font

The same distinction applies for slicing, depending on use case.

Related:
  - What normalization do D strings use. Both Linux and
    MacOS X use UTF-8, but the binary representation of non-ASCII
    file names is different.
  - How do we handle sorting strings?

The topic matter is complex, but not difficult (as in rocket science).
If we really want to find a solution, we should form an expert group
and stop talking until we read the latest Unicode specs. They are a
moving target. Don't expect to ever be "done" with full Unicode
support in D.

--=20
Marco

Mar 17 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

18-Mar-2014 10:21, Marco Leise пишет:
 The Unicode standard is too complex for general purpose
 algorithms to do useful things on D strings. We don't see that
 however, since our writing systems are sufficiently well
 supported.

 As an inspiration I'll leave a string here that contains
 combined characters in Korean
 (http://decodeunicode.org/hangul_syllables)
 and Latin as well as full width characters that span 2
 characters in e.g. Latin, Greek or Cyrillic scripts
 (http://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms):

 Halfwidth / Ｆｕｌｌｗｉｄｔｈ,
ᆨᆨᆨᆚᆚᅱᅱᅱᅡᅓᅲᄄᆒᄋᆮ, a͢b 9͚ c̹̊

 (I used the "unfonts" package for the Hangul part)

 What I want to say is that for correct Unicode handling we
 should either use existing libraries or get a feeling for
 what the Unicode standard provides, then form use cases out of it.

There is ICU and very few other things, like support in OSX frameworks 
(NSString). Industry in general kinda sucks on this point but 
desperately wants to improve.

 For example when we talk about the length of a string we are
 actually talking about 4 different things:

    - number of code units
    - number of code points
    - number of user perceived characters
    - display width using a monospace font

 The same distinction applies for slicing, depending on use case.

 Related:
    - What normalization do D strings use. Both Linux and
      MacOS X use UTF-8, but the binary representation of non-ASCII
      file names is different.

There is no single normalization to fix on.
D programs may be written for Linux only, for Mac-only or for both.

IMO we should just provide ways to normalize strings.
(std.uni.normalize has 'normalize' for starters).

    - How do we handle sorting strings?

Unicode collation algorithm and provide ways to tweak the default one.

 The topic matter is complex, but not difficult (as in rocket science).
 If we really want to find a solution, we should form an expert group
 and stop talking until we read the latest Unicode specs.

Well, I did. You seem motivated, would you like to join the group?

 They are a
 moving target. Don't expect to ever be "done" with full Unicode
 support in D.

The 6.x standard line seems pretty stable to me. There is a point in 
provding support that worth approaching. After that ROI is drooping 
steadily as the amount of work to specialize for each specific culture 
rises. At some point we can only talk about opening up ways to specialize.

D (or any library for that matter) won't ever have all possible 
tinkering that Unicode standard permits. So I expect D to be "done" with 
Unicode one day simply by reaching a point of having all universally 
applicable stuff (and stated defaults) plus having a toolbox to craft 
your own versions of algorithms. This is the goal of new std.uni.


-- 
Dmitry Olshansky

Mar 18 2014

Marco Leise <Marco.Leise gmx.de> writes:

Am Tue, 18 Mar 2014 23:18:16 +0400
schrieb Dmitry Olshansky <dmitry.olsh gmail.com>:

 18-Mar-2014 10:21, Marco Leise =D0=BF=D0=B8=D1=88=D0=B5=D1=82:
 The Unicode standard is too complex for general purpose
 algorithms to do useful things on D strings. We don't see that
 however, since our writing systems are sufficiently well
 supported.

=20
 As an inspiration I'll leave a string here that contains
 combined characters in Korean
 (http://decodeunicode.org/hangul_syllables)
 and Latin as well as full width characters that span 2
 characters in e.g. Latin, Greek or Cyrillic scripts
 (http://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms):

 Halfwidth / =EF=BC=A6=EF=BD=95=EF=BD=8C=EF=BD=8C=EF=BD=97=EF=BD=89=EF=


=BD=84=EF=BD=94=EF=BD=88, =E1=86=A8=E1=86=A8=E1=86=A8=E1=86=9A=E1=86=9A=E1=
=85=B1=E1=85=B1=E1=85=B1=E1=85=A1=E1=85=93=E1=85=B2=E1=84=84=E1=86=92=E1=84=
=8B=E1=86=AE, a=CD=A2b 9=CD=9A c=CC=8A=CC=B9
 (I used the "unfonts" package for the Hangul part)

 What I want to say is that for correct Unicode handling we
 should either use existing libraries or get a feeling for
 what the Unicode standard provides, then form use cases out of it.

=20
 There is ICU and very few other things, like support in OSX frameworks=20
 (NSString). Industry in general kinda sucks on this point but=20
 desperately wants to improve.

 For example when we talk about the length of a string we are
 actually talking about 4 different things:

    - number of code units
    - number of code points
    - number of user perceived characters
    - display width using a monospace font

 The same distinction applies for slicing, depending on use case.

 Related:
    - What normalization do D strings use. Both Linux and
      MacOS X use UTF-8, but the binary representation of non-ASCII
      file names is different.

=20
 There is no single normalization to fix on.
 D programs may be written for Linux only, for Mac-only or for both.

Normalizations C and D are the non lossy ones and as far as I
understood equivalent. So I agree.
=20
 IMO we should just provide ways to normalize strings.
 (std.uni.normalize has 'normalize' for starters).

I wondered if anyone will actually read up on normalization
prior to touching Unicode strings. I didn't, Andrei didn't and
so on...
So I expect strA =3D=3D strB to be common enough, just like floatA
=3D=3D floatB until the news spread. Since =3D=3D is supposed to
compare for equivalence, could we hide all those details in
an opaque string type and offer correct comparison functions?

    - How do we handle sorting strings?

=20
 Unicode collation algorithm and provide ways to tweak the default one.

I wish I didn't look at the UCA. Jeeeez...
But yeah, that's the way to go.
Big frameworks like Java added a Collate class with predefined
constants for several languages. That's too much work for us.
But the API doesn't need to preclude adding those.

 The topic matter is complex, but not difficult (as in rocket science).
 If we really want to find a solution, we should form an expert group
 and stop talking until we read the latest Unicode specs.

=20
 Well, I did. You seem motivated, would you like to join the group?

Yes, I'd like to see a Unicode 6.x approved stamp on D.
I didn't know that you already wrote all the simple algorithms
for 2.064. Those would have been my candidates to work on, too.
Is there anything that can be implemented in a day or two? :)

 They are a
 moving target. Don't expect to ever be "done" with full Unicode
 support in D.

=20
 The 6.x standard line seems pretty stable to me. There is a point in=20
 provding support that worth approaching. After that ROI is drooping=20
 steadily as the amount of work to specialize for each specific culture=20
 rises. At some point we can only talk about opening up ways to specialize.
=20
 D (or any library for that matter) won't ever have all possible=20
 tinkering that Unicode standard permits. So I expect D to be "done" with=

=20
 Unicode one day simply by reaching a point of having all universally=20
 applicable stuff (and stated defaults) plus having a toolbox to craft=20
 your own versions of algorithms. This is the goal of new std.uni.

Sorting strings is a very basic feature, but as I learned now
also highly complex. I expected some kind of tables for
download that would suffice, but the rules are pretty detailed.
E.g. in German phonebook order, =C3=A4/=C3=B6/=C3=BC has the same order as
ae/oe/ue.

--=20
Marco

Mar 19 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

19-Mar-2014 18:42, Marco Leise пишет:
 Am Tue, 18 Mar 2014 23:18:16 +0400
 schrieb Dmitry Olshansky <dmitry.olsh gmail.com>:

 Related:
     - What normalization do D strings use. Both Linux and
       MacOS X use UTF-8, but the binary representation of non-ASCII
       file names is different.

 There is no single normalization to fix on.
 D programs may be written for Linux only, for Mac-only or for both.

 Normalizations C and D are the non lossy ones and as far as I
 understood equivalent. So I agree.

Right, the KC & KD ones are really all about fuzzy matching and searching.

 IMO we should just provide ways to normalize strings.
 (std.uni.normalize has 'normalize' for starters).

 I wondered if anyone will actually read up on normalization
 prior to touching Unicode strings. I didn't, Andrei didn't and
 so on...
 So I expect strA == strB to be common enough, just like floatA
 == floatB until the news spread.

If that of any comfort other languages are even worse here. In C++ your 
are hopeless without ICU.

 Since == is supposed to
 compare for equivalence, could we hide all those details in
 an opaque string type and offer correct comparison functions?

Well, turns out the Unicode standard ties equivalence to normalization 
forms. In other words unless both your strings are normalized the same 
way there is really no point in trying to compare them.

As for opaque type - we could have say String!NFC and String!NFD or 
some-such. It would then make sure the normalization is the right one.

     - How do we handle sorting strings?

 Unicode collation algorithm and provide ways to tweak the default one.

 I wish I didn't look at the UCA. Jeeeez...
 But yeah, that's the way to go.

Needless to say I had a nice jaw-dropping moment when I realized what 
elephant I have missed with our std.uni (somewhere in the middle of the 
work).

 Big frameworks like Java added a Collate class with predefined
 constants for several languages. That's too much work for us.
 But the API doesn't need to preclude adding those.

Indeed some kind of Collator is in order. On the use side of things it's 
simply a functor that compares strings. The fact that it's full of 
tables and the like is well hidden. The only thing above that is caching 
preprocessed strings, that maybe useful for databases and string indexes.

 The topic matter is complex, but not difficult (as in rocket science).
 If we really want to find a solution, we should form an expert group
 and stop talking until we read the latest Unicode specs.

 Well, I did. You seem motivated, would you like to join the group?

 Yes, I'd like to see a Unicode 6.x approved stamp on D.
 I didn't know that you already wrote all the simple algorithms
 for 2.064. Those would have been my candidates to work on, too.
 Is there anything that can be implemented in a day or two? :)

Cool, consider yourself enlisted :)
I reckon word and line breaking algorithms are piece of cake compared to 
UCA. Given the power toys of CodepointSet and toTrie it shouldn't be 
that hard to come up with prototype. Then we just move precomputed 
versions of related tries to std/internal/ and that's it, ready for 
public consumption.

 D (or any library for that matter) won't ever have all possible
 tinkering that Unicode standard permits. So I expect D to be "done" with
 Unicode one day simply by reaching a point of having all universally
 applicable stuff (and stated defaults) plus having a toolbox to craft
 your own versions of algorithms. This is the goal of new std.uni.

 Sorting strings is a very basic feature, but as I learned now
 also highly complex.  I expected some kind of tables for
 download that would suffice, but the rules are pretty detailed.
 E.g. in German phonebook order, ä/ö/ü has the same order as
 ae/oe/ue.

This is tailoring, an awful thing that makes cultural differences what 
they are in Unicode ;)

What we need first and furthermost DUCET based version (default Unicode 
collation element tables).

-- 
Dmitry Olshansky

Mar 19 2014

Marco Leise <Marco.Leise gmx.de> writes:

Am Thu, 20 Mar 2014 01:55:08 +0400
schrieb Dmitry Olshansky <dmitry.olsh gmail.com>:

 Well, turns out the Unicode standard ties equivalence to normalization=20
 forms. In other words unless both your strings are normalized the same=20
 way there is really no point in trying to compare them.
=20
 As for opaque type - we could have say String!NFC and String!NFD or=20
 some-such. It would then make sure the normalization is the right one.

And I thought of going the slow route where normalized and
unnormalized strings can coexist and be compared. No NFD or
NFC, just UTF8 strings.

Pros:
+ Learning about normalization isn't needed to use strings
  correctly. And few people do that.
+ Strings don't need to be normalized. Every modification to
  data is bad, e.g. when said string is fed back to the
  source. Think about a file name on a file system where a
  different normalization is a different file.

Cons:
- Comparisons for already normalized strings are unnecessarily
  slow. Maybe the normalization form (NFC, NFD, mixed) could be
  stored alongside the string.

 Cool, consider yourself enlisted :)
 I reckon word and line breaking algorithms are piece of cake compared to=

=20
 UCA. Given the power toys of CodepointSet and toTrie it shouldn't be=20
 that hard to come up with prototype. Then we just move precomputed=20
 versions of related tries to std/internal/ and that's it, ready for=20
 public consumption.

Would a typical use case be to find the previous/next boundary
given a code unit index? E.g. the cursor sits on a word and
you want to jump to the start or end of it. Just iterating the
words and lines might not be too useful.

 D (or any library for that matter) won't ever have all possible
 tinkering that Unicode standard permits. So I expect D to be "done" wi=



th
 Unicode one day simply by reaching a point of having all universally
 applicable stuff (and stated defaults) plus having a toolbox to craft
 your own versions of algorithms. This is the goal of new std.uni.

 Sorting strings is a very basic feature, but as I learned now
 also highly complex.  I expected some kind of tables for
 download that would suffice, but the rules are pretty detailed.
 E.g. in German phonebook order, =C3=A4/=C3=B6/=C3=BC has the same order=


 as
 ae/oe/ue.

=20
 This is tailoring, an awful thing that makes cultural differences what=20
 they are in Unicode ;)
=20
 What we need first and furthermost DUCET based version (default Unicode=20
 collation element tables).

Of course.

--=20
Marco

Mar 19 2014

D Programming

C/C++ Programming

Other

digitalmars.D - Proposal for fixing dchar ranges