digitalmars.D - string is rarely useful as a function argument

Peter Alexander (17/17) Dec 28 2011 string is immutable(char)[]

bearophile (5/12) Dec 28 2011 What are the Phobos functions that unnecessarily accept a string?

Peter Alexander (7/17) Dec 28 2011 Any time you want to create a string without allocating memory.

bearophile (15/20) Dec 28 2011 I have discussed a bit two or three times about this topic. In a post I ...

Peter Alexander (3/23) Dec 28 2011 That only works when you allocate memory for the string, which is what I...

Walter Bright (2/6) Dec 28 2011 Is the buffer ever going to be reused with a different string in it?

Peter Alexander (12/19) Dec 28 2011 Possibly.

Timon Gehr (5/25) Dec 28 2011 You are approximately saying (paraphrasing): "The question is whether a

Peter Alexander (5/34) Dec 28 2011 No, I'm saying that people talk about animals more often than cows, so

Walter Bright (4/21) Dec 28 2011 If such a change is made, then people will use const string when they me...

Peter Alexander (4/13) Dec 28 2011 Then people should learn what const and immutable mean!

Walter Bright (10/17) Dec 28 2011 People do what is convenient, and as endless experience shows, doing the...

Andrei Alexandrescu (6/26) Dec 28 2011 Yes. Contrary to the OP, I don't think it's fair to dismiss a valid

Walter Bright (7/10) Dec 28 2011 And as Bruce Eckel discovered, even the people who know better will deli...

Andrei Alexandrescu (10/30) Dec 28 2011 Oh, one more thing - one good thing that could come out of this thread

Robert Jacques (2/36) Dec 28 2011 Would slicing, i.e. s[i..j] still be valid? If so, what would be the rec...

Andrei Alexandrescu (4/7) Dec 28 2011 find, findSplit etc. from std.algorithm, std.utf functions etc.

Timon Gehr (3/10) Dec 28 2011 That does not do the right thing. It would look more like

foobar (7/50) Dec 28 2011 That's a good idea which I wonder about its implementation

Andrei Alexandrescu (3/4) Dec 28 2011 Implementation would entail a change in the compiler.

Timon Gehr (5/9) Dec 28 2011 Special casing char[] and wchar[] in the language would be extremely
foobar (5/10) Dec 28 2011 Why? D should be plenty powerful to implement this without

Andrei Alexandrescu (3/13) Dec 28 2011 It's an awesome idea, but for an academic debate at best.

foobar (7/25) Dec 28 2011 I don't follow you. You've suggested a change that I agree with.

Andrei Alexandrescu (16/39) Dec 28 2011 If we have two facilities (string and e.g. String) we've lost. We'd need...

Adam D. Ruppe (16/18) Dec 28 2011 Have you actually tried to do it? Thanks to alias this, the custom

Walter Bright (3/7) Dec 28 2011 I've seen the damage done in C++ with multiple string types. Being able ...

Adam D. Ruppe (32/34) Dec 28 2011 Note that I'm on your side here re strings, but you're

Andrei Alexandrescu (5/12) Dec 28 2011 Nah, that still breaks a lotta code because people parameterize on T[],

Adam D. Ruppe (13/15) Dec 29 2011 /* snip struct string */

Andrei Alexandrescu (9/18) Dec 28 2011 This.

Jakob Ovrum (13/35) Dec 28 2011 I don't think this is a problem you can solve without educating

Jonathan M Davis (19/31) Dec 28 2011 Ultimately, the programmer _does_ need to understand unicode properly if...

deadalnix (3/34) Dec 29 2011 That is the whole point of D IMO. I think we shouldn't let an ego

Walter Bright (7/16) Dec 28 2011 I think this goes to, at some point, the language is no longer able to h...

Walter Bright (6/11) Dec 28 2011 If that ever happens, I owe you a beer. Maybe two!

Timon Gehr (3/18) Dec 29 2011 I fully agree. If I had to design an imperative programming language,

Derek (10/12) Dec 29 2011 I'm not quite sure about that last sentence. I suspect that the better w...

Sean Kelly (17/54) Dec 29 2011 Don't we already have String-like support with ranges? I'm not sure I u...
Jonathan M Davis (16/18) Dec 29 2011 To avoid common misusage. It's way to easy to misuse the length property...

Adam D. Ruppe (28/29) Dec 28 2011 I don't think I agree. Wouldn't something like this work?

foobar (8/37) Dec 28 2011 My thinking exactly. Of course we can't put "@disable" right away

Adam D. Ruppe (4/7) Dec 28 2011 I actually like strings just the way they are... but if

Timon Gehr (3/48) Dec 28 2011 In what way would the proposed change improve encapsulation, and why

foobar (8/13) Dec 28 2011 I'm not sure what are you asking here. Are you asking what are

Timon Gehr (5/18) Dec 28 2011 I know the benefits of encapsulation and none of them applies here. The

Timon Gehr (3/37) Dec 28 2011 Why? char and wchar are unicode code units, ubyte/ushort are unsigned

Jonathan M Davis (29/31) Dec 28 2011 It's an issue of the correct usage being the easy path. As it stands, it...

Timon Gehr (17/50) Dec 28 2011 I was educated enough not to make that mistake, because I read the

foobar (26/42) Dec 28 2011 I agree that it's useful. It is however the incorrect abstraction

Timon Gehr (11/54) Dec 28 2011 Well, if the alternative is slowly butchering the language I will be

foobar (12/104) Dec 28 2011 From a pragmatic view point people can also continue programming

Timon Gehr (13/96) Dec 29 2011 I disagree.

bearophile (7/20) Dec 28 2011 We have discussed this topic some times in past, it's not an easy topic....
Vladimir Panteleev (6/47) Dec 29 2011 I think it would be simpler to just make dstring the default

Gor Gyolchanyan (13/57) Dec 29 2011 This a a great idea! In this case the default string will be a

Walter Bright (3/10) Dec 29 2011 dstring consumes 4x the memory, and this can easily cause perf degradati...

Gor Gyolchanyan (41/52) Dec 29 2011 What if the string converted itself from utf-8 to utf-32 back and
Gor Gyolchanyan (10/64) Dec 29 2011 oops. I accidentally made a recursive call in the setter. scratch

Andrei Alexandrescu (3/6) Dec 29 2011 memory == time

Don (12/20) Dec 29 2011 If I understand this correctly, most others don't. Effectively, .rep

Andrei Alexandrescu (6/27) Dec 29 2011 Yes, I mean "rep" as a short for "representation" but upon first sight

Regan Heath (6/38) Dec 30 2011 +1 for this idea, however named.
Joshua Reusch (11/42) Dec 30 2011 Maybe it could happen if we

Timon Gehr (22/69) Dec 30 2011 Wrong.

Jakob Ovrum (9/27) Dec 30 2011 I strongly agree with this. It would be nice to have everything
Andrei Alexandrescu (4/5) Dec 30 2011 What we have now is adequate. The scheme I proposed is optimal.
deadalnix (14/77) Dec 30 2011 ATOS origin was hacked because of bad management of unicode in string in...

Timon Gehr (7/98) Dec 30 2011 I am not. I am just assuming that the proposed change does not help with...

Chad J (16/24) Dec 31 2011 Tsk tsk. Missing the point.

Timon Gehr (11/35) Dec 31 2011 Not at all. And I don't take anyone seriously who feels the need to 'Tsk...

Chad J (15/63) Dec 31 2011 Well, you've certainly a right to it.
deadalnix (5/26) Jan 01 2012 Well, if you write correct code, you don't need assertion. They will

Timon Gehr (3/31) Jan 01 2012 You miss the point. Testing and assertions are part of how I write

deadalnix (5/41) Jan 04 2012 So, to write correct code, you need to asume you'll write incorrect

Timon Gehr (7/51) Jan 04 2012 You are free to believe whatever you want, but I think that strategy you...

Timon Gehr (3/6) Jan 04 2012 Another major use of them is the checked documentation of assumptions,

Walter Bright (6/7) Dec 30 2011 Consider your X macro implementation. Strip out the utf.stride code and ...

Timon Gehr (3/10) Dec 30 2011 You are right, that obviously needs fixing. ☺
Andrei Alexandrescu (8/15) Dec 30 2011 It's true for any encoding with the prefix property, such as Huffman.

Timon Gehr (5/22) Dec 30 2011 auto raw(S)(S s) if(isNarrowString!S){

Andrei Alexandrescu (4/31) Dec 30 2011 Almost there.

Timon Gehr (2/34) Dec 30 2011 alias std.string.representation raw;

Andrei Alexandrescu (7/8) Dec 30 2011 I meant your implementation is incomplete.

Timon Gehr (21/29) Dec 30 2011 D strings are arrays. An array without .length and operator[] is close

Don (6/26) Dec 31 2011 No, it isn't. That's the problem. char[] is not an array of char.

Timon Gehr (6/36) Dec 31 2011 char[] is an array of char and the additional invariant is not enforced

Don (9/38) Dec 31 2011 No, it isn't an ordinary array. For example with concatenation. char[]

Timon Gehr (18/58) Jan 01 2012 Yes it will.

Sean Kelly (10/19) Dec 31 2011 I'm not sure I understand what's wrong with length. Of all the times I ...

Walter Bright (2/13) Dec 30 2011 Any other multibyte character encoding I've seen standardized for use in...
Michel Fortin (30/36) Dec 30 2011 After reading most of the thread, it seems to me like you're

Jonathan M Davis (19/22) Dec 30 2011 The problem is that what's more likely to happen in a lot of cases is th...

Timon Gehr (9/31) Dec 30 2011 Then that is the fault of the guy who created the tests. At least that
Walter Bright (12/15) Dec 30 2011 I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8

Andrei Alexandrescu (10/18) Dec 30 2011 The lower frequency of bugs makes them that much more difficult to spot....

Brad Anderson (19/40) Dec 30 2011 I don't know that Phobos would be an appropriate place for it but offeri...
Walter Bright (11/22) Dec 31 2011 I'm not so sure it's quite the same. Java was designed before there were...

kenji hara (3/35) Dec 31 2011 I fully agree with Walter. No need more wrapper for string.
Andrei Alexandrescu (17/48) Dec 31 2011 Disagree. I mean simple they are, no contest. They could and should be

Michel Fortin (41/54) Dec 31 2011 Perfect? At one time Java and other frameworks started to use UTF-16 as

Andrei Alexandrescu (19/74) Dec 31 2011 I'm not sure how you concluded I drew such assumptions.

Vladimir Panteleev (8/12) Dec 31 2011 According to my research[1], std.array.replace (which uses
Michel Fortin (61/87) Dec 31 2011 1: Because treating UTF-8 strings as a range of code point encourage

Andrei Alexandrescu (46/126) Dec 31 2011 That's sort of difficult to refute. Anyhow, I think it's great that

Michel Fortin (34/105) Dec 31 2011 As I keep saying, if you handle combining code points at the range

Andrei Alexandrescu (3/8) Dec 31 2011 You just found a bug!

Timon Gehr (6/22) Dec 31 2011 There is nothing wrong with the scheme on the conceptual level (except

Timon Gehr (2/12) Dec 31 2011 +1.

Sean Kelly (8/23) Dec 31 2011 I don't know that Unicode expertise is really required here anyway. All...

Andrei Alexandrescu (7/14) Dec 31 2011 Clearly this is a what-if debate. The best level of agreement we could

Timon Gehr (2/16) Dec 31 2011 That would be great.

Michel Fortin (24/28) Dec 31 2011 It's not bytes vs. characters, it's code units vs. code points vs. user

Sean Kelly (20/41) Dec 31 2011 Sorry, I was simplifying. The distinction I was trying to make was betwe...

bearophile (5/6) Dec 31 2011 I don't know if we need, but I agree those things are an improvement ove...

Piotr Szturmaj (19/23) Dec 31 2011 +1
Chad J (53/141) Dec 31 2011 *sigh*, FINE. Code units and /code points/ would be the same.

Timon Gehr (31/172) Dec 31 2011 int[]

Chad J (62/285) Dec 31 2011 I'll do one better and ultra relax:

Timon Gehr (41/326) Dec 31 2011 It is imo already mostly a non-problem, but YMMV:

Chad J (35/374) Dec 31 2011 Meh, I'd still prefer it be an array of UTF-8 code /points/ represented

a (6/8) Jan 01 2012 By saying you want an array of code points you already define
Timon Gehr (3/12) Jan 01 2012 That actually looks like a bug that might happen in real world code.

Chad J (17/33) Jan 01 2012 In my mind it's defined something like this:

Timon Gehr (4/37) Jan 01 2012 I think the main issue here is that char implicitly converts to dchar:

Chad J (13/56) Jan 01 2012 I agree.

Timon Gehr (6/62) Jan 01 2012 I think the conversion char -> dchar should just require an explicit

Chad J (33/104) Jan 01 2012 What of valid transfers of ASCII characters into dchar?

Timon Gehr (4/108) Jan 01 2012 That is an interesting point of view. Your proposal would therefore be

Chad J (23/143) Jan 01 2012 I just ran the example and wow, x didn't type-infer to dchar like I

Artur Skawina (4/19) Dec 28 2011 eg things like std.demangle? (which wraps core.demangle and that one acc...

Gor Gyolchanyan (7/22) Dec 28 2011 I agree, the string parameters are indeed irritating, but changing the
mta`chrono (15/15) Dec 28 2011 I understand your intention. It was one of the main irritations when I
=?UTF-8?B?QWxpIMOHZWhyZWxp?= (3/20) Dec 28 2011 Agreed. I've talked about this in D.learn a number of times myself.

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (6/7) Dec 28 2011 After seeing others' comments that focus more on the alias, I need to

Andrei Alexandrescu (20/37) Dec 28 2011 I'm afraid you're wrong here. The current setup is very good, and much

Peter Alexander (10/24) Dec 28 2011 I don't follow your argument. You've said (paraphrasing) "If a function

Andrei Alexandrescu (8/35) Dec 28 2011 I'm saying (paraphrasing) "X is modularly bankrupt and unsafe, and Y is

Jakob Ovrum (6/10) Dec 28 2011 Also, 'in char[]', which is conceptually much safer, isn't that

Jonathan M Davis (9/14) Dec 28 2011 in char[] is _not_ safer than immutable(char)[]. In fact it's _less_ saf...

Jakob Ovrum (3/10) Dec 28 2011 I didn't say it was. Please read more closely.

Jonathan M Davis (13/36) Dec 28 2011 Agreed. And for a number of functions, taking const(char)[] would be wor...

deadalnix (3/31) Dec 29 2011 Is inout a solution for the standard lib here ?

Jonathan M Davis (21/30) Dec 29 2011 ith

Walter Bright (8/10) Dec 28 2011 I have a very different experience with strings. I can't even remember a...

Andrei Alexandrescu (4/15) Dec 28 2011 I remember the day at Kahili we figured immutable(char)[] will just work...

Timon Gehr (3/20) Dec 28 2011 I agree. But I am confused by the fact that you are suggesting it

Peter Alexander (8/19) Dec 28 2011 We can disagree on this, but I think the fact that Phobos rarely uses
Sean Kelly (12/19) Dec 28 2011 Most common to me buffer reuse. I'll read a line of a file into a buffer...

Dejan Lekic (3/3) Dec 28 2011 Peter, having string as immutable(char)[] was perhaps one of the

Gor Gyolchanyan (8/11) Dec 28 2011 Having a mutable string is a bad idea also because it's mutability is

so (8/25) Dec 28 2011 As you said string is not a structure but an alias.

mta`chrono (11/11) Dec 30 2011 there are lot of people suggesting to change how string behaves. but

Peter Alexander <peter.alexander.au gmail.com> writes:

string is immutable(char)[]

I rarely *ever* need an immutable string. What I usually need is 
const(char)[]. I'd say 99%+ of the time I need only a const string.

This is quite irritating because "string" is the most convenient and 
intuitive thing to type. I often get into situations where I've written 
a function that takes a string, and then I can't call it because all I 
have is a char[]. I could copy the char[] into a new string, but that's 
expensive, and I'd rather I could just call the function.

I think it's telling that most Phobos functions use 'const(char)[]' or 
'in char[]' instead of 'string' for their arguments. The ones that use 
'string' are usually using it unnecessarily and should be fixed to use 
const(char)[].

In an ideal world I'd much prefer if string was an alias for 
const(char)[], but string literals were immutable(char)[]. It would 
require a little more effort when dealing with concurrency, but that's a 
price I would be willing to pay to make the string alias useful in 
function parameters.

Dec 28 2011

bearophile <bearophileHUGS lycos.com> writes:

Peter Alexander:

 I often get into situations where I've written 
 a function that takes a string, and then I can't call it because all I 
 have is a char[].

I suggest you to show some of such situations.


 I think it's telling that most Phobos functions use 'const(char)[]' or 
 'in char[]' instead of 'string' for their arguments. The ones that use 
 'string' are usually using it unnecessarily and should be fixed to use 
 const(char)[].

What are the Phobos functions that unnecessarily accept a string?

Bye,
bearophile

Dec 28 2011

Peter Alexander <peter.alexander.au gmail.com> writes:

On 28/12/11 12:42 PM, bearophile wrote:
 Peter Alexander:

 I often get into situations where I've written
 a function that takes a string, and then I can't call it because all I
 have is a char[].

 I suggest you to show some of such situations.

Any time you want to create a string without allocating memory.

char[N] buffer;
// write into buffer
// try to use buffer as string


 I think it's telling that most Phobos functions use 'const(char)[]' or
 'in char[]' instead of 'string' for their arguments. The ones that use
 'string' are usually using it unnecessarily and should be fixed to use
 const(char)[].

 What are the Phobos functions that unnecessarily accept a string?

Good question. I can't see any just now, although I have come across 
some in the past. Perhaps they have already been fixed.

Dec 28 2011

bearophile <bearophileHUGS lycos.com> writes:

Peter Alexander:

 Any time you want to create a string without allocating memory.
 
 char[N] buffer;
 // write into buffer
 // try to use buffer as string

I have discussed a bit two or three times about this topic. In a post I even
did suggest the idea of "scoped immutability", that was not appreciated.
Generally creating immutable data structures is a source of troubles in all
languages, and in D it's not a much solved problem yet.

In D today you are sometimes able to rewrite that as:

string foo(in int n) pure {
    auto buffer = new char[n];
    // write into buffer
    return buffer;
}
void bar(string s) {}
void main() {
    string s = foo(5);
    bar(s); // use buffer as string
}

Bye,
bearophile

Dec 28 2011

Peter Alexander <peter.alexander.au gmail.com> writes:

On 28/12/11 1:27 PM, bearophile wrote:
 Peter Alexander:

 Any time you want to create a string without allocating memory.

 char[N] buffer;
 // write into buffer
 // try to use buffer as string

 I have discussed a bit two or three times about this topic. In a post I even
did suggest the idea of "scoped immutability", that was not appreciated.
Generally creating immutable data structures is a source of troubles in all
languages, and in D it's not a much solved problem yet.

 In D today you are sometimes able to rewrite that as:

 string foo(in int n) pure {
      auto buffer = new char[n];
      // write into buffer
      return buffer;
 }
 void bar(string s) {}
 void main() {
      string s = foo(5);
      bar(s); // use buffer as string
 }

 Bye,
 bearophile

That only works when you allocate memory for the string, which is what I 
would like to avoid.

Dec 28 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 12/28/2011 5:16 AM, Peter Alexander wrote:
 Any time you want to create a string without allocating memory.

 char[N] buffer;
 // write into buffer
 // try to use buffer as string

Is the buffer ever going to be reused with a different string in it?

Dec 28 2011

Peter Alexander <peter.alexander.au gmail.com> writes:

On 28/12/11 5:16 PM, Walter Bright wrote:
 On 12/28/2011 5:16 AM, Peter Alexander wrote:
 Any time you want to create a string without allocating memory.

 char[N] buffer;
 // write into buffer
 // try to use buffer as string

 Is the buffer ever going to be reused with a different string in it?

Possibly.

I know what argument is coming next: "But if the function you call 
stores the string you passed in then it can't rely on seeing a 
consistent value!"

I know this. These functions should request immutable(char)[] because 
that's what they need. Functions that don't store the string should use 
const(char)[].

The question is whether string should alias immutable(char)[] or 
const(char)[]. In my experience (which is echoed in Phobos) is that 
const(char)[] is used much more often than immutable(char)[], so it 
should alias const(char)[].

Dec 28 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 12/28/2011 07:07 PM, Peter Alexander wrote:
 On 28/12/11 5:16 PM, Walter Bright wrote:
 On 12/28/2011 5:16 AM, Peter Alexander wrote:
 Any time you want to create a string without allocating memory.

 char[N] buffer;
 // write into buffer
 // try to use buffer as string

 Is the buffer ever going to be reused with a different string in it?

 Possibly.

 I know what argument is coming next: "But if the function you call
 stores the string you passed in then it can't rely on seeing a
 consistent value!"

 I know this. These functions should request immutable(char)[] because
 that's what they need. Functions that don't store the string should use
 const(char)[].

 The question is whether string should alias immutable(char)[] or
 const(char)[]. In my experience (which is echoed in Phobos) is that
 const(char)[] is used much more often than immutable(char)[], so it
 should alias const(char)[].

You are approximately saying (paraphrasing): "The question is whether a 
cow is a cow or an animal. In my experience (which is echoed at the farm 
down the valley) is that there are more animals than there are cows. So 
we should call all our animals cows."

Dec 28 2011

Peter Alexander <peter.alexander.au gmail.com> writes:

On 28/12/11 6:03 PM, Timon Gehr wrote:
 On 12/28/2011 07:07 PM, Peter Alexander wrote:
 On 28/12/11 5:16 PM, Walter Bright wrote:
 On 12/28/2011 5:16 AM, Peter Alexander wrote:
 Any time you want to create a string without allocating memory.

 char[N] buffer;
 // write into buffer
 // try to use buffer as string

 Is the buffer ever going to be reused with a different string in it?

 Possibly.

 I know what argument is coming next: "But if the function you call
 stores the string you passed in then it can't rely on seeing a
 consistent value!"

 I know this. These functions should request immutable(char)[] because
 that's what they need. Functions that don't store the string should use
 const(char)[].

 The question is whether string should alias immutable(char)[] or
 const(char)[]. In my experience (which is echoed in Phobos) is that
 const(char)[] is used much more often than immutable(char)[], so it
 should alias const(char)[].

 You are approximately saying (paraphrasing): "The question is whether a
 cow is a cow or an animal. In my experience (which is echoed at the farm
 down the valley) is that there are more animals than there are cows. So
 we should call all our animals cows."

No, I'm saying that people talk about animals more often than cows, so 
it should be easier and more intuitive to say "animal" than it is to say 
"cow". People can still call things cows if that is what they're talking 
about.

Dec 28 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 12/28/2011 10:07 AM, Peter Alexander wrote:
 On 28/12/11 5:16 PM, Walter Bright wrote:
 On 12/28/2011 5:16 AM, Peter Alexander wrote:
 Any time you want to create a string without allocating memory.

 char[N] buffer;
 // write into buffer
 // try to use buffer as string

 Is the buffer ever going to be reused with a different string in it?

 Possibly.

 I know what argument is coming next: "But if the function you call stores the
 string you passed in then it can't rely on seeing a consistent value!"

Exactly.


 I know this. These functions should request immutable(char)[] because that's
 what they need. Functions that don't store the string should use const(char)[].

 The question is whether string should alias immutable(char)[] or const(char)[].
 In my experience (which is echoed in Phobos) is that const(char)[] is used much
 more often than immutable(char)[], so it should alias const(char)[].

If such a change is made, then people will use const string when they mean 
immutable, and the values underneath are not guaranteed to be consistent.

Dec 28 2011

Peter Alexander <peter.alexander.au gmail.com> writes:

On 28/12/11 6:15 PM, Walter Bright wrote:
 On 12/28/2011 10:07 AM, Peter Alexander wrote:
 The question is whether string should alias immutable(char)[] or
 const(char)[].
 In my experience (which is echoed in Phobos) is that const(char)[] is
 used much
 more often than immutable(char)[], so it should alias const(char)[].

 If such a change is made, then people will use const string when they
 mean immutable, and the values underneath are not guaranteed to be
 consistent.

Then people should learn what const and immutable mean!

I don't think it's fair to dismiss my suggestion on the grounds that 
people don't understand the language.

Dec 28 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 12/28/2011 10:35 AM, Peter Alexander wrote:
 On 28/12/11 6:15 PM, Walter Bright wrote:
 If such a change is made, then people will use const string when they
 mean immutable, and the values underneath are not guaranteed to be
 consistent.

 Then people should learn what const and immutable mean!

 I don't think it's fair to dismiss my suggestion on the grounds that people
 don't understand the language.

People do what is convenient, and as endless experience shows, doing the right 
thing should be easier than doing the wrong thing. If you present people with a 
choice:




sure as the sun rises, they will type the former, and it will be subtly 
incorrect if string is const(char)[].


never works very well - not for programming, nor any other endeavor.

Dec 28 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/28/11 12:46 PM, Walter Bright wrote:
 On 12/28/2011 10:35 AM, Peter Alexander wrote:
 On 28/12/11 6:15 PM, Walter Bright wrote:
 If such a change is made, then people will use const string when they
 mean immutable, and the values underneath are not guaranteed to be
 consistent.

 Then people should learn what const and immutable mean!

 I don't think it's fair to dismiss my suggestion on the grounds that
 people
 don't understand the language.

 People do what is convenient, and as endless experience shows, doing the
 right thing should be easier than doing the wrong thing. If you present
 people with a choice:




 sure as the sun rises, they will type the former, and it will be subtly
 incorrect if string is const(char)[].


 that never works very well - not for programming, nor any other endeavor.

Yes. Contrary to the OP, I don't think it's fair to dismiss a valid 
concern by framing it as a user education issue. It's has very often 
been aired in the olden days of C++, and never in a winning argument. 
(Right off the bat - auto_ptr.)

Andrei

Dec 28 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 12/28/2011 10:56 AM, Andrei Alexandrescu wrote:
 Yes. Contrary to the OP, I don't think it's fair to dismiss a valid concern by
 framing it as a user education issue. It's has very often been aired in the
 olden days of C++, and never in a winning argument. (Right off the bat -
auto_ptr.)

And as Bruce Eckel discovered, even the people who know better will
deliberately 
pick the wrong method, because it's easier, and they justify it to themselves
by 
saying they'll go back and fix it later. And of course that doesn't happen.

Bruce decided there was something fundamentally wrong with a feature that he'd 
actually write articles about exhorting people to do X instead of Y, and then
in 
his own code he preferred to do the simpler Y.

Dec 28 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/28/11 12:46 PM, Walter Bright wrote:
 On 12/28/2011 10:35 AM, Peter Alexander wrote:
 On 28/12/11 6:15 PM, Walter Bright wrote:
 If such a change is made, then people will use const string when they
 mean immutable, and the values underneath are not guaranteed to be
 consistent.

 Then people should learn what const and immutable mean!

 I don't think it's fair to dismiss my suggestion on the grounds that
 people
 don't understand the language.

 People do what is convenient, and as endless experience shows, doing the
 right thing should be easier than doing the wrong thing. If you present
 people with a choice:




 sure as the sun rises, they will type the former, and it will be subtly
 incorrect if string is const(char)[].


 that never works very well - not for programming, nor any other endeavor.

Oh, one more thing - one good thing that could come out of this thread 
is abolition (through however slow a deprecation path) of s.length and 
s[i] for narrow strings. Requiring s.rep.length instead of s.length and 
s.rep[i] instead of s[i] would improve the quality of narrow strings 
tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. 
Then, people would access the decoding routines on the needed occasions, 
or would consciously use the representation.

Yum.


Andrei

Dec 28 2011

"Robert Jacques" <sandford jhu.edu> writes:

On Wed, 28 Dec 2011 11:00:52 -0800, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
 On 12/28/11 12:46 PM, Walter Bright wrote:
 On 12/28/2011 10:35 AM, Peter Alexander wrote:
 On 28/12/11 6:15 PM, Walter Bright wrote:
 If such a change is made, then people will use const string when they
 mean immutable, and the values underneath are not guaranteed to be
 consistent.

 Then people should learn what const and immutable mean!

 I don't think it's fair to dismiss my suggestion on the grounds that
 people
 don't understand the language.

 People do what is convenient, and as endless experience shows, doing the
 right thing should be easier than doing the wrong thing. If you present
 people with a choice:




 sure as the sun rises, they will type the former, and it will be subtly
 incorrect if string is const(char)[].


 that never works very well - not for programming, nor any other endeavor.

 Oh, one more thing - one good thing that could come out of this thread
 is abolition (through however slow a deprecation path) of s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar.
 Then, people would access the decoding routines on the needed occasions,
 or would consciously use the representation.

 Yum.


 Andrei

Would slicing, i.e. s[i..j] still be valid? If so, what would be the
recommended way of finding i and j?

Dec 28 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/28/11 1:17 PM, Robert Jacques wrote:
 Would slicing, i.e. s[i..j] still be valid?

No, only s.rep[i .. j].

 If so, what would be the
 recommended way of finding i and j?

find, findSplit etc. from std.algorithm, std.utf functions etc.


Andrei

Dec 28 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 12/28/2011 08:29 PM, Andrei Alexandrescu wrote:
 On 12/28/11 1:17 PM, Robert Jacques wrote:
 Would slicing, i.e. s[i..j] still be valid?

 No, only s.rep[i .. j].

That does not do the right thing. It would look more like 
cast(string)s.rep[i .. j].

 If so, what would be the
 recommended way of finding i and j?

 find, findSplit etc. from std.algorithm, std.utf functions etc.


 Andrei

Dec 28 2011

"foobar" <foo bar.com> writes:

On Wednesday, 28 December 2011 at 19:00:53 UTC, Andrei 
Alexandrescu wrote:
 On 12/28/11 12:46 PM, Walter Bright wrote:
 On 12/28/2011 10:35 AM, Peter Alexander wrote:
 On 28/12/11 6:15 PM, Walter Bright wrote:
 If such a change is made, then people will use const string 
 when they
 mean immutable, and the values underneath are not guaranteed 
 to be
 consistent.

 Then people should learn what const and immutable mean!

 I don't think it's fair to dismiss my suggestion on the 
 grounds that
 people
 don't understand the language.

 People do what is convenient, and as endless experience shows, 
 doing the
 right thing should be easier than doing the wrong thing. If 
 you present
 people with a choice:




 sure as the sun rises, they will type the former, and it will 
 be subtly
 incorrect if string is const(char)[].


 a strategy
 that never works very well - not for programming, nor any 
 other endeavor.

 Oh, one more thing - one good thing that could come out of this 
 thread is abolition (through however slow a deprecation path) 
 of s.length and s[i] for narrow strings. Requiring s.rep.length 
 instead of s.length and s.rep[i] instead of s[i] would improve 
 the quality of narrow strings tremendously. Also, s.rep[i] 
 should return ubyte/ushort, not char/wchar. Then, people would 
 access the decoding routines on the needed occasions, or would 
 consciously use the representation.

 Yum.


 Andrei

That's a good idea which I wonder about its implementation 
strategy. ATM string is simply an alias of a char array, are you 
suggesting string should be a wrapper struct instead (like the 
one previously suggested by Steven)?

I'm all for making string a properly encapsulated type.

Dec 28 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/28/11 1:18 PM, foobar wrote:
 That's a good idea which I wonder about its implementation strategy.

Implementation would entail a change in the compiler.

Andrei

Dec 28 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 12/28/2011 08:30 PM, Andrei Alexandrescu wrote:
 On 12/28/11 1:18 PM, foobar wrote:
 That's a good idea which I wonder about its implementation strategy.

 Implementation would entail a change in the compiler.

 Andrei

Special casing char[] and wchar[] in the language would be extremely 
ugly and inconsistent and would break nearly every D program. And for 
me, it would cripple Ds strings quite a lot. Why do you think it is 
worthwhile?

Dec 28 2011

"foobar" <foo bar.com> writes:

On Wednesday, 28 December 2011 at 19:30:04 UTC, Andrei 
Alexandrescu wrote:
 On 12/28/11 1:18 PM, foobar wrote:
 That's a good idea which I wonder about its implementation 
 strategy.

 Implementation would entail a change in the compiler.

 Andrei

Why? D should be plenty powerful to implement this without 
modifying the compiler.  Sounds like you suggest that char[] will 
behave differently than other T[] which is a very poor idea IMO.

Dec 28 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/28/11 1:48 PM, foobar wrote:
 On Wednesday, 28 December 2011 at 19:30:04 UTC, Andrei Alexandrescu wrote:
 On 12/28/11 1:18 PM, foobar wrote:
 That's a good idea which I wonder about its implementation strategy.

 Implementation would entail a change in the compiler.

 Andrei

 Why? D should be plenty powerful to implement this without modifying the
 compiler. Sounds like you suggest that char[] will behave differently
 than other T[] which is a very poor idea IMO.

It's an awesome idea, but for an academic debate at best.

Andrei

Dec 28 2011

"foobar" <foo bar.com> writes:

On Wednesday, 28 December 2011 at 21:57:00 UTC, Andrei 
Alexandrescu wrote:
 On 12/28/11 1:48 PM, foobar wrote:
 On Wednesday, 28 December 2011 at 19:30:04 UTC, Andrei 
 Alexandrescu wrote:
 On 12/28/11 1:18 PM, foobar wrote:
 That's a good idea which I wonder about its implementation 
 strategy.

 Implementation would entail a change in the compiler.

 Andrei

 Why? D should be plenty powerful to implement this without 
 modifying the
 compiler. Sounds like you suggest that char[] will behave 
 differently
 than other T[] which is a very poor idea IMO.

 It's an awesome idea, but for an academic debate at best.

 Andrei

I don't follow you. You've suggested a change that I agree with. 
Adam provided a prototype string library type that accomplishes 
your specified goals without any changes to the compiler. What 
are we missing here? IF it boils down to changing the compiler or 
leaving the status-quo, I'm voting against the compiler change.

Dec 28 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/28/11 4:18 PM, foobar wrote:
 On Wednesday, 28 December 2011 at 21:57:00 UTC, Andrei Alexandrescu wrote:
 On 12/28/11 1:48 PM, foobar wrote:
 On Wednesday, 28 December 2011 at 19:30:04 UTC, Andrei Alexandrescu
 wrote:
 On 12/28/11 1:18 PM, foobar wrote:
 That's a good idea which I wonder about its implementation strategy.

 Implementation would entail a change in the compiler.

 Andrei

 Why? D should be plenty powerful to implement this without modifying the
 compiler. Sounds like you suggest that char[] will behave differently
 than other T[] which is a very poor idea IMO.

 It's an awesome idea, but for an academic debate at best.

 Andrei

 I don't follow you. You've suggested a change that I agree with. Adam
 provided a prototype string library type that accomplishes your
 specified goals without any changes to the compiler. What are we missing
 here? IF it boils down to changing the compiler or leaving the
 status-quo, I'm voting against the compiler change.

If we have two facilities (string and e.g. String) we've lost. We'd need 
to slowly change the built-in string type.

I discussed the matter with Walter. He completely disagrees, and sees 
the idea as a sheer way to complicate stuff for no good. He mentions how 
he frequently uses .length, indexing, and slicing in narrow strings.

I know Walter's code, so I know where he's coming from. He understands 
UTF in and out, and I have zero doubt he actually knows all essential 
constants, masks, and ranges by heart. I've seen his code and indeed 
it's an amazing feat of minimal opportunistic on-demand decoding. So I 
know where he's coming from, but I also know next to nobody codes like 
him. A casual string user almost always writes string code (iteration, 
indexing) the wrong way and would be tremendously helped by a clean 
distinction between abstraction and representation.

Nagonna happen.


Andrei

Dec 28 2011

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Thursday, 29 December 2011 at 04:17:37 UTC, Andrei 
Alexandrescu wrote:
 If we have two facilities (string and e.g. String) we've lost. 
 We'd need to slowly change the built-in string type.

Have you actually tried to do it? Thanks to alias this, the custom
string can be used with existing std.string functions and 
assignments
from literals.

I suppose that technically there's two facilities: 
immutable(char)[]
and string, but I don't see what difference that makes at all.

string is just an alias. It could be changed to a struct with
ease; you can do it in your own private module.

I really think you (you!) are underestimating D's current 
capabilities.


(Again, I do not think this is a good move - I'm with Walter on 
it -
but let's not sell the language short.)

Dec 28 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 12/28/2011 8:32 PM, Adam D. Ruppe wrote:
 On Thursday, 29 December 2011 at 04:17:37 UTC, Andrei Alexandrescu wrote:
 If we have two facilities (string and e.g. String) we've lost. We'd need to
 slowly change the built-in string type.

 Have you actually tried to do it?

I've seen the damage done in C++ with multiple string types. Being able to 
convert from one to the other doesn't help much.

Dec 28 2011

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Thursday, 29 December 2011 at 05:37:00 UTC, Walter Bright 
wrote:
 I've seen the damage done in C++ with multiple string types. 
 Being able to convert from one to the other doesn't help much.

Note that I'm on your side here re strings, but you're
underselling the D language too! These conversions
are implicit both ways, and completely free. D structs
can wrap other D types perfectly well.

Check this out:

string a = "hello";
a = a.replace("h", "j");
assert(a == "jello");

this actually works, today, with a custom string type
in the D language. Just define a struct string in your
module. alias this does most the magic.

In C++, std::string and char* are very different.

===
#include<string>

void a(const char* str) {}

int main() {
	std::string me = "lol"; // works
	a(me); // ...but this doesn't work
	return 0;
}
===


But, in D, that *does work*. A struct string
can be used on a function that calls for a const(char)[].
It can be used for a function that calls for an immutable(char)[].
It can be used for a function that calls for a struct string.

A string struct works exactly the same way as a string alias.

Right down to the name!



It's not storeable in a variable typed char[] (or
wchar[] nor dchar[]), but neither are D strings
today.

Dec 28 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/29/11 12:01 AM, Adam D. Ruppe wrote:
 On Thursday, 29 December 2011 at 05:37:00 UTC, Walter Bright wrote:
 I've seen the damage done in C++ with multiple string types. Being
 able to convert from one to the other doesn't help much.

 Note that I'm on your side here re strings, but you're
 underselling the D language too! These conversions
 are implicit both ways, and completely free. D structs
 can wrap other D types perfectly well.

Nah, that still breaks a lotta code because people parameterize on T[], 
use isSomeString/isSomeChar etc.

Nagonna.


Andrei

Dec 28 2011

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Thursday, 29 December 2011 at 06:09:17 UTC, Andrei 
Alexandrescu wrote:
 Nah, that still breaks a lotta code because people parameterize 
 on T[], use isSomeString/isSomeChar etc.

/* snip struct string */

import std.traits;
void tem(T)(T t) if(isSomeString!T) {}
void tem2(T : immutable(char)[])(T t) {}

string a = "test";
tem(a); // works
tem2(a); // works


It's the alias this magic again.

(btw I also tried renaming struct string to
struct STRING, and it still worked, so it wasn't
just naming coincidence!)

Dec 29 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/28/11 11:36 PM, Walter Bright wrote:
 On 12/28/2011 8:32 PM, Adam D. Ruppe wrote:
 On Thursday, 29 December 2011 at 04:17:37 UTC, Andrei Alexandrescu wrote:
 If we have two facilities (string and e.g. String) we've lost. We'd
 need to
 slowly change the built-in string type.

 Have you actually tried to do it?

 I've seen the damage done in C++ with multiple string types. Being able
 to convert from one to the other doesn't help much.

This.

The only solution is to explain Walter no other programmer in the world 
codes UTF like him. Really. I emulate that sometimes (learned from him) 
but I see code from hundreds of people day in and day out - it's never 
like his.

Once we convince him, he'll be like "ah, I see what you mean. Requiring 
.rep is awesome. Let's do it."


Andrei

Dec 28 2011

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Thursday, 29 December 2011 at 06:08:05 UTC, Andrei 
Alexandrescu wrote:
 On 12/28/11 11:36 PM, Walter Bright wrote:
 On 12/28/2011 8:32 PM, Adam D. Ruppe wrote:
 On Thursday, 29 December 2011 at 04:17:37 UTC, Andrei 
 Alexandrescu wrote:
 If we have two facilities (string and e.g. String) we've 
 lost. We'd
 need to
 slowly change the built-in string type.

 Have you actually tried to do it?

 I've seen the damage done in C++ with multiple string types. 
 Being able
 to convert from one to the other doesn't help much.

 This.

 The only solution is to explain Walter no other programmer in 
 the world codes UTF like him. Really. I emulate that sometimes 
 (learned from him) but I see code from hundreds of people day 
 in and day out - it's never like his.

 Once we convince him, he'll be like "ah, I see what you mean. 
 Requiring .rep is awesome. Let's do it."


 Andrei

I don't think this is a problem you can solve without educating 
people. They will need to know a thing or two about how UTF works 
to know the performance implications of many of the "safe" ways 
to handle UTF strings. Further, for much use of Unicode strings 
in D you can't get away with not knowing anything anyway because 
D only abstracts up to code points, not graphemes. Imagine trying 
to explain to the unknowing programmer what is going on when an 
algorithm function broke his grapheme and he doesn't know the 
first thing about Unicode.

I'm not claiming to be an expert myself, but I believe D offers 
Unicode the right way as it is.

Dec 28 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, December 29, 2011 07:33:28 Jakob Ovrum wrote:
 I don't think this is a problem you can solve without educating
 people. They will need to know a thing or two about how UTF works
 to know the performance implications of many of the "safe" ways
 to handle UTF strings. Further, for much use of Unicode strings
 in D you can't get away with not knowing anything anyway because
 D only abstracts up to code points, not graphemes. Imagine trying
 to explain to the unknowing programmer what is going on when an
 algorithm function broke his grapheme and he doesn't know the
 first thing about Unicode.
 
 I'm not claiming to be an expert myself, but I believe D offers
 Unicode the right way as it is.

Ultimately, the programmer _does_ need to understand unicode properly if 
they're going to write code which is both correct and efficient. However, if
the 
easy way to use strings in D is correct, even if it's not as efficient as we'd 
like, then at least code will tend to be correct in its use of unicode. And 
then if the programmer wants to their string processing to be more efficient, 
they need to actually learn how unicode works so that they code for it more 
efficiently.

The issue, however, is that it's currently _way_ too easy to use strings 
completely incorrectly and operate on code units as if they were characters. A 
_lot_ of programmers will be using string and char[] as if a char were a 
character, and that's going to create a lot of bugs. Making it harder to 
operate on a char[] or string as if it were an array of characters will 
seriously reduce such bugs and on some level will force people to become 
better educated about unicode.

No, it doesn't completely solve the problem, since then we're operating at the 
code point level rather than the unicode level, but it's still a _lot_ better 
than operating on the code unit level as is likely to happen now.

- Jonathan M Davis

Dec 28 2011

deadalnix <deadalnix gmail.com> writes:

Le 29/12/2011 07:48, Jonathan M Davis a écrit :
 On Thursday, December 29, 2011 07:33:28 Jakob Ovrum wrote:
 I don't think this is a problem you can solve without educating
 people. They will need to know a thing or two about how UTF works
 to know the performance implications of many of the "safe" ways
 to handle UTF strings. Further, for much use of Unicode strings
 in D you can't get away with not knowing anything anyway because
 D only abstracts up to code points, not graphemes. Imagine trying
 to explain to the unknowing programmer what is going on when an
 algorithm function broke his grapheme and he doesn't know the
 first thing about Unicode.

 I'm not claiming to be an expert myself, but I believe D offers
 Unicode the right way as it is.

 Ultimately, the programmer _does_ need to understand unicode properly if
 they're going to write code which is both correct and efficient. However, if
the
 easy way to use strings in D is correct, even if it's not as efficient as we'd
 like, then at least code will tend to be correct in its use of unicode. And
 then if the programmer wants to their string processing to be more efficient,
 they need to actually learn how unicode works so that they code for it more
 efficiently.

 The issue, however, is that it's currently _way_ too easy to use strings
 completely incorrectly and operate on code units as if they were characters. A
 _lot_ of programmers will be using string and char[] as if a char were a
 character, and that's going to create a lot of bugs. Making it harder to
 operate on a char[] or string as if it were an array of characters will
 seriously reduce such bugs and on some level will force people to become
 better educated about unicode.

 No, it doesn't completely solve the problem, since then we're operating at the
 code point level rather than the unicode level, but it's still a _lot_ better
 than operating on the code unit level as is likely to happen now.

 - Jonathan M Davis

That is the whole point of D IMO. I think we shouldn't let an ego 
question dictate language decision.

Dec 29 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 12/28/2011 10:33 PM, Jakob Ovrum wrote:
 I don't think this is a problem you can solve without educating people. They
 will need to know a thing or two about how UTF works to know the performance
 implications of many of the "safe" ways to handle UTF strings. Further, for
much
 use of Unicode strings in D you can't get away with not knowing anything anyway
 because D only abstracts up to code points, not graphemes. Imagine trying to
 explain to the unknowing programmer what is going on when an algorithm function
 broke his grapheme and he doesn't know the first thing about Unicode.

 I'm not claiming to be an expert myself, but I believe D offers Unicode the
 right way as it is.

I think this goes to, at some point, the language is no longer able to hide the 
realities of the underlying machine. This happens with floating point (they are 
NOT mathematical real numbers), integers (they overflow), etc.

Keep in mind that D already has a string type where the code points match the 
characters:

      dstring[]

Dec 28 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 12/28/2011 10:08 PM, Andrei Alexandrescu wrote:
 The only solution is to explain Walter no other programmer in the world codes
 UTF like him. Really. I emulate that sometimes (learned from him) but I see
code
 from hundreds of people day in and day out - it's never like his.

 Once we convince him, he'll be like "ah, I see what you mean. Requiring .rep is
 awesome. Let's do it."

If that ever happens, I owe you a beer. Maybe two!

Maybe it's hubris, but I think D nails what a string type should be. I'm 
extremely reluctant to mess with its success. It strikes the right balance 
between aesthetics, efficiency and utility.

C++11 and C11 appear to have copied it.

Dec 28 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 12/29/2011 07:53 AM, Walter Bright wrote:
 On 12/28/2011 10:08 PM, Andrei Alexandrescu wrote:
 The only solution is to explain Walter no other programmer in the
 world codes
 UTF like him. Really. I emulate that sometimes (learned from him) but
 I see code
 from hundreds of people day in and day out - it's never like his.

 Once we convince him, he'll be like "ah, I see what you mean.
 Requiring .rep is
 awesome. Let's do it."

 If that ever happens, I owe you a beer. Maybe two!

 Maybe it's hubris, but I think D nails what a string type should be. I'm
 extremely reluctant to mess with its success. It strikes the right
 balance between aesthetics, efficiency and utility.

I fully agree. If I had to design an imperative programming language, 
this is how its strings would work.

 C++11 and C11 appear to have copied it.

Dec 29 2011

Derek <ddparnell bigpond.com> writes:

On Thu, 29 Dec 2011 16:36:59 +1100, Walter Bright  
<newshound2 digitalmars.com> wrote:

 I've seen the damage done in C++ with multiple string types. Being able  
 to convert from one to the other doesn't help much.

I'm not quite sure about that last sentence. I suspect that the better way  
for applications to handle strings of characters would be to internally  
store and manipulate them as utf-32 (dchar[]) and only when doing I/O use  
the other utf forms. So converting from the different forms is very  
helpful.

-- 
Derek Parnell
Melbourne, Australia

Dec 29 2011

Sean Kelly <sean invisibleduck.org> writes:

Don't we already have String-like support with ranges?  I'm not sure I under=
stand the point in having special behavior for char arrays.=20

Sent from my iPhone

On Dec 28, 2011, at 8:17 PM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.=
org> wrote:

 On 12/28/11 4:18 PM, foobar wrote:
 On Wednesday, 28 December 2011 at 21:57:00 UTC, Andrei Alexandrescu wrote=


:
 On 12/28/11 1:48 PM, foobar wrote:
 On Wednesday, 28 December 2011 at 19:30:04 UTC, Andrei Alexandrescu
 wrote:
 On 12/28/11 1:18 PM, foobar wrote:
 That's a good idea which I wonder about its implementation strategy.

=20
 Implementation would entail a change in the compiler.
=20
 Andrei

=20
 Why? D should be plenty powerful to implement this without modifying th=




e
 compiler. Sounds like you suggest that char[] will behave differently
 than other T[] which is a very poor idea IMO.

=20
 It's an awesome idea, but for an academic debate at best.
=20
 Andrei

=20
 I don't follow you. You've suggested a change that I agree with. Adam
 provided a prototype string library type that accomplishes your
 specified goals without any changes to the compiler. What are we missing
 here? IF it boils down to changing the compiler or leaving the
 status-quo, I'm voting against the compiler change.

=20
 If we have two facilities (string and e.g. String) we've lost. We'd need t=

o slowly change the built-in string type.
=20
 I discussed the matter with Walter. He completely disagrees, and sees the i=

dea as a sheer way to complicate stuff for no good. He mentions how he frequ=
ently uses .length, indexing, and slicing in narrow strings.
=20
 I know Walter's code, so I know where he's coming from. He understands UTF=

 in and out, and I have zero doubt he actually knows all essential constants=
, masks, and ranges by heart. I've seen his code and indeed it's an amazing f=
eat of minimal opportunistic on-demand decoding. So I know where he's coming=
 from, but I also know next to nobody codes like him. A casual string user a=
lmost always writes string code (iteration, indexing) the wrong way and woul=
d be tremendously helped by a clean distinction between abstraction and repr=
esentation.
=20
 Nagonna happen.
=20
=20
 Andrei
=20

Dec 29 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, December 29, 2011 11:32:52 Sean Kelly wrote:
 Don't we already have String-like support with ranges?  I'm not sure I
 understand the point in having special behavior for char arrays.

To avoid common misusage. It's way to easy to misuse the length property on 
narrow strings. Programmers shouldn't be using the length property on narrow 
strings unless they know what they're doing, but it's likely the first thing 
that any programmer is going to use for the length of a string, because that's 
how arrays in general work.

If it weren't legal to simply use the length property of a char[] or to 
directly slice it or index it, then those common misuages would be harder to 
do. You could still do them via .rep or .raw or whatever we'd call it, but it 
would no longer be the path of least resistance.

Yes, Phobos may avoid the issue, because for the most part its developers 
understand the issues, but many programmers who do not understand them, will 
make mistakes in their own code which should arguably be harder to make, 
simply because it's the path of least resistance, and they don't know any 
better.

- Jonathan M Davis

Dec 29 2011

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Wednesday, 28 December 2011 at 19:30:04 UTC, Andrei 
Alexandrescu wrote:
 Implementation would entail a change in the compiler.

I don't think I agree. Wouldn't something like this work?

===

struct string {
        immutable(char)[] rep;
        alias rep this;
        auto opAssign(immutable(char)[] rhs) {
                rep = rhs;
                return this;
        }

        this(immutable(char)[] rhs) {
                rep = rhs;
        }
        // disable these here so it isn't passed on to .rep
         disable void opSlice(){  assert(0);  };
         disable size_t length() {  assert(0);  };
}

===

I did some quick tests and the basics seemed ok:

/* paste impl from above */

import std.string : replace;

void main() {
        string a = "test"; // works

        a = a.replace("test", "mang"); // works
        // a = a[0..1]; // correctly fails to compile
        assert(0, a); // works
}

Dec 28 2011

"foobar" <foo bar.com> writes:

On Wednesday, 28 December 2011 at 19:48:28 UTC, Adam D. Ruppe 
wrote:
 On Wednesday, 28 December 2011 at 19:30:04 UTC, Andrei 
 Alexandrescu wrote:
 Implementation would entail a change in the compiler.

 I don't think I agree. Wouldn't something like this work?

 ===

 struct string {
       immutable(char)[] rep;
       alias rep this;
       auto opAssign(immutable(char)[] rhs) {
               rep = rhs;
               return this;
       }

       this(immutable(char)[] rhs) {
               rep = rhs;
       }
       // disable these here so it isn't passed on to .rep
        disable void opSlice(){  assert(0);  };
        disable size_t length() {  assert(0);  };
 }

 ===

 I did some quick tests and the basics seemed ok:

 /* paste impl from above */

 import std.string : replace;

 void main() {
       string a = "test"; // works

       a = a.replace("test", "mang"); // works
       // a = a[0..1]; // correctly fails to compile
       assert(0, a); // works
 }

My thinking exactly. Of course we can't put " disable" right away 
and should start with " deprecated" to allow for a proper 
migration period.
I'd also like a transition of the string related functions to 
this type. the previous ones can remain as simple 
wrappers/aliases/whatever for backwards compatibility.

Dec 28 2011

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Wednesday, 28 December 2011 at 20:01:15 UTC, foobar wrote:
 I'd also like a transition of the string related functions to 
 this type. the previous ones can remain as simple 
 wrappers/aliases/whatever for backwards compatibility.

I actually like strings just the way they are... but if
we had to change, I'm sure we can do a good job in the
library relatively easily.

Dec 28 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 12/28/2011 08:18 PM, foobar wrote:
 On Wednesday, 28 December 2011 at 19:00:53 UTC, Andrei Alexandrescu wrote:
 On 12/28/11 12:46 PM, Walter Bright wrote:
 On 12/28/2011 10:35 AM, Peter Alexander wrote:
 On 28/12/11 6:15 PM, Walter Bright wrote:
 If such a change is made, then people will use const string when they
 mean immutable, and the values underneath are not guaranteed to be
 consistent.

 Then people should learn what const and immutable mean!

 I don't think it's fair to dismiss my suggestion on the grounds that
 people
 don't understand the language.

 People do what is convenient, and as endless experience shows, doing the
 right thing should be easier than doing the wrong thing. If you present
 people with a choice:




 sure as the sun rises, they will type the former, and it will be subtly
 incorrect if string is const(char)[].


 that never works very well - not for programming, nor any other
 endeavor.

 Oh, one more thing - one good thing that could come out of this thread
 is abolition (through however slow a deprecation path) of s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length
 and s.rep[i] instead of s[i] would improve the quality of narrow
 strings tremendously. Also, s.rep[i] should return ubyte/ushort, not
 char/wchar. Then, people would access the decoding routines on the
 needed occasions, or would consciously use the representation.

 Yum.


 Andrei

 That's a good idea which I wonder about its implementation strategy. ATM
 string is simply an alias of a char array, are you suggesting string
 should be a wrapper struct instead (like the one previously suggested by
 Steven)?

 I'm all for making string a properly encapsulated type.

In what way would the proposed change improve encapsulation, and why 
would it even be desirable for such a basic data structure?

Dec 28 2011

"foobar" <foo bar.com> writes:

On Wednesday, 28 December 2011 at 19:38:53 UTC, Timon Gehr wrote:
[snip]
 I'm all for making string a properly encapsulated type.

 In what way would the proposed change improve encapsulation, 
 and why would it even be desirable for such a basic data 
 structure?

I'm not sure what are you asking here. Are you asking what are 
the benefits of encapsulation? This topic was discussed to death 
more than once and I'd suggest searching the NG archives for the 
details. Also, If you hadn't already I'd suggest reading about 
Unicode and its levels of abstraction: code point, code units, 
graphemes, etc...

Dec 28 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 12/28/2011 08:55 PM, foobar wrote:
 On Wednesday, 28 December 2011 at 19:38:53 UTC, Timon Gehr wrote:
 [snip]
 I'm all for making string a properly encapsulated type.

 In what way would the proposed change improve encapsulation, and why
 would it even be desirable for such a basic data structure?

 I'm not sure what are you asking here. Are you asking what are the
 benefits of encapsulation?

I know the benefits of encapsulation and none of them applies here. The 
proposed change is nothing but a breaking interface change.

 This topic was discussed to death more than
 once and I'd suggest searching the NG archives for the details. Also, If
 you hadn't already I'd suggest reading about Unicode and its levels of
 abstraction: code point, code units, graphemes, etc...

'char' is a code unit. Therefore that is the level of abstraction the 
data type char[] provides.

Dec 28 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 12/28/2011 08:00 PM, Andrei Alexandrescu wrote:
 On 12/28/11 12:46 PM, Walter Bright wrote:
 On 12/28/2011 10:35 AM, Peter Alexander wrote:
 On 28/12/11 6:15 PM, Walter Bright wrote:
 If such a change is made, then people will use const string when they
 mean immutable, and the values underneath are not guaranteed to be
 consistent.

 Then people should learn what const and immutable mean!

 I don't think it's fair to dismiss my suggestion on the grounds that
 people
 don't understand the language.

 People do what is convenient, and as endless experience shows, doing the
 right thing should be easier than doing the wrong thing. If you present
 people with a choice:




 sure as the sun rises, they will type the former, and it will be subtly
 incorrect if string is const(char)[].


 that never works very well - not for programming, nor any other endeavor.

 Oh, one more thing - one good thing that could come out of this thread
 is abolition (through however slow a deprecation path) of s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar.

Why? char and wchar are unicode code units, ubyte/ushort are unsigned 
integrals. It is clear that char/wchar are a better match.


 Then, people would access the decoding routines on the needed occasions,
 or would consciously use the representation.

 Yum.


 Andrei

Dec 28 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, December 28, 2011 21:25:39 Timon Gehr wrote:
 Why? char and wchar are unicode code units, ubyte/ushort are unsigned
 integrals. It is clear that char/wchar are a better match.

It's an issue of the correct usage being the easy path. As it stands, it's 
incredibly easy to use narrow strings incorrectly. By forcing any array of 
char or wchar to use .rep.length instead of .length, the relatively automatic 
(and generally incorrect) usage of .length on a string wouldn't immediately 
work. It would force you to work more at doing the wrong thing. Unfortunately, 
walkLength isn't necessarily any easier than .rep.length, but it does force 
people to look into why they can't do .length, which will generally better 
educate them and will hopefully reduce the misuse of narrow strings.

If we make rep ubyte[] and ushort[] for char[] and wchar[] respectively, then 
we reinforce the fact that you shouldn't operate on chars or wchars. It also 
makes it simply for the compiler to never allow you to use length on char[] or 
wchar[], since it doesn't have to worry about whether you got that char[] or 
wchar[] from a rep property or not.

Now, I don't know if this is really a good move at this point. If we were to 
really do this right, we'd need to disallow indexing and slicing of the char[] 
and wchar[] as well, which would break that much more code. It also pretty 
quickly makes it look like string should be its own type rather than an array, 
since it's acting less and less like an array. Not to mention, even the 
correct usage of .rep would become rather irritating (e.g. slicing it when you 
know that the indicies that you're dealing with aren't going to cut into any 
code points), because you'd have to cast from ubyte[] to char[] whenever you 
did that.

So, I think that the general sentiment behind this is a good one, but I don't 
know if the exact idea is ultimately a good one - particularly at this stage 
in the game. If we're going to make a change like this which would break as 
much code as this would, we'd need to be _very_ certain that it's what we want 
to do.

- Jonathan M Davis

Dec 28 2011

Timon Gehr <timon.gehr gmx.ch> writes:

Apparently my previous post was lost. Apologies if this comes out twice.

On 12/28/2011 09:39 PM, Jonathan M Davis wrote:
 On Wednesday, December 28, 2011 21:25:39 Timon Gehr wrote:
 Why? char and wchar are unicode code units, ubyte/ushort are unsigned
 integrals. It is clear that char/wchar are a better match.

 It's an issue of the correct usage being the easy path. As it stands, it's
 incredibly easy to use narrow strings incorrectly. By forcing any array of
 char or wchar to use .rep.length instead of .length, the relatively automatic
 (and generally incorrect) usage of .length on a string wouldn't immediately
 work. It would force you to work more at doing the wrong thing. Unfortunately,
 walkLength isn't necessarily any easier than .rep.length, but it does force
 people to look into why they can't do .length, which will generally better
 educate them and will hopefully reduce the misuse of narrow strings.

I was educated enough not to make that mistake, because I read the 
entire language specification before deciding the language was awesome 
and downloading the compiler. I find it strange that the product should 
be made less usable because we do not expect users to read the manual. 
But it is of course a valid point.

 If we make rep ubyte[] and ushort[] for char[] and wchar[] respectively, then
 we reinforce the fact that you shouldn't operate on chars or wchars.

There is nothing wrong with operating at the code unit level. Efficient 
slicing is very desirable.

 It also
 makes it simply for the compiler to never allow you to use length on char[] or
 wchar[], since it doesn't have to worry about whether you got that char[] or
 wchar[] from a rep property or not.

 Now, I don't know if this is really a good move at this point. If we were to
 really do this right, we'd need to disallow indexing and slicing of the char[]
 and wchar[] as well, which would break that much more code. It also pretty
 quickly makes it look like string should be its own type rather than an array,
 since it's acting less and less like an array.

Exactly. It is acting less and less like an array of code units. But it 
*is* an array of code units. If the general consensus is that we need a 
string data type that acts at a different abstraction level by default 
(with which I'd disagree, but apparently I don't have a popular opinion 
here), then we need a string type in the standard library to do that. 
Changing the language so that an array of code units stops behaving like 
an array of code units is not a solution.

 Not to mention, even the
 correct usage of .rep would become rather irritating (e.g. slicing it when you
 know that the indicies that you're dealing with aren't going to cut into any
 code points), because you'd have to cast from ubyte[] to char[] whenever you
 did that.

 So, I think that the general sentiment behind this is a good one, but I don't
 know if the exact idea is ultimately a good one - particularly at this stage
 in the game. If we're going to make a change like this which would break as
 much code as this would, we'd need to be _very_ certain that it's what we want
 to do.

 - Jonathan M Davis

I agree.

Dec 28 2011

"foobar" <foo bar.com> writes:

On Wednesday, 28 December 2011 at 21:17:49 UTC, Timon Gehr wrote:
 I was educated enough not to make that mistake, because I read 
 the entire language specification before deciding the language 
 was awesome and downloading the compiler. I find it strange 
 that the product should be made less usable because we do not 
 expect users to read the manual. But it is of course a valid 
 point.

That's awfully optimistic to expect people to read the manual.

 There is nothing wrong with operating at the code unit level. 
 Efficient slicing is very desirable.

I agree that it's useful. It is however the incorrect abstraction 
level when you need a "string" which is by far the common case in 
user code. i.e. if I need a name variable in a class: codeUnit[] 
name; // bug!
string Name; // correct

I expect that most uses of code-unit arrays should be in the 
standard library anyway since it provides the string manipulation 
routines. It all boils down to making the common case trivial and 
the rare case possible. You can use the underlying data structure 
(code units) if you need it but the default "string" is what 
people expect when thinking about what such a type does (a string 
of letters). D's already 80% there since Phobos already treats 
strings as bi-directional ranges of code-points which is much 
closer to the mental image of a string of letters, so I think 
this is about bringing the current design to its final conclusion.

 Exactly. It is acting less and less like an array of code 
 units. But it *is* an array of code units. If the general 
 consensus is that we need a string data type that acts at a 
 different abstraction level by default (with which I'd 
 disagree, but apparently I don't have a popular opinion here), 
 then we need a string type in the standard library to do that. 
 Changing the language so that an array of code units stops 
 behaving like an array of code units is not a solution.

I agree that we should not break T[] for any T and instead 
introduce a library type. While I personally believe that such a 
change will expose hidden bugs (certainly when unaware 
programmers treat string as ASCII and the product is later on 
localized), it's a big disturbance in people's code and it's 
worth a consideration if the benefit worth the costs. Perhaps, 
some middle ground could be found such that existing code can 
rely on existing behavior and the new library type will be an 
opt-in.

Dec 28 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 12/28/2011 11:12 PM, foobar wrote:
 On Wednesday, 28 December 2011 at 21:17:49 UTC, Timon Gehr wrote:
 I was educated enough not to make that mistake, because I read the
 entire language specification before deciding the language was awesome
 and downloading the compiler. I find it strange that the product
 should be made less usable because we do not expect users to read the
 manual. But it is of course a valid point.

 That's awfully optimistic to expect people to read the manual.

Well, if the alternative is slowly butchering the language I will be 
awfully optimistic about it all day long.

 There is nothing wrong with operating at the code unit level.
 Efficient slicing is very desirable.

 I agree that it's useful. It is however the incorrect abstraction level
 when you need a "string" which is by far the common case in user code.

I would not go as far as to call it 'incorrect'.

 i.e. if I need a name variable in a class: codeUnit[] name; // bug!
 string Name; // correct

 From a pragmatic viewpoint it does not matter because if string is used 
like this, then codeUnit[] does exactly the same thing. Nobody forces 
anyone to index or slice into a string variable when they don't need 
that functionality. All engineers have to work with leaky abstractions. 
Why is it such a big deal?


 I expect that most uses of code-unit arrays should be in the standard
 library anyway since it provides the string manipulation routines. It
 all boils down to making the common case trivial and the rare case
 possible.  You can use the underlying data structure (code units) if you
 need it but the default "string" is what people expect when thinking
 about what such a type does (a string of letters). D's already 80% there
 since Phobos already treats strings as bi-directional ranges of
 code-points which is much closer to the mental image of a string of
 letters, so I think this is about bringing the current design to its
 final conclusion.

Well, that mental image is just not the right one when dealing with Unicode.

 Exactly. It is acting less and less like an array of code units. But
 it *is* an array of code units. If the general consensus is that we
 need a string data type that acts at a different abstraction level by
 default (with which I'd disagree, but apparently I don't have a
 popular opinion here), then we need a string type in the standard
 library to do that. Changing the language so that an array of code
 units stops behaving like an array of code units is not a solution.

 I agree that we should not break T[] for any T and instead introduce a
 library type. While I personally believe that such a change will expose
 hidden bugs (certainly when unaware programmers treat string as ASCII
 and the product is later on localized), it's a big disturbance in
 people's code and it's worth a consideration if the benefit worth the
 costs. Perhaps, some middle ground could be found such that existing
 code can rely on existing behavior and the new library type will be an
 opt-in.

What will such a type offer, except that it disallows indexing and slicing?

Dec 28 2011

"foobar" <foo bar.com> writes:

On Wednesday, 28 December 2011 at 22:39:15 UTC, Timon Gehr wrote:
 On 12/28/2011 11:12 PM, foobar wrote:
 On Wednesday, 28 December 2011 at 21:17:49 UTC, Timon Gehr 
 wrote:
 I was educated enough not to make that mistake, because I 
 read the
 entire language specification before deciding the language 
 was awesome
 and downloading the compiler. I find it strange that the 
 product
 should be made less usable because we do not expect users to 
 read the
 manual. But it is of course a valid point.

 That's awfully optimistic to expect people to read the manual.

 Well, if the alternative is slowly butchering the language I 
 will be awfully optimistic about it all day long.

 There is nothing wrong with operating at the code unit level.
 Efficient slicing is very desirable.

 I agree that it's useful. It is however the incorrect 
 abstraction level
 when you need a "string" which is by far the common case in 
 user code.

 I would not go as far as to call it 'incorrect'.

 i.e. if I need a name variable in a class: codeUnit[] name; // 
 bug!
 string Name; // correct

 From a pragmatic viewpoint it does not matter because if string 
 is used like this, then codeUnit[] does exactly the same thing. 
 Nobody forces anyone to index or slice into a string variable 
 when they don't need that functionality. All engineers have to 
 work with leaky abstractions. Why is it such a big deal?


 I expect that most uses of code-unit arrays should be in the 
 standard
 library anyway since it provides the string manipulation 
 routines. It
 all boils down to making the common case trivial and the rare 
 case
 possible.  You can use the underlying data structure (code 
 units) if you
 need it but the default "string" is what people expect when 
 thinking
 about what such a type does (a string of letters). D's already 
 80% there
 since Phobos already treats strings as bi-directional ranges of
 code-points which is much closer to the mental image of a 
 string of
 letters, so I think this is about bringing the current design 
 to its
 final conclusion.

 Well, that mental image is just not the right one when dealing 
 with Unicode.

 Exactly. It is acting less and less like an array of code 
 units. But
 it *is* an array of code units. If the general consensus is 
 that we
 need a string data type that acts at a different abstraction 
 level by
 default (with which I'd disagree, but apparently I don't have 
 a
 popular opinion here), then we need a string type in the 
 standard
 library to do that. Changing the language so that an array of 
 code
 units stops behaving like an array of code units is not a 
 solution.

 I agree that we should not break T[] for any T and instead 
 introduce a
 library type. While I personally believe that such a change 
 will expose
 hidden bugs (certainly when unaware programmers treat string 
 as ASCII
 and the product is later on localized), it's a big disturbance 
 in
 people's code and it's worth a consideration if the benefit 
 worth the
 costs. Perhaps, some middle ground could be found such that 
 existing
 code can rely on existing behavior and the new library type 
 will be an
 opt-in.

 What will such a type offer, except that it disallows indexing 
 and slicing?


 From a pragmatic view point people can also continue programming 
in C++ instead of investing a lot of effort learning a new 
language.

The only difference between programming languages is the human 
interface aspect.  Anything you can program with D you could also 
do in assembly yet you prefer D because it's more convenient. In 
that regard, a code-unit array is definitely worse than a string 
type.

A programmer can choose to either change his 'naive' mental image 
or change the programming language. Most will do the latter. 
Computers need to adapt and be human friendly, not vice-versa.

Dec 28 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 12/29/2011 07:45 AM, foobar wrote:
 On Wednesday, 28 December 2011 at 22:39:15 UTC, Timon Gehr wrote:
 On 12/28/2011 11:12 PM, foobar wrote:
 On Wednesday, 28 December 2011 at 21:17:49 UTC, Timon Gehr wrote:
 I was educated enough not to make that mistake, because I read the
 entire language specification before deciding the language was awesome
 and downloading the compiler. I find it strange that the product
 should be made less usable because we do not expect users to read the
 manual. But it is of course a valid point.

 That's awfully optimistic to expect people to read the manual.

 Well, if the alternative is slowly butchering the language I will be
 awfully optimistic about it all day long.

 There is nothing wrong with operating at the code unit level.
 Efficient slicing is very desirable.

 I agree that it's useful. It is however the incorrect abstraction level
 when you need a "string" which is by far the common case in user code.

 I would not go as far as to call it 'incorrect'.

 i.e. if I need a name variable in a class: codeUnit[] name; // bug!
 string Name; // correct

 From a pragmatic viewpoint it does not matter because if string is
 used like this, then codeUnit[] does exactly the same thing. Nobody
 forces anyone to index or slice into a string variable when they don't
 need that functionality. All engineers have to work with leaky
 abstractions. Why is it such a big deal?


 I expect that most uses of code-unit arrays should be in the standard
 library anyway since it provides the string manipulation routines. It
 all boils down to making the common case trivial and the rare case
 possible. You can use the underlying data structure (code units) if you
 need it but the default "string" is what people expect when thinking
 about what such a type does (a string of letters). D's already 80% there
 since Phobos already treats strings as bi-directional ranges of
 code-points which is much closer to the mental image of a string of
 letters, so I think this is about bringing the current design to its
 final conclusion.

 Well, that mental image is just not the right one when dealing with
 Unicode.

 Exactly. It is acting less and less like an array of code units. But
 it *is* an array of code units. If the general consensus is that we
 need a string data type that acts at a different abstraction level by
 default (with which I'd disagree, but apparently I don't have a
 popular opinion here), then we need a string type in the standard
 library to do that. Changing the language so that an array of code
 units stops behaving like an array of code units is not a solution.

 I agree that we should not break T[] for any T and instead introduce a
 library type. While I personally believe that such a change will expose
 hidden bugs (certainly when unaware programmers treat string as ASCII
 and the product is later on localized), it's a big disturbance in
 people's code and it's worth a consideration if the benefit worth the
 costs. Perhaps, some middle ground could be found such that existing
 code can rely on existing behavior and the new library type will be an
 opt-in.

 What will such a type offer, except that it disallows indexing and
 slicing?


  From a pragmatic view point people can also continue programming in C++
 instead of investing a lot of effort learning a new language.

I disagree.

Pragmatism: "Dealing with things sensibly and realistically in a way 
that is based on practical rather than theoretical considerations."

In practice, programming in D beats the pants off programming in C++.

 The only difference between programming languages is the human interface
 aspect.

No. There is also the aspect of how well it maps to the machine it will 
run on. An interface always has two sides.

 Anything you can program with D you could also do in assembly
 yet you prefer D because it's more convenient.

I prefer D because it is more productive.

 In that regard, a code-unit array is definitely worse than a string type.

A code-unit array type is a string type, albeit a simple one.

 A programmer can choose to either change his 'naive' mental image or
 change the programming language.  Most will do the latter.

A programmer does not care about how D strings work or he is happy that 
they are so simple to work with.

 Computers need to adapt and be human friendly, not vice-versa.

When I meet a computer that adapts itself in order to be human friendly, 
I'll buy you a cookie.

Dec 29 2011

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:

 one good thing that could come out of this thread
 is abolition (through however slow a deprecation path) of s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar.
 Then, people would access the decoding routines on the needed occasions,
 or would consciously use the representation.


 Robert Jacques:
 Would slicing, i.e. s[i..j] still be valid?

 No, only s.rep[i .. j].

 If so, what would be the
 recommended way of finding i and j?

 find, findSplit etc. from std.algorithm, std.utf functions etc.

We have discussed this topic some times in past, it's not an easy topic. I
agree with the general desires under your ideas Andrei, I suggested something
related, time ago.

The idea of forbidding s.length, s[i] and s[i..j] for narrow strings seems
interesting. (I suggested something different, to keep them but turn them into
operations that do the right thing on narrow strings. Some people didn't
appreciate the idea because it changes the computational complexity of such
operations).

But I suggest to step a bit back and look at the situation from a bit more
distance, to avoid small patches to D that look like a pirate eyepatch :-)

Narrow strings are more memory (and performance) efficient, and sometimes I
want to slice them too, and do it correctly (so somestring.rep[i..j] is not
enough). So I suggest to give something to perform correct slicing of narrow
strings too.

Bye,
bearophile

Dec 28 2011

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Wednesday, 28 December 2011 at 19:00:53 UTC, Andrei 
Alexandrescu wrote:
 On 12/28/11 12:46 PM, Walter Bright wrote:
 On 12/28/2011 10:35 AM, Peter Alexander wrote:
 On 28/12/11 6:15 PM, Walter Bright wrote:
 If such a change is made, then people will use const string 
 when they
 mean immutable, and the values underneath are not guaranteed 
 to be
 consistent.

 Then people should learn what const and immutable mean!

 I don't think it's fair to dismiss my suggestion on the 
 grounds that
 people
 don't understand the language.

 People do what is convenient, and as endless experience shows, 
 doing the
 right thing should be easier than doing the wrong thing. If 
 you present
 people with a choice:




 sure as the sun rises, they will type the former, and it will 
 be subtly
 incorrect if string is const(char)[].


 a strategy
 that never works very well - not for programming, nor any 
 other endeavor.

 Oh, one more thing - one good thing that could come out of this 
 thread is abolition (through however slow a deprecation path) 
 of s.length and s[i] for narrow strings. Requiring s.rep.length 
 instead of s.length and s.rep[i] instead of s[i] would improve 
 the quality of narrow strings tremendously. Also, s.rep[i] 
 should return ubyte/ushort, not char/wchar. Then, people would 
 access the decoding routines on the needed occasions, or would 
 consciously use the representation.

I think it would be simpler to just make dstring the default 
string type.

dstring is simple and safe. People who want better memory usage 
can use UTF-8 at their own discretion.

Dec 29 2011

Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:

This a a great idea! In this case the default string will be a
random-access range, not a bidirectional range. Also, processing
dstring is faster, then string, because no encoding needs to be done.
Processing power is more expensive, then memory. utf-8 is valuable
only to pass it as an ASCII string (which is not too common) and to
store large chunks of it. Both these cases are much less common then
all the rest of string processing.

+1

On Thu, Dec 29, 2011 at 12:04 PM, Vladimir Panteleev
<vladimir thecybershadow.net> wrote:
 On Wednesday, 28 December 2011 at 19:00:53 UTC, Andrei Alexandrescu wrote:
 On 12/28/11 12:46 PM, Walter Bright wrote:
 On 12/28/2011 10:35 AM, Peter Alexander wrote:
 On 28/12/11 6:15 PM, Walter Bright wrote:
 If such a change is made, then people will use const string when they
 mean immutable, and the values underneath are not guaranteed to be
 consistent.


 Then people should learn what const and immutable mean!

 I don't think it's fair to dismiss my suggestion on the grounds that
 people
 don't understand the language.


 People do what is convenient, and as endless experience shows, doing the
 right thing should be easier than doing the wrong thing. If you present
 people with a choice:




 sure as the sun rises, they will type the former, and it will be subtly
 incorrect if string is const(char)[].


 that never works very well - not for programming, nor any other endeavor.


 Oh, one more thing - one good thing that could come out of this thread is
 abolition (through however slow a deprecation path) of s.length and s[i] for
 narrow strings. Requiring s.rep.length instead of s.length and s.rep[i]
 instead of s[i] would improve the quality of narrow strings tremendously.
 Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people
 would access the decoding routines on the needed occasions, or would
 consciously use the representation.


 I think it would be simpler to just make dstring the default string type.

 dstring is simple and safe. People who want better memory usage can use
 UTF-8 at their own discretion.



-- 
Bye,
Gor Gyolchanyan.

Dec 29 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 12/29/2011 12:12 AM, Gor Gyolchanyan wrote:
 This a a great idea! In this case the default string will be a
 random-access range, not a bidirectional range. Also, processing
 dstring is faster, then string, because no encoding needs to be done.
 Processing power is more expensive, then memory. utf-8 is valuable
 only to pass it as an ASCII string (which is not too common) and to
 store large chunks of it. Both these cases are much less common then
 all the rest of string processing.

dstring consumes 4x the memory, and this can easily cause perf degradations due 
to thrashing and poor cache locality.

Dec 29 2011

Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:

What if the string converted itself from utf-8 to utf-32 back and
forth as necessary (utf-8 for storing and utf-32 for processing):

struct String
{
public:
    bool encoded()  property const
    {
        return _encoded;
    }

    bool encoded(bool should)  property
    {
        if(should)
            if(!encoded)
            {
                _utf8 = to!string(_utf32);
                encoded = true;
            }
        else
            if(encoded)
            {
                _utf32 = to!dstring(_utf8);
                encoded = false;
            }
    }

    // Here goes the part where you get to use the string

private:
    bool _encoded;
    union
    {
        string _utf8;
        dstring _utf32;
    }
}

This has a lot of drawbacks and is purely a curiosity. The idea of
expressing the encoding of string as a property of strings, rather,
then a difference between separate types of strings.

On Thu, Dec 29, 2011 at 1:02 PM, Walter Bright
<newshound2 digitalmars.com> wrote:
 On 12/29/2011 12:12 AM, Gor Gyolchanyan wrote:
 This a a great idea! In this case the default string will be a
 random-access range, not a bidirectional range. Also, processing
 dstring is faster, then string, because no encoding needs to be done.
 Processing power is more expensive, then memory. utf-8 is valuable
 only to pass it as an ASCII string (which is not too common) and to
 store large chunks of it. Both these cases are much less common then
 all the rest of string processing.


 dstring consumes 4x the memory, and this can easily cause perf degradations
 due to thrashing and poor cache locality.



-- 
Bye,
Gor Gyolchanyan.

Dec 29 2011

Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:

oops. I accidentally made a recursive call in the setter. scratch
that, it should change the attribute.

On Thu, Dec 29, 2011 at 6:58 PM, Gor Gyolchanyan
<gor.f.gyolchanyan gmail.com> wrote:
 What if the string converted itself from utf-8 to utf-32 back and
 forth as necessary (utf-8 for storing and utf-32 for processing):

 struct String
 {
 public:
 =C2=A0 =C2=A0bool encoded()  property const
 =C2=A0 =C2=A0{
 =C2=A0 =C2=A0 =C2=A0 =C2=A0return _encoded;
 =C2=A0 =C2=A0}

 =C2=A0 =C2=A0bool encoded(bool should)  property
 =C2=A0 =C2=A0{
 =C2=A0 =C2=A0 =C2=A0 =C2=A0if(should)
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if(!encoded)
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0{
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0_utf8 =3D to!strin=

g(_utf32);
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0encoded =3D true;
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}
 =C2=A0 =C2=A0 =C2=A0 =C2=A0else
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if(encoded)
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0{
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0_utf32 =3D to!dstr=

ing(_utf8);
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0encoded =3D false;
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}
 =C2=A0 =C2=A0}

 =C2=A0 =C2=A0// Here goes the part where you get to use the string

 private:
 =C2=A0 =C2=A0bool _encoded;
 =C2=A0 =C2=A0union
 =C2=A0 =C2=A0{
 =C2=A0 =C2=A0 =C2=A0 =C2=A0string _utf8;
 =C2=A0 =C2=A0 =C2=A0 =C2=A0dstring _utf32;
 =C2=A0 =C2=A0}
 }

 This has a lot of drawbacks and is purely a curiosity. The idea of
 expressing the encoding of string as a property of strings, rather,
 then a difference between separate types of strings.

 On Thu, Dec 29, 2011 at 1:02 PM, Walter Bright
 <newshound2 digitalmars.com> wrote:
 On 12/29/2011 12:12 AM, Gor Gyolchanyan wrote:
 This a a great idea! In this case the default string will be a
 random-access range, not a bidirectional range. Also, processing
 dstring is faster, then string, because no encoding needs to be done.
 Processing power is more expensive, then memory. utf-8 is valuable
 only to pass it as an ASCII string (which is not too common) and to
 store large chunks of it. Both these cases are much less common then
 all the rest of string processing.


 dstring consumes 4x the memory, and this can easily cause perf degradati=


ons
 due to thrashing and poor cache locality.



 --
 Bye,
 Gor Gyolchanyan.



--=20
Bye,
Gor Gyolchanyan.

Dec 29 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/29/11 2:04 AM, Vladimir Panteleev wrote:
 I think it would be simpler to just make dstring the default string type.

 dstring is simple and safe. People who want better memory usage can use
 UTF-8 at their own discretion.

memory == time

Andrei

Dec 29 2011

Don <nospam nospam.com> writes:

On 28.12.2011 20:00, Andrei Alexandrescu wrote:
 Oh, one more thing - one good thing that could come out of this thread
 is abolition (through however slow a deprecation path) of s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar.
 Then, people would access the decoding routines on the needed occasions,
 or would consciously use the representation.

 Yum.


If I understand this correctly, most others don't. Effectively, .rep 
just means, "I know what I'm doing", and there's no change to existing 
semantics, purely a syntax change.

If you change s[i] into s.rep[i], it does the same thing as now. There's 
no loss of functionality -- it's just stops you from accidentally doing 
the wrong thing. Like .ptr for getting the address of an array.
Typically all the ".rep" everywhere would get annoying, so you would write:
ubyte [] u = s.rep;
and use u from then on.

I don't like the name 'rep'. Maybe 'raw' or 'utf'?
Apart from that, I think this would be perfect.

Dec 29 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/29/11 12:28 PM, Don wrote:
 On 28.12.2011 20:00, Andrei Alexandrescu wrote:
 Oh, one more thing - one good thing that could come out of this thread
 is abolition (through however slow a deprecation path) of s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar.
 Then, people would access the decoding routines on the needed occasions,
 or would consciously use the representation.

 Yum.


 If I understand this correctly, most others don't. Effectively, .rep
 just means, "I know what I'm doing", and there's no change to existing
 semantics, purely a syntax change.

Exactly!

 If you change s[i] into s.rep[i], it does the same thing as now. There's
 no loss of functionality -- it's just stops you from accidentally doing
 the wrong thing. Like .ptr for getting the address of an array.
 Typically all the ".rep" everywhere would get annoying, so you would write:
 ubyte [] u = s.rep;
 and use u from then on.

 I don't like the name 'rep'. Maybe 'raw' or 'utf'?
 Apart from that, I think this would be perfect.

Yes, I mean "rep" as a short for "representation" but upon first sight 
the connection is tenuous. "raw" sounds great.

Now I'm twice sorry this will not happen...


Andrei

Dec 29 2011

"Regan Heath" <regan netmail.co.nz> writes:

On Thu, 29 Dec 2011 18:36:27 -0000, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 12/29/11 12:28 PM, Don wrote:
 On 28.12.2011 20:00, Andrei Alexandrescu wrote:
 Oh, one more thing - one good thing that could come out of this thread
 is abolition (through however slow a deprecation path) of s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not  
 char/wchar.
 Then, people would access the decoding routines on the needed  
 occasions,
 or would consciously use the representation.

 Yum.


 If I understand this correctly, most others don't. Effectively, .rep
 just means, "I know what I'm doing", and there's no change to existing
 semantics, purely a syntax change.

 Exactly!

 If you change s[i] into s.rep[i], it does the same thing as now. There's
 no loss of functionality -- it's just stops you from accidentally doing
 the wrong thing. Like .ptr for getting the address of an array.
 Typically all the ".rep" everywhere would get annoying, so you would  
 write:
 ubyte [] u = s.rep;
 and use u from then on.

 I don't like the name 'rep'. Maybe 'raw' or 'utf'?
 Apart from that, I think this would be perfect.

 Yes, I mean "rep" as a short for "representation" but upon first sight  
 the connection is tenuous. "raw" sounds great.

 Now I'm twice sorry this will not happen...

+1 for this idea, however named.

R

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

Dec 30 2011

Joshua Reusch <yoschi arkandos.de> writes:

Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
 On 12/29/11 12:28 PM, Don wrote:
 On 28.12.2011 20:00, Andrei Alexandrescu wrote:
 Oh, one more thing - one good thing that could come out of this thread
 is abolition (through however slow a deprecation path) of s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar.
 Then, people would access the decoding routines on the needed occasions,
 or would consciously use the representation.

 Yum.


 If I understand this correctly, most others don't. Effectively, .rep
 just means, "I know what I'm doing", and there's no change to existing
 semantics, purely a syntax change.

 Exactly!

 If you change s[i] into s.rep[i], it does the same thing as now. There's
 no loss of functionality -- it's just stops you from accidentally doing
 the wrong thing. Like .ptr for getting the address of an array.
 Typically all the ".rep" everywhere would get annoying, so you would
 write:
 ubyte [] u = s.rep;
 and use u from then on.

 I don't like the name 'rep'. Maybe 'raw' or 'utf'?
 Apart from that, I think this would be perfect.

 Yes, I mean "rep" as a short for "representation" but upon first sight
 the connection is tenuous. "raw" sounds great.

 Now I'm twice sorry this will not happen...

Maybe it could happen if we
  1. make dstring the default strings type -- code units and characters 
would be the same
  or 2. forward string.length to std.utf.count and opIndex to 
std.utf.toUTFindex

so programmers could use the slices/indexing/length (no lazyness 
problems), and if they really want codeunits use .raw/.rep (or better 
.utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)

But generally I liked the idea of just having an alias for strings...

 Andrei

-- Joshua Reusch

Dec 30 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 12/30/2011 08:33 PM, Joshua Reusch wrote:
 Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
 On 12/29/11 12:28 PM, Don wrote:
 On 28.12.2011 20:00, Andrei Alexandrescu wrote:
 Oh, one more thing - one good thing that could come out of this thread
 is abolition (through however slow a deprecation path) of s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not
 char/wchar.
 Then, people would access the decoding routines on the needed
 occasions,
 or would consciously use the representation.

 Yum.


 If I understand this correctly, most others don't. Effectively, .rep
 just means, "I know what I'm doing", and there's no change to existing
 semantics, purely a syntax change.

 Exactly!

 If you change s[i] into s.rep[i], it does the same thing as now. There's
 no loss of functionality -- it's just stops you from accidentally doing
 the wrong thing. Like .ptr for getting the address of an array.
 Typically all the ".rep" everywhere would get annoying, so you would
 write:
 ubyte [] u = s.rep;
 and use u from then on.

 I don't like the name 'rep'. Maybe 'raw' or 'utf'?
 Apart from that, I think this would be perfect.

 Yes, I mean "rep" as a short for "representation" but upon first sight
 the connection is tenuous. "raw" sounds great.

 Now I'm twice sorry this will not happen...

 Maybe it could happen if we
 1. make dstring the default strings type --

Inefficient.

 code units and characters would be the same

Wrong.

 or 2. forward string.length to std.utf.count and opIndex to
 std.utf.toUTFindex

Inconsistent and inefficient (it blows up the algorithmic complexity).

 so programmers could use the slices/indexing/length (no lazyness
 problems), and if they really want codeunits use .raw/.rep (or better
 .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)

Anyone who intends to write efficient string processing code needs this. 
Anyone who does not want to write string processing code will not need 
to index into a string -- standard library functions will suffice.

 But generally I liked the idea of just having an alias for strings...

Me too. I think the way we have it now is optimal. The only reason we 
are discussing this is because of fear that uneducated users will write 
code that does not take into account Unicode characters above code point 
0x80. But what is the worst thing that can happen?

1. They don't notice. Then it is not a problem, because they are 
obviously only using ASCII characters and it is perfectly reasonable to 
assume that code units and characters are the same thing.

2. They get screwed up string output, look for the reason, patch up 
their code with some functions from std.utf and will never make the same 
mistakes again.


I have *never* seen an user in D.learn complain about it. They might 
have been some I missed, but it is certainly not a prevalent problem. 
Also, just because an user can type .rep does not mean he understands 
Unicode: He is able to make just the same mistakes as before, even more 
so, as the array he is getting back has the _wrong element type_.

Dec 30 2011

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Friday, 30 December 2011 at 19:55:45 UTC, Timon Gehr wrote:
 I think the way we have it now is optimal. The only reason we 
 are discussing this is because of fear that uneducated users 
 will write code that does not take into account Unicode 
 characters above code point 0x80. But what is the worst thing 
 that can happen?

 1. They don't notice. Then it is not a problem, because they 
 are obviously only using ASCII characters and it is perfectly 
 reasonable to assume that code units and characters are the 
 same thing.

 2. They get screwed up string output, look for the reason, 
 patch up their code with some functions from std.utf and will 
 never make the same mistakes again.


 I have *never* seen an user in D.learn complain about it. They 
 might have been some I missed, but it is certainly not a 
 prevalent problem. Also, just because an user can type .rep 
 does not mean he understands Unicode: He is able to make just 
 the same mistakes as before, even more so, as the array he is 
 getting back has the _wrong element type_.

I strongly agree with this. It would be nice to have everything 
be simple, work correctly *and* efficiently at the same time, but 
I don't believe the proposed changes make a definite improvement.

In the end, if you don't want to use the standard library or 
other UTF-aware string libraries, you'll have to know the basics 
of UTF to write the correct code. I too wish it was harder to 
write it incorrectly, but the current solution is simply the best 
one to appear yet.

Dec 30 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/30/11 1:55 PM, Timon Gehr wrote:
 Me too. I think the way we have it now is optimal.

What we have now is adequate. The scheme I proposed is optimal.

I agree with all of your other remarks.


Andrei

Dec 30 2011

deadalnix <deadalnix gmail.com> writes:

Le 30/12/2011 20:55, Timon Gehr a �crit :
 On 12/30/2011 08:33 PM, Joshua Reusch wrote:
 Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
 On 12/29/11 12:28 PM, Don wrote:
 On 28.12.2011 20:00, Andrei Alexandrescu wrote:
 Oh, one more thing - one good thing that could come out of this thread
 is abolition (through however slow a deprecation path) of s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length
 and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not
 char/wchar.
 Then, people would access the decoding routines on the needed
 occasions,
 or would consciously use the representation.

 Yum.


 If I understand this correctly, most others don't. Effectively, .rep
 just means, "I know what I'm doing", and there's no change to existing
 semantics, purely a syntax change.

 Exactly!

 If you change s[i] into s.rep[i], it does the same thing as now.
 There's
 no loss of functionality -- it's just stops you from accidentally doing
 the wrong thing. Like .ptr for getting the address of an array.
 Typically all the ".rep" everywhere would get annoying, so you would
 write:
 ubyte [] u = s.rep;
 and use u from then on.

 I don't like the name 'rep'. Maybe 'raw' or 'utf'?
 Apart from that, I think this would be perfect.

 Yes, I mean "rep" as a short for "representation" but upon first sight
 the connection is tenuous. "raw" sounds great.

 Now I'm twice sorry this will not happen...

 Maybe it could happen if we
 1. make dstring the default strings type --

 Inefficient.

 code units and characters would be the same

 Wrong.

 or 2. forward string.length to std.utf.count and opIndex to
 std.utf.toUTFindex

 Inconsistent and inefficient (it blows up the algorithmic complexity).

 so programmers could use the slices/indexing/length (no lazyness
 problems), and if they really want codeunits use .raw/.rep (or better
 .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)

 Anyone who intends to write efficient string processing code needs this.
 Anyone who does not want to write string processing code will not need
 to index into a string -- standard library functions will suffice.

 But generally I liked the idea of just having an alias for strings...

 Me too. I think the way we have it now is optimal. The only reason we
 are discussing this is because of fear that uneducated users will write
 code that does not take into account Unicode characters above code point
 0x80. But what is the worst thing that can happen?

ATOS origin was hacked because of bad management of unicode in string in 
some of their software.

Consequences can be more importants than you may think.

Additionnaly, you make an asumption that is realy wrong : an educated 
programmer will not make mistake. C programmers will just tell you 
excactly the same thing is the discution comes to pointers. But the fact 
is, we all do mistakes. Many of them ! We should go into unsafe 
behaviour, that rely on programmer capabilities only when needed.

I do understand pointers. I do make mistake with them and it does have 
crazy consequences sometime. And I do not trust anyone that say me 
he/she doesn't.


Because sometime we all are morrons.

Dec 30 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 12/30/2011 10:36 PM, deadalnix wrote:
 Le 30/12/2011 20:55, Timon Gehr a �crit :
 On 12/30/2011 08:33 PM, Joshua Reusch wrote:
 Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
 On 12/29/11 12:28 PM, Don wrote:
 On 28.12.2011 20:00, Andrei Alexandrescu wrote:
 Oh, one more thing - one good thing that could come out of this
 thread
 is abolition (through however slow a deprecation path) of s.length
 and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length
 and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not
 char/wchar.
 Then, people would access the decoding routines on the needed
 occasions,
 or would consciously use the representation.

 Yum.


 If I understand this correctly, most others don't. Effectively, .rep
 just means, "I know what I'm doing", and there's no change to existing
 semantics, purely a syntax change.

 Exactly!

 If you change s[i] into s.rep[i], it does the same thing as now.
 There's
 no loss of functionality -- it's just stops you from accidentally
 doing
 the wrong thing. Like .ptr for getting the address of an array.
 Typically all the ".rep" everywhere would get annoying, so you would
 write:
 ubyte [] u = s.rep;
 and use u from then on.

 I don't like the name 'rep'. Maybe 'raw' or 'utf'?
 Apart from that, I think this would be perfect.

 Yes, I mean "rep" as a short for "representation" but upon first sight
 the connection is tenuous. "raw" sounds great.

 Now I'm twice sorry this will not happen...

 Maybe it could happen if we
 1. make dstring the default strings type --

 Inefficient.

 code units and characters would be the same

 Wrong.

 or 2. forward string.length to std.utf.count and opIndex to
 std.utf.toUTFindex

 Inconsistent and inefficient (it blows up the algorithmic complexity).

 so programmers could use the slices/indexing/length (no lazyness
 problems), and if they really want codeunits use .raw/.rep (or better
 .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)

 Anyone who intends to write efficient string processing code needs this.
 Anyone who does not want to write string processing code will not need
 to index into a string -- standard library functions will suffice.

 But generally I liked the idea of just having an alias for strings...

 Me too. I think the way we have it now is optimal. The only reason we
 are discussing this is because of fear that uneducated users will write
 code that does not take into account Unicode characters above code point
 0x80. But what is the worst thing that can happen?

 ATOS origin was hacked because of bad management of unicode in string in
 some of their software.

And cast(string)s.rep[i..j] would magically fix all those bugs?

 Consequences can be more importants than you may think.

 Additionnaly, you make an asumption that is realy wrong : an educated
 programmer will not make mistake.

I am not. I am just assuming that the proposed change does not help with 
that.

 C programmers will just tell you
 excactly the same thing is the discution comes to pointers. But the fact
 is, we all do mistakes. Many of them ! We should go into unsafe
 behaviour, that rely on programmer capabilities only when needed.

 I do understand pointers. I do make mistake with them and it does have
 crazy consequences sometime. And I do not trust anyone that say me
 he/she doesn't.


 Because sometime we all are morrons.


as if he/she is a moron, he/she will write code that acts like a moron. 
Simple as that.

Dec 30 2011

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

On 12/30/2011 05:27 PM, Timon Gehr wrote:
 On 12/30/2011 10:36 PM, deadalnix wrote:

 Because sometime we all are morrons.

 

 as if he/she is a moron, he/she will write code that acts like a moron.
 Simple as that.

Tsk tsk.  Missing the point.

I believe what deadalnix is trying to say is this:
Programmers should try to write correct code, but should never trust
themselves to write correct code.

...

Programs worth writing are complex enough that there is no way any of us
can write them perfectly correct code on first draft.  There is always
going to be some polishing, and maybe even /a lot/ of polishing, and
perhaps some complete tear downs and rebuilds from time to time.  "Build
one to throw away; you will anyways."  If you tell me that you can
always write correct code the first time and you never need to go back
and fix anything when you do testing (you do test right?) then I will
have a hard time taking you seriously.

That said, it is extremely pleasant to have a language that catches you
when you inevitably fall.

Dec 31 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 12/31/2011 06:32 PM, Chad J wrote:
 On 12/30/2011 05:27 PM, Timon Gehr wrote:
 On 12/30/2011 10:36 PM, deadalnix wrote:

 Because sometime we all are morrons.


 as if he/she is a moron, he/she will write code that acts like a moron.
 Simple as that.

 Tsk tsk.  Missing the point.

Not at all. And I don't take anyone seriously who feels the need to 'Tsk 
tsk' btw.

 I believe what deadalnix is trying to say is this:
 Programmers should try to write correct code, but should never trust
 themselves to write correct code.

No, programmers should write correct code and then test it thoroughly. 
'Trying to' is the wrong way to go about anything. And there is no need 
to distrust oneself.

Anyway, I have a _very hard time_ translating 'acting like a moron' to 
'writing correct code'.

 ...

 Programs worth writing are complex enough that there is no way any of us
 can write them perfectly correct code on first draft.  There is always
 going to be some polishing, and maybe even /a lot/ of polishing, and
 perhaps some complete tear downs and rebuilds from time to time.  "Build
 one to throw away; you will anyways."  If you tell me that you can
 always write correct code the first time and you never need to go back
 and fix anything when you do testing (you do test right?) then I will
 have a hard time taking you seriously.

Testing is the main part of my development. Furthermore, I use 
assertions all over the place.

 That said, it is extremely pleasant to have a language that catches you
 when you inevitably fall.

That is why I also like Haskell.

Dec 31 2011

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

On 12/31/2011 01:13 PM, Timon Gehr wrote:
 On 12/31/2011 06:32 PM, Chad J wrote:
 On 12/30/2011 05:27 PM, Timon Gehr wrote:
 On 12/30/2011 10:36 PM, deadalnix wrote:

 Because sometime we all are morrons.


 as if he/she is a moron, he/she will write code that acts like a moron.
 Simple as that.

 Tsk tsk.  Missing the point.

 
 Not at all. And I don't take anyone seriously who feels the need to 'Tsk
 tsk' btw.
 

Well, you've certainly a right to it.

I just take it a little rough when it seems like someone's words are
being intentionally misread.

 I believe what deadalnix is trying to say is this:
 Programmers should try to write correct code, but should never trust
 themselves to write correct code.

 
 No, programmers should write correct code and then test it thoroughly.
 'Trying to' is the wrong way to go about anything. And there is no need
 to distrust oneself.
 

There's a perfect reason to distrust oneself: oneself is a squishy
meatbag that makes mistakes.

Repeated "trying" with rigor applied will lead to success.

 Anyway, I have a _very hard time_ translating 'acting like a moron' to
 'writing correct code'.
 

I'm pretty sure it's suggestive.  If an intelligent or careful person
acts like a moron, then they will be forced to assume that they will
make mistakes, and therefore take measures to ensure that the ALL
mistakes are caught and fixed or mitigated.  That is how you get from
'acting like a moron' to 'writing correct code'.

 ...

 Programs worth writing are complex enough that there is no way any of us
 can write them perfectly correct code on first draft.  There is always
 going to be some polishing, and maybe even /a lot/ of polishing, and
 perhaps some complete tear downs and rebuilds from time to time.  "Build
 one to throw away; you will anyways."  If you tell me that you can
 always write correct code the first time and you never need to go back
 and fix anything when you do testing (you do test right?) then I will
 have a hard time taking you seriously.

 
 Testing is the main part of my development. Furthermore, I use
 assertions all over the place.
 
 That said, it is extremely pleasant to have a language that catches you
 when you inevitably fall.

 
 That is why I also like Haskell.

I hear ya.  I feel Haskell is an important language to understand, if
not know how to use effectively.  I wish I knew how to use it better
than I do, but I haven't had too many projects that are amenable to it.

Dec 31 2011

deadalnix <deadalnix gmail.com> writes:

Le 31/12/2011 19:13, Timon Gehr a �crit :
 On 12/31/2011 06:32 PM, Chad J wrote:
 On 12/30/2011 05:27 PM, Timon Gehr wrote:
 On 12/30/2011 10:36 PM, deadalnix wrote:

 Because sometime we all are morrons.


 as if he/she is a moron, he/she will write code that acts like a moron.
 Simple as that.

 Programs worth writing are complex enough that there is no way any of us
 can write them perfectly correct code on first draft. There is always
 going to be some polishing, and maybe even /a lot/ of polishing, and
 perhaps some complete tear downs and rebuilds from time to time. "Build
 one to throw away; you will anyways." If you tell me that you can
 always write correct code the first time and you never need to go back
 and fix anything when you do testing (you do test right?) then I will
 have a hard time taking you seriously.

 Testing is the main part of my development. Furthermore, I use
 assertions all over the place.

Well, if you write correct code, you don't need assertion. They will 
always be true because your code is correct. Stop wasting your time with 


See how stupid this becomes ?

Jan 01 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/01/2012 11:36 PM, deadalnix wrote:
 Le 31/12/2011 19:13, Timon Gehr a �crit :
 On 12/31/2011 06:32 PM, Chad J wrote:
 On 12/30/2011 05:27 PM, Timon Gehr wrote:
 On 12/30/2011 10:36 PM, deadalnix wrote:

 Because sometime we all are morrons.


 as if he/she is a moron, he/she will write code that acts like a moron.
 Simple as that.

 Programs worth writing are complex enough that there is no way any of us
 can write them perfectly correct code on first draft. There is always
 going to be some polishing, and maybe even /a lot/ of polishing, and
 perhaps some complete tear downs and rebuilds from time to time. "Build
 one to throw away; you will anyways." If you tell me that you can
 always write correct code the first time and you never need to go back
 and fix anything when you do testing (you do test right?) then I will
 have a hard time taking you seriously.

 Testing is the main part of my development. Furthermore, I use
 assertions all over the place.

 Well, if you write correct code, you don't need assertion. They will
 always be true because your code is correct. Stop wasting your time with


 See how stupid this becomes ?

You miss the point. Testing and assertions are part of how I write 
correct code.

Jan 01 2012

deadalnix <deadalnix gmail.com> writes:

Le 01/01/2012 23:46, Timon Gehr a �crit :
 On 01/01/2012 11:36 PM, deadalnix wrote:
 Le 31/12/2011 19:13, Timon Gehr a �crit :
 On 12/31/2011 06:32 PM, Chad J wrote:
 On 12/30/2011 05:27 PM, Timon Gehr wrote:
 On 12/30/2011 10:36 PM, deadalnix wrote:

 Because sometime we all are morrons.


 acts
 as if he/she is a moron, he/she will write code that acts like a
 moron.
 Simple as that.

 Programs worth writing are complex enough that there is no way any
 of us
 can write them perfectly correct code on first draft. There is always
 going to be some polishing, and maybe even /a lot/ of polishing, and
 perhaps some complete tear downs and rebuilds from time to time. "Build
 one to throw away; you will anyways." If you tell me that you can
 always write correct code the first time and you never need to go back
 and fix anything when you do testing (you do test right?) then I will
 have a hard time taking you seriously.

 Testing is the main part of my development. Furthermore, I use
 assertions all over the place.

 Well, if you write correct code, you don't need assertion. They will
 always be true because your code is correct. Stop wasting your time with


 See how stupid this becomes ?

 You miss the point. Testing and assertions are part of how I write
 correct code.

So, to write correct code, you need to asume you'll write incorrect 
code. Writing correct code is your goal. Asuming you'll do stupid stuff 
is a quality required to advance toward this goal. And, saying that you 
test and assert a lot, you confirm that point.

Jan 04 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/04/2012 07:08 PM, deadalnix wrote:
 Le 01/01/2012 23:46, Timon Gehr a �crit :
 On 01/01/2012 11:36 PM, deadalnix wrote:
 Le 31/12/2011 19:13, Timon Gehr a �crit :
 On 12/31/2011 06:32 PM, Chad J wrote:
 On 12/30/2011 05:27 PM, Timon Gehr wrote:
 On 12/30/2011 10:36 PM, deadalnix wrote:

 Because sometime we all are morrons.


 acts
 as if he/she is a moron, he/she will write code that acts like a
 moron.
 Simple as that.

 Programs worth writing are complex enough that there is no way any
 of us
 can write them perfectly correct code on first draft. There is always
 going to be some polishing, and maybe even /a lot/ of polishing, and
 perhaps some complete tear downs and rebuilds from time to time.
 "Build
 one to throw away; you will anyways." If you tell me that you can
 always write correct code the first time and you never need to go back
 and fix anything when you do testing (you do test right?) then I will
 have a hard time taking you seriously.

 Testing is the main part of my development. Furthermore, I use
 assertions all over the place.

 Well, if you write correct code, you don't need assertion. They will
 always be true because your code is correct. Stop wasting your time with


 See how stupid this becomes ?

 You miss the point. Testing and assertions are part of how I write
 correct code.

 So, to write correct code, you need to asume you'll write incorrect
 code. Writing correct code is your goal. Asuming you'll do stupid stuff
 is a quality required to advance toward this goal.

You are free to believe whatever you want, but I think that strategy you 
are describing is a recipe for writing buggy code.

 And, saying that you test and assert a lot,

Code for which no tests exist is neither correct nor incorrect. 
Assertions are a neat way to detect parts of the application whose 
implementation is incomplete.

 you confirm that point.

No.

Jan 04 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/04/2012 11:31 PM, Timon Gehr wrote:
 Code for which no tests exist is neither correct nor incorrect.
 Assertions are a neat way to detect parts of the application whose
 implementation is incomplete.

Another major use of them is the checked documentation of assumptions, 
mainly in method preconditions.

Jan 04 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 12/30/2011 11:55 AM, Timon Gehr wrote:
 Me too. I think the way we have it now is optimal.

Consider your X macro implementation. Strip out the utf.stride code and use 
plain indexing - it will not break the code in any way. The naive
implementation 
still works correctly with ASCII and UTF-8.

That's not true for any other multibyte encoding, which is why UTF-8 is
inspired 
genius.

Dec 30 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 12/30/2011 11:01 PM, Walter Bright wrote:
 On 12/30/2011 11:55 AM, Timon Gehr wrote:
 Me too. I think the way we have it now is optimal.

 Consider your X macro implementation. Strip out the utf.stride code and
 use plain indexing - it will not break the code in any way. The naive
 implementation still works correctly with ASCII and UTF-8.

You are right, that obviously needs fixing. ☺
Thanks!

 That's not true for any other multibyte encoding, which is why UTF-8 is
 inspired genius.

Dec 30 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/30/11 4:01 PM, Walter Bright wrote:
 On 12/30/2011 11:55 AM, Timon Gehr wrote:
 Me too. I think the way we have it now is optimal.

 Consider your X macro implementation. Strip out the utf.stride code and
 use plain indexing - it will not break the code in any way. The naive
 implementation still works correctly with ASCII and UTF-8.

 That's not true for any other multibyte encoding, which is why UTF-8 is
 inspired genius.

It's true for any encoding with the prefix property, such as Huffman.

Using .raw is /optimal/ because it states the assumption appropriately. 
The user knows '$' cannot be in the prefix of any other symbol, so she 
can state the byte alone is the character. If that were a non-ASCII 
character, the assumption wouldn't have worked.

So yeah, UTF-8 is great. But it is not miraculous. We need .raw.


Andrei

Dec 30 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 12/31/2011 12:00 AM, Andrei Alexandrescu wrote:
 On 12/30/11 4:01 PM, Walter Bright wrote:
 On 12/30/2011 11:55 AM, Timon Gehr wrote:
 Me too. I think the way we have it now is optimal.

 Consider your X macro implementation. Strip out the utf.stride code and
 use plain indexing - it will not break the code in any way. The naive
 implementation still works correctly with ASCII and UTF-8.

 That's not true for any other multibyte encoding, which is why UTF-8 is
 inspired genius.

 It's true for any encoding with the prefix property, such as Huffman.

 Using .raw is /optimal/ because it states the assumption appropriately.
 The user knows '$' cannot be in the prefix of any other symbol, so she
 can state the byte alone is the character. If that were a non-ASCII
 character, the assumption wouldn't have worked.

 So yeah, UTF-8 is great. But it is not miraculous. We need .raw.


 Andrei

auto raw(S)(S s) if(isNarrowString!S){
     static if(is(S==string)) return cast(ubyte[])s;
     else static if(is(S==wstring)) return cast(ushort[])s;
}

Dec 30 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/30/11 5:07 PM, Timon Gehr wrote:
 On 12/31/2011 12:00 AM, Andrei Alexandrescu wrote:
 On 12/30/11 4:01 PM, Walter Bright wrote:
 On 12/30/2011 11:55 AM, Timon Gehr wrote:
 Me too. I think the way we have it now is optimal.

 Consider your X macro implementation. Strip out the utf.stride code and
 use plain indexing - it will not break the code in any way. The naive
 implementation still works correctly with ASCII and UTF-8.

 That's not true for any other multibyte encoding, which is why UTF-8 is
 inspired genius.

 It's true for any encoding with the prefix property, such as Huffman.

 Using .raw is /optimal/ because it states the assumption appropriately.
 The user knows '$' cannot be in the prefix of any other symbol, so she
 can state the byte alone is the character. If that were a non-ASCII
 character, the assumption wouldn't have worked.

 So yeah, UTF-8 is great. But it is not miraculous. We need .raw.


 Andrei

 auto raw(S)(S s) if(isNarrowString!S){
 static if(is(S==string)) return cast(ubyte[])s;
 else static if(is(S==wstring)) return cast(ushort[])s;
 }

Almost there.

https://github.com/D-Programming-Language/phobos/blob/master/std/string.d#L809


Andrei

Dec 30 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 12/31/2011 01:03 AM, Andrei Alexandrescu wrote:
 On 12/30/11 5:07 PM, Timon Gehr wrote:
 On 12/31/2011 12:00 AM, Andrei Alexandrescu wrote:
 On 12/30/11 4:01 PM, Walter Bright wrote:
 On 12/30/2011 11:55 AM, Timon Gehr wrote:
 Me too. I think the way we have it now is optimal.

 Consider your X macro implementation. Strip out the utf.stride code and
 use plain indexing - it will not break the code in any way. The naive
 implementation still works correctly with ASCII and UTF-8.

 That's not true for any other multibyte encoding, which is why UTF-8 is
 inspired genius.

 It's true for any encoding with the prefix property, such as Huffman.

 Using .raw is /optimal/ because it states the assumption appropriately.
 The user knows '$' cannot be in the prefix of any other symbol, so she
 can state the byte alone is the character. If that were a non-ASCII
 character, the assumption wouldn't have worked.

 So yeah, UTF-8 is great. But it is not miraculous. We need .raw.


 Andrei

 auto raw(S)(S s) if(isNarrowString!S){
 static if(is(S==string)) return cast(ubyte[])s;
 else static if(is(S==wstring)) return cast(ushort[])s;
 }

 Almost there.

 https://github.com/D-Programming-Language/phobos/blob/master/std/string.d#L809



 Andrei

alias std.string.representation raw;

Dec 30 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/30/11 6:07 PM, Timon Gehr wrote:
 alias std.string.representation raw;

I meant your implementation is incomplete.

But the main point is that presence of representation/raw is not the 
issue. The availability of good-for-nothing .length and operator[] are 
the issue. Putting in place the convention of using .raw is hardly 
useful within the context.


Andrei

Dec 30 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 12/31/2011 01:12 AM, Andrei Alexandrescu wrote:
 On 12/30/11 6:07 PM, Timon Gehr wrote:
 alias std.string.representation raw;

 I meant your implementation is incomplete.

It was more a sketch than an implementation. It is not even type safe :o).

 But the main point is that presence of representation/raw is not the
 issue.
 The availability of good-for-nothing .length and operator[] are
 the issue. Putting in place the convention of using .raw is hardly
 useful within the context.

D strings are arrays. An array without .length and operator[] is close 
to being good for nothing. The language specification is quite clear 
about the fact that e.g. char is not a character but an utf-8 code unit. 
Therefore char[] is an array of code units. length gives the number of 
code units. operator[i] gives the i-th code unit. Nothing wrong or 
good-for-nothing about that. .raw would return ubyte[], therefore it 
would lose all type information. Effectively, what .raw does is a type 
cast that will let code point data alias with integral data.

Consider:

void foo(ubyte[] b)in{assert(b.length);}body{
     b[0]=2; // perfectly fine
}

void main(){
     char[] s = "☺".dup;
     auto b = s.raw;
     foo(b);
     writeln(s); // oops...
}

I fail to understand why that is desirable.

Dec 30 2011

Don <nospam nospam.com> writes:

On 31.12.2011 01:56, Timon Gehr wrote:
 On 12/31/2011 01:12 AM, Andrei Alexandrescu wrote:
 On 12/30/11 6:07 PM, Timon Gehr wrote:
 alias std.string.representation raw;

 I meant your implementation is incomplete.

 It was more a sketch than an implementation. It is not even type safe :o).

 But the main point is that presence of representation/raw is not the
 issue.
 The availability of good-for-nothing .length and operator[] are
 the issue. Putting in place the convention of using .raw is hardly
 useful within the context.

 D strings are arrays. An array without .length and operator[] is close
 to being good for nothing. The language specification is quite clear
 about the fact that e.g. char is not a character but an utf-8 code unit.
 Therefore char[] is an array of code units.

No, it isn't. That's the problem. char[] is not an array of char.
It has an additional invariant: it is a UTF8 string. If you randomly 
change elements, the invariant is violated.

In reality, char[] and wchar[] are compressed forms of dstring.

 .raw would return ubyte[], therefore it
 would lose all type information. Effectively, what .raw does is a type
 cast that will let code point data alias with integral data.

Exactly. It's just a "I know what I'm doing" signal.

Dec 31 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 12/31/2011 01:15 PM, Don wrote:
 On 31.12.2011 01:56, Timon Gehr wrote:
 On 12/31/2011 01:12 AM, Andrei Alexandrescu wrote:
 On 12/30/11 6:07 PM, Timon Gehr wrote:
 alias std.string.representation raw;

 I meant your implementation is incomplete.

 It was more a sketch than an implementation. It is not even type safe
 :o).

 But the main point is that presence of representation/raw is not the
 issue.
 The availability of good-for-nothing .length and operator[] are
 the issue. Putting in place the convention of using .raw is hardly
 useful within the context.

 D strings are arrays. An array without .length and operator[] is close
 to being good for nothing. The language specification is quite clear
 about the fact that e.g. char is not a character but an utf-8 code unit.
 Therefore char[] is an array of code units.

 No, it isn't. That's the problem. char[] is not an array of char.
 It has an additional invariant: it is a UTF8 string. If you randomly
 change elements, the invariant is violated.

char[] is an array of char and the additional invariant is not enforced 
by the language.

 In reality, char[] and wchar[] are compressed forms of dstring.

 .raw would return ubyte[], therefore it
 would lose all type information. Effectively, what .raw does is a type
 cast that will let code point data alias with integral data.

 Exactly. It's just a "I know what I'm doing" signal.

No, it is a "I don't know what I'm doing" signal: ubyte[] does not carry 
any sign of an additional invariant, and the aliasing can be used to 
break the invariant that is commonly assumed for char[]. That was my point.

Dec 31 2011

Don <nospam nospam.com> writes:

On 31.12.2011 17:13, Timon Gehr wrote:
 On 12/31/2011 01:15 PM, Don wrote:
 On 31.12.2011 01:56, Timon Gehr wrote:
 On 12/31/2011 01:12 AM, Andrei Alexandrescu wrote:
 On 12/30/11 6:07 PM, Timon Gehr wrote:
 alias std.string.representation raw;

 I meant your implementation is incomplete.

 It was more a sketch than an implementation. It is not even type safe
 :o).

 But the main point is that presence of representation/raw is not the
 issue.
 The availability of good-for-nothing .length and operator[] are
 the issue. Putting in place the convention of using .raw is hardly
 useful within the context.

 D strings are arrays. An array without .length and operator[] is close
 to being good for nothing. The language specification is quite clear
 about the fact that e.g. char is not a character but an utf-8 code unit.
 Therefore char[] is an array of code units.

 No, it isn't. That's the problem. char[] is not an array of char.
 It has an additional invariant: it is a UTF8 string. If you randomly
 change elements, the invariant is violated.

 char[] is an array of char and the additional invariant is not enforced
 by the language.

No, it isn't an ordinary array. For example with concatenation.  char[] 
~ int will never create an invalid string. You can end up with multiple 
chars being appended, even from a single append. foreach is different, 
too. They are a bit magical.
There's quite a lot of code in the compiler to make sure that strings 
remain valid.

The additional invariant is not enforced in the case of slicing; that's 
the point.

Dec 31 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/01/2012 08:10 AM, Don wrote:
 On 31.12.2011 17:13, Timon Gehr wrote:
 On 12/31/2011 01:15 PM, Don wrote:
 On 31.12.2011 01:56, Timon Gehr wrote:
 On 12/31/2011 01:12 AM, Andrei Alexandrescu wrote:
 On 12/30/11 6:07 PM, Timon Gehr wrote:
 alias std.string.representation raw;

 I meant your implementation is incomplete.

 It was more a sketch than an implementation. It is not even type safe
 :o).

 But the main point is that presence of representation/raw is not the
 issue.
 The availability of good-for-nothing .length and operator[] are
 the issue. Putting in place the convention of using .raw is hardly
 useful within the context.

 D strings are arrays. An array without .length and operator[] is close
 to being good for nothing. The language specification is quite clear
 about the fact that e.g. char is not a character but an utf-8 code
 unit.
 Therefore char[] is an array of code units.

 No, it isn't. That's the problem. char[] is not an array of char.
 It has an additional invariant: it is a UTF8 string. If you randomly
 change elements, the invariant is violated.

 char[] is an array of char and the additional invariant is not enforced
 by the language.

 No, it isn't an ordinary array. For example with concatenation. char[] ~
 int will never create an invalid string.

Yes it will.

void main() {
     char[] x;
     writeln(x~255);
}

 You can end up with multiple chars being appended, even from a single append.
foreach is different,
 too. They are a bit magical.

Fair enough, but type conversion rules are a bit magical in general.

void main() {
     auto a = cast(short[])[1,2,3];
     auto b = [1,2,3];
     auto c = cast(short[])b;
     assert(a!=c);
}

 There's quite a lot of code in the compiler to make sure that strings
 remain valid.

At the same time, there are many language features that allow to create 
invalid strings.

auto a = "\377\252\314";
auto b = x"FF AA CC";
auto c = import("binary");

 The additional invariant is not enforced in the case of slicing; that's
 the point.

Jan 01 2012

Sean Kelly <sean invisibleduck.org> writes:

I'm not sure I understand what's wrong with length.  Of all the times I get a=
 length in one sizable i18nalized app at work I can think of only one instan=
ce where I actually want the character count rather than the byte count. Is t=
here some other reason I'm not aware of that length is undesirable?

Sent from my iPhone

On Dec 30, 2011, at 4:12 PM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.=
org> wrote:

 On 12/30/11 6:07 PM, Timon Gehr wrote:
 alias std.string.representation raw;

=20
 I meant your implementation is incomplete.
=20
 But the main point is that presence of representation/raw is not the issue=

. The availability of good-for-nothing .length and operator[] are the issue.=
 Putting in place the convention of using .raw is hardly useful within the c=
ontext.
=20
=20
 Andrei

Dec 31 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 12/30/2011 3:00 PM, Andrei Alexandrescu wrote:
 On 12/30/11 4:01 PM, Walter Bright wrote:
 On 12/30/2011 11:55 AM, Timon Gehr wrote:
 Me too. I think the way we have it now is optimal.

 Consider your X macro implementation. Strip out the utf.stride code and
 use plain indexing - it will not break the code in any way. The naive
 implementation still works correctly with ASCII and UTF-8.

 That's not true for any other multibyte encoding, which is why UTF-8 is
 inspired genius.

 It's true for any encoding with the prefix property, such as Huffman.

Any other multibyte character encoding I've seen standardized for use in C.

Dec 30 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-12-30 23:00:49 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 Using .raw is /optimal/ because it states the assumption appropriately. 
 The user knows '$' cannot be in the prefix of any other symbol, so she 
 can state the byte alone is the character. If that were a non-ASCII 
 character, the assumption wouldn't have worked.
 
 So yeah, UTF-8 is great. But it is not miraculous. We need .raw.

After reading most of the thread, it seems to me like you're 
deconstructing strings as arrays one piece at a time, to the point 
where instead of arrays we'd basically get a string struct and do 
things on it. Maybe it's part of a grand scheme, more likely it's one 
realization after another leading to one change after another… let's 
see where all this will lead us:

0. in the beginning, strings were char[] arrays
1. arrays are generalized as ranges
2. phobos starts treating char arrays as bidirectional ranges of dchar 
(instead of random access ranges of char)
3. foreach on char[] should iterate over dchar by default
4. remove .length, random access, and slicing from char arrays
5. replace char[] with a struct { ubyte[] raw; }

Number 1 is great by itself, no debate there. Number 2 is debatable. 
Number 3 and 4 are somewhat required for consistency with number 2. 
Number 5 is just the logical conclusion of all these changes.

If we want a fundamental change to what strings are in D, perhaps we 
should start focusing on the broader issue instead of trying to pass 
piecemeal changes one after the other. For consistency's sake, I think 
we should either stop after 1 or go all the way to 5. Either we do it 
fully or we don't do it at all.

All those divergent interpretations of strings end up hurting the 
language. Walter and Andrei ought to find a way to agree with each 
other.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Dec 30 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, December 30, 2011 20:55:42 Timon Gehr wrote:
 1. They don't notice. Then it is not a problem, because they are
 obviously only using ASCII characters and it is perfectly reasonable to
 assume that code units and characters are the same thing.

The problem is that what's more likely to happen in a lot of cases is that 
they use it wrong and don't notice, because they're only using ASCII in 
testing, _but_ they have bugs all over the place, because their code is 
actually used with unicode in the field.

Yes, diligent programmers will generally find such problems, but with the 
current scheme, it's _so_ easy to use length when you shouldn't, that it's 
pretty much a guarantee that it's going to happen. I'm not sure that Andrei's 
suggestion is the best one at this point, but I sure wouldn't be against it 
being introduced. It wouldn't entirely fix the problem by any means, but 
programmers would then have to work harder at screwing it up and so there 
would be fewer mistakes.

Arguably, the first issue with D strings is that we have char. In most 
languages, char is supposed to be a character, so many programmers will code 
with that expectation. If we had something like utf8unit, utf16unit, and 
utf32unit (arguably very bad, albeit descriptive, names) and no char, then it 
would force programmers to become semi-educated about the issues. There's no 
way that that's changing at this point though.

- Jonathan M Davis

Dec 30 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 12/31/2011 04:30 AM, Jonathan M Davis wrote:
 On Friday, December 30, 2011 20:55:42 Timon Gehr wrote:
 1. They don't notice. Then it is not a problem, because they are
 obviously only using ASCII characters and it is perfectly reasonable to
 assume that code units and characters are the same thing.

 The problem is that what's more likely to happen in a lot of cases is that
 they use it wrong and don't notice, because they're only using ASCII in
 testing, _but_ they have bugs all over the place, because their code is
 actually used with unicode in the field.

Then that is the fault of the guy who created the tests. At least that 
guy should be familiar with the issues, otherwise he is at the wrong 
position. Software should never be released without thorough testing.

 Yes, diligent programmers will generally find such problems, but with the
 current scheme, it's _so_ easy to use length when you shouldn't, that it's
 pretty much a guarantee that it's going to happen. I'm not sure that Andrei's
 suggestion is the best one at this point, but I sure wouldn't be against it
 being introduced. It wouldn't entirely fix the problem by any means, but
 programmers would then have to work harder at screwing it up and so there
 would be fewer mistakes.

Programmers would then also have to work harder at doing it right and at 
memoizing special cases, so there is absolutely no net gain.

 Arguably, the first issue with D strings is that we have char. In most
 languages, char is supposed to be a character, so many programmers will code
 with that expectation. If we had something like utf8unit, utf16unit, and
 utf32unit (arguably very bad, albeit descriptive, names) and no char, then it
 would force programmers to become semi-educated about the issues. There's no
 way that that's changing at this point though.

 - Jonathan M Davis

A programmer has to have basic knowledge of the language he is 
programming in. That includes knowing the meaning of all basic types. If 
he fails at that, testing should definitely catch that kind of trivial bugs.

Dec 30 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 12/30/2011 7:30 PM, Jonathan M Davis wrote:
 Yes, diligent programmers will generally find such problems, but with the
 current scheme, it's _so_ easy to use length when you shouldn't, that it's
 pretty much a guarantee that it's going to happen.

I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8 
correctly, but it turned out that the naive version that used [i] and .length 
worked correctly. This is typical, not exceptional.

This was definitely not true of older multibyte schemes, like Shift-JIS 
(shudder), but those schemes ought to be terminated with extreme prejudice. But 
it definitely will take a long time to live down the bugs and miasma of code 
that had to deal with them. C and C++ still live with that because of their 
agenda of backwards compatibility. They still support EBCDIC, after all, that 
was obsolete even in the 70's. And I still see posts on comp.moderated.c++ that 
say "you shouldn't write string code like that, because it won't work on 
EBCDIC!" Sheesh!

Dec 30 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/30/11 10:09 PM, Walter Bright wrote:
 On 12/30/2011 7:30 PM, Jonathan M Davis wrote:
 Yes, diligent programmers will generally find such problems, but with the
 current scheme, it's _so_ easy to use length when you shouldn't, that
 it's
 pretty much a guarantee that it's going to happen.

 I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8
 correctly, but it turned out that the naive version that used [i] and
 .length worked correctly. This is typical, not exceptional.

The lower frequency of bugs makes them that much more difficult to spot. 
This is essentially similar to the UTF16/UCS-2 morass: in a vast 
majority of the time the programmer may consider UTF16 a coding with one 
code unit per code point (which is what UCS-2 is). The existence of 
surrogates didn't make much of a difference because, again, very often 
the wrong assumption just worked. Well that all didn't go over all that 
well.

We need .raw and we must abolish .length and [] for narrow strings.


Andrei

Dec 30 2011

Brad Anderson <eco gnuk.net> writes:

On Sat, Dec 31, 2011 at 12:09 AM, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 On 12/30/11 10:09 PM, Walter Bright wrote:

 On 12/30/2011 7:30 PM, Jonathan M Davis wrote:

 Yes, diligent programmers will generally find such problems, but with the
 current scheme, it's _so_ easy to use length when you shouldn't, that
 it's
 pretty much a guarantee that it's going to happen.

 I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8
 correctly, but it turned out that the naive version that used [i] and
 .length worked correctly. This is typical, not exceptional.

 The lower frequency of bugs makes them that much more difficult to spot.
 This is essentially similar to the UTF16/UCS-2 morass: in a vast majority
 of the time the programmer may consider UTF16 a coding with one code unit
 per code point (which is what UCS-2 is). The existence of surrogates didn't
 make much of a difference because, again, very often the wrong assumption
 just worked. Well that all didn't go over all that well.

 We need .raw and we must abolish .length and [] for narrow strings.


 Andrei


I don't know that Phobos would be an appropriate place for it but offering
some easy to access string data containing extensive and advanced unicode
which users could easily add to their programs unit tests may help people
ensure proper unicode usage. Unicode seems to be one of those things where
you either know it really well or you know just enough to get yourself in
trouble so having test data written by unicode experts could be very useful
for the rest of us mortals.

I googled around a bit.  This Stack Overflow came up <
http://stackoverflow.com/questions/6136800/unicode-test-strings-for-unit-tests>
that recommends these
 - UTF-8 stress test:
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
 - Quick Brown Fox in a variety of languages:
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/quickbrown.txt

I didn't see too much beyond those two.

Regards,
Brad A.

Dec 30 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 12/30/2011 11:09 PM, Andrei Alexandrescu wrote:
 On 12/30/11 10:09 PM, Walter Bright wrote:
 I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8
 correctly, but it turned out that the naive version that used [i] and
 .length worked correctly. This is typical, not exceptional.

 The lower frequency of bugs makes them that much more difficult to spot. This
is
 essentially similar to the UTF16/UCS-2 morass: in a vast majority of the time
 the programmer may consider UTF16 a coding with one code unit per code point
 (which is what UCS-2 is). The existence of surrogates didn't make much of a
 difference because, again, very often the wrong assumption just worked. Well
 that all didn't go over all that well.

I'm not so sure it's quite the same. Java was designed before there were 
surrogate pairs, they kinda got the rug pulled out from under them. So, they 
simply have no decent way to deal with it. There isn't even a notion of a dchar 
character type. Java was designed with codeunit==codepoint, it is embedded in 
the design of the language, library, and culture.

This is not true of D. It's designed from the ground up to deal properly with 
UTF. D has very simple language features to deal with it.

 We need .raw and we must abolish .length and [] for narrow strings.

I don't believe that fixes anything and breaks every D project out there. We're 
chasing phantoms here, and I worry a lot about over-engineering trivia.

And, we already have a type to deal with it: dstring

Dec 31 2011

kenji hara <k.hara.pg gmail.com> writes:

2011/12/31 Walter Bright <newshound2 digitalmars.com>:
 On 12/30/2011 11:09 PM, Andrei Alexandrescu wrote:
 On 12/30/11 10:09 PM, Walter Bright wrote:
 I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8
 correctly, but it turned out that the naive version that used [i] and
 .length worked correctly. This is typical, not exceptional.


 The lower frequency of bugs makes them that much more difficult to spot.
 This is
 essentially similar to the UTF16/UCS-2 morass: in a vast majority of the
 time
 the programmer may consider UTF16 a coding with one code unit per code
 point
 (which is what UCS-2 is). The existence of surrogates didn't make much of
 a
 difference because, again, very often the wrong assumption just worked.
 Well
 that all didn't go over all that well.


 I'm not so sure it's quite the same. Java was designed before there were
 surrogate pairs, they kinda got the rug pulled out from under them. So, they
 simply have no decent way to deal with it. There isn't even a notion of a
 dchar character type. Java was designed with codeunit==codepoint, it is
 embedded in the design of the language, library, and culture.

 This is not true of D. It's designed from the ground up to deal properly
 with UTF. D has very simple language features to deal with it.


 We need .raw and we must abolish .length and [] for narrow strings.


 I don't believe that fixes anything and breaks every D project out there.
 We're chasing phantoms here, and I worry a lot about over-engineering
 trivia.

 And, we already have a type to deal with it: dstring

I fully agree with Walter. No need more wrapper for string.

Kenji Hara

Dec 31 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/31/11 2:04 AM, Walter Bright wrote:
 On 12/30/2011 11:09 PM, Andrei Alexandrescu wrote:
 On 12/30/11 10:09 PM, Walter Bright wrote:
 I'm not so sure about that. Timon Gehr's X macro tried to handle
 UTF-8 correctly, but it turned out that the naive version that
 used [i] and .length worked correctly. This is typical, not
 exceptional.

 The lower frequency of bugs makes them that much more difficult to
  spot. This is essentially similar to the UTF16/UCS-2 morass: in a
 vast majority of the time the programmer may consider UTF16 a
 coding with one code unit per code point (which is what UCS-2 is).
 The existence of surrogates didn't make much of a difference
 because, again, very often the wrong assumption just worked. Well
 that all didn't go over all that well.

 I'm not so sure it's quite the same. Java was designed before there
 were surrogate pairs, they kinda got the rug pulled out from under
 them. So, they simply have no decent way to deal with it. There isn't
 even a notion of a dchar character type. Java was designed with
 codeunit==codepoint, it is embedded in the design of the language,
 library, and culture.

 This is not true of D. It's designed from the ground up to deal
 properly with UTF.

I disagree. It is designed to make dealing with UTF possible.

 D has very simple language features to deal with
 it.

Disagree. I mean simple they are, no contest. They could and should be 
much better, make correct code easier to write, and make incorrect code 
more difficult to write. Claiming we reached perfection there doesn't 
quite fit.

 We need .raw and we must abolish .length and [] for narrow
 strings.

 I don't believe that fixes anything and breaks every D project out
 there.

I agree. This is the only reason that keeps me from furthering the issue.

 We're chasing phantoms here, and I worry a lot about over-engineering
 trivia.

I disagree. I understand that seems trivia to you, but that doesn't make 
your opinion any less wrong, not to mention provincial through 
insistence it's applicable beyond a small team of experts. Again: I know 
no other - I literally mean not one - person who writes string code like 
you do (and myself after learning it from you); the current system is 
adequate; the proposed system is perfect - save for breaking backwards 
compatibility, which makes the discussion moot. But it being moot does 
not afford me to concede this point. I am right.

 And, we already have a type to deal with it: dstring

No.


Andrei

Dec 31 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-12-31 08:56:37 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 12/31/11 2:04 AM, Walter Bright wrote:
 
 We're chasing phantoms here, and I worry a lot about over-engineering
 trivia.

 
 I disagree. I understand that seems trivia to you, but that doesn't 
 make your opinion any less wrong, not to mention provincial through 
 insistence it's applicable beyond a small team of experts. Again: I 
 know no other - I literally mean not one - person who writes string 
 code like you do (and myself after learning it from you); the current 
 system is adequate; the proposed system is perfect - save for breaking 
 backwards compatibility, which makes the discussion moot. But it being 
 moot does not afford me to concede this point. I am right.

Perfect? At one time Java and other frameworks started to use UTF-16 as 
if they were characters, that turned wrong on them. Now we know that 
not even code points should be considered characters, thanks to 
characters spanning on multiple code points. You might call it perfect, 
but for that you have made two assumptions:

1. treating code points as characters is good enough, and
2. the performance penalty of decoding everything is tolerable

Ranges of code points might be perfect for you, but it's a tradeoff 
that won't work in every situations.

The whole concept of generic algorithms working on strings efficiently 
doesn't work. Applying generic algorithms to strings by treating them 
as a range of code points is both wasteful (because it forces you to 
decode everything) and incomplete (because of multi-code-point 
characters) and it should be avoided. Algorithms working on Unicode 
strings should be designed with Unicode in mind. And the best way to 
design efficient Unicode algorithms is to access the array of code 
units directly and read each character at the level of abstraction 
required and know what you're doing.

I'm not against making strings more opaque to encourage people to use 
the Unicode algorithms from the standard library instead of rolling 
their own. But I doubt the current approach of using .raw alone will 
prevent many from doing dumb things. On the other side I'm sure it'll 
make it it more complicated to write Unicode algorithms because 
accessing and especially slicing the raw content of char[] will become 
tiresome. I'm not convinced it's a net win.

As for Walter being the only one coding by looking at the code units 
directly, that's not true. All my parser code look at code units 
directly and only decode to code points where necessary (just look at 
the XML parsing code I posted a while ago to get an idea to how it can 
apply to ranges). And I don't think it's because I've seen Walter code 
before, I think it is because I know how Unicode works and I want to 
make my parser efficient. I've done the same for a parser in C++ a 
while ago. I can hardly imagine I'm the only one (with Walter and you). 
I think this is how efficient algorithms dealing with Unicode should be 
written.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Dec 31 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/31/11 8:17 CST, Michel Fortin wrote:
 On 2011-12-31 08:56:37 +0000, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 On 12/31/11 2:04 AM, Walter Bright wrote:

 We're chasing phantoms here, and I worry a lot about over-engineering
 trivia.

 I disagree. I understand that seems trivia to you, but that doesn't
 make your opinion any less wrong, not to mention provincial through
 insistence it's applicable beyond a small team of experts. Again: I
 know no other - I literally mean not one - person who writes string
 code like you do (and myself after learning it from you); the current
 system is adequate; the proposed system is perfect - save for breaking
 backwards compatibility, which makes the discussion moot. But it being
 moot does not afford me to concede this point. I am right.

 Perfect?

Sorry, I exaggerated. I meant "a net improvement while keeping simplicity".

 At one time Java and other frameworks started to use UTF-16 as
 if they were characters, that turned wrong on them. Now we know that not
 even code points should be considered characters, thanks to characters
 spanning on multiple code points. You might call it perfect, but for
 that you have made two assumptions:

 1. treating code points as characters is good enough, and
 2. the performance penalty of decoding everything is tolerable

I'm not sure how you concluded I drew such assumptions.

 Ranges of code points might be perfect for you, but it's a tradeoff that
 won't work in every situations.

Ranges can be defined to span logical glyphs that span multiple code points.

 The whole concept of generic algorithms working on strings efficiently
 doesn't work.

Apparently std.algorithm does.

 Applying generic algorithms to strings by treating them as
 a range of code points is both wasteful (because it forces you to decode
 everything) and incomplete (because of multi-code-point characters) and
 it should be avoided.

An algorithm that gains by accessing the encoding can do so - and indeed 
some do. Spanning multi-code-point characters is a matter of defining 
the range appropriately; it doesn't break the abstraction.

 Algorithms working on Unicode strings should be
 designed with Unicode in mind. And the best way to design efficient
 Unicode algorithms is to access the array of code units directly and
 read each character at the level of abstraction required and know what
 you're doing.

As I said, that's happening already.

 I'm not against making strings more opaque to encourage people to use
 the Unicode algorithms from the standard library instead of rolling
 their own.

I'd say we're discussing making the two kinds of manipulation (encoded 
sequence of logical character vs. array of code units) more 
distinguished from each other. That's a Good Thing(tm).

 But I doubt the current approach of using .raw alone will
 prevent many from doing dumb things.

I agree. But I think it would be a sensible improvement over now, when 
you get to do a ton of dumb things with much more ease.

 On the other side I'm sure it'll
 make it it more complicated to write Unicode algorithms because
 accessing and especially slicing the raw content of char[] will become
 tiresome. I'm not convinced it's a net win.

Many Unicode algorithms don't need slicing. Those that do carefully mix 
manipulation of code points with manipulation of representation. It is a 
net win that the two operations are explicitly distinguished.

 As for Walter being the only one coding by looking at the code units
 directly, that's not true. All my parser code look at code units
 directly and only decode to code points where necessary (just look at
 the XML parsing code I posted a while ago to get an idea to how it can
 apply to ranges). And I don't think it's because I've seen Walter code
 before, I think it is because I know how Unicode works and I want to
 make my parser efficient. I've done the same for a parser in C++ a while
 ago. I can hardly imagine I'm the only one (with Walter and you). I
 think this is how efficient algorithms dealing with Unicode should be
 written.

Congratulations.


Andrei

Dec 31 2011

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Saturday, 31 December 2011 at 15:03:13 UTC, Andrei 
Alexandrescu wrote:
 The whole concept of generic algorithms working on strings 
 efficiently
 doesn't work.

 Apparently std.algorithm does.

According to my research[1], std.array.replace (which uses 
std.algorithm under the hood) can be at least 40% faster when 
there is a match and 70% faster when there isn't one.

I don't think this is actually related to UTF, though.

[1]: 
http://dump.thecybershadow.net/5cfb6713ce6628686c6aa8a23b15c99e/test.d

Dec 31 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-12-31 15:03:13 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 12/31/11 8:17 CST, Michel Fortin wrote:
 At one time Java and other frameworks started to use UTF-16 as
 if they were characters, that turned wrong on them. Now we know that not
 even code points should be considered characters, thanks to characters
 spanning on multiple code points. You might call it perfect, but for
 that you have made two assumptions:
 
 1. treating code points as characters is good enough, and
 2. the performance penalty of decoding everything is tolerable

 
 I'm not sure how you concluded I drew such assumptions.

1: Because treating UTF-8 strings as a range of code point encourage 
people to think so. 2: From things you posted on the newsgroup 
previously. Sorry I don't have the references, but it'd take too long 
to dig them back.

 Ranges of code points might be perfect for you, but it's a tradeoff that
 won't work in every situations.

 
 Ranges can be defined to span logical glyphs that span multiple code points.

I'm talking about the default interpretation, where string ranges are 
ranges of code units, making that tradeoff the default.

And also, I think we can agree that a logical glyph range would be 
terribly inefficient in practice, although it could be a nice teaching 
tool.

 The whole concept of generic algorithms working on strings efficiently
 doesn't work.

 
 Apparently std.algorithm does.

First, it doesn't really work. It seems to work fine, but it doesn't 
handle (yet) characters spanning multiple code points. To handle this 
case, you could use a logical glyph range, but that'd be quite 
inefficient. Or you can improve the algorithm working on code points so 
that it checks for combining characters on the edges, but then is it 
still a generic algorithm?

Second, it doesn't work efficiently. Sure you can specialize the 
algorithm so it does not decode all code units when it's not necessary, 
but then does it still classify as a generic algorithm?

My point is that *generic* algorithms cannot work *efficiently* with 
Unicode, not that they can't work at all. And even then, for the 
inneficient generic algorithm to work correctly with all input, the 
user need to choose the correct Unicode representation to for the 
problem at hand, which requires some general knowledge of Unicode.

Which is why I'd just discourage generic algorithms for strings.


 I'm not against making strings more opaque to encourage people to use
 the Unicode algorithms from the standard library instead of rolling
 their own.

 
 I'd say we're discussing making the two kinds of manipulation (encoded 
 sequence of logical character vs. array of code units) more 
 distinguished from each other. That's a Good Thing(tm).

It's a good abstraction to show the theory of Unicode. But it's not the 
way to go if you want efficiency. For efficiency you need for each 
element in the string to use the lowest abstraction required to handle 
this element, so your algorithm needs to know about the various 
abstraction layers.

This is the kind of "range" I'd use to create algorithms dealing with 
Unicode properly:

struct UnicodeRange(U)
{
	U frontUnit()  property;
	dchar frontPoint()  property;
	immutable(U)[] frontGlyph()  property;
	
	void popFrontUnit();
	void popFrontPoint();
	void popFrontGlyph();

	...
}

Not really a range per your definition of ranges, but basically it lets 
you intermix working with units, code points, and glyphs. Add a way to 
slice at the unit level and a way to know the length at the unit level 
and it's all I need to make an efficient parser, or any algorithm 
really.

The problem with .raw is that it creates a separate range for the 
units. This means you can't look at the frontUnit and then decide to 
pop the unit and then look at the next, decide you need to decode using 
frontPoint, then call popPoint and return to looking at the front unit.

Also, I'm not sure the "glyph" part of that range is required most of 
the time, because most of the time you don't need to decode glyphs to 
be glyph-aware. But it'd be nice if you wanted to count them and having 
it there alongside the rest makes teaches makes users aware of them.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Dec 31 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/31/11 10:47 AM, Michel Fortin wrote:
 On 2011-12-31 15:03:13 +0000, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 On 12/31/11 8:17 CST, Michel Fortin wrote:
 At one time Java and other frameworks started to use UTF-16 as
 if they were characters, that turned wrong on them. Now we know that not
 even code points should be considered characters, thanks to characters
 spanning on multiple code points. You might call it perfect, but for
 that you have made two assumptions:

 1. treating code points as characters is good enough, and
 2. the performance penalty of decoding everything is tolerable

 I'm not sure how you concluded I drew such assumptions.

 1: Because treating UTF-8 strings as a range of code point encourage
 people to think so. 2: From things you posted on the newsgroup
 previously. Sorry I don't have the references, but it'd take too long to
 dig them back.

That's sort of difficult to refute. Anyhow, I think it's great that 
algorithms can use types to go down to the representation if needed, and 
stay up at bidirectional range level otherwise.

 Ranges of code points might be perfect for you, but it's a tradeoff that
 won't work in every situations.

 Ranges can be defined to span logical glyphs that span multiple code
 points.

 I'm talking about the default interpretation, where string ranges are
 ranges of code units, making that tradeoff the default.

 And also, I think we can agree that a logical glyph range would be
 terribly inefficient in practice, although it could be a nice teaching
 tool.

Well people who want that could use byGlyph() or something. If you want 
glyphs, you gotta pay the price.

 The whole concept of generic algorithms working on strings efficiently
 doesn't work.

 Apparently std.algorithm does.

 First, it doesn't really work.

Oh yes it does.

 It seems to work fine, but it doesn't
 handle (yet) characters spanning multiple code points.

That's the job of std.range, not std.algorithm.

 To handle this
 case, you could use a logical glyph range, but that'd be quite
 inefficient. Or you can improve the algorithm working on code points so
 that it checks for combining characters on the edges, but then is it
 still a generic algorithm?

 Second, it doesn't work efficiently. Sure you can specialize the
 algorithm so it does not decode all code units when it's not necessary,
 but then does it still classify as a generic algorithm?

 My point is that *generic* algorithms cannot work *efficiently* with
 Unicode, not that they can't work at all. And even then, for the
 inneficient generic algorithm to work correctly with all input, the user
 need to choose the correct Unicode representation to for the problem at
 hand, which requires some general knowledge of Unicode.

 Which is why I'd just discourage generic algorithms for strings.

I think you are in a position that is defensible, but not generous and 
therefore undesirable. The military equivalent would be defending a 
fortified landfill drained by a sewer. You don't _want_ to be there. 
Taking your argument to its ultimate conclusion is that we give up on 
genericity for strings and go home.

Strings are a variable-length encoding on top of an array. That is a 
relatively easy abstraction to model. Currently we don't have a 
dedicated model for that - we offer the encoded data as a bidirectional 
range and also the underlying array. Algorithms that work with 
bidirectional ranges work out of the box. Those that can use the 
representation gainfully can opportunistically specialize on isSomeString!R.

You contend that that doesn't "work", and I think you're wrong. But to 
the extent you have a case, an abstraction could be defined for 
variable-length encodings, and algorithms could be defined to work with 
that abstraction. I thought several times about that, but couldn't 
gather enough motivation for the simple reason that the current approach 
_works_.

 I'm not against making strings more opaque to encourage people to use
 the Unicode algorithms from the standard library instead of rolling
 their own.

 I'd say we're discussing making the two kinds of manipulation (encoded
 sequence of logical character vs. array of code units) more
 distinguished from each other. That's a Good Thing(tm).

 It's a good abstraction to show the theory of Unicode. But it's not the
 way to go if you want efficiency. For efficiency you need for each
 element in the string to use the lowest abstraction required to handle
 this element, so your algorithm needs to know about the various
 abstraction layers.

Correct.

 This is the kind of "range" I'd use to create algorithms dealing with
 Unicode properly:

 struct UnicodeRange(U)
 {
 U frontUnit()  property;
 dchar frontPoint()  property;
 immutable(U)[] frontGlyph()  property;

 void popFrontUnit();
 void popFrontPoint();
 void popFrontGlyph();

 ...
 }

We already have most of that. For a string s, s[0] is frontUnit, s.front 
is frontPoint, s = s[1 .. $] is popFrontUnit(), s.popFront() is 
popFrontPoint. We only need to define the glyph routines.

But I think you'd be stopping short. You want generic variable-length 
encoding, not the above.

 Not really a range per your definition of ranges, but basically it lets
 you intermix working with units, code points, and glyphs. Add a way to
 slice at the unit level and a way to know the length at the unit level
 and it's all I need to make an efficient parser, or any algorithm really.

Except for the glpyhs implementation, we're already there. You are 
talking about existing capabilities!

 The problem with .raw is that it creates a separate range for the units.

That's the best part about it.

 This means you can't look at the frontUnit and then decide to pop the
 unit and then look at the next, decide you need to decode using
 frontPoint, then call popPoint and return to looking at the front unit.

Of course you can.

while (condition) {
   if (s.raw.front == someFrontUnitThatICareAbout) {
      s.raw.popFront();
      auto c = s.front;
      s.popFront();
   }
}

Now that I wrote it I'm even more enthralled with the coolness of the 
scheme. You essentially have access to two separate ranges on top of the 
same fabric.


Andrei

Dec 31 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-12-31 18:56:01 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 12/31/11 10:47 AM, Michel Fortin wrote:
 It seems to work fine, but it doesn't
 handle (yet) characters spanning multiple code points.

 
 That's the job of std.range, not std.algorithm.

As I keep saying, if you handle combining code points at the range 
level you'll have very inefficient code. But I think you get that.

 To handle this
 case, you could use a logical glyph range, but that'd be quite
 inefficient. Or you can improve the algorithm working on code points so
 that it checks for combining characters on the edges, but then is it
 still a generic algorithm?
 
 Second, it doesn't work efficiently. Sure you can specialize the
 algorithm so it does not decode all code units when it's not necessary,
 but then does it still classify as a generic algorithm?
 
 My point is that *generic* algorithms cannot work *efficiently* with
 Unicode, not that they can't work at all. And even then, for the
 inneficient generic algorithm to work correctly with all input, the user
 need to choose the correct Unicode representation to for the problem at
 hand, which requires some general knowledge of Unicode.
 
 Which is why I'd just discourage generic algorithms for strings.

 
 I think you are in a position that is defensible, but not generous and 
 therefore undesirable. The military equivalent would be defending a 
 fortified landfill drained by a sewer. You don't _want_ to be there.

I don't get the analogy.

 Taking your argument to its ultimate conclusion is that we give up on 
 genericity for strings and go home.

That is more or less what I am saying. Genericity for strings leads to 
inefficient algorithms, and you don't want inefficient algorithms, at 
least not without being warned in advance. This is why for instance you 
give a special name to inefficient (linear) operations in 
std.container. In the same way, I think generic operations on strings 
should be disallowed unless you opt-in by explicitly saying on which 
representation you want to algorithm to perform its task.


 This is the kind of "range" I'd use to create algorithms dealing with
 Unicode properly:
 
 struct UnicodeRange(U)
 {
 U frontUnit()  property;
 dchar frontPoint()  property;
 immutable(U)[] frontGlyph()  property;
 
 void popFrontUnit();
 void popFrontPoint();
 void popFrontGlyph();
 
 ...
 }

 
 We already have most of that. For a string s, s[0] is frontUnit, 
 s.front is frontPoint, s = s[1 .. $] is popFrontUnit(), s.popFront() is 
 popFrontPoint. We only need to define the glyph routines.

Indeed. I came with this concept when writing my XML parser, I defined 
frontUnit and popFrontUnit and used it all over the place (in 
conjunction with slicing). And I rarely needed to decode whole code 
points using front and popFront.

 But I think you'd be stopping short. You want generic variable-length 
 encoding, not the above.

Really? How'd that work?

 Except for the glpyhs implementation, we're already there. You are 
 talking about existing capabilities!
 
 The problem with .raw is that it creates a separate range for the units.

 
 That's the best part about it.

Depends. It should create a *linked* range, not a *separate* one, in 
the sense that if you advance the "raw" range with popFront, it should 
advance the underlying "code point" range too.

 This means you can't look at the frontUnit and then decide to pop the
 unit and then look at the next, decide you need to decode using
 frontPoint, then call popPoint and return to looking at the front unit.

 
 Of course you can.
 
 while (condition) {
    if (s.raw.front == someFrontUnitThatICareAbout) {
       s.raw.popFront();
       auto c = s.front;
       s.popFront();
    }
 }

But will s.raw.popFront() also pop a single unit from s? "raw" would 
need to be defined as a reinterpret cast of the reference to the char[] 
to do what I want, something like this:

	ref ubyte[] raw(ref char[] s) { return *cast(ubyte[]*)&s; }

The current std.string.representation doesn't do that at all.

Also, how does it work with slicing? It can work with raw, but you'll 
have to cast things everywhere because raw is a ubyte[]:

	string = "��";
	s = cast(typeof(s))s.raw[0..4];


 Now that I wrote it I'm even more enthralled with the coolness of the 
 scheme. You essentially have access to two separate ranges on top of 
 the same fabric.

Glad you like the concept.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Dec 31 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/31/11 2:44 PM, Michel Fortin wrote:
 But will s.raw.popFront() also pop a single unit from s? "raw" would
 need to be defined as a reinterpret cast of the reference to the char[]
 to do what I want, something like this:

 ref ubyte[] raw(ref char[] s) { return *cast(ubyte[]*)&s; }

 The current std.string.representation doesn't do that at all.

You just found a bug!

Andrei

Dec 31 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 12/31/2011 07:56 PM, Andrei Alexandrescu wrote:
 On 12/31/11 10:47 AM, Michel Fortin wrote:
 This means you can't look at the frontUnit and then decide to pop the
 unit and then look at the next, decide you need to decode using
 frontPoint, then call popPoint and return to looking at the front unit.

 Of course you can.

 while (condition) {
 if (s.raw.front == someFrontUnitThatICareAbout) {
 s.raw.popFront();
 auto c = s.front;
 s.popFront();
 }
 }

 Now that I wrote it I'm even more enthralled with the coolness of the
 scheme. You essentially have access to two separate ranges on top of the
 same fabric.


 Andrei

There is nothing wrong with the scheme on the conceptual level (except 
maybe that .raw.popFront() lets you invalidate the code point range). 
But making built-in arrays behave that way is like fitting a square peg 
in a round hole. immutable(char)[] is actually what .raw should return, 
not what it should be called on. It is already the raw representation.

Dec 31 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 12/31/2011 03:17 PM, Michel Fortin wrote:
 As for Walter being the only one coding by looking at the code units
 directly, that's not true. All my parser code look at code units
 directly and only decode to code points where necessary (just look at
 the XML parsing code I posted a while ago to get an idea to how it can
 apply to ranges). And I don't think it's because I've seen Walter code
 before, I think it is because I know how Unicode works and I want to
 make my parser efficient. I've done the same for a parser in C++ a while
 ago. I can hardly imagine I'm the only one (with Walter and you). I
 think this is how efficient algorithms dealing with Unicode should be
 written.

+1.

Dec 31 2011

Sean Kelly <sean invisibleduck.org> writes:

I don't know that Unicode expertise is really required here anyway.  All one=
 has to know is that UTF8 is a multibyte encoding and built-in string attrib=
utes talk in bytes. Knowing when one wants bytes vs characters isn't rocket s=
cience. That said, I'm on the fence about this change. It breaks consistency=
 for a benefit I'm still weighing. With this change, the char type will stil=
l be a single byte, correct?  What happens to foreach on strings?

Sent from my iPhone

On Dec 31, 2011, at 8:20 AM, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 12/31/2011 03:17 PM, Michel Fortin wrote:
=20
 As for Walter being the only one coding by looking at the code units
 directly, that's not true. All my parser code look at code units
 directly and only decode to code points where necessary (just look at
 the XML parsing code I posted a while ago to get an idea to how it can
 apply to ranges). And I don't think it's because I've seen Walter code
 before, I think it is because I know how Unicode works and I want to
 make my parser efficient. I've done the same for a parser in C++ a while
 ago. I can hardly imagine I'm the only one (with Walter and you). I
 think this is how efficient algorithms dealing with Unicode should be
 written.
=20

=20
 +1.

Dec 31 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/31/11 10:47 AM, Sean Kelly wrote:
 I don't know that Unicode expertise is really required here anyway.
 All one has to know is that UTF8 is a multibyte encoding and
 built-in string attributes talk in bytes. Knowing when one wants
 bytes vs characters isn't rocket science. That said, I'm on the fence
 about this change. It breaks consistency for a benefit I'm still
 weighing. With this change, the char type will still be a single
 byte, correct? What happens to foreach on strings?

Clearly this is a what-if debate. The best level of agreement we could
ever reach is "well, it would've been nice... sigh".

It's possible that we'll define a Rope type in std.container - a
heavy-duty string type with small string optimization, interning, the
works. That type may use insights we are deriving from this exchange.


Andrei

Dec 31 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 12/31/2011 08:06 PM, Andrei Alexandrescu wrote:
 On 12/31/11 10:47 AM, Sean Kelly wrote:
 I don't know that Unicode expertise is really required here anyway.
 All one has to know is that UTF8 is a multibyte encoding and
 built-in string attributes talk in bytes. Knowing when one wants
 bytes vs characters isn't rocket science. That said, I'm on the fence
 about this change. It breaks consistency for a benefit I'm still
 weighing. With this change, the char type will still be a single
 byte, correct? What happens to foreach on strings?

 Clearly this is a what-if debate. The best level of agreement we could
 ever reach is "well, it would've been nice... sigh".

 It's possible that we'll define a Rope type in std.container - a
 heavy-duty string type with small string optimization, interning, the
 works. That type may use insights we are deriving from this exchange.


 Andrei

That would be great.

Dec 31 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-12-31 16:47:40 +0000, Sean Kelly <sean invisibleduck.org> said:

 I don't know that Unicode expertise is really required here anyway.  All one
  has to know is that UTF8 is a multibyte encoding and built-in string attrib
 utes talk in bytes. Knowing when one wants bytes vs characters isn't rocket s
 cience.

It's not bytes vs. characters, it's code units vs. code points vs. user 
perceived characters (grapheme clusters). One character can span 
multiple code points, and can be represented in various ways depending 
on which Unicode normalization you pick. But most people don't know 
that.

If you want to count the number of *characters*, counting code points 
isn't really it, as you should avoid counting the combining ones. If 
you want to search for a substring, you need to be sure both strings 
use the same normalization first, and if not normalize them 
appropriately so that equivalent code point combinations are always 
represented the same.

That said, if you are implementing an XML or JSON parser, since those 
specs are defined in term of code points you should probably write your 
code in term of code points (hopefully without decoding code points 
when you don't need to). On the other hand, if you're writing something 
that processes text (like counting the average number of *character* 
per word in a document), then you should be aware of combining 
characters.

How to pack all this into an easy to use package is most challenging.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Dec 31 2011

Sean Kelly <sean invisibleduck.org> writes:

Sorry, I was simplifying. The distinction I was trying to make was between g=
eneric operations (in my experience the majority) vs. encoding-aware ones.=20=


Sent from my iPhone

On Dec 31, 2011, at 12:48 PM, Michel Fortin <michel.fortin michelf.com> wrot=
e:

 On 2011-12-31 16:47:40 +0000, Sean Kelly <sean invisibleduck.org> said:
=20
 I don't know that Unicode expertise is really required here anyway.  All o=


ne
 has to know is that UTF8 is a multibyte encoding and built-in string attr=


ib
 utes talk in bytes. Knowing when one wants bytes vs characters isn't rock=


et s
 cience.

=20
 It's not bytes vs. characters, it's code units vs. code points vs. user pe=

rceived characters (grapheme clusters). One character can span multiple code=
 points, and can be represented in various ways depending on which Unicode n=
ormalization you pick. But most people don't know that.
=20
 If you want to count the number of *characters*, counting code points isn'=

t really it, as you should avoid counting the combining ones. If you want to=
 search for a substring, you need to be sure both strings use the same norma=
lization first, and if not normalize them appropriately so that equivalent c=
ode point combinations are always represented the same.
=20
 That said, if you are implementing an XML or JSON parser, since those spec=

s are defined in term of code points you should probably write your code in t=
erm of code points (hopefully without decoding code points when you don't ne=
ed to). On the other hand, if you're writing something that processes text (=
like counting the average number of *character* per word in a document), the=
n you should be aware of combining characters.
=20
 How to pack all this into an easy to use package is most challenging.
=20
=20
 --=20
 Michel Fortin
 michel.fortin michelf.com
 http://michelf.com/
=20

Dec 31 2011

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:

 We need .raw and we must abolish .length and [] for narrow strings.

I don't know if we need, but I agree those things are an improvement over the
current state.
To replace the disabled slicing I think something Python islice() will be
useful.

Bye,bear
bearophile

Dec 31 2011

Piotr Szturmaj <bncrbme jadamspam.pl> writes:

Timon Gehr wrote:
 Me too. I think the way we have it now is optimal. The only reason we
 are discussing this is because of fear that uneducated users will write
 code that does not take into account Unicode characters above code point
 0x80.

+1

From D's string docs:

"char[] strings are in UTF-8 format. wchar[] strings are in UTF-16 
format. dchar[] strings are in UTF-32 format."

I would additionally add some clarifications:

char[] is an array of 8-bit code units. Unicode code point may take up 
to 4 chars.
wchar[] is an array of 16-bit code units. Unicode code point may take up 
to 2 wchars.
dchar[] is an array of 32-bit code units. Unicode code point always fits 
into one dchar.

Each of these formats may encode any Unicode string.

If you need indexing or slicing use:
* char[] or string when working with ASCII code points.
* wchar[] or wstring when working with Basic Multilingual Plane (BMP) 
code points.
* dchar[] or dstring when working with all possible code points.

If you do not need indexing or slicing you may use any of the formats.

Dec 31 2011

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

On 12/30/2011 02:55 PM, Timon Gehr wrote:
 On 12/30/2011 08:33 PM, Joshua Reusch wrote:
 Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
 On 12/29/11 12:28 PM, Don wrote:
 On 28.12.2011 20:00, Andrei Alexandrescu wrote:
 Oh, one more thing - one good thing that could come out of this thread
 is abolition (through however slow a deprecation path) of s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length
 and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not
 char/wchar.
 Then, people would access the decoding routines on the needed
 occasions,
 or would consciously use the representation.

 Yum.


 If I understand this correctly, most others don't. Effectively, .rep
 just means, "I know what I'm doing", and there's no change to existing
 semantics, purely a syntax change.

 Exactly!

 If you change s[i] into s.rep[i], it does the same thing as now.
 There's
 no loss of functionality -- it's just stops you from accidentally doing
 the wrong thing. Like .ptr for getting the address of an array.
 Typically all the ".rep" everywhere would get annoying, so you would
 write:
 ubyte [] u = s.rep;
 and use u from then on.

 I don't like the name 'rep'. Maybe 'raw' or 'utf'?
 Apart from that, I think this would be perfect.

 Yes, I mean "rep" as a short for "representation" but upon first sight
 the connection is tenuous. "raw" sounds great.

 Now I'm twice sorry this will not happen...

 Maybe it could happen if we
 1. make dstring the default strings type --

 
 Inefficient.
 

But correct (enough).

 code units and characters would be the same

 
 Wrong.
 

*sigh*, FINE.  Code units and /code points/ would be the same.

 or 2. forward string.length to std.utf.count and opIndex to
 std.utf.toUTFindex

 
 Inconsistent and inefficient (it blows up the algorithmic complexity).
 

Inconsistent?  How?

Inefficiency is a lot easier to deal with than incorrect.  If something
is inefficient, then in the right places I will NOTICE.  If something is
incorrect, it can hide for years until that one person (or country, in
this case) with a different usage pattern than the others uncovers it.

 so programmers could use the slices/indexing/length (no lazyness
 problems), and if they really want codeunits use .raw/.rep (or better
 .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)

 
 Anyone who intends to write efficient string processing code needs this.
 Anyone who does not want to write string processing code will not need
 to index into a string -- standard library functions will suffice.
 

What about people who want to write correct string processing code AND
want to use this handy slicing feature?  Because I totally want both of
these.  Slicing is super useful for script-like coding.

 But generally I liked the idea of just having an alias for strings...

 
 Me too. I think the way we have it now is optimal. The only reason we
 are discussing this is because of fear that uneducated users will write
 code that does not take into account Unicode characters above code point
 0x80. But what is the worst thing that can happen?
 
 1. They don't notice. Then it is not a problem, because they are
 obviously only using ASCII characters and it is perfectly reasonable to
 assume that code units and characters are the same thing.
 

How do you know they are only working with ASCII?  They might be /now/.
 But what if someone else uses the program a couple years later when the
original author is no longer maintaining that chunk of code?

 2. They get screwed up string output, look for the reason, patch up
 their code with some functions from std.utf and will never make the same
 mistakes again.
 

Except they don't.  Because there are a lot of programmers that will
never put in non-ascii strings to begin with.  But that has nothing to
do with whether or not the /users/ or /maintainers/ of that code will
put non-ascii strings in.  This could make some messes.

 
 I have *never* seen an user in D.learn complain about it. They might
 have been some I missed, but it is certainly not a prevalent problem.
 Also, just because an user can type .rep does not mean he understands
 Unicode: He is able to make just the same mistakes as before, even more
 so, as the array he is getting back has the _wrong element type_.
 

You know, here in America (Amurica?) we don't know that other countries
exist.  I think there is a large population of programmers here that
don't even know how to enter non-latin characters, much less would think
to include such characters in their test cases.  These programmers won't
necessarily be found on the internet much, but they will be found in
cubicles all around, doing their 9-to-5 and writing mediocre code that
the rest of us have to put up with.  Their code will pass peer review
(their peers are also from America) and continue working just fine until
someone from one of those confusing other places decides to type in the
characters they feel comfortable typing in.  No, there will not be
/tests/ for code points greater than 0x80, because there is no one
around to write those.  I'd feel a little better if D herds people into
writing correct code to begin with, because they won't otherwise.

...

There's another issue at play here too: efficiency vs correctness as a
default.

Here's the tradeoff --

Option A:
char[i] returns the i'th byte of the string as a (char) type.
Consequences:
(1) Code is efficient and INcorrect.
(2) It requires extra effort to write correct code.
(3) Detecting the incorrect code may take years, as these errors can
hide easily.

Option B:
char[i] returns the i'th codepoint of the string as a (dchar) type.
Consequences:
(1) Code is INefficient and correct.
(2) It requires extra effort to write efficient code.
(3) Detecting the inefficient code happens in minutes.  It is VERY
noticable when your program runs too slowly.


This is how I see it.

And I really like my correct code.  If it's too slow, and I'll /know/
when it's too slow, then I'll profile->tweak->profile->etc until the
slowness goes away.  I'm totally digging option B.

Dec 31 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 12/31/2011 07:22 PM, Chad J wrote:
 On 12/30/2011 02:55 PM, Timon Gehr wrote:
 On 12/30/2011 08:33 PM, Joshua Reusch wrote:
 Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
 On 12/29/11 12:28 PM, Don wrote:
 On 28.12.2011 20:00, Andrei Alexandrescu wrote:
 Oh, one more thing - one good thing that could come out of this thread
 is abolition (through however slow a deprecation path) of s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length
 and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not
 char/wchar.
 Then, people would access the decoding routines on the needed
 occasions,
 or would consciously use the representation.

 Yum.


 If I understand this correctly, most others don't. Effectively, .rep
 just means, "I know what I'm doing", and there's no change to existing
 semantics, purely a syntax change.

 Exactly!

 If you change s[i] into s.rep[i], it does the same thing as now.
 There's
 no loss of functionality -- it's just stops you from accidentally doing
 the wrong thing. Like .ptr for getting the address of an array.
 Typically all the ".rep" everywhere would get annoying, so you would
 write:
 ubyte [] u = s.rep;
 and use u from then on.

 I don't like the name 'rep'. Maybe 'raw' or 'utf'?
 Apart from that, I think this would be perfect.

 Yes, I mean "rep" as a short for "representation" but upon first sight
 the connection is tenuous. "raw" sounds great.

 Now I'm twice sorry this will not happen...

 Maybe it could happen if we
 1. make dstring the default strings type --

 Inefficient.

 But correct (enough).

 code units and characters would be the same

 Wrong.

 *sigh*, FINE.  Code units and /code points/ would be the same.

Relax.

 or 2. forward string.length to std.utf.count and opIndex to
 std.utf.toUTFindex

 Inconsistent and inefficient (it blows up the algorithmic complexity).

 Inconsistent?  How?

int[]
bool[]
float[]
char[]

 Inefficiency is a lot easier to deal with than incorrect.  If something
 is inefficient, then in the right places I will NOTICE.  If something is
 incorrect, it can hide for years until that one person (or country, in
 this case) with a different usage pattern than the others uncovers it.

 so programmers could use the slices/indexing/length (no lazyness
 problems), and if they really want codeunits use .raw/.rep (or better
 .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)

 Anyone who intends to write efficient string processing code needs this.
 Anyone who does not want to write string processing code will not need
 to index into a string -- standard library functions will suffice.

 What about people who want to write correct string processing code AND
 want to use this handy slicing feature?  Because I totally want both of
 these.  Slicing is super useful for script-like coding.

Except that the proposal would make slicing strings go away.

 But generally I liked the idea of just having an alias for strings...

 Me too. I think the way we have it now is optimal. The only reason we
 are discussing this is because of fear that uneducated users will write
 code that does not take into account Unicode characters above code point
 0x80. But what is the worst thing that can happen?

 1. They don't notice. Then it is not a problem, because they are
 obviously only using ASCII characters and it is perfectly reasonable to
 assume that code units and characters are the same thing.

 How do you know they are only working with ASCII?  They might be /now/.
   But what if someone else uses the program a couple years later when the
 original author is no longer maintaining that chunk of code?

Then they obviously need to fix the code, because the requirements have 
changed. Most of it will already work correctly though, because UTF-8 
extends ASCII in a natural way.

 2. They get screwed up string output, look for the reason, patch up
 their code with some functions from std.utf and will never make the same
 mistakes again.

 Except they don't.  Because there are a lot of programmers that will
 never put in non-ascii strings to begin with.  But that has nothing to
 do with whether or not the /users/ or /maintainers/ of that code will
 put non-ascii strings in.  This could make some messes.

 I have *never* seen an user in D.learn complain about it. They might
 have been some I missed, but it is certainly not a prevalent problem.
 Also, just because an user can type .rep does not mean he understands
 Unicode: He is able to make just the same mistakes as before, even more
 so, as the array he is getting back has the _wrong element type_.

 You know, here in America (Amurica?) we don't know that other countries
 exist.  I think there is a large population of programmers here that
 don't even know how to enter non-latin characters, much less would think
 to include such characters in their test cases.  These programmers won't
 necessarily be found on the internet much, but they will be found in
 cubicles all around, doing their 9-to-5 and writing mediocre code that
 the rest of us have to put up with.  Their code will pass peer review
 (their peers are also from America) and continue working just fine until
 someone from one of those confusing other places decides to type in the
 characters they feel comfortable typing in.  No, there will not be
 /tests/ for code points greater than 0x80, because there is no one
 around to write those.  I'd feel a little better if D herds people into
 writing correct code to begin with, because they won't otherwise.

There is no way to 'herd people into writing correct code' and UTF-8 is 
quite easy to deal with.

 ...

 There's another issue at play here too: efficiency vs correctness as a
 default.

 Here's the tradeoff --

 Option A:
 char[i] returns the i'th byte of the string as a (char) type.
 Consequences:
 (1) Code is efficient and INcorrect.

Do you have an example of impactful incorrect code resulting from those 
semantics?

 (2) It requires extra effort to write correct code.
 (3) Detecting the incorrect code may take years, as these errors can
 hide easily.

None of those is a direct consequence of char[i] returning char. They 
are the consequence of at least 3 things:

1. char[] is an array of char
2. immutable(char)[] is the default string type
3. the programmer does not know about 1. and/or 2.

I say, 1. is inevitable. You say 3. is inevitable. If we are both right, 
then 2. is the culprit.

 Option B:
 char[i] returns the i'th codepoint of the string as a (dchar) type.
 Consequences:
 (1) Code is INefficient and correct.

It is awfully optimistic to assume the code will be correct.

 (2) It requires extra effort to write efficient code.
 (3) Detecting the inefficient code happens in minutes.  It is VERY
 noticable when your program runs too slowly.

Except when in testing only small inputs are used and only 2 years later 
maintainers throw your program at a larger problem instance and wonder 
why it does not terminate. Or your program is DOS'd. Polynomial blowup 
in runtime can be as large a problem as a correctness bug in practice 
just fine.

 This is how I see it.

 And I really like my correct code.  If it's too slow, and I'll /know/
 when it's too slow, then I'll profile->tweak->profile->etc until the
 slowness goes away.  I'm totally digging option B.

Those kinds of inefficiencies build up and make the whole program run 
sluggish, and it will possibly be to late when you notice.

Option B is not even on the table. This thread is about a breaking 
interface change and special casing T[] for T in {char, wchar}.

Dec 31 2011

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

On 12/31/2011 02:02 PM, Timon Gehr wrote:
 On 12/31/2011 07:22 PM, Chad J wrote:
 On 12/30/2011 02:55 PM, Timon Gehr wrote:
 On 12/30/2011 08:33 PM, Joshua Reusch wrote:
 Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
 On 12/29/11 12:28 PM, Don wrote:
 On 28.12.2011 20:00, Andrei Alexandrescu wrote:
 Oh, one more thing - one good thing that could come out of this
 thread
 is abolition (through however slow a deprecation path) of
 s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length
 and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not
 char/wchar.
 Then, people would access the decoding routines on the needed
 occasions,
 or would consciously use the representation.

 Yum.


 If I understand this correctly, most others don't. Effectively, .rep
 just means, "I know what I'm doing", and there's no change to
 existing
 semantics, purely a syntax change.

 Exactly!

 If you change s[i] into s.rep[i], it does the same thing as now.
 There's
 no loss of functionality -- it's just stops you from accidentally
 doing
 the wrong thing. Like .ptr for getting the address of an array.
 Typically all the ".rep" everywhere would get annoying, so you would
 write:
 ubyte [] u = s.rep;
 and use u from then on.

 I don't like the name 'rep'. Maybe 'raw' or 'utf'?
 Apart from that, I think this would be perfect.

 Yes, I mean "rep" as a short for "representation" but upon first sight
 the connection is tenuous. "raw" sounds great.

 Now I'm twice sorry this will not happen...

 Maybe it could happen if we
 1. make dstring the default strings type --

 Inefficient.

 But correct (enough).

 code units and characters would be the same

 Wrong.

 *sigh*, FINE.  Code units and /code points/ would be the same.

 
 Relax.
 

I'll do one better and ultra relax:
http://www.youtube.com/watch?v=jimQoWXzc0Q
;)

 or 2. forward string.length to std.utf.count and opIndex to
 std.utf.toUTFindex

 Inconsistent and inefficient (it blows up the algorithmic complexity).

 Inconsistent?  How?

 
 int[]
 bool[]
 float[]
 char[]
 

I'll refer to another limb of this thread when foobar mentioned a mental
model of strings as strings of letters.  Now, given annoying corner
cases, we probably can't get strings of /letters/, but I'd at least like
to make it as far as code points.  That seems very doable.  I mention
this because I find that forwarding string.length and opIndex would be
much more consistent with this mental model of strings as strings of
unicode code points, which, IMO, is more important than it being binary
consistent with the other things.  I'd much rather have char[] behave
more like an array of code points than an array of bytes.  I don't need
an array of bytes.  That's ubyte[]; I have that already.

 Inefficiency is a lot easier to deal with than incorrect.  If something
 is inefficient, then in the right places I will NOTICE.  If something is
 incorrect, it can hide for years until that one person (or country, in
 this case) with a different usage pattern than the others uncovers it.

 so programmers could use the slices/indexing/length (no lazyness
 problems), and if they really want codeunits use .raw/.rep (or better
 .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)

 Anyone who intends to write efficient string processing code needs this.
 Anyone who does not want to write string processing code will not need
 to index into a string -- standard library functions will suffice.

 What about people who want to write correct string processing code AND
 want to use this handy slicing feature?  Because I totally want both of
 these.  Slicing is super useful for script-like coding.

 
 Except that the proposal would make slicing strings go away.
 

Yeah, Andrei's proposal says that.  But I'm speaking of Joshua's:

 so programmers could use the slices/indexing/length ...




I kind-of like either, but I'd prefer Joshua's suggestion.

 But generally I liked the idea of just having an alias for strings...

 Me too. I think the way we have it now is optimal. The only reason we
 are discussing this is because of fear that uneducated users will write
 code that does not take into account Unicode characters above code point
 0x80. But what is the worst thing that can happen?

 1. They don't notice. Then it is not a problem, because they are
 obviously only using ASCII characters and it is perfectly reasonable to
 assume that code units and characters are the same thing.

 How do you know they are only working with ASCII?  They might be /now/.
   But what if someone else uses the program a couple years later when the
 original author is no longer maintaining that chunk of code?

 
 Then they obviously need to fix the code, because the requirements have
 changed. Most of it will already work correctly though, because UTF-8
 extends ASCII in a natural way.
 

Or, you know, we could design the language a little differently and make
this become mostly a non-problem.  That would be cool.

 2. They get screwed up string output, look for the reason, patch up
 their code with some functions from std.utf and will never make the same
 mistakes again.

 Except they don't.  Because there are a lot of programmers that will
 never put in non-ascii strings to begin with.  But that has nothing to
 do with whether or not the /users/ or /maintainers/ of that code will
 put non-ascii strings in.  This could make some messes.

 I have *never* seen an user in D.learn complain about it. They might
 have been some I missed, but it is certainly not a prevalent problem.
 Also, just because an user can type .rep does not mean he understands
 Unicode: He is able to make just the same mistakes as before, even more
 so, as the array he is getting back has the _wrong element type_.

 You know, here in America (Amurica?) we don't know that other countries
 exist.  I think there is a large population of programmers here that
 don't even know how to enter non-latin characters, much less would think
 to include such characters in their test cases.  These programmers won't
 necessarily be found on the internet much, but they will be found in
 cubicles all around, doing their 9-to-5 and writing mediocre code that
 the rest of us have to put up with.  Their code will pass peer review
 (their peers are also from America) and continue working just fine until
 someone from one of those confusing other places decides to type in the
 characters they feel comfortable typing in.  No, there will not be
 /tests/ for code points greater than 0x80, because there is no one
 around to write those.  I'd feel a little better if D herds people into
 writing correct code to begin with, because they won't otherwise.

 
 There is no way to 'herd people into writing correct code' and UTF-8 is
 quite easy to deal with.
 

Probably not.  I played fast and loose with this a lot in my early D
code.  Then this same conversation happened like ~3 years ago on this
newsgroup.  Then I learned more about unicode and had a bit of a bitter
taste regarding char[] and how it handled indexing.  I thought I could
just index char[]s willy nilly.  But no, I can't.  And the compiler
won't tell me.  It just silently does what I don't want.

Maybe unicode is easy, but we sure as hell aren't born with it, and the
language doesn't give beginners ANY red flags about this.

I find myself pretty fortified against this issue due to having known
about it before anything unpleasant happened, but I don't like the idea
of others having to learn the hard way.

 ...

 There's another issue at play here too: efficiency vs correctness as a
 default.

 Here's the tradeoff --

 Option A:
 char[i] returns the i'th byte of the string as a (char) type.
 Consequences:
 (1) Code is efficient and INcorrect.

 
 Do you have an example of impactful incorrect code resulting from those
 semantics?
 

Nope.  Sorry.  I learned about it before it had a chance to bite me.
But this is only because I frequent(ed) the newsgroup and had a good
throw on my dice roll.

 (2) It requires extra effort to write correct code.
 (3) Detecting the incorrect code may take years, as these errors can
 hide easily.

 
 None of those is a direct consequence of char[i] returning char. They
 are the consequence of at least 3 things:
 
 1. char[] is an array of char
 2. immutable(char)[] is the default string type
 3. the programmer does not know about 1. and/or 2.
 
 I say, 1. is inevitable. You say 3. is inevitable. If we are both right,
 then 2. is the culprit.
 

I can get behind this.

Honestly I'd like the default string type to be intelligent and optimize
itself into whichever UTF-N encoding is optimal for content I throw into
it.  Maybe this means it should lazily expand itself to the narrowest
character type that maintains a 1-to-1 ratio between code units and code
points so that indexing/slicing remain O(1), or maybe it's a bag of
disparate encodings, or maybe someone can think of a better strategy.
Just make it /reasonably/ fast and help me with correctness as much as
possible.  If I need more performance or more unicode pedantics, I'll do
my homework then and only then.

Of course this is probably never going to happen I'm afraid.  Even the
problem of making such a (probably) struct work at compile time in
templates as if it were a native type... agh, headaches.

 Option B:
 char[i] returns the i'th codepoint of the string as a (dchar) type.
 Consequences:
 (1) Code is INefficient and correct.

 
 It is awfully optimistic to assume the code will be correct.
 
 (2) It requires extra effort to write efficient code.
 (3) Detecting the inefficient code happens in minutes.  It is VERY
 noticable when your program runs too slowly.

 
 Except when in testing only small inputs are used and only 2 years later
 maintainers throw your program at a larger problem instance and wonder
 why it does not terminate. Or your program is DOS'd. Polynomial blowup
 in runtime can be as large a problem as a correctness bug in practice
 just fine.
 

I see what you mean there.  I'm still not entirely happy with it though.
 I don't think these are reasonable requirements.  It sounds like forced
premature optimization to me.

I have found myself in a number of places in different problem domains
where optimality-is-correctness.  Make it too slow and the program isn't
worth writing.  I can't imagine doing this for workloads I can't test on
or anticipate though: I'd have to operate like NASA and make things 10x
more expensive than they need to be.

Correctness, on the other hand, can be easily (relatively speaking)
obtained by only allowing the user to input data you can handle and then
making sure the program can handle it as promised.  Test, test, test, etc.

 This is how I see it.

 And I really like my correct code.  If it's too slow, and I'll /know/
 when it's too slow, then I'll profile->tweak->profile->etc until the
 slowness goes away.  I'm totally digging option B.

 
 Those kinds of inefficiencies build up and make the whole program run
 sluggish, and it will possibly be to late when you notice.
 

I get the feeling that the typical divide-and-conquer profiling strategy
will find the more expensive operations /at least/ most of the time.
Unfortunately, I have only experience to speak from on this matter.

 Option B is not even on the table. This thread is about a breaking
 interface change and special casing T[] for T in {char, wchar}.
 
 

Yeah, I know.  I'm refering to what Joshua wrote, because I like option
B.  Even if it's academic, I'll say I like it anyways, if only for the
sake of argument.

Dec 31 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/01/2012 02:34 AM, Chad J wrote:
 On 12/31/2011 02:02 PM, Timon Gehr wrote:
 On 12/31/2011 07:22 PM, Chad J wrote:
 On 12/30/2011 02:55 PM, Timon Gehr wrote:
 On 12/30/2011 08:33 PM, Joshua Reusch wrote:
 Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
 On 12/29/11 12:28 PM, Don wrote:
 On 28.12.2011 20:00, Andrei Alexandrescu wrote:
 Oh, one more thing - one good thing that could come out of this
 thread
 is abolition (through however slow a deprecation path) of
 s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length
 and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not
 char/wchar.
 Then, people would access the decoding routines on the needed
 occasions,
 or would consciously use the representation.

 Yum.


 If I understand this correctly, most others don't. Effectively, .rep
 just means, "I know what I'm doing", and there's no change to
 existing
 semantics, purely a syntax change.

 Exactly!

 If you change s[i] into s.rep[i], it does the same thing as now.
 There's
 no loss of functionality -- it's just stops you from accidentally
 doing
 the wrong thing. Like .ptr for getting the address of an array.
 Typically all the ".rep" everywhere would get annoying, so you would
 write:
 ubyte [] u = s.rep;
 and use u from then on.

 I don't like the name 'rep'. Maybe 'raw' or 'utf'?
 Apart from that, I think this would be perfect.

 Yes, I mean "rep" as a short for "representation" but upon first sight
 the connection is tenuous. "raw" sounds great.

 Now I'm twice sorry this will not happen...

 Maybe it could happen if we
 1. make dstring the default strings type --

 Inefficient.

 But correct (enough).

 code units and characters would be the same

 Wrong.

 *sigh*, FINE.  Code units and /code points/ would be the same.

 Relax.

 I'll do one better and ultra relax:
 http://www.youtube.com/watch?v=jimQoWXzc0Q
 ;)

 or 2. forward string.length to std.utf.count and opIndex to
 std.utf.toUTFindex

 Inconsistent and inefficient (it blows up the algorithmic complexity).

 Inconsistent?  How?

 int[]
 bool[]
 float[]
 char[]

 I'll refer to another limb of this thread when foobar mentioned a mental
 model of strings as strings of letters.  Now, given annoying corner
 cases, we probably can't get strings of /letters/, but I'd at least like
 to make it as far as code points.  That seems very doable.  I mention
 this because I find that forwarding string.length and opIndex would be
 much more consistent with this mental model of strings as strings of
 unicode code points, which, IMO, is more important than it being binary
 consistent with the other things.  I'd much rather have char[] behave
 more like an array of code points than an array of bytes.  I don't need
 an array of bytes.  That's ubyte[]; I have that already.

char[] is not an array of bytes: it is an array of UTF-8 code units.

 Inefficiency is a lot easier to deal with than incorrect.  If something
 is inefficient, then in the right places I will NOTICE.  If something is
 incorrect, it can hide for years until that one person (or country, in
 this case) with a different usage pattern than the others uncovers it.

 so programmers could use the slices/indexing/length (no lazyness
 problems), and if they really want codeunits use .raw/.rep (or better
 .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)

 Anyone who intends to write efficient string processing code needs this.
 Anyone who does not want to write string processing code will not need
 to index into a string -- standard library functions will suffice.

 What about people who want to write correct string processing code AND
 want to use this handy slicing feature?  Because I totally want both of
 these.  Slicing is super useful for script-like coding.

 Except that the proposal would make slicing strings go away.

 Yeah, Andrei's proposal says that.  But I'm speaking of Joshua's:

 so programmers could use the slices/indexing/length ...




 I kind-of like either, but I'd prefer Joshua's suggestion.

 But generally I liked the idea of just having an alias for strings...

 Me too. I think the way we have it now is optimal. The only reason we
 are discussing this is because of fear that uneducated users will write
 code that does not take into account Unicode characters above code point
 0x80. But what is the worst thing that can happen?

 1. They don't notice. Then it is not a problem, because they are
 obviously only using ASCII characters and it is perfectly reasonable to
 assume that code units and characters are the same thing.

 How do you know they are only working with ASCII?  They might be /now/.
    But what if someone else uses the program a couple years later when the
 original author is no longer maintaining that chunk of code?

 Then they obviously need to fix the code, because the requirements have
 changed. Most of it will already work correctly though, because UTF-8
 extends ASCII in a natural way.

 Or, you know, we could design the language a little differently and make
 this become mostly a non-problem.  That would be cool.

It is imo already mostly a non-problem, but YMMV:

void main(){
     string s = readln();
     int nest = 0;
     foreach(x;s){ // iterates by code unit
         if(x=='(') nest++;
         else if(x==')' && --nest<0) goto unbalanced;
     }
     if(!nest){
         writeln("balanced parentheses");
         return;
     }
unbalanced:
     writeln("unbalanced parentheses");
}

That code is UTF aware, even though it does not explicitly deal with 
UTF. I'd claim it is like this most of the time.


 2. They get screwed up string output, look for the reason, patch up
 their code with some functions from std.utf and will never make the same
 mistakes again.

 Except they don't.  Because there are a lot of programmers that will
 never put in non-ascii strings to begin with.  But that has nothing to
 do with whether or not the /users/ or /maintainers/ of that code will
 put non-ascii strings in.  This could make some messes.

 I have *never* seen an user in D.learn complain about it. They might
 have been some I missed, but it is certainly not a prevalent problem.
 Also, just because an user can type .rep does not mean he understands
 Unicode: He is able to make just the same mistakes as before, even more
 so, as the array he is getting back has the _wrong element type_.

 You know, here in America (Amurica?) we don't know that other countries
 exist.  I think there is a large population of programmers here that
 don't even know how to enter non-latin characters, much less would think
 to include such characters in their test cases.  These programmers won't
 necessarily be found on the internet much, but they will be found in
 cubicles all around, doing their 9-to-5 and writing mediocre code that
 the rest of us have to put up with.  Their code will pass peer review
 (their peers are also from America) and continue working just fine until
 someone from one of those confusing other places decides to type in the
 characters they feel comfortable typing in.  No, there will not be
 /tests/ for code points greater than 0x80, because there is no one
 around to write those.  I'd feel a little better if D herds people into
 writing correct code to begin with, because they won't otherwise.

 There is no way to 'herd people into writing correct code' and UTF-8 is
 quite easy to deal with.

 Probably not.  I played fast and loose with this a lot in my early D
 code.  Then this same conversation happened like ~3 years ago on this
 newsgroup.  Then I learned more about unicode and had a bit of a bitter
 taste regarding char[] and how it handled indexing.  I thought I could
 just index char[]s willy nilly.  But no, I can't.  And the compiler
 won't tell me.  It just silently does what I don't want.

How often do you actually need to get, for example, the 10th character 
of a string? I think it is a very uncommon operation. If the indexing is 
just part of an iteration that looks once at each char and handles some 
ASCII characters in certain ways, there is no potential correctness 
problem. As soon as code talks about non-ascii characters, it has to be 
UTF aware anyway.

 Maybe unicode is easy, but we sure as hell aren't born with it, and the
 language doesn't give beginners ANY red flags about this.

 I find myself pretty fortified against this issue due to having known
 about it before anything unpleasant happened, but I don't like the idea
 of others having to learn the hard way.

Hm, well. The first thing I looked up when I learned D supports Unicode 
is how Unicode/UTF work in detail. After that, the semantics of char[] 
were very clear to me.

 ...

 There's another issue at play here too: efficiency vs correctness as a
 default.

 Here's the tradeoff --

 Option A:
 char[i] returns the i'th byte of the string as a (char) type.
 Consequences:
 (1) Code is efficient and INcorrect.

 Do you have an example of impactful incorrect code resulting from those
 semantics?

 Nope.  Sorry.  I learned about it before it had a chance to bite me.
 But this is only because I frequent(ed) the newsgroup and had a good
 throw on my dice roll.

I might be wrong, but I somewhat have the impression we might be chasing 
phantoms here. I have so far never seen a bug in real world code caused 
by inadvertent misuse of D string indexing or slicing.

 (2) It requires extra effort to write correct code.
 (3) Detecting the incorrect code may take years, as these errors can
 hide easily.

 None of those is a direct consequence of char[i] returning char. They
 are the consequence of at least 3 things:

 1. char[] is an array of char
 2. immutable(char)[] is the default string type
 3. the programmer does not know about 1. and/or 2.

 I say, 1. is inevitable. You say 3. is inevitable. If we are both right,
 then 2. is the culprit.

 I can get behind this.

 Honestly I'd like the default string type to be intelligent and optimize
 itself into whichever UTF-N encoding is optimal for content I throw into
 it.  Maybe this means it should lazily expand itself to the narrowest
 character type that maintains a 1-to-1 ratio between code units and code
 points so that indexing/slicing remain O(1), or maybe it's a bag of
 disparate encodings, or maybe someone can think of a better strategy.
 Just make it /reasonably/ fast and help me with correctness as much as
 possible.  If I need more performance or more unicode pedantics, I'll do
 my homework then and only then.

 Of course this is probably never going to happen I'm afraid.  Even the
 problem of making such a (probably) struct work at compile time in
 templates as if it were a native type... agh, headaches.

 Option B:
 char[i] returns the i'th codepoint of the string as a (dchar) type.
 Consequences:
 (1) Code is INefficient and correct.

 It is awfully optimistic to assume the code will be correct.

 (2) It requires extra effort to write efficient code.
 (3) Detecting the inefficient code happens in minutes.  It is VERY
 noticable when your program runs too slowly.

 Except when in testing only small inputs are used and only 2 years later
 maintainers throw your program at a larger problem instance and wonder
 why it does not terminate. Or your program is DOS'd. Polynomial blowup
 in runtime can be as large a problem as a correctness bug in practice
 just fine.

 I see what you mean there.  I'm still not entirely happy with it though.
   I don't think these are reasonable requirements.  It sounds like forced
 premature optimization to me.

It is using a better algorithm that performs faster by a linear factor. 
I would be very leery of something that looks like a constant time array 
indexing operation take linear time. I think premature optimization is 
about writing near-optimal hard-to-debug and maintain code that only 
gains some constant factors in parts of the code that are not 
performance critical.

 I have found myself in a number of places in different problem domains
 where optimality-is-correctness.  Make it too slow and the program isn't
 worth writing.  I can't imagine doing this for workloads I can't test on
 or anticipate though: I'd have to operate like NASA and make things 10x
 more expensive than they need to be.

 Correctness, on the other hand, can be easily (relatively speaking)
 obtained by only allowing the user to input data you can handle and then
 making sure the program can handle it as promised.  Test, test, test, etc.

 This is how I see it.

 And I really like my correct code.  If it's too slow, and I'll /know/
 when it's too slow, then I'll profile->tweak->profile->etc until the
 slowness goes away.  I'm totally digging option B.

 Those kinds of inefficiencies build up and make the whole program run
 sluggish, and it will possibly be to late when you notice.

 I get the feeling that the typical divide-and-conquer profiling strategy
 will find the more expensive operations /at least/ most of the time.
 Unfortunately, I have only experience to speak from on this matter.

Yes, what I meant is, that if the inefficiencies are spread out more or 
less uniformly, then fixing it all up might seem to be too much work and 
too much risk.

 Option B is not even on the table. This thread is about a breaking
 interface change and special casing T[] for T in {char, wchar}.

 Yeah, I know.  I'm refering to what Joshua wrote, because I like option
 B.  Even if it's academic, I'll say I like it anyways, if only for the
 sake of argument.

OK.

Dec 31 2011

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

On 12/31/2011 09:17 PM, Timon Gehr wrote:
 On 01/01/2012 02:34 AM, Chad J wrote:
 On 12/31/2011 02:02 PM, Timon Gehr wrote:
 On 12/31/2011 07:22 PM, Chad J wrote:
 On 12/30/2011 02:55 PM, Timon Gehr wrote:
 On 12/30/2011 08:33 PM, Joshua Reusch wrote:
 Maybe it could happen if we
 1. make dstring the default strings type --

 Inefficient.

 But correct (enough).

 code units and characters would be the same

 Wrong.

 *sigh*, FINE.  Code units and /code points/ would be the same.

 Relax.

 I'll do one better and ultra relax:
 http://www.youtube.com/watch?v=jimQoWXzc0Q
 ;)

 or 2. forward string.length to std.utf.count and opIndex to
 std.utf.toUTFindex

 Inconsistent and inefficient (it blows up the algorithmic complexity).

 Inconsistent?  How?

 int[]
 bool[]
 float[]
 char[]

 I'll refer to another limb of this thread when foobar mentioned a mental
 model of strings as strings of letters.  Now, given annoying corner
 cases, we probably can't get strings of /letters/, but I'd at least like
 to make it as far as code points.  That seems very doable.  I mention
 this because I find that forwarding string.length and opIndex would be
 much more consistent with this mental model of strings as strings of
 unicode code points, which, IMO, is more important than it being binary
 consistent with the other things.  I'd much rather have char[] behave
 more like an array of code points than an array of bytes.  I don't need
 an array of bytes.  That's ubyte[]; I have that already.

 
 char[] is not an array of bytes: it is an array of UTF-8 code units.
 

Meh, I'd still prefer it be an array of UTF-8 code /points/ represented
by an array of bytes (which are the UTF-8 code units).

 Inefficiency is a lot easier to deal with than incorrect.  If something
 is inefficient, then in the right places I will NOTICE.  If
 something is
 incorrect, it can hide for years until that one person (or country, in
 this case) with a different usage pattern than the others uncovers it.

 so programmers could use the slices/indexing/length (no lazyness
 problems), and if they really want codeunits use .raw/.rep (or better
 .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)

 Anyone who intends to write efficient string processing code needs
 this.
 Anyone who does not want to write string processing code will not need
 to index into a string -- standard library functions will suffice.

 What about people who want to write correct string processing code AND
 want to use this handy slicing feature?  Because I totally want both of
 these.  Slicing is super useful for script-like coding.

 Except that the proposal would make slicing strings go away.

 Yeah, Andrei's proposal says that.  But I'm speaking of Joshua's:

 so programmers could use the slices/indexing/length ...




 I kind-of like either, but I'd prefer Joshua's suggestion.

 But generally I liked the idea of just having an alias for strings...

 Me too. I think the way we have it now is optimal. The only reason we
 are discussing this is because of fear that uneducated users will
 write
 code that does not take into account Unicode characters above code
 point
 0x80. But what is the worst thing that can happen?

 1. They don't notice. Then it is not a problem, because they are
 obviously only using ASCII characters and it is perfectly
 reasonable to
 assume that code units and characters are the same thing.

 How do you know they are only working with ASCII?  They might be /now/.
    But what if someone else uses the program a couple years later
 when the
 original author is no longer maintaining that chunk of code?

 Then they obviously need to fix the code, because the requirements have
 changed. Most of it will already work correctly though, because UTF-8
 extends ASCII in a natural way.

 Or, you know, we could design the language a little differently and make
 this become mostly a non-problem.  That would be cool.

 
 It is imo already mostly a non-problem, but YMMV:
 
 void main(){
     string s = readln();
     int nest = 0;
     foreach(x;s){ // iterates by code unit
         if(x=='(') nest++;
         else if(x==')' && --nest<0) goto unbalanced;
     }
     if(!nest){
         writeln("balanced parentheses");
         return;
     }
 unbalanced:
     writeln("unbalanced parentheses");
 }
 
 That code is UTF aware, even though it does not explicitly deal with
 UTF. I'd claim it is like this most of the time.
 
 

I'm willing to agree with this.

I still don't like the possibility that folks encounter corner-cases in
that not-most-of-the-time.

I'm not going to rage-face too hard if this never changes though.  There
would be a number of other things more important to fix before this, IMO.

 2. They get screwed up string output, look for the reason, patch up
 their code with some functions from std.utf and will never make the
 same
 mistakes again.

 Except they don't.  Because there are a lot of programmers that will
 never put in non-ascii strings to begin with.  But that has nothing to
 do with whether or not the /users/ or /maintainers/ of that code will
 put non-ascii strings in.  This could make some messes.

 I have *never* seen an user in D.learn complain about it. They might
 have been some I missed, but it is certainly not a prevalent problem.
 Also, just because an user can type .rep does not mean he understands
 Unicode: He is able to make just the same mistakes as before, even
 more
 so, as the array he is getting back has the _wrong element type_.

 You know, here in America (Amurica?) we don't know that other countries
 exist.  I think there is a large population of programmers here that
 don't even know how to enter non-latin characters, much less would
 think
 to include such characters in their test cases.  These programmers
 won't
 necessarily be found on the internet much, but they will be found in
 cubicles all around, doing their 9-to-5 and writing mediocre code that
 the rest of us have to put up with.  Their code will pass peer review
 (their peers are also from America) and continue working just fine
 until
 someone from one of those confusing other places decides to type in the
 characters they feel comfortable typing in.  No, there will not be
 /tests/ for code points greater than 0x80, because there is no one
 around to write those.  I'd feel a little better if D herds people into
 writing correct code to begin with, because they won't otherwise.

 There is no way to 'herd people into writing correct code' and UTF-8 is
 quite easy to deal with.

 Probably not.  I played fast and loose with this a lot in my early D
 code.  Then this same conversation happened like ~3 years ago on this
 newsgroup.  Then I learned more about unicode and had a bit of a bitter
 taste regarding char[] and how it handled indexing.  I thought I could
 just index char[]s willy nilly.  But no, I can't.  And the compiler
 won't tell me.  It just silently does what I don't want.

 
 How often do you actually need to get, for example, the 10th character
 of a string? I think it is a very uncommon operation. If the indexing is
 just part of an iteration that looks once at each char and handles some
 ASCII characters in certain ways, there is no potential correctness
 problem. As soon as code talks about non-ascii characters, it has to be
 UTF aware anyway.
 

If you haven't been educated about unicode or how D handles it, you
might write this:

char[] str;
... load str ...
for ( int i = 0; i < str.length; i++ )
{
    font.render(str[i]); // Ewww.
    ...
}

It'd be neat if that gave a compiler error, or just passed code points
as dchar's.  Maybe a compiler error is best in this light.

 Maybe unicode is easy, but we sure as hell aren't born with it, and the
 language doesn't give beginners ANY red flags about this.

 I find myself pretty fortified against this issue due to having known
 about it before anything unpleasant happened, but I don't like the idea
 of others having to learn the hard way.

 
 Hm, well. The first thing I looked up when I learned D supports Unicode
 is how Unicode/UTF work in detail. After that, the semantics of char[]
 were very clear to me.
 
 ...

 There's another issue at play here too: efficiency vs correctness as a
 default.

 Here's the tradeoff --

 Option A:
 char[i] returns the i'th byte of the string as a (char) type.
 Consequences:
 (1) Code is efficient and INcorrect.

 Do you have an example of impactful incorrect code resulting from those
 semantics?

 Nope.  Sorry.  I learned about it before it had a chance to bite me.
 But this is only because I frequent(ed) the newsgroup and had a good
 throw on my dice roll.

 
 I might be wrong, but I somewhat have the impression we might be chasing
 phantoms here. I have so far never seen a bug in real world code caused
 by inadvertent misuse of D string indexing or slicing.
 

Possibly.

 (2) It requires extra effort to write correct code.
 (3) Detecting the incorrect code may take years, as these errors can
 hide easily.

 None of those is a direct consequence of char[i] returning char. They
 are the consequence of at least 3 things:

 1. char[] is an array of char
 2. immutable(char)[] is the default string type
 3. the programmer does not know about 1. and/or 2.

 I say, 1. is inevitable. You say 3. is inevitable. If we are both right,
 then 2. is the culprit.

 I can get behind this.

 Honestly I'd like the default string type to be intelligent and optimize
 itself into whichever UTF-N encoding is optimal for content I throw into
 it.  Maybe this means it should lazily expand itself to the narrowest
 character type that maintains a 1-to-1 ratio between code units and code
 points so that indexing/slicing remain O(1), or maybe it's a bag of
 disparate encodings, or maybe someone can think of a better strategy.
 Just make it /reasonably/ fast and help me with correctness as much as
 possible.  If I need more performance or more unicode pedantics, I'll do
 my homework then and only then.

 Of course this is probably never going to happen I'm afraid.  Even the
 problem of making such a (probably) struct work at compile time in
 templates as if it were a native type... agh, headaches.

 Option B:
 char[i] returns the i'th codepoint of the string as a (dchar) type.
 Consequences:
 (1) Code is INefficient and correct.

 It is awfully optimistic to assume the code will be correct.

 (2) It requires extra effort to write efficient code.
 (3) Detecting the inefficient code happens in minutes.  It is VERY
 noticable when your program runs too slowly.

 Except when in testing only small inputs are used and only 2 years later
 maintainers throw your program at a larger problem instance and wonder
 why it does not terminate. Or your program is DOS'd. Polynomial blowup
 in runtime can be as large a problem as a correctness bug in practice
 just fine.

 I see what you mean there.  I'm still not entirely happy with it though.
   I don't think these are reasonable requirements.  It sounds like forced
 premature optimization to me.

 
 It is using a better algorithm that performs faster by a linear factor.
 I would be very leery of something that looks like a constant time array
 indexing operation take linear time. I think premature optimization is
 about writing near-optimal hard-to-debug and maintain code that only
 gains some constant factors in parts of the code that are not
 performance critical.
 

This wouldn't be the first data structure to require linear time
indexing.  I mean, linked lists exists.

I do feel that heavy-duty optimization puts the onus on the programmer
to know what to do.  The programming language is responsible for merely
making it possible, not for making it the default path.  The latter is
fairly impossible.  Correctness, on the other hand, should involve some
hand-holding.  It's that notion of the language catching me when I fall.
 I think the language should (and can) help a lot with program
correctness if designed right.  D is already really good on these
counts, and even helps quite a bit when optimization gets down-and-dirty.

 I have found myself in a number of places in different problem domains
 where optimality-is-correctness.  Make it too slow and the program isn't
 worth writing.  I can't imagine doing this for workloads I can't test on
 or anticipate though: I'd have to operate like NASA and make things 10x
 more expensive than they need to be.

 Correctness, on the other hand, can be easily (relatively speaking)
 obtained by only allowing the user to input data you can handle and then
 making sure the program can handle it as promised.  Test, test, test,
 etc.

 This is how I see it.

 And I really like my correct code.  If it's too slow, and I'll /know/
 when it's too slow, then I'll profile->tweak->profile->etc until the
 slowness goes away.  I'm totally digging option B.

 Those kinds of inefficiencies build up and make the whole program run
 sluggish, and it will possibly be to late when you notice.

 I get the feeling that the typical divide-and-conquer profiling strategy
 will find the more expensive operations /at least/ most of the time.
 Unfortunately, I have only experience to speak from on this matter.

 
 Yes, what I meant is, that if the inefficiencies are spread out more or
 less uniformly, then fixing it all up might seem to be too much work and
 too much risk.
 

Ah, right.  Because code refactoring tends to suck.  I get you.

This is, of course, still the same reason why I'd never want to have to
go through my code and replace all of the "font.render(str[i]);".  Yeah,
starting a number of years ago it won't happen to me, but it might get
someone else.

 Option B is not even on the table. This thread is about a breaking
 interface change and special casing T[] for T in {char, wchar}.

 Yeah, I know.  I'm refering to what Joshua wrote, because I like option
 B.  Even if it's academic, I'll say I like it anyways, if only for the
 sake of argument.

 
 OK.

Dec 31 2011

a <a a.com> writes:

 Meh, I'd still prefer it be an array of UTF-8 code /points/ represented
 by an array of bytes (which are the UTF-8 code units).

By saying you want an array of code points you already define
representation. And if you want that there already is dchar[]. You probably
meant a range of code points represented by an array of code units. But
such a range can't have opIndex, since opIndex implies a constant time
operation.  If you want nth element of the range, you can use std.range.drop
or write your own nth() function.

Jan 01 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/01/2012 05:53 AM, Chad J wrote:
 If you haven't been educated about unicode or how D handles it, you
 might write this:

 char[] str;
 ... load str ...
 for ( int i = 0; i<  str.length; i++ )
 {
      font.render(str[i]); // Ewww.
      ...
 }

That actually looks like a bug that might happen in real world code. 
What is the signature of font.render?

Jan 01 2012

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

On 01/01/2012 07:59 AM, Timon Gehr wrote:
 On 01/01/2012 05:53 AM, Chad J wrote:
 If you haven't been educated about unicode or how D handles it, you
 might write this:

 char[] str;
 ... load str ...
 for ( int i = 0; i<  str.length; i++ )
 {
      font.render(str[i]); // Ewww.
      ...
 }

 
 That actually looks like a bug that might happen in real world code.
 What is the signature of font.render?

In my mind it's defined something like this:

class Font
{
 ...

    /** Render the given code point at
        the current (x,y) cursor position. */
    void render( dchar c )
    {
        ...
    }
}

(Of course I don't know minute details like where the "cursor position"
comes from, but I figure it doesn't matter.)

I probably wrote some code like that loop a very long time ago, but I
probably don't have that code around anymore, or at least not easily
findable.

Jan 01 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/01/2012 04:13 PM, Chad J wrote:
 On 01/01/2012 07:59 AM, Timon Gehr wrote:
 On 01/01/2012 05:53 AM, Chad J wrote:
 If you haven't been educated about unicode or how D handles it, you
 might write this:

 char[] str;
 ... load str ...
 for ( int i = 0; i<   str.length; i++ )
 {
       font.render(str[i]); // Ewww.
       ...
 }

 That actually looks like a bug that might happen in real world code.
 What is the signature of font.render?

 In my mind it's defined something like this:

 class Font
 {
   ...

      /** Render the given code point at
          the current (x,y) cursor position. */
      void render( dchar c )
      {
          ...
      }
 }

 (Of course I don't know minute details like where the "cursor position"
 comes from, but I figure it doesn't matter.)

 I probably wrote some code like that loop a very long time ago, but I
 probably don't have that code around anymore, or at least not easily
 findable.

I think the main issue here is that char implicitly converts to dchar: 
This is an implicit reinterpret-cast that is nonsensical if the 
character is outside the ascii-range.

Jan 01 2012

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

On 01/01/2012 10:39 AM, Timon Gehr wrote:
 On 01/01/2012 04:13 PM, Chad J wrote:
 On 01/01/2012 07:59 AM, Timon Gehr wrote:
 On 01/01/2012 05:53 AM, Chad J wrote:
 If you haven't been educated about unicode or how D handles it, you
 might write this:

 char[] str;
 ... load str ...
 for ( int i = 0; i<   str.length; i++ )
 {
       font.render(str[i]); // Ewww.
       ...
 }

 That actually looks like a bug that might happen in real world code.
 What is the signature of font.render?

 In my mind it's defined something like this:

 class Font
 {
   ...

      /** Render the given code point at
          the current (x,y) cursor position. */
      void render( dchar c )
      {
          ...
      }
 }

 (Of course I don't know minute details like where the "cursor position"
 comes from, but I figure it doesn't matter.)

 I probably wrote some code like that loop a very long time ago, but I
 probably don't have that code around anymore, or at least not easily
 findable.

 
 I think the main issue here is that char implicitly converts to dchar:
 This is an implicit reinterpret-cast that is nonsensical if the
 character is outside the ascii-range.

I agree.

Perhaps the compiler should insert a check on the 8th bit in cases like
these?

I suppose it's possible someone could declare a bunch of individual
char's and then start manipulating code units that way, and such an 8th
bit check could thwart those manipulations, but I would also counter
that such low manipulations should be done on ubyte's instead.

I don't know how much this would help though.  Seems like too little,
too late.

The bigger problem is that a char is being taken from a char[] and
thereby loses its context as (potentially) being part of a larger
codepoint.

Jan 01 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/01/2012 08:01 PM, Chad J wrote:
 On 01/01/2012 10:39 AM, Timon Gehr wrote:
 On 01/01/2012 04:13 PM, Chad J wrote:
 On 01/01/2012 07:59 AM, Timon Gehr wrote:
 On 01/01/2012 05:53 AM, Chad J wrote:
 If you haven't been educated about unicode or how D handles it, you
 might write this:

 char[] str;
 ... load str ...
 for ( int i = 0; i<    str.length; i++ )
 {
        font.render(str[i]); // Ewww.
        ...
 }

 That actually looks like a bug that might happen in real world code.
 What is the signature of font.render?

 In my mind it's defined something like this:

 class Font
 {
    ...

       /** Render the given code point at
           the current (x,y) cursor position. */
       void render( dchar c )
       {
           ...
       }
 }

 (Of course I don't know minute details like where the "cursor position"
 comes from, but I figure it doesn't matter.)

 I probably wrote some code like that loop a very long time ago, but I
 probably don't have that code around anymore, or at least not easily
 findable.

 I think the main issue here is that char implicitly converts to dchar:
 This is an implicit reinterpret-cast that is nonsensical if the
 character is outside the ascii-range.

 I agree.

 Perhaps the compiler should insert a check on the 8th bit in cases like
 these?

 I suppose it's possible someone could declare a bunch of individual
 char's and then start manipulating code units that way, and such an 8th
 bit check could thwart those manipulations, but I would also counter
 that such low manipulations should be done on ubyte's instead.

 I don't know how much this would help though.  Seems like too little,
 too late.

I think the conversion char -> dchar should just require an explicit 
cast. The runtime check is better left to std.conv.to;

 The bigger problem is that a char is being taken from a char[] and
 thereby loses its context as (potentially) being part of a larger
 codepoint.

If it is part of a larger code point, then it has its highest bit set. 
Any individual char that has its highest bit set does not carry a 
character on its own. If it is not set, then it is a single ASCII character.

Jan 01 2012

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

On 01/01/2012 02:25 PM, Timon Gehr wrote:
 On 01/01/2012 08:01 PM, Chad J wrote:
 On 01/01/2012 10:39 AM, Timon Gehr wrote:
 On 01/01/2012 04:13 PM, Chad J wrote:
 On 01/01/2012 07:59 AM, Timon Gehr wrote:
 On 01/01/2012 05:53 AM, Chad J wrote:
 If you haven't been educated about unicode or how D handles it, you
 might write this:

 char[] str;
 ... load str ...
 for ( int i = 0; i<    str.length; i++ )
 {
        font.render(str[i]); // Ewww.
        ...
 }

 That actually looks like a bug that might happen in real world code.
 What is the signature of font.render?

 In my mind it's defined something like this:

 class Font
 {
    ...

       /** Render the given code point at
           the current (x,y) cursor position. */
       void render( dchar c )
       {
           ...
       }
 }

 (Of course I don't know minute details like where the "cursor position"
 comes from, but I figure it doesn't matter.)

 I probably wrote some code like that loop a very long time ago, but I
 probably don't have that code around anymore, or at least not easily
 findable.

 I think the main issue here is that char implicitly converts to dchar:
 This is an implicit reinterpret-cast that is nonsensical if the
 character is outside the ascii-range.

 I agree.

 Perhaps the compiler should insert a check on the 8th bit in cases like
 these?

 I suppose it's possible someone could declare a bunch of individual
 char's and then start manipulating code units that way, and such an 8th
 bit check could thwart those manipulations, but I would also counter
 that such low manipulations should be done on ubyte's instead.

 I don't know how much this would help though.  Seems like too little,
 too late.

 
 I think the conversion char -> dchar should just require an explicit
 cast. The runtime check is better left to std.conv.to;
 

What of valid transfers of ASCII characters into dchar?

Normally this is a widening operation, so I can see how it is permissible.

 The bigger problem is that a char is being taken from a char[] and
 thereby loses its context as (potentially) being part of a larger
 codepoint.

 
 If it is part of a larger code point, then it has its highest bit set.
 Any individual char that has its highest bit set does not carry a
 character on its own. If it is not set, then it is a single ASCII
 character.

See above.


I think that assigning from a char[i] to another char[j] is probably
safe.  Similarly for slicing.  These calculations tend to occur, I
suspect, when the text is well-anchored.  I believe your balanced
parentheses example falls into this category:
(repasted for reader convenience)

void main(){
    string s = readln();
    int nest = 0;
    foreach(x;s){ // iterates by code unit
        if(x=='(') nest++;
        else if(x==')' && --nest<0) goto unbalanced;
    }
    if(!nest){
        writeln("balanced parentheses");
        return;
    }
unbalanced:
    writeln("unbalanced parentheses");
}

With these observations in hand, I would consider the safety of
operations to go like this:

char[i] = char[j];           // (Reasonably) Safe
char[i1..i2] = char[j1..j2]; // (Reasonably) Safe
char = char;                 // Safe
dchar = char                 // Safe.  Widening.
char = char[i];              // Not safe.  Should error.
dchar = char[i];             // Not safe.  Should error. (Corollary)
dchar = dchar[i];            // Safe.
char = char[i1..i2];         // Nonsensical; already an error.

Jan 01 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/02/2012 12:16 AM, Chad J wrote:
 On 01/01/2012 02:25 PM, Timon Gehr wrote:
 On 01/01/2012 08:01 PM, Chad J wrote:
 On 01/01/2012 10:39 AM, Timon Gehr wrote:
 On 01/01/2012 04:13 PM, Chad J wrote:
 On 01/01/2012 07:59 AM, Timon Gehr wrote:
 On 01/01/2012 05:53 AM, Chad J wrote:
 If you haven't been educated about unicode or how D handles it, you
 might write this:

 char[] str;
 ... load str ...
 for ( int i = 0; i<     str.length; i++ )
 {
         font.render(str[i]); // Ewww.
         ...
 }

 That actually looks like a bug that might happen in real world code.
 What is the signature of font.render?

 In my mind it's defined something like this:

 class Font
 {
     ...

        /** Render the given code point at
            the current (x,y) cursor position. */
        void render( dchar c )
        {
            ...
        }
 }

 (Of course I don't know minute details like where the "cursor position"
 comes from, but I figure it doesn't matter.)

 I probably wrote some code like that loop a very long time ago, but I
 probably don't have that code around anymore, or at least not easily
 findable.

 I think the main issue here is that char implicitly converts to dchar:
 This is an implicit reinterpret-cast that is nonsensical if the
 character is outside the ascii-range.

 I agree.

 Perhaps the compiler should insert a check on the 8th bit in cases like
 these?

 I suppose it's possible someone could declare a bunch of individual
 char's and then start manipulating code units that way, and such an 8th
 bit check could thwart those manipulations, but I would also counter
 that such low manipulations should be done on ubyte's instead.

 I don't know how much this would help though.  Seems like too little,
 too late.

 I think the conversion char ->  dchar should just require an explicit
 cast. The runtime check is better left to std.conv.to;

 What of valid transfers of ASCII characters into dchar?

 Normally this is a widening operation, so I can see how it is permissible.

 The bigger problem is that a char is being taken from a char[] and
 thereby loses its context as (potentially) being part of a larger
 codepoint.

 If it is part of a larger code point, then it has its highest bit set.
 Any individual char that has its highest bit set does not carry a
 character on its own. If it is not set, then it is a single ASCII
 character.

 See above.


 I think that assigning from a char[i] to another char[j] is probably
 safe.  Similarly for slicing.  These calculations tend to occur, I
 suspect, when the text is well-anchored.  I believe your balanced
 parentheses example falls into this category:
 (repasted for reader convenience)

 void main(){
      string s = readln();
      int nest = 0;
      foreach(x;s){ // iterates by code unit
          if(x=='(') nest++;
          else if(x==')'&&  --nest<0) goto unbalanced;
      }
      if(!nest){
          writeln("balanced parentheses");
          return;
      }
 unbalanced:
      writeln("unbalanced parentheses");
 }

 With these observations in hand, I would consider the safety of
 operations to go like this:

 char[i] = char[j];           // (Reasonably) Safe
 char[i1..i2] = char[j1..j2]; // (Reasonably) Safe
 char = char;                 // Safe
 dchar = char                 // Safe.  Widening.
 char = char[i];              // Not safe.  Should error.
 dchar = char[i];             // Not safe.  Should error. (Corollary)
 dchar = dchar[i];            // Safe.
 char = char[i1..i2];         // Nonsensical; already an error.

That is an interesting point of view. Your proposal would therefore be 
to constrain char to the ASCII range except if it is embedded in an 
array? It would break the balanced parentheses example.

Jan 01 2012

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

On 01/01/2012 06:36 PM, Timon Gehr wrote:
 On 01/02/2012 12:16 AM, Chad J wrote:
 On 01/01/2012 02:25 PM, Timon Gehr wrote:
 On 01/01/2012 08:01 PM, Chad J wrote:
 On 01/01/2012 10:39 AM, Timon Gehr wrote:
 On 01/01/2012 04:13 PM, Chad J wrote:
 On 01/01/2012 07:59 AM, Timon Gehr wrote:
 On 01/01/2012 05:53 AM, Chad J wrote:
 If you haven't been educated about unicode or how D handles it, you
 might write this:

 char[] str;
 ... load str ...
 for ( int i = 0; i<     str.length; i++ )
 {
         font.render(str[i]); // Ewww.
         ...
 }

 That actually looks like a bug that might happen in real world code.
 What is the signature of font.render?

 In my mind it's defined something like this:

 class Font
 {
     ...

        /** Render the given code point at
            the current (x,y) cursor position. */
        void render( dchar c )
        {
            ...
        }
 }

 (Of course I don't know minute details like where the "cursor
 position"
 comes from, but I figure it doesn't matter.)

 I probably wrote some code like that loop a very long time ago, but I
 probably don't have that code around anymore, or at least not easily
 findable.

 I think the main issue here is that char implicitly converts to dchar:
 This is an implicit reinterpret-cast that is nonsensical if the
 character is outside the ascii-range.

 I agree.

 Perhaps the compiler should insert a check on the 8th bit in cases like
 these?

 I suppose it's possible someone could declare a bunch of individual
 char's and then start manipulating code units that way, and such an 8th
 bit check could thwart those manipulations, but I would also counter
 that such low manipulations should be done on ubyte's instead.

 I don't know how much this would help though.  Seems like too little,
 too late.

 I think the conversion char ->  dchar should just require an explicit
 cast. The runtime check is better left to std.conv.to;

 What of valid transfers of ASCII characters into dchar?

 Normally this is a widening operation, so I can see how it is
 permissible.

 The bigger problem is that a char is being taken from a char[] and
 thereby loses its context as (potentially) being part of a larger
 codepoint.

 If it is part of a larger code point, then it has its highest bit set.
 Any individual char that has its highest bit set does not carry a
 character on its own. If it is not set, then it is a single ASCII
 character.

 See above.


 I think that assigning from a char[i] to another char[j] is probably
 safe.  Similarly for slicing.  These calculations tend to occur, I
 suspect, when the text is well-anchored.  I believe your balanced
 parentheses example falls into this category:
 (repasted for reader convenience)

 void main(){
      string s = readln();
      int nest = 0;
      foreach(x;s){ // iterates by code unit
          if(x=='(') nest++;
          else if(x==')'&&  --nest<0) goto unbalanced;
      }
      if(!nest){
          writeln("balanced parentheses");
          return;
      }
 unbalanced:
      writeln("unbalanced parentheses");
 }

 With these observations in hand, I would consider the safety of
 operations to go like this:

 char[i] = char[j];           // (Reasonably) Safe
 char[i1..i2] = char[j1..j2]; // (Reasonably) Safe
 char = char;                 // Safe
 dchar = char                 // Safe.  Widening.
 char = char[i];              // Not safe.  Should error.
 dchar = char[i];             // Not safe.  Should error. (Corollary)
 dchar = dchar[i];            // Safe.
 char = char[i1..i2];         // Nonsensical; already an error.

 
 That is an interesting point of view. Your proposal would therefore be
 to constrain char to the ASCII range except if it is embedded in an
 array? It would break the balanced parentheses example.

I just ran the example and wow, x didn't type-infer to dchar like I
expected it to.  I thought the comment might be wrong, but no, it is
correct, x type-infers to char.

I expected it to behave more like the old days before type inference
showed up everywhere:

void main(){
     string s = readln();
     int nest = 0;
     foreach(dchar x;s){ // iterates by code POINT; notice the dchar.
         if(x=='(') nest++;
         else if(x==')'&&  --nest<0) goto unbalanced;
     }
     if(!nest){
         writeln("balanced parentheses");
         return;
     }
unbalanced:
     writeln("unbalanced parentheses");
}

This version wouldn't be broken.  If the type inference changed, the
other version wouldn't be broken either.  This could break other things
though.  Bummer.

Jan 01 2012

Artur Skawina <art.08.09 gmail.com> writes:

On 12/28/11 13:42, bearophile wrote:
 Peter Alexander:
 
 I often get into situations where I've written 
 a function that takes a string, and then I can't call it because all I 
 have is a char[].

 
 I suggest you to show some of such situations.
 
 
 I think it's telling that most Phobos functions use 'const(char)[]' or 
 'in char[]' instead of 'string' for their arguments. The ones that use 
 'string' are usually using it unnecessarily and should be fixed to use 
 const(char)[].

 
 What are the Phobos functions that unnecessarily accept a string?

eg things like std.demangle? (which wraps core.demangle and that one accepts
const(char)[]). IIRC eg the stdio functions taking file names want strings too;
never investigated if they really need this, just .iduped the args...
In general, a lot of things break when trying to switch to "proper"
const(char)[] in apps, usually because the app itself used "string" instead of
the const version, but fixing it up often also uncovers lib API issues.

artur

Dec 28 2011

Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:

I agree, the string parameters are indeed irritating, but changing the
alias would bring much more pain, then it would relieve.

On Wed, Dec 28, 2011 at 4:06 PM, Peter Alexander
<peter.alexander.au gmail.com> wrote:
 string is immutable(char)[]

 I rarely *ever* need an immutable string. What I usually need is
 const(char)[]. I'd say 99%+ of the time I need only a const string.

 This is quite irritating because "string" is the most convenient and
 intuitive thing to type. I often get into situations where I've written a
 function that takes a string, and then I can't call it because all I have is
 a char[]. I could copy the char[] into a new string, but that's expensive,
 and I'd rather I could just call the function.

 I think it's telling that most Phobos functions use 'const(char)[]' or 'in
 char[]' instead of 'string' for their arguments. The ones that use 'string'
 are usually using it unnecessarily and should be fixed to use const(char)[].

 In an ideal world I'd much prefer if string was an alias for const(char)[],
 but string literals were immutable(char)[]. It would require a little more
 effort when dealing with concurrency, but that's a price I would be willing
 to pay to make the string alias useful in function parameters.



-- 
Bye,
Gor Gyolchanyan.

Dec 28 2011

mta`chrono <chrono mta-international.net> writes:

I understand your intention. It was one of the main irritations when I
moved to D. Here is a function that unnecessarily uses string.

/**
 * replaces foo by bar within text.
 */
string replace(string text, string foo, string bar)
{
   // ...
}

The function is crap because it can't be called with mutable char[].
Okay, that's true. Therefore you'd suggested to alias const(char)[]
instead of immutable(char)[] ???

But I think inout() is your man in this case. If I remeber correctly, it
has been fixed recently.

I'm not quite sure if I got your point. So forgive me if I was wrong.

Dec 28 2011

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 12/28/2011 04:06 AM, Peter Alexander wrote:
 string is immutable(char)[]

 I rarely *ever* need an immutable string. What I usually need is
 const(char)[]. I'd say 99%+ of the time I need only a const string.

 This is quite irritating because "string" is the most convenient and
 intuitive thing to type. I often get into situations where I've written
 a function that takes a string, and then I can't call it because all I
 have is a char[]. I could copy the char[] into a new string, but that's
 expensive, and I'd rather I could just call the function.

 I think it's telling that most Phobos functions use 'const(char)[]' or
 'in char[]' instead of 'string' for their arguments. The ones that use
 'string' are usually using it unnecessarily and should be fixed to use
 const(char)[].

 In an ideal world I'd much prefer if string was an alias for
 const(char)[], but string literals were immutable(char)[]. It would
 require a little more effort when dealing with concurrency, but that's a
 price I would be willing to pay to make the string alias useful in
 function parameters.

Agreed. I've talked about this in D.learn a number of times myself.

Ali

Dec 28 2011

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 12/28/2011 08:00 AM, Ali Çehreli wrote:
 Agreed. I've talked about this in D.learn a number of times myself.

After seeing others' comments that focus more on the alias, I need to 
clarify: I don't have an opinion on the alias itself.

I agree with the subject line that function parameter lists should 
mostly have const(char)[] instead of string.

Ali

Dec 28 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/28/11 6:06 AM, Peter Alexander wrote:
 string is immutable(char)[]

 I rarely *ever* need an immutable string. What I usually need is
 const(char)[]. I'd say 99%+ of the time I need only a const string.

 This is quite irritating because "string" is the most convenient and
 intuitive thing to type. I often get into situations where I've written
 a function that takes a string, and then I can't call it because all I
 have is a char[]. I could copy the char[] into a new string, but that's
 expensive, and I'd rather I could just call the function.

 I think it's telling that most Phobos functions use 'const(char)[]' or
 'in char[]' instead of 'string' for their arguments. The ones that use
 'string' are usually using it unnecessarily and should be fixed to use
 const(char)[].

 In an ideal world I'd much prefer if string was an alias for
 const(char)[], but string literals were immutable(char)[]. It would
 require a little more effort when dealing with concurrency, but that's a
 price I would be willing to pay to make the string alias useful in
 function parameters.

I'm afraid you're wrong here. The current setup is very good, and much 
better than one in which "string" would be an alias for const(char)[].

The problem is escaping. A function that transitorily operates on a 
string indeed does not care about the origin of the string, but storing 
a string inside an object is a completely different deal. The setup

class Query
{
     string name;
     ...
}

is safe, minimizes data copying, and never causes surprises to anyone 
("I set the name of my query and a little later it's all messed up!").

So immutable(char)[] is the best choice for a correct string abstraction 
compared against both char[] and const(char)[]. In fact it's in a way 
good that const(char)[] takes longer to type, because it also carries 
larger liabilities.

If you want to create a string out of a char[] or const(char)[], use 
std.conv.to or the unsafe assumeUnique.


Andrei

Dec 28 2011

Peter Alexander <peter.alexander.au gmail.com> writes:

On 28/12/11 4:27 PM, Andrei Alexandrescu wrote:
 The problem is escaping. A function that transitorily operates on a
 string indeed does not care about the origin of the string, but storing
 a string inside an object is a completely different deal. The setup

 class Query
 {
 string name;
 ...
 }

 is safe, minimizes data copying, and never causes surprises to anyone
 ("I set the name of my query and a little later it's all messed up!").

 So immutable(char)[] is the best choice for a correct string abstraction
 compared against both char[] and const(char)[]. In fact it's in a way
 good that const(char)[] takes longer to type, because it also carries
 larger liabilities.

I don't follow your argument. You've said (paraphrasing) "If a function 
does A then X is best, but if a function does B then Y is best, so Y is 
best."

If a function needs to store the string then by all means it should use 
immutable(char)[]. However, this is a much rarer case than functions 
that simply use the string transitorily as you put it.

Again, there are very, very few functions in Phobos that accept a string 
as an argument. The vast majority accept `const(char)[]` or `in char[]`. 
This speaks volumes about how useful the string alias is.

Dec 28 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/28/11 11:42 AM, Peter Alexander wrote:
 On 28/12/11 4:27 PM, Andrei Alexandrescu wrote:
 The problem is escaping. A function that transitorily operates on a
 string indeed does not care about the origin of the string, but storing
 a string inside an object is a completely different deal. The setup

 class Query
 {
 string name;
 ...
 }

 is safe, minimizes data copying, and never causes surprises to anyone
 ("I set the name of my query and a little later it's all messed up!").

 So immutable(char)[] is the best choice for a correct string abstraction
 compared against both char[] and const(char)[]. In fact it's in a way
 good that const(char)[] takes longer to type, because it also carries
 larger liabilities.

 I don't follow your argument. You've said (paraphrasing) "If a function
 does A then X is best, but if a function does B then Y is best, so Y is
 best."

I'm saying (paraphrasing) "X is modularly bankrupt and unsafe, and Y is 
modular and safe, so Y is best".

 If a function needs to store the string then by all means it should use
 immutable(char)[]. However, this is a much rarer case than functions
 that simply use the string transitorily as you put it.

Rarity is a secondary concern to modularity and safety.

 Again, there are very, very few functions in Phobos that accept a string
 as an argument. The vast majority accept `const(char)[]` or `in char[]`.
 This speaks volumes about how useful the string alias is.

Phobos consists of many functions and few entity types. Application code 
is rife with entity types. I kindly suggest you reconsider your 
position; the current setup is indeed very solid.


Andrei

Dec 28 2011

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Wednesday, 28 December 2011 at 16:27:15 UTC, Andrei 
Alexandrescu wrote:
 So immutable(char)[] is the best choice for a correct string 
 abstraction compared against both char[] and const(char)[]. In 
 fact it's in a way good that const(char)[] takes longer to 
 type, because it also carries larger liabilities.

Also, 'in char[]', which is conceptually much safer, isn't that 
much longer to type.

It would be cool if 'scope' was actually implemented apart from 
an optimization though.

Dec 28 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, December 28, 2011 19:25:15 Jakob Ovrum wrote:
 Also, 'in char[]', which is conceptually much safer, isn't that
 much longer to type.
 
 It would be cool if 'scope' was actually implemented apart from
 an optimization though.

in char[] is _not_ safer than immutable(char)[]. In fact it's _less_ safe. 
Itals also far more restrictive. Many, many functions return a portion of the 
string that they are passed in. That slicing would be impossible with scope, 
and because in char[] makes no guarantees about the elements not changing 
after the function call, you'd often have to dup or idup it in order to avoid 
bugs. immutable(char)[] avoids all of that. You can safely slice it without 
having to worry about duping it to avoid it changing out from under you.

- Jonathan M Davis

Dec 28 2011

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Wednesday, 28 December 2011 at 20:49:54 UTC, Jonathan M Davis 
wrote:
 On Wednesday, December 28, 2011 19:25:15 Jakob Ovrum wrote:
 Also, 'in char[]', which is conceptually much safer, isn't that
 much longer to type.
 
 It would be cool if 'scope' was actually implemented apart from
 an optimization though.

 in char[] is _not_ safer than immutable(char)[].

I didn't say it was. Please read more closely.

Dec 28 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, December 28, 2011 10:27:15 Andrei Alexandrescu wrote:
 I'm afraid you're wrong here. The current setup is very good, and much
 better than one in which "string" would be an alias for const(char)[].
 
 The problem is escaping. A function that transitorily operates on a
 string indeed does not care about the origin of the string, but storing
 a string inside an object is a completely different deal. The setup
 
 class Query
 {
      string name;
      ...
 }
 
 is safe, minimizes data copying, and never causes surprises to anyone
 ("I set the name of my query and a little later it's all messed up!").
 
 So immutable(char)[] is the best choice for a correct string abstraction
 compared against both char[] and const(char)[]. In fact it's in a way
 good that const(char)[] takes longer to type, because it also carries
 larger liabilities.
 
 If you want to create a string out of a char[] or const(char)[], use
 std.conv.to or the unsafe assumeUnique.

Agreed. And for a number of functions, taking const(char)[] would be worse, 
because they would have to dup or idup the string, whereas with 
immutable(char)[], they can safely slice it without worrying about its value 
changing.

I think that if we want to make it so that immutable(char)[] isn't forced as 
much, then we need to make proper use of templates (which also can allow you 
to not force char over wchar or dchar) and inout - and perhaps in some cases, 
a templated function could allow you to indicate what type of character you 
want returned. But in general, string is by far the most useful and least 
likely to cause bugs with slicing. So, I think that string should remain 
immutable(char)[].

- Jonathan M Davis

Dec 28 2011

deadalnix <deadalnix gmail.com> writes:

Le 28/12/2011 21:43, Jonathan M Davis a écrit :
 On Wednesday, December 28, 2011 10:27:15 Andrei Alexandrescu wrote:
 I'm afraid you're wrong here. The current setup is very good, and much
 better than one in which "string" would be an alias for const(char)[].

 The problem is escaping. A function that transitorily operates on a
 string indeed does not care about the origin of the string, but storing
 a string inside an object is a completely different deal. The setup

 class Query
 {
       string name;
       ...
 }

 is safe, minimizes data copying, and never causes surprises to anyone
 ("I set the name of my query and a little later it's all messed up!").

 So immutable(char)[] is the best choice for a correct string abstraction
 compared against both char[] and const(char)[]. In fact it's in a way
 good that const(char)[] takes longer to type, because it also carries
 larger liabilities.

 If you want to create a string out of a char[] or const(char)[], use
 std.conv.to or the unsafe assumeUnique.

 Agreed. And for a number of functions, taking const(char)[] would be worse,
 because they would have to dup or idup the string, whereas with
 immutable(char)[], they can safely slice it without worrying about its value
 changing.

Is inout a solution for the standard lib here ?

The user could idup if a string is needed from a const/mutable char[]

Dec 29 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, December 29, 2011 17:01:19 deadalnix wrote:
 Le 28/12/2011 21:43, Jonathan M Davis a =C3=A9crit :
 Agreed. And for a number of functions, taking const(char)[] would b=


e
 worse, because they would have to dup or idup the string, whereas w=


ith
 immutable(char)[], they can safely slice it without worrying about =


its
 value changing.

=20
 Is inout a solution for the standard lib here ?
=20
 The user could idup if a string is needed from a const/mutable char[]=


In some places, yes. Phobos doesn't use inout as much as it probably sh=
ould,=20
simply because it was only recently that inout was made to work properl=
y.=20
Regardless, you have to be careful about taking const(char)[], because =
there's=20
a risk of forcing what could be an unnecessary idup. The best solution =
to=20
that, however, depends on what exactly the function is doing. If it's s=
imply=20
slicing a portion of the string that's passed in and returning it, then=
 inout=20
is a great solution. On the other hand, if it actually needs an=20
immutable(char)[] internally, then there's a good chance that it should=
 just=20
take a string. It depends on what the function is ultimately doing.

- Jonathan M Davis

Dec 29 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 12/28/2011 4:06 AM, Peter Alexander wrote:
 I rarely *ever* need an immutable string. What I usually need is const(char)[].
 I'd say 99%+ of the time I need only a const string.

I have a very different experience with strings. I can't even remember a case 
where I wanted to modify an existing string (this includes all my C and C++ 
usage of strings). It's always assemble a string at one place, and then refer
to 
that string ever after (and never modify it).

What immutable strings make possible is treating strings as if they were value 
types. Nearly every language I know of treats them as immutable except for C
and 
C++.

Dec 28 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 12/28/11 11:11 AM, Walter Bright wrote:
 On 12/28/2011 4:06 AM, Peter Alexander wrote:
 I rarely *ever* need an immutable string. What I usually need is
 const(char)[].
 I'd say 99%+ of the time I need only a const string.

 I have a very different experience with strings. I can't even remember a
 case where I wanted to modify an existing string (this includes all my C
 and C++ usage of strings). It's always assemble a string at one place,
 and then refer to that string ever after (and never modify it).

 What immutable strings make possible is treating strings as if they were
 value types. Nearly every language I know of treats them as immutable
 except for C and C++.

I remember the day at Kahili we figured immutable(char)[] will just work 
as it needs to. It felt pretty awesome.

Andrei

Dec 28 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 12/28/2011 06:40 PM, Andrei Alexandrescu wrote:
 On 12/28/11 11:11 AM, Walter Bright wrote:
 On 12/28/2011 4:06 AM, Peter Alexander wrote:
 I rarely *ever* need an immutable string. What I usually need is
 const(char)[].
 I'd say 99%+ of the time I need only a const string.

 I have a very different experience with strings. I can't even remember a
 case where I wanted to modify an existing string (this includes all my C
 and C++ usage of strings). It's always assemble a string at one place,
 and then refer to that string ever after (and never modify it).

 What immutable strings make possible is treating strings as if they were
 value types. Nearly every language I know of treats them as immutable
 except for C and C++.

 I remember the day at Kahili we figured immutable(char)[] will just work
 as it needs to. It felt pretty awesome.

 Andrei

I agree. But I am confused by the fact that you are suggesting it 
actually does not work as it needs to at other places in this thread.

Dec 28 2011

Peter Alexander <peter.alexander.au gmail.com> writes:

On 28/12/11 5:11 PM, Walter Bright wrote:
 On 12/28/2011 4:06 AM, Peter Alexander wrote:
 I rarely *ever* need an immutable string. What I usually need is
 const(char)[].
 I'd say 99%+ of the time I need only a const string.

 I have a very different experience with strings. I can't even remember a
 case where I wanted to modify an existing string (this includes all my C
 and C++ usage of strings). It's always assemble a string at one place,
 and then refer to that string ever after (and never modify it).

We can disagree on this, but I think the fact that Phobos rarely uses 
'string' and instead uses 'const(char)[]' or 'in char[]' speaks louder 
than either of our experiences.

 What immutable strings make possible is treating strings as if they were
 value types. Nearly every language I know of treats them as immutable
 except for C and C++.

Yes, and I wouldn't want to remove that. Immutable strings are good, but 
requiring immutable strings when you don't need them is definitely not 
good. Phobos knows this, so it doesn't use string, which leads me to 
question what use the string alias is.

Dec 28 2011

Sean Kelly <sean invisibleduck.org> writes:

Most common to me buffer reuse. I'll read a line of a file into a buffer, op=
erate on it, then read the next line into the same buffer. If references to t=
he buffer may escape, it's obviously unsafe to cast to immutable.=20

Sent from my iPhone

On Dec 28, 2011, at 9:11 AM, Walter Bright <newshound2 digitalmars.com> wrot=
e:

 On 12/28/2011 4:06 AM, Peter Alexander wrote:
 I rarely *ever* need an immutable string. What I usually need is const(ch=


ar)[].
 I'd say 99%+ of the time I need only a const string.

=20
 I have a very different experience with strings. I can't even remember a c=

ase where I wanted to modify an existing string (this includes all my C and C=
++ usage of strings). It's always assemble a string at one place, and then r=
efer to that string ever after (and never modify it).
=20
 What immutable strings make possible is treating strings as if they were v=

alue types. Nearly every language I know of treats them as immutable except f=
or C and C++.

Dec 28 2011

"Dejan Lekic" <dejan.lekic gmail.com> writes:

Peter, having string as immutable(char)[] was perhaps one of the 
best D2 decisions so far, in my humble opinion. I strongly 
disagree with you on this one.

Dec 28 2011

Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:

Having a mutable string is a bad idea also because it's mutability is
in the form of array element manipulations, but the string (except for
the dstring) is not semantically an array and its element mutation
isn't safe.

On Wed, Dec 28, 2011 at 9:19 PM, Dejan Lekic <dejan.lekic gmail.com> wrote:
 Peter, having string as immutable(char)[] was perhaps one of the best D2
 decisions so far, in my humble opinion. I strongly disagree with you on this
 one.



-- 
Bye,
Gor Gyolchanyan.

Dec 28 2011

so <so so.so> writes:

On Wed, 28 Dec 2011 14:06:06 +0200, Peter Alexander  
<peter.alexander.au gmail.com> wrote:

 string is immutable(char)[]

 I rarely *ever* need an immutable string. What I usually need is  
 const(char)[]. I'd say 99%+ of the time I need only a const string.

 This is quite irritating because "string" is the most convenient and  
 intuitive thing to type. I often get into situations where I've written  
 a function that takes a string, and then I can't call it because all I  
 have is a char[]. I could copy the char[] into a new string, but that's  
 expensive, and I'd rather I could just call the function.

 I think it's telling that most Phobos functions use 'const(char)[]' or  
 'in char[]' instead of 'string' for their arguments. The ones that use  
 'string' are usually using it unnecessarily and should be fixed to use  
 const(char)[].

 In an ideal world I'd much prefer if string was an alias for  
 const(char)[], but string literals were immutable(char)[]. It would  
 require a little more effort when dealing with concurrency, but that's a  
 price I would be willing to pay to make the string alias useful in  
 function parameters.

As you said string is not a structure but an alias.
Your arguments not against string but the functions that support only  
strings which you think they shouldn't.
If you are sure, that function is able to work on your "string" (but it  
won't) it just shows that we need to focus on the function rather than the  
string, no?

Dec 28 2011

mta`chrono <chrono mta-international.net> writes:

there are lot of people suggesting to change how string behaves. but
remember, d is awesome compared to other languages for not wrapping
string in a class or struct.

you can use string/char[] without loosing your _nativeness_. programmers
targeting embedded systems are really happy because of this.

by the way, I don't want to blame someone, but I think we diverged from
the original purpose of this topic. __"string is rarely useful as a
function argument"__

I think he points out that choosing _string_ type in function arguments
is _wrong_ in most cases. and there isn't much use of inout in phobos as
it was broken for a long time.

Dec 30 2011

D Programming

C/C++ Programming

Other

digitalmars.D - string is rarely useful as a function argument