www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - string is rarely useful as a function argument

reply Peter Alexander <peter.alexander.au gmail.com> writes:
string is immutable(char)[]

I rarely *ever* need an immutable string. What I usually need is 
const(char)[]. I'd say 99%+ of the time I need only a const string.

This is quite irritating because "string" is the most convenient and 
intuitive thing to type. I often get into situations where I've written 
a function that takes a string, and then I can't call it because all I 
have is a char[]. I could copy the char[] into a new string, but that's 
expensive, and I'd rather I could just call the function.

I think it's telling that most Phobos functions use 'const(char)[]' or 
'in char[]' instead of 'string' for their arguments. The ones that use 
'string' are usually using it unnecessarily and should be fixed to use 
const(char)[].

In an ideal world I'd much prefer if string was an alias for 
const(char)[], but string literals were immutable(char)[]. It would 
require a little more effort when dealing with concurrency, but that's a 
price I would be willing to pay to make the string alias useful in 
function parameters.
Dec 28 2011
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Peter Alexander:

 I often get into situations where I've written 
 a function that takes a string, and then I can't call it because all I 
 have is a char[].

I suggest you to show some of such situations.
 I think it's telling that most Phobos functions use 'const(char)[]' or 
 'in char[]' instead of 'string' for their arguments. The ones that use 
 'string' are usually using it unnecessarily and should be fixed to use 
 const(char)[].

What are the Phobos functions that unnecessarily accept a string? Bye, bearophile
Dec 28 2011
next sibling parent reply Peter Alexander <peter.alexander.au gmail.com> writes:
On 28/12/11 12:42 PM, bearophile wrote:
 Peter Alexander:

 I often get into situations where I've written
 a function that takes a string, and then I can't call it because all I
 have is a char[].

I suggest you to show some of such situations.

Any time you want to create a string without allocating memory. char[N] buffer; // write into buffer // try to use buffer as string
 I think it's telling that most Phobos functions use 'const(char)[]' or
 'in char[]' instead of 'string' for their arguments. The ones that use
 'string' are usually using it unnecessarily and should be fixed to use
 const(char)[].

What are the Phobos functions that unnecessarily accept a string?

Good question. I can't see any just now, although I have come across some in the past. Perhaps they have already been fixed.
Dec 28 2011
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Peter Alexander:

 Any time you want to create a string without allocating memory.
 
 char[N] buffer;
 // write into buffer
 // try to use buffer as string

I have discussed a bit two or three times about this topic. In a post I even did suggest the idea of "scoped immutability", that was not appreciated. Generally creating immutable data structures is a source of troubles in all languages, and in D it's not a much solved problem yet. In D today you are sometimes able to rewrite that as: string foo(in int n) pure { auto buffer = new char[n]; // write into buffer return buffer; } void bar(string s) {} void main() { string s = foo(5); bar(s); // use buffer as string } Bye, bearophile
Dec 28 2011
parent Peter Alexander <peter.alexander.au gmail.com> writes:
On 28/12/11 1:27 PM, bearophile wrote:
 Peter Alexander:

 Any time you want to create a string without allocating memory.

 char[N] buffer;
 // write into buffer
 // try to use buffer as string

I have discussed a bit two or three times about this topic. In a post I even did suggest the idea of "scoped immutability", that was not appreciated. Generally creating immutable data structures is a source of troubles in all languages, and in D it's not a much solved problem yet. In D today you are sometimes able to rewrite that as: string foo(in int n) pure { auto buffer = new char[n]; // write into buffer return buffer; } void bar(string s) {} void main() { string s = foo(5); bar(s); // use buffer as string } Bye, bearophile

That only works when you allocate memory for the string, which is what I would like to avoid.
Dec 28 2011
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 12/28/2011 5:16 AM, Peter Alexander wrote:
 Any time you want to create a string without allocating memory.

 char[N] buffer;
 // write into buffer
 // try to use buffer as string

Is the buffer ever going to be reused with a different string in it?
Dec 28 2011
parent reply Peter Alexander <peter.alexander.au gmail.com> writes:
On 28/12/11 5:16 PM, Walter Bright wrote:
 On 12/28/2011 5:16 AM, Peter Alexander wrote:
 Any time you want to create a string without allocating memory.

 char[N] buffer;
 // write into buffer
 // try to use buffer as string

Is the buffer ever going to be reused with a different string in it?

Possibly. I know what argument is coming next: "But if the function you call stores the string you passed in then it can't rely on seeing a consistent value!" I know this. These functions should request immutable(char)[] because that's what they need. Functions that don't store the string should use const(char)[]. The question is whether string should alias immutable(char)[] or const(char)[]. In my experience (which is echoed in Phobos) is that const(char)[] is used much more often than immutable(char)[], so it should alias const(char)[].
Dec 28 2011
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 12/28/2011 07:07 PM, Peter Alexander wrote:
 On 28/12/11 5:16 PM, Walter Bright wrote:
 On 12/28/2011 5:16 AM, Peter Alexander wrote:
 Any time you want to create a string without allocating memory.

 char[N] buffer;
 // write into buffer
 // try to use buffer as string

Is the buffer ever going to be reused with a different string in it?

Possibly. I know what argument is coming next: "But if the function you call stores the string you passed in then it can't rely on seeing a consistent value!" I know this. These functions should request immutable(char)[] because that's what they need. Functions that don't store the string should use const(char)[]. The question is whether string should alias immutable(char)[] or const(char)[]. In my experience (which is echoed in Phobos) is that const(char)[] is used much more often than immutable(char)[], so it should alias const(char)[].

You are approximately saying (paraphrasing): "The question is whether a cow is a cow or an animal. In my experience (which is echoed at the farm down the valley) is that there are more animals than there are cows. So we should call all our animals cows."
Dec 28 2011
parent Peter Alexander <peter.alexander.au gmail.com> writes:
On 28/12/11 6:03 PM, Timon Gehr wrote:
 On 12/28/2011 07:07 PM, Peter Alexander wrote:
 On 28/12/11 5:16 PM, Walter Bright wrote:
 On 12/28/2011 5:16 AM, Peter Alexander wrote:
 Any time you want to create a string without allocating memory.

 char[N] buffer;
 // write into buffer
 // try to use buffer as string

Is the buffer ever going to be reused with a different string in it?

Possibly. I know what argument is coming next: "But if the function you call stores the string you passed in then it can't rely on seeing a consistent value!" I know this. These functions should request immutable(char)[] because that's what they need. Functions that don't store the string should use const(char)[]. The question is whether string should alias immutable(char)[] or const(char)[]. In my experience (which is echoed in Phobos) is that const(char)[] is used much more often than immutable(char)[], so it should alias const(char)[].

You are approximately saying (paraphrasing): "The question is whether a cow is a cow or an animal. In my experience (which is echoed at the farm down the valley) is that there are more animals than there are cows. So we should call all our animals cows."

No, I'm saying that people talk about animals more often than cows, so it should be easier and more intuitive to say "animal" than it is to say "cow". People can still call things cows if that is what they're talking about.
Dec 28 2011
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 12/28/2011 10:07 AM, Peter Alexander wrote:
 On 28/12/11 5:16 PM, Walter Bright wrote:
 On 12/28/2011 5:16 AM, Peter Alexander wrote:
 Any time you want to create a string without allocating memory.

 char[N] buffer;
 // write into buffer
 // try to use buffer as string

Is the buffer ever going to be reused with a different string in it?

Possibly. I know what argument is coming next: "But if the function you call stores the string you passed in then it can't rely on seeing a consistent value!"

Exactly.
 I know this. These functions should request immutable(char)[] because that's
 what they need. Functions that don't store the string should use const(char)[].

 The question is whether string should alias immutable(char)[] or const(char)[].
 In my experience (which is echoed in Phobos) is that const(char)[] is used much
 more often than immutable(char)[], so it should alias const(char)[].

If such a change is made, then people will use const string when they mean immutable, and the values underneath are not guaranteed to be consistent.
Dec 28 2011
parent reply Peter Alexander <peter.alexander.au gmail.com> writes:
On 28/12/11 6:15 PM, Walter Bright wrote:
 On 12/28/2011 10:07 AM, Peter Alexander wrote:
 The question is whether string should alias immutable(char)[] or
 const(char)[].
 In my experience (which is echoed in Phobos) is that const(char)[] is
 used much
 more often than immutable(char)[], so it should alias const(char)[].

If such a change is made, then people will use const string when they mean immutable, and the values underneath are not guaranteed to be consistent.

Then people should learn what const and immutable mean! I don't think it's fair to dismiss my suggestion on the grounds that people don't understand the language.
Dec 28 2011
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 12/28/2011 10:35 AM, Peter Alexander wrote:
 On 28/12/11 6:15 PM, Walter Bright wrote:
 If such a change is made, then people will use const string when they
 mean immutable, and the values underneath are not guaranteed to be
 consistent.

Then people should learn what const and immutable mean! I don't think it's fair to dismiss my suggestion on the grounds that people don't understand the language.

People do what is convenient, and as endless experience shows, doing the right thing should be easier than doing the wrong thing. If you present people with a choice: #1: string s; #2: immutable(char)[] s; sure as the sun rises, they will type the former, and it will be subtly incorrect if string is const(char)[]. Telling people they should know better and pick #2 instead is a strategy that never works very well - not for programming, nor any other endeavor.
Dec 28 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/28/11 12:46 PM, Walter Bright wrote:
 On 12/28/2011 10:35 AM, Peter Alexander wrote:
 On 28/12/11 6:15 PM, Walter Bright wrote:
 If such a change is made, then people will use const string when they
 mean immutable, and the values underneath are not guaranteed to be
 consistent.

Then people should learn what const and immutable mean! I don't think it's fair to dismiss my suggestion on the grounds that people don't understand the language.

People do what is convenient, and as endless experience shows, doing the right thing should be easier than doing the wrong thing. If you present people with a choice: #1: string s; #2: immutable(char)[] s; sure as the sun rises, they will type the former, and it will be subtly incorrect if string is const(char)[]. Telling people they should know better and pick #2 instead is a strategy that never works very well - not for programming, nor any other endeavor.

Yes. Contrary to the OP, I don't think it's fair to dismiss a valid concern by framing it as a user education issue. It's has very often been aired in the olden days of C++, and never in a winning argument. (Right off the bat - auto_ptr.) Andrei
Dec 28 2011
parent Walter Bright <newshound2 digitalmars.com> writes:
On 12/28/2011 10:56 AM, Andrei Alexandrescu wrote:
 Yes. Contrary to the OP, I don't think it's fair to dismiss a valid concern by
 framing it as a user education issue. It's has very often been aired in the
 olden days of C++, and never in a winning argument. (Right off the bat -
auto_ptr.)

And as Bruce Eckel discovered, even the people who know better will deliberately pick the wrong method, because it's easier, and they justify it to themselves by saying they'll go back and fix it later. And of course that doesn't happen. Bruce decided there was something fundamentally wrong with a feature that he'd actually write articles about exhorting people to do X instead of Y, and then in his own code he preferred to do the simpler Y.
Dec 28 2011
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/28/11 12:46 PM, Walter Bright wrote:
 On 12/28/2011 10:35 AM, Peter Alexander wrote:
 On 28/12/11 6:15 PM, Walter Bright wrote:
 If such a change is made, then people will use const string when they
 mean immutable, and the values underneath are not guaranteed to be
 consistent.

Then people should learn what const and immutable mean! I don't think it's fair to dismiss my suggestion on the grounds that people don't understand the language.

People do what is convenient, and as endless experience shows, doing the right thing should be easier than doing the wrong thing. If you present people with a choice: #1: string s; #2: immutable(char)[] s; sure as the sun rises, they will type the former, and it will be subtly incorrect if string is const(char)[]. Telling people they should know better and pick #2 instead is a strategy that never works very well - not for programming, nor any other endeavor.

Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people would access the decoding routines on the needed occasions, or would consciously use the representation. Yum. Andrei
Dec 28 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/28/11 1:17 PM, Robert Jacques wrote:
 Would slicing, i.e. s[i..j] still be valid?

No, only s.rep[i .. j].
 If so, what would be the
 recommended way of finding i and j?

find, findSplit etc. from std.algorithm, std.utf functions etc. Andrei
Dec 28 2011
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 12/28/2011 08:29 PM, Andrei Alexandrescu wrote:
 On 12/28/11 1:17 PM, Robert Jacques wrote:
 Would slicing, i.e. s[i..j] still be valid?

No, only s.rep[i .. j].

That does not do the right thing. It would look more like cast(string)s.rep[i .. j].
 If so, what would be the
 recommended way of finding i and j?

find, findSplit etc. from std.algorithm, std.utf functions etc. Andrei

Dec 28 2011
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/28/11 1:18 PM, foobar wrote:
 That's a good idea which I wonder about its implementation strategy.

Implementation would entail a change in the compiler. Andrei
Dec 28 2011
next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 12/28/2011 08:30 PM, Andrei Alexandrescu wrote:
 On 12/28/11 1:18 PM, foobar wrote:
 That's a good idea which I wonder about its implementation strategy.

Implementation would entail a change in the compiler. Andrei

Special casing char[] and wchar[] in the language would be extremely ugly and inconsistent and would break nearly every D program. And for me, it would cripple Ds strings quite a lot. Why do you think it is worthwhile?
Dec 28 2011
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/28/11 1:48 PM, foobar wrote:
 On Wednesday, 28 December 2011 at 19:30:04 UTC, Andrei Alexandrescu wrote:
 On 12/28/11 1:18 PM, foobar wrote:
 That's a good idea which I wonder about its implementation strategy.

Implementation would entail a change in the compiler. Andrei

Why? D should be plenty powerful to implement this without modifying the compiler. Sounds like you suggest that char[] will behave differently than other T[] which is a very poor idea IMO.

It's an awesome idea, but for an academic debate at best. Andrei
Dec 28 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/28/11 4:18 PM, foobar wrote:
 On Wednesday, 28 December 2011 at 21:57:00 UTC, Andrei Alexandrescu wrote:
 On 12/28/11 1:48 PM, foobar wrote:
 On Wednesday, 28 December 2011 at 19:30:04 UTC, Andrei Alexandrescu
 wrote:
 On 12/28/11 1:18 PM, foobar wrote:
 That's a good idea which I wonder about its implementation strategy.

Implementation would entail a change in the compiler. Andrei

Why? D should be plenty powerful to implement this without modifying the compiler. Sounds like you suggest that char[] will behave differently than other T[] which is a very poor idea IMO.

It's an awesome idea, but for an academic debate at best. Andrei

I don't follow you. You've suggested a change that I agree with. Adam provided a prototype string library type that accomplishes your specified goals without any changes to the compiler. What are we missing here? IF it boils down to changing the compiler or leaving the status-quo, I'm voting against the compiler change.

If we have two facilities (string and e.g. String) we've lost. We'd need to slowly change the built-in string type. I discussed the matter with Walter. He completely disagrees, and sees the idea as a sheer way to complicate stuff for no good. He mentions how he frequently uses .length, indexing, and slicing in narrow strings. I know Walter's code, so I know where he's coming from. He understands UTF in and out, and I have zero doubt he actually knows all essential constants, masks, and ranges by heart. I've seen his code and indeed it's an amazing feat of minimal opportunistic on-demand decoding. So I know where he's coming from, but I also know next to nobody codes like him. A casual string user almost always writes string code (iteration, indexing) the wrong way and would be tremendously helped by a clean distinction between abstraction and representation. Nagonna happen. Andrei
Dec 28 2011
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 12/28/2011 8:32 PM, Adam D. Ruppe wrote:
 On Thursday, 29 December 2011 at 04:17:37 UTC, Andrei Alexandrescu wrote:
 If we have two facilities (string and e.g. String) we've lost. We'd need to
 slowly change the built-in string type.

Have you actually tried to do it?

I've seen the damage done in C++ with multiple string types. Being able to convert from one to the other doesn't help much.
Dec 28 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/28/11 11:36 PM, Walter Bright wrote:
 On 12/28/2011 8:32 PM, Adam D. Ruppe wrote:
 On Thursday, 29 December 2011 at 04:17:37 UTC, Andrei Alexandrescu wrote:
 If we have two facilities (string and e.g. String) we've lost. We'd
 need to
 slowly change the built-in string type.

Have you actually tried to do it?

I've seen the damage done in C++ with multiple string types. Being able to convert from one to the other doesn't help much.

This. The only solution is to explain Walter no other programmer in the world codes UTF like him. Really. I emulate that sometimes (learned from him) but I see code from hundreds of people day in and day out - it's never like his. Once we convince him, he'll be like "ah, I see what you mean. Requiring .rep is awesome. Let's do it." Andrei
Dec 28 2011
next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, December 29, 2011 07:33:28 Jakob Ovrum wrote:
 I don't think this is a problem you can solve without educating
 people. They will need to know a thing or two about how UTF works
 to know the performance implications of many of the "safe" ways
 to handle UTF strings. Further, for much use of Unicode strings
 in D you can't get away with not knowing anything anyway because
 D only abstracts up to code points, not graphemes. Imagine trying
 to explain to the unknowing programmer what is going on when an
 algorithm function broke his grapheme and he doesn't know the
 first thing about Unicode.
 
 I'm not claiming to be an expert myself, but I believe D offers
 Unicode the right way as it is.

Ultimately, the programmer _does_ need to understand unicode properly if they're going to write code which is both correct and efficient. However, if the easy way to use strings in D is correct, even if it's not as efficient as we'd like, then at least code will tend to be correct in its use of unicode. And then if the programmer wants to their string processing to be more efficient, they need to actually learn how unicode works so that they code for it more efficiently. The issue, however, is that it's currently _way_ too easy to use strings completely incorrectly and operate on code units as if they were characters. A _lot_ of programmers will be using string and char[] as if a char were a character, and that's going to create a lot of bugs. Making it harder to operate on a char[] or string as if it were an array of characters will seriously reduce such bugs and on some level will force people to become better educated about unicode. No, it doesn't completely solve the problem, since then we're operating at the code point level rather than the unicode level, but it's still a _lot_ better than operating on the code unit level as is likely to happen now. - Jonathan M Davis
Dec 28 2011
parent deadalnix <deadalnix gmail.com> writes:
Le 29/12/2011 07:48, Jonathan M Davis a écrit :
 On Thursday, December 29, 2011 07:33:28 Jakob Ovrum wrote:
 I don't think this is a problem you can solve without educating
 people. They will need to know a thing or two about how UTF works
 to know the performance implications of many of the "safe" ways
 to handle UTF strings. Further, for much use of Unicode strings
 in D you can't get away with not knowing anything anyway because
 D only abstracts up to code points, not graphemes. Imagine trying
 to explain to the unknowing programmer what is going on when an
 algorithm function broke his grapheme and he doesn't know the
 first thing about Unicode.

 I'm not claiming to be an expert myself, but I believe D offers
 Unicode the right way as it is.

Ultimately, the programmer _does_ need to understand unicode properly if they're going to write code which is both correct and efficient. However, if the easy way to use strings in D is correct, even if it's not as efficient as we'd like, then at least code will tend to be correct in its use of unicode. And then if the programmer wants to their string processing to be more efficient, they need to actually learn how unicode works so that they code for it more efficiently. The issue, however, is that it's currently _way_ too easy to use strings completely incorrectly and operate on code units as if they were characters. A _lot_ of programmers will be using string and char[] as if a char were a character, and that's going to create a lot of bugs. Making it harder to operate on a char[] or string as if it were an array of characters will seriously reduce such bugs and on some level will force people to become better educated about unicode. No, it doesn't completely solve the problem, since then we're operating at the code point level rather than the unicode level, but it's still a _lot_ better than operating on the code unit level as is likely to happen now. - Jonathan M Davis

That is the whole point of D IMO. I think we shouldn't let an ego question dictate language decision.
Dec 29 2011
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 12/28/2011 10:08 PM, Andrei Alexandrescu wrote:
 The only solution is to explain Walter no other programmer in the world codes
 UTF like him. Really. I emulate that sometimes (learned from him) but I see
code
 from hundreds of people day in and day out - it's never like his.

 Once we convince him, he'll be like "ah, I see what you mean. Requiring .rep is
 awesome. Let's do it."

If that ever happens, I owe you a beer. Maybe two! Maybe it's hubris, but I think D nails what a string type should be. I'm extremely reluctant to mess with its success. It strikes the right balance between aesthetics, efficiency and utility. C++11 and C11 appear to have copied it.
Dec 28 2011
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 12/29/2011 07:53 AM, Walter Bright wrote:
 On 12/28/2011 10:08 PM, Andrei Alexandrescu wrote:
 The only solution is to explain Walter no other programmer in the
 world codes
 UTF like him. Really. I emulate that sometimes (learned from him) but
 I see code
 from hundreds of people day in and day out - it's never like his.

 Once we convince him, he'll be like "ah, I see what you mean.
 Requiring .rep is
 awesome. Let's do it."

If that ever happens, I owe you a beer. Maybe two! Maybe it's hubris, but I think D nails what a string type should be. I'm extremely reluctant to mess with its success. It strikes the right balance between aesthetics, efficiency and utility.

I fully agree. If I had to design an imperative programming language, this is how its strings would work.
 C++11 and C11 appear to have copied it.

Dec 29 2011
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 12/28/2011 10:33 PM, Jakob Ovrum wrote:
 I don't think this is a problem you can solve without educating people. They
 will need to know a thing or two about how UTF works to know the performance
 implications of many of the "safe" ways to handle UTF strings. Further, for
much
 use of Unicode strings in D you can't get away with not knowing anything anyway
 because D only abstracts up to code points, not graphemes. Imagine trying to
 explain to the unknowing programmer what is going on when an algorithm function
 broke his grapheme and he doesn't know the first thing about Unicode.

 I'm not claiming to be an expert myself, but I believe D offers Unicode the
 right way as it is.

I think this goes to, at some point, the language is no longer able to hide the realities of the underlying machine. This happens with floating point (they are NOT mathematical real numbers), integers (they overflow), etc. Keep in mind that D already has a string type where the code points match the characters: dstring[]
Dec 28 2011
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/29/11 12:01 AM, Adam D. Ruppe wrote:
 On Thursday, 29 December 2011 at 05:37:00 UTC, Walter Bright wrote:
 I've seen the damage done in C++ with multiple string types. Being
 able to convert from one to the other doesn't help much.

Note that I'm on your side here re strings, but you're underselling the D language too! These conversions are implicit both ways, and completely free. D structs can wrap other D types perfectly well.

Nah, that still breaks a lotta code because people parameterize on T[], use isSomeString/isSomeChar etc. Nagonna. Andrei
Dec 28 2011
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, December 29, 2011 11:32:52 Sean Kelly wrote:
 Don't we already have String-like support with ranges?  I'm not sure I
 understand the point in having special behavior for char arrays.

To avoid common misusage. It's way to easy to misuse the length property on narrow strings. Programmers shouldn't be using the length property on narrow strings unless they know what they're doing, but it's likely the first thing that any programmer is going to use for the length of a string, because that's how arrays in general work. If it weren't legal to simply use the length property of a char[] or to directly slice it or index it, then those common misuages would be harder to do. You could still do them via .rep or .raw or whatever we'd call it, but it would no longer be the path of least resistance. Yes, Phobos may avoid the issue, because for the most part its developers understand the issues, but many programmers who do not understand them, will make mistakes in their own code which should arguably be harder to make, simply because it's the path of least resistance, and they don't know any better. - Jonathan M Davis
Dec 29 2011
prev sibling next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 12/28/2011 08:18 PM, foobar wrote:
 On Wednesday, 28 December 2011 at 19:00:53 UTC, Andrei Alexandrescu wrote:
 On 12/28/11 12:46 PM, Walter Bright wrote:
 On 12/28/2011 10:35 AM, Peter Alexander wrote:
 On 28/12/11 6:15 PM, Walter Bright wrote:
 If such a change is made, then people will use const string when they
 mean immutable, and the values underneath are not guaranteed to be
 consistent.

Then people should learn what const and immutable mean! I don't think it's fair to dismiss my suggestion on the grounds that people don't understand the language.

People do what is convenient, and as endless experience shows, doing the right thing should be easier than doing the wrong thing. If you present people with a choice: #1: string s; #2: immutable(char)[] s; sure as the sun rises, they will type the former, and it will be subtly incorrect if string is const(char)[]. Telling people they should know better and pick #2 instead is a strategy that never works very well - not for programming, nor any other endeavor.

Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people would access the decoding routines on the needed occasions, or would consciously use the representation. Yum. Andrei

That's a good idea which I wonder about its implementation strategy. ATM string is simply an alias of a char array, are you suggesting string should be a wrapper struct instead (like the one previously suggested by Steven)? I'm all for making string a properly encapsulated type.

In what way would the proposed change improve encapsulation, and why would it even be desirable for such a basic data structure?
Dec 28 2011
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 12/28/2011 08:55 PM, foobar wrote:
 On Wednesday, 28 December 2011 at 19:38:53 UTC, Timon Gehr wrote:
 [snip]
 I'm all for making string a properly encapsulated type.

In what way would the proposed change improve encapsulation, and why would it even be desirable for such a basic data structure?

I'm not sure what are you asking here. Are you asking what are the benefits of encapsulation?

I know the benefits of encapsulation and none of them applies here. The proposed change is nothing but a breaking interface change.
 This topic was discussed to death more than
 once and I'd suggest searching the NG archives for the details. Also, If
 you hadn't already I'd suggest reading about Unicode and its levels of
 abstraction: code point, code units, graphemes, etc...

'char' is a code unit. Therefore that is the level of abstraction the data type char[] provides.
Dec 28 2011
prev sibling next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 12/28/2011 08:00 PM, Andrei Alexandrescu wrote:
 On 12/28/11 12:46 PM, Walter Bright wrote:
 On 12/28/2011 10:35 AM, Peter Alexander wrote:
 On 28/12/11 6:15 PM, Walter Bright wrote:
 If such a change is made, then people will use const string when they
 mean immutable, and the values underneath are not guaranteed to be
 consistent.

Then people should learn what const and immutable mean! I don't think it's fair to dismiss my suggestion on the grounds that people don't understand the language.

People do what is convenient, and as endless experience shows, doing the right thing should be easier than doing the wrong thing. If you present people with a choice: #1: string s; #2: immutable(char)[] s; sure as the sun rises, they will type the former, and it will be subtly incorrect if string is const(char)[]. Telling people they should know better and pick #2 instead is a strategy that never works very well - not for programming, nor any other endeavor.

Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar.

Why? char and wchar are unicode code units, ubyte/ushort are unsigned integrals. It is clear that char/wchar are a better match.
 Then, people would access the decoding routines on the needed occasions,
 or would consciously use the representation.

 Yum.


 Andrei

Dec 28 2011
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
Apparently my previous post was lost. Apologies if this comes out twice.

On 12/28/2011 09:39 PM, Jonathan M Davis wrote:
 On Wednesday, December 28, 2011 21:25:39 Timon Gehr wrote:
 Why? char and wchar are unicode code units, ubyte/ushort are unsigned
 integrals. It is clear that char/wchar are a better match.

It's an issue of the correct usage being the easy path. As it stands, it's incredibly easy to use narrow strings incorrectly. By forcing any array of char or wchar to use .rep.length instead of .length, the relatively automatic (and generally incorrect) usage of .length on a string wouldn't immediately work. It would force you to work more at doing the wrong thing. Unfortunately, walkLength isn't necessarily any easier than .rep.length, but it does force people to look into why they can't do .length, which will generally better educate them and will hopefully reduce the misuse of narrow strings.

I was educated enough not to make that mistake, because I read the entire language specification before deciding the language was awesome and downloading the compiler. I find it strange that the product should be made less usable because we do not expect users to read the manual. But it is of course a valid point.
 If we make rep ubyte[] and ushort[] for char[] and wchar[] respectively, then
 we reinforce the fact that you shouldn't operate on chars or wchars.

There is nothing wrong with operating at the code unit level. Efficient slicing is very desirable.
 It also
 makes it simply for the compiler to never allow you to use length on char[] or
 wchar[], since it doesn't have to worry about whether you got that char[] or
 wchar[] from a rep property or not.

 Now, I don't know if this is really a good move at this point. If we were to
 really do this right, we'd need to disallow indexing and slicing of the char[]
 and wchar[] as well, which would break that much more code. It also pretty
 quickly makes it look like string should be its own type rather than an array,
 since it's acting less and less like an array.

Exactly. It is acting less and less like an array of code units. But it *is* an array of code units. If the general consensus is that we need a string data type that acts at a different abstraction level by default (with which I'd disagree, but apparently I don't have a popular opinion here), then we need a string type in the standard library to do that. Changing the language so that an array of code units stops behaving like an array of code units is not a solution.
 Not to mention, even the
 correct usage of .rep would become rather irritating (e.g. slicing it when you
 know that the indicies that you're dealing with aren't going to cut into any
 code points), because you'd have to cast from ubyte[] to char[] whenever you
 did that.

 So, I think that the general sentiment behind this is a good one, but I don't
 know if the exact idea is ultimately a good one - particularly at this stage
 in the game. If we're going to make a change like this which would break as
 much code as this would, we'd need to be _very_ certain that it's what we want
 to do.

 - Jonathan M Davis

I agree.
Dec 28 2011
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 12/28/2011 11:12 PM, foobar wrote:
 On Wednesday, 28 December 2011 at 21:17:49 UTC, Timon Gehr wrote:
 I was educated enough not to make that mistake, because I read the
 entire language specification before deciding the language was awesome
 and downloading the compiler. I find it strange that the product
 should be made less usable because we do not expect users to read the
 manual. But it is of course a valid point.

That's awfully optimistic to expect people to read the manual.

Well, if the alternative is slowly butchering the language I will be awfully optimistic about it all day long.
 There is nothing wrong with operating at the code unit level.
 Efficient slicing is very desirable.

I agree that it's useful. It is however the incorrect abstraction level when you need a "string" which is by far the common case in user code.

I would not go as far as to call it 'incorrect'.
 i.e. if I need a name variable in a class: codeUnit[] name; // bug!
 string Name; // correct

From a pragmatic viewpoint it does not matter because if string is used like this, then codeUnit[] does exactly the same thing. Nobody forces anyone to index or slice into a string variable when they don't need that functionality. All engineers have to work with leaky abstractions. Why is it such a big deal?
 I expect that most uses of code-unit arrays should be in the standard
 library anyway since it provides the string manipulation routines. It
 all boils down to making the common case trivial and the rare case
 possible.  You can use the underlying data structure (code units) if you
 need it but the default "string" is what people expect when thinking
 about what such a type does (a string of letters). D's already 80% there
 since Phobos already treats strings as bi-directional ranges of
 code-points which is much closer to the mental image of a string of
 letters, so I think this is about bringing the current design to its
 final conclusion.

Well, that mental image is just not the right one when dealing with Unicode.
 Exactly. It is acting less and less like an array of code units. But
 it *is* an array of code units. If the general consensus is that we
 need a string data type that acts at a different abstraction level by
 default (with which I'd disagree, but apparently I don't have a
 popular opinion here), then we need a string type in the standard
 library to do that. Changing the language so that an array of code
 units stops behaving like an array of code units is not a solution.

I agree that we should not break T[] for any T and instead introduce a library type. While I personally believe that such a change will expose hidden bugs (certainly when unaware programmers treat string as ASCII and the product is later on localized), it's a big disturbance in people's code and it's worth a consideration if the benefit worth the costs. Perhaps, some middle ground could be found such that existing code can rely on existing behavior and the new library type will be an opt-in.

What will such a type offer, except that it disallows indexing and slicing?
Dec 28 2011
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 12/29/2011 07:45 AM, foobar wrote:
 On Wednesday, 28 December 2011 at 22:39:15 UTC, Timon Gehr wrote:
 On 12/28/2011 11:12 PM, foobar wrote:
 On Wednesday, 28 December 2011 at 21:17:49 UTC, Timon Gehr wrote:
 I was educated enough not to make that mistake, because I read the
 entire language specification before deciding the language was awesome
 and downloading the compiler. I find it strange that the product
 should be made less usable because we do not expect users to read the
 manual. But it is of course a valid point.

That's awfully optimistic to expect people to read the manual.

Well, if the alternative is slowly butchering the language I will be awfully optimistic about it all day long.
 There is nothing wrong with operating at the code unit level.
 Efficient slicing is very desirable.

I agree that it's useful. It is however the incorrect abstraction level when you need a "string" which is by far the common case in user code.

I would not go as far as to call it 'incorrect'.
 i.e. if I need a name variable in a class: codeUnit[] name; // bug!
 string Name; // correct

From a pragmatic viewpoint it does not matter because if string is used like this, then codeUnit[] does exactly the same thing. Nobody forces anyone to index or slice into a string variable when they don't need that functionality. All engineers have to work with leaky abstractions. Why is it such a big deal?
 I expect that most uses of code-unit arrays should be in the standard
 library anyway since it provides the string manipulation routines. It
 all boils down to making the common case trivial and the rare case
 possible. You can use the underlying data structure (code units) if you
 need it but the default "string" is what people expect when thinking
 about what such a type does (a string of letters). D's already 80% there
 since Phobos already treats strings as bi-directional ranges of
 code-points which is much closer to the mental image of a string of
 letters, so I think this is about bringing the current design to its
 final conclusion.

Well, that mental image is just not the right one when dealing with Unicode.
 Exactly. It is acting less and less like an array of code units. But
 it *is* an array of code units. If the general consensus is that we
 need a string data type that acts at a different abstraction level by
 default (with which I'd disagree, but apparently I don't have a
 popular opinion here), then we need a string type in the standard
 library to do that. Changing the language so that an array of code
 units stops behaving like an array of code units is not a solution.

I agree that we should not break T[] for any T and instead introduce a library type. While I personally believe that such a change will expose hidden bugs (certainly when unaware programmers treat string as ASCII and the product is later on localized), it's a big disturbance in people's code and it's worth a consideration if the benefit worth the costs. Perhaps, some middle ground could be found such that existing code can rely on existing behavior and the new library type will be an opt-in.

What will such a type offer, except that it disallows indexing and slicing?

From a pragmatic view point people can also continue programming in C++ instead of investing a lot of effort learning a new language.

I disagree. Pragmatism: "Dealing with things sensibly and realistically in a way that is based on practical rather than theoretical considerations." In practice, programming in D beats the pants off programming in C++.
 The only difference between programming languages is the human interface
 aspect.

No. There is also the aspect of how well it maps to the machine it will run on. An interface always has two sides.
 Anything you can program with D you could also do in assembly
 yet you prefer D because it's more convenient.

I prefer D because it is more productive.
 In that regard, a code-unit array is definitely worse than a string type.

A code-unit array type is a string type, albeit a simple one.
 A programmer can choose to either change his 'naive' mental image or
 change the programming language.  Most will do the latter.

A programmer does not care about how D strings work or he is happy that they are so simple to work with.
 Computers need to adapt and be human friendly, not vice-versa.

When I meet a computer that adapts itself in order to be human friendly, I'll buy you a cookie.
Dec 29 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, December 28, 2011 21:25:39 Timon Gehr wrote:
 Why? char and wchar are unicode code units, ubyte/ushort are unsigned
 integrals. It is clear that char/wchar are a better match.

It's an issue of the correct usage being the easy path. As it stands, it's incredibly easy to use narrow strings incorrectly. By forcing any array of char or wchar to use .rep.length instead of .length, the relatively automatic (and generally incorrect) usage of .length on a string wouldn't immediately work. It would force you to work more at doing the wrong thing. Unfortunately, walkLength isn't necessarily any easier than .rep.length, but it does force people to look into why they can't do .length, which will generally better educate them and will hopefully reduce the misuse of narrow strings. If we make rep ubyte[] and ushort[] for char[] and wchar[] respectively, then we reinforce the fact that you shouldn't operate on chars or wchars. It also makes it simply for the compiler to never allow you to use length on char[] or wchar[], since it doesn't have to worry about whether you got that char[] or wchar[] from a rep property or not. Now, I don't know if this is really a good move at this point. If we were to really do this right, we'd need to disallow indexing and slicing of the char[] and wchar[] as well, which would break that much more code. It also pretty quickly makes it look like string should be its own type rather than an array, since it's acting less and less like an array. Not to mention, even the correct usage of .rep would become rather irritating (e.g. slicing it when you know that the indicies that you're dealing with aren't going to cut into any code points), because you'd have to cast from ubyte[] to char[] whenever you did that. So, I think that the general sentiment behind this is a good one, but I don't know if the exact idea is ultimately a good one - particularly at this stage in the game. If we're going to make a change like this which would break as much code as this would, we'd need to be _very_ certain that it's what we want to do. - Jonathan M Davis
Dec 28 2011
prev sibling next sibling parent "foobar" <foo bar.com> writes:
On Wednesday, 28 December 2011 at 21:17:49 UTC, Timon Gehr wrote:
 I was educated enough not to make that mistake, because I read 
 the entire language specification before deciding the language 
 was awesome and downloading the compiler. I find it strange 
 that the product should be made less usable because we do not 
 expect users to read the manual. But it is of course a valid 
 point.

That's awfully optimistic to expect people to read the manual.
 There is nothing wrong with operating at the code unit level. 
 Efficient slicing is very desirable.

I agree that it's useful. It is however the incorrect abstraction level when you need a "string" which is by far the common case in user code. i.e. if I need a name variable in a class: codeUnit[] name; // bug! string Name; // correct I expect that most uses of code-unit arrays should be in the standard library anyway since it provides the string manipulation routines. It all boils down to making the common case trivial and the rare case possible. You can use the underlying data structure (code units) if you need it but the default "string" is what people expect when thinking about what such a type does (a string of letters). D's already 80% there since Phobos already treats strings as bi-directional ranges of code-points which is much closer to the mental image of a string of letters, so I think this is about bringing the current design to its final conclusion.
 Exactly. It is acting less and less like an array of code 
 units. But it *is* an array of code units. If the general 
 consensus is that we need a string data type that acts at a 
 different abstraction level by default (with which I'd 
 disagree, but apparently I don't have a popular opinion here), 
 then we need a string type in the standard library to do that. 
 Changing the language so that an array of code units stops 
 behaving like an array of code units is not a solution.

I agree that we should not break T[] for any T and instead introduce a library type. While I personally believe that such a change will expose hidden bugs (certainly when unaware programmers treat string as ASCII and the product is later on localized), it's a big disturbance in people's code and it's worth a consideration if the benefit worth the costs. Perhaps, some middle ground could be found such that existing code can rely on existing behavior and the new library type will be an opt-in.
Dec 28 2011
prev sibling next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:

 one good thing that could come out of this thread
 is abolition (through however slow a deprecation path) of s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar.
 Then, people would access the decoding routines on the needed occasions,
 or would consciously use the representation.

 Robert Jacques:
 Would slicing, i.e. s[i..j] still be valid?

No, only s.rep[i .. j].
 If so, what would be the
 recommended way of finding i and j?

find, findSplit etc. from std.algorithm, std.utf functions etc.

We have discussed this topic some times in past, it's not an easy topic. I agree with the general desires under your ideas Andrei, I suggested something related, time ago. The idea of forbidding s.length, s[i] and s[i..j] for narrow strings seems interesting. (I suggested something different, to keep them but turn them into operations that do the right thing on narrow strings. Some people didn't appreciate the idea because it changes the computational complexity of such operations). But I suggest to step a bit back and look at the situation from a bit more distance, to avoid small patches to D that look like a pirate eyepatch :-) Narrow strings are more memory (and performance) efficient, and sometimes I want to slice them too, and do it correctly (so somestring.rep[i..j] is not enough). So I suggest to give something to perform correct slicing of narrow strings too. Bye, bearophile
Dec 28 2011
prev sibling next sibling parent "foobar" <foo bar.com> writes:
On Wednesday, 28 December 2011 at 22:39:15 UTC, Timon Gehr wrote:
 On 12/28/2011 11:12 PM, foobar wrote:
 On Wednesday, 28 December 2011 at 21:17:49 UTC, Timon Gehr 
 wrote:
 I was educated enough not to make that mistake, because I 
 read the
 entire language specification before deciding the language 
 was awesome
 and downloading the compiler. I find it strange that the 
 product
 should be made less usable because we do not expect users to 
 read the
 manual. But it is of course a valid point.

That's awfully optimistic to expect people to read the manual.

Well, if the alternative is slowly butchering the language I will be awfully optimistic about it all day long.
 There is nothing wrong with operating at the code unit level.
 Efficient slicing is very desirable.

I agree that it's useful. It is however the incorrect abstraction level when you need a "string" which is by far the common case in user code.

I would not go as far as to call it 'incorrect'.
 i.e. if I need a name variable in a class: codeUnit[] name; // 
 bug!
 string Name; // correct

From a pragmatic viewpoint it does not matter because if string is used like this, then codeUnit[] does exactly the same thing. Nobody forces anyone to index or slice into a string variable when they don't need that functionality. All engineers have to work with leaky abstractions. Why is it such a big deal?
 I expect that most uses of code-unit arrays should be in the 
 standard
 library anyway since it provides the string manipulation 
 routines. It
 all boils down to making the common case trivial and the rare 
 case
 possible.  You can use the underlying data structure (code 
 units) if you
 need it but the default "string" is what people expect when 
 thinking
 about what such a type does (a string of letters). D's already 
 80% there
 since Phobos already treats strings as bi-directional ranges of
 code-points which is much closer to the mental image of a 
 string of
 letters, so I think this is about bringing the current design 
 to its
 final conclusion.

Well, that mental image is just not the right one when dealing with Unicode.
 Exactly. It is acting less and less like an array of code 
 units. But
 it *is* an array of code units. If the general consensus is 
 that we
 need a string data type that acts at a different abstraction 
 level by
 default (with which I'd disagree, but apparently I don't have 
 a
 popular opinion here), then we need a string type in the 
 standard
 library to do that. Changing the language so that an array of 
 code
 units stops behaving like an array of code units is not a 
 solution.

I agree that we should not break T[] for any T and instead introduce a library type. While I personally believe that such a change will expose hidden bugs (certainly when unaware programmers treat string as ASCII and the product is later on localized), it's a big disturbance in people's code and it's worth a consideration if the benefit worth the costs. Perhaps, some middle ground could be found such that existing code can rely on existing behavior and the new library type will be an opt-in.

What will such a type offer, except that it disallows indexing and slicing?

From a pragmatic view point people can also continue programming in C++ instead of investing a lot of effort learning a new language. The only difference between programming languages is the human interface aspect. Anything you can program with D you could also do in assembly yet you prefer D because it's more convenient. In that regard, a code-unit array is definitely worse than a string type. A programmer can choose to either change his 'naive' mental image or change the programming language. Most will do the latter. Computers need to adapt and be human friendly, not vice-versa.
Dec 28 2011
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 12/29/2011 12:12 AM, Gor Gyolchanyan wrote:
 This a a great idea! In this case the default string will be a
 random-access range, not a bidirectional range. Also, processing
 dstring is faster, then string, because no encoding needs to be done.
 Processing power is more expensive, then memory. utf-8 is valuable
 only to pass it as an ASCII string (which is not too common) and to
 store large chunks of it. Both these cases are much less common then
 all the rest of string processing.

dstring consumes 4x the memory, and this can easily cause perf degradations due to thrashing and poor cache locality.
Dec 29 2011
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/29/11 2:04 AM, Vladimir Panteleev wrote:
 I think it would be simpler to just make dstring the default string type.

 dstring is simple and safe. People who want better memory usage can use
 UTF-8 at their own discretion.

memory == time Andrei
Dec 29 2011
prev sibling parent reply Don <nospam nospam.com> writes:
On 28.12.2011 20:00, Andrei Alexandrescu wrote:
 Oh, one more thing - one good thing that could come out of this thread
 is abolition (through however slow a deprecation path) of s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar.
 Then, people would access the decoding routines on the needed occasions,
 or would consciously use the representation.

 Yum.

If I understand this correctly, most others don't. Effectively, .rep just means, "I know what I'm doing", and there's no change to existing semantics, purely a syntax change. If you change s[i] into s.rep[i], it does the same thing as now. There's no loss of functionality -- it's just stops you from accidentally doing the wrong thing. Like .ptr for getting the address of an array. Typically all the ".rep" everywhere would get annoying, so you would write: ubyte [] u = s.rep; and use u from then on. I don't like the name 'rep'. Maybe 'raw' or 'utf'? Apart from that, I think this would be perfect.
Dec 29 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/29/11 12:28 PM, Don wrote:
 On 28.12.2011 20:00, Andrei Alexandrescu wrote:
 Oh, one more thing - one good thing that could come out of this thread
 is abolition (through however slow a deprecation path) of s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar.
 Then, people would access the decoding routines on the needed occasions,
 or would consciously use the representation.

 Yum.

If I understand this correctly, most others don't. Effectively, .rep just means, "I know what I'm doing", and there's no change to existing semantics, purely a syntax change.

Exactly!
 If you change s[i] into s.rep[i], it does the same thing as now. There's
 no loss of functionality -- it's just stops you from accidentally doing
 the wrong thing. Like .ptr for getting the address of an array.
 Typically all the ".rep" everywhere would get annoying, so you would write:
 ubyte [] u = s.rep;
 and use u from then on.

 I don't like the name 'rep'. Maybe 'raw' or 'utf'?
 Apart from that, I think this would be perfect.

Yes, I mean "rep" as a short for "representation" but upon first sight the connection is tenuous. "raw" sounds great. Now I'm twice sorry this will not happen... Andrei
Dec 29 2011
parent reply Joshua Reusch <yoschi arkandos.de> writes:
Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
 On 12/29/11 12:28 PM, Don wrote:
 On 28.12.2011 20:00, Andrei Alexandrescu wrote:
 Oh, one more thing - one good thing that could come out of this thread
 is abolition (through however slow a deprecation path) of s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar.
 Then, people would access the decoding routines on the needed occasions,
 or would consciously use the representation.

 Yum.

If I understand this correctly, most others don't. Effectively, .rep just means, "I know what I'm doing", and there's no change to existing semantics, purely a syntax change.

Exactly!
 If you change s[i] into s.rep[i], it does the same thing as now. There's
 no loss of functionality -- it's just stops you from accidentally doing
 the wrong thing. Like .ptr for getting the address of an array.
 Typically all the ".rep" everywhere would get annoying, so you would
 write:
 ubyte [] u = s.rep;
 and use u from then on.

 I don't like the name 'rep'. Maybe 'raw' or 'utf'?
 Apart from that, I think this would be perfect.

Yes, I mean "rep" as a short for "representation" but upon first sight the connection is tenuous. "raw" sounds great. Now I'm twice sorry this will not happen...

Maybe it could happen if we 1. make dstring the default strings type -- code units and characters would be the same or 2. forward string.length to std.utf.count and opIndex to std.utf.toUTFindex so programmers could use the slices/indexing/length (no lazyness problems), and if they really want codeunits use .raw/.rep (or better .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32) But generally I liked the idea of just having an alias for strings...
 Andrei

-- Joshua Reusch
Dec 30 2011
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 12/30/2011 08:33 PM, Joshua Reusch wrote:
 Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
 On 12/29/11 12:28 PM, Don wrote:
 On 28.12.2011 20:00, Andrei Alexandrescu wrote:
 Oh, one more thing - one good thing that could come out of this thread
 is abolition (through however slow a deprecation path) of s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not
 char/wchar.
 Then, people would access the decoding routines on the needed
 occasions,
 or would consciously use the representation.

 Yum.

If I understand this correctly, most others don't. Effectively, .rep just means, "I know what I'm doing", and there's no change to existing semantics, purely a syntax change.

Exactly!
 If you change s[i] into s.rep[i], it does the same thing as now. There's
 no loss of functionality -- it's just stops you from accidentally doing
 the wrong thing. Like .ptr for getting the address of an array.
 Typically all the ".rep" everywhere would get annoying, so you would
 write:
 ubyte [] u = s.rep;
 and use u from then on.

 I don't like the name 'rep'. Maybe 'raw' or 'utf'?
 Apart from that, I think this would be perfect.

Yes, I mean "rep" as a short for "representation" but upon first sight the connection is tenuous. "raw" sounds great. Now I'm twice sorry this will not happen...

Maybe it could happen if we 1. make dstring the default strings type --

Inefficient.
 code units and characters would be the same

Wrong.
 or 2. forward string.length to std.utf.count and opIndex to
 std.utf.toUTFindex

Inconsistent and inefficient (it blows up the algorithmic complexity).
 so programmers could use the slices/indexing/length (no lazyness
 problems), and if they really want codeunits use .raw/.rep (or better
 .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)

Anyone who intends to write efficient string processing code needs this. Anyone who does not want to write string processing code will not need to index into a string -- standard library functions will suffice.
 But generally I liked the idea of just having an alias for strings...

Me too. I think the way we have it now is optimal. The only reason we are discussing this is because of fear that uneducated users will write code that does not take into account Unicode characters above code point 0x80. But what is the worst thing that can happen? 1. They don't notice. Then it is not a problem, because they are obviously only using ASCII characters and it is perfectly reasonable to assume that code units and characters are the same thing. 2. They get screwed up string output, look for the reason, patch up their code with some functions from std.utf and will never make the same mistakes again. I have *never* seen an user in D.learn complain about it. They might have been some I missed, but it is certainly not a prevalent problem. Also, just because an user can type .rep does not mean he understands Unicode: He is able to make just the same mistakes as before, even more so, as the array he is getting back has the _wrong element type_.
Dec 30 2011
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/30/11 1:55 PM, Timon Gehr wrote:
 Me too. I think the way we have it now is optimal.

What we have now is adequate. The scheme I proposed is optimal. I agree with all of your other remarks. Andrei
Dec 30 2011
prev sibling next sibling parent reply deadalnix <deadalnix gmail.com> writes:
Le 30/12/2011 20:55, Timon Gehr a écrit :
 On 12/30/2011 08:33 PM, Joshua Reusch wrote:
 Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
 On 12/29/11 12:28 PM, Don wrote:
 On 28.12.2011 20:00, Andrei Alexandrescu wrote:
 Oh, one more thing - one good thing that could come out of this thread
 is abolition (through however slow a deprecation path) of s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length
 and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not
 char/wchar.
 Then, people would access the decoding routines on the needed
 occasions,
 or would consciously use the representation.

 Yum.

If I understand this correctly, most others don't. Effectively, .rep just means, "I know what I'm doing", and there's no change to existing semantics, purely a syntax change.

Exactly!
 If you change s[i] into s.rep[i], it does the same thing as now.
 There's
 no loss of functionality -- it's just stops you from accidentally doing
 the wrong thing. Like .ptr for getting the address of an array.
 Typically all the ".rep" everywhere would get annoying, so you would
 write:
 ubyte [] u = s.rep;
 and use u from then on.

 I don't like the name 'rep'. Maybe 'raw' or 'utf'?
 Apart from that, I think this would be perfect.

Yes, I mean "rep" as a short for "representation" but upon first sight the connection is tenuous. "raw" sounds great. Now I'm twice sorry this will not happen...

Maybe it could happen if we 1. make dstring the default strings type --

Inefficient.
 code units and characters would be the same

Wrong.
 or 2. forward string.length to std.utf.count and opIndex to
 std.utf.toUTFindex

Inconsistent and inefficient (it blows up the algorithmic complexity).
 so programmers could use the slices/indexing/length (no lazyness
 problems), and if they really want codeunits use .raw/.rep (or better
 .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)

Anyone who intends to write efficient string processing code needs this. Anyone who does not want to write string processing code will not need to index into a string -- standard library functions will suffice.
 But generally I liked the idea of just having an alias for strings...

Me too. I think the way we have it now is optimal. The only reason we are discussing this is because of fear that uneducated users will write code that does not take into account Unicode characters above code point 0x80. But what is the worst thing that can happen?

ATOS origin was hacked because of bad management of unicode in string in some of their software. Consequences can be more importants than you may think. Additionnaly, you make an asumption that is realy wrong : an educated programmer will not make mistake. C programmers will just tell you excactly the same thing is the discution comes to pointers. But the fact is, we all do mistakes. Many of them ! We should go into unsafe behaviour, that rely on programmer capabilities only when needed. I do understand pointers. I do make mistake with them and it does have crazy consequences sometime. And I do not trust anyone that say me he/she doesn't. The #1 quality of a programmer is to act like he/she is a morron. Because sometime we all are morrons.
Dec 30 2011
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 12/30/2011 10:36 PM, deadalnix wrote:
 Le 30/12/2011 20:55, Timon Gehr a écrit :
 On 12/30/2011 08:33 PM, Joshua Reusch wrote:
 Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
 On 12/29/11 12:28 PM, Don wrote:
 On 28.12.2011 20:00, Andrei Alexandrescu wrote:
 Oh, one more thing - one good thing that could come out of this
 thread
 is abolition (through however slow a deprecation path) of s.length
 and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length
 and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not
 char/wchar.
 Then, people would access the decoding routines on the needed
 occasions,
 or would consciously use the representation.

 Yum.

If I understand this correctly, most others don't. Effectively, .rep just means, "I know what I'm doing", and there's no change to existing semantics, purely a syntax change.

Exactly!
 If you change s[i] into s.rep[i], it does the same thing as now.
 There's
 no loss of functionality -- it's just stops you from accidentally
 doing
 the wrong thing. Like .ptr for getting the address of an array.
 Typically all the ".rep" everywhere would get annoying, so you would
 write:
 ubyte [] u = s.rep;
 and use u from then on.

 I don't like the name 'rep'. Maybe 'raw' or 'utf'?
 Apart from that, I think this would be perfect.

Yes, I mean "rep" as a short for "representation" but upon first sight the connection is tenuous. "raw" sounds great. Now I'm twice sorry this will not happen...

Maybe it could happen if we 1. make dstring the default strings type --

Inefficient.
 code units and characters would be the same

Wrong.
 or 2. forward string.length to std.utf.count and opIndex to
 std.utf.toUTFindex

Inconsistent and inefficient (it blows up the algorithmic complexity).
 so programmers could use the slices/indexing/length (no lazyness
 problems), and if they really want codeunits use .raw/.rep (or better
 .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)

Anyone who intends to write efficient string processing code needs this. Anyone who does not want to write string processing code will not need to index into a string -- standard library functions will suffice.
 But generally I liked the idea of just having an alias for strings...

Me too. I think the way we have it now is optimal. The only reason we are discussing this is because of fear that uneducated users will write code that does not take into account Unicode characters above code point 0x80. But what is the worst thing that can happen?

ATOS origin was hacked because of bad management of unicode in string in some of their software.

And cast(string)s.rep[i..j] would magically fix all those bugs?
 Consequences can be more importants than you may think.

 Additionnaly, you make an asumption that is realy wrong : an educated
 programmer will not make mistake.

I am not. I am just assuming that the proposed change does not help with that.
 C programmers will just tell you
 excactly the same thing is the discution comes to pointers. But the fact
 is, we all do mistakes. Many of them ! We should go into unsafe
 behaviour, that rely on programmer capabilities only when needed.

 I do understand pointers. I do make mistake with them and it does have
 crazy consequences sometime. And I do not trust anyone that say me
 he/she doesn't.

 The #1 quality of a programmer is to act like he/she is a morron.
 Because sometime we all are morrons.

The #1 quality of a programmer is to write correct code. If he/she acts as if he/she is a moron, he/she will write code that acts like a moron. Simple as that.
Dec 30 2011
parent reply Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 12/30/2011 05:27 PM, Timon Gehr wrote:
 On 12/30/2011 10:36 PM, deadalnix wrote:
 The #1 quality of a programmer is to act like he/she is a morron.
 Because sometime we all are morrons.

The #1 quality of a programmer is to write correct code. If he/she acts as if he/she is a moron, he/she will write code that acts like a moron. Simple as that.

Tsk tsk. Missing the point. I believe what deadalnix is trying to say is this: Programmers should try to write correct code, but should never trust themselves to write correct code. ... Programs worth writing are complex enough that there is no way any of us can write them perfectly correct code on first draft. There is always going to be some polishing, and maybe even /a lot/ of polishing, and perhaps some complete tear downs and rebuilds from time to time. "Build one to throw away; you will anyways." If you tell me that you can always write correct code the first time and you never need to go back and fix anything when you do testing (you do test right?) then I will have a hard time taking you seriously. That said, it is extremely pleasant to have a language that catches you when you inevitably fall.
Dec 31 2011
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 12/31/2011 06:32 PM, Chad J wrote:
 On 12/30/2011 05:27 PM, Timon Gehr wrote:
 On 12/30/2011 10:36 PM, deadalnix wrote:
 The #1 quality of a programmer is to act like he/she is a morron.
 Because sometime we all are morrons.

The #1 quality of a programmer is to write correct code. If he/she acts as if he/she is a moron, he/she will write code that acts like a moron. Simple as that.

Tsk tsk. Missing the point.

Not at all. And I don't take anyone seriously who feels the need to 'Tsk tsk' btw.
 I believe what deadalnix is trying to say is this:
 Programmers should try to write correct code, but should never trust
 themselves to write correct code.

No, programmers should write correct code and then test it thoroughly. 'Trying to' is the wrong way to go about anything. And there is no need to distrust oneself. Anyway, I have a _very hard time_ translating 'acting like a moron' to 'writing correct code'.
 ...

 Programs worth writing are complex enough that there is no way any of us
 can write them perfectly correct code on first draft.  There is always
 going to be some polishing, and maybe even /a lot/ of polishing, and
 perhaps some complete tear downs and rebuilds from time to time.  "Build
 one to throw away; you will anyways."  If you tell me that you can
 always write correct code the first time and you never need to go back
 and fix anything when you do testing (you do test right?) then I will
 have a hard time taking you seriously.

Testing is the main part of my development. Furthermore, I use assertions all over the place.
 That said, it is extremely pleasant to have a language that catches you
 when you inevitably fall.

That is why I also like Haskell.
Dec 31 2011
next sibling parent Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 12/31/2011 01:13 PM, Timon Gehr wrote:
 On 12/31/2011 06:32 PM, Chad J wrote:
 On 12/30/2011 05:27 PM, Timon Gehr wrote:
 On 12/30/2011 10:36 PM, deadalnix wrote:
 The #1 quality of a programmer is to act like he/she is a morron.
 Because sometime we all are morrons.

The #1 quality of a programmer is to write correct code. If he/she acts as if he/she is a moron, he/she will write code that acts like a moron. Simple as that.

Tsk tsk. Missing the point.

Not at all. And I don't take anyone seriously who feels the need to 'Tsk tsk' btw.

Well, you've certainly a right to it. I just take it a little rough when it seems like someone's words are being intentionally misread.
 I believe what deadalnix is trying to say is this:
 Programmers should try to write correct code, but should never trust
 themselves to write correct code.

No, programmers should write correct code and then test it thoroughly. 'Trying to' is the wrong way to go about anything. And there is no need to distrust oneself.

There's a perfect reason to distrust oneself: oneself is a squishy meatbag that makes mistakes. Repeated "trying" with rigor applied will lead to success.
 Anyway, I have a _very hard time_ translating 'acting like a moron' to
 'writing correct code'.
 

I'm pretty sure it's suggestive. If an intelligent or careful person acts like a moron, then they will be forced to assume that they will make mistakes, and therefore take measures to ensure that the ALL mistakes are caught and fixed or mitigated. That is how you get from 'acting like a moron' to 'writing correct code'.
 ...

 Programs worth writing are complex enough that there is no way any of us
 can write them perfectly correct code on first draft.  There is always
 going to be some polishing, and maybe even /a lot/ of polishing, and
 perhaps some complete tear downs and rebuilds from time to time.  "Build
 one to throw away; you will anyways."  If you tell me that you can
 always write correct code the first time and you never need to go back
 and fix anything when you do testing (you do test right?) then I will
 have a hard time taking you seriously.

Testing is the main part of my development. Furthermore, I use assertions all over the place.
 That said, it is extremely pleasant to have a language that catches you
 when you inevitably fall.

That is why I also like Haskell.

I hear ya. I feel Haskell is an important language to understand, if not know how to use effectively. I wish I knew how to use it better than I do, but I haven't had too many projects that are amenable to it.
Dec 31 2011
prev sibling parent reply deadalnix <deadalnix gmail.com> writes:
Le 31/12/2011 19:13, Timon Gehr a écrit :
 On 12/31/2011 06:32 PM, Chad J wrote:
 On 12/30/2011 05:27 PM, Timon Gehr wrote:
 On 12/30/2011 10:36 PM, deadalnix wrote:
 The #1 quality of a programmer is to act like he/she is a morron.
 Because sometime we all are morrons.

The #1 quality of a programmer is to write correct code. If he/she acts as if he/she is a moron, he/she will write code that acts like a moron. Simple as that.

Programs worth writing are complex enough that there is no way any of us can write them perfectly correct code on first draft. There is always going to be some polishing, and maybe even /a lot/ of polishing, and perhaps some complete tear downs and rebuilds from time to time. "Build one to throw away; you will anyways." If you tell me that you can always write correct code the first time and you never need to go back and fix anything when you do testing (you do test right?) then I will have a hard time taking you seriously.

Testing is the main part of my development. Furthermore, I use assertions all over the place.

Well, if you write correct code, you don't need assertion. They will always be true because your code is correct. Stop wasting your time with that. Remeber the #1 quality of a programmer : write correct code. See how stupid this becomes ?
Jan 01 2012
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 01/01/2012 11:36 PM, deadalnix wrote:
 Le 31/12/2011 19:13, Timon Gehr a écrit :
 On 12/31/2011 06:32 PM, Chad J wrote:
 On 12/30/2011 05:27 PM, Timon Gehr wrote:
 On 12/30/2011 10:36 PM, deadalnix wrote:
 The #1 quality of a programmer is to act like he/she is a morron.
 Because sometime we all are morrons.

The #1 quality of a programmer is to write correct code. If he/she acts as if he/she is a moron, he/she will write code that acts like a moron. Simple as that.

Programs worth writing are complex enough that there is no way any of us can write them perfectly correct code on first draft. There is always going to be some polishing, and maybe even /a lot/ of polishing, and perhaps some complete tear downs and rebuilds from time to time. "Build one to throw away; you will anyways." If you tell me that you can always write correct code the first time and you never need to go back and fix anything when you do testing (you do test right?) then I will have a hard time taking you seriously.

Testing is the main part of my development. Furthermore, I use assertions all over the place.

Well, if you write correct code, you don't need assertion. They will always be true because your code is correct. Stop wasting your time with that. Remeber the #1 quality of a programmer : write correct code. See how stupid this becomes ?

You miss the point. Testing and assertions are part of how I write correct code.
Jan 01 2012
parent reply deadalnix <deadalnix gmail.com> writes:
Le 01/01/2012 23:46, Timon Gehr a écrit :
 On 01/01/2012 11:36 PM, deadalnix wrote:
 Le 31/12/2011 19:13, Timon Gehr a écrit :
 On 12/31/2011 06:32 PM, Chad J wrote:
 On 12/30/2011 05:27 PM, Timon Gehr wrote:
 On 12/30/2011 10:36 PM, deadalnix wrote:
 The #1 quality of a programmer is to act like he/she is a morron.
 Because sometime we all are morrons.

The #1 quality of a programmer is to write correct code. If he/she acts as if he/she is a moron, he/she will write code that acts like a moron. Simple as that.

Programs worth writing are complex enough that there is no way any of us can write them perfectly correct code on first draft. There is always going to be some polishing, and maybe even /a lot/ of polishing, and perhaps some complete tear downs and rebuilds from time to time. "Build one to throw away; you will anyways." If you tell me that you can always write correct code the first time and you never need to go back and fix anything when you do testing (you do test right?) then I will have a hard time taking you seriously.

Testing is the main part of my development. Furthermore, I use assertions all over the place.

Well, if you write correct code, you don't need assertion. They will always be true because your code is correct. Stop wasting your time with that. Remeber the #1 quality of a programmer : write correct code. See how stupid this becomes ?

You miss the point. Testing and assertions are part of how I write correct code.

So, to write correct code, you need to asume you'll write incorrect code. Writing correct code is your goal. Asuming you'll do stupid stuff is a quality required to advance toward this goal. And, saying that you test and assert a lot, you confirm that point.
Jan 04 2012
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 01/04/2012 07:08 PM, deadalnix wrote:
 Le 01/01/2012 23:46, Timon Gehr a écrit :
 On 01/01/2012 11:36 PM, deadalnix wrote:
 Le 31/12/2011 19:13, Timon Gehr a écrit :
 On 12/31/2011 06:32 PM, Chad J wrote:
 On 12/30/2011 05:27 PM, Timon Gehr wrote:
 On 12/30/2011 10:36 PM, deadalnix wrote:
 The #1 quality of a programmer is to act like he/she is a morron.
 Because sometime we all are morrons.

The #1 quality of a programmer is to write correct code. If he/she acts as if he/she is a moron, he/she will write code that acts like a moron. Simple as that.

Programs worth writing are complex enough that there is no way any of us can write them perfectly correct code on first draft. There is always going to be some polishing, and maybe even /a lot/ of polishing, and perhaps some complete tear downs and rebuilds from time to time. "Build one to throw away; you will anyways." If you tell me that you can always write correct code the first time and you never need to go back and fix anything when you do testing (you do test right?) then I will have a hard time taking you seriously.

Testing is the main part of my development. Furthermore, I use assertions all over the place.

Well, if you write correct code, you don't need assertion. They will always be true because your code is correct. Stop wasting your time with that. Remeber the #1 quality of a programmer : write correct code. See how stupid this becomes ?

You miss the point. Testing and assertions are part of how I write correct code.

So, to write correct code, you need to asume you'll write incorrect code. Writing correct code is your goal. Asuming you'll do stupid stuff is a quality required to advance toward this goal.

You are free to believe whatever you want, but I think that strategy you are describing is a recipe for writing buggy code.
 And, saying that you test and assert a lot,

Code for which no tests exist is neither correct nor incorrect. Assertions are a neat way to detect parts of the application whose implementation is incomplete.
 you confirm that point.

No.
Jan 04 2012
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 01/04/2012 11:31 PM, Timon Gehr wrote:
 Code for which no tests exist is neither correct nor incorrect.
 Assertions are a neat way to detect parts of the application whose
 implementation is incomplete.

Another major use of them is the checked documentation of assumptions, mainly in method preconditions.
Jan 04 2012
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 12/30/2011 11:55 AM, Timon Gehr wrote:
 Me too. I think the way we have it now is optimal.

Consider your X macro implementation. Strip out the utf.stride code and use plain indexing - it will not break the code in any way. The naive implementation still works correctly with ASCII and UTF-8. That's not true for any other multibyte encoding, which is why UTF-8 is inspired genius.
Dec 30 2011
next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 12/30/2011 11:01 PM, Walter Bright wrote:
 On 12/30/2011 11:55 AM, Timon Gehr wrote:
 Me too. I think the way we have it now is optimal.

Consider your X macro implementation. Strip out the utf.stride code and use plain indexing - it will not break the code in any way. The naive implementation still works correctly with ASCII and UTF-8.

You are right, that obviously needs fixing. ☺ Thanks!
 That's not true for any other multibyte encoding, which is why UTF-8 is
 inspired genius.

Dec 30 2011
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/30/11 4:01 PM, Walter Bright wrote:
 On 12/30/2011 11:55 AM, Timon Gehr wrote:
 Me too. I think the way we have it now is optimal.

Consider your X macro implementation. Strip out the utf.stride code and use plain indexing - it will not break the code in any way. The naive implementation still works correctly with ASCII and UTF-8. That's not true for any other multibyte encoding, which is why UTF-8 is inspired genius.

It's true for any encoding with the prefix property, such as Huffman. Using .raw is /optimal/ because it states the assumption appropriately. The user knows '$' cannot be in the prefix of any other symbol, so she can state the byte alone is the character. If that were a non-ASCII character, the assumption wouldn't have worked. So yeah, UTF-8 is great. But it is not miraculous. We need .raw. Andrei
Dec 30 2011
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 12/31/2011 12:00 AM, Andrei Alexandrescu wrote:
 On 12/30/11 4:01 PM, Walter Bright wrote:
 On 12/30/2011 11:55 AM, Timon Gehr wrote:
 Me too. I think the way we have it now is optimal.

Consider your X macro implementation. Strip out the utf.stride code and use plain indexing - it will not break the code in any way. The naive implementation still works correctly with ASCII and UTF-8. That's not true for any other multibyte encoding, which is why UTF-8 is inspired genius.

It's true for any encoding with the prefix property, such as Huffman. Using .raw is /optimal/ because it states the assumption appropriately. The user knows '$' cannot be in the prefix of any other symbol, so she can state the byte alone is the character. If that were a non-ASCII character, the assumption wouldn't have worked. So yeah, UTF-8 is great. But it is not miraculous. We need .raw. Andrei

auto raw(S)(S s) if(isNarrowString!S){ static if(is(S==string)) return cast(ubyte[])s; else static if(is(S==wstring)) return cast(ushort[])s; }
Dec 30 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/30/11 5:07 PM, Timon Gehr wrote:
 On 12/31/2011 12:00 AM, Andrei Alexandrescu wrote:
 On 12/30/11 4:01 PM, Walter Bright wrote:
 On 12/30/2011 11:55 AM, Timon Gehr wrote:
 Me too. I think the way we have it now is optimal.

Consider your X macro implementation. Strip out the utf.stride code and use plain indexing - it will not break the code in any way. The naive implementation still works correctly with ASCII and UTF-8. That's not true for any other multibyte encoding, which is why UTF-8 is inspired genius.

It's true for any encoding with the prefix property, such as Huffman. Using .raw is /optimal/ because it states the assumption appropriately. The user knows '$' cannot be in the prefix of any other symbol, so she can state the byte alone is the character. If that were a non-ASCII character, the assumption wouldn't have worked. So yeah, UTF-8 is great. But it is not miraculous. We need .raw. Andrei

auto raw(S)(S s) if(isNarrowString!S){ static if(is(S==string)) return cast(ubyte[])s; else static if(is(S==wstring)) return cast(ushort[])s; }

Almost there. https://github.com/D-Programming-Language/phobos/blob/master/std/string.d#L809 Andrei
Dec 30 2011
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 12/31/2011 01:03 AM, Andrei Alexandrescu wrote:
 On 12/30/11 5:07 PM, Timon Gehr wrote:
 On 12/31/2011 12:00 AM, Andrei Alexandrescu wrote:
 On 12/30/11 4:01 PM, Walter Bright wrote:
 On 12/30/2011 11:55 AM, Timon Gehr wrote:
 Me too. I think the way we have it now is optimal.

Consider your X macro implementation. Strip out the utf.stride code and use plain indexing - it will not break the code in any way. The naive implementation still works correctly with ASCII and UTF-8. That's not true for any other multibyte encoding, which is why UTF-8 is inspired genius.

It's true for any encoding with the prefix property, such as Huffman. Using .raw is /optimal/ because it states the assumption appropriately. The user knows '$' cannot be in the prefix of any other symbol, so she can state the byte alone is the character. If that were a non-ASCII character, the assumption wouldn't have worked. So yeah, UTF-8 is great. But it is not miraculous. We need .raw. Andrei

auto raw(S)(S s) if(isNarrowString!S){ static if(is(S==string)) return cast(ubyte[])s; else static if(is(S==wstring)) return cast(ushort[])s; }

Almost there. https://github.com/D-Programming-Language/phobos/blob/master/std/string.d#L809 Andrei

alias std.string.representation raw;
Dec 30 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/30/11 6:07 PM, Timon Gehr wrote:
 alias std.string.representation raw;

I meant your implementation is incomplete. But the main point is that presence of representation/raw is not the issue. The availability of good-for-nothing .length and operator[] are the issue. Putting in place the convention of using .raw is hardly useful within the context. Andrei
Dec 30 2011
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 12/31/2011 01:12 AM, Andrei Alexandrescu wrote:
 On 12/30/11 6:07 PM, Timon Gehr wrote:
 alias std.string.representation raw;

I meant your implementation is incomplete.

It was more a sketch than an implementation. It is not even type safe :o).
 But the main point is that presence of representation/raw is not the
 issue.
 The availability of good-for-nothing .length and operator[] are
 the issue. Putting in place the convention of using .raw is hardly
 useful within the context.

D strings are arrays. An array without .length and operator[] is close to being good for nothing. The language specification is quite clear about the fact that e.g. char is not a character but an utf-8 code unit. Therefore char[] is an array of code units. length gives the number of code units. operator[i] gives the i-th code unit. Nothing wrong or good-for-nothing about that. .raw would return ubyte[], therefore it would lose all type information. Effectively, what .raw does is a type cast that will let code point data alias with integral data. Consider: void foo(ubyte[] b)in{assert(b.length);}body{ b[0]=2; // perfectly fine } void main(){ char[] s = "☺".dup; auto b = s.raw; foo(b); writeln(s); // oops... } I fail to understand why that is desirable.
Dec 30 2011
parent reply Don <nospam nospam.com> writes:
On 31.12.2011 01:56, Timon Gehr wrote:
 On 12/31/2011 01:12 AM, Andrei Alexandrescu wrote:
 On 12/30/11 6:07 PM, Timon Gehr wrote:
 alias std.string.representation raw;

I meant your implementation is incomplete.

It was more a sketch than an implementation. It is not even type safe :o).
 But the main point is that presence of representation/raw is not the
 issue.
 The availability of good-for-nothing .length and operator[] are
 the issue. Putting in place the convention of using .raw is hardly
 useful within the context.

D strings are arrays. An array without .length and operator[] is close to being good for nothing. The language specification is quite clear about the fact that e.g. char is not a character but an utf-8 code unit. Therefore char[] is an array of code units.

No, it isn't. That's the problem. char[] is not an array of char. It has an additional invariant: it is a UTF8 string. If you randomly change elements, the invariant is violated. In reality, char[] and wchar[] are compressed forms of dstring.
 .raw would return ubyte[], therefore it
 would lose all type information. Effectively, what .raw does is a type
 cast that will let code point data alias with integral data.

Exactly. It's just a "I know what I'm doing" signal.
Dec 31 2011
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 12/31/2011 01:15 PM, Don wrote:
 On 31.12.2011 01:56, Timon Gehr wrote:
 On 12/31/2011 01:12 AM, Andrei Alexandrescu wrote:
 On 12/30/11 6:07 PM, Timon Gehr wrote:
 alias std.string.representation raw;

I meant your implementation is incomplete.

It was more a sketch than an implementation. It is not even type safe :o).
 But the main point is that presence of representation/raw is not the
 issue.
 The availability of good-for-nothing .length and operator[] are
 the issue. Putting in place the convention of using .raw is hardly
 useful within the context.

D strings are arrays. An array without .length and operator[] is close to being good for nothing. The language specification is quite clear about the fact that e.g. char is not a character but an utf-8 code unit. Therefore char[] is an array of code units.

No, it isn't. That's the problem. char[] is not an array of char. It has an additional invariant: it is a UTF8 string. If you randomly change elements, the invariant is violated.

char[] is an array of char and the additional invariant is not enforced by the language.
 In reality, char[] and wchar[] are compressed forms of dstring.

 .raw would return ubyte[], therefore it
 would lose all type information. Effectively, what .raw does is a type
 cast that will let code point data alias with integral data.

Exactly. It's just a "I know what I'm doing" signal.

No, it is a "I don't know what I'm doing" signal: ubyte[] does not carry any sign of an additional invariant, and the aliasing can be used to break the invariant that is commonly assumed for char[]. That was my point.
Dec 31 2011
parent reply Don <nospam nospam.com> writes:
On 31.12.2011 17:13, Timon Gehr wrote:
 On 12/31/2011 01:15 PM, Don wrote:
 On 31.12.2011 01:56, Timon Gehr wrote:
 On 12/31/2011 01:12 AM, Andrei Alexandrescu wrote:
 On 12/30/11 6:07 PM, Timon Gehr wrote:
 alias std.string.representation raw;

I meant your implementation is incomplete.

It was more a sketch than an implementation. It is not even type safe :o).
 But the main point is that presence of representation/raw is not the
 issue.
 The availability of good-for-nothing .length and operator[] are
 the issue. Putting in place the convention of using .raw is hardly
 useful within the context.

D strings are arrays. An array without .length and operator[] is close to being good for nothing. The language specification is quite clear about the fact that e.g. char is not a character but an utf-8 code unit. Therefore char[] is an array of code units.

No, it isn't. That's the problem. char[] is not an array of char. It has an additional invariant: it is a UTF8 string. If you randomly change elements, the invariant is violated.

char[] is an array of char and the additional invariant is not enforced by the language.

No, it isn't an ordinary array. For example with concatenation. char[] ~ int will never create an invalid string. You can end up with multiple chars being appended, even from a single append. foreach is different, too. They are a bit magical. There's quite a lot of code in the compiler to make sure that strings remain valid. The additional invariant is not enforced in the case of slicing; that's the point.
Dec 31 2011
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 01/01/2012 08:10 AM, Don wrote:
 On 31.12.2011 17:13, Timon Gehr wrote:
 On 12/31/2011 01:15 PM, Don wrote:
 On 31.12.2011 01:56, Timon Gehr wrote:
 On 12/31/2011 01:12 AM, Andrei Alexandrescu wrote:
 On 12/30/11 6:07 PM, Timon Gehr wrote:
 alias std.string.representation raw;

I meant your implementation is incomplete.

It was more a sketch than an implementation. It is not even type safe :o).
 But the main point is that presence of representation/raw is not the
 issue.
 The availability of good-for-nothing .length and operator[] are
 the issue. Putting in place the convention of using .raw is hardly
 useful within the context.

D strings are arrays. An array without .length and operator[] is close to being good for nothing. The language specification is quite clear about the fact that e.g. char is not a character but an utf-8 code unit. Therefore char[] is an array of code units.

No, it isn't. That's the problem. char[] is not an array of char. It has an additional invariant: it is a UTF8 string. If you randomly change elements, the invariant is violated.

char[] is an array of char and the additional invariant is not enforced by the language.

No, it isn't an ordinary array. For example with concatenation. char[] ~ int will never create an invalid string.

Yes it will. void main() { char[] x; writeln(x~255); }
 You can end up with multiple chars being appended, even from a single append.
foreach is different,
 too. They are a bit magical.

Fair enough, but type conversion rules are a bit magical in general. void main() { auto a = cast(short[])[1,2,3]; auto b = [1,2,3]; auto c = cast(short[])b; assert(a!=c); }
 There's quite a lot of code in the compiler to make sure that strings
 remain valid.

At the same time, there are many language features that allow to create invalid strings. auto a = "\377\252\314"; auto b = x"FF AA CC"; auto c = import("binary");
 The additional invariant is not enforced in the case of slicing; that's
 the point.

Jan 01 2012
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 12/30/2011 3:00 PM, Andrei Alexandrescu wrote:
 On 12/30/11 4:01 PM, Walter Bright wrote:
 On 12/30/2011 11:55 AM, Timon Gehr wrote:
 Me too. I think the way we have it now is optimal.

Consider your X macro implementation. Strip out the utf.stride code and use plain indexing - it will not break the code in any way. The naive implementation still works correctly with ASCII and UTF-8. That's not true for any other multibyte encoding, which is why UTF-8 is inspired genius.

It's true for any encoding with the prefix property, such as Huffman.

Any other multibyte character encoding I've seen standardized for use in C.
Dec 30 2011
prev sibling parent Michel Fortin <michel.fortin michelf.com> writes:
On 2011-12-30 23:00:49 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 Using .raw is /optimal/ because it states the assumption appropriately. 
 The user knows '$' cannot be in the prefix of any other symbol, so she 
 can state the byte alone is the character. If that were a non-ASCII 
 character, the assumption wouldn't have worked.
 
 So yeah, UTF-8 is great. But it is not miraculous. We need .raw.

After reading most of the thread, it seems to me like you're deconstructing strings as arrays one piece at a time, to the point where instead of arrays we'd basically get a string struct and do things on it. Maybe it's part of a grand scheme, more likely it's one realization after another leading to one change after another… let's see where all this will lead us: 0. in the beginning, strings were char[] arrays 1. arrays are generalized as ranges 2. phobos starts treating char arrays as bidirectional ranges of dchar (instead of random access ranges of char) 3. foreach on char[] should iterate over dchar by default 4. remove .length, random access, and slicing from char arrays 5. replace char[] with a struct { ubyte[] raw; } Number 1 is great by itself, no debate there. Number 2 is debatable. Number 3 and 4 are somewhat required for consistency with number 2. Number 5 is just the logical conclusion of all these changes. If we want a fundamental change to what strings are in D, perhaps we should start focusing on the broader issue instead of trying to pass piecemeal changes one after the other. For consistency's sake, I think we should either stop after 1 or go all the way to 5. Either we do it fully or we don't do it at all. All those divergent interpretations of strings end up hurting the language. Walter and Andrei ought to find a way to agree with each other. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Dec 30 2011
prev sibling next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 12/31/2011 04:30 AM, Jonathan M Davis wrote:
 On Friday, December 30, 2011 20:55:42 Timon Gehr wrote:
 1. They don't notice. Then it is not a problem, because they are
 obviously only using ASCII characters and it is perfectly reasonable to
 assume that code units and characters are the same thing.

The problem is that what's more likely to happen in a lot of cases is that they use it wrong and don't notice, because they're only using ASCII in testing, _but_ they have bugs all over the place, because their code is actually used with unicode in the field.

Then that is the fault of the guy who created the tests. At least that guy should be familiar with the issues, otherwise he is at the wrong position. Software should never be released without thorough testing.
 Yes, diligent programmers will generally find such problems, but with the
 current scheme, it's _so_ easy to use length when you shouldn't, that it's
 pretty much a guarantee that it's going to happen. I'm not sure that Andrei's
 suggestion is the best one at this point, but I sure wouldn't be against it
 being introduced. It wouldn't entirely fix the problem by any means, but
 programmers would then have to work harder at screwing it up and so there
 would be fewer mistakes.

Programmers would then also have to work harder at doing it right and at memoizing special cases, so there is absolutely no net gain.
 Arguably, the first issue with D strings is that we have char. In most
 languages, char is supposed to be a character, so many programmers will code
 with that expectation. If we had something like utf8unit, utf16unit, and
 utf32unit (arguably very bad, albeit descriptive, names) and no char, then it
 would force programmers to become semi-educated about the issues. There's no
 way that that's changing at this point though.

 - Jonathan M Davis

A programmer has to have basic knowledge of the language he is programming in. That includes knowing the meaning of all basic types. If he fails at that, testing should definitely catch that kind of trivial bugs.
Dec 30 2011
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 12/30/2011 7:30 PM, Jonathan M Davis wrote:
 Yes, diligent programmers will generally find such problems, but with the
 current scheme, it's _so_ easy to use length when you shouldn't, that it's
 pretty much a guarantee that it's going to happen.

I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8 correctly, but it turned out that the naive version that used [i] and .length worked correctly. This is typical, not exceptional. This was definitely not true of older multibyte schemes, like Shift-JIS (shudder), but those schemes ought to be terminated with extreme prejudice. But it definitely will take a long time to live down the bugs and miasma of code that had to deal with them. C and C++ still live with that because of their agenda of backwards compatibility. They still support EBCDIC, after all, that was obsolete even in the 70's. And I still see posts on comp.moderated.c++ that say "you shouldn't write string code like that, because it won't work on EBCDIC!" Sheesh!
Dec 30 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/30/11 10:09 PM, Walter Bright wrote:
 On 12/30/2011 7:30 PM, Jonathan M Davis wrote:
 Yes, diligent programmers will generally find such problems, but with the
 current scheme, it's _so_ easy to use length when you shouldn't, that
 it's
 pretty much a guarantee that it's going to happen.

I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8 correctly, but it turned out that the naive version that used [i] and .length worked correctly. This is typical, not exceptional.

The lower frequency of bugs makes them that much more difficult to spot. This is essentially similar to the UTF16/UCS-2 morass: in a vast majority of the time the programmer may consider UTF16 a coding with one code unit per code point (which is what UCS-2 is). The existence of surrogates didn't make much of a difference because, again, very often the wrong assumption just worked. Well that all didn't go over all that well. We need .raw and we must abolish .length and [] for narrow strings. Andrei
Dec 30 2011
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 12/30/2011 11:09 PM, Andrei Alexandrescu wrote:
 On 12/30/11 10:09 PM, Walter Bright wrote:
 I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8
 correctly, but it turned out that the naive version that used [i] and
 .length worked correctly. This is typical, not exceptional.

The lower frequency of bugs makes them that much more difficult to spot. This is essentially similar to the UTF16/UCS-2 morass: in a vast majority of the time the programmer may consider UTF16 a coding with one code unit per code point (which is what UCS-2 is). The existence of surrogates didn't make much of a difference because, again, very often the wrong assumption just worked. Well that all didn't go over all that well.

I'm not so sure it's quite the same. Java was designed before there were surrogate pairs, they kinda got the rug pulled out from under them. So, they simply have no decent way to deal with it. There isn't even a notion of a dchar character type. Java was designed with codeunit==codepoint, it is embedded in the design of the language, library, and culture. This is not true of D. It's designed from the ground up to deal properly with UTF. D has very simple language features to deal with it.
 We need .raw and we must abolish .length and [] for narrow strings.

I don't believe that fixes anything and breaks every D project out there. We're chasing phantoms here, and I worry a lot about over-engineering trivia. And, we already have a type to deal with it: dstring
Dec 31 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/31/11 2:04 AM, Walter Bright wrote:
 On 12/30/2011 11:09 PM, Andrei Alexandrescu wrote:
 On 12/30/11 10:09 PM, Walter Bright wrote:
 I'm not so sure about that. Timon Gehr's X macro tried to handle
 UTF-8 correctly, but it turned out that the naive version that
 used [i] and .length worked correctly. This is typical, not
 exceptional.

The lower frequency of bugs makes them that much more difficult to spot. This is essentially similar to the UTF16/UCS-2 morass: in a vast majority of the time the programmer may consider UTF16 a coding with one code unit per code point (which is what UCS-2 is). The existence of surrogates didn't make much of a difference because, again, very often the wrong assumption just worked. Well that all didn't go over all that well.

I'm not so sure it's quite the same. Java was designed before there were surrogate pairs, they kinda got the rug pulled out from under them. So, they simply have no decent way to deal with it. There isn't even a notion of a dchar character type. Java was designed with codeunit==codepoint, it is embedded in the design of the language, library, and culture. This is not true of D. It's designed from the ground up to deal properly with UTF.

I disagree. It is designed to make dealing with UTF possible.
 D has very simple language features to deal with
 it.

Disagree. I mean simple they are, no contest. They could and should be much better, make correct code easier to write, and make incorrect code more difficult to write. Claiming we reached perfection there doesn't quite fit.
 We need .raw and we must abolish .length and [] for narrow
 strings.

I don't believe that fixes anything and breaks every D project out there.

I agree. This is the only reason that keeps me from furthering the issue.
 We're chasing phantoms here, and I worry a lot about over-engineering
 trivia.

I disagree. I understand that seems trivia to you, but that doesn't make your opinion any less wrong, not to mention provincial through insistence it's applicable beyond a small team of experts. Again: I know no other - I literally mean not one - person who writes string code like you do (and myself after learning it from you); the current system is adequate; the proposed system is perfect - save for breaking backwards compatibility, which makes the discussion moot. But it being moot does not afford me to concede this point. I am right.
 And, we already have a type to deal with it: dstring

No. Andrei
Dec 31 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-12-31 08:56:37 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 12/31/11 2:04 AM, Walter Bright wrote:
 
 We're chasing phantoms here, and I worry a lot about over-engineering
 trivia.

I disagree. I understand that seems trivia to you, but that doesn't make your opinion any less wrong, not to mention provincial through insistence it's applicable beyond a small team of experts. Again: I know no other - I literally mean not one - person who writes string code like you do (and myself after learning it from you); the current system is adequate; the proposed system is perfect - save for breaking backwards compatibility, which makes the discussion moot. But it being moot does not afford me to concede this point. I am right.

Perfect? At one time Java and other frameworks started to use UTF-16 as if they were characters, that turned wrong on them. Now we know that not even code points should be considered characters, thanks to characters spanning on multiple code points. You might call it perfect, but for that you have made two assumptions: 1. treating code points as characters is good enough, and 2. the performance penalty of decoding everything is tolerable Ranges of code points might be perfect for you, but it's a tradeoff that won't work in every situations. The whole concept of generic algorithms working on strings efficiently doesn't work. Applying generic algorithms to strings by treating them as a range of code points is both wasteful (because it forces you to decode everything) and incomplete (because of multi-code-point characters) and it should be avoided. Algorithms working on Unicode strings should be designed with Unicode in mind. And the best way to design efficient Unicode algorithms is to access the array of code units directly and read each character at the level of abstraction required and know what you're doing. I'm not against making strings more opaque to encourage people to use the Unicode algorithms from the standard library instead of rolling their own. But I doubt the current approach of using .raw alone will prevent many from doing dumb things. On the other side I'm sure it'll make it it more complicated to write Unicode algorithms because accessing and especially slicing the raw content of char[] will become tiresome. I'm not convinced it's a net win. As for Walter being the only one coding by looking at the code units directly, that's not true. All my parser code look at code units directly and only decode to code points where necessary (just look at the XML parsing code I posted a while ago to get an idea to how it can apply to ranges). And I don't think it's because I've seen Walter code before, I think it is because I know how Unicode works and I want to make my parser efficient. I've done the same for a parser in C++ a while ago. I can hardly imagine I'm the only one (with Walter and you). I think this is how efficient algorithms dealing with Unicode should be written. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Dec 31 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/31/11 8:17 CST, Michel Fortin wrote:
 On 2011-12-31 08:56:37 +0000, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 On 12/31/11 2:04 AM, Walter Bright wrote:

 We're chasing phantoms here, and I worry a lot about over-engineering
 trivia.

I disagree. I understand that seems trivia to you, but that doesn't make your opinion any less wrong, not to mention provincial through insistence it's applicable beyond a small team of experts. Again: I know no other - I literally mean not one - person who writes string code like you do (and myself after learning it from you); the current system is adequate; the proposed system is perfect - save for breaking backwards compatibility, which makes the discussion moot. But it being moot does not afford me to concede this point. I am right.

Perfect?

Sorry, I exaggerated. I meant "a net improvement while keeping simplicity".
 At one time Java and other frameworks started to use UTF-16 as
 if they were characters, that turned wrong on them. Now we know that not
 even code points should be considered characters, thanks to characters
 spanning on multiple code points. You might call it perfect, but for
 that you have made two assumptions:

 1. treating code points as characters is good enough, and
 2. the performance penalty of decoding everything is tolerable

I'm not sure how you concluded I drew such assumptions.
 Ranges of code points might be perfect for you, but it's a tradeoff that
 won't work in every situations.

Ranges can be defined to span logical glyphs that span multiple code points.
 The whole concept of generic algorithms working on strings efficiently
 doesn't work.

Apparently std.algorithm does.
 Applying generic algorithms to strings by treating them as
 a range of code points is both wasteful (because it forces you to decode
 everything) and incomplete (because of multi-code-point characters) and
 it should be avoided.

An algorithm that gains by accessing the encoding can do so - and indeed some do. Spanning multi-code-point characters is a matter of defining the range appropriately; it doesn't break the abstraction.
 Algorithms working on Unicode strings should be
 designed with Unicode in mind. And the best way to design efficient
 Unicode algorithms is to access the array of code units directly and
 read each character at the level of abstraction required and know what
 you're doing.

As I said, that's happening already.
 I'm not against making strings more opaque to encourage people to use
 the Unicode algorithms from the standard library instead of rolling
 their own.

I'd say we're discussing making the two kinds of manipulation (encoded sequence of logical character vs. array of code units) more distinguished from each other. That's a Good Thing(tm).
 But I doubt the current approach of using .raw alone will
 prevent many from doing dumb things.

I agree. But I think it would be a sensible improvement over now, when you get to do a ton of dumb things with much more ease.
 On the other side I'm sure it'll
 make it it more complicated to write Unicode algorithms because
 accessing and especially slicing the raw content of char[] will become
 tiresome. I'm not convinced it's a net win.

Many Unicode algorithms don't need slicing. Those that do carefully mix manipulation of code points with manipulation of representation. It is a net win that the two operations are explicitly distinguished.
 As for Walter being the only one coding by looking at the code units
 directly, that's not true. All my parser code look at code units
 directly and only decode to code points where necessary (just look at
 the XML parsing code I posted a while ago to get an idea to how it can
 apply to ranges). And I don't think it's because I've seen Walter code
 before, I think it is because I know how Unicode works and I want to
 make my parser efficient. I've done the same for a parser in C++ a while
 ago. I can hardly imagine I'm the only one (with Walter and you). I
 think this is how efficient algorithms dealing with Unicode should be
 written.

Congratulations. Andrei
Dec 31 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-12-31 15:03:13 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 12/31/11 8:17 CST, Michel Fortin wrote:
 At one time Java and other frameworks started to use UTF-16 as
 if they were characters, that turned wrong on them. Now we know that not
 even code points should be considered characters, thanks to characters
 spanning on multiple code points. You might call it perfect, but for
 that you have made two assumptions:
 
 1. treating code points as characters is good enough, and
 2. the performance penalty of decoding everything is tolerable

I'm not sure how you concluded I drew such assumptions.

1: Because treating UTF-8 strings as a range of code point encourage people to think so. 2: From things you posted on the newsgroup previously. Sorry I don't have the references, but it'd take too long to dig them back.
 Ranges of code points might be perfect for you, but it's a tradeoff that
 won't work in every situations.

Ranges can be defined to span logical glyphs that span multiple code points.

I'm talking about the default interpretation, where string ranges are ranges of code units, making that tradeoff the default. And also, I think we can agree that a logical glyph range would be terribly inefficient in practice, although it could be a nice teaching tool.
 The whole concept of generic algorithms working on strings efficiently
 doesn't work.

Apparently std.algorithm does.

First, it doesn't really work. It seems to work fine, but it doesn't handle (yet) characters spanning multiple code points. To handle this case, you could use a logical glyph range, but that'd be quite inefficient. Or you can improve the algorithm working on code points so that it checks for combining characters on the edges, but then is it still a generic algorithm? Second, it doesn't work efficiently. Sure you can specialize the algorithm so it does not decode all code units when it's not necessary, but then does it still classify as a generic algorithm? My point is that *generic* algorithms cannot work *efficiently* with Unicode, not that they can't work at all. And even then, for the inneficient generic algorithm to work correctly with all input, the user need to choose the correct Unicode representation to for the problem at hand, which requires some general knowledge of Unicode. Which is why I'd just discourage generic algorithms for strings.
 I'm not against making strings more opaque to encourage people to use
 the Unicode algorithms from the standard library instead of rolling
 their own.

I'd say we're discussing making the two kinds of manipulation (encoded sequence of logical character vs. array of code units) more distinguished from each other. That's a Good Thing(tm).

It's a good abstraction to show the theory of Unicode. But it's not the way to go if you want efficiency. For efficiency you need for each element in the string to use the lowest abstraction required to handle this element, so your algorithm needs to know about the various abstraction layers. This is the kind of "range" I'd use to create algorithms dealing with Unicode properly: struct UnicodeRange(U) { U frontUnit() property; dchar frontPoint() property; immutable(U)[] frontGlyph() property; void popFrontUnit(); void popFrontPoint(); void popFrontGlyph(); ... } Not really a range per your definition of ranges, but basically it lets you intermix working with units, code points, and glyphs. Add a way to slice at the unit level and a way to know the length at the unit level and it's all I need to make an efficient parser, or any algorithm really. The problem with .raw is that it creates a separate range for the units. This means you can't look at the frontUnit and then decide to pop the unit and then look at the next, decide you need to decode using frontPoint, then call popPoint and return to looking at the front unit. Also, I'm not sure the "glyph" part of that range is required most of the time, because most of the time you don't need to decode glyphs to be glyph-aware. But it'd be nice if you wanted to count them and having it there alongside the rest makes teaches makes users aware of them. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Dec 31 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/31/11 10:47 AM, Michel Fortin wrote:
 On 2011-12-31 15:03:13 +0000, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> said:

 On 12/31/11 8:17 CST, Michel Fortin wrote:
 At one time Java and other frameworks started to use UTF-16 as
 if they were characters, that turned wrong on them. Now we know that not
 even code points should be considered characters, thanks to characters
 spanning on multiple code points. You might call it perfect, but for
 that you have made two assumptions:

 1. treating code points as characters is good enough, and
 2. the performance penalty of decoding everything is tolerable

I'm not sure how you concluded I drew such assumptions.

1: Because treating UTF-8 strings as a range of code point encourage people to think so. 2: From things you posted on the newsgroup previously. Sorry I don't have the references, but it'd take too long to dig them back.

That's sort of difficult to refute. Anyhow, I think it's great that algorithms can use types to go down to the representation if needed, and stay up at bidirectional range level otherwise.
 Ranges of code points might be perfect for you, but it's a tradeoff that
 won't work in every situations.

Ranges can be defined to span logical glyphs that span multiple code points.

I'm talking about the default interpretation, where string ranges are ranges of code units, making that tradeoff the default. And also, I think we can agree that a logical glyph range would be terribly inefficient in practice, although it could be a nice teaching tool.

Well people who want that could use byGlyph() or something. If you want glyphs, you gotta pay the price.
 The whole concept of generic algorithms working on strings efficiently
 doesn't work.

Apparently std.algorithm does.

First, it doesn't really work.

Oh yes it does.
 It seems to work fine, but it doesn't
 handle (yet) characters spanning multiple code points.

That's the job of std.range, not std.algorithm.
 To handle this
 case, you could use a logical glyph range, but that'd be quite
 inefficient. Or you can improve the algorithm working on code points so
 that it checks for combining characters on the edges, but then is it
 still a generic algorithm?

 Second, it doesn't work efficiently. Sure you can specialize the
 algorithm so it does not decode all code units when it's not necessary,
 but then does it still classify as a generic algorithm?

 My point is that *generic* algorithms cannot work *efficiently* with
 Unicode, not that they can't work at all. And even then, for the
 inneficient generic algorithm to work correctly with all input, the user
 need to choose the correct Unicode representation to for the problem at
 hand, which requires some general knowledge of Unicode.

 Which is why I'd just discourage generic algorithms for strings.

I think you are in a position that is defensible, but not generous and therefore undesirable. The military equivalent would be defending a fortified landfill drained by a sewer. You don't _want_ to be there. Taking your argument to its ultimate conclusion is that we give up on genericity for strings and go home. Strings are a variable-length encoding on top of an array. That is a relatively easy abstraction to model. Currently we don't have a dedicated model for that - we offer the encoded data as a bidirectional range and also the underlying array. Algorithms that work with bidirectional ranges work out of the box. Those that can use the representation gainfully can opportunistically specialize on isSomeString!R. You contend that that doesn't "work", and I think you're wrong. But to the extent you have a case, an abstraction could be defined for variable-length encodings, and algorithms could be defined to work with that abstraction. I thought several times about that, but couldn't gather enough motivation for the simple reason that the current approach _works_.
 I'm not against making strings more opaque to encourage people to use
 the Unicode algorithms from the standard library instead of rolling
 their own.

I'd say we're discussing making the two kinds of manipulation (encoded sequence of logical character vs. array of code units) more distinguished from each other. That's a Good Thing(tm).

It's a good abstraction to show the theory of Unicode. But it's not the way to go if you want efficiency. For efficiency you need for each element in the string to use the lowest abstraction required to handle this element, so your algorithm needs to know about the various abstraction layers.

Correct.
 This is the kind of "range" I'd use to create algorithms dealing with
 Unicode properly:

 struct UnicodeRange(U)
 {
 U frontUnit()  property;
 dchar frontPoint()  property;
 immutable(U)[] frontGlyph()  property;

 void popFrontUnit();
 void popFrontPoint();
 void popFrontGlyph();

 ...
 }

We already have most of that. For a string s, s[0] is frontUnit, s.front is frontPoint, s = s[1 .. $] is popFrontUnit(), s.popFront() is popFrontPoint. We only need to define the glyph routines. But I think you'd be stopping short. You want generic variable-length encoding, not the above.
 Not really a range per your definition of ranges, but basically it lets
 you intermix working with units, code points, and glyphs. Add a way to
 slice at the unit level and a way to know the length at the unit level
 and it's all I need to make an efficient parser, or any algorithm really.

Except for the glpyhs implementation, we're already there. You are talking about existing capabilities!
 The problem with .raw is that it creates a separate range for the units.

That's the best part about it.
 This means you can't look at the frontUnit and then decide to pop the
 unit and then look at the next, decide you need to decode using
 frontPoint, then call popPoint and return to looking at the front unit.

Of course you can. while (condition) { if (s.raw.front == someFrontUnitThatICareAbout) { s.raw.popFront(); auto c = s.front; s.popFront(); } } Now that I wrote it I'm even more enthralled with the coolness of the scheme. You essentially have access to two separate ranges on top of the same fabric. Andrei
Dec 31 2011
next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-12-31 18:56:01 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 12/31/11 10:47 AM, Michel Fortin wrote:
 It seems to work fine, but it doesn't
 handle (yet) characters spanning multiple code points.

That's the job of std.range, not std.algorithm.

As I keep saying, if you handle combining code points at the range level you'll have very inefficient code. But I think you get that.
 To handle this
 case, you could use a logical glyph range, but that'd be quite
 inefficient. Or you can improve the algorithm working on code points so
 that it checks for combining characters on the edges, but then is it
 still a generic algorithm?
 
 Second, it doesn't work efficiently. Sure you can specialize the
 algorithm so it does not decode all code units when it's not necessary,
 but then does it still classify as a generic algorithm?
 
 My point is that *generic* algorithms cannot work *efficiently* with
 Unicode, not that they can't work at all. And even then, for the
 inneficient generic algorithm to work correctly with all input, the user
 need to choose the correct Unicode representation to for the problem at
 hand, which requires some general knowledge of Unicode.
 
 Which is why I'd just discourage generic algorithms for strings.

I think you are in a position that is defensible, but not generous and therefore undesirable. The military equivalent would be defending a fortified landfill drained by a sewer. You don't _want_ to be there.

I don't get the analogy.
 Taking your argument to its ultimate conclusion is that we give up on 
 genericity for strings and go home.

That is more or less what I am saying. Genericity for strings leads to inefficient algorithms, and you don't want inefficient algorithms, at least not without being warned in advance. This is why for instance you give a special name to inefficient (linear) operations in std.container. In the same way, I think generic operations on strings should be disallowed unless you opt-in by explicitly saying on which representation you want to algorithm to perform its task.
 This is the kind of "range" I'd use to create algorithms dealing with
 Unicode properly:
 
 struct UnicodeRange(U)
 {
 U frontUnit()  property;
 dchar frontPoint()  property;
 immutable(U)[] frontGlyph()  property;
 
 void popFrontUnit();
 void popFrontPoint();
 void popFrontGlyph();
 
 ...
 }

We already have most of that. For a string s, s[0] is frontUnit, s.front is frontPoint, s = s[1 .. $] is popFrontUnit(), s.popFront() is popFrontPoint. We only need to define the glyph routines.

Indeed. I came with this concept when writing my XML parser, I defined frontUnit and popFrontUnit and used it all over the place (in conjunction with slicing). And I rarely needed to decode whole code points using front and popFront.
 But I think you'd be stopping short. You want generic variable-length 
 encoding, not the above.

Really? How'd that work?
 Except for the glpyhs implementation, we're already there. You are 
 talking about existing capabilities!
 
 The problem with .raw is that it creates a separate range for the units.

That's the best part about it.

Depends. It should create a *linked* range, not a *separate* one, in the sense that if you advance the "raw" range with popFront, it should advance the underlying "code point" range too.
 This means you can't look at the frontUnit and then decide to pop the
 unit and then look at the next, decide you need to decode using
 frontPoint, then call popPoint and return to looking at the front unit.

Of course you can. while (condition) { if (s.raw.front == someFrontUnitThatICareAbout) { s.raw.popFront(); auto c = s.front; s.popFront(); } }

But will s.raw.popFront() also pop a single unit from s? "raw" would need to be defined as a reinterpret cast of the reference to the char[] to do what I want, something like this: ref ubyte[] raw(ref char[] s) { return *cast(ubyte[]*)&s; } The current std.string.representation doesn't do that at all. Also, how does it work with slicing? It can work with raw, but you'll have to cast things everywhere because raw is a ubyte[]: string = "éà"; s = cast(typeof(s))s.raw[0..4];
 Now that I wrote it I'm even more enthralled with the coolness of the 
 scheme. You essentially have access to two separate ranges on top of 
 the same fabric.

Glad you like the concept. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Dec 31 2011
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/31/11 2:44 PM, Michel Fortin wrote:
 But will s.raw.popFront() also pop a single unit from s? "raw" would
 need to be defined as a reinterpret cast of the reference to the char[]
 to do what I want, something like this:

 ref ubyte[] raw(ref char[] s) { return *cast(ubyte[]*)&s; }

 The current std.string.representation doesn't do that at all.

You just found a bug! Andrei
Dec 31 2011
prev sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 12/31/2011 07:56 PM, Andrei Alexandrescu wrote:
 On 12/31/11 10:47 AM, Michel Fortin wrote:
 This means you can't look at the frontUnit and then decide to pop the
 unit and then look at the next, decide you need to decode using
 frontPoint, then call popPoint and return to looking at the front unit.

Of course you can. while (condition) { if (s.raw.front == someFrontUnitThatICareAbout) { s.raw.popFront(); auto c = s.front; s.popFront(); } } Now that I wrote it I'm even more enthralled with the coolness of the scheme. You essentially have access to two separate ranges on top of the same fabric. Andrei

There is nothing wrong with the scheme on the conceptual level (except maybe that .raw.popFront() lets you invalidate the code point range). But making built-in arrays behave that way is like fitting a square peg in a round hole. immutable(char)[] is actually what .raw should return, not what it should be called on. It is already the raw representation.
Dec 31 2011
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 12/31/2011 03:17 PM, Michel Fortin wrote:
 As for Walter being the only one coding by looking at the code units
 directly, that's not true. All my parser code look at code units
 directly and only decode to code points where necessary (just look at
 the XML parsing code I posted a while ago to get an idea to how it can
 apply to ranges). And I don't think it's because I've seen Walter code
 before, I think it is because I know how Unicode works and I want to
 make my parser efficient. I've done the same for a parser in C++ a while
 ago. I can hardly imagine I'm the only one (with Walter and you). I
 think this is how efficient algorithms dealing with Unicode should be
 written.

+1.
Dec 31 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/31/11 10:47 AM, Sean Kelly wrote:
 I don't know that Unicode expertise is really required here anyway.
 All one has to know is that UTF8 is a multibyte encoding and
 built-in string attributes talk in bytes. Knowing when one wants
 bytes vs characters isn't rocket science. That said, I'm on the fence
 about this change. It breaks consistency for a benefit I'm still
 weighing. With this change, the char type will still be a single
 byte, correct? What happens to foreach on strings?

Clearly this is a what-if debate. The best level of agreement we could ever reach is "well, it would've been nice... sigh". It's possible that we'll define a Rope type in std.container - a heavy-duty string type with small string optimization, interning, the works. That type may use insights we are deriving from this exchange. Andrei
Dec 31 2011
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 12/31/2011 08:06 PM, Andrei Alexandrescu wrote:
 On 12/31/11 10:47 AM, Sean Kelly wrote:
 I don't know that Unicode expertise is really required here anyway.
 All one has to know is that UTF8 is a multibyte encoding and
 built-in string attributes talk in bytes. Knowing when one wants
 bytes vs characters isn't rocket science. That said, I'm on the fence
 about this change. It breaks consistency for a benefit I'm still
 weighing. With this change, the char type will still be a single
 byte, correct? What happens to foreach on strings?

Clearly this is a what-if debate. The best level of agreement we could ever reach is "well, it would've been nice... sigh". It's possible that we'll define a Rope type in std.container - a heavy-duty string type with small string optimization, interning, the works. That type may use insights we are deriving from this exchange. Andrei

That would be great.
Dec 31 2011
prev sibling parent Michel Fortin <michel.fortin michelf.com> writes:
On 2011-12-31 16:47:40 +0000, Sean Kelly <sean invisibleduck.org> said:

 I don't know that Unicode expertise is really required here anyway.  All one
  has to know is that UTF8 is a multibyte encoding and built-in string attrib
 utes talk in bytes. Knowing when one wants bytes vs characters isn't rocket s
 cience.

It's not bytes vs. characters, it's code units vs. code points vs. user perceived characters (grapheme clusters). One character can span multiple code points, and can be represented in various ways depending on which Unicode normalization you pick. But most people don't know that. If you want to count the number of *characters*, counting code points isn't really it, as you should avoid counting the combining ones. If you want to search for a substring, you need to be sure both strings use the same normalization first, and if not normalize them appropriately so that equivalent code point combinations are always represented the same. That said, if you are implementing an XML or JSON parser, since those specs are defined in term of code points you should probably write your code in term of code points (hopefully without decoding code points when you don't need to). On the other hand, if you're writing something that processes text (like counting the average number of *character* per word in a document), then you should be aware of combining characters. How to pack all this into an easy to use package is most challenging. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Dec 31 2011
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:

 We need .raw and we must abolish .length and [] for narrow strings.

I don't know if we need, but I agree those things are an improvement over the current state. To replace the disabled slicing I think something Python islice() will be useful. Bye,bear bearophile
Dec 31 2011
prev sibling next sibling parent Piotr Szturmaj <bncrbme jadamspam.pl> writes:
Timon Gehr wrote:
 Me too. I think the way we have it now is optimal. The only reason we
 are discussing this is because of fear that uneducated users will write
 code that does not take into account Unicode characters above code point
 0x80.

+1 From D's string docs: "char[] strings are in UTF-8 format. wchar[] strings are in UTF-16 format. dchar[] strings are in UTF-32 format." I would additionally add some clarifications: char[] is an array of 8-bit code units. Unicode code point may take up to 4 chars. wchar[] is an array of 16-bit code units. Unicode code point may take up to 2 wchars. dchar[] is an array of 32-bit code units. Unicode code point always fits into one dchar. Each of these formats may encode any Unicode string. If you need indexing or slicing use: * char[] or string when working with ASCII code points. * wchar[] or wstring when working with Basic Multilingual Plane (BMP) code points. * dchar[] or dstring when working with all possible code points. If you do not need indexing or slicing you may use any of the formats.
Dec 31 2011
prev sibling parent reply Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 12/30/2011 02:55 PM, Timon Gehr wrote:
 On 12/30/2011 08:33 PM, Joshua Reusch wrote:
 Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
 On 12/29/11 12:28 PM, Don wrote:
 On 28.12.2011 20:00, Andrei Alexandrescu wrote:
 Oh, one more thing - one good thing that could come out of this thread
 is abolition (through however slow a deprecation path) of s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length
 and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not
 char/wchar.
 Then, people would access the decoding routines on the needed
 occasions,
 or would consciously use the representation.

 Yum.

If I understand this correctly, most others don't. Effectively, .rep just means, "I know what I'm doing", and there's no change to existing semantics, purely a syntax change.

Exactly!
 If you change s[i] into s.rep[i], it does the same thing as now.
 There's
 no loss of functionality -- it's just stops you from accidentally doing
 the wrong thing. Like .ptr for getting the address of an array.
 Typically all the ".rep" everywhere would get annoying, so you would
 write:
 ubyte [] u = s.rep;
 and use u from then on.

 I don't like the name 'rep'. Maybe 'raw' or 'utf'?
 Apart from that, I think this would be perfect.

Yes, I mean "rep" as a short for "representation" but upon first sight the connection is tenuous. "raw" sounds great. Now I'm twice sorry this will not happen...

Maybe it could happen if we 1. make dstring the default strings type --

Inefficient.

But correct (enough).
 code units and characters would be the same

Wrong.

*sigh*, FINE. Code units and /code points/ would be the same.
 or 2. forward string.length to std.utf.count and opIndex to
 std.utf.toUTFindex

Inconsistent and inefficient (it blows up the algorithmic complexity).

Inconsistent? How? Inefficiency is a lot easier to deal with than incorrect. If something is inefficient, then in the right places I will NOTICE. If something is incorrect, it can hide for years until that one person (or country, in this case) with a different usage pattern than the others uncovers it.
 so programmers could use the slices/indexing/length (no lazyness
 problems), and if they really want codeunits use .raw/.rep (or better
 .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)

Anyone who intends to write efficient string processing code needs this. Anyone who does not want to write string processing code will not need to index into a string -- standard library functions will suffice.

What about people who want to write correct string processing code AND want to use this handy slicing feature? Because I totally want both of these. Slicing is super useful for script-like coding.
 But generally I liked the idea of just having an alias for strings...

Me too. I think the way we have it now is optimal. The only reason we are discussing this is because of fear that uneducated users will write code that does not take into account Unicode characters above code point 0x80. But what is the worst thing that can happen? 1. They don't notice. Then it is not a problem, because they are obviously only using ASCII characters and it is perfectly reasonable to assume that code units and characters are the same thing.

How do you know they are only working with ASCII? They might be /now/. But what if someone else uses the program a couple years later when the original author is no longer maintaining that chunk of code?
 2. They get screwed up string output, look for the reason, patch up
 their code with some functions from std.utf and will never make the same
 mistakes again.
 

Except they don't. Because there are a lot of programmers that will never put in non-ascii strings to begin with. But that has nothing to do with whether or not the /users/ or /maintainers/ of that code will put non-ascii strings in. This could make some messes.
 
 I have *never* seen an user in D.learn complain about it. They might
 have been some I missed, but it is certainly not a prevalent problem.
 Also, just because an user can type .rep does not mean he understands
 Unicode: He is able to make just the same mistakes as before, even more
 so, as the array he is getting back has the _wrong element type_.
 

You know, here in America (Amurica?) we don't know that other countries exist. I think there is a large population of programmers here that don't even know how to enter non-latin characters, much less would think to include such characters in their test cases. These programmers won't necessarily be found on the internet much, but they will be found in cubicles all around, doing their 9-to-5 and writing mediocre code that the rest of us have to put up with. Their code will pass peer review (their peers are also from America) and continue working just fine until someone from one of those confusing other places decides to type in the characters they feel comfortable typing in. No, there will not be /tests/ for code points greater than 0x80, because there is no one around to write those. I'd feel a little better if D herds people into writing correct code to begin with, because they won't otherwise. ... There's another issue at play here too: efficiency vs correctness as a default. Here's the tradeoff -- Option A: char[i] returns the i'th byte of the string as a (char) type. Consequences: (1) Code is efficient and INcorrect. (2) It requires extra effort to write correct code. (3) Detecting the incorrect code may take years, as these errors can hide easily. Option B: char[i] returns the i'th codepoint of the string as a (dchar) type. Consequences: (1) Code is INefficient and correct. (2) It requires extra effort to write efficient code. (3) Detecting the inefficient code happens in minutes. It is VERY noticable when your program runs too slowly. This is how I see it. And I really like my correct code. If it's too slow, and I'll /know/ when it's too slow, then I'll profile->tweak->profile->etc until the slowness goes away. I'm totally digging option B.
Dec 31 2011
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 12/31/2011 07:22 PM, Chad J wrote:
 On 12/30/2011 02:55 PM, Timon Gehr wrote:
 On 12/30/2011 08:33 PM, Joshua Reusch wrote:
 Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
 On 12/29/11 12:28 PM, Don wrote:
 On 28.12.2011 20:00, Andrei Alexandrescu wrote:
 Oh, one more thing - one good thing that could come out of this thread
 is abolition (through however slow a deprecation path) of s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length
 and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not
 char/wchar.
 Then, people would access the decoding routines on the needed
 occasions,
 or would consciously use the representation.

 Yum.

If I understand this correctly, most others don't. Effectively, .rep just means, "I know what I'm doing", and there's no change to existing semantics, purely a syntax change.

Exactly!
 If you change s[i] into s.rep[i], it does the same thing as now.
 There's
 no loss of functionality -- it's just stops you from accidentally doing
 the wrong thing. Like .ptr for getting the address of an array.
 Typically all the ".rep" everywhere would get annoying, so you would
 write:
 ubyte [] u = s.rep;
 and use u from then on.

 I don't like the name 'rep'. Maybe 'raw' or 'utf'?
 Apart from that, I think this would be perfect.

Yes, I mean "rep" as a short for "representation" but upon first sight the connection is tenuous. "raw" sounds great. Now I'm twice sorry this will not happen...

Maybe it could happen if we 1. make dstring the default strings type --

Inefficient.

But correct (enough).
 code units and characters would be the same

Wrong.

*sigh*, FINE. Code units and /code points/ would be the same.

Relax.
 or 2. forward string.length to std.utf.count and opIndex to
 std.utf.toUTFindex

Inconsistent and inefficient (it blows up the algorithmic complexity).

Inconsistent? How?

int[] bool[] float[] char[]
 Inefficiency is a lot easier to deal with than incorrect.  If something
 is inefficient, then in the right places I will NOTICE.  If something is
 incorrect, it can hide for years until that one person (or country, in
 this case) with a different usage pattern than the others uncovers it.

 so programmers could use the slices/indexing/length (no lazyness
 problems), and if they really want codeunits use .raw/.rep (or better
 .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)

Anyone who intends to write efficient string processing code needs this. Anyone who does not want to write string processing code will not need to index into a string -- standard library functions will suffice.

What about people who want to write correct string processing code AND want to use this handy slicing feature? Because I totally want both of these. Slicing is super useful for script-like coding.

Except that the proposal would make slicing strings go away.
 But generally I liked the idea of just having an alias for strings...

Me too. I think the way we have it now is optimal. The only reason we are discussing this is because of fear that uneducated users will write code that does not take into account Unicode characters above code point 0x80. But what is the worst thing that can happen? 1. They don't notice. Then it is not a problem, because they are obviously only using ASCII characters and it is perfectly reasonable to assume that code units and characters are the same thing.

How do you know they are only working with ASCII? They might be /now/. But what if someone else uses the program a couple years later when the original author is no longer maintaining that chunk of code?

Then they obviously need to fix the code, because the requirements have changed. Most of it will already work correctly though, because UTF-8 extends ASCII in a natural way.
 2. They get screwed up string output, look for the reason, patch up
 their code with some functions from std.utf and will never make the same
 mistakes again.

Except they don't. Because there are a lot of programmers that will never put in non-ascii strings to begin with. But that has nothing to do with whether or not the /users/ or /maintainers/ of that code will put non-ascii strings in. This could make some messes.
 I have *never* seen an user in D.learn complain about it. They might
 have been some I missed, but it is certainly not a prevalent problem.
 Also, just because an user can type .rep does not mean he understands
 Unicode: He is able to make just the same mistakes as before, even more
 so, as the array he is getting back has the _wrong element type_.

You know, here in America (Amurica?) we don't know that other countries exist. I think there is a large population of programmers here that don't even know how to enter non-latin characters, much less would think to include such characters in their test cases. These programmers won't necessarily be found on the internet much, but they will be found in cubicles all around, doing their 9-to-5 and writing mediocre code that the rest of us have to put up with. Their code will pass peer review (their peers are also from America) and continue working just fine until someone from one of those confusing other places decides to type in the characters they feel comfortable typing in. No, there will not be /tests/ for code points greater than 0x80, because there is no one around to write those. I'd feel a little better if D herds people into writing correct code to begin with, because they won't otherwise.

There is no way to 'herd people into writing correct code' and UTF-8 is quite easy to deal with.
 ...

 There's another issue at play here too: efficiency vs correctness as a
 default.

 Here's the tradeoff --

 Option A:
 char[i] returns the i'th byte of the string as a (char) type.
 Consequences:
 (1) Code is efficient and INcorrect.

Do you have an example of impactful incorrect code resulting from those semantics?
 (2) It requires extra effort to write correct code.
 (3) Detecting the incorrect code may take years, as these errors can
 hide easily.

None of those is a direct consequence of char[i] returning char. They are the consequence of at least 3 things: 1. char[] is an array of char 2. immutable(char)[] is the default string type 3. the programmer does not know about 1. and/or 2. I say, 1. is inevitable. You say 3. is inevitable. If we are both right, then 2. is the culprit.
 Option B:
 char[i] returns the i'th codepoint of the string as a (dchar) type.
 Consequences:
 (1) Code is INefficient and correct.

It is awfully optimistic to assume the code will be correct.
 (2) It requires extra effort to write efficient code.
 (3) Detecting the inefficient code happens in minutes.  It is VERY
 noticable when your program runs too slowly.

Except when in testing only small inputs are used and only 2 years later maintainers throw your program at a larger problem instance and wonder why it does not terminate. Or your program is DOS'd. Polynomial blowup in runtime can be as large a problem as a correctness bug in practice just fine.
 This is how I see it.

 And I really like my correct code.  If it's too slow, and I'll /know/
 when it's too slow, then I'll profile->tweak->profile->etc until the
 slowness goes away.  I'm totally digging option B.

Those kinds of inefficiencies build up and make the whole program run sluggish, and it will possibly be to late when you notice. Option B is not even on the table. This thread is about a breaking interface change and special casing T[] for T in {char, wchar}.
Dec 31 2011
parent reply Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 12/31/2011 02:02 PM, Timon Gehr wrote:
 On 12/31/2011 07:22 PM, Chad J wrote:
 On 12/30/2011 02:55 PM, Timon Gehr wrote:
 On 12/30/2011 08:33 PM, Joshua Reusch wrote:
 Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
 On 12/29/11 12:28 PM, Don wrote:
 On 28.12.2011 20:00, Andrei Alexandrescu wrote:
 Oh, one more thing - one good thing that could come out of this
 thread
 is abolition (through however slow a deprecation path) of
 s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length
 and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not
 char/wchar.
 Then, people would access the decoding routines on the needed
 occasions,
 or would consciously use the representation.

 Yum.

If I understand this correctly, most others don't. Effectively, .rep just means, "I know what I'm doing", and there's no change to existing semantics, purely a syntax change.

Exactly!
 If you change s[i] into s.rep[i], it does the same thing as now.
 There's
 no loss of functionality -- it's just stops you from accidentally
 doing
 the wrong thing. Like .ptr for getting the address of an array.
 Typically all the ".rep" everywhere would get annoying, so you would
 write:
 ubyte [] u = s.rep;
 and use u from then on.

 I don't like the name 'rep'. Maybe 'raw' or 'utf'?
 Apart from that, I think this would be perfect.

Yes, I mean "rep" as a short for "representation" but upon first sight the connection is tenuous. "raw" sounds great. Now I'm twice sorry this will not happen...

Maybe it could happen if we 1. make dstring the default strings type --

Inefficient.

But correct (enough).
 code units and characters would be the same

Wrong.

*sigh*, FINE. Code units and /code points/ would be the same.

Relax.

I'll do one better and ultra relax: http://www.youtube.com/watch?v=jimQoWXzc0Q ;)
 or 2. forward string.length to std.utf.count and opIndex to
 std.utf.toUTFindex

Inconsistent and inefficient (it blows up the algorithmic complexity).

Inconsistent? How?

int[] bool[] float[] char[]

I'll refer to another limb of this thread when foobar mentioned a mental model of strings as strings of letters. Now, given annoying corner cases, we probably can't get strings of /letters/, but I'd at least like to make it as far as code points. That seems very doable. I mention this because I find that forwarding string.length and opIndex would be much more consistent with this mental model of strings as strings of unicode code points, which, IMO, is more important than it being binary consistent with the other things. I'd much rather have char[] behave more like an array of code points than an array of bytes. I don't need an array of bytes. That's ubyte[]; I have that already.
 Inefficiency is a lot easier to deal with than incorrect.  If something
 is inefficient, then in the right places I will NOTICE.  If something is
 incorrect, it can hide for years until that one person (or country, in
 this case) with a different usage pattern than the others uncovers it.

 so programmers could use the slices/indexing/length (no lazyness
 problems), and if they really want codeunits use .raw/.rep (or better
 .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)

Anyone who intends to write efficient string processing code needs this. Anyone who does not want to write string processing code will not need to index into a string -- standard library functions will suffice.

What about people who want to write correct string processing code AND want to use this handy slicing feature? Because I totally want both of these. Slicing is super useful for script-like coding.

Except that the proposal would make slicing strings go away.

Yeah, Andrei's proposal says that. But I'm speaking of Joshua's:
 so programmers could use the slices/indexing/length ...




I kind-of like either, but I'd prefer Joshua's suggestion.
 But generally I liked the idea of just having an alias for strings...

Me too. I think the way we have it now is optimal. The only reason we are discussing this is because of fear that uneducated users will write code that does not take into account Unicode characters above code point 0x80. But what is the worst thing that can happen? 1. They don't notice. Then it is not a problem, because they are obviously only using ASCII characters and it is perfectly reasonable to assume that code units and characters are the same thing.

How do you know they are only working with ASCII? They might be /now/. But what if someone else uses the program a couple years later when the original author is no longer maintaining that chunk of code?

Then they obviously need to fix the code, because the requirements have changed. Most of it will already work correctly though, because UTF-8 extends ASCII in a natural way.

Or, you know, we could design the language a little differently and make this become mostly a non-problem. That would be cool.
 2. They get screwed up string output, look for the reason, patch up
 their code with some functions from std.utf and will never make the same
 mistakes again.

Except they don't. Because there are a lot of programmers that will never put in non-ascii strings to begin with. But that has nothing to do with whether or not the /users/ or /maintainers/ of that code will put non-ascii strings in. This could make some messes.
 I have *never* seen an user in D.learn complain about it. They might
 have been some I missed, but it is certainly not a prevalent problem.
 Also, just because an user can type .rep does not mean he understands
 Unicode: He is able to make just the same mistakes as before, even more
 so, as the array he is getting back has the _wrong element type_.

You know, here in America (Amurica?) we don't know that other countries exist. I think there is a large population of programmers here that don't even know how to enter non-latin characters, much less would think to include such characters in their test cases. These programmers won't necessarily be found on the internet much, but they will be found in cubicles all around, doing their 9-to-5 and writing mediocre code that the rest of us have to put up with. Their code will pass peer review (their peers are also from America) and continue working just fine until someone from one of those confusing other places decides to type in the characters they feel comfortable typing in. No, there will not be /tests/ for code points greater than 0x80, because there is no one around to write those. I'd feel a little better if D herds people into writing correct code to begin with, because they won't otherwise.

There is no way to 'herd people into writing correct code' and UTF-8 is quite easy to deal with.

Probably not. I played fast and loose with this a lot in my early D code. Then this same conversation happened like ~3 years ago on this newsgroup. Then I learned more about unicode and had a bit of a bitter taste regarding char[] and how it handled indexing. I thought I could just index char[]s willy nilly. But no, I can't. And the compiler won't tell me. It just silently does what I don't want. Maybe unicode is easy, but we sure as hell aren't born with it, and the language doesn't give beginners ANY red flags about this. I find myself pretty fortified against this issue due to having known about it before anything unpleasant happened, but I don't like the idea of others having to learn the hard way.
 ...

 There's another issue at play here too: efficiency vs correctness as a
 default.

 Here's the tradeoff --

 Option A:
 char[i] returns the i'th byte of the string as a (char) type.
 Consequences:
 (1) Code is efficient and INcorrect.

Do you have an example of impactful incorrect code resulting from those semantics?

Nope. Sorry. I learned about it before it had a chance to bite me. But this is only because I frequent(ed) the newsgroup and had a good throw on my dice roll.
 (2) It requires extra effort to write correct code.
 (3) Detecting the incorrect code may take years, as these errors can
 hide easily.

None of those is a direct consequence of char[i] returning char. They are the consequence of at least 3 things: 1. char[] is an array of char 2. immutable(char)[] is the default string type 3. the programmer does not know about 1. and/or 2. I say, 1. is inevitable. You say 3. is inevitable. If we are both right, then 2. is the culprit.

I can get behind this. Honestly I'd like the default string type to be intelligent and optimize itself into whichever UTF-N encoding is optimal for content I throw into it. Maybe this means it should lazily expand itself to the narrowest character type that maintains a 1-to-1 ratio between code units and code points so that indexing/slicing remain O(1), or maybe it's a bag of disparate encodings, or maybe someone can think of a better strategy. Just make it /reasonably/ fast and help me with correctness as much as possible. If I need more performance or more unicode pedantics, I'll do my homework then and only then. Of course this is probably never going to happen I'm afraid. Even the problem of making such a (probably) struct work at compile time in templates as if it were a native type... agh, headaches.
 Option B:
 char[i] returns the i'th codepoint of the string as a (dchar) type.
 Consequences:
 (1) Code is INefficient and correct.

It is awfully optimistic to assume the code will be correct.
 (2) It requires extra effort to write efficient code.
 (3) Detecting the inefficient code happens in minutes.  It is VERY
 noticable when your program runs too slowly.

Except when in testing only small inputs are used and only 2 years later maintainers throw your program at a larger problem instance and wonder why it does not terminate. Or your program is DOS'd. Polynomial blowup in runtime can be as large a problem as a correctness bug in practice just fine.

I see what you mean there. I'm still not entirely happy with it though. I don't think these are reasonable requirements. It sounds like forced premature optimization to me. I have found myself in a number of places in different problem domains where optimality-is-correctness. Make it too slow and the program isn't worth writing. I can't imagine doing this for workloads I can't test on or anticipate though: I'd have to operate like NASA and make things 10x more expensive than they need to be. Correctness, on the other hand, can be easily (relatively speaking) obtained by only allowing the user to input data you can handle and then making sure the program can handle it as promised. Test, test, test, etc.
 This is how I see it.

 And I really like my correct code.  If it's too slow, and I'll /know/
 when it's too slow, then I'll profile->tweak->profile->etc until the
 slowness goes away.  I'm totally digging option B.

Those kinds of inefficiencies build up and make the whole program run sluggish, and it will possibly be to late when you notice.

I get the feeling that the typical divide-and-conquer profiling strategy will find the more expensive operations /at least/ most of the time. Unfortunately, I have only experience to speak from on this matter.
 Option B is not even on the table. This thread is about a breaking
 interface change and special casing T[] for T in {char, wchar}.
 
 

Yeah, I know. I'm refering to what Joshua wrote, because I like option B. Even if it's academic, I'll say I like it anyways, if only for the sake of argument.
Dec 31 2011
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 01/01/2012 02:34 AM, Chad J wrote:
 On 12/31/2011 02:02 PM, Timon Gehr wrote:
 On 12/31/2011 07:22 PM, Chad J wrote:
 On 12/30/2011 02:55 PM, Timon Gehr wrote:
 On 12/30/2011 08:33 PM, Joshua Reusch wrote:
 Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
 On 12/29/11 12:28 PM, Don wrote:
 On 28.12.2011 20:00, Andrei Alexandrescu wrote:
 Oh, one more thing - one good thing that could come out of this
 thread
 is abolition (through however slow a deprecation path) of
 s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length
 and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not
 char/wchar.
 Then, people would access the decoding routines on the needed
 occasions,
 or would consciously use the representation.

 Yum.

If I understand this correctly, most others don't. Effectively, .rep just means, "I know what I'm doing", and there's no change to existing semantics, purely a syntax change.

Exactly!
 If you change s[i] into s.rep[i], it does the same thing as now.
 There's
 no loss of functionality -- it's just stops you from accidentally
 doing
 the wrong thing. Like .ptr for getting the address of an array.
 Typically all the ".rep" everywhere would get annoying, so you would
 write:
 ubyte [] u = s.rep;
 and use u from then on.

 I don't like the name 'rep'. Maybe 'raw' or 'utf'?
 Apart from that, I think this would be perfect.

Yes, I mean "rep" as a short for "representation" but upon first sight the connection is tenuous. "raw" sounds great. Now I'm twice sorry this will not happen...

Maybe it could happen if we 1. make dstring the default strings type --

Inefficient.

But correct (enough).
 code units and characters would be the same

Wrong.

*sigh*, FINE. Code units and /code points/ would be the same.

Relax.

I'll do one better and ultra relax: http://www.youtube.com/watch?v=jimQoWXzc0Q ;)
 or 2. forward string.length to std.utf.count and opIndex to
 std.utf.toUTFindex

Inconsistent and inefficient (it blows up the algorithmic complexity).

Inconsistent? How?

int[] bool[] float[] char[]

I'll refer to another limb of this thread when foobar mentioned a mental model of strings as strings of letters. Now, given annoying corner cases, we probably can't get strings of /letters/, but I'd at least like to make it as far as code points. That seems very doable. I mention this because I find that forwarding string.length and opIndex would be much more consistent with this mental model of strings as strings of unicode code points, which, IMO, is more important than it being binary consistent with the other things. I'd much rather have char[] behave more like an array of code points than an array of bytes. I don't need an array of bytes. That's ubyte[]; I have that already.

char[] is not an array of bytes: it is an array of UTF-8 code units.
 Inefficiency is a lot easier to deal with than incorrect.  If something
 is inefficient, then in the right places I will NOTICE.  If something is
 incorrect, it can hide for years until that one person (or country, in
 this case) with a different usage pattern than the others uncovers it.

 so programmers could use the slices/indexing/length (no lazyness
 problems), and if they really want codeunits use .raw/.rep (or better
 .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)

Anyone who intends to write efficient string processing code needs this. Anyone who does not want to write string processing code will not need to index into a string -- standard library functions will suffice.

What about people who want to write correct string processing code AND want to use this handy slicing feature? Because I totally want both of these. Slicing is super useful for script-like coding.

Except that the proposal would make slicing strings go away.

Yeah, Andrei's proposal says that. But I'm speaking of Joshua's:
 so programmers could use the slices/indexing/length ...




I kind-of like either, but I'd prefer Joshua's suggestion.
 But generally I liked the idea of just having an alias for strings...

Me too. I think the way we have it now is optimal. The only reason we are discussing this is because of fear that uneducated users will write code that does not take into account Unicode characters above code point 0x80. But what is the worst thing that can happen? 1. They don't notice. Then it is not a problem, because they are obviously only using ASCII characters and it is perfectly reasonable to assume that code units and characters are the same thing.

How do you know they are only working with ASCII? They might be /now/. But what if someone else uses the program a couple years later when the original author is no longer maintaining that chunk of code?

Then they obviously need to fix the code, because the requirements have changed. Most of it will already work correctly though, because UTF-8 extends ASCII in a natural way.

Or, you know, we could design the language a little differently and make this become mostly a non-problem. That would be cool.

It is imo already mostly a non-problem, but YMMV: void main(){ string s = readln(); int nest = 0; foreach(x;s){ // iterates by code unit if(x=='(') nest++; else if(x==')' && --nest<0) goto unbalanced; } if(!nest){ writeln("balanced parentheses"); return; } unbalanced: writeln("unbalanced parentheses"); } That code is UTF aware, even though it does not explicitly deal with UTF. I'd claim it is like this most of the time.
 2. They get screwed up string output, look for the reason, patch up
 their code with some functions from std.utf and will never make the same
 mistakes again.

Except they don't. Because there are a lot of programmers that will never put in non-ascii strings to begin with. But that has nothing to do with whether or not the /users/ or /maintainers/ of that code will put non-ascii strings in. This could make some messes.
 I have *never* seen an user in D.learn complain about it. They might
 have been some I missed, but it is certainly not a prevalent problem.
 Also, just because an user can type .rep does not mean he understands
 Unicode: He is able to make just the same mistakes as before, even more
 so, as the array he is getting back has the _wrong element type_.

You know, here in America (Amurica?) we don't know that other countries exist. I think there is a large population of programmers here that don't even know how to enter non-latin characters, much less would think to include such characters in their test cases. These programmers won't necessarily be found on the internet much, but they will be found in cubicles all around, doing their 9-to-5 and writing mediocre code that the rest of us have to put up with. Their code will pass peer review (their peers are also from America) and continue working just fine until someone from one of those confusing other places decides to type in the characters they feel comfortable typing in. No, there will not be /tests/ for code points greater than 0x80, because there is no one around to write those. I'd feel a little better if D herds people into writing correct code to begin with, because they won't otherwise.

There is no way to 'herd people into writing correct code' and UTF-8 is quite easy to deal with.

Probably not. I played fast and loose with this a lot in my early D code. Then this same conversation happened like ~3 years ago on this newsgroup. Then I learned more about unicode and had a bit of a bitter taste regarding char[] and how it handled indexing. I thought I could just index char[]s willy nilly. But no, I can't. And the compiler won't tell me. It just silently does what I don't want.

How often do you actually need to get, for example, the 10th character of a string? I think it is a very uncommon operation. If the indexing is just part of an iteration that looks once at each char and handles some ASCII characters in certain ways, there is no potential correctness problem. As soon as code talks about non-ascii characters, it has to be UTF aware anyway.
 Maybe unicode is easy, but we sure as hell aren't born with it, and the
 language doesn't give beginners ANY red flags about this.

 I find myself pretty fortified against this issue due to having known
 about it before anything unpleasant happened, but I don't like the idea
 of others having to learn the hard way.

Hm, well. The first thing I looked up when I learned D supports Unicode is how Unicode/UTF work in detail. After that, the semantics of char[] were very clear to me.
 ...

 There's another issue at play here too: efficiency vs correctness as a
 default.

 Here's the tradeoff --

 Option A:
 char[i] returns the i'th byte of the string as a (char) type.
 Consequences:
 (1) Code is efficient and INcorrect.

Do you have an example of impactful incorrect code resulting from those semantics?

Nope. Sorry. I learned about it before it had a chance to bite me. But this is only because I frequent(ed) the newsgroup and had a good throw on my dice roll.

I might be wrong, but I somewhat have the impression we might be chasing phantoms here. I have so far never seen a bug in real world code caused by inadvertent misuse of D string indexing or slicing.
 (2) It requires extra effort to write correct code.
 (3) Detecting the incorrect code may take years, as these errors can
 hide easily.

None of those is a direct consequence of char[i] returning char. They are the consequence of at least 3 things: 1. char[] is an array of char 2. immutable(char)[] is the default string type 3. the programmer does not know about 1. and/or 2. I say, 1. is inevitable. You say 3. is inevitable. If we are both right, then 2. is the culprit.

I can get behind this. Honestly I'd like the default string type to be intelligent and optimize itself into whichever UTF-N encoding is optimal for content I throw into it. Maybe this means it should lazily expand itself to the narrowest character type that maintains a 1-to-1 ratio between code units and code points so that indexing/slicing remain O(1), or maybe it's a bag of disparate encodings, or maybe someone can think of a better strategy. Just make it /reasonably/ fast and help me with correctness as much as possible. If I need more performance or more unicode pedantics, I'll do my homework then and only then. Of course this is probably never going to happen I'm afraid. Even the problem of making such a (probably) struct work at compile time in templates as if it were a native type... agh, headaches.
 Option B:
 char[i] returns the i'th codepoint of the string as a (dchar) type.
 Consequences:
 (1) Code is INefficient and correct.

It is awfully optimistic to assume the code will be correct.
 (2) It requires extra effort to write efficient code.
 (3) Detecting the inefficient code happens in minutes.  It is VERY
 noticable when your program runs too slowly.

Except when in testing only small inputs are used and only 2 years later maintainers throw your program at a larger problem instance and wonder why it does not terminate. Or your program is DOS'd. Polynomial blowup in runtime can be as large a problem as a correctness bug in practice just fine.

I see what you mean there. I'm still not entirely happy with it though. I don't think these are reasonable requirements. It sounds like forced premature optimization to me.

It is using a better algorithm that performs faster by a linear factor. I would be very leery of something that looks like a constant time array indexing operation take linear time. I think premature optimization is about writing near-optimal hard-to-debug and maintain code that only gains some constant factors in parts of the code that are not performance critical.
 I have found myself in a number of places in different problem domains
 where optimality-is-correctness.  Make it too slow and the program isn't
 worth writing.  I can't imagine doing this for workloads I can't test on
 or anticipate though: I'd have to operate like NASA and make things 10x
 more expensive than they need to be.

 Correctness, on the other hand, can be easily (relatively speaking)
 obtained by only allowing the user to input data you can handle and then
 making sure the program can handle it as promised.  Test, test, test, etc.

 This is how I see it.

 And I really like my correct code.  If it's too slow, and I'll /know/
 when it's too slow, then I'll profile->tweak->profile->etc until the
 slowness goes away.  I'm totally digging option B.

Those kinds of inefficiencies build up and make the whole program run sluggish, and it will possibly be to late when you notice.

I get the feeling that the typical divide-and-conquer profiling strategy will find the more expensive operations /at least/ most of the time. Unfortunately, I have only experience to speak from on this matter.

Yes, what I meant is, that if the inefficiencies are spread out more or less uniformly, then fixing it all up might seem to be too much work and too much risk.
 Option B is not even on the table. This thread is about a breaking
 interface change and special casing T[] for T in {char, wchar}.

Yeah, I know. I'm refering to what Joshua wrote, because I like option B. Even if it's academic, I'll say I like it anyways, if only for the sake of argument.

OK.
Dec 31 2011
parent reply Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 12/31/2011 09:17 PM, Timon Gehr wrote:
 On 01/01/2012 02:34 AM, Chad J wrote:
 On 12/31/2011 02:02 PM, Timon Gehr wrote:
 On 12/31/2011 07:22 PM, Chad J wrote:
 On 12/30/2011 02:55 PM, Timon Gehr wrote:
 On 12/30/2011 08:33 PM, Joshua Reusch wrote:
 Maybe it could happen if we
 1. make dstring the default strings type --

Inefficient.

But correct (enough).
 code units and characters would be the same

Wrong.

*sigh*, FINE. Code units and /code points/ would be the same.

Relax.

I'll do one better and ultra relax: http://www.youtube.com/watch?v=jimQoWXzc0Q ;)
 or 2. forward string.length to std.utf.count and opIndex to
 std.utf.toUTFindex

Inconsistent and inefficient (it blows up the algorithmic complexity).

Inconsistent? How?

int[] bool[] float[] char[]

I'll refer to another limb of this thread when foobar mentioned a mental model of strings as strings of letters. Now, given annoying corner cases, we probably can't get strings of /letters/, but I'd at least like to make it as far as code points. That seems very doable. I mention this because I find that forwarding string.length and opIndex would be much more consistent with this mental model of strings as strings of unicode code points, which, IMO, is more important than it being binary consistent with the other things. I'd much rather have char[] behave more like an array of code points than an array of bytes. I don't need an array of bytes. That's ubyte[]; I have that already.

char[] is not an array of bytes: it is an array of UTF-8 code units.

Meh, I'd still prefer it be an array of UTF-8 code /points/ represented by an array of bytes (which are the UTF-8 code units).
 Inefficiency is a lot easier to deal with than incorrect.  If something
 is inefficient, then in the right places I will NOTICE.  If
 something is
 incorrect, it can hide for years until that one person (or country, in
 this case) with a different usage pattern than the others uncovers it.

 so programmers could use the slices/indexing/length (no lazyness
 problems), and if they really want codeunits use .raw/.rep (or better
 .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)

Anyone who intends to write efficient string processing code needs this. Anyone who does not want to write string processing code will not need to index into a string -- standard library functions will suffice.

What about people who want to write correct string processing code AND want to use this handy slicing feature? Because I totally want both of these. Slicing is super useful for script-like coding.

Except that the proposal would make slicing strings go away.

Yeah, Andrei's proposal says that. But I'm speaking of Joshua's:
 so programmers could use the slices/indexing/length ...




I kind-of like either, but I'd prefer Joshua's suggestion.
 But generally I liked the idea of just having an alias for strings...

Me too. I think the way we have it now is optimal. The only reason we are discussing this is because of fear that uneducated users will write code that does not take into account Unicode characters above code point 0x80. But what is the worst thing that can happen? 1. They don't notice. Then it is not a problem, because they are obviously only using ASCII characters and it is perfectly reasonable to assume that code units and characters are the same thing.

How do you know they are only working with ASCII? They might be /now/. But what if someone else uses the program a couple years later when the original author is no longer maintaining that chunk of code?

Then they obviously need to fix the code, because the requirements have changed. Most of it will already work correctly though, because UTF-8 extends ASCII in a natural way.

Or, you know, we could design the language a little differently and make this become mostly a non-problem. That would be cool.

It is imo already mostly a non-problem, but YMMV: void main(){ string s = readln(); int nest = 0; foreach(x;s){ // iterates by code unit if(x=='(') nest++; else if(x==')' && --nest<0) goto unbalanced; } if(!nest){ writeln("balanced parentheses"); return; } unbalanced: writeln("unbalanced parentheses"); } That code is UTF aware, even though it does not explicitly deal with UTF. I'd claim it is like this most of the time.

I'm willing to agree with this. I still don't like the possibility that folks encounter corner-cases in that not-most-of-the-time. I'm not going to rage-face too hard if this never changes though. There would be a number of other things more important to fix before this, IMO.
 2. They get screwed up string output, look for the reason, patch up
 their code with some functions from std.utf and will never make the
 same
 mistakes again.

Except they don't. Because there are a lot of programmers that will never put in non-ascii strings to begin with. But that has nothing to do with whether or not the /users/ or /maintainers/ of that code will put non-ascii strings in. This could make some messes.
 I have *never* seen an user in D.learn complain about it. They might
 have been some I missed, but it is certainly not a prevalent problem.
 Also, just because an user can type .rep does not mean he understands
 Unicode: He is able to make just the same mistakes as before, even
 more
 so, as the array he is getting back has the _wrong element type_.

You know, here in America (Amurica?) we don't know that other countries exist. I think there is a large population of programmers here that don't even know how to enter non-latin characters, much less would think to include such characters in their test cases. These programmers won't necessarily be found on the internet much, but they will be found in cubicles all around, doing their 9-to-5 and writing mediocre code that the rest of us have to put up with. Their code will pass peer review (their peers are also from America) and continue working just fine until someone from one of those confusing other places decides to type in the characters they feel comfortable typing in. No, there will not be /tests/ for code points greater than 0x80, because there is no one around to write those. I'd feel a little better if D herds people into writing correct code to begin with, because they won't otherwise.

There is no way to 'herd people into writing correct code' and UTF-8 is quite easy to deal with.

Probably not. I played fast and loose with this a lot in my early D code. Then this same conversation happened like ~3 years ago on this newsgroup. Then I learned more about unicode and had a bit of a bitter taste regarding char[] and how it handled indexing. I thought I could just index char[]s willy nilly. But no, I can't. And the compiler won't tell me. It just silently does what I don't want.

How often do you actually need to get, for example, the 10th character of a string? I think it is a very uncommon operation. If the indexing is just part of an iteration that looks once at each char and handles some ASCII characters in certain ways, there is no potential correctness problem. As soon as code talks about non-ascii characters, it has to be UTF aware anyway.

If you haven't been educated about unicode or how D handles it, you might write this: char[] str; ... load str ... for ( int i = 0; i < str.length; i++ ) { font.render(str[i]); // Ewww. ... } It'd be neat if that gave a compiler error, or just passed code points as dchar's. Maybe a compiler error is best in this light.
 Maybe unicode is easy, but we sure as hell aren't born with it, and the
 language doesn't give beginners ANY red flags about this.

 I find myself pretty fortified against this issue due to having known
 about it before anything unpleasant happened, but I don't like the idea
 of others having to learn the hard way.

Hm, well. The first thing I looked up when I learned D supports Unicode is how Unicode/UTF work in detail. After that, the semantics of char[] were very clear to me.
 ...

 There's another issue at play here too: efficiency vs correctness as a
 default.

 Here's the tradeoff --

 Option A:
 char[i] returns the i'th byte of the string as a (char) type.
 Consequences:
 (1) Code is efficient and INcorrect.

Do you have an example of impactful incorrect code resulting from those semantics?

Nope. Sorry. I learned about it before it had a chance to bite me. But this is only because I frequent(ed) the newsgroup and had a good throw on my dice roll.

I might be wrong, but I somewhat have the impression we might be chasing phantoms here. I have so far never seen a bug in real world code caused by inadvertent misuse of D string indexing or slicing.

Possibly.
 (2) It requires extra effort to write correct code.
 (3) Detecting the incorrect code may take years, as these errors can
 hide easily.

None of those is a direct consequence of char[i] returning char. They are the consequence of at least 3 things: 1. char[] is an array of char 2. immutable(char)[] is the default string type 3. the programmer does not know about 1. and/or 2. I say, 1. is inevitable. You say 3. is inevitable. If we are both right, then 2. is the culprit.

I can get behind this. Honestly I'd like the default string type to be intelligent and optimize itself into whichever UTF-N encoding is optimal for content I throw into it. Maybe this means it should lazily expand itself to the narrowest character type that maintains a 1-to-1 ratio between code units and code points so that indexing/slicing remain O(1), or maybe it's a bag of disparate encodings, or maybe someone can think of a better strategy. Just make it /reasonably/ fast and help me with correctness as much as possible. If I need more performance or more unicode pedantics, I'll do my homework then and only then. Of course this is probably never going to happen I'm afraid. Even the problem of making such a (probably) struct work at compile time in templates as if it were a native type... agh, headaches.
 Option B:
 char[i] returns the i'th codepoint of the string as a (dchar) type.
 Consequences:
 (1) Code is INefficient and correct.

It is awfully optimistic to assume the code will be correct.
 (2) It requires extra effort to write efficient code.
 (3) Detecting the inefficient code happens in minutes.  It is VERY
 noticable when your program runs too slowly.

Except when in testing only small inputs are used and only 2 years later maintainers throw your program at a larger problem instance and wonder why it does not terminate. Or your program is DOS'd. Polynomial blowup in runtime can be as large a problem as a correctness bug in practice just fine.

I see what you mean there. I'm still not entirely happy with it though. I don't think these are reasonable requirements. It sounds like forced premature optimization to me.

It is using a better algorithm that performs faster by a linear factor. I would be very leery of something that looks like a constant time array indexing operation take linear time. I think premature optimization is about writing near-optimal hard-to-debug and maintain code that only gains some constant factors in parts of the code that are not performance critical.

This wouldn't be the first data structure to require linear time indexing. I mean, linked lists exists. I do feel that heavy-duty optimization puts the onus on the programmer to know what to do. The programming language is responsible for merely making it possible, not for making it the default path. The latter is fairly impossible. Correctness, on the other hand, should involve some hand-holding. It's that notion of the language catching me when I fall. I think the language should (and can) help a lot with program correctness if designed right. D is already really good on these counts, and even helps quite a bit when optimization gets down-and-dirty.
 I have found myself in a number of places in different problem domains
 where optimality-is-correctness.  Make it too slow and the program isn't
 worth writing.  I can't imagine doing this for workloads I can't test on
 or anticipate though: I'd have to operate like NASA and make things 10x
 more expensive than they need to be.

 Correctness, on the other hand, can be easily (relatively speaking)
 obtained by only allowing the user to input data you can handle and then
 making sure the program can handle it as promised.  Test, test, test,
 etc.

 This is how I see it.

 And I really like my correct code.  If it's too slow, and I'll /know/
 when it's too slow, then I'll profile->tweak->profile->etc until the
 slowness goes away.  I'm totally digging option B.

Those kinds of inefficiencies build up and make the whole program run sluggish, and it will possibly be to late when you notice.

I get the feeling that the typical divide-and-conquer profiling strategy will find the more expensive operations /at least/ most of the time. Unfortunately, I have only experience to speak from on this matter.

Yes, what I meant is, that if the inefficiencies are spread out more or less uniformly, then fixing it all up might seem to be too much work and too much risk.

Ah, right. Because code refactoring tends to suck. I get you. This is, of course, still the same reason why I'd never want to have to go through my code and replace all of the "font.render(str[i]);". Yeah, starting a number of years ago it won't happen to me, but it might get someone else.
 Option B is not even on the table. This thread is about a breaking
 interface change and special casing T[] for T in {char, wchar}.

Yeah, I know. I'm refering to what Joshua wrote, because I like option B. Even if it's academic, I'll say I like it anyways, if only for the sake of argument.

OK.

Dec 31 2011
next sibling parent a <a a.com> writes:
 Meh, I'd still prefer it be an array of UTF-8 code /points/ represented
 by an array of bytes (which are the UTF-8 code units).

By saying you want an array of code points you already define representation. And if you want that there already is dchar[]. You probably meant a range of code points represented by an array of code units. But such a range can't have opIndex, since opIndex implies a constant time operation. If you want nth element of the range, you can use std.range.drop or write your own nth() function.
Jan 01 2012
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 01/01/2012 05:53 AM, Chad J wrote:
 If you haven't been educated about unicode or how D handles it, you
 might write this:

 char[] str;
 ... load str ...
 for ( int i = 0; i<  str.length; i++ )
 {
      font.render(str[i]); // Ewww.
      ...
 }

That actually looks like a bug that might happen in real world code. What is the signature of font.render?
Jan 01 2012
parent reply Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 01/01/2012 07:59 AM, Timon Gehr wrote:
 On 01/01/2012 05:53 AM, Chad J wrote:
 If you haven't been educated about unicode or how D handles it, you
 might write this:

 char[] str;
 ... load str ...
 for ( int i = 0; i<  str.length; i++ )
 {
      font.render(str[i]); // Ewww.
      ...
 }

That actually looks like a bug that might happen in real world code. What is the signature of font.render?

In my mind it's defined something like this: class Font { ... /** Render the given code point at the current (x,y) cursor position. */ void render( dchar c ) { ... } } (Of course I don't know minute details like where the "cursor position" comes from, but I figure it doesn't matter.) I probably wrote some code like that loop a very long time ago, but I probably don't have that code around anymore, or at least not easily findable.
Jan 01 2012
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 01/01/2012 04:13 PM, Chad J wrote:
 On 01/01/2012 07:59 AM, Timon Gehr wrote:
 On 01/01/2012 05:53 AM, Chad J wrote:
 If you haven't been educated about unicode or how D handles it, you
 might write this:

 char[] str;
 ... load str ...
 for ( int i = 0; i<   str.length; i++ )
 {
       font.render(str[i]); // Ewww.
       ...
 }

That actually looks like a bug that might happen in real world code. What is the signature of font.render?

In my mind it's defined something like this: class Font { ... /** Render the given code point at the current (x,y) cursor position. */ void render( dchar c ) { ... } } (Of course I don't know minute details like where the "cursor position" comes from, but I figure it doesn't matter.) I probably wrote some code like that loop a very long time ago, but I probably don't have that code around anymore, or at least not easily findable.

I think the main issue here is that char implicitly converts to dchar: This is an implicit reinterpret-cast that is nonsensical if the character is outside the ascii-range.
Jan 01 2012
parent reply Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 01/01/2012 10:39 AM, Timon Gehr wrote:
 On 01/01/2012 04:13 PM, Chad J wrote:
 On 01/01/2012 07:59 AM, Timon Gehr wrote:
 On 01/01/2012 05:53 AM, Chad J wrote:
 If you haven't been educated about unicode or how D handles it, you
 might write this:

 char[] str;
 ... load str ...
 for ( int i = 0; i<   str.length; i++ )
 {
       font.render(str[i]); // Ewww.
       ...
 }

That actually looks like a bug that might happen in real world code. What is the signature of font.render?

In my mind it's defined something like this: class Font { ... /** Render the given code point at the current (x,y) cursor position. */ void render( dchar c ) { ... } } (Of course I don't know minute details like where the "cursor position" comes from, but I figure it doesn't matter.) I probably wrote some code like that loop a very long time ago, but I probably don't have that code around anymore, or at least not easily findable.

I think the main issue here is that char implicitly converts to dchar: This is an implicit reinterpret-cast that is nonsensical if the character is outside the ascii-range.

I agree. Perhaps the compiler should insert a check on the 8th bit in cases like these? I suppose it's possible someone could declare a bunch of individual char's and then start manipulating code units that way, and such an 8th bit check could thwart those manipulations, but I would also counter that such low manipulations should be done on ubyte's instead. I don't know how much this would help though. Seems like too little, too late. The bigger problem is that a char is being taken from a char[] and thereby loses its context as (potentially) being part of a larger codepoint.
Jan 01 2012
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 01/01/2012 08:01 PM, Chad J wrote:
 On 01/01/2012 10:39 AM, Timon Gehr wrote:
 On 01/01/2012 04:13 PM, Chad J wrote:
 On 01/01/2012 07:59 AM, Timon Gehr wrote:
 On 01/01/2012 05:53 AM, Chad J wrote:
 If you haven't been educated about unicode or how D handles it, you
 might write this:

 char[] str;
 ... load str ...
 for ( int i = 0; i<    str.length; i++ )
 {
        font.render(str[i]); // Ewww.
        ...
 }

That actually looks like a bug that might happen in real world code. What is the signature of font.render?

In my mind it's defined something like this: class Font { ... /** Render the given code point at the current (x,y) cursor position. */ void render( dchar c ) { ... } } (Of course I don't know minute details like where the "cursor position" comes from, but I figure it doesn't matter.) I probably wrote some code like that loop a very long time ago, but I probably don't have that code around anymore, or at least not easily findable.

I think the main issue here is that char implicitly converts to dchar: This is an implicit reinterpret-cast that is nonsensical if the character is outside the ascii-range.

I agree. Perhaps the compiler should insert a check on the 8th bit in cases like these? I suppose it's possible someone could declare a bunch of individual char's and then start manipulating code units that way, and such an 8th bit check could thwart those manipulations, but I would also counter that such low manipulations should be done on ubyte's instead. I don't know how much this would help though. Seems like too little, too late.

I think the conversion char -> dchar should just require an explicit cast. The runtime check is better left to std.conv.to;
 The bigger problem is that a char is being taken from a char[] and
 thereby loses its context as (potentially) being part of a larger
 codepoint.

If it is part of a larger code point, then it has its highest bit set. Any individual char that has its highest bit set does not carry a character on its own. If it is not set, then it is a single ASCII character.
Jan 01 2012
parent reply Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 01/01/2012 02:25 PM, Timon Gehr wrote:
 On 01/01/2012 08:01 PM, Chad J wrote:
 On 01/01/2012 10:39 AM, Timon Gehr wrote:
 On 01/01/2012 04:13 PM, Chad J wrote:
 On 01/01/2012 07:59 AM, Timon Gehr wrote:
 On 01/01/2012 05:53 AM, Chad J wrote:
 If you haven't been educated about unicode or how D handles it, you
 might write this:

 char[] str;
 ... load str ...
 for ( int i = 0; i<    str.length; i++ )
 {
        font.render(str[i]); // Ewww.
        ...
 }

That actually looks like a bug that might happen in real world code. What is the signature of font.render?

In my mind it's defined something like this: class Font { ... /** Render the given code point at the current (x,y) cursor position. */ void render( dchar c ) { ... } } (Of course I don't know minute details like where the "cursor position" comes from, but I figure it doesn't matter.) I probably wrote some code like that loop a very long time ago, but I probably don't have that code around anymore, or at least not easily findable.

I think the main issue here is that char implicitly converts to dchar: This is an implicit reinterpret-cast that is nonsensical if the character is outside the ascii-range.

I agree. Perhaps the compiler should insert a check on the 8th bit in cases like these? I suppose it's possible someone could declare a bunch of individual char's and then start manipulating code units that way, and such an 8th bit check could thwart those manipulations, but I would also counter that such low manipulations should be done on ubyte's instead. I don't know how much this would help though. Seems like too little, too late.

I think the conversion char -> dchar should just require an explicit cast. The runtime check is better left to std.conv.to;

What of valid transfers of ASCII characters into dchar? Normally this is a widening operation, so I can see how it is permissible.
 The bigger problem is that a char is being taken from a char[] and
 thereby loses its context as (potentially) being part of a larger
 codepoint.

If it is part of a larger code point, then it has its highest bit set. Any individual char that has its highest bit set does not carry a character on its own. If it is not set, then it is a single ASCII character.

See above. I think that assigning from a char[i] to another char[j] is probably safe. Similarly for slicing. These calculations tend to occur, I suspect, when the text is well-anchored. I believe your balanced parentheses example falls into this category: (repasted for reader convenience) void main(){ string s = readln(); int nest = 0; foreach(x;s){ // iterates by code unit if(x=='(') nest++; else if(x==')' && --nest<0) goto unbalanced; } if(!nest){ writeln("balanced parentheses"); return; } unbalanced: writeln("unbalanced parentheses"); } With these observations in hand, I would consider the safety of operations to go like this: char[i] = char[j]; // (Reasonably) Safe char[i1..i2] = char[j1..j2]; // (Reasonably) Safe char = char; // Safe dchar = char // Safe. Widening. char = char[i]; // Not safe. Should error. dchar = char[i]; // Not safe. Should error. (Corollary) dchar = dchar[i]; // Safe. char = char[i1..i2]; // Nonsensical; already an error.
Jan 01 2012
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 01/02/2012 12:16 AM, Chad J wrote:
 On 01/01/2012 02:25 PM, Timon Gehr wrote:
 On 01/01/2012 08:01 PM, Chad J wrote:
 On 01/01/2012 10:39 AM, Timon Gehr wrote:
 On 01/01/2012 04:13 PM, Chad J wrote:
 On 01/01/2012 07:59 AM, Timon Gehr wrote:
 On 01/01/2012 05:53 AM, Chad J wrote:
 If you haven't been educated about unicode or how D handles it, you
 might write this:

 char[] str;
 ... load str ...
 for ( int i = 0; i<     str.length; i++ )
 {
         font.render(str[i]); // Ewww.
         ...
 }

That actually looks like a bug that might happen in real world code. What is the signature of font.render?

In my mind it's defined something like this: class Font { ... /** Render the given code point at the current (x,y) cursor position. */ void render( dchar c ) { ... } } (Of course I don't know minute details like where the "cursor position" comes from, but I figure it doesn't matter.) I probably wrote some code like that loop a very long time ago, but I probably don't have that code around anymore, or at least not easily findable.

I think the main issue here is that char implicitly converts to dchar: This is an implicit reinterpret-cast that is nonsensical if the character is outside the ascii-range.

I agree. Perhaps the compiler should insert a check on the 8th bit in cases like these? I suppose it's possible someone could declare a bunch of individual char's and then start manipulating code units that way, and such an 8th bit check could thwart those manipulations, but I would also counter that such low manipulations should be done on ubyte's instead. I don't know how much this would help though. Seems like too little, too late.

I think the conversion char -> dchar should just require an explicit cast. The runtime check is better left to std.conv.to;

What of valid transfers of ASCII characters into dchar? Normally this is a widening operation, so I can see how it is permissible.
 The bigger problem is that a char is being taken from a char[] and
 thereby loses its context as (potentially) being part of a larger
 codepoint.

If it is part of a larger code point, then it has its highest bit set. Any individual char that has its highest bit set does not carry a character on its own. If it is not set, then it is a single ASCII character.

See above. I think that assigning from a char[i] to another char[j] is probably safe. Similarly for slicing. These calculations tend to occur, I suspect, when the text is well-anchored. I believe your balanced parentheses example falls into this category: (repasted for reader convenience) void main(){ string s = readln(); int nest = 0; foreach(x;s){ // iterates by code unit if(x=='(') nest++; else if(x==')'&& --nest<0) goto unbalanced; } if(!nest){ writeln("balanced parentheses"); return; } unbalanced: writeln("unbalanced parentheses"); } With these observations in hand, I would consider the safety of operations to go like this: char[i] = char[j]; // (Reasonably) Safe char[i1..i2] = char[j1..j2]; // (Reasonably) Safe char = char; // Safe dchar = char // Safe. Widening. char = char[i]; // Not safe. Should error. dchar = char[i]; // Not safe. Should error. (Corollary) dchar = dchar[i]; // Safe. char = char[i1..i2]; // Nonsensical; already an error.

That is an interesting point of view. Your proposal would therefore be to constrain char to the ASCII range except if it is embedded in an array? It would break the balanced parentheses example.
Jan 01 2012
parent Chad J <chadjoan __spam.is.bad__gmail.com> writes:
On 01/01/2012 06:36 PM, Timon Gehr wrote:
 On 01/02/2012 12:16 AM, Chad J wrote:
 On 01/01/2012 02:25 PM, Timon Gehr wrote:
 On 01/01/2012 08:01 PM, Chad J wrote:
 On 01/01/2012 10:39 AM, Timon Gehr wrote:
 On 01/01/2012 04:13 PM, Chad J wrote:
 On 01/01/2012 07:59 AM, Timon Gehr wrote:
 On 01/01/2012 05:53 AM, Chad J wrote:
 If you haven't been educated about unicode or how D handles it, you
 might write this:

 char[] str;
 ... load str ...
 for ( int i = 0; i<     str.length; i++ )
 {
         font.render(str[i]); // Ewww.
         ...
 }

That actually looks like a bug that might happen in real world code. What is the signature of font.render?

In my mind it's defined something like this: class Font { ... /** Render the given code point at the current (x,y) cursor position. */ void render( dchar c ) { ... } } (Of course I don't know minute details like where the "cursor position" comes from, but I figure it doesn't matter.) I probably wrote some code like that loop a very long time ago, but I probably don't have that code around anymore, or at least not easily findable.

I think the main issue here is that char implicitly converts to dchar: This is an implicit reinterpret-cast that is nonsensical if the character is outside the ascii-range.

I agree. Perhaps the compiler should insert a check on the 8th bit in cases like these? I suppose it's possible someone could declare a bunch of individual char's and then start manipulating code units that way, and such an 8th bit check could thwart those manipulations, but I would also counter that such low manipulations should be done on ubyte's instead. I don't know how much this would help though. Seems like too little, too late.

I think the conversion char -> dchar should just require an explicit cast. The runtime check is better left to std.conv.to;

What of valid transfers of ASCII characters into dchar? Normally this is a widening operation, so I can see how it is permissible.
 The bigger problem is that a char is being taken from a char[] and
 thereby loses its context as (potentially) being part of a larger
 codepoint.

If it is part of a larger code point, then it has its highest bit set. Any individual char that has its highest bit set does not carry a character on its own. If it is not set, then it is a single ASCII character.

See above. I think that assigning from a char[i] to another char[j] is probably safe. Similarly for slicing. These calculations tend to occur, I suspect, when the text is well-anchored. I believe your balanced parentheses example falls into this category: (repasted for reader convenience) void main(){ string s = readln(); int nest = 0; foreach(x;s){ // iterates by code unit if(x=='(') nest++; else if(x==')'&& --nest<0) goto unbalanced; } if(!nest){ writeln("balanced parentheses"); return; } unbalanced: writeln("unbalanced parentheses"); } With these observations in hand, I would consider the safety of operations to go like this: char[i] = char[j]; // (Reasonably) Safe char[i1..i2] = char[j1..j2]; // (Reasonably) Safe char = char; // Safe dchar = char // Safe. Widening. char = char[i]; // Not safe. Should error. dchar = char[i]; // Not safe. Should error. (Corollary) dchar = dchar[i]; // Safe. char = char[i1..i2]; // Nonsensical; already an error.

That is an interesting point of view. Your proposal would therefore be to constrain char to the ASCII range except if it is embedded in an array? It would break the balanced parentheses example.

I just ran the example and wow, x didn't type-infer to dchar like I expected it to. I thought the comment might be wrong, but no, it is correct, x type-infers to char. I expected it to behave more like the old days before type inference showed up everywhere: void main(){ string s = readln(); int nest = 0; foreach(dchar x;s){ // iterates by code POINT; notice the dchar. if(x=='(') nest++; else if(x==')'&& --nest<0) goto unbalanced; } if(!nest){ writeln("balanced parentheses"); return; } unbalanced: writeln("unbalanced parentheses"); } This version wouldn't be broken. If the type inference changed, the other version wouldn't be broken either. This could break other things though. Bummer.
Jan 01 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, December 30, 2011 20:55:42 Timon Gehr wrote:
 1. They don't notice. Then it is not a problem, because they are
 obviously only using ASCII characters and it is perfectly reasonable to
 assume that code units and characters are the same thing.

The problem is that what's more likely to happen in a lot of cases is that they use it wrong and don't notice, because they're only using ASCII in testing, _but_ they have bugs all over the place, because their code is actually used with unicode in the field. Yes, diligent programmers will generally find such problems, but with the current scheme, it's _so_ easy to use length when you shouldn't, that it's pretty much a guarantee that it's going to happen. I'm not sure that Andrei's suggestion is the best one at this point, but I sure wouldn't be against it being introduced. It wouldn't entirely fix the problem by any means, but programmers would then have to work harder at screwing it up and so there would be fewer mistakes. Arguably, the first issue with D strings is that we have char. In most languages, char is supposed to be a character, so many programmers will code with that expectation. If we had something like utf8unit, utf16unit, and utf32unit (arguably very bad, albeit descriptive, names) and no char, then it would force programmers to become semi-educated about the issues. There's no way that that's changing at this point though. - Jonathan M Davis
Dec 30 2011
prev sibling next sibling parent Brad Anderson <eco gnuk.net> writes:
--f46d042c636bd521b004b55ebba8
Content-Type: text/plain; charset=ISO-8859-1

On Sat, Dec 31, 2011 at 12:09 AM, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 On 12/30/11 10:09 PM, Walter Bright wrote:

 On 12/30/2011 7:30 PM, Jonathan M Davis wrote:

 Yes, diligent programmers will generally find such problems, but with the
 current scheme, it's _so_ easy to use length when you shouldn't, that
 it's
 pretty much a guarantee that it's going to happen.

I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8 correctly, but it turned out that the naive version that used [i] and .length worked correctly. This is typical, not exceptional.

The lower frequency of bugs makes them that much more difficult to spot. This is essentially similar to the UTF16/UCS-2 morass: in a vast majority of the time the programmer may consider UTF16 a coding with one code unit per code point (which is what UCS-2 is). The existence of surrogates didn't make much of a difference because, again, very often the wrong assumption just worked. Well that all didn't go over all that well. We need .raw and we must abolish .length and [] for narrow strings. Andrei

I don't know that Phobos would be an appropriate place for it but offering some easy to access string data containing extensive and advanced unicode which users could easily add to their programs unit tests may help people ensure proper unicode usage. Unicode seems to be one of those things where you either know it really well or you know just enough to get yourself in trouble so having test data written by unicode experts could be very useful for the rest of us mortals. I googled around a bit. This Stack Overflow came up < http://stackoverflow.com/questions/6136800/unicode-test-strings-for-unit-tests> that recommends these - UTF-8 stress test: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt - Quick Brown Fox in a variety of languages: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/quickbrown.txt I didn't see too much beyond those two. Regards, Brad A. --f46d042c636bd521b004b55ebba8 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On Sat, Dec 31, 2011 at 12:09 AM, Andrei Alexandrescu <span dir=3D"ltr">&lt= ;<a href=3D"mailto:SeeWebsiteForEmail erdani.org">SeeWebsiteForEmail erdani= .org</a>&gt;</span> wrote:<br><div class=3D"gmail_quote"><blockquote class= =3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd= ing-left:1ex"> <div class=3D"im">On 12/30/11 10:09 PM, Walter Bright wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> On 12/30/2011 7:30 PM, Jonathan M Davis wrote:<br> <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p= x #ccc solid;padding-left:1ex"> Yes, diligent programmers will generally find such problems, but with the<b= r> current scheme, it&#39;s _so_ easy to use length when you shouldn&#39;t, th= at<br> it&#39;s<br> pretty much a guarantee that it&#39;s going to happen.<br> </blockquote> <br> I&#39;m not so sure about that. Timon Gehr&#39;s X macro tried to handle UT= F-8<br> correctly, but it turned out that the naive version that used [i] and<br> .length worked correctly. This is typical, not exceptional.<br> </blockquote> <br></div> The lower frequency of bugs makes them that much more difficult to spot. Th= is is essentially similar to the UTF16/UCS-2 morass: in a vast majority of = the time the programmer may consider UTF16 a coding with one code unit per = code point (which is what UCS-2 is). The existence of surrogates didn&#39;t= make much of a difference because, again, very often the wrong assumption = just worked. Well that all didn&#39;t go over all that well.<br> <br> We need .raw and we must abolish .length and [] for narrow strings.<span cl= ass=3D"HOEnZb"><font color=3D"#888888"><br> <br> <br> Andrei<br> </font></span></blockquote></div><br><div><br></div><div>I don&#39;t know t= hat Phobos would be an appropriate place for it but offering some easy to a= ccess string data containing extensive and advanced unicode which users cou= ld easily add to their programs unit tests may help people ensure proper un= icode usage. Unicode seems to be one of those things where you either know = it really well or you know just enough to get yourself in trouble so having= test data written by unicode experts could be very useful for the rest of = us mortals.</div> <div><br></div><div>I googled around a bit. =A0This Stack Overflow came up = &lt;<a href=3D"http://stackoverflow.com/questions/6136800/unicode-test-stri= ngs-for-unit-tests">http://stackoverflow.com/questions/6136800/unicode-test= -strings-for-unit-tests</a>&gt; that recommends these</div> <div>=A0- UTF-8 stress test:=A0<a href=3D"http://www.cl.cam.ac.uk/~mgk25/uc= s/examples/UTF-8-test.txt">http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-= 8-test.txt</a></div><div>=A0- Quick Brown Fox in a variety of languages:=A0= <a href=3D"http://www.cl.cam.ac.uk/~mgk25/ucs/examples/quickbrown.txt">http= ://www.cl.cam.ac.uk/~mgk25/ucs/examples/quickbrown.txt</a></div> <div><br></div><div>I didn&#39;t see too much beyond those two.</div><div><= br></div><div>Regards,</div><div>Brad A. =A0</div> --f46d042c636bd521b004b55ebba8--
Dec 30 2011
prev sibling parent kenji hara <k.hara.pg gmail.com> writes:
2011/12/31 Walter Bright <newshound2 digitalmars.com>:
 On 12/30/2011 11:09 PM, Andrei Alexandrescu wrote:
 On 12/30/11 10:09 PM, Walter Bright wrote:
 I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8
 correctly, but it turned out that the naive version that used [i] and
 .length worked correctly. This is typical, not exceptional.

The lower frequency of bugs makes them that much more difficult to spot. This is essentially similar to the UTF16/UCS-2 morass: in a vast majority of the time the programmer may consider UTF16 a coding with one code unit per code point (which is what UCS-2 is). The existence of surrogates didn't make much of a difference because, again, very often the wrong assumption just worked. Well that all didn't go over all that well.

I'm not so sure it's quite the same. Java was designed before there were surrogate pairs, they kinda got the rug pulled out from under them. So, they simply have no decent way to deal with it. There isn't even a notion of a dchar character type. Java was designed with codeunit==codepoint, it is embedded in the design of the language, library, and culture. This is not true of D. It's designed from the ground up to deal properly with UTF. D has very simple language features to deal with it.
 We need .raw and we must abolish .length and [] for narrow strings.

I don't believe that fixes anything and breaks every D project out there. We're chasing phantoms here, and I worry a lot about over-engineering trivia. And, we already have a type to deal with it: dstring

I fully agree with Walter. No need more wrapper for string. Kenji Hara
Dec 31 2011
prev sibling next sibling parent Artur Skawina <art.08.09 gmail.com> writes:
On 12/28/11 13:42, bearophile wrote:
 Peter Alexander:
 
 I often get into situations where I've written 
 a function that takes a string, and then I can't call it because all I 
 have is a char[].

I suggest you to show some of such situations.
 I think it's telling that most Phobos functions use 'const(char)[]' or 
 'in char[]' instead of 'string' for their arguments. The ones that use 
 'string' are usually using it unnecessarily and should be fixed to use 
 const(char)[].

What are the Phobos functions that unnecessarily accept a string?

eg things like std.demangle? (which wraps core.demangle and that one accepts const(char)[]). IIRC eg the stdio functions taking file names want strings too; never investigated if they really need this, just .iduped the args... In general, a lot of things break when trying to switch to "proper" const(char)[] in apps, usually because the app itself used "string" instead of the const version, but fixing it up often also uncovers lib API issues. artur
Dec 28 2011
prev sibling next sibling parent "Robert Jacques" <sandford jhu.edu> writes:
On Wed, 28 Dec 2011 11:00:52 -0800, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
 On 12/28/11 12:46 PM, Walter Bright wrote:
 On 12/28/2011 10:35 AM, Peter Alexander wrote:
 On 28/12/11 6:15 PM, Walter Bright wrote:
 If such a change is made, then people will use const string when they
 mean immutable, and the values underneath are not guaranteed to be
 consistent.

Then people should learn what const and immutable mean! I don't think it's fair to dismiss my suggestion on the grounds that people don't understand the language.

People do what is convenient, and as endless experience shows, doing the right thing should be easier than doing the wrong thing. If you present people with a choice: #1: string s; #2: immutable(char)[] s; sure as the sun rises, they will type the former, and it will be subtly incorrect if string is const(char)[]. Telling people they should know better and pick #2 instead is a strategy that never works very well - not for programming, nor any other endeavor.

Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people would access the decoding routines on the needed occasions, or would consciously use the representation. Yum. Andrei

Would slicing, i.e. s[i..j] still be valid? If so, what would be the recommended way of finding i and j?
Dec 28 2011
prev sibling next sibling parent "foobar" <foo bar.com> writes:
On Wednesday, 28 December 2011 at 19:00:53 UTC, Andrei 
Alexandrescu wrote:
 On 12/28/11 12:46 PM, Walter Bright wrote:
 On 12/28/2011 10:35 AM, Peter Alexander wrote:
 On 28/12/11 6:15 PM, Walter Bright wrote:
 If such a change is made, then people will use const string 
 when they
 mean immutable, and the values underneath are not guaranteed 
 to be
 consistent.

Then people should learn what const and immutable mean! I don't think it's fair to dismiss my suggestion on the grounds that people don't understand the language.

People do what is convenient, and as endless experience shows, doing the right thing should be easier than doing the wrong thing. If you present people with a choice: #1: string s; #2: immutable(char)[] s; sure as the sun rises, they will type the former, and it will be subtly incorrect if string is const(char)[]. Telling people they should know better and pick #2 instead is a strategy that never works very well - not for programming, nor any other endeavor.

Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people would access the decoding routines on the needed occasions, or would consciously use the representation. Yum. Andrei

That's a good idea which I wonder about its implementation strategy. ATM string is simply an alias of a char array, are you suggesting string should be a wrapper struct instead (like the one previously suggested by Steven)? I'm all for making string a properly encapsulated type.
Dec 28 2011
prev sibling next sibling parent "foobar" <foo bar.com> writes:
On Wednesday, 28 December 2011 at 19:30:04 UTC, Andrei 
Alexandrescu wrote:
 On 12/28/11 1:18 PM, foobar wrote:
 That's a good idea which I wonder about its implementation 
 strategy.

Implementation would entail a change in the compiler. Andrei

Why? D should be plenty powerful to implement this without modifying the compiler. Sounds like you suggest that char[] will behave differently than other T[] which is a very poor idea IMO.
Dec 28 2011
prev sibling next sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Wednesday, 28 December 2011 at 19:30:04 UTC, Andrei 
Alexandrescu wrote:
 Implementation would entail a change in the compiler.

I don't think I agree. Wouldn't something like this work? === struct string { immutable(char)[] rep; alias rep this; auto opAssign(immutable(char)[] rhs) { rep = rhs; return this; } this(immutable(char)[] rhs) { rep = rhs; } // disable these here so it isn't passed on to .rep disable void opSlice(){ assert(0); }; disable size_t length() { assert(0); }; } === I did some quick tests and the basics seemed ok: /* paste impl from above */ import std.string : replace; void main() { string a = "test"; // works a = a.replace("test", "mang"); // works // a = a[0..1]; // correctly fails to compile assert(0, a); // works }
Dec 28 2011
prev sibling next sibling parent "foobar" <foo bar.com> writes:
On Wednesday, 28 December 2011 at 19:38:53 UTC, Timon Gehr wrote:
[snip]
 I'm all for making string a properly encapsulated type.

In what way would the proposed change improve encapsulation, and why would it even be desirable for such a basic data structure?

I'm not sure what are you asking here. Are you asking what are the benefits of encapsulation? This topic was discussed to death more than once and I'd suggest searching the NG archives for the details. Also, If you hadn't already I'd suggest reading about Unicode and its levels of abstraction: code point, code units, graphemes, etc...
Dec 28 2011
prev sibling next sibling parent "foobar" <foo bar.com> writes:
On Wednesday, 28 December 2011 at 19:48:28 UTC, Adam D. Ruppe 
wrote:
 On Wednesday, 28 December 2011 at 19:30:04 UTC, Andrei 
 Alexandrescu wrote:
 Implementation would entail a change in the compiler.

I don't think I agree. Wouldn't something like this work? === struct string { immutable(char)[] rep; alias rep this; auto opAssign(immutable(char)[] rhs) { rep = rhs; return this; } this(immutable(char)[] rhs) { rep = rhs; } // disable these here so it isn't passed on to .rep disable void opSlice(){ assert(0); }; disable size_t length() { assert(0); }; } === I did some quick tests and the basics seemed ok: /* paste impl from above */ import std.string : replace; void main() { string a = "test"; // works a = a.replace("test", "mang"); // works // a = a[0..1]; // correctly fails to compile assert(0, a); // works }

My thinking exactly. Of course we can't put " disable" right away and should start with " deprecated" to allow for a proper migration period. I'd also like a transition of the string related functions to this type. the previous ones can remain as simple wrappers/aliases/whatever for backwards compatibility.
Dec 28 2011
prev sibling next sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Wednesday, 28 December 2011 at 20:01:15 UTC, foobar wrote:
 I'd also like a transition of the string related functions to 
 this type. the previous ones can remain as simple 
 wrappers/aliases/whatever for backwards compatibility.

I actually like strings just the way they are... but if we had to change, I'm sure we can do a good job in the library relatively easily.
Dec 28 2011
prev sibling next sibling parent "foobar" <foo bar.com> writes:
On Wednesday, 28 December 2011 at 21:57:00 UTC, Andrei 
Alexandrescu wrote:
 On 12/28/11 1:48 PM, foobar wrote:
 On Wednesday, 28 December 2011 at 19:30:04 UTC, Andrei 
 Alexandrescu wrote:
 On 12/28/11 1:18 PM, foobar wrote:
 That's a good idea which I wonder about its implementation 
 strategy.

Implementation would entail a change in the compiler. Andrei

Why? D should be plenty powerful to implement this without modifying the compiler. Sounds like you suggest that char[] will behave differently than other T[] which is a very poor idea IMO.

It's an awesome idea, but for an academic debate at best. Andrei

I don't follow you. You've suggested a change that I agree with. Adam provided a prototype string library type that accomplishes your specified goals without any changes to the compiler. What are we missing here? IF it boils down to changing the compiler or leaving the status-quo, I'm voting against the compiler change.
Dec 28 2011
prev sibling next sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Thursday, 29 December 2011 at 04:17:37 UTC, Andrei 
Alexandrescu wrote:
 If we have two facilities (string and e.g. String) we've lost. 
 We'd need to slowly change the built-in string type.

Have you actually tried to do it? Thanks to alias this, the custom string can be used with existing std.string functions and assignments from literals. I suppose that technically there's two facilities: immutable(char)[] and string, but I don't see what difference that makes at all. string is just an alias. It could be changed to a struct with ease; you can do it in your own private module. I really think you (you!) are underestimating D's current capabilities. (Again, I do not think this is a good move - I'm with Walter on it - but let's not sell the language short.)
Dec 28 2011
prev sibling next sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Thursday, 29 December 2011 at 05:37:00 UTC, Walter Bright 
wrote:
 I've seen the damage done in C++ with multiple string types. 
 Being able to convert from one to the other doesn't help much.

Note that I'm on your side here re strings, but you're underselling the D language too! These conversions are implicit both ways, and completely free. D structs can wrap other D types perfectly well. Check this out: string a = "hello"; a = a.replace("h", "j"); assert(a == "jello"); this actually works, today, with a custom string type in the D language. Just define a struct string in your module. alias this does most the magic. In C++, std::string and char* are very different. === #include<string> void a(const char* str) {} int main() { std::string me = "lol"; // works a(me); // ...but this doesn't work return 0; } === But, in D, that *does work*. A struct string can be used on a function that calls for a const(char)[]. It can be used for a function that calls for an immutable(char)[]. It can be used for a function that calls for a struct string. A string struct works exactly the same way as a string alias. Right down to the name! It's not storeable in a variable typed char[] (or wchar[] nor dchar[]), but neither are D strings today.
Dec 28 2011
prev sibling next sibling parent "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Thursday, 29 December 2011 at 06:08:05 UTC, Andrei 
Alexandrescu wrote:
 On 12/28/11 11:36 PM, Walter Bright wrote:
 On 12/28/2011 8:32 PM, Adam D. Ruppe wrote:
 On Thursday, 29 December 2011 at 04:17:37 UTC, Andrei 
 Alexandrescu wrote:
 If we have two facilities (string and e.g. String) we've 
 lost. We'd
 need to
 slowly change the built-in string type.

Have you actually tried to do it?

I've seen the damage done in C++ with multiple string types. Being able to convert from one to the other doesn't help much.

This. The only solution is to explain Walter no other programmer in the world codes UTF like him. Really. I emulate that sometimes (learned from him) but I see code from hundreds of people day in and day out - it's never like his. Once we convince him, he'll be like "ah, I see what you mean. Requiring .rep is awesome. Let's do it." Andrei

I don't think this is a problem you can solve without educating people. They will need to know a thing or two about how UTF works to know the performance implications of many of the "safe" ways to handle UTF strings. Further, for much use of Unicode strings in D you can't get away with not knowing anything anyway because D only abstracts up to code points, not graphemes. Imagine trying to explain to the unknowing programmer what is going on when an algorithm function broke his grapheme and he doesn't know the first thing about Unicode. I'm not claiming to be an expert myself, but I believe D offers Unicode the right way as it is.
Dec 28 2011
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Wednesday, 28 December 2011 at 19:00:53 UTC, Andrei 
Alexandrescu wrote:
 On 12/28/11 12:46 PM, Walter Bright wrote:
 On 12/28/2011 10:35 AM, Peter Alexander wrote:
 On 28/12/11 6:15 PM, Walter Bright wrote:
 If such a change is made, then people will use const string 
 when they
 mean immutable, and the values underneath are not guaranteed 
 to be
 consistent.

Then people should learn what const and immutable mean! I don't think it's fair to dismiss my suggestion on the grounds that people don't understand the language.

People do what is convenient, and as endless experience shows, doing the right thing should be easier than doing the wrong thing. If you present people with a choice: #1: string s; #2: immutable(char)[] s; sure as the sun rises, they will type the former, and it will be subtly incorrect if string is const(char)[]. Telling people they should know better and pick #2 instead is a strategy that never works very well - not for programming, nor any other endeavor.

Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people would access the decoding routines on the needed occasions, or would consciously use the representation.

I think it would be simpler to just make dstring the default string type. dstring is simple and safe. People who want better memory usage can use UTF-8 at their own discretion.
Dec 29 2011
prev sibling next sibling parent Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:
This a a great idea! In this case the default string will be a
random-access range, not a bidirectional range. Also, processing
dstring is faster, then string, because no encoding needs to be done.
Processing power is more expensive, then memory. utf-8 is valuable
only to pass it as an ASCII string (which is not too common) and to
store large chunks of it. Both these cases are much less common then
all the rest of string processing.

+1

On Thu, Dec 29, 2011 at 12:04 PM, Vladimir Panteleev
<vladimir thecybershadow.net> wrote:
 On Wednesday, 28 December 2011 at 19:00:53 UTC, Andrei Alexandrescu wrote:
 On 12/28/11 12:46 PM, Walter Bright wrote:
 On 12/28/2011 10:35 AM, Peter Alexander wrote:
 On 28/12/11 6:15 PM, Walter Bright wrote:
 If such a change is made, then people will use const string when they
 mean immutable, and the values underneath are not guaranteed to be
 consistent.

Then people should learn what const and immutable mean! I don't think it's fair to dismiss my suggestion on the grounds that people don't understand the language.

People do what is convenient, and as endless experience shows, doing the right thing should be easier than doing the wrong thing. If you present people with a choice: #1: string s; #2: immutable(char)[] s; sure as the sun rises, they will type the former, and it will be subtly incorrect if string is const(char)[]. Telling people they should know better and pick #2 instead is a strategy that never works very well - not for programming, nor any other endeavor.

Oh, one more thing - one good thing that could come out of this thread is abolition (through however slow a deprecation path) of s.length and s[i] for narrow strings. Requiring s.rep.length instead of s.length and s.rep[i] instead of s[i] would improve the quality of narrow strings tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people would access the decoding routines on the needed occasions, or would consciously use the representation.

I think it would be simpler to just make dstring the default string type. dstring is simple and safe. People who want better memory usage can use UTF-8 at their own discretion.

-- Bye, Gor Gyolchanyan.
Dec 29 2011
prev sibling next sibling parent Derek <ddparnell bigpond.com> writes:
On Thu, 29 Dec 2011 16:36:59 +1100, Walter Bright  
<newshound2 digitalmars.com> wrote:

 I've seen the damage done in C++ with multiple string types. Being able  
 to convert from one to the other doesn't help much.

I'm not quite sure about that last sentence. I suspect that the better way for applications to handle strings of characters would be to internally store and manipulate them as utf-32 (dchar[]) and only when doing I/O use the other utf forms. So converting from the different forms is very helpful. -- Derek Parnell Melbourne, Australia
Dec 29 2011
prev sibling next sibling parent Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:
What if the string converted itself from utf-8 to utf-32 back and
forth as necessary (utf-8 for storing and utf-32 for processing):

struct String
{
public:
    bool encoded()  property const
    {
        return _encoded;
    }

    bool encoded(bool should)  property
    {
        if(should)
            if(!encoded)
            {
                _utf8 = to!string(_utf32);
                encoded = true;
            }
        else
            if(encoded)
            {
                _utf32 = to!dstring(_utf8);
                encoded = false;
            }
    }

    // Here goes the part where you get to use the string

private:
    bool _encoded;
    union
    {
        string _utf8;
        dstring _utf32;
    }
}

This has a lot of drawbacks and is purely a curiosity. The idea of
expressing the encoding of string as a property of strings, rather,
then a difference between separate types of strings.

On Thu, Dec 29, 2011 at 1:02 PM, Walter Bright
<newshound2 digitalmars.com> wrote:
 On 12/29/2011 12:12 AM, Gor Gyolchanyan wrote:
 This a a great idea! In this case the default string will be a
 random-access range, not a bidirectional range. Also, processing
 dstring is faster, then string, because no encoding needs to be done.
 Processing power is more expensive, then memory. utf-8 is valuable
 only to pass it as an ASCII string (which is not too common) and to
 store large chunks of it. Both these cases are much less common then
 all the rest of string processing.

dstring consumes 4x the memory, and this can easily cause perf degradations due to thrashing and poor cache locality.

-- Bye, Gor Gyolchanyan.
Dec 29 2011
prev sibling next sibling parent Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:
oops. I accidentally made a recursive call in the setter. scratch
that, it should change the attribute.

On Thu, Dec 29, 2011 at 6:58 PM, Gor Gyolchanyan
<gor.f.gyolchanyan gmail.com> wrote:
 What if the string converted itself from utf-8 to utf-32 back and
 forth as necessary (utf-8 for storing and utf-32 for processing):

 struct String
 {
 public:
 =C2=A0 =C2=A0bool encoded()  property const
 =C2=A0 =C2=A0{
 =C2=A0 =C2=A0 =C2=A0 =C2=A0return _encoded;
 =C2=A0 =C2=A0}

 =C2=A0 =C2=A0bool encoded(bool should)  property
 =C2=A0 =C2=A0{
 =C2=A0 =C2=A0 =C2=A0 =C2=A0if(should)
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if(!encoded)
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0{
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0_utf8 =3D to!strin=

 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0encoded =3D true;
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}
 =C2=A0 =C2=A0 =C2=A0 =C2=A0else
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if(encoded)
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0{
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0_utf32 =3D to!dstr=

 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0encoded =3D false;
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}
 =C2=A0 =C2=A0}

 =C2=A0 =C2=A0// Here goes the part where you get to use the string

 private:
 =C2=A0 =C2=A0bool _encoded;
 =C2=A0 =C2=A0union
 =C2=A0 =C2=A0{
 =C2=A0 =C2=A0 =C2=A0 =C2=A0string _utf8;
 =C2=A0 =C2=A0 =C2=A0 =C2=A0dstring _utf32;
 =C2=A0 =C2=A0}
 }

 This has a lot of drawbacks and is purely a curiosity. The idea of
 expressing the encoding of string as a property of strings, rather,
 then a difference between separate types of strings.

 On Thu, Dec 29, 2011 at 1:02 PM, Walter Bright
 <newshound2 digitalmars.com> wrote:
 On 12/29/2011 12:12 AM, Gor Gyolchanyan wrote:
 This a a great idea! In this case the default string will be a
 random-access range, not a bidirectional range. Also, processing
 dstring is faster, then string, because no encoding needs to be done.
 Processing power is more expensive, then memory. utf-8 is valuable
 only to pass it as an ASCII string (which is not too common) and to
 store large chunks of it. Both these cases are much less common then
 all the rest of string processing.

dstring consumes 4x the memory, and this can easily cause perf degradati=


 due to thrashing and poor cache locality.

-- Bye, Gor Gyolchanyan.

--=20 Bye, Gor Gyolchanyan.
Dec 29 2011
prev sibling next sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Thursday, 29 December 2011 at 06:09:17 UTC, Andrei 
Alexandrescu wrote:
 Nah, that still breaks a lotta code because people parameterize 
 on T[], use isSomeString/isSomeChar etc.

/* snip struct string */ import std.traits; void tem(T)(T t) if(isSomeString!T) {} void tem2(T : immutable(char)[])(T t) {} string a = "test"; tem(a); // works tem2(a); // works It's the alias this magic again. (btw I also tried renaming struct string to struct STRING, and it still worked, so it wasn't just naming coincidence!)
Dec 29 2011
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
Don't we already have String-like support with ranges?  I'm not sure I under=
stand the point in having special behavior for char arrays.=20

Sent from my iPhone

On Dec 28, 2011, at 8:17 PM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.=
org> wrote:

 On 12/28/11 4:18 PM, foobar wrote:
 On Wednesday, 28 December 2011 at 21:57:00 UTC, Andrei Alexandrescu wrote=


 On 12/28/11 1:48 PM, foobar wrote:
 On Wednesday, 28 December 2011 at 19:30:04 UTC, Andrei Alexandrescu
 wrote:
 On 12/28/11 1:18 PM, foobar wrote:
 That's a good idea which I wonder about its implementation strategy.

Implementation would entail a change in the compiler. =20 Andrei

Why? D should be plenty powerful to implement this without modifying th=




 compiler. Sounds like you suggest that char[] will behave differently
 than other T[] which is a very poor idea IMO.

It's an awesome idea, but for an academic debate at best. =20 Andrei

I don't follow you. You've suggested a change that I agree with. Adam provided a prototype string library type that accomplishes your specified goals without any changes to the compiler. What are we missing here? IF it boils down to changing the compiler or leaving the status-quo, I'm voting against the compiler change.

If we have two facilities (string and e.g. String) we've lost. We'd need t=

=20
 I discussed the matter with Walter. He completely disagrees, and sees the i=

ently uses .length, indexing, and slicing in narrow strings.
=20
 I know Walter's code, so I know where he's coming from. He understands UTF=

, masks, and ranges by heart. I've seen his code and indeed it's an amazing f= eat of minimal opportunistic on-demand decoding. So I know where he's coming= from, but I also know next to nobody codes like him. A casual string user a= lmost always writes string code (iteration, indexing) the wrong way and woul= d be tremendously helped by a clean distinction between abstraction and repr= esentation.
=20
 Nagonna happen.
=20
=20
 Andrei
=20

Dec 29 2011
prev sibling next sibling parent "Regan Heath" <regan netmail.co.nz> writes:
On Thu, 29 Dec 2011 18:36:27 -0000, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 12/29/11 12:28 PM, Don wrote:
 On 28.12.2011 20:00, Andrei Alexandrescu wrote:
 Oh, one more thing - one good thing that could come out of this thread
 is abolition (through however slow a deprecation path) of s.length and
 s[i] for narrow strings. Requiring s.rep.length instead of s.length and
 s.rep[i] instead of s[i] would improve the quality of narrow strings
 tremendously. Also, s.rep[i] should return ubyte/ushort, not  
 char/wchar.
 Then, people would access the decoding routines on the needed  
 occasions,
 or would consciously use the representation.

 Yum.

If I understand this correctly, most others don't. Effectively, .rep just means, "I know what I'm doing", and there's no change to existing semantics, purely a syntax change.

Exactly!
 If you change s[i] into s.rep[i], it does the same thing as now. There's
 no loss of functionality -- it's just stops you from accidentally doing
 the wrong thing. Like .ptr for getting the address of an array.
 Typically all the ".rep" everywhere would get annoying, so you would  
 write:
 ubyte [] u = s.rep;
 and use u from then on.

 I don't like the name 'rep'. Maybe 'raw' or 'utf'?
 Apart from that, I think this would be perfect.

Yes, I mean "rep" as a short for "representation" but upon first sight the connection is tenuous. "raw" sounds great. Now I'm twice sorry this will not happen...

+1 for this idea, however named. R -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Dec 30 2011
prev sibling next sibling parent "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Friday, 30 December 2011 at 19:55:45 UTC, Timon Gehr wrote:
 I think the way we have it now is optimal. The only reason we 
 are discussing this is because of fear that uneducated users 
 will write code that does not take into account Unicode 
 characters above code point 0x80. But what is the worst thing 
 that can happen?

 1. They don't notice. Then it is not a problem, because they 
 are obviously only using ASCII characters and it is perfectly 
 reasonable to assume that code units and characters are the 
 same thing.

 2. They get screwed up string output, look for the reason, 
 patch up their code with some functions from std.utf and will 
 never make the same mistakes again.


 I have *never* seen an user in D.learn complain about it. They 
 might have been some I missed, but it is certainly not a 
 prevalent problem. Also, just because an user can type .rep 
 does not mean he understands Unicode: He is able to make just 
 the same mistakes as before, even more so, as the array he is 
 getting back has the _wrong element type_.

I strongly agree with this. It would be nice to have everything be simple, work correctly *and* efficiently at the same time, but I don't believe the proposed changes make a definite improvement. In the end, if you don't want to use the standard library or other UTF-aware string libraries, you'll have to know the basics of UTF to write the correct code. I too wish it was harder to write it incorrectly, but the current solution is simply the best one to appear yet.
Dec 30 2011
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 31 December 2011 at 15:03:13 UTC, Andrei 
Alexandrescu wrote:
 The whole concept of generic algorithms working on strings 
 efficiently
 doesn't work.

Apparently std.algorithm does.

According to my research[1], std.array.replace (which uses std.algorithm under the hood) can be at least 40% faster when there is a match and 70% faster when there isn't one. I don't think this is actually related to UTF, though. [1]: http://dump.thecybershadow.net/5cfb6713ce6628686c6aa8a23b15c99e/test.d
Dec 31 2011
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
I'm not sure I understand what's wrong with length.  Of all the times I get a=
 length in one sizable i18nalized app at work I can think of only one instan=
ce where I actually want the character count rather than the byte count. Is t=
here some other reason I'm not aware of that length is undesirable?

Sent from my iPhone

On Dec 30, 2011, at 4:12 PM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.=
org> wrote:

 On 12/30/11 6:07 PM, Timon Gehr wrote:
 alias std.string.representation raw;

I meant your implementation is incomplete. =20 But the main point is that presence of representation/raw is not the issue=

Putting in place the convention of using .raw is hardly useful within the c= ontext.
=20
=20
 Andrei

Dec 31 2011
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
I don't know that Unicode expertise is really required here anyway.  All one=
 has to know is that UTF8 is a multibyte encoding and built-in string attrib=
utes talk in bytes. Knowing when one wants bytes vs characters isn't rocket s=
cience. That said, I'm on the fence about this change. It breaks consistency=
 for a benefit I'm still weighing. With this change, the char type will stil=
l be a single byte, correct?  What happens to foreach on strings?

Sent from my iPhone

On Dec 31, 2011, at 8:20 AM, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 12/31/2011 03:17 PM, Michel Fortin wrote:
=20
 As for Walter being the only one coding by looking at the code units
 directly, that's not true. All my parser code look at code units
 directly and only decode to code points where necessary (just look at
 the XML parsing code I posted a while ago to get an idea to how it can
 apply to ranges). And I don't think it's because I've seen Walter code
 before, I think it is because I know how Unicode works and I want to
 make my parser efficient. I've done the same for a parser in C++ a while
 ago. I can hardly imagine I'm the only one (with Walter and you). I
 think this is how efficient algorithms dealing with Unicode should be
 written.
=20

+1.

Dec 31 2011
prev sibling parent Sean Kelly <sean invisibleduck.org> writes:
Sorry, I was simplifying. The distinction I was trying to make was between g=
eneric operations (in my experience the majority) vs. encoding-aware ones.=20=


Sent from my iPhone

On Dec 31, 2011, at 12:48 PM, Michel Fortin <michel.fortin michelf.com> wrot=
e:

 On 2011-12-31 16:47:40 +0000, Sean Kelly <sean invisibleduck.org> said:
=20
 I don't know that Unicode expertise is really required here anyway.  All o=


 has to know is that UTF8 is a multibyte encoding and built-in string attr=


 utes talk in bytes. Knowing when one wants bytes vs characters isn't rock=


 cience.

It's not bytes vs. characters, it's code units vs. code points vs. user pe=

points, and can be represented in various ways depending on which Unicode n= ormalization you pick. But most people don't know that.
=20
 If you want to count the number of *characters*, counting code points isn'=

search for a substring, you need to be sure both strings use the same norma= lization first, and if not normalize them appropriately so that equivalent c= ode point combinations are always represented the same.
=20
 That said, if you are implementing an XML or JSON parser, since those spec=

erm of code points (hopefully without decoding code points when you don't ne= ed to). On the other hand, if you're writing something that processes text (= like counting the average number of *character* per word in a document), the= n you should be aware of combining characters.
=20
 How to pack all this into an easy to use package is most challenging.
=20
=20
 --=20
 Michel Fortin
 michel.fortin michelf.com
 http://michelf.com/
=20

Dec 31 2011
prev sibling next sibling parent Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:
I agree, the string parameters are indeed irritating, but changing the
alias would bring much more pain, then it would relieve.

On Wed, Dec 28, 2011 at 4:06 PM, Peter Alexander
<peter.alexander.au gmail.com> wrote:
 string is immutable(char)[]

 I rarely *ever* need an immutable string. What I usually need is
 const(char)[]. I'd say 99%+ of the time I need only a const string.

 This is quite irritating because "string" is the most convenient and
 intuitive thing to type. I often get into situations where I've written a
 function that takes a string, and then I can't call it because all I have is
 a char[]. I could copy the char[] into a new string, but that's expensive,
 and I'd rather I could just call the function.

 I think it's telling that most Phobos functions use 'const(char)[]' or 'in
 char[]' instead of 'string' for their arguments. The ones that use 'string'
 are usually using it unnecessarily and should be fixed to use const(char)[].

 In an ideal world I'd much prefer if string was an alias for const(char)[],
 but string literals were immutable(char)[]. It would require a little more
 effort when dealing with concurrency, but that's a price I would be willing
 to pay to make the string alias useful in function parameters.

-- Bye, Gor Gyolchanyan.
Dec 28 2011
prev sibling next sibling parent mta`chrono <chrono mta-international.net> writes:
I understand your intention. It was one of the main irritations when I
moved to D. Here is a function that unnecessarily uses string.

/**
 * replaces foo by bar within text.
 */
string replace(string text, string foo, string bar)
{
   // ...
}

The function is crap because it can't be called with mutable char[].
Okay, that's true. Therefore you'd suggested to alias const(char)[]
instead of immutable(char)[] ???

But I think inout() is your man in this case. If I remeber correctly, it
has been fixed recently.

I'm not quite sure if I got your point. So forgive me if I was wrong.
Dec 28 2011
prev sibling next sibling parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 12/28/2011 04:06 AM, Peter Alexander wrote:
 string is immutable(char)[]

 I rarely *ever* need an immutable string. What I usually need is
 const(char)[]. I'd say 99%+ of the time I need only a const string.

 This is quite irritating because "string" is the most convenient and
 intuitive thing to type. I often get into situations where I've written
 a function that takes a string, and then I can't call it because all I
 have is a char[]. I could copy the char[] into a new string, but that's
 expensive, and I'd rather I could just call the function.

 I think it's telling that most Phobos functions use 'const(char)[]' or
 'in char[]' instead of 'string' for their arguments. The ones that use
 'string' are usually using it unnecessarily and should be fixed to use
 const(char)[].

 In an ideal world I'd much prefer if string was an alias for
 const(char)[], but string literals were immutable(char)[]. It would
 require a little more effort when dealing with concurrency, but that's a
 price I would be willing to pay to make the string alias useful in
 function parameters.

Agreed. I've talked about this in D.learn a number of times myself. Ali
Dec 28 2011
parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 12/28/2011 08:00 AM, Ali Çehreli wrote:
 Agreed. I've talked about this in D.learn a number of times myself.

After seeing others' comments that focus more on the alias, I need to clarify: I don't have an opinion on the alias itself. I agree with the subject line that function parameter lists should mostly have const(char)[] instead of string. Ali
Dec 28 2011
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/28/11 6:06 AM, Peter Alexander wrote:
 string is immutable(char)[]

 I rarely *ever* need an immutable string. What I usually need is
 const(char)[]. I'd say 99%+ of the time I need only a const string.

 This is quite irritating because "string" is the most convenient and
 intuitive thing to type. I often get into situations where I've written
 a function that takes a string, and then I can't call it because all I
 have is a char[]. I could copy the char[] into a new string, but that's
 expensive, and I'd rather I could just call the function.

 I think it's telling that most Phobos functions use 'const(char)[]' or
 'in char[]' instead of 'string' for their arguments. The ones that use
 'string' are usually using it unnecessarily and should be fixed to use
 const(char)[].

 In an ideal world I'd much prefer if string was an alias for
 const(char)[], but string literals were immutable(char)[]. It would
 require a little more effort when dealing with concurrency, but that's a
 price I would be willing to pay to make the string alias useful in
 function parameters.

I'm afraid you're wrong here. The current setup is very good, and much better than one in which "string" would be an alias for const(char)[]. The problem is escaping. A function that transitorily operates on a string indeed does not care about the origin of the string, but storing a string inside an object is a completely different deal. The setup class Query { string name; ... } is safe, minimizes data copying, and never causes surprises to anyone ("I set the name of my query and a little later it's all messed up!"). So immutable(char)[] is the best choice for a correct string abstraction compared against both char[] and const(char)[]. In fact it's in a way good that const(char)[] takes longer to type, because it also carries larger liabilities. If you want to create a string out of a char[] or const(char)[], use std.conv.to or the unsafe assumeUnique. Andrei
Dec 28 2011
next sibling parent reply Peter Alexander <peter.alexander.au gmail.com> writes:
On 28/12/11 4:27 PM, Andrei Alexandrescu wrote:
 The problem is escaping. A function that transitorily operates on a
 string indeed does not care about the origin of the string, but storing
 a string inside an object is a completely different deal. The setup

 class Query
 {
 string name;
 ...
 }

 is safe, minimizes data copying, and never causes surprises to anyone
 ("I set the name of my query and a little later it's all messed up!").

 So immutable(char)[] is the best choice for a correct string abstraction
 compared against both char[] and const(char)[]. In fact it's in a way
 good that const(char)[] takes longer to type, because it also carries
 larger liabilities.

I don't follow your argument. You've said (paraphrasing) "If a function does A then X is best, but if a function does B then Y is best, so Y is best." If a function needs to store the string then by all means it should use immutable(char)[]. However, this is a much rarer case than functions that simply use the string transitorily as you put it. Again, there are very, very few functions in Phobos that accept a string as an argument. The vast majority accept `const(char)[]` or `in char[]`. This speaks volumes about how useful the string alias is.
Dec 28 2011
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/28/11 11:42 AM, Peter Alexander wrote:
 On 28/12/11 4:27 PM, Andrei Alexandrescu wrote:
 The problem is escaping. A function that transitorily operates on a
 string indeed does not care about the origin of the string, but storing
 a string inside an object is a completely different deal. The setup

 class Query
 {
 string name;
 ...
 }

 is safe, minimizes data copying, and never causes surprises to anyone
 ("I set the name of my query and a little later it's all messed up!").

 So immutable(char)[] is the best choice for a correct string abstraction
 compared against both char[] and const(char)[]. In fact it's in a way
 good that const(char)[] takes longer to type, because it also carries
 larger liabilities.

I don't follow your argument. You've said (paraphrasing) "If a function does A then X is best, but if a function does B then Y is best, so Y is best."

I'm saying (paraphrasing) "X is modularly bankrupt and unsafe, and Y is modular and safe, so Y is best".
 If a function needs to store the string then by all means it should use
 immutable(char)[]. However, this is a much rarer case than functions
 that simply use the string transitorily as you put it.

Rarity is a secondary concern to modularity and safety.
 Again, there are very, very few functions in Phobos that accept a string
 as an argument. The vast majority accept `const(char)[]` or `in char[]`.
 This speaks volumes about how useful the string alias is.

Phobos consists of many functions and few entity types. Application code is rife with entity types. I kindly suggest you reconsider your position; the current setup is indeed very solid. Andrei
Dec 28 2011
prev sibling next sibling parent "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Wednesday, 28 December 2011 at 16:27:15 UTC, Andrei 
Alexandrescu wrote:
 So immutable(char)[] is the best choice for a correct string 
 abstraction compared against both char[] and const(char)[]. In 
 fact it's in a way good that const(char)[] takes longer to 
 type, because it also carries larger liabilities.

Also, 'in char[]', which is conceptually much safer, isn't that much longer to type. It would be cool if 'scope' was actually implemented apart from an optimization though.
Dec 28 2011
prev sibling next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, December 28, 2011 10:27:15 Andrei Alexandrescu wrote:
 I'm afraid you're wrong here. The current setup is very good, and much
 better than one in which "string" would be an alias for const(char)[].
 
 The problem is escaping. A function that transitorily operates on a
 string indeed does not care about the origin of the string, but storing
 a string inside an object is a completely different deal. The setup
 
 class Query
 {
      string name;
      ...
 }
 
 is safe, minimizes data copying, and never causes surprises to anyone
 ("I set the name of my query and a little later it's all messed up!").
 
 So immutable(char)[] is the best choice for a correct string abstraction
 compared against both char[] and const(char)[]. In fact it's in a way
 good that const(char)[] takes longer to type, because it also carries
 larger liabilities.
 
 If you want to create a string out of a char[] or const(char)[], use
 std.conv.to or the unsafe assumeUnique.

Agreed. And for a number of functions, taking const(char)[] would be worse, because they would have to dup or idup the string, whereas with immutable(char)[], they can safely slice it without worrying about its value changing. I think that if we want to make it so that immutable(char)[] isn't forced as much, then we need to make proper use of templates (which also can allow you to not force char over wchar or dchar) and inout - and perhaps in some cases, a templated function could allow you to indicate what type of character you want returned. But in general, string is by far the most useful and least likely to cause bugs with slicing. So, I think that string should remain immutable(char)[]. - Jonathan M Davis
Dec 28 2011
parent deadalnix <deadalnix gmail.com> writes:
Le 28/12/2011 21:43, Jonathan M Davis a écrit :
 On Wednesday, December 28, 2011 10:27:15 Andrei Alexandrescu wrote:
 I'm afraid you're wrong here. The current setup is very good, and much
 better than one in which "string" would be an alias for const(char)[].

 The problem is escaping. A function that transitorily operates on a
 string indeed does not care about the origin of the string, but storing
 a string inside an object is a completely different deal. The setup

 class Query
 {
       string name;
       ...
 }

 is safe, minimizes data copying, and never causes surprises to anyone
 ("I set the name of my query and a little later it's all messed up!").

 So immutable(char)[] is the best choice for a correct string abstraction
 compared against both char[] and const(char)[]. In fact it's in a way
 good that const(char)[] takes longer to type, because it also carries
 larger liabilities.

 If you want to create a string out of a char[] or const(char)[], use
 std.conv.to or the unsafe assumeUnique.

Agreed. And for a number of functions, taking const(char)[] would be worse, because they would have to dup or idup the string, whereas with immutable(char)[], they can safely slice it without worrying about its value changing.

Is inout a solution for the standard lib here ? The user could idup if a string is needed from a const/mutable char[]
Dec 29 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, December 28, 2011 19:25:15 Jakob Ovrum wrote:
 Also, 'in char[]', which is conceptually much safer, isn't that
 much longer to type.
 
 It would be cool if 'scope' was actually implemented apart from
 an optimization though.

in char[] is _not_ safer than immutable(char)[]. In fact it's _less_ safe. Itals also far more restrictive. Many, many functions return a portion of the string that they are passed in. That slicing would be impossible with scope, and because in char[] makes no guarantees about the elements not changing after the function call, you'd often have to dup or idup it in order to avoid bugs. immutable(char)[] avoids all of that. You can safely slice it without having to worry about duping it to avoid it changing out from under you. - Jonathan M Davis
Dec 28 2011
prev sibling parent "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Wednesday, 28 December 2011 at 20:49:54 UTC, Jonathan M Davis 
wrote:
 On Wednesday, December 28, 2011 19:25:15 Jakob Ovrum wrote:
 Also, 'in char[]', which is conceptually much safer, isn't that
 much longer to type.
 
 It would be cool if 'scope' was actually implemented apart from
 an optimization though.

in char[] is _not_ safer than immutable(char)[].

I didn't say it was. Please read more closely.
Dec 28 2011
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 12/28/2011 4:06 AM, Peter Alexander wrote:
 I rarely *ever* need an immutable string. What I usually need is const(char)[].
 I'd say 99%+ of the time I need only a const string.

I have a very different experience with strings. I can't even remember a case where I wanted to modify an existing string (this includes all my C and C++ usage of strings). It's always assemble a string at one place, and then refer to that string ever after (and never modify it). What immutable strings make possible is treating strings as if they were value types. Nearly every language I know of treats them as immutable except for C and C++.
Dec 28 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 12/28/11 11:11 AM, Walter Bright wrote:
 On 12/28/2011 4:06 AM, Peter Alexander wrote:
 I rarely *ever* need an immutable string. What I usually need is
 const(char)[].
 I'd say 99%+ of the time I need only a const string.

I have a very different experience with strings. I can't even remember a case where I wanted to modify an existing string (this includes all my C and C++ usage of strings). It's always assemble a string at one place, and then refer to that string ever after (and never modify it). What immutable strings make possible is treating strings as if they were value types. Nearly every language I know of treats them as immutable except for C and C++.

I remember the day at Kahili we figured immutable(char)[] will just work as it needs to. It felt pretty awesome. Andrei
Dec 28 2011
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 12/28/2011 06:40 PM, Andrei Alexandrescu wrote:
 On 12/28/11 11:11 AM, Walter Bright wrote:
 On 12/28/2011 4:06 AM, Peter Alexander wrote:
 I rarely *ever* need an immutable string. What I usually need is
 const(char)[].
 I'd say 99%+ of the time I need only a const string.

I have a very different experience with strings. I can't even remember a case where I wanted to modify an existing string (this includes all my C and C++ usage of strings). It's always assemble a string at one place, and then refer to that string ever after (and never modify it). What immutable strings make possible is treating strings as if they were value types. Nearly every language I know of treats them as immutable except for C and C++.

I remember the day at Kahili we figured immutable(char)[] will just work as it needs to. It felt pretty awesome. Andrei

I agree. But I am confused by the fact that you are suggesting it actually does not work as it needs to at other places in this thread.
Dec 28 2011
prev sibling next sibling parent Peter Alexander <peter.alexander.au gmail.com> writes:
On 28/12/11 5:11 PM, Walter Bright wrote:
 On 12/28/2011 4:06 AM, Peter Alexander wrote:
 I rarely *ever* need an immutable string. What I usually need is
 const(char)[].
 I'd say 99%+ of the time I need only a const string.

I have a very different experience with strings. I can't even remember a case where I wanted to modify an existing string (this includes all my C and C++ usage of strings). It's always assemble a string at one place, and then refer to that string ever after (and never modify it).

We can disagree on this, but I think the fact that Phobos rarely uses 'string' and instead uses 'const(char)[]' or 'in char[]' speaks louder than either of our experiences.
 What immutable strings make possible is treating strings as if they were
 value types. Nearly every language I know of treats them as immutable
 except for C and C++.

Yes, and I wouldn't want to remove that. Immutable strings are good, but requiring immutable strings when you don't need them is definitely not good. Phobos knows this, so it doesn't use string, which leads me to question what use the string alias is.
Dec 28 2011
prev sibling parent Sean Kelly <sean invisibleduck.org> writes:
Most common to me buffer reuse. I'll read a line of a file into a buffer, op=
erate on it, then read the next line into the same buffer. If references to t=
he buffer may escape, it's obviously unsafe to cast to immutable.=20

Sent from my iPhone

On Dec 28, 2011, at 9:11 AM, Walter Bright <newshound2 digitalmars.com> wrot=
e:

 On 12/28/2011 4:06 AM, Peter Alexander wrote:
 I rarely *ever* need an immutable string. What I usually need is const(ch=


 I'd say 99%+ of the time I need only a const string.

I have a very different experience with strings. I can't even remember a c=

++ usage of strings). It's always assemble a string at one place, and then r= efer to that string ever after (and never modify it).
=20
 What immutable strings make possible is treating strings as if they were v=

or C and C++.
Dec 28 2011
prev sibling next sibling parent "Dejan Lekic" <dejan.lekic gmail.com> writes:
Peter, having string as immutable(char)[] was perhaps one of the 
best D2 decisions so far, in my humble opinion. I strongly 
disagree with you on this one.
Dec 28 2011
prev sibling next sibling parent Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:
Having a mutable string is a bad idea also because it's mutability is
in the form of array element manipulations, but the string (except for
the dstring) is not semantically an array and its element mutation
isn't safe.

On Wed, Dec 28, 2011 at 9:19 PM, Dejan Lekic <dejan.lekic gmail.com> wrote:
 Peter, having string as immutable(char)[] was perhaps one of the best D2
 decisions so far, in my humble opinion. I strongly disagree with you on this
 one.

-- Bye, Gor Gyolchanyan.
Dec 28 2011
prev sibling next sibling parent reply so <so so.so> writes:
On Wed, 28 Dec 2011 14:06:06 +0200, Peter Alexander  
<peter.alexander.au gmail.com> wrote:

 string is immutable(char)[]

 I rarely *ever* need an immutable string. What I usually need is  
 const(char)[]. I'd say 99%+ of the time I need only a const string.

 This is quite irritating because "string" is the most convenient and  
 intuitive thing to type. I often get into situations where I've written  
 a function that takes a string, and then I can't call it because all I  
 have is a char[]. I could copy the char[] into a new string, but that's  
 expensive, and I'd rather I could just call the function.

 I think it's telling that most Phobos functions use 'const(char)[]' or  
 'in char[]' instead of 'string' for their arguments. The ones that use  
 'string' are usually using it unnecessarily and should be fixed to use  
 const(char)[].

 In an ideal world I'd much prefer if string was an alias for  
 const(char)[], but string literals were immutable(char)[]. It would  
 require a little more effort when dealing with concurrency, but that's a  
 price I would be willing to pay to make the string alias useful in  
 function parameters.

As you said string is not a structure but an alias. Your arguments not against string but the functions that support only strings which you think they shouldn't. If you are sure, that function is able to work on your "string" (but it won't) it just shows that we need to focus on the function rather than the string, no?
Dec 28 2011
parent mta`chrono <chrono mta-international.net> writes:
there are lot of people suggesting to change how string behaves. but
remember, d is awesome compared to other languages for not wrapping
string in a class or struct.

you can use string/char[] without loosing your _nativeness_. programmers
targeting embedded systems are really happy because of this.

by the way, I don't want to blame someone, but I think we diverged from
the original purpose of this topic. __"string is rarely useful as a
function argument"__

I think he points out that choosing _string_ type in function arguments
is _wrong_ in most cases. and there isn't much use of inout in phobos as
it was broken for a long time.
Dec 30 2011
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, December 29, 2011 17:01:19 deadalnix wrote:
 Le 28/12/2011 21:43, Jonathan M Davis a =C3=A9crit :
 Agreed. And for a number of functions, taking const(char)[] would b=


 worse, because they would have to dup or idup the string, whereas w=


 immutable(char)[], they can safely slice it without worrying about =


 value changing.

Is inout a solution for the standard lib here ? =20 The user could idup if a string is needed from a const/mutable char[]=

In some places, yes. Phobos doesn't use inout as much as it probably sh= ould,=20 simply because it was only recently that inout was made to work properl= y.=20 Regardless, you have to be careful about taking const(char)[], because = there's=20 a risk of forcing what could be an unnecessary idup. The best solution = to=20 that, however, depends on what exactly the function is doing. If it's s= imply=20 slicing a portion of the string that's passed in and returning it, then= inout=20 is a great solution. On the other hand, if it actually needs an=20 immutable(char)[] internally, then there's a good chance that it should= just=20 take a string. It depends on what the function is ultimately doing. - Jonathan M Davis
Dec 29 2011