www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Major performance problem with std.array.front()

reply Walter Bright <newshound2 digitalmars.com> writes:
In "Lots of low hanging fruit in Phobos" the issue came up about the automatic 
encoding and decoding of char ranges.

Throughout D's history, there are regular and repeated proposals to redesign
D's 
view of char[] to pretend it is not UTF-8, but UTF-32. I.e. so D will 
automatically generate code to decode and encode on every attempt to index
char[].

I have strongly objected to these proposals on the grounds that:

1. It is a MAJOR performance problem to do this.

2. Very, very few manipulations of strings ever actually need decoded values.

3. D is a systems/native programming language, and systems/native programming 
languages must not hide the underlying representation (I make similar arguments 
about proposals to make ints issue errors on overflow, etc.).

4. Users should choose when decode/encode happens, not the language.

and I have been successful at heading these off. But one slipped by me. See
this 
in std.array:

    property dchar front(T)(T[] a)  safe pure if (isNarrowString!(T[]))
   {
     assert(a.length, "Attempting to fetch the front of an empty array of " ~
            T.stringof);
     size_t i = 0;
     return decode(a, i);
   }

What that means is that if I implement an algorithm that accepts, as input, an 
InputRange of char's, it will ALWAYS try to decode it. This means that even:

    from.copy(to)

will decode 'from', and then re-encode it for 'to'. And it will do it SILENTLY. 
The user won't notice, and he'll just assume that D performance sux. Even if he 
does notice, his options to make his code run faster are poor.

If the user wants decoding, it should be explicit, as in:

     from.decode.copy(encode!to)

The USER should decide where and when the decoding goes. 'decode' should be
just 
another algorithm.

(Yes, I know that std.algorithm.copy() has some specializations to take care of 
this. But these specializations would have to be written for EVERY algorithm, 
which is thoroughly unreasonable. Furthermore, copy()'s specializations only 
apply if BOTH source and destination are arrays. If just one is, the 
decode/encode penalty applies.)

Is there any hope of fixing this?
Mar 06 2014
next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Walter Bright:

 systems/native programming languages must not hide the 
 underlying representation (I make similar arguments about 
 proposals to make ints issue errors on overflow, etc.).

But it's good to have in Phobos a compiler-intrinsics-based efficient overflow detection on a user-defined struct type that behaves like built-in ints in all other aspects.
 Is there any hope of fixing this?

I don't think we can change that in D2. You can change it in D3. Bye, bearophile
Mar 06 2014
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/6/2014 6:54 PM, bearophile wrote:
 Walter Bright:

 systems/native programming languages must not hide the underlying
 representation (I make similar arguments about proposals to make ints issue
 errors on overflow, etc.).

But it's good to have in Phobos a compiler-intrinsics-based efficient overflow detection on a user-defined struct type that behaves like built-in ints in all other aspects.

Yes, so that the user selects it, rather than having it wired in everywhere and the user has to figure out how to defeat it.
Mar 06 2014
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/6/2014 8:01 PM, Adam D. Ruppe wrote:
 BTW you know what would help this? A pragma we can attach to a struct which
 makes it a very thin value type.

I'd rather fix the compiler's codegen than add a pragma.
Mar 06 2014
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 3/6/2014 10:12 PM, H. S. Teoh wrote:
 From what I understand, structs are *supposed* to be thin value types. I
 would say that if a struct is under a certain size (determined by the
 compiler), and doesn't have complicated semantics like dtors and stuff
 like that, then it should be treated like a POD (passed in registers,
 etc).

Yes, that's right.
Mar 06 2014
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 3/7/2014 5:56 AM, Adam D. Ruppe wrote:
 On Friday, 7 March 2014 at 04:19:16 UTC, Walter Bright wrote:
 I'd rather fix the compiler's codegen than add a pragma.

The codegen isn't broken, the current this pointer behavior is needed for full compatibility with the C ABI. It would be opt in to an ABI tweak that the caller needs to be aware of rather than an traditional optimization where the outside world would never know.

Oh, I see what you mean. But I think it does generate the same code, if you use it the same way. There is no 'get' function for ints; you aren't using it the same way.
Mar 07 2014
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 3/7/2014 7:24 AM, Adam D. Ruppe wrote:
 But you can't inline asm function,

I intend to fix that for dmd, but haven't had the time.
 and checking the overflow flag needs asm. (or a compiler intrinsic.)

For that, I was thinking of having the compiler recognize one of the common coding patterns for detecting overflow, and then generating efficient overflow checks. Then documenting the pattern as being specially detected. This means the code will still be successful for compilers that don't detect the pattern, and no language changes would be required.
Mar 07 2014
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/6/2014 6:54 PM, bearophile wrote:
 Walter Bright:
 Is there any hope of fixing this?

I don't think we can change that in D2. You can change it in D3.

You use ranges a lot. Would it break any of your code?
Mar 06 2014
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/6/2014 7:22 PM, bearophile wrote:
 One advantage of your change is that this code will work:

 auto s = "hello".dup;
 s.sort();

Yes, I hadn't thought of that. The auto-decoding front() introduces all kinds of asymmetry in how ranges work, and asymmetry is bad as it negatively impacts composability.
Mar 06 2014
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/6/14, 7:55 PM, Walter Bright wrote:
 On 3/6/2014 7:22 PM, bearophile wrote:
 One advantage of your change is that this code will work:

 auto s = "hello".dup;
 s.sort();

Yes, I hadn't thought of that. The auto-decoding front() introduces all kinds of asymmetry in how ranges work, and asymmetry is bad as it negatively impacts composability.

There's no asymmetry, and decoding helps composability as I demonstrated. Andrei
Mar 07 2014
parent Walter Bright <newshound2 digitalmars.com> writes:
On 3/7/2014 11:59 AM, Andrei Alexandrescu wrote:
 On 3/6/14, 7:55 PM, Walter Bright wrote:
 On 3/6/2014 7:22 PM, bearophile wrote:
 One advantage of your change is that this code will work:

 auto s = "hello".dup;
 s.sort();

Yes, I hadn't thought of that. The auto-decoding front() introduces all kinds of asymmetry in how ranges work, and asymmetry is bad as it negatively impacts composability.

There's no asymmetry, and decoding helps composability as I demonstrated.

Here's one asymmetry: ----------------------------- alias int T; // compiles //alias char T; // fails to compile struct Input(T) { T front(); bool empty(); void popFront(); } struct Output(T) { void put(T); } import std.array; void copy(F,T)(F f, T t) { while (!f.empty) { t.put(f.front); f.popFront(); } } void main() { immutable(T)[] from; Output!T to; from.copy(to); } -------------------------------
Mar 07 2014
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/6/2014 7:31 PM, H. S. Teoh wrote:
 Whoa. You're not serious about changing this now, are you? Because even
 though I would support such a change, you have to realize the magnitude
 of code breakage that will happen. A lot of code that iterates over
 narrow strings will break, and worse yet, they will break *silently*.
 Calling count() on a narrow string will not return the expected value,
 for example. And existing code that iterates over narrow strings
 expecting dchars to come out of it will suddenly silently convert to
 char, and may pass by unnoticed until somebody runs the program with a
 multibyte character in the input.

I understand this all too well. (Note that we currently have a different silent problem: unnoticed large performance problems.)
 This is very high risk change IMO.

 You're welcome to create a (temporary) Phobos fork that reverts narrow
 string auto-decoding, of course, and people can try it out to see how
 much actual breakage is happening. If you really want to push for this,
 that might be the safest way to test the waters before committing to
 such a major change. Silent breakage is not easy to test for,
 unfortunately. :(

I posted a plan in another message in this thread. It'll be a long process, but I think it's doable.
Mar 06 2014
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/6/2014 7:59 PM, bearophile wrote:
 Walter Bright:

 I understand this all too well. (Note that we currently have a different
 silent problem: unnoticed large performance problems.)

On the other hand your change could introduce Unicode-related bugs in future code (that the current Phobos avoids) (and here I am not talking about code breakage).

This comes up repeatedly as justification for D trying to hide the UTF-8 nature of strings that I discussed upthread. To my mind it's like trying to pretend that floating point doesn't have roundoff issues, integers have infinite range, memory is infinite, etc. That has a place in other languages, but not in a systems/native language.
Mar 06 2014
parent Shammah Chancellor <anonymous coward.com> writes:
On 2014-03-07 04:17:34 +0000, Walter Bright said:

 On 3/6/2014 7:59 PM, bearophile wrote:
 Walter Bright:
 
 I understand this all too well. (Note that we currently have a different
 silent problem: unnoticed large performance problems.)

On the other hand your change could introduce Unicode-related bugs in future code (that the current Phobos avoids) (and here I am not talking about code breakage).

This comes up repeatedly as justification for D trying to hide the UTF-8 nature of strings that I discussed upthread. To my mind it's like trying to pretend that floating point doesn't have roundoff issues, integers have infinite range, memory is infinite, etc. That has a place in other languages, but not in a systems/native language.

Is it possible to add a warning notice when .front() is used on char? I would say fix it now, add a warning, and then remove the warning later. -S.
Mar 07 2014
prev sibling parent reply Michel Fortin <michel.fortin michelf.ca> writes:
On 2014-03-07 03:59:55 +0000, "bearophile" <bearophileHUGS lycos.com> said:

 Walter Bright:
 
 I understand this all too well. (Note that we currently have a 
 different silent problem: unnoticed large performance problems.)

On the other hand your change could introduce Unicode-related bugs in future code (that the current Phobos avoids) (and here I am not talking about code breakage).

The way Phobos works isn't any more correct than dealing with code units. Many graphemes span on multiple code points -- because of combined diacritics or character variant modifiers -- and decoding at the code-point level is thus often insufficient for correctness. The problem with Unicode strings is that the representation you must work with depends on the things you want to do. If you want to count the characters then you need graphemes; if you want to parse XML then you'll need to work with code points (in theory, in practice you might still want direct access to code units for performance reasons); and if you want to slice or copy a string then you need to deal with code units. Because of this multiple-representation-for-different-purpose thing, generic algorithms for arrays don't map very well to string. From my experience, I'd suggest these basic operations for a "string range" instead of the regular range interface: .empty .frontCodeUnit .frontCodePoint .frontGrapheme .popFrontCodeUnit .popFrontCodePoint .popFrontGrapheme .codeUnitLength (aka length) .codePointLength (for dchar[] only) .codePointLengthLinear .graphemeLengthLinear Someone should be able to mix all the three 'front' and 'pop' function variants above in any code dealing with a string type. In my XML parser for instance I regularly use frontCodeUnit to avoid the decoding penalty when matching the next character with an ASCII one such as '<' or '&'. An API like the one above forces you to be aware of the level you're working on, making bugs and inefficiencies stand out (as long as you're familiar with each representation). If someone wants to use a generic array/range algorithm with a string, my opinion is that he should have to wrap it in a range type that maps front and popFront to one of the above variant. Having to do that should make it obvious that there's an inefficiency there, as you're using an algorithm that wasn't tailored to work with strings and that more decoding than strictly necessary is being done. -- Michel Fortin michel.fortin michelf.ca http://michelf.ca
Mar 07 2014
next sibling parent Michel Fortin <michel.fortin michelf.ca> writes:
On 2014-03-07 14:47:26 +0000, "Kagamin" <spam here.lot> said:

 On Friday, 7 March 2014 at 13:40:31 UTC, Michel Fortin wrote:
 if you want to parse XML then you'll need to work with code points (in 
 theory, in practice you might still want direct access to code units 
 for performance reasons)

AFAIK, xml control characters are all ascii, and what's between them you can slice or dup without consideration, so code units should be more than enough.

If you don't fully check for well-formness (as XML parsers ought to do according to the XML spec) then sure you can limit yourself to ASCII. You'll let through illegal characters in element and attribute names though. -- Michel Fortin michel.fortin michelf.ca http://michelf.ca
Mar 07 2014
prev sibling parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 3/7/2014 8:40 AM, Michel Fortin wrote:
 On 2014-03-07 03:59:55 +0000, "bearophile" <bearophileHUGS lycos.com> said:

 Walter Bright:

 I understand this all too well. (Note that we currently have a
 different silent problem: unnoticed large performance problems.)

On the other hand your change could introduce Unicode-related bugs in future code (that the current Phobos avoids) (and here I am not talking about code breakage).

The way Phobos works isn't any more correct than dealing with code units. Many graphemes span on multiple code points -- because of combined diacritics or character variant modifiers -- and decoding at the code-point level is thus often insufficient for correctness.

Well, it is *more* correct, as many western languages are more likely in current Phobos to "just work" in most cases. It's just that things still aren't completely correct overall.
  From my experience, I'd suggest these basic operations for a "string
 range" instead of the regular range interface:

 .empty
 .frontCodeUnit
 .frontCodePoint
 .frontGrapheme
 .popFrontCodeUnit
 .popFrontCodePoint
 .popFrontGrapheme
 .codeUnitLength (aka length)
 .codePointLength (for dchar[] only)
 .codePointLengthLinear
 .graphemeLengthLinear

 Someone should be able to mix all the three 'front' and 'pop' function
 variants above in any code dealing with a string type. In my XML parser
 for instance I regularly use frontCodeUnit to avoid the decoding penalty
 when matching the next character with an ASCII one such as '<' or '&'.
 An API like the one above forces you to be aware of the level you're
 working on, making bugs and inefficiencies stand out (as long as you're
 familiar with each representation).

 If someone wants to use a generic array/range algorithm with a string,
 my opinion is that he should have to wrap it in a range type that maps
 front and popFront to one of the above variant. Having to do that should
 make it obvious that there's an inefficiency there, as you're using an
 algorithm that wasn't tailored to work with strings and that more
 decoding than strictly necessary is being done.

I actually like this suggestion quite a bit.
Mar 10 2014
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
07-Mar-2014 07:22, bearophile пишет:
 Walter Bright:

 You use ranges a lot. Would it break any of your code?

I need to try the changes to be sure. But the magnitude of this change is so large that I guess some code will surely break. One advantage of your change is that this code will work: auto s = "hello".dup; s.sort();

Which it shouldn't unless there is an ascii type or some such. -- Dmitry Olshansky
Mar 07 2014
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/7/14, 1:56 AM, Dmitry Olshansky wrote:
 07-Mar-2014 07:22, bearophile пишет:
 Walter Bright:

 You use ranges a lot. Would it break any of your code?

I need to try the changes to be sure. But the magnitude of this change is so large that I guess some code will surely break. One advantage of your change is that this code will work: auto s = "hello".dup; s.sort();

Which it shouldn't unless there is an ascii type or some such.

Correct. This is a win, not a failure, of the current approach. To sort the bytes in "hello" write: s.representation.sort(); which is indicative to the human and technically correct. Andrei
Mar 07 2014
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/6/2014 6:37 PM, Walter Bright wrote:
 Is there any hope of fixing this?

Is there any way we can provide an upgrade path for this? Silent breakage is terrible. Any ideas?
Mar 06 2014
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/6/2014 7:06 PM, Walter Bright wrote:
 On 3/6/2014 6:37 PM, Walter Bright wrote:
 Is there any hope of fixing this?

Is there any way we can provide an upgrade path for this? Silent breakage is terrible. Any ideas?

Ok, I have a plan. Each step will be separated by at least one version: 1. implement decode() as an algorithm for string types, so one can write: string s; s.decode.algorithm... suggest that people start doing that instead of: s.algorithm... 2. Emit warning when people use std.array.front(s) with strings. 3. Deprecate std.array.front for strings. 4. Error for std.array.front for strings. 5. Implement new std.array.front for strings that doesn't decode.
Mar 06 2014
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
07-Mar-2014 07:52, Walter Bright пишет:
 On 3/6/2014 7:06 PM, Walter Bright wrote:
 On 3/6/2014 6:37 PM, Walter Bright wrote:
 Is there any hope of fixing this?

Is there any way we can provide an upgrade path for this? Silent breakage is terrible. Any ideas?

Ok, I have a plan. Each step will be separated by at least one version: 1. implement decode() as an algorithm for string types, so one can write: string s; s.decode.algorithm... suggest that people start doing that instead of: s.algorithm...

This would also be a great fit in cases where 'decode' is decoding some other encoding.
 2. Emit warning when people use std.array.front(s) with strings.

 3. Deprecate std.array.front for strings.

 4. Error for std.array.front for strings.

This sounds fine to me. I would even prefer to only offer explicit wrappers: .raw - ubyte/ushort for UTF-8/UTF-16 etc. .decode - dchars as Nick suggests. Then there is also the horrible ElementEncodingType vs ElementType. I would love to see ElementEncodingType die.
 5. Implement new std.array.front for strings that doesn't decode.

It would make it simple to think that strings are arrays of characters. This illusion was broken (and good thing it was), no point in reestablishing it to save a couple of keystrokes for those "who really know what they are doing". -- Dmitry Olshansky
Mar 07 2014
parent Walter Bright <newshound2 digitalmars.com> writes:
On 3/7/2014 2:11 AM, Dmitry Olshansky wrote:
 Then there is also the horrible ElementEncodingType vs ElementType.
 I would love to see ElementEncodingType die.

I agree. ElementEncodingType is a giant red flag saying we screwed things up.
Mar 07 2014
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/7/14, 9:24 AM, Vladimir Panteleev wrote:
 5. Implement new std.array.front for strings that doesn't decode.

Until then, how will people use strings with algorithms when they mean to use them per-byte? A .raw property which casts to ubyte[]?

There's no "until then". A current ".representation" property already exists that casts all string types appropriately. Andrei
Mar 07 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
07-Mar-2014 23:11, Andrei Alexandrescu пишет:
 On 3/7/14, 9:24 AM, Vladimir Panteleev wrote:
 5. Implement new std.array.front for strings that doesn't decode.

Until then, how will people use strings with algorithms when they mean to use them per-byte? A .raw property which casts to ubyte[]?

There's no "until then". A current ".representation" property already exists that casts all string types appropriately.

There is however a big glaring failure: std.algorithm specialized for char[], wchar[] but not for any RandomAccessRange!char or RandomAccessRange!wchar. So if I for instance get a custom slice type (e.g. a ring buffer), then I'm out of luck w/o both "auto-magic dchar range" and special code in std.algo that works with chars as code units. If there is a way to exploit the duality of RA range of code units being "is a" BD range of code points we certainly have failed with making it work (first of all doing horrible job at generic-ness as mentioned). -- Dmitry Olshansky
Mar 07 2014
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/7/14, 11:28 AM, Dmitry Olshansky wrote:
 07-Mar-2014 23:11, Andrei Alexandrescu пишет:
 On 3/7/14, 9:24 AM, Vladimir Panteleev wrote:
 5. Implement new std.array.front for strings that doesn't decode.

Until then, how will people use strings with algorithms when they mean to use them per-byte? A .raw property which casts to ubyte[]?

There's no "until then". A current ".representation" property already exists that casts all string types appropriately.

There is however a big glaring failure: std.algorithm specialized for char[], wchar[] but not for any RandomAccessRange!char or RandomAccessRange!wchar.

I agree that's an issue. Back in the day when this was a choice I decided to consider only char[] and friends "UTF strings". There was room for more generality but I didn't know of any use cases that would ask for them. It's possible I was wrong, but the option to generalize is still open today. Andrei
Mar 07 2014
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 3/7/2014 9:24 AM, Vladimir Panteleev wrote:
 I think .decode should be something more explicit (byCodePoint OSLT), just so
 it's clear that it's not magical and does not solve all problems.

Good point. Perhaps "decodeUTF". "decode" is too generic.
 Until then, how will people use strings with algorithms when they mean to use
 them per-byte?

The way they do it now, i.e. they can't. That's the whole problem.
Mar 07 2014
prev sibling next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
07-Mar-2014 21:34, H. S. Teoh пишет:
 On Fri, Mar 07, 2014 at 05:24:59PM +0000, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 03:52:42 UTC, Walter Bright wrote:
 Ok, I have a plan. Each step will be separated by at least one
 version:

 1. implement decode() as an algorithm for string types, so one can
 write:

     string s;
     s.decode.algorithm...

 suggest that people start doing that instead of:

     s.algorithm...

I think .decode should be something more explicit (byCodePoint OSLT), just so it's clear that it's not magical and does not solve all problems.

+1. I think "byCodePoint" is far more self-documenting and less misleading than "decode". string s; s.byCodePoint.algorithm... I'm already starting to like it.

-- Dmitry Olshansky
Mar 07 2014
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/6/14, 7:52 PM, Walter Bright wrote:
 On 3/6/2014 7:06 PM, Walter Bright wrote:
 On 3/6/2014 6:37 PM, Walter Bright wrote:
 Is there any hope of fixing this?

Is there any way we can provide an upgrade path for this? Silent breakage is terrible. Any ideas?

Ok, I have a plan. Each step will be separated by at least one version: 1. implement decode() as an algorithm for string types, so one can write: string s; s.decode.algorithm... suggest that people start doing that instead of: s.algorithm... 2. Emit warning when people use std.array.front(s) with strings. 3. Deprecate std.array.front for strings. 4. Error for std.array.front for strings. 5. Implement new std.array.front for strings that doesn't decode.

This would kill D. I am not exaggerating. Andrei
Mar 07 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Friday, 7 March 2014 at 03:52:42 UTC, Walter Bright wrote:
 Ok, I have a plan. Each step will be separated by at least one 
 version:

 1. implement decode() as an algorithm for string types, so one 
 can write:

     string s;
     s.decode.algorithm...

 suggest that people start doing that instead of:

     s.algorithm...

I think .decode should be something more explicit (byCodePoint OSLT), just so it's clear that it's not magical and does not solve all problems.
 2. Emit warning when people use std.array.front(s) with strings.

 3. Deprecate std.array.front for strings.

 4. Error for std.array.front for strings.

 5. Implement new std.array.front for strings that doesn't 
 decode.

Until then, how will people use strings with algorithms when they mean to use them per-byte? A .raw property which casts to ubyte[]?
Mar 07 2014
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Mar 07, 2014 at 05:24:59PM +0000, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 03:52:42 UTC, Walter Bright wrote:
Ok, I have a plan. Each step will be separated by at least one
version:

1. implement decode() as an algorithm for string types, so one can
write:

    string s;
    s.decode.algorithm...

suggest that people start doing that instead of:

    s.algorithm...

I think .decode should be something more explicit (byCodePoint OSLT), just so it's clear that it's not magical and does not solve all problems.

+1. I think "byCodePoint" is far more self-documenting and less misleading than "decode". string s; s.byCodePoint.algorithm... I'm already starting to like it. T -- It always amuses me that Windows has a Safe Mode during bootup. Does that mean that Windows is normally unsafe?
Mar 07 2014
prev sibling next sibling parent "Nicholas Londey" <londey gmail.com> writes:
 1. implement decode() as an algorithm for string types,

Decode is an incredibly generic name. What about byGlyph similar to byLine?
Mar 08 2014
prev sibling next sibling parent "Nicholas Londey" <londey gmail.com> writes:
 This would kill D. I am not exaggerating.

I don't know about kill but it certainly feels awfully similar to the Python 2/3 spit over strings and Unicode which still doesn't seem to be resolved.
Mar 08 2014
prev sibling parent "Chris" <wendlec tcd.ie> writes:
On Friday, 7 March 2014 at 03:52:42 UTC, Walter Bright wrote:
 Ok, I have a plan. Each step will be separated by at least one 
 version:

 1. implement decode() as an algorithm for string types, so one 
 can write:

     string s;
     s.decode.algorithm...

 suggest that people start doing that instead of:

     s.algorithm...

 2. Emit warning when people use std.array.front(s) with strings.

 3. Deprecate std.array.front for strings.

 4. Error for std.array.front for strings.

 5. Implement new std.array.front for strings that doesn't 
 decode.

What about this: [as above] 1. implement decode() as an algorithm for string types, so one can write: string s; s.decode.algorithm... suggest that people start doing that instead of: s.algorithm... [as above] 2. Emit warning when people use std.array.front(s) with strings. 3. Implement new std.array.front for strings that doesn't decode, but keep the old one either forever(ish) or until way into D3 (3.03). 4. Deprecate std.array.front for strings (see 3.) 5. Error for std.array.front for strings. (see 3) I know that one of the rules of D is "warnings should eventually become errors", but there is nothing wrong with waiting longer than a few months before something is an error or removed from the library, especially if it would cause loads of code to break (my own too, I suppose). As long as users are aware of it, they can start to make the transition in their own code little by little. In this case they will make the transition rather sooner than later, because nobody wants to suffer constant performance penalties. So for this particular change I'd suggest to wait patiently until it can finally be deprecated. Is this feasible?
Mar 11 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Walter Bright:

 You use ranges a lot. Would it break any of your code?

I need to try the changes to be sure. But the magnitude of this change is so large that I guess some code will surely break. One advantage of your change is that this code will work: auto s = "hello".dup; s.sort(); Bye, bearophile
Mar 06 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Walter Bright:

 But it's good to have in Phobos a compiler-intrinsics-based 
 efficient overflow
 detection on a user-defined struct type that behaves like 
 built-in ints in all
 other aspects.

Yes, so that the user selects it, rather than having it wired in everywhere and the user has to figure out how to defeat it.

I don't think people have ever suggested that. In a recent discussion you seemed against the idea of a special compiler support for that user defined type. Bye, bearophile
Mar 06 2014
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Mar 06, 2014 at 06:59:36PM -0800, Walter Bright wrote:
 On 3/6/2014 6:54 PM, bearophile wrote:
Walter Bright:
Is there any hope of fixing this?

I don't think we can change that in D2. You can change it in D3.

You use ranges a lot. Would it break any of your code?

Whoa. You're not serious about changing this now, are you? Because even though I would support such a change, you have to realize the magnitude of code breakage that will happen. A lot of code that iterates over narrow strings will break, and worse yet, they will break *silently*. Calling count() on a narrow string will not return the expected value, for example. And existing code that iterates over narrow strings expecting dchars to come out of it will suddenly silently convert to char, and may pass by unnoticed until somebody runs the program with a multibyte character in the input. This is very high risk change IMO. You're welcome to create a (temporary) Phobos fork that reverts narrow string auto-decoding, of course, and people can try it out to see how much actual breakage is happening. If you really want to push for this, that might be the safest way to test the waters before committing to such a major change. Silent breakage is not easy to test for, unfortunately. :( T -- Truth, Sir, is a cow which will give [skeptics] no more milk, and so they are gone to milk the bull. -- Sam. Johnson
Mar 06 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Walter Bright:

 I understand this all too well. (Note that we currently have a 
 different silent problem: unnoticed large performance problems.)

On the other hand your change could introduce Unicode-related bugs in future code (that the current Phobos avoids) (and here I am not talking about code breakage). Bye, bearophile
Mar 06 2014
prev sibling next sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Friday, 7 March 2014 at 02:57:38 UTC, Walter Bright wrote:
 Yes, so that the user selects it, rather than having it wired 
 in everywhere and the user has to figure out how to defeat it.

BTW you know what would help this? A pragma we can attach to a struct which makes it a very thin value type. pragma(thin_struct) struct A { int a; int foo() { return a; } static A get() { A(10); } } void test() { A a = A.get(); printf("%d", a.foo()); } With the pragma, A would be completely indistinguishable from int in all ways. What do I mean? $ dmd -release -O -inline test56 -c Let's look at A.foo: A.foo: 0: 55 push ebp 1: 8b ec mov ebp,esp 3: 50 push eax 4: 8b 00 mov eax,DWORD PTR [eax] ; waste! 6: 8b e5 mov esp,ebp 8: 5d pop ebp 9: c3 ret It is line four that bugs me: the struct is passed as a *pointer*, but its only contents are an int, which could just as well be passed as a value. Let's compare it to an identical function in operation: int identity(int a) { return a; } 00000000 <_D6test568identityFiZi>: 0: 55 push ebp 1: 8b ec mov ebp,esp 3: 83 ec 04 sub esp,0x4 6: c9 leave 7: c3 ret lol it *still* wastes time, setting up a stack frame for nothing. But we could just as well write asm { naked; ret; } and it would work as expected: the argument is passed in EAX and the return value is expected in EAX. The function doesn't actually have to do anything. Anywho, the struct could work the same way. Now, I understand that we can't just change this unilaterally since it would break interaction with the C ABI, but we could opt in to some thinner stuff with a pragma. Ideally, the thin struct would generate this code: void A.get() { naked { // no need for stack frame here mov EAX, 10; ret; } } return A(10); when A is thin should be equal to return 10;. No need for NRVO, the object is super thin. void A.foo() { naked { // no locals, no stack frame ret; // the last argument (this) is passed in EAX // and the return value goes in EAX // so we don't have to do anything } } Without the thin_struct thing, this would minimally look like mov EAX, [EAX]; ret; Having to load the value from the this pointer. But since it is thin, it is generated identically to an int, like the identity function above, so the value is already in the register! Then, test: void test() { naked { // don't need a stack frame here either! call A.get; // a is now in EAX, the value loaded right up call A.foo; // the this is an int and already // where it needs to be, so just go // and finally, go ahead and call printf push EAX; push "%d".ptr; call printf; ret; } } Then, naturally, inlining A.get and A.foo might be possible (though I'd love to write them in assembly myself* and the compiler prolly can't inline them) but call/ret is fairly cheap, especially when compared to push/pop, so just keeping all the relevant stuff right in registers with no need to reference can really help us. pragma(thin_struct) struct RangedInt { int a; RangedInt opBinary(string op : "+")(int rhs) { asm { naked; add EAX, [rhs]; // or RDI on 64 bit! Don't even need to touch the stack! ** jo throw_exception; ret; } } } Might still not be as perfect as intrinsics like bearophile is thinking of... but we'd be getting pretty close. And this kind of thing would be good for other thin wrappers too, we could magically make smart pointers too! (This can't be done now since returning a struct is done via hidden pointer argument instead of by register like a naked pointer). ** i'd kinda love it if we had an all-register calling convention on 32 bit too.... but eh oh well
Mar 06 2014
prev sibling next sibling parent reply Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
What about this?:

Anywhere we currently have a front() that decodes, such as your example:

     property dchar front(T)(T[] a)  safe pure if (isNarrowString!(T[]))
    {
      assert(a.length, "Attempting to fetch the front of an empty array
 of " ~
             T.stringof);
      size_t i = 0;
      return decode(a, i);
    }

We rip out that front() entirely. The result is *not* technically a range...yet! We could call it a protorange. Then we provide two functions: auto decode(someStringProtoRange) {...} auto raw(someStringProtoRange) {...} These convert the protoranges into actual ranges by adding the missing front() function. The 'decode' adds a front() which decodes into dchar, while the 'raw' adds a front() which simply returns the raw underlying type. I imagine the decode/raw would probably also handle any "length" property (if it exists in the protorange) accordingly. This way, the user is forced to specify "myStringRange.decode" or "myStringRange.raw" as appropriate, otherwise myStringRange can't be used since it isn't technically a range, only a protorange. (Naturally, ranges of dchar would always have front, since no decoding is ever needed for them anyway. For these ranges, the decode/raw funcs above would simply be no-ops.)
Mar 06 2014
next sibling parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 3/6/2014 11:11 PM, Nick Sabalausky wrote:
 What about this?:

 Anywhere we currently have a front() that decodes, such as your example:

     property dchar front(T)(T[] a)  safe pure if (isNarrowString!(T[]))
    {
      assert(a.length, "Attempting to fetch the front of an empty array
 of " ~
             T.stringof);
      size_t i = 0;
      return decode(a, i);
    }

We rip out that front() entirely. The result is *not* technically a range...yet! We could call it a protorange. Then we provide two functions: auto decode(someStringProtoRange) {...} auto raw(someStringProtoRange) {...} These convert the protoranges into actual ranges by adding the missing front() function. The 'decode' adds a front() which decodes into dchar, while the 'raw' adds a front() which simply returns the raw underlying type. I imagine the decode/raw would probably also handle any "length" property (if it exists in the protorange) accordingly. This way, the user is forced to specify "myStringRange.decode" or "myStringRange.raw" as appropriate, otherwise myStringRange can't be used since it isn't technically a range, only a protorange. (Naturally, ranges of dchar would always have front, since no decoding is ever needed for them anyway. For these ranges, the decode/raw funcs above would simply be no-ops.)

Of course, I just realized that these front()s can't be added unless there's already a front to be called in the first place... So instead of ripping out the current front() functions entirely, we replace "front" with some sort of "rawFront" which the raw/decode versions of front() can query in order to provide actual decoding/non-decoding ranges.
Mar 06 2014
prev sibling next sibling parent reply "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Friday, 7 March 2014 at 04:11:15 UTC, Nick Sabalausky wrote:
 What about this?:

 Anywhere we currently have a front() that decodes, such as your 
 example:

    property dchar front(T)(T[] a)  safe pure if 
 (isNarrowString!(T[]))
   {
     assert(a.length, "Attempting to fetch the front of an 
 empty array
 of " ~
            T.stringof);
     size_t i = 0;
     return decode(a, i);
   }

We rip out that front() entirely. The result is *not* technically a range...yet! We could call it a protorange. Then we provide two functions: auto decode(someStringProtoRange) {...} auto raw(someStringProtoRange) {...} These convert the protoranges into actual ranges by adding the missing front() function. The 'decode' adds a front() which decodes into dchar, while the 'raw' adds a front() which simply returns the raw underlying type. I imagine the decode/raw would probably also handle any "length" property (if it exists in the protorange) accordingly. This way, the user is forced to specify "myStringRange.decode" or "myStringRange.raw" as appropriate, otherwise myStringRange can't be used since it isn't technically a range, only a protorange. (Naturally, ranges of dchar would always have front, since no decoding is ever needed for them anyway. For these ranges, the decode/raw funcs above would simply be no-ops.)

Strings can be iterated over by code unit, code point, grapheme, grapheme cluster (?), words, sentences, lines, paragraphs, and potentially other things. Therefore, it makes sense two require the same for ranges of dchar, too. Also, `byCodeUnit` and `byCodePoint` would probably be better names than `raw` and `decode`, to much the already existing `byGrapheme` in std.uni.
Mar 09 2014
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/9/14, 6:34 AM, Jakob Ovrum wrote:
 On Sunday, 9 March 2014 at 13:08:05 UTC, Marc Schütz wrote:
 Also, `byCodeUnit` and `byCodePoint` would probably be better names
 than `raw` and `decode`, to much the already existing `byGrapheme` in
 std.uni.

There already is a std.uni.byCodePoint. It is a higher order range that accepts ranges of graphemes and ranges of code points (such as strings).

noice
 `byCodeUnit` is essentially std.string.representation.

Actually not because for reasons that are unclear to me people really want the individual type to be char, not ubyte. Andrei
Mar 09 2014
parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 3/9/2014 1:26 PM, Andrei Alexandrescu wrote:
 On 3/9/14, 6:34 AM, Jakob Ovrum wrote:

 `byCodeUnit` is essentially std.string.representation.

Actually not because for reasons that are unclear to me people really want the individual type to be char, not ubyte.

Probably because char *is* D's type for UTF-8 code units.
Mar 09 2014
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/9/2014 6:08 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 Also, `byCodeUnit` and `byCodePoint` would probably be better names than `raw`
 and `decode`, to much the already existing `byGrapheme` in std.uni.

I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string, wstring, dstring, and InputRange!char, etc.
Mar 09 2014
parent reply Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 3/9/2014 6:31 PM, Walter Bright wrote:
 On 3/9/2014 6:08 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 Also, `byCodeUnit` and `byCodePoint` would probably be better names
 than `raw`
 and `decode`, to much the already existing `byGrapheme` in std.uni.

I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string, wstring, dstring, and InputRange!char, etc.

'byCodePoint' and 'byDchar' are the same. However, 'byCodeUnit' is completely different from anything else: string str; wstring wstr; dstring dstr; (str|wchar|dchar).byChar // Always range of char (str|wchar|dchar).byWchar // Always range of wchar (str|wchar|dchar).byDchar // Always range of dchar str.representation // Range of ubyte wstr.representation // Range of ushort dstr.representation // Range of uint str.byCodeUnit // Range of char wstr.byCodeUnit // Range of wchar dstr.byCodeUnit // Range of dchar
Mar 09 2014
next sibling parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 3/10/2014 12:19 AM, Nick Sabalausky wrote:
 (str|wchar|dchar).byChar  // Always range of char
 (str|wchar|dchar).byWchar // Always range of wchar
 (str|wchar|dchar).byDchar // Always range of dchar

Erm, naturally I meant "(str|wstr|dstr)"
Mar 09 2014
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/9/2014 9:19 PM, Nick Sabalausky wrote:
 On 3/9/2014 6:31 PM, Walter Bright wrote:
 On 3/9/2014 6:08 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 Also, `byCodeUnit` and `byCodePoint` would probably be better names
 than `raw`
 and `decode`, to much the already existing `byGrapheme` in std.uni.

I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string, wstring, dstring, and InputRange!char, etc.

'byCodePoint' and 'byDchar' are the same. However, 'byCodeUnit' is completely different from anything else: string str; wstring wstr; dstring dstr; (str|wchar|dchar).byChar // Always range of char (str|wchar|dchar).byWchar // Always range of wchar (str|wchar|dchar).byDchar // Always range of dchar str.representation // Range of ubyte wstr.representation // Range of ushort dstr.representation // Range of uint str.byCodeUnit // Range of char wstr.byCodeUnit // Range of wchar dstr.byCodeUnit // Range of dchar

I don't see much point to the latter 3.
Mar 09 2014
parent reply Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 3/10/2014 12:23 AM, Walter Bright wrote:
 On 3/9/2014 9:19 PM, Nick Sabalausky wrote:
 On 3/9/2014 6:31 PM, Walter Bright wrote:
 On 3/9/2014 6:08 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 Also, `byCodeUnit` and `byCodePoint` would probably be better names
 than `raw`
 and `decode`, to much the already existing `byGrapheme` in std.uni.

I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string, wstring, dstring, and InputRange!char, etc.

'byCodePoint' and 'byDchar' are the same. However, 'byCodeUnit' is completely different from anything else: string str; wstring wstr; dstring dstr; (str|wchar|dchar).byChar // Always range of char (str|wchar|dchar).byWchar // Always range of wchar (str|wchar|dchar).byDchar // Always range of dchar str.representation // Range of ubyte wstr.representation // Range of ushort dstr.representation // Range of uint str.byCodeUnit // Range of char wstr.byCodeUnit // Range of wchar dstr.byCodeUnit // Range of dchar

I don't see much point to the latter 3.

Do you mean: 1. You don't see the point to iterating by code unit? 2. You don't see the point to 'byCodeUnit' if we have 'representation'? 3. You don't see the point to 'byCodeUnit' if we have 'byChar/byWchar/byDchar'? 4. You don't see the point to having 'byCodeUnit' work on UTF-32 dstrings? Responses: 1. Iterating by code unit: Useful for tweaking performance anytime decoding is unnecessary. For example, parsing a grammar where the bulk of the keywords and operators are ASCII. (Occasional uses of Unicode, like unicode whitespace, can of course be handled easily enough by the lexer FSM). 2. 'byCodeUnit' if we have 'representation': This one I have trouble answering since I'm still unclear on the purpose of 'representation' (I wasn't even aware of it until a few days ago.) I've been assuming there's some specific use-case I've overlooked where it's useful to iterate by code unit *while* treating the code units as if they weren't UTF-8/16/32 at all. But since 'representation' is called *on* a string/wstring/dstring, they should already be UTF-8/16/32 anyway, not some other encoding that would necessitate using integer types. Or maybe it's just for working around problems with the auto-verification being too eager (I've ran into those)? I admit I don't quite get 'representation'. 3. 'byCodeUnit' if we have 'byChar/byWchar/byDchar': To avoid a "static if" chain every time you want to use code units inside generic code. Also, so in non-generic code you can change your data type without updating instances of 'by*char'. 4. Having 'byCodeUnit' work on UTF-32 dstrings: So generic code working on code units doesn't have to special-case UTF-32.
Mar 10 2014
parent Walter Bright <newshound2 digitalmars.com> writes:
On 3/10/2014 12:09 AM, Nick Sabalausky wrote:
 On 3/10/2014 12:23 AM, Walter Bright wrote:
 On 3/9/2014 9:19 PM, Nick Sabalausky wrote:
 On 3/9/2014 6:31 PM, Walter Bright wrote:
 On 3/9/2014 6:08 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 Also, `byCodeUnit` and `byCodePoint` would probably be better names
 than `raw`
 and `decode`, to much the already existing `byGrapheme` in std.uni.

I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string, wstring, dstring, and InputRange!char, etc.

'byCodePoint' and 'byDchar' are the same. However, 'byCodeUnit' is completely different from anything else: string str; wstring wstr; dstring dstr; (str|wchar|dchar).byChar // Always range of char (str|wchar|dchar).byWchar // Always range of wchar (str|wchar|dchar).byDchar // Always range of dchar str.representation // Range of ubyte wstr.representation // Range of ushort dstr.representation // Range of uint str.byCodeUnit // Range of char wstr.byCodeUnit // Range of wchar dstr.byCodeUnit // Range of dchar

I don't see much point to the latter 3.

Do you mean: 1. You don't see the point to iterating by code unit? 2. You don't see the point to 'byCodeUnit' if we have 'representation'? 3. You don't see the point to 'byCodeUnit' if we have 'byChar/byWchar/byDchar'? 4. You don't see the point to having 'byCodeUnit' work on UTF-32 dstrings?

(3)
 3. 'byCodeUnit' if we have 'byChar/byWchar/byDchar': To avoid a "static if"
 chain every time you want to use code units inside generic code. Also, so in
 non-generic code you can change your data type without updating instances of
 'by*char'.

Just not sure I see a use for that.
Mar 10 2014
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 3/9/2014 6:34 AM, Jakob Ovrum wrote:
 `byCodeUnit` is essentially std.string.representation.

Not at all. std.string.representation takes a string and casts it to the corresponding ubyte, ushort, uint string. It doesn't work at all with InputRange!char
Mar 09 2014
prev sibling parent "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Sunday, 9 March 2014 at 13:08:05 UTC, Marc Schütz wrote:
 Also, `byCodeUnit` and `byCodePoint` would probably be better 
 names than `raw` and `decode`, to much the already existing 
 `byGrapheme` in std.uni.

There already is a std.uni.byCodePoint. It is a higher order range that accepts ranges of graphemes and ranges of code points (such as strings). `byCodeUnit` is essentially std.string.representation.
Mar 09 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Walter Bright:

 I'd rather fix the compiler's codegen than add a pragma.

But a standard common intrinsic to detect the overflow efficiently could be useful. Bye, bearophile
Mar 06 2014
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Mar 06, 2014 at 08:19:18PM -0800, Walter Bright wrote:
 On 3/6/2014 8:01 PM, Adam D. Ruppe wrote:
BTW you know what would help this? A pragma we can attach to a struct
which makes it a very thin value type.

I'd rather fix the compiler's codegen than add a pragma.

From what I understand, structs are *supposed* to be thin value types. I

compiler), and doesn't have complicated semantics like dtors and stuff like that, then it should be treated like a POD (passed in registers, etc). T -- Ruby is essentially Perl minus Wall.
Mar 06 2014
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
07-Mar-2014 06:37, Walter Bright пишет:
 In "Lots of low hanging fruit in Phobos" the issue came up about the
 automatic encoding and decoding of char ranges.

 Throughout D's history, there are regular and repeated proposals to
 redesign D's view of char[] to pretend it is not UTF-8, but UTF-32. I.e.
 so D will automatically generate code to decode and encode on every
 attempt to index char[].

...
 Is there any hope of fixing this?

Where have you been when it was introduced? :) -- Dmitry Olshansky
Mar 07 2014
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/7/2014 2:27 AM, Dmitry Olshansky wrote:
 Where have you been when it was introduced? :)

It slipped by me. What can I say? I'm not the only committer :-) But after spending non-trivial time suffering as auto-decode blasted my kingdom, I've concluded that it needs to die. Working around it is not easy. I know that auto-decode has negatively impacted your regex, too. Basically, auto-decode is like booking a flight from Seattle to San Francisco with a plane change in Atlanta.
Mar 07 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
07-Mar-2014 14:41, Walter Bright пишет:
 On 3/7/2014 2:27 AM, Dmitry Olshansky wrote:
 Where have you been when it was introduced? :)

It slipped by me. What can I say? I'm not the only committer :-) But after spending non-trivial time suffering as auto-decode blasted my kingdom, I've concluded that it needs to die. Working around it is not easy.

That seems to be the biggest problem, it's an overriding default that is very hard to "turn off" and retain nice and clear generic view of stuff.
 I know that auto-decode has negatively impacted your regex, too.

No, technically, I knew what I was doing and that decode call was explicit. It's just turned out to set a bar on a minimum time budget to do X with a string, and it's too high. What really got nasty is multiple re-decoding of the same piece as engine backtracks to try earlier alternatives. -- Dmitry Olshansky
Mar 07 2014
parent Walter Bright <newshound2 digitalmars.com> writes:
On 3/7/2014 11:43 AM, Dmitry Olshansky wrote:
 No, technically, I knew what I was doing and that decode call was explicit.

Ah right, I misremembered. Thanks for the correction.
Mar 07 2014
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 07 Mar 2014 05:41:18 -0500, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 3/7/2014 2:27 AM, Dmitry Olshansky wrote:
 Where have you been when it was introduced? :)

It slipped by me. What can I say? I'm not the only committer :-)

No, this is intrinsic in the problem of treating strings as ranges of dchar. This one function is a symptom, not the problem. -Steve
Mar 07 2014
prev sibling next sibling parent "Kagamin" <spam here.lot> writes:
On Friday, 7 March 2014 at 04:01:15 UTC, Adam D. Ruppe wrote:
 BTW you know what would help this? A pragma we can attach to a 
 struct which makes it a very thin value type.

 pragma(thin_struct)
 struct A {
    int a;
    int foo() { return a; }
    static A get() { A(10); }
 }

 void test() {
     A a = A.get();
     printf("%d", a.foo());
 }

 With the pragma, A would be completely indistinguishable from 
 int in all ways.

 What do I mean?
 $ dmd -release -O -inline test56 -c

 Let's look at A.foo:

 A.foo:
    0:   55                      push   ebp
    1:   8b ec                   mov    ebp,esp
    3:   50                      push   eax
    4:   8b 00                   mov    eax,DWORD PTR [eax] ; 
 waste!
    6:   8b e5                   mov    esp,ebp
    8:   5d                      pop    ebp
    9:   c3                      ret


 It is line four that bugs me: the struct is passed as a 
 *pointer*, but its only contents are an int, which could just 
 as well be passed as a value. Let's compare it to an identical 
 function in operation:

 int identity(int a) { return a; }

 00000000 <_D6test568identityFiZi>:
    0:   55                      push   ebp
    1:   8b ec                   mov    ebp,esp
    3:   83 ec 04                sub    esp,0x4
    6:   c9                      leave
    7:   c3                      ret

 lol it *still* wastes time, setting up a stack frame for 
 nothing. But we could just as well write asm { naked; ret; } 
 and it would work as expected: the argument is passed in EAX 
 and the return value is expected in EAX. The function doesn't 
 actually have to do anything.

struct A { int a; //int foo() { return a; } static A get() { A(10); } } int foo(A a) { return a.a; } printf("%d", a.foo()); Now it's passed by value. Though, I needed checked arithmetic only twice: for cast from long to int and for cast from double to long. If you expect your number type to overflow, you probably chose wrong type.
Mar 07 2014
prev sibling next sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Friday, 7 March 2014 at 02:37:11 UTC, Walter Bright wrote:
 In "Lots of low hanging fruit in Phobos" the issue came up 
 about the automatic encoding and decoding of char ranges.

 Throughout D's history, there are regular and repeated 
 proposals to redesign D's view of char[] to pretend it is not 
 UTF-8, but UTF-32. I.e. so D will automatically generate code 
 to decode and encode on every attempt to index char[].

I'm glad I'm not the only one who feels this way. Implicit decoding must die. I strongly believe that implicit decoding of character points in std.range has been a mistake. - Algorithms such as "countUntil" will count code points. These numbers are useless for slicing, and can introduce hard-to-find bugs. - In lots of places, I've discovered that Phobos did UTF decoding (thus murdering performance) when it didn't need to. Such cases included format (now fixed), appender (now fixed), startsWith (now fixed - recently), skipOver (still unfixed). These have caused latent bugs in my programs that happened to be fed non-UTF data. There's no reason for why D should fail on non-UTF data if it has no reason to decode it in the first place! These failures have only served to identify places in Phobos where redundant decoding was occurring. Furthermore, it doesn't actually solve anything completely! The only thing it solves is a subset of cases for a subset of languages! People want to look at a string "character by character". If a Unicode code point is a character in your language and alphabet, I'm really happy for you, but that's not how it is for everyone. Combining marks, complex scripts etc. make this point just a fallacy that in the end will cause programmers to make mistakes that will affect certain users somewhere. Why do people want to look at individual characters? There are a lot of misconceptions about Unicode, and I think some of that applies here. - Do you want to split a string by whitespace? Some languages have no notion of whitespace. What do you need it for? Line wrapping? Employ the Unicode line-breaking algorithm instead. - Do you want to uppercase the first letter of a string? Some language have no notion of letter case, and some use it for different reasons. Furthermore, even languages with a Latin-based alphabet may not have 1:1 mapping for case, e.g. the German ß letter. - Do you want to count how wide a string will be in a fixed-point font? Wrong... Combining and control characters, zero-width whitespace, etc. will render this approach futile. - Do you want to split or flush a stream to a character device at a point so that there's no garbage? I believe, this is the case in TDPL's mention of the subject. Again, combining characters or complex scripts will still be broken by this approach. You need to either go all-out and provide complete implementations of the relevant Unicode algorithms to perform tasks such as the above that will work in all locales, or you need to draw a line somewhere for which languages, alphabets, locales do you want to support in your program. D's line is drawn at the point where it considers that code points == characters, however the outcome of this is clear nowhere in its documentation and for such an arbitrary decision (from a cultural point of view), it is embedded too deep into the language itself. With std.ascii, at least, it's clear to the user that the functions there will only work with English or languages using the same alphabet. This doesn't apply universally. There are still cases like, e.g., regular expression ranges. [a-z] makes sense in English, and [а-я] makes sense in Russian, but I don't think that makes sense for all languages. However, for the most part, I think implicit decoding must be axed, and instead we need implementations of Unicode algorithms and the documentation to instruct users why and how to use them.
Mar 07 2014
parent reply Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 3/9/2014 11:21 AM, Vladimir Panteleev wrote:
 On Sunday, 9 March 2014 at 12:24:11 UTC, ponce wrote:
 - In lots of places, I've discovered that Phobos did UTF decoding
 (thus murdering performance) when it didn't need to. Such cases
 included format (now fixed), appender (now fixed), startsWith (now
 fixed - recently), skipOver (still unfixed). These have caused latent
 bugs in my programs that happened to be fed non-UTF data. There's no
 reason for why D should fail on non-UTF data if it has no reason to
 decode it in the first place! These failures have only served to
 identify places in Phobos where redundant decoding was occurring.

With all due respect, D string type is exclusively for UTF-8 strings. If it is not valid UTF-8, it should never had been a D string in the first place. In the other cases, ubyte[] is there.

This is an arbitrary self-imposed limitation caused by the choice in how strings are handled in Phobos.

Yea, I've had problems before - completely unnecessary problems that were *not* helpful or indicative of latent bugs - which were a direct result of Phobos being overly pedantic and eager about UTF validation. And yet the implicit UTF validation has never actually *helped* me in any way.
Mar 09 2014
parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 3/10/2014 6:21 AM, ponce wrote:
 On Sunday, 9 March 2014 at 21:14:30 UTC, Nick Sabalausky wrote:
 Yea, I've had problems before - completely unnecessary problems that
 were *not* helpful or indicative of latent bugs - which were a direct
 result of Phobos being overly pedantic and eager about UTF validation.
 And yet the implicit UTF validation has never actually *helped* me in
 any way.

 self-imposed limitation


I finds this article very telling about why string should be converted to UTF-8 as often as possible. http://www.utf8everywhere.org/ I agree 100% with its content, it's impossibly hard to have a sane handling of encodings on WIndows (even more in a team), if not following the drastic rules the article exposes.

I may have missed it, but I don't see where it says anything about validation or immediate sanitation of invalid sequences. It's mostly "UTF-16 sucks and so does Windows" (not that I'm necessarily disagreeing with it). (ot: Kinda wish they hadn't used such a hard to read font...)
Mar 10 2014
prev sibling next sibling parent "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Friday, 7 March 2014 at 03:32:50 UTC, H. S. Teoh wrote:
 On Thu, Mar 06, 2014 at 06:59:36PM -0800, Walter Bright wrote:
 On 3/6/2014 6:54 PM, bearophile wrote:
Walter Bright:
Is there any hope of fixing this?

I don't think we can change that in D2. You can change it in D3.

You use ranges a lot. Would it break any of your code?

This is very high risk change IMO.

+1 This will be the most disruptive change in D's history...
Mar 07 2014
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 3/7/14, Vladimir Panteleev <vladimir thecybershadow.net> wrote:
 - Do you want to split a string by whitespace?
 - Do you want to uppercase the first letter of a string?
 - Do you want to count how wide a string will be in a fixed-point
 font?
 - Do you want to split or flush a stream to a character device at
 a point so that there's no garbage?

We could later make a page on dlang (or the wiki) describing how to do these common things.
Mar 07 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Friday, 7 March 2014 at 03:32:50 UTC, H. S. Teoh wrote:
 Calling count() on a narrow string will not return the expected 
 value, for example.

I would argue that, unless it's been made clear that the program is expected to work only for certain languages, code that relied on this was wrong in the first place.
Mar 07 2014
prev sibling next sibling parent Robert Schadek <realburner gmx.de> writes:
On 03/07/2014 12:56 PM, Vladimir Panteleev wrote:
 I'm glad I'm not the only one who feels this way. Implicit decoding
 must die.

 I strongly believe that implicit decoding of character points in
 std.range has been a mistake.

 - Algorithms such as "countUntil" will count code points. These
 numbers are useless for slicing, and can introduce hard-to-find bugs.

https://github.com/D-Programming-Language/phobos/pull/1952 https://github.com/D-Programming-Language/phobos/pull/1977
Mar 07 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 06 Mar 2014 21:37:13 -0500, Walter Bright  
<newshound2 digitalmars.com> wrote:

 Is there any hope of fixing this?

Yes, make d strings not char arrays, but a library-defined struct with an array as backing. auto x = "..."; compiles to => auto x = string(cast(immutable(char)[])"..."); Then define string to be whatever kind of range you want in the library, with whatever functionality you want. Then if you want by-char traversal, explicitly use immutable(char)[] as x's type. And in the string range's members, we can provide whatever access we want. Note, this also fixes foreach, and many other problems we have. Most likely code that works today will continue to work, since it's much more of a bear to type immutable(char)[] instead of string :) -Steve
Mar 07 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Friday, 7 March 2014 at 13:40:31 UTC, Michel Fortin wrote:
 if you want to parse XML then you'll need to work with code 
 points

Why is this?
Mar 07 2014
prev sibling next sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Friday, 7 March 2014 at 04:19:16 UTC, Walter Bright wrote:
 I'd rather fix the compiler's codegen than add a pragma.

The codegen isn't broken, the current this pointer behavior is needed for full compatibility with the C ABI. It would be opt in to an ABI tweak that the caller needs to be aware of rather than an traditional optimization where the outside world would never know.
Mar 07 2014
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Friday, 7 March 2014 at 13:56:48 UTC, Adam D. Ruppe wrote:
 On Friday, 7 March 2014 at 04:19:16 UTC, Walter Bright wrote:
 I'd rather fix the compiler's codegen than add a pragma.

The codegen isn't broken, the current this pointer behavior is needed for full compatibility with the C ABI. It would be opt in to an ABI tweak that the caller needs to be aware of rather than an traditional optimization where the outside world would never know.

We don't need C ABI compatibility for stuff that is not extern(C), do we?
Mar 07 2014
prev sibling next sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Friday, 7 March 2014 at 10:44:46 UTC, Kagamin wrote:
 Now it's passed by value.

That won't work for operator overloading though (which is the really interesting case here).
 Though, I needed checked arithmetic only twice: for cast from 
 long to int and for cast from double to long. If you expect 
 your number type to overflow, you probably chose wrong type.

I very rarely need it too, but it is nice to have in a convenient package that is fairly efficient at the same time.
Mar 07 2014
prev sibling next sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Friday, 7 March 2014 at 14:04:53 UTC, Dicebot wrote:
 We don't need C ABI compatibility for stuff that is not 
 extern(C), do we?

That's a good point, though personally I'd still like some way to magic it up, even in extern(C). Consider the example of library typedef. If C did: typedef void* HANDLE; and D did struct HANDLE { void* foo; alias foo this; } it is almost the same, but then when you declare HANDLE OpenFile(...); it won't work since the compiler will pass a hidden struct pointer (which is exactly what C woudl expect if it was a typedef struct { void* } on its side too) instead of expecting the value in the accumulator as it would with the void*.
Mar 07 2014
prev sibling next sibling parent "Kagamin" <spam here.lot> writes:
On Friday, 7 March 2014 at 14:13:54 UTC, Adam D. Ruppe wrote:
 On Friday, 7 March 2014 at 10:44:46 UTC, Kagamin wrote:
 Now it's passed by value.

That won't work for operator overloading though (which is the really interesting case here).

Alternatively for small methods you can rely on inlining, which dereferences the argument. If the method is big, the reference is probably unimportant.
Mar 07 2014
prev sibling next sibling parent "Kagamin" <spam here.lot> writes:
On Friday, 7 March 2014 at 13:40:31 UTC, Michel Fortin wrote:
 if you want to parse XML then you'll need to work with code 
 points (in theory, in practice you might still want direct 
 access to code units for performance reasons)

AFAIK, xml control characters are all ascii, and what's between them you can slice or dup without consideration, so code units should be more than enough.
Mar 07 2014
prev sibling next sibling parent reply "Dicebot" <public dicebot.lv> writes:
I don't like it at all.

1) It is a huge breakage and you have been refusing to do one 
even for more important problems. What is about this sudden 
change of mind?

2) It is regression back to C++ days of 
no-one-cares-about-Unicode pain. Thinking about strings as 
character arrays is so natural and convenient that if 
language/Phobos won't punish you for that, it will be extremely 
widespread.

Rendering correctness is very application-specific but providing 
basic guarantees that string is not completely broken is useful.

Now real problems I see:

1) stuff like readText() returns char[] instead of requiring 
explicit default encoding

2) lack of convenient .raw property which will effectively do 
cast(ubyte[])

3) the fact that std.string always assumes unicode and never 
forwards to std.ascii for 
http://dlang.org/phobos/std_encoding.html#.AsciiString / ubyte[]
Mar 07 2014
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/7/2014 7:03 AM, Dicebot wrote:
 1) It is a huge breakage and you have been refusing to do one even for more
 important problems. What is about this sudden change of mind?

1. Performance Performance Performance 2. The current behavior is surprising (it sure surprised me, I didn't notice it until I looked at the assembler to figure out why the performance sucked) 3. Weirdnesses like ElementEncodingType 4. Strange behavior differences between char[], char*, and InputRange!char types 5. Funky anomalous issues with writing OutputRange!char (the put(T) must take a dchar)
 2) lack of convenient .raw property which will effectively do cast(ubyte[])

I've done the cast as a workaround, but when working with generic code it turns out the ubyte type becomes viral - you have to use it everywhere. So all over the place you're having casts between ubyte <=> char in unexpected places. You also wind up with ugly ubyte <=> dchar casts, with the commensurate risk that you goofed and have a truncation bug. Essentially, the auto-decode makes trivial code look better, but if you're writing a more comprehensive string processing program, and care about performance, it makes a regular ugly mess of things.
Mar 07 2014
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/10/2014 6:47 AM, Dicebot wrote:
 (array literals that allocate, I will never forgive that).

It was done that way simply to get it up and running quickly. Having them not allocate is an optimization, it doesn't change the nature.
Mar 10 2014
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/10/14, 7:07 PM, Steven Schveighoffer wrote:
 On Mon, 10 Mar 2014 19:59:07 -0400, Walter Bright
 <newshound2 digitalmars.com> wrote:

 On 3/10/2014 6:47 AM, Dicebot wrote:
 (array literals that allocate, I will never forgive that).

It was done that way simply to get it up and running quickly. Having them not allocate is an optimization, it doesn't change the nature.

I think you forget about this: foo(int v, int w) { auto x = [v, w]; } Which cannot pre-allocate.

It actually can, seeing as x is a dead assignment :o).
 That said, I would not mind if this code broke and you had to use
 array(v, w) instead, for the sake of avoiding unnecessary allocations.

Fixing that: int[] foo(int v, int w) { return [v, w]; } This one would allocate. But analyses of varying complexity may eliminate a variety of allocation patterns. Andrei
Mar 10 2014
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/10/14, 8:05 PM, Steven Schveighoffer wrote:
 I think you are missing what I'm saying, I don't want the allocation
 eliminated, but if we eliminate some allocations with [] and not others,
 it will be confusing. The path I'd always hoped we would go in was to
 make all array literals immutable, and make allocation of mutable arrays
 on the heap explicit.

 Adding eliding of some allocations for optimization is good, but I (and
 I think possibly Dicebot) think all array literals should not allocate.

I think so too. But that's irrelevant because arrays do allocate (at least behave as if they did) and that's how the cookie crumbles. D is a wonderful language, and is getting better literally by day. There is a lot more in using it in new and interesting ways, than in brooding about its inevitable imperfections. Andrei
Mar 10 2014
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 3/7/2014 9:04 AM, Vladimir Panteleev wrote:
 Ideally, it would be
 clearly visible in the code that you are counting code points.

Yes.
Mar 07 2014
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/9/14, 6:47 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
 2) It is regression back to C++ days of no-one-cares-about-Unicode
 pain. Thinking about strings as character arrays is so natural and
 convenient that if language/Phobos won't punish you for that, it will
 be extremely widespread.

Not with Nick Sabalausky's suggestion to remove the implementation of front from char arrays. This way, everyone will be forced to decide whether they want code units or code points or something else.

Such as giving up on that crappy language that keeps on breaking their code. Andrei
Mar 09 2014
prev sibling next sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Friday, 7 March 2014 at 14:44:43 UTC, Kagamin wrote:
 Alternatively for small methods you can rely on inlining, which 
 dereferences the argument.

Yeah, that's usually the way to go, inlining can also avoid pushing other arguments to the stack on 32 bit which is a big win too. But you can't inline asm function, and checking the overflow flag needs asm. (or a compiler intrinsic.) For the library typedef case too, this means wrapping any function that returns a struct too which is annoying if nothing else.
Mar 07 2014
prev sibling next sibling parent "Sean Kelly" <sean invisibleduck.org> writes:
I'm with Walter on this, and it's why I don't use char ranges. 
Though converting to ubyte feels weird.
Mar 07 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
 I don't like it at all.

 1) It is a huge breakage

Can we look at some example situations that this will break?
 and you have been refusing to do one even for more important 
 problems.

This is a fallacy.
 2) It is regression back to C++ days of 
 no-one-cares-about-Unicode pain. Thinking about strings as 
 character arrays is so natural and convenient that if 
 language/Phobos won't punish you for that, it will be extremely 
 widespread.

Thinking about dstrings as character arrays is less flawed only to a certain extent.
 Now real problems I see:

 1) stuff like readText() returns char[] instead of requiring 
 explicit default encoding

 2) lack of convenient .raw property which will effectively do 
 cast(ubyte[])

 3) the fact that std.string always assumes unicode and never 
 forwards to std.ascii for 
 http://dlang.org/phobos/std_encoding.html#.AsciiString / ubyte[]

I think these are fixable without breaking anything? So why not go for it? The first two sound trivial (.raw can be an UFCS property).
Mar 07 2014
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
 I don't like it at all.

 1) It is a huge breakage

Can we look at some example situations that this will break?

Any code that relies on countUntil to count dchar's? Or, to generalize, almost any code that uses std.algorithm functions with string?
 and you have been refusing to do one even for more important 
 problems.

This is a fallacy.

Ok :)
 2) It is regression back to C++ days of 
 no-one-cares-about-Unicode pain. Thinking about strings as 
 character arrays is so natural and convenient that if 
 language/Phobos won't punish you for that, it will be 
 extremely widespread.

Thinking about dstrings as character arrays is less flawed only to a certain extent.

Sure. But I find this extent practical enough to make the difference. It is good compromise between perfectly correct (and very slow) string processing and having your program unusable with anything but basic latin symbol set.
 Now real problems I see:

 1) stuff like readText() returns char[] instead of requiring 
 explicit default encoding

 2) lack of convenient .raw property which will effectively do 
 cast(ubyte[])

 3) the fact that std.string always assumes unicode and never 
 forwards to std.ascii for 
 http://dlang.org/phobos/std_encoding.html#.AsciiString / 
 ubyte[]

I think these are fixable without breaking anything? So why not go for it? The first two sound trivial (.raw can be an UFCS property).

(1) will likely require deprecation (== breakage) of old interface, but yes, those are relatively trivial. It is just has not been important enough to me to spend time on pushing it. Still struggling to finish my template argument list proposal :(
Mar 07 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Friday, 7 March 2014 at 16:43:30 UTC, Dicebot wrote:
 On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev 
 wrote:
 Can we look at some example situations that this will break?

Any code that relies on countUntil to count dchar's? Or, to generalize, almost any code that uses std.algorithm functions with string?

This is a pretty fragile design in the first place, since we use the same basic type (integers) to count two different things (code units / code points). Code that relies on this behavior would need to be explicitly tested with Unicode data to be sure that it works correctly - otherwise, it will only appear at a glance that it works right if it's only tested with ASCII. Correct code where these indices never left the equation will not be affected, e.g.: auto s = "日本語"; auto x = s.countUntil("本語"); // was 1, will be 3 s = s.drop(x); assert(s == "本語"); // still OK
 Thinking about dstrings as character arrays is less flawed 
 only to a certain extent.

Sure. But I find this extent practical enough to make the difference. It is good compromise between perfectly correct (and very slow) string processing and having your program unusable with anything but basic latin symbol set.

I think that if we are to draw a line somewhere on what to support and not, the decision should not be embedded as deep into the language. Ideally, it would be clearly visible in the code that you are counting code points.
Mar 07 2014
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Friday, 7 March 2014 at 17:04:30 UTC, Vladimir Panteleev wrote:
 I think that if we are to draw a line somewhere on what to 
 support and not, the decision should not be embedded as deep 
 into the language. Ideally, it would be clearly visible in the 
 code that you are counting code points.

Well if you consider really breaking changes, simply prohibiting plain random access to char[] and forcing to use either .raw or .decode is one thing I'd love to see (with .byGrapheme as library cherry on top)
Mar 07 2014
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Mar 07, 2014 at 05:08:02PM +0000, Dicebot wrote:
 On Friday, 7 March 2014 at 17:04:30 UTC, Vladimir Panteleev wrote:
I think that if we are to draw a line somewhere on what to support
and not, the decision should not be embedded as deep into the
language. Ideally, it would be clearly visible in the code that
you are counting code points.

Well if you consider really breaking changes, simply prohibiting plain random access to char[] and forcing to use either .raw or .decode is one thing I'd love to see (with .byGrapheme as library cherry on top)

I don't understand what advantage this would bring. T -- Frank disagreement binds closer than feigned agreement.
Mar 07 2014
prev sibling next sibling parent "Chris" <wendlec tcd.ie> writes:
I only hope it won't break my code. It mainly deals with string / 
character processing and our project in D is now almost ready for 
take off (at least for a beta flight). It deals with characters 
like "é", it is not dealing with English input. Hope the landing 
will be soft!
Mar 07 2014
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Friday, 7 March 2014 at 17:39:41 UTC, H. S. Teoh wrote:
 Well if you consider really breaking changes, simply 
 prohibiting
 plain random access to char[] and forcing to use either .raw or
 .decode is one thing I'd love to see (with .byGrapheme as 
 library
 cherry on top)

I don't understand what advantage this would bring.

Making sure that whatever interpretation is chosen by the programmer it is actually a conscious choice and he does not hold any false illusions.
Mar 07 2014
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/6/14, 6:37 PM, Walter Bright wrote:
 In "Lots of low hanging fruit in Phobos" the issue came up about the
 automatic encoding and decoding of char ranges.

 Is there any hope of fixing this?

There's nothing to fix. Allow me to enumerate the functions of std.algorithm and how they work today and how they'd work with the proposed change. Let s be a variable of some string type. 1. s.all!(x => x == 'é') currently works as expected. Proposed: fails silently. 2. s.any!(x => x == 'é') currently works as expected. Proposed: fails silently. 3. s.canFind!(x => x == 'é') currently works as expected. Proposed: fails silently. 4. s.canFind('é') currently works as expected. Proposed: fails silently. 5. s.count() currently works as expected. Proposed: fails silently. 6. s.count!((a, b) => std.uni.toLower(a) == std.uni.toLower(b))("é") currently works as expected (with the known issues of lowercase conversion). Proposed: fails silently. 7. s.count('é') currently works as expected. Proposed: fails silently. 8. s.countUntil("a") currently work as expected. Proposed: fails silently. This applies to all variations of countUntil. 9. s.endsWith('é') currently works as expected. Proposed: fails silently. 10. s.find('é') currently works as expected. Proposed: fails silently. This applies to other variations of find that include custom predicates. 11. ... I went down std.algorithm in the order listed in its documentation and found pernicious issues with almost every single algorithm. I designed the range behavior of strings after much thinking and consideration back in the day when I designed std.algorithm. It was painfully obvious (but it seems to have been forgotten now that it's working so well) that approaching strings as arrays of char[] would break almost every single algorithm leaving us essentially in the pre-UTF C++aveman era. Making strings bidirectional ranges has been a very good choice within the constraints. There was already a string type, and that was immutable(char)[], and a bunch of code depended on that definition. Clearly one might argue that their app has no business dealing with diacriticals or Asian characters. But that's the typical provincial view that marred many languages' approach to UTF and internationalization. If you know your string is ASCII, the remedy is simple - don't use char[] and friends. From day 1, the type "char" was meant to mean "code unit of UTF characters". So please ponder the above before going to do surgery on the patient that's going to kill him. Andrei
Mar 07 2014
next sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu 
wrote:
 Allow me to enumerate the functions of std.algorithm and how 
 they work today and how they'd work with the proposed change. 
 Let s be a variable of some string type.

 s.canFind('é') currently works as expected.

No, it doesn't. import std.algorithm; void main() { auto s = "cassé"; assert(s.canFind('é')); } That's the whole problem - all this hot steam and it still does not work properly. Because it can't - not without pulling in all of the Unicode algorithms implicitly, and that would be much worse.
 I went down std.algorithm in the order listed in its 
 documentation and found pernicious issues with almost every 
 single algorithm.

All of your examples are variations of one and the same case: searching for a non-ASCII dchar or dchar literal. How often does this pattern occur in real programs? I think the only real metric is to try the change and find out.
 Clearly one might argue that their app has no business dealing 
 with diacriticals or Asian characters. But that's the typical 
 provincial view that marred many languages' approach to UTF and 
 internationalization.

So is yours, if you think that making everything magically a dchar is going to solve all problems. The TDPL example only showcases the problem. Yes, it works with Swedish. Now try it again with Sanskrit.
Mar 07 2014
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/7/14, 12:43 PM, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu wrote:
 Allow me to enumerate the functions of std.algorithm and how they work
 today and how they'd work with the proposed change. Let s be a
 variable of some string type.

 s.canFind('é') currently works as expected.

No, it doesn't. import std.algorithm; void main() { auto s = "cassé"; assert(s.canFind('é')); }

worksforme
Mar 07 2014
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
 On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
 No, it doesn't.

 import std.algorithm;

 void main()
 {
    auto s = "cassé";
    assert(s.canFind('é'));
 }

Hm, I'm not following? Works perfectly fine on my system?

Something's messing with your Unicode. Try downloading and compiling this file: http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

Yup, the grapheme issue. This should work. import std.algorithm, std.uni; void main() { auto s = "cassé"; assert(s.byGrapheme.canFind('é')); } It doesn't compile, seems like a library bug. Graphemes are the next level of Nirvana above code points, but that doesn't mean it's graphemes or nothing. Andrei
Mar 07 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
08-Mar-2014 05:23, Andrei Alexandrescu пишет:
 On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
 On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
 No, it doesn't.

 import std.algorithm;

 void main()
 {
    auto s = "cassé";
    assert(s.canFind('é'));
 }

Hm, I'm not following? Works perfectly fine on my system?

Something's messing with your Unicode. Try downloading and compiling this file: http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

Yup, the grapheme issue. This should work. import std.algorithm, std.uni; void main() { auto s = "cassé"; assert(s.byGrapheme.canFind('é')); } It doesn't compile, seems like a library bug.

Becasue Graphemes do not auto-magically convert to dchar and back? After all they are just small strings.
 Graphemes are the next level of Nirvana above code points, but that
 doesn't mean it's graphemes or nothing.


 Andrei

-- Dmitry Olshansky
Mar 08 2014
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
08-Mar-2014 12:09, Dmitry Olshansky пишет:
 08-Mar-2014 05:23, Andrei Alexandrescu пишет:
 On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
 On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
 No, it doesn't.

 import std.algorithm;

 void main()
 {
    auto s = "cassé";
    assert(s.canFind('é'));
 }

Hm, I'm not following? Works perfectly fine on my system?

Something's messing with your Unicode. Try downloading and compiling this file: http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

Yup, the grapheme issue. This should work. import std.algorithm, std.uni; void main() { auto s = "cassé"; assert(s.byGrapheme.canFind('é')); } It doesn't compile, seems like a library bug.

Becasue Graphemes do not auto-magically convert to dchar and back? After all they are just small strings.
 Graphemes are the next level of Nirvana above code points, but that
 doesn't mean it's graphemes or nothing.


Plus it won't help the matters, you need both "é" and "cassé" to have the same normalization. -- Dmitry Olshansky
Mar 08 2014
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/8/14, 12:14 AM, Dmitry Olshansky wrote:
 08-Mar-2014 12:09, Dmitry Olshansky пишет:
 08-Mar-2014 05:23, Andrei Alexandrescu пишет:
 On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
 On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
 No, it doesn't.

 import std.algorithm;

 void main()
 {
    auto s = "cassé";
    assert(s.canFind('é'));
 }

Hm, I'm not following? Works perfectly fine on my system?

Something's messing with your Unicode. Try downloading and compiling this file: http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

Yup, the grapheme issue. This should work. import std.algorithm, std.uni; void main() { auto s = "cassé"; assert(s.byGrapheme.canFind('é')); } It doesn't compile, seems like a library bug.

Becasue Graphemes do not auto-magically convert to dchar and back? After all they are just small strings.
 Graphemes are the next level of Nirvana above code points, but that
 doesn't mean it's graphemes or nothing.


Plus it won't help the matters, you need both "é" and "cassé" to have the same normalization.

Why? Couldn't the grapheme 'compare true with the character? I.e. the byGrapheme iteration normalizes on the fly. Andrei
Mar 08 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
08-Mar-2014 19:33, Andrei Alexandrescu пишет:
 On 3/8/14, 12:14 AM, Dmitry Olshansky wrote:
 08-Mar-2014 12:09, Dmitry Olshansky пишет:
 08-Mar-2014 05:23, Andrei Alexandrescu пишет:
 On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
 On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
 No, it doesn't.

 import std.algorithm;

 void main()
 {
    auto s = "cassé";
    assert(s.canFind('é'));
 }

Hm, I'm not following? Works perfectly fine on my system?

Something's messing with your Unicode. Try downloading and compiling this file: http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

Yup, the grapheme issue. This should work. import std.algorithm, std.uni; void main() { auto s = "cassé"; assert(s.byGrapheme.canFind('é')); } It doesn't compile, seems like a library bug.

Becasue Graphemes do not auto-magically convert to dchar and back? After all they are just small strings.
 Graphemes are the next level of Nirvana above code points, but that
 doesn't mean it's graphemes or nothing.


Plus it won't help the matters, you need both "é" and "cassé" to have the same normalization.

Why? Couldn't the grapheme 'compare true with the character?

Iff it consists of one codepoint, it technically may.
 I.e. the
 byGrapheme iteration normalizes on the fly.

Oh crap, please no. It's not only _Slow_ but it's also horribly complicated (even in off-line, eager version). + there are 4 normalizations, of which 2 are lossy. You simply can't be serious on this one, though seeing that you introduced auto-decoding then by extension you must have proposed to normalize on the fly :) -- Dmitry Olshansky
Mar 08 2014
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/8/14, 8:08 AM, Dmitry Olshansky wrote:
 08-Mar-2014 19:33, Andrei Alexandrescu пишет:
 I.e. the
 byGrapheme iteration normalizes on the fly.

Oh crap, please no. It's not only _Slow_ but it's also horribly complicated (even in off-line, eager version). + there are 4 normalizations, of which 2 are lossy. You simply can't be serious on this one, though seeing that you introduced auto-decoding then by extension you must have proposed to normalize on the fly :)

Yah, just pushing my luck :o). I don't know much about graphemes and normalization, so leaving that stuff to you guys. Andrei
Mar 08 2014
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/8/14, 12:09 AM, Dmitry Olshansky wrote:
 08-Mar-2014 05:23, Andrei Alexandrescu пишет:
 On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
 On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
 No, it doesn't.

 import std.algorithm;

 void main()
 {
    auto s = "cassé";
    assert(s.canFind('é'));
 }

Hm, I'm not following? Works perfectly fine on my system?

Something's messing with your Unicode. Try downloading and compiling this file: http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

Yup, the grapheme issue. This should work. import std.algorithm, std.uni; void main() { auto s = "cassé"; assert(s.byGrapheme.canFind('é')); } It doesn't compile, seems like a library bug.

Becasue Graphemes do not auto-magically convert to dchar and back? After all they are just small strings.

Yah but I think they should support comparison with individual characters. No? Andrei
Mar 08 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
08-Mar-2014 19:32, Andrei Alexandrescu пишет:
 On 3/8/14, 12:09 AM, Dmitry Olshansky wrote:
 08-Mar-2014 05:23, Andrei Alexandrescu пишет:
 On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
 On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
 No, it doesn't.

 import std.algorithm;

 void main()
 {
    auto s = "cassé";
    assert(s.canFind('é'));
 }

Hm, I'm not following? Works perfectly fine on my system?

Something's messing with your Unicode. Try downloading and compiling this file: http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

Yup, the grapheme issue. This should work. import std.algorithm, std.uni; void main() { auto s = "cassé"; assert(s.byGrapheme.canFind('é')); } It doesn't compile, seems like a library bug.

Becasue Graphemes do not auto-magically convert to dchar and back? After all they are just small strings.

Yah but I think they should support comparison with individual characters. No?

We could add one. I don't think Grapheme interface is optimal or set in stone. The following should work as is though: s.byGrapheme.canFind(Grapheme("é"))
 Andrei

-- Dmitry Olshansky
Mar 08 2014
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
08-Mar-2014 20:43, Vladimir Panteleev пишет:
 On Saturday, 8 March 2014 at 15:56:08 UTC, Dmitry Olshansky wrote:
 The following should work as is though:

 s.byGrapheme.canFind(Grapheme("é"))

Doesn't work here. Not sure why. Grapheme(1000065, 3, 0, 33554432, [101, 0, 0, 1, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 2) // last byGrapheme vs. Grapheme(E9, 0, 0, 16777216, [233, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 1) // Grapheme("é")

Sounds like a bug, file it before we derailed. -- Dmitry Olshansky
Mar 08 2014
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/8/2014 12:09 AM, Dmitry Olshansky wrote:
 Becasue Graphemes do not auto-magically convert to dchar and back? After all
 they are just small strings.

std.uni.Grapheme is a struct, and that struct contains a string of arbitrary length. I don't know if that is the right design or not, or if a Grapheme should instead be an alias for a slice (rather than be a distinct type). Graphemes do not appear to have a 1:1 mapping with dchars, and any attempt to do so would likely be a giant mistake.
Mar 08 2014
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
09-Mar-2014 01:15, Walter Bright пишет:
 On 3/8/2014 12:09 AM, Dmitry Olshansky wrote:
 Becasue Graphemes do not auto-magically convert to dchar and back?
 After all
 they are just small strings.

std.uni.Grapheme is a struct, and that struct contains a string of arbitrary length. I don't know if that is the right design or not, or if a Grapheme should instead be an alias for a slice (rather than be a distinct type).

They use small-string optimization with great success, as indeed plenty of graphemes are just 1 codepoint. Many others are just a couple.
 Graphemes do not appear to have a 1:1 mapping with dchars, and any
 attempt to do so would likely be a giant mistake.

-- Dmitry Olshansky
Mar 08 2014
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/8/14, 1:15 PM, Walter Bright wrote:
 On 3/8/2014 12:09 AM, Dmitry Olshansky wrote:
 Becasue Graphemes do not auto-magically convert to dchar and back?
 After all
 they are just small strings.

std.uni.Grapheme is a struct, and that struct contains a string of arbitrary length. I don't know if that is the right design or not, or if a Grapheme should instead be an alias for a slice (rather than be a distinct type).

I think basic encapsulation suggests Grapheme should be a distinct type. It's a restricted slice, not just any slice.
 Graphemes do not appear to have a 1:1 mapping with dchars, and any
 attempt to do so would likely be a giant mistake.

I think they may be comparable to dchar. Andrei
Mar 08 2014
parent reply Michel Fortin <michel.fortin michelf.ca> writes:
On 2014-03-08 23:50:43 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 Graphemes do not appear to have a 1:1 mapping with dchars, and any
 attempt to do so would likely be a giant mistake.

I think they may be comparable to dchar.

Dchar, aka code points, are much clearly defined than graphemes. A quick search shows me there's more than one way to segment a string into graphemes. There's the legacy and extended boundary algorithms for general processing, and then there are some tailored algorithms that can segment code points differently depending on the locale, or other considerations. Reference: http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries There are three examples of local-specific graphemes in the table in the section linked above. "Ch" is one of them. Quoting Wikipedia: "Ch is a digraph in the Latin script. It is treated as a letter of its own in Chamorro, Czech, Slovak, Igbo, Quechua, Guarani, Welsh, Cornish, Breton and Belarusian Łacinka alphabets." https://en.wikipedia.org/wiki/Ch_(digraph) Also, there's some code points that represent ligatures (such as “fl”), which are in theory two graphemes. I'm not sure that the general algorithm does with that, but the depending on what you're doing (counting characters? spell checking?) you might want to split it in two. So basically you just can't make make an algorithm capable of counting letters/graphemes/characters in a universal fashion. There's no such thing as a universal grapheme segmentation algorithm, even though there is a general one. It'd be wise for any API to expose this subtlety whenever segmenting graphemes. Text is an interesting topic for never-ending discussions. -- Michel Fortin michel.fortin michelf.ca http://michelf.ca
Mar 08 2014
parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 3/8/2014 9:15 PM, Michel Fortin wrote:
 Text is an interesting topic for never-ending discussions.

It's also a good example for when non-programmers are surprised to hear that I *don't* see the world as binary "black and white" *because* of my programming experience ;) Problems like text-handling make it [painfully] obvious to programmers that reality is shades-of-grey - laymen don't often expect that!
Mar 09 2014
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/7/14, 2:26 PM, H. S. Teoh wrote:
 This illustrates one of my objections to Andrei's post: by auto-decoding
 behind the user's back and hiding the intricacies of unicode from him,
 it has masked the fact that codepoint-for-codepoint comparison of a
 unicode string is not guaranteed to always return the correct results,
 due to the possibility of non-normalized strings.

 Basically, to have correct behaviour in all cases, the user must be
 aware of, and use, the Unicode collation / normalization algorithms
 prescribed by the Unicode standard.

Which is a reasonable thing to ask for. Andrei
Mar 07 2014
prev sibling next sibling parent reply Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 3/7/2014 6:33 PM, H. S. Teoh wrote:
 On Fri, Mar 07, 2014 at 11:13:50PM +0000, Sarath Kodali wrote:
 On Friday, 7 March 2014 at 22:35:47 UTC, Sarath Kodali wrote:
 +1
 In Indian languages, a character consists of one or more UNICODE
 code points. For example, in Sanskrit "ddhrya"
 http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg
 consists of 7 UNICODE code points. So to search for this char I
 have to use string search.

 - Sarath

Oops, incomplete reply ... Since a single "alphabet" in Indian languages can contain multiple code-points, iterating over single code-points is like iterating over char[] for non English European languages. So decode is of no use other than decreasing the performance. A raw char[] comparison is much faster.

Yes. The more I think about it, the more auto-decoding sounds like a wrong decision. The question, though, is whether it's worth the massive code breakage needed to undo it. :-(

I'm leaning the same way too. But I also think Andrei is right that, at this point in time, it'd be a terrible move to change things so that "by code unit" is default. For better or worse, that ship has sailed. Perhaps we *can* deal with the auto-decoding problem not by killing auto-decoding, but by marginalizing it in an additive way: Convincing arguments have been made that any string-processing code which *isn't* done entirely with the official Unicode algorithms is likely wrong *regardless* of whether std.algorithm defaults to per-code-unit or per-code-point. So...How's this?: We add any of these Unicode algorithms we may be missing, encourage their use for strings, discourage use of std.algorithm for string processing, and in the meantime, just do our best to reduce unnecessary decoding wherever possible. Then we call it a day and all be happy :)
Mar 09 2014
parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 3/9/2014 7:47 AM, w0rp wrote:
 My knowledge of Unicode pretty much just comes from having
 to deal with foreign language customers and discovering the problems
 with the code unit abstraction most languages seem to use. (Java and
 Python suffer from similar issues, but they don't really have algorithms
 in the way that we do.)

Python 2 or 3 (out of curiosity)? If you're including Python3, then that somewhat surprises me as I thought greatly improved Unicode was one of the biggest reasons for the jump from 2 to 3. (Although it isn't *completely* surprising since, as we all know far too well here, fully correct Unicode is *not* easy.)
Mar 09 2014
prev sibling parent Michel Fortin <michel.fortin michelf.ca> writes:
On 2014-03-09 14:12:28 +0000, "Marc Schtz" <schuetzm gmx.net> said:

 That won't work, because your needle might be in a different 
 normalization form than your haystack, thus a byte-by-byte comparison 
 will not be able to find it.

The core of the problem is that sometime this byte-by-byte comparison is exactly what you want; when searching for some terminal character(s) in some kind of parser for instance. Other times you want to do a proper Unicode search using Unicode comparison algorithms; when the user is searching for a particular string in a text document for instance. The former is very easy to do with the current API. But what's the API for the later? And how to make the correct API the obvious choice depending on the use case? These two questions are what this thread should be about. Although not unimportant, performance of std.array.front() and whether it should decode is a secondary issue in comparison. -- Michel Fortin michel.fortin michelf.ca http://michelf.ca
Mar 09 2014
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
07-Mar-2014 23:57, Andrei Alexandrescu пишет:
 On 3/6/14, 6:37 PM, Walter Bright wrote:
 In "Lots of low hanging fruit in Phobos" the issue came up about the
 automatic encoding and decoding of char ranges.

 Is there any hope of fixing this?

There's nothing to fix.

There is, all right. ElementEncodingType for starters.
 Allow me to enumerate the functions of std.algorithm and how they work
 today and how they'd work with the proposed change. Let s be a variable
 of some string type.

Special case was wrong though - special casing arrays of char[] and throwing all other ranges of char out the window. The amount of code to support this schizophrenia is enormous.
 Making strings bidirectional ranges has been a very good choice within
 the constraints. There was already a string type, and that was
 immutable(char)[], and a bunch of code depended on that definition.

Trying to make it work by blowing a hole in the generic range concept now seems like it wasn't worth it. -- Dmitry Olshansky
Mar 07 2014
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/7/14, 12:48 PM, Dmitry Olshansky wrote:
 07-Mar-2014 23:57, Andrei Alexandrescu пишет:
 On 3/6/14, 6:37 PM, Walter Bright wrote:
 In "Lots of low hanging fruit in Phobos" the issue came up about the
 automatic encoding and decoding of char ranges.

 Is there any hope of fixing this?

There's nothing to fix.

There is, all right. ElementEncodingType for starters.
 Allow me to enumerate the functions of std.algorithm and how they work
 today and how they'd work with the proposed change. Let s be a variable
 of some string type.

Special case was wrong though - special casing arrays of char[] and throwing all other ranges of char out the window. The amount of code to support this schizophrenia is enormous.

I think this is a confusion. The code in e.g. std.algorithm is specialized for efficiency of stuff that already works.
 Making strings bidirectional ranges has been a very good choice within
 the constraints. There was already a string type, and that was
 immutable(char)[], and a bunch of code depended on that definition.

Trying to make it work by blowing a hole in the generic range concept now seems like it wasn't worth it.

I disagree. Also what hole? Andrei
Mar 07 2014
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
08-Mar-2014 05:18, Andrei Alexandrescu пишет:
 On 3/7/14, 12:48 PM, Dmitry Olshansky wrote:
 07-Mar-2014 23:57, Andrei Alexandrescu пишет:
 On 3/6/14, 6:37 PM, Walter Bright wrote:
 In "Lots of low hanging fruit in Phobos" the issue came up about the
 automatic encoding and decoding of char ranges.

Allow me to enumerate the functions of std.algorithm and how they work today and how they'd work with the proposed change. Let s be a variable of some string type.

Special case was wrong though - special casing arrays of char[] and throwing all other ranges of char out the window. The amount of code to support this schizophrenia is enormous.

I think this is a confusion. The code in e.g. std.algorithm is specialized for efficiency of stuff that already works.

Well, I've said it elsewhere - specialization was too fine grained. Either a generic or it doesn't work.
 Making strings bidirectional ranges has been a very good choice within
 the constraints. There was already a string type, and that was
 immutable(char)[], and a bunch of code depended on that definition.

Trying to make it work by blowing a hole in the generic range concept now seems like it wasn't worth it.

I disagree. Also what hole?

Let's say we keep it. Yesterday I had to write constraints like this: if((isNarrowString!Range && is(Unqual!(ElementEncodingType!Range) == wchar)) || (isRandomAccessRange!Range && is(Unqual!(ElementType!Range) == wchar))) Just to accept anything that works alike to array of wchar, buffers and whatnot included. I expect that this should have been enough: isRandomAccessRange!Range && is(Unqual!(ElementType!Range) == wchar) Or maybe introduce something to indicate any "DualRange" of narrow chars. -- Dmitry Olshansky
Mar 08 2014
prev sibling next sibling parent "Eyrk" <eyrk hotmail.com> writes:
On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
 No, it doesn't.

 import std.algorithm;

 void main()
 {
     auto s = "cassé";
     assert(s.canFind('é'));
 }

Hm, I'm not following? Works perfectly fine on my system?
Mar 07 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
 On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev 
 wrote:
 No, it doesn't.

 import std.algorithm;

 void main()
 {
    auto s = "cassé";
    assert(s.canFind('é'));
 }

Hm, I'm not following? Works perfectly fine on my system?

Something's messing with your Unicode. Try downloading and compiling this file: http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d
Mar 07 2014
prev sibling next sibling parent "TC" <chalucha gmail.com> writes:
 Hm, I'm not following? Works perfectly fine on my system?

Something's messing with your Unicode. Try downloading and compiling this file: http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

Used hex view on referenced file and it does not seem to be the same symbol. Works for me with same ones.
Mar 07 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Friday, 7 March 2014 at 22:16:58 UTC, TC wrote:
 Hm, I'm not following? Works perfectly fine on my system?

Something's messing with your Unicode. Try downloading and compiling this file: http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

Used hex view on referenced file and it does not seem to be the same symbol.

Define "symbol". :)
Mar 07 2014
prev sibling next sibling parent "Eyrk" <eyrk hotmail.com> writes:
On Friday, 7 March 2014 at 21:58:40 UTC, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
 On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev 
 wrote:
 No, it doesn't.

 import std.algorithm;

 void main()
 {
   auto s = "cassé";
   assert(s.canFind('é'));
 }

Hm, I'm not following? Works perfectly fine on my system?

Something's messing with your Unicode. Try downloading and compiling this file: http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

ah right, missing normalization, I get your point, thanks.
Mar 07 2014
prev sibling next sibling parent "TC" <chalucha gmail.com> writes:
On Friday, 7 March 2014 at 22:18:17 UTC, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 22:16:58 UTC, TC wrote:
 Hm, I'm not following? Works perfectly fine on my system?

Something's messing with your Unicode. Try downloading and compiling this file: http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

Used hex view on referenced file and it does not seem to be the same symbol.

Define "symbol". :)

"cassé" - 22 63 61 73 73 65 cc 81 22 vs 'é' - 27 c3 a9 27
Mar 07 2014
prev sibling next sibling parent "TC" <chalucha gmail.com> writes:
 ah right, missing normalization, I get your point, thanks.

Oops :)
Mar 07 2014
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Mar 07, 2014 at 09:58:39PM +0000, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
No, it doesn't.

import std.algorithm;

void main()
{
   auto s = "cassé";
   assert(s.canFind('é'));
}

Hm, I'm not following? Works perfectly fine on my system?


Probably because your browser is normalizing the unicode string when you copy-n-paste Vladimir's message? See below:
 Something's messing with your Unicode. Try downloading and compiling
 this file:
 http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

I downloaded the file and looked at it through `od -ctx1`: the first é is encoded as the byte sequence 65 cc 81, that is, [U+65, U+301] (small letter e + combining diacritic acute accent), whereas the second é is encoded as c3 a9, that is, U+E9 (precomposed small letter e with acute accent). This illustrates one of my objections to Andrei's post: by auto-decoding behind the user's back and hiding the intricacies of unicode from him, it has masked the fact that codepoint-for-codepoint comparison of a unicode string is not guaranteed to always return the correct results, due to the possibility of non-normalized strings. Basically, to have correct behaviour in all cases, the user must be aware of, and use, the Unicode collation / normalization algorithms prescribed by the Unicode standard. What we have in std.algorithm right now is an incomplete implementation with non-working edge cases (like Vladimir's example) that has poor performance to start with. Its only redeeming factor is that the auto-decoding hack has given it the illusion of being correct, when actually it's not correct according to the Unicode standard. I don't see how this is necessarily superior to Walter's proposal. T -- Just because you survived after you did it, doesn't mean it wasn't stupid!
Mar 07 2014
prev sibling next sibling parent "Sarath Kodali" <sarath dummy.com> writes:
On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu 
 wrote:
 Allow me to enumerate the functions of std.algorithm and how 
 they work today and how they'd work with the proposed change. 
 Let s be a variable of some string type.

 s.canFind('é') currently works as expected.

No, it doesn't. import std.algorithm; void main() { auto s = "cassé"; assert(s.canFind('é')); } That's the whole problem - all this hot steam and it still does not work properly. Because it can't - not without pulling in all of the Unicode algorithms implicitly, and that would be much worse.
 I went down std.algorithm in the order listed in its 
 documentation and found pernicious issues with almost every 
 single algorithm.

All of your examples are variations of one and the same case: searching for a non-ASCII dchar or dchar literal. How often does this pattern occur in real programs? I think the only real metric is to try the change and find out.
 Clearly one might argue that their app has no business dealing 
 with diacriticals or Asian characters. But that's the typical 
 provincial view that marred many languages' approach to UTF 
 and internationalization.

So is yours, if you think that making everything magically a dchar is going to solve all problems. The TDPL example only showcases the problem. Yes, it works with Swedish. Now try it again with Sanskrit.

+1 In Indian languages, a character consists of one or more UNICODE code points. For example, in Sanskrit "ddhrya" http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg consists of 7 UNICODE code points. So to search for this char I have to use string search. - Sarath
Mar 07 2014
prev sibling next sibling parent "Eyrk" <eyrk hotmail.com> writes:
On Friday, 7 March 2014 at 22:27:35 UTC, H. S. Teoh wrote:
 This illustrates one of my objections to Andrei's post: by 
 auto-decoding
 behind the user's back and hiding the intricacies of unicode 
 from him,
 it has masked the fact that codepoint-for-codepoint comparison 
 of a
 unicode string is not guaranteed to always return the correct 
 results,
 due to the possibility of non-normalized strings.

 Basically, to have correct behaviour in all cases, the user 
 must be
 aware of, and use, the Unicode collation / normalization 
 algorithms
 prescribed by the Unicode standard. What we have in 
 std.algorithm right
 now is an incomplete implementation with non-working edge cases 
 (like
 Vladimir's example) that has poor performance to start with. 
 Its only
 redeeming factor is that the auto-decoding hack has given it the
 illusion of being correct, when actually it's not correct 
 according to
 the Unicode standard. I don't see how this is necessarily 
 superior to
 Walter's proposal.


 T

Yes, I realised too late. Would it not be beneficial to have different types of literals, one type which is implicitly normalized and one which is "raw"(like today)? Since typically you'd want to normalize most string literals at compile-time, then you only have to normalize external input at run-time.
Mar 07 2014
prev sibling next sibling parent "TC" <chalucha gmail.com> writes:
 Probably because your browser is normalizing the unicode string 
 when you
 copy-n-paste Vladimir's message? See below:

Just for curiosity I tried it with C# to see how it is handled there and it works like this: using System; using System.Diagnostics; namespace Test { class Program { static void Main() { var s = "cassé"; Debug.Assert(s.IndexOf('é') < 0); s = s.Normalize(); Debug.Assert(s.IndexOf('é') == 4); } } } So it's neither work by default there and Normalize has to be used
Mar 07 2014
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Mar 07, 2014 at 10:35:46PM +0000, Sarath Kodali wrote:
 On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu
wrote:


Clearly one might argue that their app has no business dealing
with diacriticals or Asian characters. But that's the typical
provincial view that marred many languages' approach to UTF and
internationalization.

So is yours, if you think that making everything magically a dchar is going to solve all problems. The TDPL example only showcases the problem. Yes, it works with Swedish. Now try it again with Sanskrit.

+1 In Indian languages, a character consists of one or more UNICODE code points. For example, in Sanskrit "ddhrya" http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg consists of 7 UNICODE code points. So to search for this char I have to use string search.

That's what I've been arguing for. The most general form of character searching in Unicode requires substring searching, and similarly many character-based operations on Unicode strings are effectively substring-based operations, because said "character" may be a multibyte code point, or, in your case, multiple code points. Since that's the case, we might as well just forget about the distinction between "character" and "string", and treat all such operations as substring operations (even if the operand is supposedly "just 1 character long"). This would allow us to get rid of the hackish auto-decoding of narrow strings, and thus eliminate the needless overhead of always decoding. T -- All men are mortal. Socrates is mortal. Therefore all men are Socrates.
Mar 07 2014
prev sibling next sibling parent "Sarath Kodali" <sarath dummy.com> writes:
On Friday, 7 March 2014 at 22:35:47 UTC, Sarath Kodali wrote:
 +1
 In Indian languages, a character consists of one or more 
 UNICODE code points. For example, in Sanskrit "ddhrya" 
 http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg 
 consists of 7 UNICODE code points. So to search for this char I 
 have to use string search.

 - Sarath

Oops, incomplete reply ... Since a single "alphabet" in Indian languages can contain multiple code-points, iterating over single code-points is like iterating over char[] for non English European languages. So decode is of no use other than decreasing the performance. A raw char[] comparison is much faster. And then there is this "unicode normalization" that makes it very difficult for string searches or comparisons. - Sarath
Mar 07 2014
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Mar 07, 2014 at 11:13:50PM +0000, Sarath Kodali wrote:
 On Friday, 7 March 2014 at 22:35:47 UTC, Sarath Kodali wrote:
+1
In Indian languages, a character consists of one or more UNICODE
code points. For example, in Sanskrit "ddhrya"
http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg
consists of 7 UNICODE code points. So to search for this char I
have to use string search.

- Sarath

Oops, incomplete reply ... Since a single "alphabet" in Indian languages can contain multiple code-points, iterating over single code-points is like iterating over char[] for non English European languages. So decode is of no use other than decreasing the performance. A raw char[] comparison is much faster.

Yes. The more I think about it, the more auto-decoding sounds like a wrong decision. The question, though, is whether it's worth the massive code breakage needed to undo it. :-(
 And then there is this "unicode normalization" that makes it very
 difficult for string searches or comparisons.

I believe the convention is to always normalize strings before performing operations on them, in order to prevent these sorts of problems. I think many of the unicode prescribed algorithms have normalization as a prerequisite, since otherwise there's no guarantee that the algorithm will produce the correct results. T -- "I'm not childish; I'm just in touch with the child within!" - RL
Mar 07 2014
prev sibling next sibling parent "Brad Anderson" <eco gnuk.net> writes:
On Friday, 7 March 2014 at 22:27:35 UTC, H. S. Teoh wrote:
 On Fri, Mar 07, 2014 at 09:58:39PM +0000, Vladimir Panteleev 
 wrote:
 On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev 
wrote:
No, it doesn't.

import std.algorithm;

void main()
{
   auto s = "cassé";
   assert(s.canFind('é'));
}

Hm, I'm not following? Works perfectly fine on my system?


Probably because your browser is normalizing the unicode string when you copy-n-paste Vladimir's message? See below:
 Something's messing with your Unicode. Try downloading and 
 compiling
 this file:
 http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

I downloaded the file and looked at it through `od -ctx1`: the first é is encoded as the byte sequence 65 cc 81, that is, [U+65, U+301] (small letter e + combining diacritic acute accent), whereas the second é is encoded as c3 a9, that is, U+E9 (precomposed small letter e with acute accent). This illustrates one of my objections to Andrei's post: by auto-decoding behind the user's back and hiding the intricacies of unicode from him, it has masked the fact that codepoint-for-codepoint comparison of a unicode string is not guaranteed to always return the correct results, due to the possibility of non-normalized strings. Basically, to have correct behaviour in all cases, the user must be aware of, and use, the Unicode collation / normalization algorithms prescribed by the Unicode standard. What we have in std.algorithm right now is an incomplete implementation with non-working edge cases (like Vladimir's example) that has poor performance to start with. Its only redeeming factor is that the auto-decoding hack has given it the illusion of being correct, when actually it's not correct according to the Unicode standard. I don't see how this is necessarily superior to Walter's proposal. T

To me, the status quo feels like an ok compromise between performance and correctness. Everyone is pointing out that working at the code point level is bad because it's not correct but working at the code unit level as Walter proposes is correct even less often so that's not really an argument for moving to that. It is, however, an argument for forcing the user to decide what level of correctness and performance they need. Walter's idea (code unit level) would be fastest but least correct. The current is somewhat fast and is somewhat correct. The next level, graphemes, would be slowest of all but most correct. It seems like there is just no way to avoid the tradeoff between speed and correctness so we shouldn't try, only try to force the user to make a decision. Maybe some more string types are in order (hrm). In order of performance to correctness: string, wstring (code units) dstring (code points) +gstring (graphemes) (do grapheme's completely normalize? If not probably need another level, say, nstring) Then if a user needs correctness over performance they just work with gstrings. If they need performance over correctness they work with strings (assuming some of Walter's idea happens, otherwise they'd work with string.representation).
Mar 07 2014
prev sibling next sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu 
wrote:
 s.all!(x => x == 'é')
 s.any!(x => x == 'é')
 s.canFind!(x => x == 'é')

These are a variation of the following: ubyte b = ...; if (b == 1000) { ... } The compiler could emit a warning here, and indeed some languages/compilers do. It might not be in the vein of D metaprogramming, though, as the compiler will not emit a warning for "if (false) { ... }".
 s.canFind('é')
 s.endsWith('é')
 s.find('é')
 s.count('é')
 s.countUntil('é')

These should not compile post-change, because the sought element (dchar) is not of the same type as the string. So they will not fail silently.
 s.count()
 s.count!((a, b) => std.uni.toLower(a) == 
 std.uni.toLower(b))("é")
 s.countUntil('é')

As has already been mentioned, counting code points is borderline useless.
 s.count!((a, b) => std.uni.toLower(a) == 
 std.uni.toLower(b))("é")

And this is just wrong on many levels. I hope you know better than to actually use this for case-insensitive comparisons in production software.
Mar 07 2014
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/7/14, 4:39 PM, Vladimir Panteleev wrote:
 s.canFind('é')
 s.endsWith('é')
 s.find('é')
 s.count('é')
 s.countUntil('é')

These should not compile post-change, because the sought element (dchar) is not of the same type as the string. So they will not fail silently.

The compared element need not have the same type (otherwise we'd break some other code). Andrei
Mar 07 2014
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/7/14, 12:26 PM, H. S. Teoh wrote:
 On Fri, Mar 07, 2014 at 11:57:23AM -0800, Andrei Alexandrescu wrote:
 s.canFind('') currently works as expected. Proposed: fails silently.

The problem is that the current implementation of this correct behaviour leaves a lot to be desired in terms of performance. Ideally, you should not need to decode every single character in s just to see if it happens to contain . Rather, canFind, et al should convert the dchar literal '' into a UTF-8 (resp. UTF-16) sequence and do a substring search instead. Decoding every character in s, while correct, is also needlessly inefficient.

That's an optimization that fits the current design and goes in the library transparently, i.e. the good stuff.
 5.

 s.count() currently works as expected. Proposed: fails silently.

Wrong. The current behaviour of s.count() does not work as expected, it only gives an illusion that it does.

Depends on what one expects :o).
 Its return value is misleading when
 combining diacritics and other such Unicode "niceness" are involved.
 Arguably, such things should be prohibited altogether, and more
 semantically transparent algorithms used, namely s.countCodePoints,
 s.countGraphemes, etc..

I think s.byGrapheme.count is the right way instead of specializing a bunch of algorithms to work with graphemes.
 s.endsWith('') currently works as expected. Proposed: fails silently.

Arguable, because it imposes a performance hit by needless decoding. Ideally, you should have 3 overloads: bool endsWith(string s, char asciiChar); bool endsWith(string s, wchar wideChar); bool endsWith(string s, dchar codepoint);

Nice idea. Fits current design. Then interesting complications arise with things like bool endsWith(string, wstring) etc.
 [...]
 I designed the range behavior of strings after much thinking and
 consideration back in the day when I designed std.algorithm. It was
 painfully obvious (but it seems to have been forgotten now that it's
 working so well) that approaching strings as arrays of char[] would
 break almost every single algorithm leaving us essentially in the
 pre-UTF C++aveman era.

I agree, but it is also painfully obvious that the current implementation is lackluster in terms of performance.

It's not painfully obvious to me at all. What is obvious to me is people are happy campers with the way D's strings work, including UTF support and performance. I don't remember people bringing this up in forums and here at Facebook "yeah, just look at the crappy way they handle strings..." Silent approval is easy to forget about. Walter has been working on an application in which anything slower than 2x baseline would have been a failure. In that app (which I know very well) the right option from day 1 would have been ubyte[], which he discovered the hard way. His incomplete understanding of how D strings work is the single largest problem there, and indicates an issue with the documentation. He discovered that, was surprised, and overreacted. No need to amplify that into mass hysteria. There are improvements that can be made, in the form of additions, not breaking changes that would inflict massive breakage on the community. This is the way in which this discussion can have a positive outcome. (I've shared in fact a few ideas with Walter.)
 Clearly one might argue that their app has no business dealing with
 diacriticals or Asian characters. But that's the typical provincial
 view that marred many languages' approach to UTF and
 internationalization. If you know your string is ASCII, the remedy
 is simple - don't use char[] and friends. From day 1, the type
 "char" was meant to mean "code unit of UTF characters".

Yes, but currently Phobos support for non-UTF strings is rather poor, and requires many explicit casts to/from ubyte[].

Non-UTF strings are currently modeled as ubyte[], so I don't see what you'd be casting to and fro. You have absolutely no business representing anything non-UTF with char and char[] etc.
 So please ponder the above before going to do surgery on the patient
 that's going to kill him.

Yeah I was surprised Walter was actually seriously going to pursue this. It's a change of a far vaster magnitude than many of the other DIPs and other proposals that have been rejected because they were deemed to cause too much breakage of existing code.

Compared with what's going on now with D at Facebook, this agitation is but a little side show. We have way bigger fish to fry. Andrei
Mar 07 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 8 March 2014 at 00:44:53 UTC, Andrei Alexandrescu 
wrote:
 worksforme

http://forum.dlang.org/post/fhqradggtvwnpqpuehgg forum.dlang.org
Mar 07 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 8 March 2014 at 01:23:27 UTC, Andrei Alexandrescu 
wrote:
 Yup, the grapheme issue. This should work.

No. It does not work because grapheme segmentation is not the same as normalization. Even if you fix the code (should be: assert(s.byGrapheme.canFind!"a[] == b"("é"))), it will not work because byGrapheme does not normalize (and not all graphemes can be normalized to a single code point anyway). And there is more than one type of normalization - you need to use the one depending on what you're trying to achieve.
 Graphemes are the next level of Nirvana above code points, but 
 that doesn't mean it's graphemes or nothing.

It's not about types, it's about algorithms. It's never "X or nothing" - unless X is "actually understanding Unicode". Everything else is a compromise. Compromises are acceptable, but not when they are built into the language as the standard way of working with text, thus hiding the problems that come with them.
Mar 07 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 8 March 2014 at 01:38:39 UTC, Andrei Alexandrescu 
wrote:
 On 3/7/14, 4:39 PM, Vladimir Panteleev wrote:
 s.canFind('é')
 s.endsWith('é')
 s.find('é')
 s.count('é')
 s.countUntil('é')

These should not compile post-change, because the sought element (dchar) is not of the same type as the string. So they will not fail silently.

The compared element need not have the same type (otherwise we'd break some other code).

Do you think such code will appear often in practice? Even if the type is a dchar, in some cases the programmer may not have intended to do decoding (e.g. the "dchar" type was a result of type deduction form .front OSLT).
Mar 07 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 8 March 2014 at 01:41:01 UTC, Vladimir Panteleev 
wrote:
 On Saturday, 8 March 2014 at 01:38:39 UTC, Andrei Alexandrescu 
 wrote:
 On 3/7/14, 4:39 PM, Vladimir Panteleev wrote:
 These should not compile post-change, because the sought 
 element (dchar)
 is not of the same type as the string. So they will not fail 
 silently.

The compared element need not have the same type (otherwise we'd break some other code).

Do you think such code will appear often in practice? Even if the type is a dchar, in some cases the programmer may not have intended to do decoding (e.g. the "dchar" type was a result of type deduction form .front OSLT).

Sorry, I see now that you were referring to algorithms in general. I think adding a temporary warning for character types only, as with .front, would be appropriate...
Mar 07 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Vladimir Panteleev:

 It's not about types, it's about algorithms.

Given sufficiently refined types, it can be about types :-) Bye, bearophile
Mar 07 2014
prev sibling next sibling parent "Eyrk" <eyrk hotmail.com> writes:
On Saturday, 8 March 2014 at 02:04:12 UTC, bearophile wrote:
 Vladimir Panteleev:

 It's not about types, it's about algorithms.

Given sufficiently refined types, it can be about types :-) Bye, bearophile

I think Bear is onto something, we already solved an analogous problem in an elegant way. see SortedRange with assumeSorted etc. But for this to be convenient to use, I still think we should expand the current 'String Literal Postfix' types to include both normaliztion and graphemes. Postfix Type Aka c immutable(char)[] string w immutable(wchar)[] wstring d immutable(dchar)[] dstring
Mar 08 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 8 March 2014 at 15:33:34 UTC, Andrei Alexandrescu 
wrote:
 Why? Couldn't the grapheme 'compare true with the character? 
 I.e. the byGrapheme iteration normalizes on the fly.

Grapheme segmentation and normalization are distinct Unicode algorithms: http://www.unicode.org/reports/tr15/ http://www.unicode.org/reports/tr29/ There are also several normalization algorithms. http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization
Mar 08 2014
prev sibling next sibling parent "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Saturday, 8 March 2014 at 16:00:38 UTC, Vladimir Panteleev 
wrote:
 On Saturday, 8 March 2014 at 15:33:34 UTC, Andrei Alexandrescu 
 wrote:
 Why? Couldn't the grapheme 'compare true with the character? 
 I.e. the byGrapheme iteration normalizes on the fly.

Grapheme segmentation and normalization are distinct Unicode algorithms: http://www.unicode.org/reports/tr15/ http://www.unicode.org/reports/tr29/ There are also several normalization algorithms. http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization

How about this? s.normalize!NFKD To return a range of normalized code points? Clearly, no definition of string can handle this natively. As you say, there are multiple algorithms, so there is no one 'right' answer. byGrapheme is useful, but doesn't and cannot solve the normalization issue. I feel this discussion is tangential to main debate: whether strings should be ranges of code points or code units. By code unit is faster by default, and simpler to implement in Phobos (no more special code). By code point works better when searching for individual code points, but as you rightly point out this might not be useful in practice as you rarely search for individual non-ASCII code points, and it isn't a complete solution anyway because of normalization. There's a few problems with by code unit: 1. Searching string/wstring for dchar fails silently. You have suggested making this a compilation error, but Andrei argues this would break lots of code. You counter that it's possible that people rarely search for dchar anyway, so may not matter. 2. It's a fundamental change. Regardless of which is better, we need to consider the impact of such a change. 3. Ranges of code units are random access and sliceable, which means they will be accepted by algorithms such as sort, which will just produce garbage strings. Maybe this isn't an issue.
Mar 08 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 8 March 2014 at 15:56:08 UTC, Dmitry Olshansky wrote:
 The following should work as is though:

 s.byGrapheme.canFind(Grapheme("é"))

Doesn't work here. Not sure why. Grapheme(1000065, 3, 0, 33554432, [101, 0, 0, 1, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 2) // last byGrapheme vs. Grapheme(E9, 0, 0, 16777216, [233, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 1) // Grapheme("é")
Mar 08 2014
prev sibling next sibling parent reply =?UTF-8?B?Ikx1w61z?= Marques" <luis luismarques.eu> writes:
On Friday, 7 March 2014 at 20:27:38 UTC, H. S. Teoh wrote:
 	s.indexOf("a");			// for slicing
 	s.byCodepoint.countUntil("a");	// count code points
 	s.byGrapheme.countUntil("a");	// count graphemes

(BTW, byGrapheme is currently missing in the std.uni docs)
Mar 08 2014
parent Walter Bright <newshound2 digitalmars.com> writes:
On 3/8/2014 9:44 AM, "Luís Marques" <luis luismarques.eu>" wrote:
 (BTW, byGrapheme is currently missing in the std.uni docs)

https://github.com/D-Programming-Language/phobos/pull/1985
Mar 08 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 8 March 2014 at 22:42:20 UTC, Dmitry Olshansky wrote:
 Sounds like a bug, file it before we derailed.

https://d.puremagic.com/issues/show_bug.cgi?id=12324
Mar 08 2014
prev sibling next sibling parent "w0rp" <devw0rp gmail.com> writes:
On Sunday, 9 March 2014 at 09:24:02 UTC, Nick Sabalausky wrote:
 I'm leaning the same way too. But I also think Andrei is right 
 that, at this point in time, it'd be a terrible move to change 
 things so that "by code unit" is default. For better or worse, 
 that ship has sailed.

 Perhaps we *can* deal with the auto-decoding problem not by 
 killing auto-decoding, but by marginalizing it in an additive 
 way:

 Convincing arguments have been made that any string-processing 
 code which *isn't* done entirely with the official Unicode 
 algorithms is likely wrong *regardless* of whether 
 std.algorithm defaults to per-code-unit or per-code-point.

 So...How's this?: We add any of these Unicode algorithms we may 
 be missing, encourage their use for strings, discourage use of 
 std.algorithm for string processing, and in the meantime, just 
 do our best to reduce unnecessary decoding wherever possible. 
 Then we call it a day and all be happy :)

I've been watching this discussion for the last few days, and I'm kind of a nobody jumping in pretty late, but I think after thinking about the problem for a while I would aggree on a solution along the lines of what you have suggested. I think Vladimir is definitely right when he's saying that when you have algorithms that deal with natural languages, simply working on the basis of a code unit isn't enough. I think it is also true that you need to select a particular algorithm for dealing with strings of characters, as there are many different algorithms you can use for different languages which behave differently, perhaps several in a single langauge. I also think Andrei is right when he is saying we need to minimise code breakage, and that the string decoding and encoding by default isn't the biggest of performance problems. I think our best option is to offer a function which creates a range in std.array for getting a range over raw character data, without decoding to code points. myArray.someAlgorithm; // std.array .front used today with decode calls myArray.rawData.someAlgorithm; // New range which doesn't decode. Then we could look at creating algorithms for string processing which don't use the existing dchar abstraction. myArray.rawData.byNaturalSymbol!SomeIndianEncodingHere; // Range of strings, maybe range of range of characters, not dchars Or even specialise the new algorithm so it looks for arrays and turns them into the ranges for you via the transformation myArray -> myArray.rawData. myArray.byNaturalSymbol!SomeIndianEncodingHere; Honestly, I'd leave the details of such an algorithm to Vladimir and not myself, because he's spent far more time looking into Unicode processing than myself. My knowledge of Unicode pretty much just comes from having to deal with foreign language customers and discovering the problems with the code unit abstraction most languages seem to use. (Java and Python suffer from similar issues, but they don't really have algorithms in the way that we do.) This new set of algorithms taking settings for different encodings could be first implemented in a third party library, tested there, and eventually submitted to Phobos, probably in std.string. There's my input, I'll duck before I'm beheaded.
Mar 09 2014
prev sibling next sibling parent "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Friday, 7 March 2014 at 23:13:50 UTC, H. S. Teoh wrote:
 On Fri, Mar 07, 2014 at 10:35:46PM +0000, Sarath Kodali wrote:
 On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev 
 wrote:
On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu
wrote:


Clearly one might argue that their app has no business 
dealing
with diacriticals or Asian characters. But that's the typical
provincial view that marred many languages' approach to UTF 
and
internationalization.

So is yours, if you think that making everything magically a dchar is going to solve all problems. The TDPL example only showcases the problem. Yes, it works with Swedish. Now try it again with Sanskrit.

+1 In Indian languages, a character consists of one or more UNICODE code points. For example, in Sanskrit "ddhrya" http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg consists of 7 UNICODE code points. So to search for this char I have to use string search.

That's what I've been arguing for. The most general form of character searching in Unicode requires substring searching, and similarly many character-based operations on Unicode strings are effectively substring-based operations, because said "character" may be a multibyte code point, or, in your case, multiple code points. Since that's the case, we might as well just forget about the distinction between "character" and "string", and treat all such operations as substring operations (even if the operand is supposedly "just 1 character long"). This would allow us to get rid of the hackish auto-decoding of narrow strings, and thus eliminate the needless overhead of always decoding.

That won't work, because your needle might be in a different normalization form than your haystack, thus a byte-by-byte comparison will not be able to find it.
Mar 09 2014
prev sibling parent "w0rp" <devw0rp gmail.com> writes:
On Sunday, 9 March 2014 at 21:38:06 UTC, Nick Sabalausky wrote:
 On 3/9/2014 7:47 AM, w0rp wrote:
 My knowledge of Unicode pretty much just comes from having
 to deal with foreign language customers and discovering the 
 problems
 with the code unit abstraction most languages seem to use. 
 (Java and
 Python suffer from similar issues, but they don't really have 
 algorithms
 in the way that we do.)

Python 2 or 3 (out of curiosity)? If you're including Python3, then that somewhat surprises me as I thought greatly improved Unicode was one of the biggest reasons for the jump from 2 to 3. (Although it isn't *completely* surprising since, as we all know far too well here, fully correct Unicode is *not* easy.)

Late reply here. Python 3 is a lot better in terms of Unicode support than 2. The situation in Python 2 was this. 1. The default string type is 'str', an immutable array of bytes. 2. 'str' could be one of many encodings, including UTF-16, etc. 3. There is an extra 'unicode' type for when you want a Unicode string. 4. Python implicltly converts between the two, often in wrong ways, often causing exceptions to appear where you didn't expect them to. In 3, this changed to... 1. The default string type is still named 'str', only now it's like the 'unicode' of olde. 2. 'bytes' is a new immutable array of bytes type like the Python 2 'str'. 3. Conversion between 'str' and 'bytes' is always explicit. However, Python 3 works on a code point level, probably some code unit level in fact, and you don't see very many algorithms which take, say, combining characters into account. So Python suffers from similar issues.
Mar 11 2014
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Mar 07, 2014 at 11:57:23AM -0800, Andrei Alexandrescu wrote:
 On 3/6/14, 6:37 PM, Walter Bright wrote:
In "Lots of low hanging fruit in Phobos" the issue came up about the
automatic encoding and decoding of char ranges.

Is there any hope of fixing this?

There's nothing to fix.

:D I knew this was going to happen.
 Allow me to enumerate the functions of std.algorithm and how they
 work today and how they'd work with the proposed change. Let s be a
 variable of some string type.
 
 1.
 
 s.all!(x => x == '') currently works as expected. Proposed: fails silently.
 
 2.
 
 s.any!(x => x == '') currently works as expected. Proposed: fails silently.
 
 3.
 
 s.canFind!(x => x == '') currently works as expected. Proposed:
 fails silently.
 
 4.
 
 s.canFind('') currently works as expected. Proposed: fails silently.

The problem is that the current implementation of this correct behaviour leaves a lot to be desired in terms of performance. Ideally, you should not need to decode every single character in s just to see if it happens to contain . Rather, canFind, et al should convert the dchar literal '' into a UTF-8 (resp. UTF-16) sequence and do a substring search instead. Decoding every character in s, while correct, is also needlessly inefficient.
 5.
 
 s.count() currently works as expected. Proposed: fails silently.

Wrong. The current behaviour of s.count() does not work as expected, it only gives an illusion that it does. Its return value is misleading when combining diacritics and other such Unicode "niceness" are involved. Arguably, such things should be prohibited altogether, and more semantically transparent algorithms used, namely s.countCodePoints, s.countGraphemes, etc..
 6.
 
 s.count!((a, b) => std.uni.toLower(a) == std.uni.toLower(b))("")
 currently works as expected (with the known issues of lowercase
 conversion). Proposed: fails silently.

Again, I don't like this. It sweeps the issues of comparing unicode strings under the carpet and gives the programmer a false sense of code correctness. Users instead should be encouraged to use proper Unicode collation functions that are actually correct, instead of giving an illusion of correctness.
 7.
 
 s.count('') currently works as expected. Proposed: fails silently.

This is a repetition of #5. :)
 8.
 
 s.countUntil("a") currently work as expected. Proposed: fails
 silently. This applies to all variations of countUntil.

Whether this is correct or not depends on what the intention is. If you're looking to slice a string, this most definitely does NOT work as expected. If you're looking to count graphemes, this doesn't work as expected either. This only works if you just so happen to be counting code points. The correct approach, IMO, is to help the user make a conscious choice between these different semantics: s.indexOf("a"); // for slicing s.byCodepoint.countUntil("a"); // count code points s.byGrapheme.countUntil("a"); // count graphemes Things like s.countUntil("a") are misleading and lead to subtle Unicode bugs.
 9.
 
 s.endsWith('') currently works as expected. Proposed: fails silently.

Arguable, because it imposes a performance hit by needless decoding. Ideally, you should have 3 overloads: bool endsWith(string s, char asciiChar); bool endsWith(string s, wchar wideChar); bool endsWith(string s, dchar codepoint); In the wchar and dchar overloads you'd do substring search. There is no need to decode.
 10.
 
 s.find('') currently works as expected. Proposed: fails silently.
 This applies to other variations of find that include custom
 predicates.

Not necessarily. Arguably we should be overloading on needle type to eliminate needless decoding: string find(string s, char c); // ubyte search string find(string s, wchar c); // substring search with char[2] string find(string s, dchar c); // substring search with char[4] This makes sense to me because string is immutable(char)[], so from the point of view of being an array, searching for wchar is not something that is obvious (how do you search for a value of type T in an array of elements of type U?), so explicit overloads for handling those cases make sense. Decoding every single character in s is a lot of needless work. [...]
 I designed the range behavior of strings after much thinking and
 consideration back in the day when I designed std.algorithm. It was
 painfully obvious (but it seems to have been forgotten now that it's
 working so well) that approaching strings as arrays of char[] would
 break almost every single algorithm leaving us essentially in the
 pre-UTF C++aveman era.

I agree, but it is also painfully obvious that the current implementation is lackluster in terms of performance.
 Making strings bidirectional ranges has been a very good choice
 within the constraints. There was already a string type, and that
 was immutable(char)[], and a bunch of code depended on that
 definition.
 
 Clearly one might argue that their app has no business dealing with
 diacriticals or Asian characters. But that's the typical provincial
 view that marred many languages' approach to UTF and
 internationalization. If you know your string is ASCII, the remedy
 is simple - don't use char[] and friends. From day 1, the type
 "char" was meant to mean "code unit of UTF characters".

Yes, but currently Phobos support for non-UTF strings is rather poor, and requires many explicit casts to/from ubyte[].
 So please ponder the above before going to do surgery on the patient
 that's going to kill him.

Yeah I was surprised Walter was actually seriously going to pursue this. It's a change of a far vaster magnitude than many of the other DIPs and other proposals that have been rejected because they were deemed to cause too much breakage of existing code. T -- Having a smoking section in a restaurant is like having a peeing section in a swimming pool. -- Edward Burr
Mar 07 2014
prev sibling next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 03/07/2014 03:37 AM, Walter Bright wrote:
 In "Lots of low hanging fruit in Phobos" the issue came up about the
 automatic encoding and decoding of char ranges.
 ...

I think this is among the most annoying aspects of Phobos.
Mar 07 2014
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
Andrei suggests that this change would destroy D by breaking too much existing 
code. He might be right. Can we afford the risk that he is right?

We should think about a way to have our cake and eat it, too.

Keep in mind that this issue is a Phobos one, not a core language issue.
Mar 07 2014
next sibling parent "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Saturday, 8 March 2014 at 00:22:05 UTC, Walter Bright wrote:
 Andrei suggests that this change would destroy D by breaking 
 too much existing code. He might be right. Can we afford the 
 risk that he is right?

 We should think about a way to have our cake and eat it, too.

 Keep in mind that this issue is a Phobos one, not a core 
 language issue.

Before we discuss risk in the change, we need to agree that it is even a desirable change. I don't think we have reached that point. It's worth pointing out that all the performance issues can be resolved in Phobos through specialisation with no disruption to the users.
Mar 07 2014
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Mar 08, 2014 at 12:46:21AM +0000, Peter Alexander wrote:
 On Saturday, 8 March 2014 at 00:22:05 UTC, Walter Bright wrote:
Andrei suggests that this change would destroy D by breaking too
much existing code. He might be right. Can we afford the risk that
he is right?

We should think about a way to have our cake and eat it, too.

Keep in mind that this issue is a Phobos one, not a core language
issue.

Before we discuss risk in the change, we need to agree that it is even a desirable change. I don't think we have reached that point. It's worth pointing out that all the performance issues can be resolved in Phobos through specialisation with no disruption to the users.

Regardless of which way we decide in the end, I hope the one thing good that will come out of this thread is to improve the performance of string algorithms in Phobos. Things like substring searching to implement multibyte character (or multi-codepoint "characters") operations efficiently are quite needed, IMO. T -- If a person can't communicate, the very least he could do is to shut up. -- Tom Lehrer, on people who bemoan their communication woes with their loved ones.
Mar 07 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 8 March 2014 at 00:22:05 UTC, Walter Bright wrote:
 We should think about a way to have our cake and eat it, too.

I think a good place to start would be to have a draft implementation of the proposal. This will allow people to try it with their projects and see how much code it will really affect. As I mentioned here[1], I suspect that certain valid code that used the range primitives will continue to work unaffected even after a sudden switch, so perhaps the "deprecation" and "error" stage can be replaced with a longer "warning" stage instead. This is similar to how git changed the meaning of the "push" command: it just nagged users for a long time, and included the instructions to switch to the new behavior early (thus squelching the warning) or permanently accepting the old behavior. (For our case it is adding .representation or .byCodepoint depending on the intent.) [1]: http://forum.dlang.org/post/dlpmchtaqzrxxylpmiwh forum.dlang.org
Mar 07 2014
prev sibling next sibling parent reply "Sean Kelly" <sean invisibleduck.org> writes:
On Saturday, 8 March 2014 at 00:22:05 UTC, Walter Bright wrote:
 Andrei suggests that this change would destroy D by breaking 
 too much existing code. He might be right. Can we afford the 
 risk that he is right?

Perhaps not. But I think the current approach is totally broken, it's just also happens to be what people have coded to. Andrei used algorithms operating on a code point level as an example of what would break if this change were made, and in that he's absolutely correct. But what bothers me is whether it's appropriate to perform this sort of character-based operation on Unicode strings in the first place. The current approach is a cut above treating strings as arrays of bytes for some languages, and still utterly broken for others. If I'm operating on a right to left language like Hebrew, what would I expect the result to be from something like countUntil? And how useful would such a result be? I'm inclined to say that the correct approach is to state that algorithms operate explicitly on a T.sizeof basis and that if the data contained in a particular range has some multi-element encoding then separate, specialized routines should be used with the T.sizeof behavior will not produce the desired result. So the problem to me is that we're stuck not fixing something that's horribly broken just because it's broken in a way that people presumably now expect. I'd personally like to see this fixed and I think the new behavior is preferable overall, but I do share Andrei's concern that such a big change might hurt the language anyway.
Mar 08 2014
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/8/14, 9:33 AM, Sean Kelly wrote:
 On Saturday, 8 March 2014 at 00:22:05 UTC, Walter Bright wrote:
 Andrei suggests that this change would destroy D by breaking too much
 existing code. He might be right. Can we afford the risk that he is
 right?

Perhaps not. But I think the current approach is totally broken, it's just also happens to be what people have coded to.

I think that's an exaggeration poorly supported by evidence. My definition of "totally broken" would be "essentially unusable" and I think we're well past the point we need to prove that. Virtually all applications need to deal with strings to some extent, and I myself wrote a couple of relatively string-heavy ones. D strings work well. Even the most ardent detractors of D on e.g. reddit.com admit by omission that string processing is one if its strengths. Though they'll probably pick up on this thread soon :o).
 Andrei used
 algorithms operating on a code point level as an example of what would
 break if this change were made, and in that he's absolutely correct.
 But what bothers me is whether it's appropriate to perform this sort of
 character-based operation on Unicode strings in the first place.

Searching for characters in strings would be difficult to deem inappropriate. When I designed std.algorithm I recall I put the following options on the table: 1. All algorithms would by default operate on strings at char/wchar level (i.e. code unit). That would cause the usual issues and confusions I was aware of from C++. Certain algorithms would require specialization and/or the user using byDchar for correctness. At some point I swear I've had a byDchar definition somewhere; I've searched the recent git history for it, no avail. 2. All algorithms would by default operate at code point level. That way correctness would be achieved by default, and certain algorithms would require specialization for efficiency. (Back then I didn't know about graphemes and normalization. I'm not sure how that would have affected the final decision.) 3. Change the alias string, wstring etc. to be some type that requires explicit access for code units/code points etc. instead of implicitly mixing the two. My fave was (3). And not mine only - several people suggested alternative definitions of the "default" string type. Back then however we were in the middle of the D1/D2 transition and one more aftershock didn't seem like a good idea at all. Walter opposed such a change, and didn't really have to convince me. From experience with C++ I knew (1) had a bad track record, and (2) "generically conservative, specialize for speed" was a successful pattern. What would you have chosen given that context?
 The current approach is a cut above treating strings as arrays of bytes
 for some languages, and still utterly broken for others. If I'm
 operating on a right to left language like Hebrew, what would I expect
 the result to be from something like countUntil?

The entire string processing paraphernalia is left to right. I figure RTL languages are under-supported, but s.retro.countUntil comes to mind.
 And how useful would
 such a result be?

I don't know.
 I'm inclined to say that the correct approach is to
 state that algorithms operate explicitly on a T.sizeof basis and that if
 the data contained in a particular range has some multi-element encoding
 then separate, specialized routines should be used with the T.sizeof
 behavior will not produce the desired result.

That sounds quite like C++ plus ICU. It doesn't strike me as the golden standard for Unicode integration.
 So the problem to me is that we're stuck not fixing something that's
 horribly broken just because it's broken in a way that people presumably
 now expect.

Clearly I'm being subjective here but again I'd find it difficult to get convinced we have something horribly broken from the evidence I gathered inside and outside Facebook.
 I'd personally like to see this fixed and I think the new behavior is
 preferable overall, but I do share Andrei's concern that such a big
 change might hurt the language anyway.

I've said this once and I'm saying it again: the best way to convert this discussion into something useful is to devise ideas for useful non-breaking additions. Andrei
Mar 08 2014
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/8/14, 12:26 PM, Sean Kelly wrote:
 But you're right. I was being dramatic when I called it utterly broken.
 It's simply not useful to me as-is. The solution for me is fairly simple
 though if inelegant--cast the string to an array of ubyte.

Ain't nobody know nothing about http://dlang.org/phobos/std_string.html#.representation around here! Andrei
Mar 08 2014
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/8/14, 12:38 PM, Vladimir Panteleev wrote:
 On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu wrote:
 1. All algorithms would by default operate on strings at char/wchar
 level (i.e. code unit). That would cause the usual issues and
 confusions I was aware of from C++. Certain algorithms would require
 specialization and/or the user using byDchar for correctness.

As previously discussed, "correctness" here is conditional. I would not use that word, it is another extreme.

Agreed.
 From experience with C++ I knew (1) had a bad track record, and (2)
 "generically conservative, specialize for speed" was a successful
 pattern.

 What would you have chosen given that context?

Ideally, we would have the Unicode algorithms in the standard library from day 1, and advocated their use throughout the documentation.

It's not late to do a lot of that.
 I'm inclined to say that the correct approach is to
 state that algorithms operate explicitly on a T.sizeof basis and that if
 the data contained in a particular range has some multi-element encoding
 then separate, specialized routines should be used with the T.sizeof
 behavior will not produce the desired result.

That sounds quite like C++ plus ICU. It doesn't strike me as the golden standard for Unicode integration.

Why not? Because it sounds like D needs exactly that. Plus its amazing slicing and range capabilities, of course.

Pretty much everyone using ICU hates it.
 So the problem to me is that we're stuck not fixing something that's
 horribly broken just because it's broken in a way that people presumably
 now expect.

Clearly I'm being subjective here but again I'd find it difficult to get convinced we have something horribly broken from the evidence I gathered inside and outside Facebook.

Have you or anyone you personally know tried to process text in D containing a writing system such as Sanskrit's?

No. Point being?
 I'd personally like to see this fixed and I think the new behavior is
 preferable overall, but I do share Andrei's concern that such a big
 change might hurt the language anyway.

I've said this once and I'm saying it again: the best way to convert this discussion into something useful is to devise ideas for useful non-breaking additions.

I disagree. As I've argued, I believe that currently most uses of dchars in an application are incorrect, and ultimately a time bomb for proper internationalization support. We need to apply the same procedure that we do with any language construct that was deemed to have been a poor decision: put it through a deprecation cycle and fix it.

I think there are too large risks for that, and it's quite unclear this is solving a problem. "Slightly better Unicode support" is hardly a good justification. Andrei
Mar 08 2014
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/8/14, 1:13 PM, Vladimir Panteleev wrote:
 On Saturday, 8 March 2014 at 20:50:49 UTC, Andrei Alexandrescu wrote:
 On 3/8/14, 12:38 PM, Vladimir Panteleev wrote:
 On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu wrote:
 That sounds quite like C++ plus ICU. It doesn't strike me as the
 golden standard for Unicode integration.

Why not? Because it sounds like D needs exactly that. Plus its amazing slicing and range capabilities, of course.

Pretty much everyone using ICU hates it.

I admit I never used it personally.

Time to do due diligence :o).
 I just thought you meant that
 implied "D implementations of relevant Unicode algorithms, adapted to D
 style (range interface)". Is there more to this than the limitations of
 C++ or the implementers' design choices?

 Have you or anyone you personally know tried to process text in D
 containing a writing system such as Sanskrit's?

No. Point being?

Point being, we don't have solid data to conclude whether D's current approach is actually good enough for such cases as you claim.

My only claim is that recognizing and iterating strings by code point is better than doing things by the octet.
 We do have one post in this thread:
 http://forum.dlang.org/post/jlgfkxlrhlzdpwkpsrot forum.dlang.org

 I think there are too large risks for that,

For what? We have not discussed a possible plan yet. Are you referring to Walter Bright's proposal?

Any plan to inflict a large breaking change for strings incurs a risk. To add insult to injury, the improvement brought about by the change is debatable.
 and it's quite unclear this is solving a problem. "Slightly better
 Unicode support" is hardly a good justification.

What this will solve: 1. Eliminating dangerous constructs, such as s.countUntil and s.indexOf both returning integers, yet possibly having different values in circumstances that the developer may not foresee.

I disagree there's any danger. They deal in code points, end of story.
 2. Very high complexity of implementations (the ElementEncodingType
 problem previously mentioned).

I disagree with "very high". Besides if you want to do Unicode you gotta crack some eggs.
 3. Hidden, difficult-to-detect performance problems. The reason why this
 thread was started. I've had to deal with them in several places myself.

I disagree with "hidden, difficult to detect". Also I'd add that I'd rather not have hidden, difficult to detect correctness problems.
 4. Encourage D programmers to write Unicode-capable code that is correct
 in the full sense of the word.

I disagree we are presently discouraging them. I do agree a change would make certain things clearer. But not enough to nearly make up for the breakage.
 I think the above list has enough weight to merit at least considering
 *some* breaking changes.

I think a better approach is to figure what to add. Andrei
Mar 08 2014
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/8/14, 4:42 PM, Vladimir Panteleev wrote:
 On Saturday, 8 March 2014 at 23:59:15 UTC, Andrei Alexandrescu wrote:
 My only claim is that recognizing and iterating strings by code point
 is better than doing things by the octet.

Considering or disregarding the disadvantages of this choice?

Doing my best to weigh everything with the right measures.
 1. Eliminating dangerous constructs, such as s.countUntil and s.indexOf
 both returning integers, yet possibly having different values in
 circumstances that the developer may not foresee.

I disagree there's any danger. They deal in code points, end of story.

Perhaps I did not explain clearly enough. auto pos = s.countUntil(sub); writeln(s[pos..$]); This will compile, and work for English text. For someone without complete knowledge of Phobos functions and how D handles Unicode, it is not obvious that this code is actually wrong.

I agree. At a point or another, the dual nature of strings (and dual means to iterate them) will cause trouble for the unwary.
 In certain situations,
 this can have devastating effects: consider, for example, if this code
 is extracting a slice from a string that elsewhere contains sensitive
 data (e.g. a configuration file containing, among other data, a
 password).

Whaaa, passwords in clear?
 An attacker could supply an Unicode string where the
 developer did not expect it, thus causing "pos" to have a smaller value
 than the corresponding indexOf result, thus revealing a slice of "s"
 which was not intended to be visible. Thus, a developer currently needs
 to tread very carefully wherever he is slicing strings, so as to not
 accidentally use indices obtained from functions that count code points.

Okay, though when you opened with "devastating" I was hoping for nothing short of death and dismemberment. Anyhow the fix is obvious per this brief tutorial: http://www.youtube.com/watch?v=hkDD03yeLnU
 2. Very high complexity of implementations (the ElementEncodingType
 problem previously mentioned).

I disagree with "very high".

I'm quite sure that std.range and std.algorithm will lose a LOT of weight if they were fixed to not treat strings specially.

I'm not so sure. Most of the string-specific optimizations simply detect certain string cases and forward them to array algorithms that need be written anyway. You would, indeed, save a fair amount of isSomeString conditionals and stuff (thus simplifying on scaffolding), but probably not a lot of code. That's not useless work - it'd go somewhere in any design.
 Besides if you want to do Unicode you gotta crack some eggs.

No, I can't see how this justifies the choice. An explicit decoding range would have simplified things greatly while offering much of the same advantages.

My point there is that there's no useless or duplicated code that would be thrown away. A better design would indeed make for better modular separation - would be great if the string-related optimizations in std.algorithm went elsewhere. They wouldn't disappear.
 Whether the fact that it is there "by default" an advantage of the
 current approach at all is debatable.

Clearly. If I'd do things over again, I'd definitely change a thing or two. (I wouldn't go with Walter's proposal, which I think is worse than what we have now.) But the current approach has something very difficult to talk away: it's there. And that makes a whole lotta difference. Do I believe it's perfect? Hell no. Does it blunt much of the point of this debate? I'm afraid so.
 3. Hidden, difficult-to-detect performance problems. The reason why this
 thread was started. I've had to deal with them in several places myself.

I disagree with "hidden, difficult to detect".

Why? You can only find out that an algorithm is slower than it needs to be via either profiling (at which point you're wondering why the #$% the thing is so slow), or feeding it invalid UTF. If you had made a different choice for Unicode in D, this problem would not exist altogether.

Disagree.
 Also I'd add that I'd rather not have hidden, difficult to detect
 correctness problems.

Except we already do. Arguments have already been presented in this thread that demonstrate correctness problems with the current approach. I don't think that these can stand up to the problems that the simpler by-char iteration approach would have.

Sure there are, and you yourself illustrated a misuse of the APIs. My point is: code point is better than code unit and not all that much slower. Grapheme is better than code point but a lot slower. It seems we're quite in a sweet spot here wrt performance/correctness.
 4. Encourage D programmers to write Unicode-capable code that is correct
 in the full sense of the word.

I disagree we are presently discouraging them.

I did not say we are. The problem is that we aren't encouraging them either - we are instead setting an example of how to do it in a wrong (incomplete) way.

Code unit is what it is. Those programming for natural languages for which code units are not sufficient would need to exercise due diligence. We ought to help them without crippling efficiency.
 I do agree a change would make certain things clearer.

I have an issue with all the counter-arguments presented in this thread being shoved behind the one word "clearer".

What is the issue you are having? I don't see a much better API being proposed. I see a marginally improved API at the very best, and possibly quite a bit more prone to error.
 But not enough to nearly make up for the breakage.

I would still like to go ahead with my suggestion to attempt some possible changes without releasing them. I'm going to try them with my own programs first to see how much it will break.

I think that's great.
 I believe that you are
 too eagerly dismissing all proposals without even evaluating them.

Perspective is everything, isn't it :o). I thought I'm being reasonable and accepting in discussing of a number of proposed points, although in my heart of hearts many arguments seem rather frivolous. With what has been put forward so far, that's not even close to justifying a breaking change. If that great better design is just get back to code unit iteration, the change will not happen while I work on D. It is possible, however, that a much better idea comes forward, and I'd be looking forward to such.
 I think the above list has enough weight to merit at least considering
 *some* breaking changes.

I think a better approach is to figure what to add.

This is obvious: - more Unicode algorithms (normalization, segmentation, etc.) - better documentation

I was thinking of these too: 1. Revisit std.encoding and perhaps confer legitimacy to the character types defined there. The implementation in std.encoding is wanting, but I think the idea is sound. Essentially give more love to various encodings, including Ascii and "bypass encoding, I'll deal with stuff myself". 2. Add byChar that returns a random-access range iterating a string by character. Add byWchar that does on-the-fly transcoding to UTF16. Add byDchar that accepts any range of char and does decoding. And such stuff. Then whenever one wants to go through a string by code point can just use str.byChar. Andrei
Mar 08 2014
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/8/14, 6:14 PM, Vladimir Panteleev wrote:
 On Sunday, 9 March 2014 at 01:23:27 UTC, Andrei Alexandrescu wrote:
 On 3/8/14, 4:42 PM, Vladimir Panteleev wrote:
 My point there is that there's no useless or duplicated code that
 would be thrown away. A better design would indeed make for better
 modular separation - would be great if the string-related
 optimizations in std.algorithm went elsewhere. They wouldn't disappear.

Why? Isn't the whole issue that std.range presents strings as dchar ranges, and std.algorithm needs to detect dchar ranges and then treat them as char arrays? As opposed to std.algorithm just detecting arrays and treating them all as arrays (which it should be doing now anyway)?

That's scaffolding, not actual executable code.
 Why? You can only find out that an algorithm is slower than it needs to
 be via either profiling (at which point you're wondering why the  #$%
 the thing is so slow), or feeding it invalid UTF. If you had made a
 different choice for Unicode in D, this problem would not exist
 altogether.

Disagree.

Could you please elaborate? This is the second uninformative reply to this argument.

What can I say? The answer is obvious. It's not hard to figure for me. Performance of D's UTF strings has never been a mystery to me. From where I stand all this "hidden, difficult-to-detect performance problems" drama is just posturing. We'd do good to wean such out of the discussion. No bug myriad of bug reports "D strings are awfully slow" on bugzilla. No long threads "Why are D strings so slow" on stack overflow. No trolling on reddit or hackernews "D? Just look at their strings. How could anyone think that's a good idea lol." And it's not like people aren't talking. In contrast, D has been (and often rightly) criticized in the past for things like floating point performance and garbage collection. No evidence we are having an acute performance problem with UTF strings.
 Sure there are, and you yourself illustrated a misuse of the APIs.

If UTF decoding was explicit, the problem would stand out. I don't think this is a valid argument.

Yours? Indeed isn't, if what you want is iterate by code unit (= meaningless for all but ASCII strings) by default.
 My point is: code point is better than code unit

This was debated... people should not be looking at individual code points, unless they really know what they're doing.

Should they be looking at code units instead?
 Grapheme is better than code point but a lot slower.

We are going in circles. People should have very good reasons for looking at individual graphemes as well.

And it's good we have increasing support for graphemes. I don't think they should be the default.
 It seems we're quite in a sweet spot here wrt performance/correctness.

This does not seem like an objective summary of this thread's arguments so far.

What is an objective summary? Those who want to inflict massive breakage are not even done arguing we have a better design.
 I guess I'll get working on that wiki page to organize the arguments.
 This discussion is starting to feel like a quicksand roundabout.

That's great. Yes, we're exchanging jabs right now which is not our best use of time. Also in the interest of time, please understand you'd need to show the second coming if you want to break backward compatibility. Additions are a much better path.
 With what has been put forward so far, that's not even close to
 justifying a breaking change. If that great better design is just get
 back to code unit iteration, the change will not happen while I work
 on D. It is possible, however, that a much better idea comes forward,
 and I'd be looking forward to such.

Actually, could you post some examples of real-world code that would be broken by a hypothetical sudden switch? I think I would be hard-pressed to find some in my own code, but I'd need to check for sure to find out.

I'm afraid burden of proof is on you. Far as I'm concerned every breakage of string processing is unacceptable or at least very undesirable.
 2. Add byChar that returns a random-access range iterating a string by
 character. Add byWchar that does on-the-fly transcoding to UTF16. Add
 byDchar that accepts any range of char and does decoding. And such
 stuff. Then whenever one wants to go through a string by code point
 can just use str.byChar.

This is confusing. Did you mean to say that byChar iterates a string by code unit (not character / code point)?

Unit. s.byChar.front is a (possibly ref, possibly qualified) char. Andrei
Mar 08 2014
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/8/14, 7:53 PM, Vladimir Panteleev wrote:
  From my POV, I could say I see consensus, with just you defending a
 decision you made a while ago :) But I'd prefer a constructive discussion.

What exactly is the consensus? From your wiki page I see "One of the proposals in the thread is to switch the iteration type of string ranges from dchar to the string's character type." I can tell you straight out: That will not happen for as long as I'm working on D. I'm ready to fight on this not only Walter Bright, but him and Walter White together. (Fortunately the former agrees the breakage is too large; haven't asked the latter yet.)
 Anyway, I don't want to "inflict massive breakage" either. I want the
 amount of breakage to be a justified cost of fixing a mistake and
 permanently improving the language's design going forward.

It seems you and I have a different view of the tradeoffs involved.
 In all seriousness, at this point I'm worried that you will defend the
 status quo even if the breakage turns out minimal. Instead of dealing
 with absolutes, advantages and disadvantages should be weighed against
 another (even with the breaking-backwards-compatibility penalty being
 very high).

Of course. If you come with something better, I'd be glad to take a look.
 Unit. s.byChar.front is a (possibly ref, possibly qualified) char.

So... does byChar for wstrings do the same thing as byWchar?

No, it transcodes from UTF16 to UTF8.
 And what if
 you want to iterate a wstring by char?

byChar.
 Wouldn't it be better to have
 byChar/byWchar/byDchar be a range of char/wchar/dchar regardless of the
 string type

that's right
, and have byCodeUnit which iterates by the code unit type?

We must add that too. I agree the resulting design is roundabout (you have char[] which is by default iterated by code point, and you need to wrap it to get to its units that were there in the first place). I also wanted to add some ASCII string love (by ascribing it a separate type) but Walter has good arguments opposing that. Andrei
Mar 08 2014
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/8/14, 8:24 PM, Vladimir Panteleev wrote:
 On Sunday, 9 March 2014 at 04:18:15 UTC, Andrei Alexandrescu wrote:
 What exactly is the consensus? From your wiki page I see "One of the
 proposals in the thread is to switch the iteration type of string
 ranges from dchar to the string's character type."

 I can tell you straight out: That will not happen for as long as I'm
 working on D.

Why?

From the cycle "going in circles": because I think the breakage is way too large compared to the alleged improvement. In fact I believe that that design is inferior to the current one regardless. Andrei
Mar 08 2014
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/9/14, 8:18 AM, Vladimir Panteleev wrote:
 On Sunday, 9 March 2014 at 05:10:26 UTC, Andrei Alexandrescu wrote:
 On 3/8/14, 8:24 PM, Vladimir Panteleev wrote:
 On Sunday, 9 March 2014 at 04:18:15 UTC, Andrei Alexandrescu wrote:
 What exactly is the consensus? From your wiki page I see "One of the
 proposals in the thread is to switch the iteration type of string
 ranges from dchar to the string's character type."

 I can tell you straight out: That will not happen for as long as I'm
 working on D.

Why?

From the cycle "going in circles": because I think the breakage is way too large compared to the alleged improvement.

All right. I was wondering if there was something more fundamental behind such an ultimatum.

It's just factual information with no drama attached (i.e. I'm not threatening to leave the language, just plainly explain I'll never approve that particular change). That said a larger explanation is in order. There have been cases in the past when our community has worked itself in a froth over a non-issue and ultimately caused a language change imposed by "the faction that shouted the loudest". The "lazy" keyword and recently the "virtual" keyword come to mind as cases in which the language leadership has been essentially annoyed into making a change it didn't believe in. I am all about listening to the community's needs and desires. But at some point there is a need to stick to one's guns in matters of judgment call. See e.g. https://d.puremagic.com/issues/show_bug.cgi?id=11837 for a very recent example in which reasonable people may disagree but at some point you can't choose both options. What we now have works as intended. As I mentioned, there is quite a bit more evidence the design is useful to people, than detrimental. Unicode is all about code points. Code units are incidental to each encoding. The fact that we recognize code points at language and library level is, in my opinion, a Good Thing(tm). I understand that doesn't reach the ninth level of Nirvana and there are still issues to work on, and issues where good-looking code is actually incorrect. But I think we're overall in good shape. A regression from that to code unit level would be very destructive. Even a clear slight improvement that breaks backward compatibility would be destructive. So I wanted to limit the potential damage of this discussion. It is made only a lot more dangerous that Walter himself started it, something that others didn't fail to tune into. The sheer fact that we got to contemplate an unbelievably massive breakage on no other evidence than one misuse case and for the sake of possibly an illusory improvement - that's a sign we need to grow up. We can't go like this about changing the language and aim to play in the big leagues.
 In fact I believe that that design is inferior to the current one
 regardless.

I was hoping we could come to an agreement at least on this point.

Sorry to disappoint.
 ---

 BTW, a thought struck me while thinking about the problem yesterday.

 char and dchar should not be implicitly convertible between one another,
 or comparable to the other.

I think only the char -> dchar conversion works, and I can see arguments against it. Also comparison of char with dchar is dicey. But there are also cases in which it's legitimate to do that (e.g. assign ASCII chars etc) and this would be a breaking change. One good way to think about breaking changes is "if this change were executed to perfection, how much would that improve the overall quality of D?" Because breakages _are_ "overall" - users don't care whether they come from this or the other part of the type system. Really puts things into perspective.
 void main()
 {
      string s = "Привет";
      foreach (c; s)
          assert(c != 'Ñ');
 }

 Instead, std.conv.to should allow converting between character types,
 iff they represent one whole code point and fit into the destination
 type, and throw an exception otherwise (similar to how it deals with
 integer overflow). Char literals should be special-cased by the compiler
 to implicitly convert to any sufficiently large type.

 This would break more[1] code, but it would avoid the silent failures of
 the earlier proposal.

 [1] I went through my own larger programs. I actually couldn't find any
 uses of dchar which would be impacted by such a hypothetical change.

Generally I think we should steer away from slight improvements of the language at the cost of breaking existing code. Instead, we must think of ways to improve the language without the breakage. You may want to pursue (bugzilla + pull request) adding the std.conv routines with the semantics you mentioned. Andrei
Mar 09 2014
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:
 On 09/03/14 04:26, Andrei Alexandrescu wrote:
 2. Add byChar that returns a random-access range iterating a string by
 character. Add byWchar that does on-the-fly transcoding to UTF16. Add
 byDchar that accepts any range of char and does decoding. And such
 stuff. Then whenever one wants to go through a string by code point
 can just use str.byChar.

This is confusing. Did you mean to say that byChar iterates a string by code unit (not character / code point)?

Unit. s.byChar.front is a (possibly ref, possibly qualified) char.

So IIUC iterating over s.byChar would not encounter the decoding-related speed hits that Walter is concerned about?

That is correct. Andrei
Mar 09 2014
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/9/14, 10:21 AM, Vladimir Panteleev wrote:
 On Sunday, 9 March 2014 at 17:18:47 UTC, Andrei Alexandrescu wrote:
 On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:
 So IIUC iterating over s.byChar would not encounter the decoding-related
 speed hits that Walter is concerned about?

That is correct.

Unless I'm missing something, all algorithms that can work faster on arrays will need to be adapted to also recognize byChar-wrapped arrays, unwrap them, perform the fast array operation, and wrap them back in a byChar.

Good point. Off the top of my head I can't remember any algorithm that relies on array representation to do better on arrays than on random-access ranges offering all of arrays' primitives. But I'm sure there are a few. Andrei
Mar 09 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
09-Mar-2014 21:45, Andrei Alexandrescu пишет:
 On 3/9/14, 10:21 AM, Vladimir Panteleev wrote:
 On Sunday, 9 March 2014 at 17:18:47 UTC, Andrei Alexandrescu wrote:
 On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:
 So IIUC iterating over s.byChar would not encounter the
 decoding-related
 speed hits that Walter is concerned about?

That is correct.

Unless I'm missing something, all algorithms that can work faster on arrays will need to be adapted to also recognize byChar-wrapped arrays, unwrap them, perform the fast array operation, and wrap them back in a byChar.

Good point. Off the top of my head I can't remember any algorithm that relies on array representation to do better on arrays than on random-access ranges offering all of arrays' primitives. But I'm sure there are a few.

copy to begin with. And it's about 80x faster with plain arrays. -- Dmitry Olshansky
Mar 09 2014
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/9/14, 11:14 AM, Dmitry Olshansky wrote:
 09-Mar-2014 21:45, Andrei Alexandrescu пишет:
 On 3/9/14, 10:21 AM, Vladimir Panteleev wrote:
 On Sunday, 9 March 2014 at 17:18:47 UTC, Andrei Alexandrescu wrote:
 On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:
 So IIUC iterating over s.byChar would not encounter the
 decoding-related
 speed hits that Walter is concerned about?

That is correct.

Unless I'm missing something, all algorithms that can work faster on arrays will need to be adapted to also recognize byChar-wrapped arrays, unwrap them, perform the fast array operation, and wrap them back in a byChar.

Good point. Off the top of my head I can't remember any algorithm that relies on array representation to do better on arrays than on random-access ranges offering all of arrays' primitives. But I'm sure there are a few.

copy to begin with. And it's about 80x faster with plain arrays.

Question is if there are a bunch of them. Andrei
Mar 09 2014
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
09-Mar-2014 07:53, Vladimir Panteleev пишет:
 On Sunday, 9 March 2014 at 03:26:40 UTC, Andrei Alexandrescu wrote:
 I don't understand this argument. Iterating by code unit is not
 meaningless if you don't want to extract meaning from each unit
 iteration. For example, if you're parsing JSON or XML, you only care
 about the syntax characters, which are all ASCII. And there is no
 confusion of "what exactly are we counting here".

 This was debated... people should not be looking at individual code
 points, unless they really know what they're doing.

Should they be looking at code units instead?

No. They should only be looking at substrings.

This. Anyhow searching dchar makes sense for _some_ languages, the problem is that it shouldn't decode the whole string but rather encode the needle properly and search that. Basically the whole thread is about: how do I work efficiently (no-decoding) with UTF-8/UTF-16 in cases where it obviously can be done? The current situation is bad in that it undermines writing decode-less generic code. One easily falls into auto-decode trap on first .front, especially when called from some standard algorithm. The algo sees char[]/wchar[] and gets into decode mode via some special case. If it would do that with _all_ char/wchar random access ranges it'd be at least consistent. That and wrapping your head around 2 sets of constraints. The amount of code around 2 types - wchar[]/char[] is way too much, that much is clear. -- Dmitry Olshansky
Mar 09 2014
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/9/14, 11:34 AM, Dmitry Olshansky wrote:
 09-Mar-2014 07:53, Vladimir Panteleev пишет:
 On Sunday, 9 March 2014 at 03:26:40 UTC, Andrei Alexandrescu wrote:
 I don't understand this argument. Iterating by code unit is not
 meaningless if you don't want to extract meaning from each unit
 iteration. For example, if you're parsing JSON or XML, you only care
 about the syntax characters, which are all ASCII. And there is no
 confusion of "what exactly are we counting here".

 This was debated... people should not be looking at individual code
 points, unless they really know what they're doing.

Should they be looking at code units instead?

No. They should only be looking at substrings.

This. Anyhow searching dchar makes sense for _some_ languages, the problem is that it shouldn't decode the whole string but rather encode the needle properly and search that.

That's just an optimization. Conceptually what happens is we're looking for a code point in a sequence of code points.
 Basically the whole thread is about:
 how do I work efficiently (no-decoding) with UTF-8/UTF-16 in cases where
 it obviously can be done?

 The current situation is bad in that it undermines writing decode-less
 generic code.

s/undermines writing/makes writing explicit/
 One easily falls into auto-decode trap on first .front,
 especially when called from some standard algorithm. The algo sees
 char[]/wchar[] and gets into decode mode via some special case. If it
 would do that with _all_ char/wchar random access ranges it'd be at
 least consistent.

 That and wrapping your head around 2 sets of constraints. The amount of
 code around 2 types - wchar[]/char[] is way too much, that much is clear.

We're engineers so we should quantify. Ideally that would be as simple as "git grep isNarrowString|wc -l" which currently prints 42 of all numbers :o). Overall I suspect there are a few good simplifications we can make by using isNarrowString and .representation. Andrei
Mar 09 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
09-Mar-2014 22:41, Andrei Alexandrescu пишет:
 On 3/9/14, 11:34 AM, Dmitry Olshansky wrote:
 This. Anyhow searching dchar makes sense for _some_ languages, the
 problem is that it shouldn't decode the whole string but rather encode
 the needle properly and search that.

That's just an optimization. Conceptually what happens is we're looking for a code point in a sequence of code points.

Yup. It's till not a good idea to introduce this in std.algorithm in a non-generic way.
 That and wrapping your head around 2 sets of constraints. The amount of
 code around 2 types - wchar[]/char[] is way too much, that much is clear.

We're engineers so we should quantify. Ideally that would be as simple as "git grep isNarrowString|wc -l" which currently prints 42 of all numbers :o).

Add to that some uses of isSomeString and ElementEncodingType. 138 and 80 respectively. And in most cases it means that nice generic code was hacked to care about 2 types in particular. That is what bothers me.
 Overall I suspect there are a few good simplifications we can make by
 using isNarrowString and .representation.

Okay putting potential breakage aside. Let me sketch up an additive way of improving current situation. 1. Say we recognize any indexable entity of char/wchar/dchar, that however has .front returning a dchar as a "narrow string". Nothing fancy - it's just a generalization of isNarrowString. At least a range over Array!char will work as string now. 2. Likewise representation must be made something more explicit say byCodeUnit and work on any isNarrowString per above. The opposite of that is byCodePoint. 3. ElementEncodingType is too verbose and misleading. Something more explicit would be useful. ItemType/UnitType maybe? 4. We lack lots of good stuff from Unicode standard. Some recently landed in std.uni. We need many more, and deprecate crappy ones in std.string. (e.g. wrapping text is one) 5. Most algorithms conceptually decode, but may be enhanced to work directly on UTF-8/UTF-16. That together with 1, should IMHO solve most of our problems. 6. Take into account ASCII and maybe other alphabets? Should be as trivial as .assumeASCII and then on you march with all of std.algo/etc. -- Dmitry Olshansky
Mar 09 2014
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/9/14, 12:25 PM, Dmitry Olshansky wrote:
 Okay putting potential breakage aside.
 Let me sketch up an additive way of improving current situation.

Now you're talking.
 1. Say we recognize any indexable entity of char/wchar/dchar, that
 however has .front returning a dchar as a "narrow string". Nothing fancy
 - it's just a generalization of isNarrowString. At least a range over
 Array!char will work as string now.

Wait, why is dchar[] a narrow string?
 2. Likewise representation must be made something more explicit say
 byCodeUnit and work on any isNarrowString per above. The opposite of
 that is byCodePoint.

Fine.
 3. ElementEncodingType is too verbose and misleading. Something more
 explicit would be useful. ItemType/UnitType maybe?

We're stuck with that name.
 4. We lack lots of good stuff from Unicode standard. Some recently
 landed in std.uni. We need many more, and deprecate crappy ones in
 std.string. (e.g. wrapping text is one)

Add away.
 5. Most algorithms conceptually decode, but may be enhanced to work
 directly on UTF-8/UTF-16. That together with 1, should IMHO solve most
 of our problems.

Great!
 6. Take into account ASCII and maybe other alphabets? Should be as
 trivial as .assumeASCII and then on you march with all of std.algo/etc.

Walter is against that. His main argument is that UTF already covers ASCII with only a marginal cost (that can be avoided) and that we should go farther into the future instead of catering to an obsolete representation. Andrei
Mar 09 2014
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
09-Mar-2014 23:40, Andrei Alexandrescu пишет:
 On 3/9/14, 12:25 PM, Dmitry Olshansky wrote:
 Okay putting potential breakage aside.
 Let me sketch up an additive way of improving current situation.

Now you're talking.
 1. Say we recognize any indexable entity of char/wchar/dchar, that
 however has .front returning a dchar as a "narrow string". Nothing fancy
 - it's just a generalization of isNarrowString. At least a range over
 Array!char will work as string now.

Wait, why is dchar[] a narrow string?

Indeed `...entity of char/wchar/dchar` --> `...entity of char/wchar`.
 3. ElementEncodingType is too verbose and misleading. Something more
 explicit would be useful. ItemType/UnitType maybe?

We're stuck with that name.

Too bad, but we have renamed imports... if only they worked correctly. But let's not derail. [snip] Great, so this may be turned into smallish DIP or bugzilla enhancements.
 6. Take into account ASCII and maybe other alphabets? Should be as
 trivial as .assumeASCII and then on you march with all of std.algo/etc.

Walter is against that. His main argument is that UTF already covers ASCII with only a marginal cost

He certainly doesn't have things like case-insensitive matching or collation on his list. Some cute tables are what "directly to the UTF" algorithms require for almost anything beyond simple-minded "find me a substring". Walter certainly would have different stance the moment he observe the extra bulk of object code for these.
 (that can be avoided)

How? I'm not talking about `x < 0x80` branches, these wouldn't cost a dime. I really don't feel strong about 6th point. I see it as a good idea to allow custom alphabets and reap performance benefits where it makes sense, the need for that is less urgent though.
 and that we should
 go farther into the future instead of catering to an obsolete
 representation.

That is something I agree with. -- Dmitry Olshansky
Mar 09 2014
prev sibling next sibling parent Michel Fortin <michel.fortin michelf.ca> writes:
On 2014-03-09 13:00:45 +0000, "monarch_dodra" <monarchdodra gmail.com> said:

 AFAIK, the most common algorithm "case insensitive search" *must* decode.

Not necessarily. While the unicode collation algorithms (which should be used to compare text) are defined in term of code points, you could build a collation element table using code units as keys and bypass the decoding step for searching the table. I'm not sure if there would be a significant performance gain though. That remains an optimization though. The natural way to implement a Unicode algorithm is to base it on code points. -- Michel Fortin michel.fortin michelf.ca http://michelf.ca
Mar 09 2014
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/9/14, 4:34 AM, Peter Alexander wrote:
 I think this is the main confusion: the belief that iterating by code
 point has utility.

 If you care about normalization then neither by code unit, by code
 point, nor by grapheme are correct (except in certain language subsets).

I suspect that code point iteration is the worst as it works only with ASCII and perchance with ASCII single-byte extensions. Then we have code unit iteration that works with a larger spectrum of languages. One question would be how large that spectrum it is. If it's larger than English, then that would be nice because we would've made progress. I don't know about normalization beyond discussions in this group, but as far as I understand from http://www.unicode.org/faq/normalization.html, normalization would be a one-step process, after which code point iteration would cover still more human languages. No? I'm pretty sure it's more complicated than that, so please illuminate me :o).
 If you don't care about normalization then by code unit is just as good
 as by code point, but you don't need to specialise everywhere in Phobos.

 AFAIK, there is only one exception, stuff like s.all!(c => c == 'é'),
 but as Vladimir correctly points out: (a) by code point, this is still
 broken in the face of normalization, and (b) are there any real
 applications that search a string for a specific non-ASCII character?

What happened to counting characters and such?
 To those that think the status quo is better, can you give an example of
 a real-life use case that demonstrates this?

split(ter) comes to mind.
 I do think it's probably too late to change this, but I think there is
 value in at least getting everyone on the same page.

Awesome. Andrei
Mar 09 2014
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/9/14, 10:34 AM, Peter Alexander wrote:
 If we assume strings are normalized then substring search, equality
 testing, sorting all work the same with either code units or code points.

But others such as edit distance or equal(some_string, some_wstring) will not.
 If you don't care about normalization then by code unit is just as good
 as by code point, but you don't need to specialise everywhere in Phobos.

 AFAIK, there is only one exception, stuff like s.all!(c => c == 'é'),
 but as Vladimir correctly points out: (a) by code point, this is still
 broken in the face of normalization, and (b) are there any real
 applications that search a string for a specific non-ASCII character?

What happened to counting characters and such?

I can't think of any case where you would want to count characters.

wc (Generally: I've always been very very very doubtful about arguments that start with "I can't think of..." because I've historically tried them so many times, and with terrible results.) Andrei
Mar 09 2014
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/9/14, 11:19 AM, Peter Alexander wrote:
 On Sunday, 9 March 2014 at 17:48:47 UTC, Andrei Alexandrescu wrote:
 On 3/9/14, 10:34 AM, Peter Alexander wrote:
 If we assume strings are normalized then substring search, equality
 testing, sorting all work the same with either code units or code
 points.

But others such as edit distance or equal(some_string, some_wstring) will not.

equal(string, wstring) should either not compile, or would be overloaded to do the right thing.

These would be possible designs each with its pros and cons. The current design works out of the box across all encodings. It has its own pros and cons. Puts in perspective what should and shouldn't be.
 In an ideal world, char, wchar, and dchar should
 not be comparable.

Probably. But that has nothing to do with equal() working.
 Edit distance on code points is of questionable utility. Like Vladimir
 says, its meaning is pretty philosophical, even in ASCII (is "\r\n"
 really two "edits"? What is an "edit"?)

Nothing philosophical - it's as cut and dried as it gets. An edit is as defined by the Levenshtein algorithm using code points as the unit of comparison.
 I can't think of any case where you would want to count characters.

wc

% echo € | wc -c 4 :-)

Noice.
 (Generally: I've always been very very very doubtful about arguments
 that start with "I can't think of..." because I've historically tried
 them so many times, and with terrible results.)

Fair point... but it's not as if we would be removing the ability (you could always do s.byCodePoint.count); we are talking about defaults. The argument that we shouldn't iterate by code unit by default because people might want to count code points is without substance. Also, with the proposal, string.count(dchar) would encode the dchar to a string first for performance, so it would still work.

That's a good enhancement for the current design as well - care to submit a request for it?
 Anyway, I think this discussion isn't really going anywhere so I think
 I'll agree to disagree and retire.

The part that advocates a breaking change will not indeed lead anywhere. The parts where we improve Unicode support for D is very fertile. Andrei
Mar 09 2014
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
09-Mar-2014 21:54, Vladimir Panteleev пишет:
 On Sunday, 9 March 2014 at 17:48:47 UTC, Andrei Alexandrescu wrote:
 wc

What should wc produce on a Sanskrit text? The problem is that such questions quickly become philosophical.

Technically it could use word-braking algorithm for words. Or count grapheme clusters, or count code points it all may have value, depending on the user and writing system. -- Dmitry Olshansky
Mar 09 2014
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
09-Mar-2014 21:16, Andrei Alexandrescu пишет:
 On 3/9/14, 4:34 AM, Peter Alexander wrote:
 I think this is the main confusion: the belief that iterating by code
 point has utility.

 If you care about normalization then neither by code unit, by code
 point, nor by grapheme are correct (except in certain language subsets).

I suspect that code point iteration is the worst as it works only with ASCII and perchance with ASCII single-byte extensions. Then we have code unit iteration that works with a larger spectrum of languages.

Was clearly meant to be: code point <--> code unit
 One
 question would be how large that spectrum it is. If it's larger than
 English, then that would be nice because we would've made progress.

Code points help only in so far that many (~all) high-level algorithms in Unicode are described in terms of code points. Code points have properties, code unit do not have anything. Code points with assigned semantic value are "abstract characters". It's up to programmer to implement a particular algorithm to make it "as if" decoding really happened, working directly on code units or do decoding and work with code points which is simpler. Current std.uni offering mostly work on code points and decodes, crucial building block to work directly on code units is in review: https://github.com/D-Programming-Language/phobos/pull/1685
 I don't know about normalization beyond discussions in this group, but
 as far as I understand from
 http://www.unicode.org/faq/normalization.html, normalization would be a
 one-step process, after which code point iteration would cover still
 more human languages. No? I'm pretty sure it's more complicated than
 that, so please illuminate me :o).

Technically most apps just assume say "input comes in UTF-8 that is in normalization form C". Other such as browsers strive to get uniform representation on any input, do normalization of any input (often times normalization turns out to be just a no-op).
 If you don't care about normalization then by code unit is just as good
 as by code point, but you don't need to specialise everywhere in Phobos.

 AFAIK, there is only one exception, stuff like s.all!(c => c == 'é'),
 but as Vladimir correctly points out: (a) by code point, this is still
 broken in the face of normalization, and (b) are there any real
 applications that search a string for a specific non-ASCII character?

What happened to counting characters and such?

Counting chars is dubious. But, for instance, collation is defined in terms of code points. Regex pattern matching is _defined_ in terms of codepoints (even the mystical level 3 Unicode support of it). So there is certain merit to work at that level. But hacking it to be this way isn't the way to go. The least intrusive change would be to generalize the current choice w.r.t. to RA ranges of char/wchar. -- Dmitry Olshansky
Mar 09 2014
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/9/14, 9:02 AM, bearophile wrote:
 Time ago I have even asked for a helper function:
 https://d.puremagic.com/issues/show_bug.cgi?id=10162

I commented on that and preapproved it. Andrei
Mar 09 2014
prev sibling parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 3/9/2014 11:27 AM, Vladimir Panteleev wrote:
 On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
 On topic, I think D's implicit default decode to dchar is *infinity*
 times better than C++'s char-based strings. While imperfect in terms
 of grapheme, it was still a design decision made of win.

Care to argument?

It's simple: Breaking things on all non-English languages is worse than breaking things on non-western[1] languages. Is still breakage, and that *is* bad, but there's no question which breakage is significantly larger. [1] (And yes, I realize "western" is a gross over-simplification here. Point is "one working language" vs "several working languages".)
Mar 10 2014
prev sibling next sibling parent "Sean Kelly" <sean invisibleduck.org> writes:
I'll admit that I'm probably not the best person to make 
suggestions here. As a back-end programmer, a large portion of my 
work is dealing with text streams of various types. And the data 
I work with is in any number of encodings and none can be assumed 
to be in English. But literally all of my work is either parsing 
protocols where the symbols are single byte and so the C way is 
appropriate, or they are with blocks of text where I basically 
never work at the per character level. In fact I can think of 
only one case--trimming a block of text for disay in a small 
frame. And there I use an explicit routine for trimming to a 
specific number of Unicodw characters.

So regarding std.algorithm, I couldn't use it because I need to 
be able to slice based on the result. Knowing the number of 
multibyte code points between the beginning of the string and the 
thing I was searching for is utterly useless. Also, the 
performance is way too bad to make it a consideration.

But you're right. I was being dramatic when I called it utterly 
broken. It's simply not useful to me as-is. The solution for me 
is fairly simple though if inelegant--cast the string to an array 
of ubyte. Having both options is nice I suppose. I just can't 
comment on the utility of the default behavior because I can't 
imagine a use for it.
Mar 08 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu 
wrote:
 Searching for characters in strings would be difficult to deem 
 inappropriate.

The notion of "character" exists only in certain writing systems. It is thus a flawed practice, and I think it should not be encouraged, as it will only make writing truly-international software more difficult. A more correct approach is searching for a certain substring. If non-exact matching is needed (normalization, case insensitivity etc.), then the appropriate solution is to use the Unicode algorithms. If you look at the situation from this point of view, single code points become merely an implementation detail.
 1. All algorithms would by default operate on strings at 
 char/wchar level (i.e. code unit). That would cause the usual 
 issues and confusions I was aware of from C++. Certain 
 algorithms would require specialization and/or the user using 
 byDchar for correctness.

As previously discussed, "correctness" here is conditional. I would not use that word, it is another extreme.
 From experience with C++ I knew (1) had a bad track record, and 
 (2) "generically conservative, specialize for speed" was a 
 successful pattern.

 What would you have chosen given that context?

Ideally, we would have the Unicode algorithms in the standard library from day 1, and advocated their use throughout the documentation.
 I'm inclined to say that the correct approach is to
 state that algorithms operate explicitly on a T.sizeof basis 
 and that if
 the data contained in a particular range has some 
 multi-element encoding
 then separate, specialized routines should be used with the 
 T.sizeof
 behavior will not produce the desired result.

That sounds quite like C++ plus ICU. It doesn't strike me as the golden standard for Unicode integration.

Why not? Because it sounds like D needs exactly that. Plus its amazing slicing and range capabilities, of course.
 So the problem to me is that we're stuck not fixing something 
 that's
 horribly broken just because it's broken in a way that people 
 presumably
 now expect.

Clearly I'm being subjective here but again I'd find it difficult to get convinced we have something horribly broken from the evidence I gathered inside and outside Facebook.

Have you or anyone you personally know tried to process text in D containing a writing system such as Sanskrit's?
 I'd personally like to see this fixed and I think the new 
 behavior is
 preferable overall, but I do share Andrei's concern that such 
 a big
 change might hurt the language anyway.

I've said this once and I'm saying it again: the best way to convert this discussion into something useful is to devise ideas for useful non-breaking additions.

I disagree. As I've argued, I believe that currently most uses of dchars in an application are incorrect, and ultimately a time bomb for proper internationalization support. We need to apply the same procedure that we do with any language construct that was deemed to have been a poor decision: put it through a deprecation cycle and fix it.
Mar 08 2014
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Mar 08, 2014 at 08:38:40PM +0000, Vladimir Panteleev wrote:
 On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu
 wrote:
Searching for characters in strings would be difficult to deem
inappropriate.

The notion of "character" exists only in certain writing systems. It is thus a flawed practice, and I think it should not be encouraged, as it will only make writing truly-international software more difficult. A more correct approach is searching for a certain substring. If non-exact matching is needed (normalization, case insensitivity etc.), then the appropriate solution is to use the Unicode algorithms.

+1. Most "character"-based Unicode string operations are actually *substring* operations, because the notion of "character" is not universal to every writing system, and doesn't map 1-to-1 to Unicode code points anyway. I would argue that most instances of code that perform character-based operations on strings are incorrect, in the sense that they will fail to correctly process strings in certain languages. [...]
From experience with C++ I knew (1) had a bad track record, and
(2) "generically conservative, specialize for speed" was a
successful pattern.

What would you have chosen given that context?

Ideally, we would have the Unicode algorithms in the standard library from day 1, and advocated their use throughout the documentation.

+1. I came to D expecting this to be the case... and was a little let down when I discovered the actual state of affairs in std.uni at the time. Thankfully, things have improved since, and all those who worked on that have my gratitude. But it's still not quite there yet. [...]
So the problem to me is that we're stuck not fixing something that's
horribly broken just because it's broken in a way that people
presumably now expect.

Clearly I'm being subjective here but again I'd find it difficult to get convinced we have something horribly broken from the evidence I gathered inside and outside Facebook.

Have you or anyone you personally know tried to process text in D containing a writing system such as Sanskrit's?

Or more to the point, do you know of any experience that you can share about code that attempts to process these sorts of strings on a per character basis? My suspicion is that any code that operates on such strings, if they have any claim to correctness at all, must be substring-based, rather than character-based. T -- I think Debian's doing something wrong, `apt-get install pesticide', doesn't seem to remove the bugs on my system! -- Mike Dresser
Mar 08 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 8 March 2014 at 20:50:49 UTC, Andrei Alexandrescu 
wrote:
 On 3/8/14, 12:38 PM, Vladimir Panteleev wrote:
 On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu 
 wrote:
 That sounds quite like C++ plus ICU. It doesn't strike me as 
 the
 golden standard for Unicode integration.

Why not? Because it sounds like D needs exactly that. Plus its amazing slicing and range capabilities, of course.

Pretty much everyone using ICU hates it.

I admit I never used it personally. I just thought you meant that implied "D implementations of relevant Unicode algorithms, adapted to D style (range interface)". Is there more to this than the limitations of C++ or the implementers' design choices?
 Have you or anyone you personally know tried to process text 
 in D
 containing a writing system such as Sanskrit's?

No. Point being?

Point being, we don't have solid data to conclude whether D's current approach is actually good enough for such cases as you claim. We do have one post in this thread: http://forum.dlang.org/post/jlgfkxlrhlzdpwkpsrot forum.dlang.org
 I think there are too large risks for that,

For what? We have not discussed a possible plan yet. Are you referring to Walter Bright's proposal?
 and it's quite unclear this is solving a problem. "Slightly 
 better Unicode support" is hardly a good justification.

What this will solve: 1. Eliminating dangerous constructs, such as s.countUntil and s.indexOf both returning integers, yet possibly having different values in circumstances that the developer may not foresee. 2. Very high complexity of implementations (the ElementEncodingType problem previously mentioned). 3. Hidden, difficult-to-detect performance problems. The reason why this thread was started. I've had to deal with them in several places myself. 4. Encourage D programmers to write Unicode-capable code that is correct in the full sense of the word. I think the above list has enough weight to merit at least considering *some* breaking changes.
Mar 08 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 8 March 2014 at 20:52:40 UTC, H. S. Teoh wrote:
 Or more to the point, do you know of any experience that you 
 can share
 about code that attempts to process these sorts of strings on a 
 per
 character basis? My suspicion is that any code that operates on 
 such
 strings, if they have any claim to correctness at all, must be
 substring-based, rather than character-based.

That's pretty much it. Unless you are working in the confines of certain languages (alphabets, scripts, etc.), many notions that are valid for English or European languages lose meaning in general. This includes the notion of "characters" - at full abstraction, you can only treat a string as a stream of code units (or code points, if you wish, but as has been discussed to death this is rarely useful). An application which has to handle user text (said text being possibly in any language), has to pretty much treat string variables as "holy": - no indexing - no slicing - no counting anything - no toUpper/toLower (std.ascii or std.uni) etc. All processing and transformations (line breaking, normalization, etc.) needs to be done using the relevant Unicode algorithms. I've posted something earlier which I'd like to take back:
 [a-z] makes sense in English, and [а-я] makes sense in Russian

[а-я] makes sense for Russian, but it doesn't for Ukrainian, in the same way how [a-z] is useless for Portuguese. There are probably only a few such ranges in Unicode which encompass exactly one alphabet, due to how much letters overlap across alphabets of similar languages.
Mar 08 2014
prev sibling next sibling parent "Sean Kelly" <sean invisibleduck.org> writes:
On Saturday, 8 March 2014 at 20:50:49 UTC, Andrei Alexandrescu 
wrote:
 Pretty much everyone using ICU hates it.

I think the biggest problem with ICU is documentation. It can take a long time to figure out how to do something if you've never done it before. Also, the C interface in ICU seems better than the C++ interface. And I'll grant that a few things are just far harder than they need to be. I wanted a transcoding iterator and ICU almost has this but not quite, so I've got to write my own. In fact, iterating across an arbitrary encoding in general is at least not intuitive and perhaps not possible. I kinda gave up on that. Um, and using UTF-16 as the standard encoding, requiring many transcoding operations to require two conversions. Okay, I guess there are a lot of problems with ICU, but it handles nearly every requirement I have, which is in itself quite a lot.
Mar 08 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 8 March 2014 at 23:59:15 UTC, Andrei Alexandrescu 
wrote:
 My only claim is that recognizing and iterating strings by code 
 point is better than doing things by the octet.

Considering or disregarding the disadvantages of this choice?
 1. Eliminating dangerous constructs, such as s.countUntil and 
 s.indexOf
 both returning integers, yet possibly having different values 
 in
 circumstances that the developer may not foresee.

I disagree there's any danger. They deal in code points, end of story.

Perhaps I did not explain clearly enough. auto pos = s.countUntil(sub); writeln(s[pos..$]); This will compile, and work for English text. For someone without complete knowledge of Phobos functions and how D handles Unicode, it is not obvious that this code is actually wrong. In certain situations, this can have devastating effects: consider, for example, if this code is extracting a slice from a string that elsewhere contains sensitive data (e.g. a configuration file containing, among other data, a password). An attacker could supply an Unicode string where the developer did not expect it, thus causing "pos" to have a smaller value than the corresponding indexOf result, thus revealing a slice of "s" which was not intended to be visible. Thus, a developer currently needs to tread very carefully wherever he is slicing strings, so as to not accidentally use indices obtained from functions that count code points.
 2. Very high complexity of implementations (the 
 ElementEncodingType
 problem previously mentioned).

I disagree with "very high".

I'm quite sure that std.range and std.algorithm will lose a LOT of weight if they were fixed to not treat strings specially.
 Besides if you want to do Unicode you gotta crack some eggs.

No, I can't see how this justifies the choice. An explicit decoding range would have simplified things greatly while offering much of the same advantages. Whether the fact that it is there "by default" an advantage of the current approach at all is debatable.
 3. Hidden, difficult-to-detect performance problems. The 
 reason why this
 thread was started. I've had to deal with them in several 
 places myself.

I disagree with "hidden, difficult to detect".

Why? You can only find out that an algorithm is slower than it needs to be via either profiling (at which point you're wondering why the #$% the thing is so slow), or feeding it invalid UTF. If you had made a different choice for Unicode in D, this problem would not exist altogether.
 Also I'd add that I'd rather not have hidden, difficult to 
 detect correctness problems.

Except we already do. Arguments have already been presented in this thread that demonstrate correctness problems with the current approach. I don't think that these can stand up to the problems that the simpler by-char iteration approach would have.
 4. Encourage D programmers to write Unicode-capable code that 
 is correct
 in the full sense of the word.

I disagree we are presently discouraging them.

I did not say we are. The problem is that we aren't encouraging them either - we are instead setting an example of how to do it in a wrong (incomplete) way.
 I do agree a change would make certain things clearer.

I have an issue with all the counter-arguments presented in this thread being shoved behind the one word "clearer".
 But not enough to nearly make up for the breakage.

I would still like to go ahead with my suggestion to attempt some possible changes without releasing them. I'm going to try them with my own programs first to see how much it will break. I believe that you are too eagerly dismissing all proposals without even evaluating them.
 I think the above list has enough weight to merit at least 
 considering
 *some* breaking changes.

I think a better approach is to figure what to add.

This is obvious: - more Unicode algorithms (normalization, segmentation, etc.) - better documentation
Mar 08 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 9 March 2014 at 01:23:27 UTC, Andrei Alexandrescu 
wrote:
 On 3/8/14, 4:42 PM, Vladimir Panteleev wrote:
 On Saturday, 8 March 2014 at 23:59:15 UTC, Andrei Alexandrescu 
 wrote:
 My only claim is that recognizing and iterating strings by 
 code point
 is better than doing things by the octet.

Considering or disregarding the disadvantages of this choice?

Doing my best to weigh everything with the right measures.

I think it would be good to get a comparison of the two approaches, and list the arguments presented so far. I'll look into starting a Wiki page.
 Okay, though when you opened with "devastating" I was hoping 
 for nothing short of death and dismemberment.

In proportion. To the best of my knowledge, no one here writes software for military or industrial robots in D. Security issues rank as the worst kind of bugs in software on my scale.
 Anyhow the fix is obvious per this brief tutorial: 
 http://www.youtube.com/watch?v=hkDD03yeLnU

I don't get it.
 I'm quite sure that std.range and std.algorithm will lose a 
 LOT of
 weight if they were fixed to not treat strings specially.

I'm not so sure. Most of the string-specific optimizations simply detect certain string cases and forward them to array algorithms that need be written anyway. You would, indeed, save a fair amount of isSomeString conditionals and stuff (thus simplifying on scaffolding), but probably not a lot of code. That's not useless work - it'd go somewhere in any design.

One way to find out.
 Besides if you want to do Unicode you gotta crack some eggs.

No, I can't see how this justifies the choice. An explicit decoding range would have simplified things greatly while offering much of the same advantages.

My point there is that there's no useless or duplicated code that would be thrown away. A better design would indeed make for better modular separation - would be great if the string-related optimizations in std.algorithm went elsewhere. They wouldn't disappear.

Why? Isn't the whole issue that std.range presents strings as dchar ranges, and std.algorithm needs to detect dchar ranges and then treat them as char arrays? As opposed to std.algorithm just detecting arrays and treating them all as arrays (which it should be doing now anyway)?
 3. Hidden, difficult-to-detect performance problems. The 
 reason why this
 thread was started. I've had to deal with them in several 
 places myself.

I disagree with "hidden, difficult to detect".

Why? You can only find out that an algorithm is slower than it needs to be via either profiling (at which point you're wondering why the #$% the thing is so slow), or feeding it invalid UTF. If you had made a different choice for Unicode in D, this problem would not exist altogether.

Disagree.

Could you please elaborate? This is the second uninformative reply to this argument.
 Except we already do. Arguments have already been presented in 
 this
 thread that demonstrate correctness problems with the current 
 approach.
 I don't think that these can stand up to the problems that the 
 simpler
 by-char iteration approach would have.

Sure there are, and you yourself illustrated a misuse of the APIs.

If UTF decoding was explicit, the problem would stand out. I don't think this is a valid argument.
 My point is: code point is better than code unit

This was debated... people should not be looking at individual code points, unless they really know what they're doing.
 Grapheme is better than code point but a lot slower.

We are going in circles. People should have very good reasons for looking at individual graphemes as well.
 It seems we're quite in a sweet spot here wrt 
 performance/correctness.

This does not seem like an objective summary of this thread's arguments so far. I guess I'll get working on that wiki page to organize the arguments. This discussion is starting to feel like a quicksand roundabout.
 With what has been put forward so far, that's not even close to 
 justifying a breaking change. If that great better design is 
 just get back to code unit iteration, the change will not 
 happen while I work on D. It is possible, however, that a much 
 better idea comes forward, and I'd be looking forward to such.

Actually, could you post some examples of real-world code that would be broken by a hypothetical sudden switch? I think I would be hard-pressed to find some in my own code, but I'd need to check for sure to find out.
 2. Add byChar that returns a random-access range iterating a 
 string by character. Add byWchar that does on-the-fly 
 transcoding to UTF16. Add byDchar that accepts any range of 
 char and does decoding. And such stuff. Then whenever one wants 
 to go through a string by code point can just use str.byChar.

This is confusing. Did you mean to say that byChar iterates a string by code unit (not character / code point)?
Mar 08 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 9 March 2014 at 03:26:40 UTC, Andrei Alexandrescu 
wrote:
 And it's not like people aren't talking. In contrast, D has 
 been (and often rightly) criticized in the past for things like 
 floating point performance and garbage collection. No evidence 
 we are having an acute performance problem with UTF strings.

The size of this thread is one factor. But I see your point - I agree that is evidently not one of D's more glaring current problems. I hope I never alluded to that not being the case. That doesn't mean the problem doesn't exist at all, though.
 If UTF decoding was explicit, the problem would stand out. I 
 don't think
 this is a valid argument.

Yours? Indeed isn't, if what you want is iterate by code unit (= meaningless for all but ASCII strings) by default.

I don't understand this argument. Iterating by code unit is not meaningless if you don't want to extract meaning from each unit iteration. For example, if you're parsing JSON or XML, you only care about the syntax characters, which are all ASCII. And there is no confusion of "what exactly are we counting here".
 This was debated... people should not be looking at individual 
 code
 points, unless they really know what they're doing.

Should they be looking at code units instead?

No. They should only be looking at substrings. Unless they're e.g. parsing a computer language (regardless if it has international text data), as above.
 We are going in circles. People should have very good reasons 
 for
 looking at individual graphemes as well.

And it's good we have increasing support for graphemes. I don't think they should be the default.

I don't think so either. Did I somehow imply that?
 What is an objective summary? Those who want to inflict massive 
 breakage are not even done arguing we have a better design.

From my POV, I could say I see consensus, with just you defending a decision you made a while ago :) But I'd prefer a constructive discussion. Anyway, I don't want to "inflict massive breakage" either. I want the amount of breakage to be a justified cost of fixing a mistake and permanently improving the language's design going forward. Here's what I have so far, BTW: http://wiki.dlang.org/Element_type_of_string_ranges I'll have to review it in the morning. Or rather, afternoon, given that it's 6 AM here.
 I'm afraid burden of proof is on you.

Why? I'm not saying that if you can't produce an example of breakage then your arguments are invalid. Rather, concrete examples give us a concrete problem to work with. I'm not trying to put any "burden of proof" on anyone.
 That's great. Yes, we're exchanging jabs right now which is not 
 our best use of time. Also in the interest of time, please 
 understand you'd need to show the second coming if you want to 
 break backward compatibility. Additions are a much better path.

Even a teensy-weensy breakage? :)
 Far as I'm concerned every breakage of string processing is 
 unacceptable or at least very undesirable.

In all seriousness, at this point I'm worried that you will defend the status quo even if the breakage turns out minimal. Instead of dealing with absolutes, advantages and disadvantages should be weighed against another (even with the breaking-backwards-compatibility penalty being very high).
 Unit. s.byChar.front is a (possibly ref, possibly qualified) 
 char.

So... does byChar for wstrings do the same thing as byWchar? And what if you want to iterate a wstring by char? Wouldn't it be better to have byChar/byWchar/byDchar be a range of char/wchar/dchar regardless of the string type, and have byCodeUnit which iterates by the code unit type?
Mar 08 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 9 March 2014 at 04:18:15 UTC, Andrei Alexandrescu 
wrote:
 What exactly is the consensus? From your wiki page I see "One 
 of the proposals in the thread is to switch the iteration type 
 of string ranges from dchar to the string's character type."

 I can tell you straight out: That will not happen for as long 
 as I'm working on D.

Why?
Mar 08 2014
prev sibling next sibling parent "Joseph Cassman" <jc7919 outlook.com> writes:
On Sunday, 9 March 2014 at 01:23:27 UTC, Andrei Alexandrescu 
wrote:
 I was thinking of these too:

 1. Revisit std.encoding and perhaps confer legitimacy to the 
 character types defined there. The implementation in 
 std.encoding is wanting, but I think the idea is sound. 
 Essentially give more love to various encodings, including 
 Ascii and "bypass encoding, I'll deal with stuff myself".

 2. Add byChar that returns a random-access range iterating a 
 string by character. Add byWchar that does on-the-fly 
 transcoding to UTF16. Add byDchar that accepts any range of 
 char and does decoding. And such stuff. Then whenever one wants 
 to go through a string by code point can just use str.byChar.


 Andrei

I like these two points you make here. In particular, I like the recent addition of byGrapheme, and other ideas along this line which provide a custom range interface to a string. Such additions do not break code but add opt-in functionality for those who need it, while leaving the default case intact. Overall, I think the current string design in D2 stikes a nice balance between performance and functionality. It does not reach Unicode perfection but gets rather close to good useability while still maintaining good C compatibility and performance in the default case. As for Walter's original post regarding the use of decode by default in std.array.front, if I had it my way, I would prefer all performance hits to be explicit so that way I know what I am paying for by simply reading the code. Nonetheless, this change will break code in the wild relying on its current behavior. As a result, I feel that making such a fundamental change would be better to postpone until the next major version of D is considered. D currently seems to carry much hope due to its potential, but is struggling to gain reputation as a reliable, quality, production-ready language. If such fundamental changes are made at this point it will do a lot of harm to D's reputation which it may never recover from. Rather than making such a change now, I feel that fixing all open issues in bugzilla and 'completing' D2 would do much good. Then, near the close of implementing D2, a new library implementation of text capabilities could be prototyped for D3 and flagged as beta-please-test-but-avoid-use-in-production-code. Such an approach would benefit from the insights gained from implementing this version in D2 and also get much needed input from actual usage. Joseph
Mar 08 2014
prev sibling next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu 
wrote:
 The current approach is a cut above treating strings as arrays 
 of bytes
 for some languages, and still utterly broken for others. If I'm
 operating on a right to left language like Hebrew, what would 
 I expect
 the result to be from something like countUntil?

The entire string processing paraphernalia is left to right. I figure RTL languages are under-supported, but s.retro.countUntil comes to mind. Andrei

I'm pretty sure that all string operations are actually "front to back". If I recall correctly, evenlanguages that "read" right to left, are stored in a front to back manner: EG: string[0] would be the right-most character. Is is only a question of "display", and changes nothing to the code. As for "countUntil", it would still work perfectly fine, as a RTL reader would expect the counting to start at the "begining" eg: the "Right" side. I'm pretty confident RTL is 100% supported. The only issue is the "front"/"left" abiguity, and the only one I know of is the oddly named "stripLeft" function, which actually does a "stripFront" anyways. So I wouldn't worry about RTL. But as mentioned, it is languages like indian, that have complex graphemes, or languages with accentuated characters, eg, most europeans ones, that can have problems, such as canFind("cassé", 'e'). On topic, I think D's implicit default decode to dchar is *infinity* times better than C++'s char-based strings. While imperfect in terms of grapheme, it was still a design decision made of win. I'd be tempted to not ask "how do we back out", but rather, "how can we take this further"? I'd love to ditch the whole "char"/"dchar" thing altogether, and work with graphemes. But that would be massive involvement.
Mar 09 2014
prev sibling next sibling parent "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
 On topic, I think D's implicit default decode to dchar is 
 *infinity* times better than C++'s char-based strings. While 
 imperfect in terms of grapheme, it was still a design decision 
 made of win.

 I'd be tempted to not ask "how do we back out", but rather, 
 "how can we take this further"? I'd love to ditch the whole 
 "char"/"dchar" thing altogether, and work with graphemes. But 
 that would be massive involvement.

Why do you think it is better? Let's be clear here: if you are searching/iterating/comparing by code point then your program is either not correct, or no better than doing so by code unit. Graphemes don't really fix this either. I think this is the main confusion: the belief that iterating by code point has utility. If you care about normalization then neither by code unit, by code point, nor by grapheme are correct (except in certain language subsets). If you don't care about normalization then by code unit is just as good as by code point, but you don't need to specialise everywhere in Phobos. AFAIK, there is only one exception, stuff like s.all!(c => c == 'é'), but as Vladimir correctly points out: (a) by code point, this is still broken in the face of normalization, and (b) are there any real applications that search a string for a specific non-ASCII character? To those that think the status quo is better, can you give an example of a real-life use case that demonstrates this? I do think it's probably too late to change this, but I think there is value in at least getting everyone on the same page.
Mar 09 2014
prev sibling next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 09/03/14 04:26, Andrei Alexandrescu wrote:
 2. Add byChar that returns a random-access range iterating a string by
 character. Add byWchar that does on-the-fly transcoding to UTF16. Add
 byDchar that accepts any range of char and does decoding. And such
 stuff. Then whenever one wants to go through a string by code point
 can just use str.byChar.

This is confusing. Did you mean to say that byChar iterates a string by code unit (not character / code point)?

Unit. s.byChar.front is a (possibly ref, possibly qualified) char.

So IIUC iterating over s.byChar would not encounter the decoding-related speed hits that Walter is concerned about? In which case it seems to me a better solution -- "safe" strings by default, unsafe speed-focused solution available if you want it. ("Safe" here in the more general sense of "Doesn't generate unexpected errors" rather than memory safety.)
Mar 09 2014
prev sibling next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Sunday, 9 March 2014 at 11:34:31 UTC, Peter Alexander wrote:
 On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
 On topic, I think D's implicit default decode to dchar is 
 *infinity* times better than C++'s char-based strings. While 
 imperfect in terms of grapheme, it was still a design decision 
 made of win.

 I'd be tempted to not ask "how do we back out", but rather, 
 "how can we take this further"? I'd love to ditch the whole 
 "char"/"dchar" thing altogether, and work with graphemes. But 
 that would be massive involvement.

Why do you think it is better? Let's be clear here: if you are searching/iterating/comparing by code point then your program is either not correct, or no better than doing so by code unit. Graphemes don't really fix this either. I think this is the main confusion: the belief that iterating by code point has utility. If you care about normalization then neither by code unit, by code point, nor by grapheme are correct (except in certain language subsets). If you don't care about normalization then by code unit is just as good as by code point, but you don't need to specialise everywhere in Phobos.

IMO, the "normalization" argument is overrated. I've yet to encounter a real-world case of normalization: only hand written counter-examples. Not saying it doesn't exist, just that: 1. It occurs only in special cases that the program should be aware of before hand. 2. Arguably, be taken care of eagerly, or in a special pass. As for "the belief that iterating by code point has utility." I have to strongly disagree. Unicode is composed of codepoints, and that is what we handle. The fact that it can be be encoded and stored as UTF is implementation detail. As for the grapheme thing, I'm not actually so sure about it myself, so don't take it too seriously.
 AFAIK, there is only one exception, stuff like s.all!(c => c == 
 'é'), but as Vladimir correctly points out: (a) by code point, 
 this is still broken in the face of normalization, and (b) are 
 there any real applications that search a string for a specific 
 non-ASCII character?

But *what* other kinds of algorithms are there? AFAIK, the *only* type of algorithm that doesn't need decoding is searching, and you know what? std.algorithm.find does it perfectly well. This trickles into most other algorithms too: split, splitter or findAmong don't decode if they don't have too. AFAIK, the most common algorithm "case insensitive search" *must* decode. There may still be cases where it is still not working as intended in the face of normalization, but it is still leaps and bounds better than what we get iterating with codeunits. To turn it the other way around, *what* are you guys doing, that doesn't require decoding, and where performance is such a killer?
 To those that think the status quo is better, can you give an 
 example of a real-life use case that demonstrates this?

I do not know of a single bug report in regards to buggy phobos code that used front/popFront. Not_a_single_one (AFAIK). On the other hand, there are plenty of cases of bugs for attempting to not decode strings, or incorrectly decoding strings. They are being corrected on a continuous basis. Seriously, Bearophile suggested "ABCD".sort(), and it took about 6 pages (!) for someone to point out this would be wrong. Even Walter pointed out that such code should work. *Maybe* it is still wrong in regards to graphemes and normalization, but at *least*, the result is not a corrupted UTF-8 stream. Walter keeps grinding on about "myCharArray.put('é')" not working, but I'm not sure he realizes how dangerous it would actually be to allow such a thing to work. In particular, in all these cases, a simple call to "representation" will deactivate the feature, giving you the tools you want.
 I do think it's probably too late to change this, but I think 
 there is value in at least getting everyone on the same page.

Me too. I do see the value in being able to do decode-less iteration. I just think the *default* behavior has the advantage of being correct *most* of the time, and definitely much more correct than without decoding. I think opt-out of decoding is just a much much much saner approach to string handling.
Mar 09 2014
prev sibling next sibling parent "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Sunday, 9 March 2014 at 13:00:46 UTC, monarch_dodra wrote:
 IMO, the "normalization" argument is overrated. I've yet to 
 encounter a real-world case of normalization: only hand written 
 counter-examples. Not saying it doesn't exist, just that:
 1. It occurs only in special cases that the program should be 
 aware of before hand.
 2. Arguably, be taken care of eagerly, or in a special pass.

 As for "the belief that iterating by code point has utility." I 
 have to strongly disagree. Unicode is composed of codepoints, 
 and that is what we handle. The fact that it can be be encoded 
 and stored as UTF is implementation detail.

We don't "handle" code points (when have you ever wanted to handle a combining character separate to the character it combines with?) You are just thinking of a subset of languages and locales. Normalization is an issue any time you have a user enter text into your program and you then want to search for that text. I hope we can agree this isn't a rare occurrence.
 AFAIK, there is only one exception, stuff like s.all!(c => c 
 == 'é'), but as Vladimir correctly points out: (a) by code 
 point, this is still broken in the face of normalization, and 
 (b) are there any real applications that search a string for a 
 specific non-ASCII character?

But *what* other kinds of algorithms are there? AFAIK, the *only* type of algorithm that doesn't need decoding is searching, and you know what? std.algorithm.find does it perfectly well. This trickles into most other algorithms too: split, splitter or findAmong don't decode if they don't have too.

Searching, equality testing, copying, sorting, hashing, splitting, joining... I can't think of a single use-case for searching for a non-ASCII code point. You can search for strings, but searching by code unit is just as good (and fast by default).
 AFAIK, the most common algorithm "case insensitive search" 
 *must* decode.

But it must also normalize and take locales into account, so by code point is insufficient (unless you are willing to ignore languages like Turkish). See Turkish I. http://en.wikipedia.org/wiki/Turkish_I Sure, if you just want to ignore normalization and several languages then by code point is just fine... but that's the point: by code point is incorrect in general.
 There may still be cases where it is still not working as 
 intended in the face of normalization, but it is still leaps 
 and bounds better than what we get iterating with codeunits.

 To turn it the other way around, *what* are you guys doing, 
 that doesn't require decoding, and where performance is such a 
 killer?

Searching, equality testing, copying, sorting, hashing, splitting, joining... The performance thing can be fixed in the library, but my concern is (a) it takes a significant amount of code to do so (b) complicates implementations. There are many, many algorithms in Phobos that are special cased for strings, and I don't think it needs to be that way.
 To those that think the status quo is better, can you give an 
 example of a real-life use case that demonstrates this?

I do not know of a single bug report in regards to buggy phobos code that used front/popFront. Not_a_single_one (AFAIK). On the other hand, there are plenty of cases of bugs for attempting to not decode strings, or incorrectly decoding strings. They are being corrected on a continuous basis.

Can you provide a link to a bug? Also, you haven't answered the question :-) Can you give a real-life example of a case where code point decoding was necessary where code units wouldn't have sufficed? You have mentioned case-insensitive searching, but I think I've adequately demonstrated that this doesn't work in general by code point: you need to normalize and take locales into account.
Mar 09 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 9 March 2014 at 05:10:26 UTC, Andrei Alexandrescu 
wrote:
 On 3/8/14, 8:24 PM, Vladimir Panteleev wrote:
 On Sunday, 9 March 2014 at 04:18:15 UTC, Andrei Alexandrescu 
 wrote:
 What exactly is the consensus? From your wiki page I see "One 
 of the
 proposals in the thread is to switch the iteration type of 
 string
 ranges from dchar to the string's character type."

 I can tell you straight out: That will not happen for as long 
 as I'm
 working on D.

Why?

From the cycle "going in circles": because I think the breakage is way too large compared to the alleged improvement.

All right. I was wondering if there was something more fundamental behind such an ultimatum.
 In fact I believe that that design is inferior to the current 
 one regardless.

I was hoping we could come to an agreement at least on this point. --- BTW, a thought struck me while thinking about the problem yesterday. char and dchar should not be implicitly convertible between one another, or comparable to the other. void main() { string s = "Привет"; foreach (c; s) assert(c != 'Ñ'); } Instead, std.conv.to should allow converting between character types, iff they represent one whole code point and fit into the destination type, and throw an exception otherwise (similar to how it deals with integer overflow). Char literals should be special-cased by the compiler to implicitly convert to any sufficiently large type. This would break more[1] code, but it would avoid the silent failures of the earlier proposal. [1] I went through my own larger programs. I actually couldn't find any uses of dchar which would be impacted by such a hypothetical change.
Mar 09 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
 On topic, I think D's implicit default decode to dchar is 
 *infinity* times better than C++'s char-based strings. While 
 imperfect in terms of grapheme, it was still a design decision 
 made of win.

Care to argument?
 I'd be tempted to not ask "how do we back out", but rather, 
 "how can we take this further"? I'd love to ditch the whole 
 "char"/"dchar" thing altogether, and work with graphemes. But 
 that would be massive involvement.

As has been discussed, this does not make sense. Graphemes are also a concept which apply only to certain writing systems, all it would do is exchange one set of tradeoffs with another, without solving anything. Text isn't that simple.
Mar 09 2014
prev sibling next sibling parent "Sean Kelly" <sean invisibleduck.org> writes:
On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
 On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu 
 wrote:
 The current approach is a cut above treating strings as 
 arrays of bytes
 for some languages, and still utterly broken for others. If 
 I'm
 operating on a right to left language like Hebrew, what would 
 I expect
 the result to be from something like countUntil?

The entire string processing paraphernalia is left to right. I figure RTL languages are under-supported, but s.retro.countUntil comes to mind. Andrei

I'm pretty sure that all string operations are actually "front to back". If I recall correctly, evenlanguages that "read" right to left, are stored in a front to back manner: EG: string[0] would be the right-most character. Is is only a question of "display", and changes nothing to the code. As for "countUntil", it would still work perfectly fine, as a RTL reader would expect the counting to start at the "begining" eg: the "Right" side. I'm pretty confident RTL is 100% supported. The only issue is the "front"/"left" abiguity, and the only one I know of is the oddly named "stripLeft" function, which actually does a "stripFront" anyways. So I wouldn't worry about RTL.

Yeah, I think RTL strings are preceded by a code point that indicates RTL display. It was just something I mentioned because some operations might be confusing to the programmer.
 But as mentioned, it is languages like indian, that have 
 complex graphemes, or languages with accentuated characters, 
 eg, most europeans ones, that can have problems, such as 
 canFind("cassé", 'e').

True. I still question why anyone would want to do character-based operations on Unicode strings. I guess substring searches could even end up with the same problem in some cases if not implemented specifically for Unicode for the same reason, but those should be far less common.
Mar 09 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 9 March 2014 at 13:00:46 UTC, monarch_dodra wrote:
 As for "the belief that iterating by code point has utility." I 
 have to strongly disagree. Unicode is composed of codepoints, 
 and that is what we handle. The fact that it can be be encoded 
 and stored as UTF is implementation detail.

But you don't deal with Unicode. You deal with *text*. Unless you are implementing Unicode algorithms, code points solve nothing in the general case.
 Seriously, Bearophile suggested "ABCD".sort(), and it took 
 about 6 pages (!) for someone to point out this would be wrong.

Sorting a string has quite limited use in the general case, so I think this is another artificial example.
 Even Walter pointed out that such code should work. *Maybe* it 
 is still wrong in regards to graphemes and normalization, but 
 at *least*, the result is not a corrupted UTF-8 stream.

I think this is no worse than putting all combining marks all clustered at the end of the string, thus attached to the last non-combining letter.
Mar 09 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Vladimir Panteleev:

 Seriously, Bearophile suggested "ABCD".sort(), and it took 
 about 6 pages (!) for someone to point out this would be wrong.

Sorting a string has quite limited use in the general case,

It seems I am sorting arrays of mutable ASCII chars often enough :-) Time ago I have even asked for a helper function: https://d.puremagic.com/issues/show_bug.cgi?id=10162 Bye, bearophile
Mar 09 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 9 March 2014 at 16:02:55 UTC, bearophile wrote:
 Vladimir Panteleev:

 Seriously, Bearophile suggested "ABCD".sort(), and it took 
 about 6 pages (!) for someone to point out this would be 
 wrong.

Sorting a string has quite limited use in the general case,

It seems I am sorting arrays of mutable ASCII chars often enough :-)

What do you use this for? I can think of sort being useful e.g. to see which characters appear in a string (and with which frequency), but as the concept does not apply to all languages, one would need to draw a line somewhere for which languages they want to support. I think this should be done explicitly in user code.
Mar 09 2014
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Vladimir Panteleev:

 What do you use this for?

For lots of different reasons (counting, testing, histograms, to unique-ify, to allow binary searches, etc), you can find alternative solutions for every one of those use cases.
 I can think of sort being useful e.g. to see which characters 
 appear in a string (and with which frequency), but as the 
 concept does not apply to all languages, one would need to draw 
 a line somewhere for which languages they want to support. I 
 think this should be done explicitly in user code.

So far I have needed to sort 7-bit ASCII chars. Bye, bearophile
Mar 09 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 9 March 2014 at 17:18:47 UTC, Andrei Alexandrescu 
wrote:
 On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:
 So IIUC iterating over s.byChar would not encounter the 
 decoding-related
 speed hits that Walter is concerned about?

That is correct.

Unless I'm missing something, all algorithms that can work faster on arrays will need to be adapted to also recognize byChar-wrapped arrays, unwrap them, perform the fast array operation, and wrap them back in a byChar.
Mar 09 2014
prev sibling next sibling parent "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Sunday, 9 March 2014 at 17:15:59 UTC, Andrei Alexandrescu 
wrote:
 On 3/9/14, 4:34 AM, Peter Alexander wrote:
 I think this is the main confusion: the belief that iterating 
 by code
 point has utility.

 If you care about normalization then neither by code unit, by 
 code
 point, nor by grapheme are correct (except in certain language 
 subsets).

I suspect that code point iteration is the worst as it works only with ASCII and perchance with ASCII single-byte extensions. Then we have code unit iteration that works with a larger spectrum of languages. One question would be how large that spectrum it is. If it's larger than English, then that would be nice because we would've made progress. I don't know about normalization beyond discussions in this group, but as far as I understand from http://www.unicode.org/faq/normalization.html, normalization would be a one-step process, after which code point iteration would cover still more human languages. No? I'm pretty sure it's more complicated than that, so please illuminate me :o).

It depends what you mean by "cover" :-) If we assume strings are normalized then substring search, equality testing, sorting all work the same with either code units or code points.
 If you don't care about normalization then by code unit is 
 just as good
 as by code point, but you don't need to specialise everywhere 
 in Phobos.

 AFAIK, there is only one exception, stuff like s.all!(c => c 
 == 'é'),
 but as Vladimir correctly points out: (a) by code point, this 
 is still
 broken in the face of normalization, and (b) are there any real
 applications that search a string for a specific non-ASCII 
 character?

What happened to counting characters and such?

I can't think of any case where you would want to count characters. * If you want an index to slice from, then you need code units. * If you want a buffer size, then you need code units. * If you are doing something like word wrapping then you need to count glyphs, which is not the same as counting code points (and that only works with mono-spaced fonts anyway -- with variable width fonts you need to add up the widths of those glyphs)
 To those that think the status quo is better, can you give an 
 example of
 a real-life use case that demonstrates this?

split(ter) comes to mind.

splitter is just an application of substring search, no? substring search works the same with both code units and code points (e.g. strstr in C works with UTF encoded strings without any need to decode). All you need to do is ensure that mismatched encodings in the delimeter are re-encoded (you want to do this for performance anyway) auto splitter(string str, dchar delim) { char[4] enc; return splitter(str, enc[0..encode(enc, delim)]); }
Mar 09 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 9 March 2014 at 17:48:47 UTC, Andrei Alexandrescu 
wrote:
 wc

What should wc produce on a Sanskrit text? The problem is that such questions quickly become philosophical.
 (Generally: I've always been very very very doubtful about 
 arguments that start with "I can't think of..." because I've 
 historically tried them so many times, and with terrible 
 results.)

I agree, which is why I think that although such arguments are not unwelcome, it's much better to find out by experiment. Break something in Phobos and see how much of your code is affected :)
Mar 09 2014
prev sibling next sibling parent "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Sunday, 9 March 2014 at 17:48:47 UTC, Andrei Alexandrescu 
wrote:
 On 3/9/14, 10:34 AM, Peter Alexander wrote:
 If we assume strings are normalized then substring search, 
 equality
 testing, sorting all work the same with either code units or 
 code points.

But others such as edit distance or equal(some_string, some_wstring) will not.

equal(string, wstring) should either not compile, or would be overloaded to do the right thing. In an ideal world, char, wchar, and dchar should not be comparable. Edit distance on code points is of questionable utility. Like Vladimir says, its meaning is pretty philosophical, even in ASCII (is "\r\n" really two "edits"? What is an "edit"?)
 I can't think of any case where you would want to count 
 characters.

wc

% echo € | wc -c 4 :-)
 (Generally: I've always been very very very doubtful about 
 arguments that start with "I can't think of..." because I've 
 historically tried them so many times, and with terrible 
 results.)

Fair point... but it's not as if we would be removing the ability (you could always do s.byCodePoint.count); we are talking about defaults. The argument that we shouldn't iterate by code unit by default because people might want to count code points is without substance. Also, with the proposal, string.count(dchar) would encode the dchar to a string first for performance, so it would still work. Anyway, I think this discussion isn't really going anywhere so I think I'll agree to disagree and retire.
Mar 09 2014
prev sibling next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Sunday, 9 March 2014 at 14:57:32 UTC, Peter Alexander wrote:
 You have mentioned case-insensitive searching, but I think I've 
 adequately demonstrated that this doesn't work in general by 
 code point: you need to normalize and take locales into account.

I don't understand what your argument. Is it "by code point is not 100% correct, so let's just drop it and go for raw code units instead?" We *are* arguing about whether or not "front/popFront" should decode by dchar, right...? You mention the algorithms "Searching, equality testing, copying, sorting, hashing, splitting, joining..." I said "by codepoint is not correct", but I still think it's a hell of a lot more accurate than by codeunit. Unless you want to ignore any and all algorithms that takes a predicate? You say "unless you are willing to ignore languages like Turkish", but... If you don't decode front, than aren't you just ignoring *all* languages that basically aren't English....? As I said, maybe by codepoint is not correct, but if it isn't, I think we should be moving further *into* the correct behavior by default, not away from it.
Mar 09 2014
prev sibling parent "w0rp" <devw0rp gmail.com> writes:
On Sunday, 9 March 2014 at 19:40:32 UTC, Andrei Alexandrescu 
wrote:
 6. Take into account ASCII and maybe other alphabets? Should 
 be as
 trivial as .assumeASCII and then on you march with all of 
 std.algo/etc.

Walter is against that. His main argument is that UTF already covers ASCII with only a marginal cost (that can be avoided) and that we should go farther into the future instead of catering to an obsolete representation. Andrei

When I've wanted to write code especially for ASCII, I think it hasn't been for use in generic algorithms anyway. Mostly it's stuff for manipulating segments of memory in a particular way, like as seen here in my library which does some work to generate D code. https://github.com/w0rp/dsmoke/blob/master/source/smoke/string_util.d#L45 Anything else would be something like running through an algorithm and then copying data into a new array or similar, and that would miss the point. When it comes to generic algorithms and ASCII I think UTF-x is sufficient.
Mar 09 2014
prev sibling next sibling parent "ponce" <contact gam3sfrommars.fr> writes:
 - In lots of places, I've discovered that Phobos did UTF 
 decoding (thus murdering performance) when it didn't need to. 
 Such cases included format (now fixed), appender (now fixed), 
 startsWith (now fixed - recently), skipOver (still unfixed). 
 These have caused latent bugs in my programs that happened to 
 be fed non-UTF data. There's no reason for why D should fail on 
 non-UTF data if it has no reason to decode it in the first 
 place! These failures have only served to identify places in 
 Phobos where redundant decoding was occurring.

With all due respect, D string type is exclusively for UTF-8 strings. If it is not valid UTF-8, it should never had been a D string in the first place. In the other cases, ubyte[] is there.
Mar 09 2014
prev sibling next sibling parent "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
 2) It is regression back to C++ days of 
 no-one-cares-about-Unicode pain. Thinking about strings as 
 character arrays is so natural and convenient that if 
 language/Phobos won't punish you for that, it will be extremely 
 widespread.

Not with Nick Sabalausky's suggestion to remove the implementation of front from char arrays. This way, everyone will be forced to decide whether they want code units or code points or something else.
Mar 09 2014
prev sibling next sibling parent "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Friday, 7 March 2014 at 16:43:30 UTC, Dicebot wrote:
 On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev
 Can we look at some example situations that this will break?

Any code that relies on countUntil to count dchar's? Or, to generalize, almost any code that uses std.algorithm functions with string?

This would no longer compile, as dchar[] stops being a range. countUntil(range.byCodePoint) would have to be used instead.
Mar 09 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 9 March 2014 at 12:24:11 UTC, ponce wrote:
 - In lots of places, I've discovered that Phobos did UTF 
 decoding (thus murdering performance) when it didn't need to. 
 Such cases included format (now fixed), appender (now fixed), 
 startsWith (now fixed - recently), skipOver (still unfixed). 
 These have caused latent bugs in my programs that happened to 
 be fed non-UTF data. There's no reason for why D should fail 
 on non-UTF data if it has no reason to decode it in the first 
 place! These failures have only served to identify places in 
 Phobos where redundant decoding was occurring.

With all due respect, D string type is exclusively for UTF-8 strings. If it is not valid UTF-8, it should never had been a D string in the first place. In the other cases, ubyte[] is there.

This is an arbitrary self-imposed limitation caused by the choice in how strings are handled in Phobos.
Mar 09 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 9 March 2014 at 13:51:12 UTC, Marc Schütz wrote:
 On Friday, 7 March 2014 at 16:43:30 UTC, Dicebot wrote:
 On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev
 Can we look at some example situations that this will break?

Any code that relies on countUntil to count dchar's? Or, to generalize, almost any code that uses std.algorithm functions with string?

This would no longer compile, as dchar[] stops being a range. countUntil(range.byCodePoint) would have to be used instead.

Why? There's no reason why dchar[] would stop being a range. It will be treated as now, like any other array.
Mar 09 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 9 March 2014 at 13:47:26 UTC, Marc Schütz wrote:
 On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
 2) It is regression back to C++ days of 
 no-one-cares-about-Unicode pain. Thinking about strings as 
 character arrays is so natural and convenient that if 
 language/Phobos won't punish you for that, it will be 
 extremely widespread.

Not with Nick Sabalausky's suggestion to remove the implementation of front from char arrays. This way, everyone will be forced to decide whether they want code units or code points or something else.

Andrei has made it clear that the code breakage this would involve would be unacceptable.
Mar 09 2014
prev sibling next sibling parent "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Sunday, 9 March 2014 at 15:23:57 UTC, Vladimir Panteleev wrote:
 On Sunday, 9 March 2014 at 13:51:12 UTC, Marc Schütz wrote:
 On Friday, 7 March 2014 at 16:43:30 UTC, Dicebot wrote:
 On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev
 Can we look at some example situations that this will break?

Any code that relies on countUntil to count dchar's? Or, to generalize, almost any code that uses std.algorithm functions with string?

This would no longer compile, as dchar[] stops being a range. countUntil(range.byCodePoint) would have to be used instead.

Why? There's no reason why dchar[] would stop being a range. It will be treated as now, like any other array.

This was under the assumption that Nick's proposal (and my "amendment" to extend it to dchar because of graphemes e.a.) would be implemented. But I made the mistake of replying to posts as I read them, just to notice a few posts later that someone else already posted something to the same effect, or that made my point irrelevant. Sorry for the confusion.
Mar 09 2014
prev sibling next sibling parent "ponce" <contact gam3sfrommars.fr> writes:
On Sunday, 9 March 2014 at 21:14:30 UTC, Nick Sabalausky wrote:
 With all due respect, D string type is exclusively for UTF-8 
 strings.
 If it is not valid UTF-8, it should never had been a D string 
 in the
 first place. In the other cases, ubyte[] is there.

This is an arbitrary self-imposed limitation caused by the choice in how strings are handled in Phobos.

Yea, I've had problems before - completely unnecessary problems that were *not* helpful or indicative of latent bugs - which were a direct result of Phobos being overly pedantic and eager about UTF validation. And yet the implicit UTF validation has never actually *helped* me in any way.

 self-imposed limitation


I finds this article very telling about why string should be converted to UTF-8 as often as possible. http://www.utf8everywhere.org/ I agree 100% with its content, it's impossibly hard to have a sane handling of encodings on WIndows (even more in a team), if not following the drastic rules the article exposes. This happens to be what Phobos gently mandates, UTF validation is certainly the lesser evil as compared the mess that everything become without. How is mandating valid UTF-8 being overly pedantic? This is the sanest behaviour. Just use sanitizeUTF8 (http://vibed.org/api/vibe.utils.string/sanitizeUTF8) or equivalent.
Mar 10 2014
prev sibling next sibling parent "Andrea Fontana" <nospam example.com> writes:
I'm not sure I understood the point of this (long) thread.
The main problem is that decode() is called also if not needed?

Well, in this case that's not a problem only for string. I found
this problem also when I was writing other ranges. For example
when I read binary data from db stream. Front represent a single
row, and I decode it every time also if not needed.

On Friday, 7 March 2014 at 02:37:11 UTC, Walter Bright wrote:
 In "Lots of low hanging fruit in Phobos" the issue came up 
 about the automatic encoding and decoding of char ranges.

 Throughout D's history, there are regular and repeated 
 proposals to redesign D's view of char[] to pretend it is not 
 UTF-8, but UTF-32. I.e. so D will automatically generate code 
 to decode and encode on every attempt to index char[].

 I have strongly objected to these proposals on the grounds that:

 1. It is a MAJOR performance problem to do this.

 2. Very, very few manipulations of strings ever actually need 
 decoded values.

 3. D is a systems/native programming language, and 
 systems/native programming languages must not hide the 
 underlying representation (I make similar arguments about 
 proposals to make ints issue errors on overflow, etc.).

 4. Users should choose when decode/encode happens, not the 
 language.

 and I have been successful at heading these off. But one 
 slipped by me. See this in std.array:

    property dchar front(T)(T[] a)  safe pure if 
 (isNarrowString!(T[]))
   {
     assert(a.length, "Attempting to fetch the front of an empty 
 array of " ~
            T.stringof);
     size_t i = 0;
     return decode(a, i);
   }

 What that means is that if I implement an algorithm that 
 accepts, as input, an InputRange of char's, it will ALWAYS try 
 to decode it. This means that even:

    from.copy(to)

 will decode 'from', and then re-encode it for 'to'. And it will 
 do it SILENTLY. The user won't notice, and he'll just assume 
 that D performance sux. Even if he does notice, his options to 
 make his code run faster are poor.

 If the user wants decoding, it should be explicit, as in:

     from.decode.copy(encode!to)

 The USER should decide where and when the decoding goes. 
 'decode' should be just another algorithm.

 (Yes, I know that std.algorithm.copy() has some specializations 
 to take care of this. But these specializations would have to 
 be written for EVERY algorithm, which is thoroughly 
 unreasonable. Furthermore, copy()'s specializations only apply 
 if BOTH source and destination are arrays. If just one is, the 
 decode/encode penalty applies.)

 Is there any hope of fixing this?

Mar 10 2014
prev sibling next sibling parent "ponce" <contact gam3sfrommars.fr> writes:
On Monday, 10 March 2014 at 11:04:43 UTC, Nick Sabalausky wrote:
 I may have missed it, but I don't see where it says anything 
 about validation or immediate sanitation of invalid sequences. 
 It's mostly "UTF-16 sucks and so does Windows" (not that I'm 
 necessarily disagreeing with it). (ot: Kinda wish they hadn't 
 used such a hard to read font...)

I should have highlighted it, their recommendations for proper encoding handling on Windows are in section 5 ("How to do text on Windows"). One of them is "std::strings and char*, anywhere in the program, are considered UTF-8 (if not said otherwise)." I finds it interesting that D tends to enforce this lesson learned with mixed-encodings codebases.
Mar 10 2014
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Sunday, 9 March 2014 at 17:27:20 UTC, Andrei Alexandrescu 
wrote:
 On 3/9/14, 6:47 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
 2) It is regression back to C++ days of 
 no-one-cares-about-Unicode
 pain. Thinking about strings as character arrays is so 
 natural and
 convenient that if language/Phobos won't punish you for that, 
 it will
 be extremely widespread.

Not with Nick Sabalausky's suggestion to remove the implementation of front from char arrays. This way, everyone will be forced to decide whether they want code units or code points or something else.

Such as giving up on that crappy language that keeps on breaking their code. Andrei

That was more about "if you are that crazy to even consider such breakage, this is closer my personal perfection" than actual proposal ;)
Mar 10 2014
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Friday, 7 March 2014 at 19:43:57 UTC, Walter Bright wrote:
 On 3/7/2014 7:03 AM, Dicebot wrote:
 1) It is a huge breakage and you have been refusing to do one 
 even for more
 important problems. What is about this sudden change of mind?

1. Performance Performance Performance

Not important enough. D has always been "safe by default, fast when asked to" language, not other way around. There is no fundamental performance problem here, only lack of knowledge about Phobos.
 2. The current behavior is surprising (it sure surprised me, I 
 didn't notice it until I looked at the assembler to figure out 
 why the performance sucked)

That may imply that better documentation is needed. You were only surprised because of wrong initial assumption about what `char[]` type means.
 3. Weirdnesses like ElementEncodingType

ElementEncodingType is extremely annoying but I think it is just a side effect of more bigger problem how string algorithms are handled currently. It does not need to be that way.
 4. Strange behavior differences between char[], char*, and 
 InputRange!char types

Again, there is nothing strange about it. `char[]` is a special type with special semantics that is defined in documentation and consistently following that definition in all but raw array indexing/slicing (which is what I find unfortunate but also beyond fixing feasibility).
 5. Funky anomalous issues with writing OutputRange!char (the 
 put(T) must take a dchar)

Bad but not worth even a small breaking change.
 2) lack of convenient .raw property which will effectively do 
 cast(ubyte[])

I've done the cast as a workaround, but when working with generic code it turns out the ubyte type becomes viral - you have to use it everywhere. So all over the place you're having casts between ubyte <=> char in unexpected places. You also wind up with ugly ubyte <=> dchar casts, with the commensurate risk that you goofed and have a truncation bug.

Of course it is viral. Because you never ever wan't to have char[] at all if you don't work with Unicode (or work with it on raw byte level). And in that case it is your responsibility to do manual decoding when appropriate. Trying to dish out that performance often means going at low level with all associated risks, there is nothing special about char[] here. It is not a common use case.
 Essentially, the auto-decode makes trivial code look better, 
 but if you're writing a more comprehensive string processing 
 program, and care about performance, it makes a regular ugly 
 mess of things.

And this is how it should be. Again, I am all for creating language that favors performance-critical power programming needs over common/casual needs but it is not what D is and you have been making such choices consistently over quite a long time now (array literals that allocate, I will never forgive that). Suddenly changing your mind only because you have encountered this specific issue personally as opposed to just reports does not fit a language author role. It does not really matter if any new approach itself is good or bad - being unpredictable is a reputation damage D simply can't afford.
Mar 10 2014
prev sibling next sibling parent "Abdulhaq" <alynch4047 gmail.com> writes:
On Monday, 10 March 2014 at 10:52:02 UTC, Andrea Fontana wrote:
 I'm not sure I understood the point of this (long) thread.
 The main problem is that decode() is called also if not needed?

I'd like to offer up one D 'user' perspective, it's just a single data point but perhaps useful. I write applications that process Arabic, and I'm thinking about converting one of those apps from python to D, for performance reasons. My app deals with unicode arabic text that is 'out there', and the UnicodeTM support for Arabic is not that well thought out, so the data is often (always) inconsistent in terms of sequencing diacritics etc. Even the code page can vary. Therefore my code has to cater to various ways that other developers have sequenced the code points. So, my needs as a 'user' are: * I want to encode all incoming data immediately into unicode, usually UTF8, if isn't already. * I want to iterate over code points. I don't care about the raw data. * When I get the length of my string it should be the number of code points. * When I index my string it should return the nth code point. * When I manipulate my strings I want to work with code points ... you get the drift. If I want to access the raw data, which I don't, then I'm very happy to cast to ubyte etc. If encode/decode is a performance issue then perhaps there could be a cache for recently used strings where the code point representation is held. BTW to answer a question in the thread, yes the data is left-to-right and visualised right-to-left.
Mar 10 2014
prev sibling next sibling parent "Andrea Fontana" <nospam example.com> writes:
In italian we need unicode too. We have several accented letters 
and often programming languages don't handle utf-8 and other 
encoding so well...

In D I never had any problem with this, and I work a lot on text 
processing.

So my question: is there any problem I'm missing in D with 
unicode support or is just a performance problem on algorithms?

If the problem is performance on algorithms that use .front() but 
don't care to understand its data, why don't we add a .rawFront() 
property to implement only when make sense and then a "fallback" 
like:

auto rawFront(R)(R range) if ( ... isrange ... && 
!__traits(compiles, range.rawFront))  { return range.front; }

In this way on copy() or other algorithms we can use rawFront() 
and it's backward compatible with other ranges too.

But I guess I'm missing the point :)


On Monday, 10 March 2014 at 13:48:44 UTC, Abdulhaq wrote:
 On Monday, 10 March 2014 at 10:52:02 UTC, Andrea Fontana wrote:
 I'm not sure I understood the point of this (long) thread.
 The main problem is that decode() is called also if not needed?

I'd like to offer up one D 'user' perspective, it's just a single data point but perhaps useful. I write applications that process Arabic, and I'm thinking about converting one of those apps from python to D, for performance reasons. My app deals with unicode arabic text that is 'out there', and the UnicodeTM support for Arabic is not that well thought out, so the data is often (always) inconsistent in terms of sequencing diacritics etc. Even the code page can vary. Therefore my code has to cater to various ways that other developers have sequenced the code points. So, my needs as a 'user' are: * I want to encode all incoming data immediately into unicode, usually UTF8, if isn't already. * I want to iterate over code points. I don't care about the raw data. * When I get the length of my string it should be the number of code points. * When I index my string it should return the nth code point. * When I manipulate my strings I want to work with code points ... you get the drift. If I want to access the raw data, which I don't, then I'm very happy to cast to ubyte etc. If encode/decode is a performance issue then perhaps there could be a cache for recently used strings where the code point representation is held. BTW to answer a question in the thread, yes the data is left-to-right and visualised right-to-left.

Mar 10 2014
prev sibling next sibling parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 07.03.2014 03:37, schrieb Walter Bright:
 In "Lots of low hanging fruit in Phobos" the issue came up about the automatic
 encoding and decoding of char ranges.

after reading many of the attached posts the question is - what could be Ds future design of introducing breaking changes, its not a solution to say its not possible because of too many breaking changes - that will become more and more a problem of Ds evolution - much like C++
Mar 10 2014
next sibling parent reply "Dicebot" <public dicebot.lv> writes:
On Monday, 10 March 2014 at 14:05:39 UTC, dennis luehring wrote:
 Am 07.03.2014 03:37, schrieb Walter Bright:
 In "Lots of low hanging fruit in Phobos" the issue came up 
 about the automatic
 encoding and decoding of char ranges.

after reading many of the attached posts the question is - what could be Ds future design of introducing breaking changes, its not a solution to say its not possible because of too many breaking changes - that will become more and more a problem of Ds evolution - much like C++

Historically 2 approaches has been practiced: 1) argue a lot and then do nothing 2) suddenly change something and tell users is was necessary I also think that this is much more important issue than this whole thread but it does not seem to attract any real attention when mentioned.
Mar 10 2014
parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 3/10/2014 7:35 PM, Yota wrote:
 On Monday, 10 March 2014 at 14:42:18 UTC, Dicebot wrote:
 Yes. I have given up about this idea at some point as there seemed to
 be consensus that no breaking changes will be even considered for D2
 and those that come from fixing bugs are not worth the fuss.

So at what point are we going to discuss these things in the context of D-next?

Not until (at least) the D2/Phobos implementations mature, the current issues get worked out, and the library/tool ecosystem grows and matures.
Mar 10 2014
prev sibling next sibling parent "Abdulhaq" <alynch4047 gmail.com> writes:
On Monday, 10 March 2014 at 14:05:39 UTC, dennis luehring wrote:
 Am 07.03.2014 03:37, schrieb Walter Bright:
 In "Lots of low hanging fruit in Phobos" the issue came up 
 about the automatic
 encoding and decoding of char ranges.

after reading many of the attached posts the question is - what could be Ds future design of introducing breaking changes, its not a solution to say its not possible because of too many breaking changes - that will become more and more a problem of Ds evolution - much like C++

I'm a newbie here but I've been waiting for D to mature for a long time. D IMHO has to stabilise now because: * D needs a bigger community so that the the big fish who have learnt the ins and outs don't get bored and move on due to lack of kudos etc. * To get the bigger community D needs more _working_ libraries for major toolkits (GUI etc. etc.) * Libraries will cease to work if there is significant change in D, and then can stay broken because there isn't the inertial mass of other developers to maintain it after the intial developer has moved on. You can see that this has happened a LOT * Anyway the D that I read about in TDPL is already very exciting for programmers like myself, we just want that thanks. Breaking changes can go into D3, if and whenever that is. Keep breaking D2 now and it risks just being forevermore a playpen for computer scientist types. Anyway who cares what I think but I think it reflects a lot of people's opinions too.
Mar 10 2014
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Monday, 10 March 2014 at 14:11:13 UTC, Dicebot wrote:
 Historically 2 approaches has been practiced:

 1) argue a lot and then do nothing
 2) suddenly change something and tell users is was necessary

These are one and the same, just from the two opposing points of view.
 I also think that this is much more important issue than this 
 whole thread but it does not seem to attract any real attention 
 when mentioned.

You mean the whole policy on breaking changes?
Mar 10 2014
prev sibling next sibling parent "Abdulhaq" <alynch4047 gmail.com> writes:
 Historically 2 approaches has been practiced:

 1) argue a lot and then do nothing

This happens (I think) because Andrei and Walter really value your's and other expert's opinions, but nevertheless have to preserve the general way things work to preserve the long term future of D. They have to be open to persuasion but it would have to be very compelling to get them to change basics now - it seems to me. D is at that difficult 90% stage that we all know about where the boring difficult stuff is left to do. People like to discuss interesting new stuff which at the time seems oh-so-important.
Mar 10 2014
prev sibling next sibling parent "Dicebot" <public dicebot.lv> writes:
On Monday, 10 March 2014 at 14:27:02 UTC, Vladimir Panteleev 
wrote:
 On Monday, 10 March 2014 at 14:11:13 UTC, Dicebot wrote:
 Historically 2 approaches has been practiced:

 1) argue a lot and then do nothing
 2) suddenly change something and tell users is was necessary

These are one and the same, just from the two opposing points of view.

</sarcasm> :)
 I also think that this is much more important issue than this 
 whole thread but it does not seem to attract any real 
 attention when mentioned.

You mean the whole policy on breaking changes?

Yes. I have given up about this idea at some point as there seemed to be consensus that no breaking changes will be even considered for D2 and those that come from fixing bugs are not worth the fuss. This is exactly why I was so shocked that Walter has even started this thread. If breaking changes are actually considered (rare or not), then it is absolutely critical to define the process for it and put link to its description to dlang.org front page.
Mar 10 2014
prev sibling parent "Yota" <yotaxp thatGoogleMailThing.com> writes:
On Monday, 10 March 2014 at 14:42:18 UTC, Dicebot wrote:
 Yes. I have given up about this idea at some point as there 
 seemed to be consensus that no breaking changes will be even 
 considered for D2 and those that come from fixing bugs are not 
 worth the fuss.

So at what point are we going to discuss these things in the context of D-next? These topics have us group up and focus on compromises instead of ideals. As was said, D2 is at the 90% point. It only has room left for bug fixes. I think we would make much more productive use of our time and minds coming up with ideas that actually have a chance of coming to fruition, even if D3 ends up being half a decade away.
Mar 10 2014
prev sibling next sibling parent Johannes Pfau <nospam example.com> writes:
Am Mon, 10 Mar 2014 14:05:03 +0000
schrieb "Andrea Fontana" <nospam example.com>:

 In italian we need unicode too. We have several accented letters 
 and often programming languages don't handle utf-8 and other 
 encoding so well...
 
 In D I never had any problem with this, and I work a lot on text 
 processing.
 
 So my question: is there any problem I'm missing in D with 
 unicode support or is just a performance problem on algorithms?

The only real problem apart from potential performance issues I've seen mentioned in this thread is that indexing/slicing is done with code units. I think this: auto index = countUntil(...); auto slice = str[0 .. index]; is really the only problem with the current implementation. If we could start from scratch I'd say we keep operating on code points by default but don't make strings arrays of char/wchar/dchar. Instead they should be special types which do all operations (especially indexing, slicing) on code points. This would be as safe as the current implementation, always consistent but probably even slower in some cases. Then offer some nice way to get the raw data for algorithms which can deal with it. However, I think it's too late to make these changes.
Mar 10 2014
prev sibling next sibling parent "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Monday, 10 March 2014 at 13:18:50 UTC, Dicebot wrote:
 On Sunday, 9 March 2014 at 17:27:20 UTC, Andrei Alexandrescu 
 wrote:
 On 3/9/14, 6:47 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
 2) It is regression back to C++ days of 
 no-one-cares-about-Unicode
 pain. Thinking about strings as character arrays is so 
 natural and
 convenient that if language/Phobos won't punish you for 
 that, it will
 be extremely widespread.

Not with Nick Sabalausky's suggestion to remove the implementation of front from char arrays. This way, everyone will be forced to decide whether they want code units or code points or something else.

Such as giving up on that crappy language that keeps on breaking their code. Andrei

That was more about "if you are that crazy to even consider such breakage, this is closer my personal perfection" than actual proposal ;)

BTW, I don't believe it would be that bad, because there's a straight-forward path of deprecation: First, std.range.front for narrow strings (and dchar, for consistency) can be marked as deprecated. The deprecation message can say: "Please specify .byCodePoint()/.byCodeUnit()", guiding the users towards a better style (assuming one agrees that explicit is indeed better than implicit in this case). After some time, the functionality can be moved into a compatibility module, with the deprecated functions still there, but now additionally telling the user about the quick fix of importing that module. The deprecation period can be very long, and even if the functions should never be removed, at least everyone writing new code would do so in the new style.
Mar 10 2014
prev sibling next sibling parent "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Monday, 10 March 2014 at 13:48:44 UTC, Abdulhaq wrote:
 My app deals with unicode arabic text that is 'out there', and 
 the UnicodeTM support for Arabic is not that well thought out, 
 so the data is often (always) inconsistent in terms of 
 sequencing diacritics etc. Even the code page can vary. 
 Therefore my code has to cater to various ways that other 
 developers have sequenced the code points.

 So, my needs as a 'user' are:
 * I want to encode all incoming data immediately into unicode, 
 usually UTF8, if isn't already.
 * I want to iterate over code points. I don't care about the 
 raw data.
 * When I get the length of my string it should be the number of 
 code points.
 * When I index my string it should return the nth code point.
 * When I manipulate my strings I want to work with code points
 ... you get the drift.

Are you sure that code points is what you want? AFAIK there are lots of diacritics in Arabic, and I believe they are not precomposed with their carrying letters...
Mar 10 2014
prev sibling next sibling parent "Abdulhaq" <alynch4047 gmail.com> writes:
On Monday, 10 March 2014 at 18:54:26 UTC, Marc Schütz wrote:
 On Monday, 10 March 2014 at 13:48:44 UTC, Abdulhaq wrote:
 My app deals with unicode arabic text that is 'out there', and 
 the UnicodeTM support for Arabic is not that well thought out, 
 so the data is often (always) inconsistent in terms of 
 sequencing diacritics etc. Even the code page can vary. 
 Therefore my code has to cater to various ways that other 
 developers have sequenced the code points.

 So, my needs as a 'user' are:
 * I want to encode all incoming data immediately into unicode, 
 usually UTF8, if isn't already.
 * I want to iterate over code points. I don't care about the 
 raw data.
 * When I get the length of my string it should be the number 
 of code points.
 * When I index my string it should return the nth code point.
 * When I manipulate my strings I want to work with code points
 ... you get the drift.

Are you sure that code points is what you want? AFAIK there are lots of diacritics in Arabic, and I believe they are not precomposed with their carrying letters...

I checked the terminology before posting so I'm pretty sure. Arabic has a code page for the logical characters, one code point for each letter of the alphabet and others for various diacritics. Because of the 'shaping' each logical character has various glyphs, found on other code pages. Text editing programs tend to store typed Arabic as the user entered it, and because there can be more than one diacritic per alphabetic letter the sequence varies as to how the user sequenced them.
Mar 10 2014
prev sibling next sibling parent "Abdulhaq" <alynch4047 gmail.com> writes:
On Monday, 10 March 2014 at 18:54:26 UTC, Marc Schütz wrote:
 On Monday, 10 March 2014 at 13:48:44 UTC, Abdulhaq wrote:
 My app deals with unicode arabic text that is 'out there', and 
 the UnicodeTM support for Arabic is not that well thought out, 
 so the data is often (always) inconsistent in terms of 
 sequencing diacritics etc. Even the code page can vary. 
 Therefore my code has to cater to various ways that other 
 developers have sequenced the code points.

 So, my needs as a 'user' are:
 * I want to encode all incoming data immediately into unicode, 
 usually UTF8, if isn't already.
 * I want to iterate over code points. I don't care about the 
 raw data.
 * When I get the length of my string it should be the number 
 of code points.
 * When I index my string it should return the nth code point.
 * When I manipulate my strings I want to work with code points
 ... you get the drift.

Are you sure that code points is what you want? AFAIK there are lots of diacritics in Arabic, and I believe they are not precomposed with their carrying letters...

Adding to my other comment I don't expect a string type to understand arabic and merge the diacritics for me. In fact there are other symbols (code points) that can also be present, for instance instructions on how Quranic text is to be read. These issues have not been standardised and I would say are not well understood generally.
Mar 10 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 10 Mar 2014 19:59:07 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 3/10/2014 6:47 AM, Dicebot wrote:
 (array literals that allocate, I will never forgive that).

It was done that way simply to get it up and running quickly. Having them not allocate is an optimization, it doesn't change the nature.

I think you forget about this: foo(int v, int w) { auto x = [v, w]; } Which cannot pre-allocate. That said, I would not mind if this code broke and you had to use array(v, w) instead, for the sake of avoiding unnecessary allocations. -Steve
Mar 10 2014
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 10 Mar 2014 22:56:22 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 3/10/14, 7:07 PM, Steven Schveighoffer wrote:
 On Mon, 10 Mar 2014 19:59:07 -0400, Walter Bright
 <newshound2 digitalmars.com> wrote:

 On 3/10/2014 6:47 AM, Dicebot wrote:
 (array literals that allocate, I will never forgive that).

It was done that way simply to get it up and running quickly. Having them not allocate is an optimization, it doesn't change the nature.

I think you forget about this: foo(int v, int w) { auto x = [v, w]; } Which cannot pre-allocate.

It actually can, seeing as x is a dead assignment :o).

Actually, it can't do anything, seeing as it's invalid code ;)
 That said, I would not mind if this code broke and you had to use
 array(v, w) instead, for the sake of avoiding unnecessary allocations.

Fixing that: int[] foo(int v, int w) { return [v, w]; } This one would allocate. But analyses of varying complexity may eliminate a variety of allocation patterns.

I think you are missing what I'm saying, I don't want the allocation eliminated, but if we eliminate some allocations with [] and not others, it will be confusing. The path I'd always hoped we would go in was to make all array literals immutable, and make allocation of mutable arrays on the heap explicit. Adding eliding of some allocations for optimization is good, but I (and I think possibly Dicebot) think all array literals should not allocate. -Steve
Mar 10 2014
prev sibling next sibling parent "Sean Kelly" <sean invisibleduck.org> writes:
On Tuesday, 11 March 2014 at 02:07:19 UTC, Steven Schveighoffer 
wrote:
 On Mon, 10 Mar 2014 19:59:07 -0400, Walter Bright 
 <newshound2 digitalmars.com> wrote:

 On 3/10/2014 6:47 AM, Dicebot wrote:
 (array literals that allocate, I will never forgive that).

It was done that way simply to get it up and running quickly. Having them not allocate is an optimization, it doesn't change the nature.

I think you forget about this: foo(int v, int w) { auto x = [v, w]; } Which cannot pre-allocate.

The array is small and does not escape. It could be allocated on the stack as an optimization.
Mar 11 2014
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, March 06, 2014 18:37:13 Walter Bright wrote:
 Is there any hope of fixing this?

I agree with Andrei. I don't think that there's really anything to fix. The problem is that there's roughly 3 levels at which string operations can be done 1. By code unit 2. By code point 3. By grapheme and which is correct depends on what you're trying to do. Phobos attempts to go for correctness by default without seriously impacting performance, so it treats all strings as ranges of dchar (so, level #2). If we went with #1, then pretty much any algorithm which operated on individual characters would be broken, as unless your strings are ASCII-only, code units are very much the wrong level to be operating on if you're trying to deal with characters. If we went with #3, then we'd have full correctness, but we'd tank performance. With #2, we're far more correct than is typically the case with C++ while still being reasonably performant. And those who want full performance can use immutable(ubyte)[] to get #1, and those who want #3 can use the grapheme support in std.uni. We've gone to great lengths in Phobos to specialize on narrow strings in order to make it more efficient while still maintaining correctness, and anyone who really wants performance can do the same. But by operating on the code point level, we at least get a reasonable level of unicode-correctness by default. With your suggestion, I'd fully expect most D programs to be wrong with regards to Unicode, because most programmers don't know or care about how Unicode works. And changing what we're doing now would be code breakage of astronomical proportions. It will essentially break all uses of range-based string code. Certainly, it would be largest code breakage that D has seen is years if not ever. So, it's almost certainly a bad idea, but if it isn't, we need to be darn sure that what we change to is significantly better and worth the huge amount of code breakage that it will cause. I really don't think that there's any way to get this right. Regardless of which level you operate at by default - be it code unit, code point, or grapheme - it will be wrong a good chunk of the time. So, it becomes a question which of the three has the best tradeoffs, and I think that our current solution of operating on code points by default does that. If there are things that we can do to better support operating on code units or graphemes for those who want it, then great. And it's great if we can find ways to make operating at the code point level more efficient or less prone to bugs due to not operating at the grapheme level. But I think that operating on the code point level like we currently do is by far the best approach. If anything, it's the fact that the language doesn't do that that's a bigger concern IMHO - the main place where that's an issue being the fact that foreach iterates by code unit by default. But I don't know of a good way to solve that other than treating all arrays of char, wchar, and dchar specially, and disable their array operations like ranges do so that you have to convert them to code units via the representation function in order to operate on them as code units - which Andrei has suggested a number of times before, but you've shot him down each time. If that were fixed, then at least we'd be consistent, which is usually the biggest complaint with regards to how D treats strings. But I really don't think that there's a magical fix for range- based string operations, and I think that our current approach is a good one. - Jonathan M Davis
Mar 12 2014
prev sibling next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Mon, 10 Mar 2014 17:44:22 -0400
schrieb Nick Sabalausky <SeeWebsiteToContactMe semitwist.com>:

 On 3/7/2014 8:40 AM, Michel Fortin wrote:
 On 2014-03-07 03:59:55 +0000, "bearophile" <bearophileHUGS lycos.com> said:

 Walter Bright:

 I understand this all too well. (Note that we currently have a
 different silent problem: unnoticed large performance problems.)

On the other hand your change could introduce Unicode-related bugs in future code (that the current Phobos avoids) (and here I am not talking about code breakage).

The way Phobos works isn't any more correct than dealing with code units. Many graphemes span on multiple code points -- because of combined diacritics or character variant modifiers -- and decoding at the code-point level is thus often insufficient for correctness.

Well, it is *more* correct, as many western languages are more likely in current Phobos to "just work" in most cases. It's just that things still aren't completely correct overall.
  From my experience, I'd suggest these basic operations for a "string
 range" instead of the regular range interface:

 .empty
 .frontCodeUnit
 .frontCodePoint
 .frontGrapheme
 .popFrontCodeUnit
 .popFrontCodePoint
 .popFrontGrapheme
 .codeUnitLength (aka length)
 .codePointLength (for dchar[] only)
 .codePointLengthLinear
 .graphemeLengthLinear

 Someone should be able to mix all the three 'front' and 'pop' function
 variants above in any code dealing with a string type. In my XML parser
 for instance I regularly use frontCodeUnit to avoid the decoding penalty
 when matching the next character with an ASCII one such as '<' or '&'.
 An API like the one above forces you to be aware of the level you're
 working on, making bugs and inefficiencies stand out (as long as you're
 familiar with each representation).

 If someone wants to use a generic array/range algorithm with a string,
 my opinion is that he should have to wrap it in a range type that maps
 front and popFront to one of the above variant. Having to do that should
 make it obvious that there's an inefficiency there, as you're using an
 algorithm that wasn't tailored to work with strings and that more
 decoding than strictly necessary is being done.

I actually like this suggestion quite a bit.

+1 Reminds me of my proposal for Rust (https://github.com/mozilla/rust/issues/7043#issuecomment-19187984) -- Marco
Mar 18 2014
prev sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Sat, 08 Mar 2014 22:07:09 +0000
schrieb "Sean Kelly" <sean invisibleduck.org>:

 On Saturday, 8 March 2014 at 20:50:49 UTC, Andrei Alexandrescu 
 wrote:
 Pretty much everyone using ICU hates it.

I think the biggest problem with ICU is documentation. It can take a long time to figure out how to do something if you've never done it before. Also, the C interface in ICU seems better than the C++ interface. And I'll grant that a few things are just far harder than they need to be. I wanted a transcoding iterator and ICU almost has this but not quite, so I've got to write my own. In fact, iterating across an arbitrary encoding in general is at least not intuitive and perhaps not possible. I kinda gave up on that. Um, and using UTF-16 as the standard encoding, requiring many transcoding operations to require two conversions. Okay, I guess there are a lot of problems with ICU, but it handles nearly every requirement I have, which is in itself quite a lot.

You find the answer here: http://userguide.icu-project.org/icufaq#TOC-What-is-the-performance-difference-between-UTF-8-and-UTF-16- In addition it is infeasible to maintain code for direct conversions with all the encodings they support. The project doesn't aim at providing a specific transcoding but all of them equally. What can you do. For Java it is easier to accept since they use UTF-16 internally. -- Marco
Mar 19 2014