digitalmars.D - Range of chars (narrow string ranges)

Martin Nowak (7/7) Apr 24 2015 Just want to make this a bit more visible.

H. S. Teoh via Digitalmars-d (10/19) Apr 24 2015 I really wish we would just *make the darn decision* already, whether to

Walter Bright (16/22) Apr 24 2015 Some facts:

Martin Nowak (4/13) Apr 24 2015 It probably won't be too disruptive to optimize algorithms such as
Brad Anderson (17/23) Apr 24 2015 Yay!

Walter Bright (7/22) Apr 24 2015 If they accept ranges, and don't special case narrow strings, then they ...

Jonathan M Davis (27/59) Apr 24 2015 I really think that leaving things with autodecoding in some
Steven Schveighoffer (4/6) Apr 24 2015 This is pretty easy. We just have to create a string type that is backed...

Walter Bright (2/4) Apr 24 2015 Just shoot me now!

Steven Schveighoffer (5/10) Apr 24 2015 Yeah, that's the reaction I figured I'd get ;) But it doesn't hurt to

Jonathan M Davis (25/37) Apr 24 2015 Honestly, even if that were the ideal way to go (and I don't

H. S. Teoh via Digitalmars-d (12/30) Apr 27 2015 [...]

Jonathan M Davis (31/78) Apr 27 2015 I expect that the two biggest problems causing the current

Chris (10/89) Apr 28 2015 Would it be much work to show have example code or even an

Jonathan M Davis (82/91) Apr 28 2015 Honestly, most code won't care. If we just switched out all of

Vladimir Panteleev (11/17) Apr 28 2015 One proposal is to make char and dchar comparisons illegal (after

H. S. Teoh via Digitalmars-d (6/25) Apr 28 2015 Oooh, Jonathan has the code ready? Haha, maybe I'll start using that

Damian (3/37) Apr 28 2015 I second that! If we all make the switch, perhaps Walter will

Jonathan M Davis (13/15) Apr 28 2015 Walter isn't necessarily the one we have to convince in this

Jonathan M Davis (33/52) Apr 28 2015 It would, but it doesn't necessarily play nicely with the

Chris (5/100) Apr 29 2015 This sounds like a good starting point for a transition plan. One

Jonathan M Davis (63/67) Apr 29 2015 Well, personally, I think that it's worth it even if the

Chris (7/75) Apr 29 2015 Ok, I see. Well, if we don't want to repeat C++'s mistakes, we

ketmar (4/6) Apr 25 2015 the more time passing the harder autodecode to kill. kill it while it's=...
Kagamin (3/5) Apr 30 2015 Also NSString. It used to support UTF-16 and C encoding. AFAIK,

Martin Nowak <code+news.digitalmars dawg.eu> writes:

Just want to make this a bit more visible.
https://github.com/D-Programming-Language/phobos/pull/3206#issuecomment-95681812

We just added entabber to std.phobos, and AFAIK, it's the first range
algorithm that transforms narrow strings to a range of chars, instead of
decoding the original string and returning a range of dchars.

Most of phobos can't handle such ranges like strings and you'd have to
decode them using byDchar to work with them.

Apr 24 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Fri, Apr 24, 2015 at 08:39:36PM +0200, Martin Nowak via Digitalmars-d wrote:
 Just want to make this a bit more visible.
 https://github.com/D-Programming-Language/phobos/pull/3206#issuecomment-95681812
 
 We just added entabber to std.phobos, and AFAIK, it's the first range
 algorithm that transforms narrow strings to a range of chars, instead
 of decoding the original string and returning a range of dchars.
 
 Most of phobos can't handle such ranges like strings and you'd have to
 decode them using byDchar to work with them.

I really wish we would just *make the darn decision* already, whether to
kill off autodecoding or not, and MAKE IT CONSISTENT ACROSS PHOBOS,
instead of introducing this schizophrenic dichotomy where some functions
give you a range of dchar while others give you a range of char/wchar,
and the two don't work well together. This is totally going to make a
laughing stock of D one day.


T

-- 
Guns don't kill people. Bullets do.

Apr 24 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 4/24/2015 11:52 AM, H. S. Teoh via Digitalmars-d wrote:
 I really wish we would just *make the darn decision* already, whether to
 kill off autodecoding or not, and MAKE IT CONSISTENT ACROSS PHOBOS,
 instead of introducing this schizophrenic dichotomy where some functions
 give you a range of dchar while others give you a range of char/wchar,
 and the two don't work well together. This is totally going to make a
 laughing stock of D one day.

Some facts:

1. When I started D, there was a lot of speculation about whether the world 
would settle on UTF8, UTF16, or UTF32. So D supports natively all three. Time 
has shown, however, that UTF8 has pretty much won. wchar only exists for
Windows 
API and Java, dchar strings pretty much don't exist in the wild.

2. dchar is very useful as a character type, but not as a string type.

3. Pretty much none of the algorithms in Phobos work when presented with a
range 
of chars or wchars. This is not even documented.

4. Autodecoding is inefficient, especially considering that few algorithms 
actually need decoding. Re-encoding the result back to UTF8 is another
inefficiency.

I'm afraid we are stuck with autodecoding, as taking it out may be far too 
disruptive.

But all is not lost. The Phobos algorithms can all be fixed to not care about 
autodecoding. The changes I've made to std.string all reflect that.

https://github.com/D-Programming-Language/phobos/pulls/WalterBright

Apr 24 2015

Martin Nowak <code+news.digitalmars dawg.eu> writes:

On 04/24/2015 10:44 PM, Walter Bright wrote:
 4. Autodecoding is inefficient, especially considering that few
 algorithms actually need decoding. Re-encoding the result back to UTF8
 is another inefficiency.
 
 I'm afraid we are stuck with autodecoding, as taking it out may be far
 too disruptive.
 
 But all is not lost. The Phobos algorithms can all be fixed to not care
 about autodecoding. The changes I've made to std.string all reflect that.

It probably won't be too disruptive to optimize algorithms such as
filter to return a range of chars, but only if we support such ranges as
narrow strings everywhere.

Apr 24 2015

"Brad Anderson" <eco gnuk.net> writes:

On Friday, 24 April 2015 at 20:44:34 UTC, Walter Bright wrote:
 [snip]
 I'm afraid we are stuck with autodecoding, as taking it out may 
 be far too disruptive.

No!

 But all is not lost. The Phobos algorithms can all be fixed to 
 not care about autodecoding. The changes I've made to 
 std.string all reflect that.

Yay!

I haven't really followed the autodecoding conversations. The 
problem is that front on char ranges decode, right? Is there 
quick way to tell which functions are auto decoding so we can 
have a list of candidates for replacement? It'd be good for 
hackweek.

I'm reminded of this conversation 
http://forum.dlang.org/post/xgnurdjcqiyatpvnwznd forum.dlang.org
which contains a partial list of candidates. Following your lead 
with implementing these lazy versions (without autodecoding) 
would be good hackweek projects.

Finally, there is this http://goo.gl/Wmotu4 list from 
http://forum.dlang.org/post/lvmydbvjivsvmwtimobs forum.dlang.org 
that has some good candidates for hackweek I think.

Are we collecting hackweek ideas anywhere?

Apr 24 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 4/24/2015 3:29 PM, Brad Anderson wrote:
 I haven't really followed the autodecoding conversations. The problem is that
 front on char ranges decode, right?

Nope. Only front on narrow string arrays. Ranges aren't autodecoded.


 Is there quick way to tell which functions
 are auto decoding so we can have a list of candidates for replacement? It'd be
 good for hackweek.

If they accept ranges, and don't special case narrow strings, then they
autodecode.


 I'm reminded of this conversation
 http://forum.dlang.org/post/xgnurdjcqiyatpvnwznd forum.dlang.org
 which contains a partial list of candidates.

PR's exist for most of these now.

 Following your lead with
 implementing these lazy versions (without autodecoding) would be good hackweek
 projects.

Yup.


 Finally, there is this http://goo.gl/Wmotu4 list from
 http://forum.dlang.org/post/lvmydbvjivsvmwtimobs forum.dlang.org that has some
 good candidates for hackweek I think.

Yes, we should have an answer for each of the Boost string algorithms.


 Are we collecting hackweek ideas anywhere?

Andrei?

Apr 24 2015

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Friday, 24 April 2015 at 20:44:34 UTC, Walter Bright wrote:
 On 4/24/2015 11:52 AM, H. S. Teoh via Digitalmars-d wrote:
 I really wish we would just *make the darn decision* already, 
 whether to
 kill off autodecoding or not, and MAKE IT CONSISTENT ACROSS 
 PHOBOS,
 instead of introducing this schizophrenic dichotomy where some 
 functions
 give you a range of dchar while others give you a range of 
 char/wchar,
 and the two don't work well together. This is totally going to 
 make a
 laughing stock of D one day.

 Some facts:

 1. When I started D, there was a lot of speculation about 
 whether the world would settle on UTF8, UTF16, or UTF32. So D 
 supports natively all three. Time has shown, however, that UTF8 
 has pretty much won. wchar only exists for Windows API and 
 Java, dchar strings pretty much don't exist in the wild.

 2. dchar is very useful as a character type, but not as a 
 string type.

 3. Pretty much none of the algorithms in Phobos work when 
 presented with a range of chars or wchars. This is not even 
 documented.

 4. Autodecoding is inefficient, especially considering that few 
 algorithms actually need decoding. Re-encoding the result back 
 to UTF8 is another inefficiency.

 I'm afraid we are stuck with autodecoding, as taking it out may 
 be far too disruptive.

 But all is not lost. The Phobos algorithms can all be fixed to 
 not care about autodecoding. The changes I've made to 
 std.string all reflect that.

 https://github.com/D-Programming-Language/phobos/pulls/WalterBright

I really think that leaving things with autodecoding in some 
cases and not in others is just asking for trouble. Even if we 
manage to figure out how to fix it so that Phobos doesn't 
autodecode in any of its algorithms without breaking any user 
code in the process, that then leaves user code with the problem, 
and since Phobos _wouldn't_ have the problem, it then would be 
all the more confusing.

It _is_ possible to get rid of it entirely without breaking code 
if we move the array range primitives to a new module and later 
deprecate the old ones, though that would probably mean breaking 
up std.array into submodules and deprecating _all_ of it in favor 
of its submodules, since anyone importing std.array would then 
have the old array range primitives rather than the new ones - or 
both, causing conflicts. And it's made worse by the fact that 
std.range publicly imports std.array. So, yes, it _is_ ugly. But 
it _can_ be done.

If we leave autodecoding in and just work around it everywhere in 
Phobos, it's just going to forever screw with user code and 
confuse users. They get confused enough by it as it is, and at 
least now, they're running into it in Phobos where we can explain 
it, whereas if they don't see it with Phobos and only with their 
own code, then they're going to think that they're doing 
something wrong and potentially get very frustrated.

I definitely share the concern that removing autodecoding 
outright will be too disruptive, but at the same time, I don't 
know if we can afford to go halfway with it.

Apr 24 2015

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 4/24/15 4:44 PM, Walter Bright wrote:

 I'm afraid we are stuck with autodecoding, as taking it out may be far
 too disruptive.

This is pretty easy. We just have to create a string type that is backed 
by, but isn't simply an alias to, an array of char.

-Steve

Apr 24 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 4/24/2015 4:56 PM, Steven Schveighoffer wrote:
 This is pretty easy. We just have to create a string type that is backed by,
but
 isn't simply an alias to, an array of char.

Just shoot me now!

Apr 24 2015

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 4/24/15 9:02 PM, Walter Bright wrote:
 On 4/24/2015 4:56 PM, Steven Schveighoffer wrote:
 This is pretty easy. We just have to create a string type that is
 backed by, but
 isn't simply an alias to, an array of char.

 Just shoot me now!

Yeah, that's the reaction I figured I'd get ;) But it doesn't hurt to 
keep trying since we keep coming back to this over, and over, and over, 
and over...

-Steve

Apr 24 2015

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Saturday, 25 April 2015 at 02:04:02 UTC, Steven Schveighoffer 
wrote:
 On 4/24/15 9:02 PM, Walter Bright wrote:
 On 4/24/2015 4:56 PM, Steven Schveighoffer wrote:
 This is pretty easy. We just have to create a string type 
 that is
 backed by, but
 isn't simply an alias to, an array of char.

 Just shoot me now!

 Yeah, that's the reaction I figured I'd get ;) But it doesn't 
 hurt to keep trying since we keep coming back to this over, and 
 over, and over, and over...

Honestly, even if that were the ideal way to go (and I don't 
think that it is), I'd expect that to be even more disruptive 
than trying to rearrange the modules so that front and friends 
don't autodecode for strings.

I suppose that a related alternative would be to change it so 
that strings aren't considered ranges anymore (at least 
temporarily), and force folks to use stuff like byChar or byDChar 
(or whatever those functions are) whenever they use strings as 
ranges. And actually, that _would_ allow us to get rid of the 
autodecoding without rearranging modules. Later, we could change 
them to being ranges of their actual element types, or we could 
just force folks to be explicit forever in an effort to make the 
Unicode issues clear, if we thought that that were better (though 
it would probably better to just change front and friends later 
to work with strings again but not autodecode). And if an 
algorithm would work with either autodecoding or without it, then 
maybe it could be special cased to accept strings as ranges, only 
forcing it in the cases where it the behavior of the algorithm 
would change based on whether autodecoding were used or not.

Hmmm. I'm not sure what all of the repercussions of such an 
approach would be, but the more I think about it, the more 
tempting it seems to me.

- Jonathan M Davis

Apr 24 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Sat, Apr 25, 2015 at 02:27:45AM +0000, Jonathan M Davis via Digitalmars-d
wrote:
[...]
 I suppose that a related alternative would be to change it so that
 strings aren't considered ranges anymore (at least temporarily), and
 force folks to use stuff like byChar or byDChar (or whatever those
 functions are) whenever they use strings as ranges. And actually, that
 _would_ allow us to get rid of the autodecoding without rearranging
 modules. Later, we could change them to being ranges of their actual
 element types, or we could just force folks to be explicit forever in
 an effort to make the Unicode issues clear, if we thought that that
 were better (though it would probably better to just change front and
 friends later to work with strings again but not autodecode). And if
 an algorithm would work with either autodecoding or without it, then
 maybe it could be special cased to accept strings as ranges, only
 forcing it in the cases where it the behavior of the algorithm would
 change based on whether autodecoding were used or not.
 
 Hmmm. I'm not sure what all of the repercussions of such an approach
 would be, but the more I think about it, the more tempting it seems to
 me.

[...]

I would vote for this approach, if we ever decide to get rid of
autodecoding. I'm OK with either option -- get rid of autodecoding, or
keep it and use it consistently. What I am *not* OK with is the present,
and growing, schizophrenic mixture of autodecoding and non-autodecoding
string functions in Phobos. This inconsistency is going to come back to
bite us later.


T

-- 
One reason that few people are aware there are programs running the internet is
that they never crash in any significant way: the free software underlying the
internet is reliable to the point of invisibility. -- Glyn Moody, from the
article "Giving it all away"

Apr 27 2015

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Monday, 27 April 2015 at 17:01:03 UTC, H. S. Teoh wrote:
 On Sat, Apr 25, 2015 at 02:27:45AM +0000, Jonathan M Davis via 
 Digitalmars-d wrote:
 [...]
 I suppose that a related alternative would be to change it so 
 that
 strings aren't considered ranges anymore (at least 
 temporarily), and
 force folks to use stuff like byChar or byDChar (or whatever 
 those
 functions are) whenever they use strings as ranges. And 
 actually, that
 _would_ allow us to get rid of the autodecoding without 
 rearranging
 modules. Later, we could change them to being ranges of their 
 actual
 element types, or we could just force folks to be explicit 
 forever in
 an effort to make the Unicode issues clear, if we thought that 
 that
 were better (though it would probably better to just change 
 front and
 friends later to work with strings again but not autodecode). 
 And if
 an algorithm would work with either autodecoding or without 
 it, then
 maybe it could be special cased to accept strings as ranges, 
 only
 forcing it in the cases where it the behavior of the algorithm 
 would
 change based on whether autodecoding were used or not.
 
 Hmmm. I'm not sure what all of the repercussions of such an 
 approach
 would be, but the more I think about it, the more tempting it 
 seems to
 me.

 [...]

 I would vote for this approach, if we ever decide to get rid of
 autodecoding. I'm OK with either option -- get rid of 
 autodecoding, or
 keep it and use it consistently. What I am *not* OK with is the 
 present,
 and growing, schizophrenic mixture of autodecoding and 
 non-autodecoding
 string functions in Phobos. This inconsistency is going to come 
 back to
 bite us later.

I expect that the two biggest problems causing the current 
situation are

1. Andrei and Walter don't seem to agree on the issue (Andrei 
seems to think that it's not a big deal to leave in the 
autodecoding).

2. While most of the core devs want to get rid of the 
autodecoding, it's a big enough change that we're afraid to do it 
and/or aren't sure of how we could do it without being too 
disruptive.

So, Walter has been pushing the schizophrenic approach in an 
effort to work around the problem. If the core devs could agree 
on an approach to removing autodecoding that wasn't too 
disruptive and somehow get Andrei to go along with it, then we 
could do that and fix the problem, but otherwise, Walter is just 
going to push for the schizophrenic approach, because it at least 
partially fixes the autodecoding problem, and enough of the core 
devs want to ditch the autodecoding that at least some of those 
changes are likely to make it in.

Honestly, I think that we need to figure out what the best 
options are for killing autodecoding and then figure out how to 
convince Andrei of it, but I haven't a clue how to convince 
Andrei unless maybe a solution which isn't very disruptive can be 
found, but it seems like every time the issue comes up, he gets 
annoyed that we're spending time on something unimportant. I do 
think that this limbo needs to stop though, and I think that it's 
clear that while autodecoding seemed like a good idea at first 
(especially if code points really were full characters instead of 
having to worry about graphemes), ultimately, autodecoding is a 
mistake.

- Jonathan M Davis

Apr 27 2015

"Chris" <wendlec tcd.ie> writes:

On Monday, 27 April 2015 at 17:49:04 UTC, Jonathan M Davis wrote:
 On Monday, 27 April 2015 at 17:01:03 UTC, H. S. Teoh wrote:
 On Sat, Apr 25, 2015 at 02:27:45AM +0000, Jonathan M Davis via 
 Digitalmars-d wrote:
 [...]
 I suppose that a related alternative would be to change it so 
 that
 strings aren't considered ranges anymore (at least 
 temporarily), and
 force folks to use stuff like byChar or byDChar (or whatever 
 those
 functions are) whenever they use strings as ranges. And 
 actually, that
 _would_ allow us to get rid of the autodecoding without 
 rearranging
 modules. Later, we could change them to being ranges of their 
 actual
 element types, or we could just force folks to be explicit 
 forever in
 an effort to make the Unicode issues clear, if we thought 
 that that
 were better (though it would probably better to just change 
 front and
 friends later to work with strings again but not autodecode). 
 And if
 an algorithm would work with either autodecoding or without 
 it, then
 maybe it could be special cased to accept strings as ranges, 
 only
 forcing it in the cases where it the behavior of the 
 algorithm would
 change based on whether autodecoding were used or not.
 
 Hmmm. I'm not sure what all of the repercussions of such an 
 approach
 would be, but the more I think about it, the more tempting it 
 seems to
 me.

 [...]

 I would vote for this approach, if we ever decide to get rid of
 autodecoding. I'm OK with either option -- get rid of 
 autodecoding, or
 keep it and use it consistently. What I am *not* OK with is 
 the present,
 and growing, schizophrenic mixture of autodecoding and 
 non-autodecoding
 string functions in Phobos. This inconsistency is going to 
 come back to
 bite us later.

 I expect that the two biggest problems causing the current 
 situation are

 1. Andrei and Walter don't seem to agree on the issue (Andrei 
 seems to think that it's not a big deal to leave in the 
 autodecoding).

 2. While most of the core devs want to get rid of the 
 autodecoding, it's a big enough change that we're afraid to do 
 it and/or aren't sure of how we could do it without being too 
 disruptive.

 So, Walter has been pushing the schizophrenic approach in an 
 effort to work around the problem. If the core devs could agree 
 on an approach to removing autodecoding that wasn't too 
 disruptive and somehow get Andrei to go along with it, then we 
 could do that and fix the problem, but otherwise, Walter is 
 just going to push for the schizophrenic approach, because it 
 at least partially fixes the autodecoding problem, and enough 
 of the core devs want to ditch the autodecoding that at least 
 some of those changes are likely to make it in.

 Honestly, I think that we need to figure out what the best 
 options are for killing autodecoding and then figure out how to 
 convince Andrei of it, but I haven't a clue how to convince 
 Andrei unless maybe a solution which isn't very disruptive can 
 be found, but it seems like every time the issue comes up, he 
 gets annoyed that we're spending time on something unimportant. 
 I do think that this limbo needs to stop though, and I think 
 that it's clear that while autodecoding seemed like a good idea 
 at first (especially if code points really were full characters 
 instead of having to worry about graphemes), ultimately, 
 autodecoding is a mistake.

 - Jonathan M Davis

Would it be much work to show have example code or even an 
experimental module that gets rid of auto-decoding, so we could 
see what would be affected in general and how actual code we have 
would be affected by it?

The topic keeps coming up again and again, and while I'm in favor 
of anything that enhances performance, I'm afraid of having to 
refactor large chunks of my code. However, this fear may be 
unfounded, but I would need some examples to visualize the 
problem.

Apr 28 2015

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Tuesday, 28 April 2015 at 09:11:10 UTC, Chris wrote:
 Would it be much work to show have example code or even an 
 experimental module that gets rid of auto-decoding, so we could 
 see what would be affected in general and how actual code we 
 have would be affected by it?

 The topic keeps coming up again and again, and while I'm in 
 favor of anything that enhances performance, I'm afraid of 
 having to refactor large chunks of my code. However, this fear 
 may be unfounded, but I would need some examples to visualize 
 the problem.

Honestly, most code won't care. If we just switched out all of 
the auto-decoding right now, pretty much anything using only 
ASCII would just work, and most anything that's trying to 
manipulate ASCII characters in a Unicode string will just work, 
whereas code that's specifically manipulating Unicode characters 
might have problems (e.g. comparing front with a dchar will no 
longer have the same result, since front would just be the first 
code unit rather than necessarily the first code point). Since 
most Phobos range-based functions which operate on strings are 
special-cased on strings already, many of them would continue to 
just work (e.g. find returns the same range type as what's passed 
to it even if it's given a string, so it might just work with the 
change, or it might need to be tweaked slightly), and those that 
would then generally either need to call encode on an argument to 
make it match the string type in the cases string types mix (e.g. 
"foo".find("fo"d) would need to call encode on "fo"d to make it a 
string for comparison), or the caller would need to use 
std.utf.byDchar or std.uni.byGrapheme to operate on code points 
or graphemes rather than code units.

The two biggest places in Phobos that would potentially have 
problems are functions that special-cased strings but still used 
front and those which have to return a new range type. e.g. 
filter would be a good example, because it's forced to return a 
new range type. Right now, it would filter on dchars, but with 
the change, it would filter on the code unit type (most typically 
char). If you're filtering on ASCII characters, it wouldn't 
matter aside from the fact that the resulting range would have an 
element type of char rather than dchar, but if you're filtering 
on Unicode characters, it wouldn't work anymore. For situations 
like that, you'd be forced do use std.utf.byDchar or 
std.uni.byGrapheme. However, since most string code tends to 
operate on substrings rather than characters, I don't know how 
common it even is to use a function like filter on a string (as 
opposed to a range of strings). Such code might actually be 
fairly rare.

So, there _are_ a few functions which stop working the same way 
in a potentially silent manner if we just made it so that front 
didn't autodecode anymore. However, in general, because Phobos 
almost always special-cases strings, calls to Phobos functions 
probably wouldn't need to change in most cases, and when they do, 
a call to byDchar would restore the old behavior. But of course, 
we'd want to do the transition in a way that didn't result in 
silent behavioral changes that would break code, even though in 
most cases, it wouldn't matter, because most code will be 
operating on ASCII strings even if the strings themselves contain 
Unicode - e.g. unicodeString.find(asciiString) is far more common 
than unicodeString.find(otherUnicodeString).

I suspect that the code that's at the greatest risk is code that 
checks for is(Unqual!(ElementType!Range) == dchar) to operate on 
strings and wrapper ranges around strings, since it would then 
only match the cases where byDchar had been used. In general 
though, the code that's going to run into the most trouble is 
user code that contains range-based functions similar to what you 
might find in Phobos rather than code that's simply using the 
Phobos functions like startsWith and find - i.e. if you're 
writing range-base code that worries about doing stuff like 
special-casing strings or which specifically needs to operate on 
code points, then you're going to have to make changes, whereas 
to a great extent, if all you're doing is passing strings to 
Phobos functions, your code will tend to just work.

To actually see what the impact would be, we'd have to just 
change Phobos, I think, and then see what the impact was on user 
code. It could be surprising how much or how little it affects 
things, though in most cases, I expect that it'll mean that code 
will just work. And if we really wanted to do that, we could 
create a version flag that turned of autodecoding and version the 
changes in Phobos appropriately to see what we got. In many 
cases, if we simply made sure that Phobos functions which 
special-cased strings didn't use front directly but instead 
didn't care whether they were operating on ranges of char, wchar, 
or dchar, then we wouldn't even need to version anything (e.g. 
find could easily be made to work that way if it doesn't 
already), but some functions (like filter) would need to be 
versioned differently.

So, maybe what we need to do to start is to just go through 
Phobos and make as many functions as possible not care about 
whether they're dealing with strings as ranges of char, wchar, or 
dchar. And at least then, we'd minimize how much code would have 
to be versioned differently if we were to test out getting rid of 
autodecoding with versioning.

- Jonathan M Davis

Apr 28 2015

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Tuesday, 28 April 2015 at 16:48:48 UTC, Jonathan M Davis wrote:
 But of course, we'd want to do the transition in a way that 
 didn't result in silent behavioral changes that would break 
 code,

One proposal is to make char and dchar comparisons illegal (after 
all, they are comparing different things - an UTF-8 code unit 
with a code point, and even though in some cases this comparison 
makes sense, in many it doesn't). That would solve most silent 
breakages at the expense of more not-so-silent breakages.

 And if we really wanted to do that, we could create a version 
 flag that turned of autodecoding and version the changes in 
 Phobos appropriately to see what we got.

Shameless self-promotion alert: An alternative is a GitHub fork. 
You can easily install and try out D forks with Digger, it's two 
commands:

digger build master+jmdavis/phobos/noautodecode
digger install

Apr 28 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Tue, Apr 28, 2015 at 09:57:29PM +0000, Vladimir Panteleev via Digitalmars-d
wrote:
 On Tuesday, 28 April 2015 at 16:48:48 UTC, Jonathan M Davis wrote:
But of course, we'd want to do the transition in a way that didn't
result in silent behavioral changes that would break code,

 
 One proposal is to make char and dchar comparisons illegal (after all,
 they are comparing different things - an UTF-8 code unit with a code
 point, and even though in some cases this comparison makes sense, in
 many it doesn't).  That would solve most silent breakages at the
 expense of more not-so-silent breakages.
 
And if we really wanted to do that, we could create a version flag
that turned of autodecoding and version the changes in Phobos
appropriately to see what we got.

 
 Shameless self-promotion alert: An alternative is a GitHub fork. You
 can easily install and try out D forks with Digger, it's two commands:
 
 digger build master+jmdavis/phobos/noautodecode
 digger install

Oooh, Jonathan has the code ready? Haha, maybe I'll start using that
instead of git master! ;-)


T

-- 
Arise, you prisoners of Windows / Arise, you slaves of Redmond, Wash, / The day
and hour soon are coming / When all the IT folks say "Gosh!" / It isn't from a
clever lawsuit / That Windowsland will finally fall, / But thousands writing
open source code / Like mice who nibble through a wall. -- The Linux-nationale
by Greg Baker

Apr 28 2015

"Damian" <damianday hotmail.co.uk> writes:

On Tuesday, 28 April 2015 at 23:15:40 UTC, H. S. Teoh wrote:
 On Tue, Apr 28, 2015 at 09:57:29PM +0000, Vladimir Panteleev 
 via Digitalmars-d wrote:
 On Tuesday, 28 April 2015 at 16:48:48 UTC, Jonathan M Davis 
 wrote:
But of course, we'd want to do the transition in a way that 
didn't
result in silent behavioral changes that would break code,

 
 One proposal is to make char and dchar comparisons illegal 
 (after all,
 they are comparing different things - an UTF-8 code unit with 
 a code
 point, and even though in some cases this comparison makes 
 sense, in
 many it doesn't).  That would solve most silent breakages at 
 the
 expense of more not-so-silent breakages.
 
And if we really wanted to do that, we could create a version 
flag
that turned of autodecoding and version the changes in Phobos
appropriately to see what we got.

 
 Shameless self-promotion alert: An alternative is a GitHub 
 fork. You
 can easily install and try out D forks with Digger, it's two 
 commands:
 
 digger build master+jmdavis/phobos/noautodecode
 digger install

 Oooh, Jonathan has the code ready? Haha, maybe I'll start using 
 that
 instead of git master! ;-)


 T

I second that! If we all make the switch, perhaps Walter will 
too? :D

Apr 28 2015

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Tuesday, 28 April 2015 at 23:26:14 UTC, Damian wrote:
 I second that! If we all make the switch, perhaps Walter will 
 too? :D

Walter isn't necessarily the one we have to convince in this 
case. He'll be very concerned about avoiding breaking existing 
code, so we'd need a solid transition plan, but he very much 
wants to get rid of autodecoding, so he'll welcome it if we can 
do it cleanly. The bigger problem is convincing Andrei, since he 
seems to think that even discussing the issue is a waste of time 
and takes away from more important topics. And I don't dispute 
that there are other important topics, and coming back to this 
one over and over again is arguably a problem, but if we can just 
figure out how to make the transition and get it over with, then 
it wouldn't need to keep getting discussed like this.

- Jonathan M Davis

Apr 28 2015

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Tuesday, 28 April 2015 at 21:57:31 UTC, Vladimir Panteleev 
wrote:
 On Tuesday, 28 April 2015 at 16:48:48 UTC, Jonathan M Davis 
 wrote:
 But of course, we'd want to do the transition in a way that 
 didn't result in silent behavioral changes that would break 
 code,

 One proposal is to make char and dchar comparisons illegal 
 (after all, they are comparing different things - an UTF-8 code 
 unit with a code point, and even though in some cases this 
 comparison makes sense, in many it doesn't). That would solve 
 most silent breakages at the expense of more not-so-silent 
 breakages.

It would, but it doesn't necessarily play nicely with the 
promotion rules, and since the character types tend to be treated 
as integral types, I suspect that it would be problematic in a 
number of cases. I also suspect that it's not something that 
Walter would go for given his typical attitude about conversions 
(though I don't know). It's definitely an interesting thought, 
but I doubt that it would fly.

 And if we really wanted to do that, we could create a version 
 flag that turned of autodecoding and version the changes in 
 Phobos appropriately to see what we got.

 Shameless self-promotion alert: An alternative is a GitHub 
 fork. You can easily install and try out D forks with Digger, 
 it's two commands:

 digger build master+jmdavis/phobos/noautodecode
 digger install

Well, that may very well be what needs to happen as an 
experiment, but if we want to actually transition to not having 
autodecoding, we need a transition plan in master itself rather 
than a fork, and a temporary version would be one way to do that.

After thinking about the situation some over the past few days 
though, I think that what we need to do to begin with is to make 
it so that as many functions in Phobos as possible don't care 
whether they're dealing with ranges of char or dchar so that 
they'll work regardless of what front does on strings (either by 
simply not using front on strings - or by making it so that the 
code will work whether front return char or dchar). And that will 
reduce the number of changes that will have to be done in Phobos 
via versioning or deprecation or whatever we'd have to do to 
actually remove autodecoding. I suspect that it would mean that 
very little would have to be versioned or deprecated if/when we 
make the switch.

The bigger problem though is probably 3rd party range-based 
functions using front with strings or checking rather than Phobos 
itself or code using Phobos, since much of that would just work 
even if we outright switched front from autodecoding to 
non-autodecoding, and most of what wouldn't can be made to work 
by making it so that those functions don't care whether they're 
dealing with autodecoded strings or not.

- Jonathan M Davis

Apr 28 2015

"Chris" <wendlec tcd.ie> writes:

On Tuesday, 28 April 2015 at 16:48:48 UTC, Jonathan M Davis wrote:
 On Tuesday, 28 April 2015 at 09:11:10 UTC, Chris wrote:
 Would it be much work to show have example code or even an 
 experimental module that gets rid of auto-decoding, so we 
 could see what would be affected in general and how actual 
 code we have would be affected by it?

 The topic keeps coming up again and again, and while I'm in 
 favor of anything that enhances performance, I'm afraid of 
 having to refactor large chunks of my code. However, this fear 
 may be unfounded, but I would need some examples to visualize 
 the problem.

 Honestly, most code won't care. If we just switched out all of 
 the auto-decoding right now, pretty much anything using only 
 ASCII would just work, and most anything that's trying to 
 manipulate ASCII characters in a Unicode string will just work, 
 whereas code that's specifically manipulating Unicode 
 characters might have problems (e.g. comparing front with a 
 dchar will no longer have the same result, since front would 
 just be the first code unit rather than necessarily the first 
 code point). Since most Phobos range-based functions which 
 operate on strings are special-cased on strings already, many 
 of them would continue to just work (e.g. find returns the same 
 range type as what's passed to it even if it's given a string, 
 so it might just work with the change, or it might need to be 
 tweaked slightly), and those that would then generally either 
 need to call encode on an argument to make it match the string 
 type in the cases string types mix (e.g. "foo".find("fo"d) 
 would need to call encode on "fo"d to make it a string for 
 comparison), or the caller would need to use std.utf.byDchar or 
 std.uni.byGrapheme to operate on code points or graphemes 
 rather than code units.

 The two biggest places in Phobos that would potentially have 
 problems are functions that special-cased strings but still 
 used front and those which have to return a new range type. 
 e.g. filter would be a good example, because it's forced to 
 return a new range type. Right now, it would filter on dchars, 
 but with the change, it would filter on the code unit type 
 (most typically char). If you're filtering on ASCII characters, 
 it wouldn't matter aside from the fact that the resulting range 
 would have an element type of char rather than dchar, but if 
 you're filtering on Unicode characters, it wouldn't work 
 anymore. For situations like that, you'd be forced do use 
 std.utf.byDchar or std.uni.byGrapheme. However, since most 
 string code tends to operate on substrings rather than 
 characters, I don't know how common it even is to use a 
 function like filter on a string (as opposed to a range of 
 strings). Such code might actually be fairly rare.

 So, there _are_ a few functions which stop working the same way 
 in a potentially silent manner if we just made it so that front 
 didn't autodecode anymore. However, in general, because Phobos 
 almost always special-cases strings, calls to Phobos functions 
 probably wouldn't need to change in most cases, and when they 
 do, a call to byDchar would restore the old behavior. But of 
 course, we'd want to do the transition in a way that didn't 
 result in silent behavioral changes that would break code, even 
 though in most cases, it wouldn't matter, because most code 
 will be operating on ASCII strings even if the strings 
 themselves contain Unicode - e.g. 
 unicodeString.find(asciiString) is far more common than 
 unicodeString.find(otherUnicodeString).

 I suspect that the code that's at the greatest risk is code 
 that checks for is(Unqual!(ElementType!Range) == dchar) to 
 operate on strings and wrapper ranges around strings, since it 
 would then only match the cases where byDchar had been used. In 
 general though, the code that's going to run into the most 
 trouble is user code that contains range-based functions 
 similar to what you might find in Phobos rather than code 
 that's simply using the Phobos functions like startsWith and 
 find - i.e. if you're writing range-base code that worries 
 about doing stuff like special-casing strings or which 
 specifically needs to operate on code points, then you're going 
 to have to make changes, whereas to a great extent, if all 
 you're doing is passing strings to Phobos functions, your code 
 will tend to just work.

 To actually see what the impact would be, we'd have to just 
 change Phobos, I think, and then see what the impact was on 
 user code. It could be surprising how much or how little it 
 affects things, though in most cases, I expect that it'll mean 
 that code will just work. And if we really wanted to do that, 
 we could create a version flag that turned of autodecoding and 
 version the changes in Phobos appropriately to see what we got. 
 In many cases, if we simply made sure that Phobos functions 
 which special-cased strings didn't use front directly but 
 instead didn't care whether they were operating on ranges of 
 char, wchar, or dchar, then we wouldn't even need to version 
 anything (e.g. find could easily be made to work that way if it 
 doesn't already), but some functions (like filter) would need 
 to be versioned differently.

 So, maybe what we need to do to start is to just go through 
 Phobos and make as many functions as possible not care about 
 whether they're dealing with strings as ranges of char, wchar, 
 or dchar. And at least then, we'd minimize how much code would 
 have to be versioned differently if we were to test out getting 
 rid of autodecoding with versioning.

 - Jonathan M Davis

This sounds like a good starting point for a transition plan. One 
important thing, though, would be to do some benchmarking with 
and without autodecoding, to see if it really boosts performance 
in a way that would justify the transition.

Apr 29 2015

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, 29 April 2015 at 10:02:09 UTC, Chris wrote:
 This sounds like a good starting point for a transition plan. 
 One important thing, though, would be to do some benchmarking 
 with and without autodecoding, to see if it really boosts 
 performance in a way that would justify the transition.

Well, personally, I think that it's worth it even if the 
performance is identical (and it's a guarantee that it's going to 
be better without autodecoding - it's just a question of how much 
better - since it's going to have less work to do without 
autodecoding). Simply operating at the code point level like we 
do now is the worst of all worlds in terms of flexibility and 
correctness. As long as the Unicode is normalized, operating at 
the code unit level is the most efficient, and decoding is often 
unnecessary for correctness, and if you need to decode, then you 
really need to go up to the grapheme level in order to be 
operating on the full character, meaning that operating on code 
points really has the same problems as operating on code units as 
far as correctness goes. So, it's less performant without 
actually being correct. It just gives the illusion of correctness.

By treating strings as ranges of code units, you don't take a 
performance hit when you don't need to, and it forces you to 
actually consider something like byDchar or byGrapheme if you 
want to operate on full, Unicode characters. It's similar to how 
operating on UTF-16 code units as if they were characters (as 

impression that you're handling Unicode correctly, because you 
have to work harder at coming up with characters that can't fit 
in a single code unit, whereas with UTF-8, anything but ASCII is 
screwed if you treat code units as code points. Treating code 
points as if they were full characters like we're doing now in 
Phobos with ranges just makes it that much harder to notice that 
you're not handling Unicode correctly.

Also, treating strings as ranges of code units makes it so that 
they're not so special and actually are treated like every other 
type of array, which eliminates a lot of the special casing that 
we're forced to do right now, and it eliminates all of the 
confusion that folks keep running into when string doesn't work 
with many functions, because it's not a random-access range or 
doesn't have length, or because the resulting range isn't the 
same type (copy would be a prime example of a function that 
doesn't work with char[] when it should). By leaving in 
autodecoding, we're basically leaving in technical debt in D 
permanently. We'll forever have to be explaining it to folks and 
forever have to be working around it in order to achieve either 
performance or correctness.

What we have now isn't performant, correct, or flexible, and 
we'll be forever paying for that if we don't get rid of 
autodecoding.

I don't criticize Andrei in the least for coming up with it, 
since if you don't take graphemes into account (and he didn't 
know about them at the time), it seems like a great idea and 
allows us to be correct by default and performant if we put some 
effort into, but after having seen how it's worked out, how much 
code has to be special-cased, how much confusion there is over 
it, and how it's not actually correct anyway, I think that it's 
quite clear that autodecoding was a mistake. And at this point, 
it's mainly a question of how we can get rid of it without being 
too disruptive and whether we can convince Andrei that it makes 
sense to make the change, since he seems to still think that 
autodecoding is fine in spite of the fact that it's neither 
performant nor correct.

It may be that the decision will be that it's too disruptive to 
remove autodecoding, but I think that that's really a question of 
whether we can find a way to do it that doesn't break tons of 
code rather than whether it's worth the performance or 
correctness gain.

- Jonathan M Davis

Apr 29 2015

"Chris" <wendlec tcd.ie> writes:

On Wednesday, 29 April 2015 at 15:13:15 UTC, Jonathan M Davis 
wrote:
 On Wednesday, 29 April 2015 at 10:02:09 UTC, Chris wrote:
 This sounds like a good starting point for a transition plan. 
 One important thing, though, would be to do some benchmarking 
 with and without autodecoding, to see if it really boosts 
 performance in a way that would justify the transition.

 Well, personally, I think that it's worth it even if the 
 performance is identical (and it's a guarantee that it's going 
 to be better without autodecoding - it's just a question of how 
 much better - since it's going to have less work to do without 
 autodecoding). Simply operating at the code point level like we 
 do now is the worst of all worlds in terms of flexibility and 
 correctness. As long as the Unicode is normalized, operating at 
 the code unit level is the most efficient, and decoding is 
 often unnecessary for correctness, and if you need to decode, 
 then you really need to go up to the grapheme level in order to 
 be operating on the full character, meaning that operating on 
 code points really has the same problems as operating on code 
 units as far as correctness goes. So, it's less performant 
 without actually being correct. It just gives the illusion of 
 correctness.

 By treating strings as ranges of code units, you don't take a 
 performance hit when you don't need to, and it forces you to 
 actually consider something like byDchar or byGrapheme if you 
 want to operate on full, Unicode characters. It's similar to 
 how operating on UTF-16 code units as if they were characters 

 impression that you're handling Unicode correctly, because you 
 have to work harder at coming up with characters that can't fit 
 in a single code unit, whereas with UTF-8, anything but ASCII 
 is screwed if you treat code units as code points. Treating 
 code points as if they were full characters like we're doing 
 now in Phobos with ranges just makes it that much harder to 
 notice that you're not handling Unicode correctly.

 Also, treating strings as ranges of code units makes it so that 
 they're not so special and actually are treated like every 
 other type of array, which eliminates a lot of the special 
 casing that we're forced to do right now, and it eliminates all 
 of the confusion that folks keep running into when string 
 doesn't work with many functions, because it's not a 
 random-access range or doesn't have length, or because the 
 resulting range isn't the same type (copy would be a prime 
 example of a function that doesn't work with char[] when it 
 should). By leaving in autodecoding, we're basically leaving in 
 technical debt in D permanently. We'll forever have to be 
 explaining it to folks and forever have to be working around it 
 in order to achieve either performance or correctness.

 What we have now isn't performant, correct, or flexible, and 
 we'll be forever paying for that if we don't get rid of 
 autodecoding.

 I don't criticize Andrei in the least for coming up with it, 
 since if you don't take graphemes into account (and he didn't 
 know about them at the time), it seems like a great idea and 
 allows us to be correct by default and performant if we put 
 some effort into, but after having seen how it's worked out, 
 how much code has to be special-cased, how much confusion there 
 is over it, and how it's not actually correct anyway, I think 
 that it's quite clear that autodecoding was a mistake. And at 
 this point, it's mainly a question of how we can get rid of it 
 without being too disruptive and whether we can convince Andrei 
 that it makes sense to make the change, since he seems to still 
 think that autodecoding is fine in spite of the fact that it's 
 neither performant nor correct.

 It may be that the decision will be that it's too disruptive to 
 remove autodecoding, but I think that that's really a question 
 of whether we can find a way to do it that doesn't break tons 
 of code rather than whether it's worth the performance or 
 correctness gain.

 - Jonathan M Davis

Ok, I see. Well, if we don't want to repeat C++'s mistakes, we 
should fix it before it's too late. Since I'm dealing a lot with 
strings (non ASCII) and depend on Unicode (and correctness!), I 
would be more than happy to test any changes to Phobos with my 
programs to see if it screws up anything.

Apr 29 2015

ketmar <ketmar ketmar.no-ip.org> writes:

On Fri, 24 Apr 2015 13:44:43 -0700, Walter Bright wrote:

 I'm afraid we are stuck with autodecoding, as taking it out may be far
 too disruptive.

the more time passing the harder autodecode to kill. kill it while it's=20
not too late. make the next DMD release 2.100 and KILL AUTODECODE for=20
good.=

Apr 25 2015

"Kagamin" <spam here.lot> writes:

On Friday, 24 April 2015 at 20:44:34 UTC, Walter Bright wrote:
 Time has shown, however, that UTF8 has pretty much won. wchar 
 only exists for Windows API and Java

Also NSString. It used to support UTF-16 and C encoding. AFAIK, 
the latter later evolved into UTF-8.

Apr 30 2015

D Programming

C/C++ Programming

Other

digitalmars.D - Range of chars (narrow string ranges)