www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Range of chars (narrow string ranges)

reply Martin Nowak <code+news.digitalmars dawg.eu> writes:
Just want to make this a bit more visible.
https://github.com/D-Programming-Language/phobos/pull/3206#issuecomment-95681812

We just added entabber to std.phobos, and AFAIK, it's the first range
algorithm that transforms narrow strings to a range of chars, instead of
decoding the original string and returning a range of dchars.

Most of phobos can't handle such ranges like strings and you'd have to
decode them using byDchar to work with them.
Apr 24 2015
parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Apr 24, 2015 at 08:39:36PM +0200, Martin Nowak via Digitalmars-d wrote:
 Just want to make this a bit more visible.
 https://github.com/D-Programming-Language/phobos/pull/3206#issuecomment-95681812
 
 We just added entabber to std.phobos, and AFAIK, it's the first range
 algorithm that transforms narrow strings to a range of chars, instead
 of decoding the original string and returning a range of dchars.
 
 Most of phobos can't handle such ranges like strings and you'd have to
 decode them using byDchar to work with them.
I really wish we would just *make the darn decision* already, whether to kill off autodecoding or not, and MAKE IT CONSISTENT ACROSS PHOBOS, instead of introducing this schizophrenic dichotomy where some functions give you a range of dchar while others give you a range of char/wchar, and the two don't work well together. This is totally going to make a laughing stock of D one day. T -- Guns don't kill people. Bullets do.
Apr 24 2015
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/24/2015 11:52 AM, H. S. Teoh via Digitalmars-d wrote:
 I really wish we would just *make the darn decision* already, whether to
 kill off autodecoding or not, and MAKE IT CONSISTENT ACROSS PHOBOS,
 instead of introducing this schizophrenic dichotomy where some functions
 give you a range of dchar while others give you a range of char/wchar,
 and the two don't work well together. This is totally going to make a
 laughing stock of D one day.
Some facts: 1. When I started D, there was a lot of speculation about whether the world would settle on UTF8, UTF16, or UTF32. So D supports natively all three. Time has shown, however, that UTF8 has pretty much won. wchar only exists for Windows API and Java, dchar strings pretty much don't exist in the wild. 2. dchar is very useful as a character type, but not as a string type. 3. Pretty much none of the algorithms in Phobos work when presented with a range of chars or wchars. This is not even documented. 4. Autodecoding is inefficient, especially considering that few algorithms actually need decoding. Re-encoding the result back to UTF8 is another inefficiency. I'm afraid we are stuck with autodecoding, as taking it out may be far too disruptive. But all is not lost. The Phobos algorithms can all be fixed to not care about autodecoding. The changes I've made to std.string all reflect that. https://github.com/D-Programming-Language/phobos/pulls/WalterBright
Apr 24 2015
next sibling parent Martin Nowak <code+news.digitalmars dawg.eu> writes:
On 04/24/2015 10:44 PM, Walter Bright wrote:
 4. Autodecoding is inefficient, especially considering that few
 algorithms actually need decoding. Re-encoding the result back to UTF8
 is another inefficiency.
 
 I'm afraid we are stuck with autodecoding, as taking it out may be far
 too disruptive.
 
 But all is not lost. The Phobos algorithms can all be fixed to not care
 about autodecoding. The changes I've made to std.string all reflect that.
It probably won't be too disruptive to optimize algorithms such as filter to return a range of chars, but only if we support such ranges as narrow strings everywhere.
Apr 24 2015
prev sibling next sibling parent reply "Brad Anderson" <eco gnuk.net> writes:
On Friday, 24 April 2015 at 20:44:34 UTC, Walter Bright wrote:
 [snip]
 I'm afraid we are stuck with autodecoding, as taking it out may 
 be far too disruptive.
No!
 But all is not lost. The Phobos algorithms can all be fixed to 
 not care about autodecoding. The changes I've made to 
 std.string all reflect that.
Yay! I haven't really followed the autodecoding conversations. The problem is that front on char ranges decode, right? Is there quick way to tell which functions are auto decoding so we can have a list of candidates for replacement? It'd be good for hackweek. I'm reminded of this conversation http://forum.dlang.org/post/xgnurdjcqiyatpvnwznd forum.dlang.org which contains a partial list of candidates. Following your lead with implementing these lazy versions (without autodecoding) would be good hackweek projects. Finally, there is this http://goo.gl/Wmotu4 list from http://forum.dlang.org/post/lvmydbvjivsvmwtimobs forum.dlang.org that has some good candidates for hackweek I think. Are we collecting hackweek ideas anywhere?
Apr 24 2015
parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/24/2015 3:29 PM, Brad Anderson wrote:
 I haven't really followed the autodecoding conversations. The problem is that
 front on char ranges decode, right?
Nope. Only front on narrow string arrays. Ranges aren't autodecoded.
 Is there quick way to tell which functions
 are auto decoding so we can have a list of candidates for replacement? It'd be
 good for hackweek.
If they accept ranges, and don't special case narrow strings, then they autodecode.
 I'm reminded of this conversation
 http://forum.dlang.org/post/xgnurdjcqiyatpvnwznd forum.dlang.org
 which contains a partial list of candidates.
PR's exist for most of these now.
 Following your lead with
 implementing these lazy versions (without autodecoding) would be good hackweek
 projects.
Yup.
 Finally, there is this http://goo.gl/Wmotu4 list from
 http://forum.dlang.org/post/lvmydbvjivsvmwtimobs forum.dlang.org that has some
 good candidates for hackweek I think.
Yes, we should have an answer for each of the Boost string algorithms.
 Are we collecting hackweek ideas anywhere?
Andrei?
Apr 24 2015
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Friday, 24 April 2015 at 20:44:34 UTC, Walter Bright wrote:
 On 4/24/2015 11:52 AM, H. S. Teoh via Digitalmars-d wrote:
 I really wish we would just *make the darn decision* already, 
 whether to
 kill off autodecoding or not, and MAKE IT CONSISTENT ACROSS 
 PHOBOS,
 instead of introducing this schizophrenic dichotomy where some 
 functions
 give you a range of dchar while others give you a range of 
 char/wchar,
 and the two don't work well together. This is totally going to 
 make a
 laughing stock of D one day.
Some facts: 1. When I started D, there was a lot of speculation about whether the world would settle on UTF8, UTF16, or UTF32. So D supports natively all three. Time has shown, however, that UTF8 has pretty much won. wchar only exists for Windows API and Java, dchar strings pretty much don't exist in the wild. 2. dchar is very useful as a character type, but not as a string type. 3. Pretty much none of the algorithms in Phobos work when presented with a range of chars or wchars. This is not even documented. 4. Autodecoding is inefficient, especially considering that few algorithms actually need decoding. Re-encoding the result back to UTF8 is another inefficiency. I'm afraid we are stuck with autodecoding, as taking it out may be far too disruptive. But all is not lost. The Phobos algorithms can all be fixed to not care about autodecoding. The changes I've made to std.string all reflect that. https://github.com/D-Programming-Language/phobos/pulls/WalterBright
I really think that leaving things with autodecoding in some cases and not in others is just asking for trouble. Even if we manage to figure out how to fix it so that Phobos doesn't autodecode in any of its algorithms without breaking any user code in the process, that then leaves user code with the problem, and since Phobos _wouldn't_ have the problem, it then would be all the more confusing. It _is_ possible to get rid of it entirely without breaking code if we move the array range primitives to a new module and later deprecate the old ones, though that would probably mean breaking up std.array into submodules and deprecating _all_ of it in favor of its submodules, since anyone importing std.array would then have the old array range primitives rather than the new ones - or both, causing conflicts. And it's made worse by the fact that std.range publicly imports std.array. So, yes, it _is_ ugly. But it _can_ be done. If we leave autodecoding in and just work around it everywhere in Phobos, it's just going to forever screw with user code and confuse users. They get confused enough by it as it is, and at least now, they're running into it in Phobos where we can explain it, whereas if they don't see it with Phobos and only with their own code, then they're going to think that they're doing something wrong and potentially get very frustrated. I definitely share the concern that removing autodecoding outright will be too disruptive, but at the same time, I don't know if we can afford to go halfway with it.
Apr 24 2015
prev sibling next sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 4/24/15 4:44 PM, Walter Bright wrote:

 I'm afraid we are stuck with autodecoding, as taking it out may be far
 too disruptive.
This is pretty easy. We just have to create a string type that is backed by, but isn't simply an alias to, an array of char. -Steve
Apr 24 2015
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/24/2015 4:56 PM, Steven Schveighoffer wrote:
 This is pretty easy. We just have to create a string type that is backed by,
but
 isn't simply an alias to, an array of char.
Just shoot me now!
Apr 24 2015
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 4/24/15 9:02 PM, Walter Bright wrote:
 On 4/24/2015 4:56 PM, Steven Schveighoffer wrote:
 This is pretty easy. We just have to create a string type that is
 backed by, but
 isn't simply an alias to, an array of char.
Just shoot me now!
Yeah, that's the reaction I figured I'd get ;) But it doesn't hurt to keep trying since we keep coming back to this over, and over, and over, and over... -Steve
Apr 24 2015
parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Saturday, 25 April 2015 at 02:04:02 UTC, Steven Schveighoffer 
wrote:
 On 4/24/15 9:02 PM, Walter Bright wrote:
 On 4/24/2015 4:56 PM, Steven Schveighoffer wrote:
 This is pretty easy. We just have to create a string type 
 that is
 backed by, but
 isn't simply an alias to, an array of char.
Just shoot me now!
Yeah, that's the reaction I figured I'd get ;) But it doesn't hurt to keep trying since we keep coming back to this over, and over, and over, and over...
Honestly, even if that were the ideal way to go (and I don't think that it is), I'd expect that to be even more disruptive than trying to rearrange the modules so that front and friends don't autodecode for strings. I suppose that a related alternative would be to change it so that strings aren't considered ranges anymore (at least temporarily), and force folks to use stuff like byChar or byDChar (or whatever those functions are) whenever they use strings as ranges. And actually, that _would_ allow us to get rid of the autodecoding without rearranging modules. Later, we could change them to being ranges of their actual element types, or we could just force folks to be explicit forever in an effort to make the Unicode issues clear, if we thought that that were better (though it would probably better to just change front and friends later to work with strings again but not autodecode). And if an algorithm would work with either autodecoding or without it, then maybe it could be special cased to accept strings as ranges, only forcing it in the cases where it the behavior of the algorithm would change based on whether autodecoding were used or not. Hmmm. I'm not sure what all of the repercussions of such an approach would be, but the more I think about it, the more tempting it seems to me. - Jonathan M Davis
Apr 24 2015
parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Sat, Apr 25, 2015 at 02:27:45AM +0000, Jonathan M Davis via Digitalmars-d
wrote:
[...]
 I suppose that a related alternative would be to change it so that
 strings aren't considered ranges anymore (at least temporarily), and
 force folks to use stuff like byChar or byDChar (or whatever those
 functions are) whenever they use strings as ranges. And actually, that
 _would_ allow us to get rid of the autodecoding without rearranging
 modules. Later, we could change them to being ranges of their actual
 element types, or we could just force folks to be explicit forever in
 an effort to make the Unicode issues clear, if we thought that that
 were better (though it would probably better to just change front and
 friends later to work with strings again but not autodecode). And if
 an algorithm would work with either autodecoding or without it, then
 maybe it could be special cased to accept strings as ranges, only
 forcing it in the cases where it the behavior of the algorithm would
 change based on whether autodecoding were used or not.
 
 Hmmm. I'm not sure what all of the repercussions of such an approach
 would be, but the more I think about it, the more tempting it seems to
 me.
[...] I would vote for this approach, if we ever decide to get rid of autodecoding. I'm OK with either option -- get rid of autodecoding, or keep it and use it consistently. What I am *not* OK with is the present, and growing, schizophrenic mixture of autodecoding and non-autodecoding string functions in Phobos. This inconsistency is going to come back to bite us later. T -- One reason that few people are aware there are programs running the internet is that they never crash in any significant way: the free software underlying the internet is reliable to the point of invisibility. -- Glyn Moody, from the article "Giving it all away"
Apr 27 2015
parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Monday, 27 April 2015 at 17:01:03 UTC, H. S. Teoh wrote:
 On Sat, Apr 25, 2015 at 02:27:45AM +0000, Jonathan M Davis via 
 Digitalmars-d wrote:
 [...]
 I suppose that a related alternative would be to change it so 
 that
 strings aren't considered ranges anymore (at least 
 temporarily), and
 force folks to use stuff like byChar or byDChar (or whatever 
 those
 functions are) whenever they use strings as ranges. And 
 actually, that
 _would_ allow us to get rid of the autodecoding without 
 rearranging
 modules. Later, we could change them to being ranges of their 
 actual
 element types, or we could just force folks to be explicit 
 forever in
 an effort to make the Unicode issues clear, if we thought that 
 that
 were better (though it would probably better to just change 
 front and
 friends later to work with strings again but not autodecode). 
 And if
 an algorithm would work with either autodecoding or without 
 it, then
 maybe it could be special cased to accept strings as ranges, 
 only
 forcing it in the cases where it the behavior of the algorithm 
 would
 change based on whether autodecoding were used or not.
 
 Hmmm. I'm not sure what all of the repercussions of such an 
 approach
 would be, but the more I think about it, the more tempting it 
 seems to
 me.
[...] I would vote for this approach, if we ever decide to get rid of autodecoding. I'm OK with either option -- get rid of autodecoding, or keep it and use it consistently. What I am *not* OK with is the present, and growing, schizophrenic mixture of autodecoding and non-autodecoding string functions in Phobos. This inconsistency is going to come back to bite us later.
I expect that the two biggest problems causing the current situation are 1. Andrei and Walter don't seem to agree on the issue (Andrei seems to think that it's not a big deal to leave in the autodecoding). 2. While most of the core devs want to get rid of the autodecoding, it's a big enough change that we're afraid to do it and/or aren't sure of how we could do it without being too disruptive. So, Walter has been pushing the schizophrenic approach in an effort to work around the problem. If the core devs could agree on an approach to removing autodecoding that wasn't too disruptive and somehow get Andrei to go along with it, then we could do that and fix the problem, but otherwise, Walter is just going to push for the schizophrenic approach, because it at least partially fixes the autodecoding problem, and enough of the core devs want to ditch the autodecoding that at least some of those changes are likely to make it in. Honestly, I think that we need to figure out what the best options are for killing autodecoding and then figure out how to convince Andrei of it, but I haven't a clue how to convince Andrei unless maybe a solution which isn't very disruptive can be found, but it seems like every time the issue comes up, he gets annoyed that we're spending time on something unimportant. I do think that this limbo needs to stop though, and I think that it's clear that while autodecoding seemed like a good idea at first (especially if code points really were full characters instead of having to worry about graphemes), ultimately, autodecoding is a mistake. - Jonathan M Davis
Apr 27 2015
parent reply "Chris" <wendlec tcd.ie> writes:
On Monday, 27 April 2015 at 17:49:04 UTC, Jonathan M Davis wrote:
 On Monday, 27 April 2015 at 17:01:03 UTC, H. S. Teoh wrote:
 On Sat, Apr 25, 2015 at 02:27:45AM +0000, Jonathan M Davis via 
 Digitalmars-d wrote:
 [...]
 I suppose that a related alternative would be to change it so 
 that
 strings aren't considered ranges anymore (at least 
 temporarily), and
 force folks to use stuff like byChar or byDChar (or whatever 
 those
 functions are) whenever they use strings as ranges. And 
 actually, that
 _would_ allow us to get rid of the autodecoding without 
 rearranging
 modules. Later, we could change them to being ranges of their 
 actual
 element types, or we could just force folks to be explicit 
 forever in
 an effort to make the Unicode issues clear, if we thought 
 that that
 were better (though it would probably better to just change 
 front and
 friends later to work with strings again but not autodecode). 
 And if
 an algorithm would work with either autodecoding or without 
 it, then
 maybe it could be special cased to accept strings as ranges, 
 only
 forcing it in the cases where it the behavior of the 
 algorithm would
 change based on whether autodecoding were used or not.
 
 Hmmm. I'm not sure what all of the repercussions of such an 
 approach
 would be, but the more I think about it, the more tempting it 
 seems to
 me.
[...] I would vote for this approach, if we ever decide to get rid of autodecoding. I'm OK with either option -- get rid of autodecoding, or keep it and use it consistently. What I am *not* OK with is the present, and growing, schizophrenic mixture of autodecoding and non-autodecoding string functions in Phobos. This inconsistency is going to come back to bite us later.
I expect that the two biggest problems causing the current situation are 1. Andrei and Walter don't seem to agree on the issue (Andrei seems to think that it's not a big deal to leave in the autodecoding). 2. While most of the core devs want to get rid of the autodecoding, it's a big enough change that we're afraid to do it and/or aren't sure of how we could do it without being too disruptive. So, Walter has been pushing the schizophrenic approach in an effort to work around the problem. If the core devs could agree on an approach to removing autodecoding that wasn't too disruptive and somehow get Andrei to go along with it, then we could do that and fix the problem, but otherwise, Walter is just going to push for the schizophrenic approach, because it at least partially fixes the autodecoding problem, and enough of the core devs want to ditch the autodecoding that at least some of those changes are likely to make it in. Honestly, I think that we need to figure out what the best options are for killing autodecoding and then figure out how to convince Andrei of it, but I haven't a clue how to convince Andrei unless maybe a solution which isn't very disruptive can be found, but it seems like every time the issue comes up, he gets annoyed that we're spending time on something unimportant. I do think that this limbo needs to stop though, and I think that it's clear that while autodecoding seemed like a good idea at first (especially if code points really were full characters instead of having to worry about graphemes), ultimately, autodecoding is a mistake. - Jonathan M Davis
Would it be much work to show have example code or even an experimental module that gets rid of auto-decoding, so we could see what would be affected in general and how actual code we have would be affected by it? The topic keeps coming up again and again, and while I'm in favor of anything that enhances performance, I'm afraid of having to refactor large chunks of my code. However, this fear may be unfounded, but I would need some examples to visualize the problem.
Apr 28 2015
parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Tuesday, 28 April 2015 at 09:11:10 UTC, Chris wrote:
 Would it be much work to show have example code or even an 
 experimental module that gets rid of auto-decoding, so we could 
 see what would be affected in general and how actual code we 
 have would be affected by it?

 The topic keeps coming up again and again, and while I'm in 
 favor of anything that enhances performance, I'm afraid of 
 having to refactor large chunks of my code. However, this fear 
 may be unfounded, but I would need some examples to visualize 
 the problem.
Honestly, most code won't care. If we just switched out all of the auto-decoding right now, pretty much anything using only ASCII would just work, and most anything that's trying to manipulate ASCII characters in a Unicode string will just work, whereas code that's specifically manipulating Unicode characters might have problems (e.g. comparing front with a dchar will no longer have the same result, since front would just be the first code unit rather than necessarily the first code point). Since most Phobos range-based functions which operate on strings are special-cased on strings already, many of them would continue to just work (e.g. find returns the same range type as what's passed to it even if it's given a string, so it might just work with the change, or it might need to be tweaked slightly), and those that would then generally either need to call encode on an argument to make it match the string type in the cases string types mix (e.g. "foo".find("fo"d) would need to call encode on "fo"d to make it a string for comparison), or the caller would need to use std.utf.byDchar or std.uni.byGrapheme to operate on code points or graphemes rather than code units. The two biggest places in Phobos that would potentially have problems are functions that special-cased strings but still used front and those which have to return a new range type. e.g. filter would be a good example, because it's forced to return a new range type. Right now, it would filter on dchars, but with the change, it would filter on the code unit type (most typically char). If you're filtering on ASCII characters, it wouldn't matter aside from the fact that the resulting range would have an element type of char rather than dchar, but if you're filtering on Unicode characters, it wouldn't work anymore. For situations like that, you'd be forced do use std.utf.byDchar or std.uni.byGrapheme. However, since most string code tends to operate on substrings rather than characters, I don't know how common it even is to use a function like filter on a string (as opposed to a range of strings). Such code might actually be fairly rare. So, there _are_ a few functions which stop working the same way in a potentially silent manner if we just made it so that front didn't autodecode anymore. However, in general, because Phobos almost always special-cases strings, calls to Phobos functions probably wouldn't need to change in most cases, and when they do, a call to byDchar would restore the old behavior. But of course, we'd want to do the transition in a way that didn't result in silent behavioral changes that would break code, even though in most cases, it wouldn't matter, because most code will be operating on ASCII strings even if the strings themselves contain Unicode - e.g. unicodeString.find(asciiString) is far more common than unicodeString.find(otherUnicodeString). I suspect that the code that's at the greatest risk is code that checks for is(Unqual!(ElementType!Range) == dchar) to operate on strings and wrapper ranges around strings, since it would then only match the cases where byDchar had been used. In general though, the code that's going to run into the most trouble is user code that contains range-based functions similar to what you might find in Phobos rather than code that's simply using the Phobos functions like startsWith and find - i.e. if you're writing range-base code that worries about doing stuff like special-casing strings or which specifically needs to operate on code points, then you're going to have to make changes, whereas to a great extent, if all you're doing is passing strings to Phobos functions, your code will tend to just work. To actually see what the impact would be, we'd have to just change Phobos, I think, and then see what the impact was on user code. It could be surprising how much or how little it affects things, though in most cases, I expect that it'll mean that code will just work. And if we really wanted to do that, we could create a version flag that turned of autodecoding and version the changes in Phobos appropriately to see what we got. In many cases, if we simply made sure that Phobos functions which special-cased strings didn't use front directly but instead didn't care whether they were operating on ranges of char, wchar, or dchar, then we wouldn't even need to version anything (e.g. find could easily be made to work that way if it doesn't already), but some functions (like filter) would need to be versioned differently. So, maybe what we need to do to start is to just go through Phobos and make as many functions as possible not care about whether they're dealing with strings as ranges of char, wchar, or dchar. And at least then, we'd minimize how much code would have to be versioned differently if we were to test out getting rid of autodecoding with versioning. - Jonathan M Davis
Apr 28 2015
next sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Tuesday, 28 April 2015 at 16:48:48 UTC, Jonathan M Davis wrote:
 But of course, we'd want to do the transition in a way that 
 didn't result in silent behavioral changes that would break 
 code,
One proposal is to make char and dchar comparisons illegal (after all, they are comparing different things - an UTF-8 code unit with a code point, and even though in some cases this comparison makes sense, in many it doesn't). That would solve most silent breakages at the expense of more not-so-silent breakages.
 And if we really wanted to do that, we could create a version 
 flag that turned of autodecoding and version the changes in 
 Phobos appropriately to see what we got.
Shameless self-promotion alert: An alternative is a GitHub fork. You can easily install and try out D forks with Digger, it's two commands: digger build master+jmdavis/phobos/noautodecode digger install
Apr 28 2015
next sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Tue, Apr 28, 2015 at 09:57:29PM +0000, Vladimir Panteleev via Digitalmars-d
wrote:
 On Tuesday, 28 April 2015 at 16:48:48 UTC, Jonathan M Davis wrote:
But of course, we'd want to do the transition in a way that didn't
result in silent behavioral changes that would break code,
One proposal is to make char and dchar comparisons illegal (after all, they are comparing different things - an UTF-8 code unit with a code point, and even though in some cases this comparison makes sense, in many it doesn't). That would solve most silent breakages at the expense of more not-so-silent breakages.
And if we really wanted to do that, we could create a version flag
that turned of autodecoding and version the changes in Phobos
appropriately to see what we got.
Shameless self-promotion alert: An alternative is a GitHub fork. You can easily install and try out D forks with Digger, it's two commands: digger build master+jmdavis/phobos/noautodecode digger install
Oooh, Jonathan has the code ready? Haha, maybe I'll start using that instead of git master! ;-) T -- Arise, you prisoners of Windows / Arise, you slaves of Redmond, Wash, / The day and hour soon are coming / When all the IT folks say "Gosh!" / It isn't from a clever lawsuit / That Windowsland will finally fall, / But thousands writing open source code / Like mice who nibble through a wall. -- The Linux-nationale by Greg Baker
Apr 28 2015
parent reply "Damian" <damianday hotmail.co.uk> writes:
On Tuesday, 28 April 2015 at 23:15:40 UTC, H. S. Teoh wrote:
 On Tue, Apr 28, 2015 at 09:57:29PM +0000, Vladimir Panteleev 
 via Digitalmars-d wrote:
 On Tuesday, 28 April 2015 at 16:48:48 UTC, Jonathan M Davis 
 wrote:
But of course, we'd want to do the transition in a way that 
didn't
result in silent behavioral changes that would break code,
One proposal is to make char and dchar comparisons illegal (after all, they are comparing different things - an UTF-8 code unit with a code point, and even though in some cases this comparison makes sense, in many it doesn't). That would solve most silent breakages at the expense of more not-so-silent breakages.
And if we really wanted to do that, we could create a version 
flag
that turned of autodecoding and version the changes in Phobos
appropriately to see what we got.
Shameless self-promotion alert: An alternative is a GitHub fork. You can easily install and try out D forks with Digger, it's two commands: digger build master+jmdavis/phobos/noautodecode digger install
Oooh, Jonathan has the code ready? Haha, maybe I'll start using that instead of git master! ;-) T
I second that! If we all make the switch, perhaps Walter will too? :D
Apr 28 2015
parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Tuesday, 28 April 2015 at 23:26:14 UTC, Damian wrote:
 I second that! If we all make the switch, perhaps Walter will 
 too? :D
Walter isn't necessarily the one we have to convince in this case. He'll be very concerned about avoiding breaking existing code, so we'd need a solid transition plan, but he very much wants to get rid of autodecoding, so he'll welcome it if we can do it cleanly. The bigger problem is convincing Andrei, since he seems to think that even discussing the issue is a waste of time and takes away from more important topics. And I don't dispute that there are other important topics, and coming back to this one over and over again is arguably a problem, but if we can just figure out how to make the transition and get it over with, then it wouldn't need to keep getting discussed like this. - Jonathan M Davis
Apr 28 2015
prev sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Tuesday, 28 April 2015 at 21:57:31 UTC, Vladimir Panteleev 
wrote:
 On Tuesday, 28 April 2015 at 16:48:48 UTC, Jonathan M Davis 
 wrote:
 But of course, we'd want to do the transition in a way that 
 didn't result in silent behavioral changes that would break 
 code,
One proposal is to make char and dchar comparisons illegal (after all, they are comparing different things - an UTF-8 code unit with a code point, and even though in some cases this comparison makes sense, in many it doesn't). That would solve most silent breakages at the expense of more not-so-silent breakages.
It would, but it doesn't necessarily play nicely with the promotion rules, and since the character types tend to be treated as integral types, I suspect that it would be problematic in a number of cases. I also suspect that it's not something that Walter would go for given his typical attitude about conversions (though I don't know). It's definitely an interesting thought, but I doubt that it would fly.
 And if we really wanted to do that, we could create a version 
 flag that turned of autodecoding and version the changes in 
 Phobos appropriately to see what we got.
Shameless self-promotion alert: An alternative is a GitHub fork. You can easily install and try out D forks with Digger, it's two commands: digger build master+jmdavis/phobos/noautodecode digger install
Well, that may very well be what needs to happen as an experiment, but if we want to actually transition to not having autodecoding, we need a transition plan in master itself rather than a fork, and a temporary version would be one way to do that. After thinking about the situation some over the past few days though, I think that what we need to do to begin with is to make it so that as many functions in Phobos as possible don't care whether they're dealing with ranges of char or dchar so that they'll work regardless of what front does on strings (either by simply not using front on strings - or by making it so that the code will work whether front return char or dchar). And that will reduce the number of changes that will have to be done in Phobos via versioning or deprecation or whatever we'd have to do to actually remove autodecoding. I suspect that it would mean that very little would have to be versioned or deprecated if/when we make the switch. The bigger problem though is probably 3rd party range-based functions using front with strings or checking rather than Phobos itself or code using Phobos, since much of that would just work even if we outright switched front from autodecoding to non-autodecoding, and most of what wouldn't can be made to work by making it so that those functions don't care whether they're dealing with autodecoded strings or not. - Jonathan M Davis
Apr 28 2015
prev sibling parent reply "Chris" <wendlec tcd.ie> writes:
On Tuesday, 28 April 2015 at 16:48:48 UTC, Jonathan M Davis wrote:
 On Tuesday, 28 April 2015 at 09:11:10 UTC, Chris wrote:
 Would it be much work to show have example code or even an 
 experimental module that gets rid of auto-decoding, so we 
 could see what would be affected in general and how actual 
 code we have would be affected by it?

 The topic keeps coming up again and again, and while I'm in 
 favor of anything that enhances performance, I'm afraid of 
 having to refactor large chunks of my code. However, this fear 
 may be unfounded, but I would need some examples to visualize 
 the problem.
Honestly, most code won't care. If we just switched out all of the auto-decoding right now, pretty much anything using only ASCII would just work, and most anything that's trying to manipulate ASCII characters in a Unicode string will just work, whereas code that's specifically manipulating Unicode characters might have problems (e.g. comparing front with a dchar will no longer have the same result, since front would just be the first code unit rather than necessarily the first code point). Since most Phobos range-based functions which operate on strings are special-cased on strings already, many of them would continue to just work (e.g. find returns the same range type as what's passed to it even if it's given a string, so it might just work with the change, or it might need to be tweaked slightly), and those that would then generally either need to call encode on an argument to make it match the string type in the cases string types mix (e.g. "foo".find("fo"d) would need to call encode on "fo"d to make it a string for comparison), or the caller would need to use std.utf.byDchar or std.uni.byGrapheme to operate on code points or graphemes rather than code units. The two biggest places in Phobos that would potentially have problems are functions that special-cased strings but still used front and those which have to return a new range type. e.g. filter would be a good example, because it's forced to return a new range type. Right now, it would filter on dchars, but with the change, it would filter on the code unit type (most typically char). If you're filtering on ASCII characters, it wouldn't matter aside from the fact that the resulting range would have an element type of char rather than dchar, but if you're filtering on Unicode characters, it wouldn't work anymore. For situations like that, you'd be forced do use std.utf.byDchar or std.uni.byGrapheme. However, since most string code tends to operate on substrings rather than characters, I don't know how common it even is to use a function like filter on a string (as opposed to a range of strings). Such code might actually be fairly rare. So, there _are_ a few functions which stop working the same way in a potentially silent manner if we just made it so that front didn't autodecode anymore. However, in general, because Phobos almost always special-cases strings, calls to Phobos functions probably wouldn't need to change in most cases, and when they do, a call to byDchar would restore the old behavior. But of course, we'd want to do the transition in a way that didn't result in silent behavioral changes that would break code, even though in most cases, it wouldn't matter, because most code will be operating on ASCII strings even if the strings themselves contain Unicode - e.g. unicodeString.find(asciiString) is far more common than unicodeString.find(otherUnicodeString). I suspect that the code that's at the greatest risk is code that checks for is(Unqual!(ElementType!Range) == dchar) to operate on strings and wrapper ranges around strings, since it would then only match the cases where byDchar had been used. In general though, the code that's going to run into the most trouble is user code that contains range-based functions similar to what you might find in Phobos rather than code that's simply using the Phobos functions like startsWith and find - i.e. if you're writing range-base code that worries about doing stuff like special-casing strings or which specifically needs to operate on code points, then you're going to have to make changes, whereas to a great extent, if all you're doing is passing strings to Phobos functions, your code will tend to just work. To actually see what the impact would be, we'd have to just change Phobos, I think, and then see what the impact was on user code. It could be surprising how much or how little it affects things, though in most cases, I expect that it'll mean that code will just work. And if we really wanted to do that, we could create a version flag that turned of autodecoding and version the changes in Phobos appropriately to see what we got. In many cases, if we simply made sure that Phobos functions which special-cased strings didn't use front directly but instead didn't care whether they were operating on ranges of char, wchar, or dchar, then we wouldn't even need to version anything (e.g. find could easily be made to work that way if it doesn't already), but some functions (like filter) would need to be versioned differently. So, maybe what we need to do to start is to just go through Phobos and make as many functions as possible not care about whether they're dealing with strings as ranges of char, wchar, or dchar. And at least then, we'd minimize how much code would have to be versioned differently if we were to test out getting rid of autodecoding with versioning. - Jonathan M Davis
This sounds like a good starting point for a transition plan. One important thing, though, would be to do some benchmarking with and without autodecoding, to see if it really boosts performance in a way that would justify the transition.
Apr 29 2015
parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, 29 April 2015 at 10:02:09 UTC, Chris wrote:
 This sounds like a good starting point for a transition plan. 
 One important thing, though, would be to do some benchmarking 
 with and without autodecoding, to see if it really boosts 
 performance in a way that would justify the transition.
Well, personally, I think that it's worth it even if the performance is identical (and it's a guarantee that it's going to be better without autodecoding - it's just a question of how much better - since it's going to have less work to do without autodecoding). Simply operating at the code point level like we do now is the worst of all worlds in terms of flexibility and correctness. As long as the Unicode is normalized, operating at the code unit level is the most efficient, and decoding is often unnecessary for correctness, and if you need to decode, then you really need to go up to the grapheme level in order to be operating on the full character, meaning that operating on code points really has the same problems as operating on code units as far as correctness goes. So, it's less performant without actually being correct. It just gives the illusion of correctness. By treating strings as ranges of code units, you don't take a performance hit when you don't need to, and it forces you to actually consider something like byDchar or byGrapheme if you want to operate on full, Unicode characters. It's similar to how operating on UTF-16 code units as if they were characters (as Java and C# generally do) frequently gives the incorrect impression that you're handling Unicode correctly, because you have to work harder at coming up with characters that can't fit in a single code unit, whereas with UTF-8, anything but ASCII is screwed if you treat code units as code points. Treating code points as if they were full characters like we're doing now in Phobos with ranges just makes it that much harder to notice that you're not handling Unicode correctly. Also, treating strings as ranges of code units makes it so that they're not so special and actually are treated like every other type of array, which eliminates a lot of the special casing that we're forced to do right now, and it eliminates all of the confusion that folks keep running into when string doesn't work with many functions, because it's not a random-access range or doesn't have length, or because the resulting range isn't the same type (copy would be a prime example of a function that doesn't work with char[] when it should). By leaving in autodecoding, we're basically leaving in technical debt in D permanently. We'll forever have to be explaining it to folks and forever have to be working around it in order to achieve either performance or correctness. What we have now isn't performant, correct, or flexible, and we'll be forever paying for that if we don't get rid of autodecoding. I don't criticize Andrei in the least for coming up with it, since if you don't take graphemes into account (and he didn't know about them at the time), it seems like a great idea and allows us to be correct by default and performant if we put some effort into, but after having seen how it's worked out, how much code has to be special-cased, how much confusion there is over it, and how it's not actually correct anyway, I think that it's quite clear that autodecoding was a mistake. And at this point, it's mainly a question of how we can get rid of it without being too disruptive and whether we can convince Andrei that it makes sense to make the change, since he seems to still think that autodecoding is fine in spite of the fact that it's neither performant nor correct. It may be that the decision will be that it's too disruptive to remove autodecoding, but I think that that's really a question of whether we can find a way to do it that doesn't break tons of code rather than whether it's worth the performance or correctness gain. - Jonathan M Davis
Apr 29 2015
parent "Chris" <wendlec tcd.ie> writes:
On Wednesday, 29 April 2015 at 15:13:15 UTC, Jonathan M Davis 
wrote:
 On Wednesday, 29 April 2015 at 10:02:09 UTC, Chris wrote:
 This sounds like a good starting point for a transition plan. 
 One important thing, though, would be to do some benchmarking 
 with and without autodecoding, to see if it really boosts 
 performance in a way that would justify the transition.
Well, personally, I think that it's worth it even if the performance is identical (and it's a guarantee that it's going to be better without autodecoding - it's just a question of how much better - since it's going to have less work to do without autodecoding). Simply operating at the code point level like we do now is the worst of all worlds in terms of flexibility and correctness. As long as the Unicode is normalized, operating at the code unit level is the most efficient, and decoding is often unnecessary for correctness, and if you need to decode, then you really need to go up to the grapheme level in order to be operating on the full character, meaning that operating on code points really has the same problems as operating on code units as far as correctness goes. So, it's less performant without actually being correct. It just gives the illusion of correctness. By treating strings as ranges of code units, you don't take a performance hit when you don't need to, and it forces you to actually consider something like byDchar or byGrapheme if you want to operate on full, Unicode characters. It's similar to how operating on UTF-16 code units as if they were characters (as Java and C# generally do) frequently gives the incorrect impression that you're handling Unicode correctly, because you have to work harder at coming up with characters that can't fit in a single code unit, whereas with UTF-8, anything but ASCII is screwed if you treat code units as code points. Treating code points as if they were full characters like we're doing now in Phobos with ranges just makes it that much harder to notice that you're not handling Unicode correctly. Also, treating strings as ranges of code units makes it so that they're not so special and actually are treated like every other type of array, which eliminates a lot of the special casing that we're forced to do right now, and it eliminates all of the confusion that folks keep running into when string doesn't work with many functions, because it's not a random-access range or doesn't have length, or because the resulting range isn't the same type (copy would be a prime example of a function that doesn't work with char[] when it should). By leaving in autodecoding, we're basically leaving in technical debt in D permanently. We'll forever have to be explaining it to folks and forever have to be working around it in order to achieve either performance or correctness. What we have now isn't performant, correct, or flexible, and we'll be forever paying for that if we don't get rid of autodecoding. I don't criticize Andrei in the least for coming up with it, since if you don't take graphemes into account (and he didn't know about them at the time), it seems like a great idea and allows us to be correct by default and performant if we put some effort into, but after having seen how it's worked out, how much code has to be special-cased, how much confusion there is over it, and how it's not actually correct anyway, I think that it's quite clear that autodecoding was a mistake. And at this point, it's mainly a question of how we can get rid of it without being too disruptive and whether we can convince Andrei that it makes sense to make the change, since he seems to still think that autodecoding is fine in spite of the fact that it's neither performant nor correct. It may be that the decision will be that it's too disruptive to remove autodecoding, but I think that that's really a question of whether we can find a way to do it that doesn't break tons of code rather than whether it's worth the performance or correctness gain. - Jonathan M Davis
Ok, I see. Well, if we don't want to repeat C++'s mistakes, we should fix it before it's too late. Since I'm dealing a lot with strings (non ASCII) and depend on Unicode (and correctness!), I would be more than happy to test any changes to Phobos with my programs to see if it screws up anything.
Apr 29 2015
prev sibling next sibling parent ketmar <ketmar ketmar.no-ip.org> writes:
On Fri, 24 Apr 2015 13:44:43 -0700, Walter Bright wrote:

 I'm afraid we are stuck with autodecoding, as taking it out may be far
 too disruptive.
the more time passing the harder autodecode to kill. kill it while it's=20 not too late. make the next DMD release 2.100 and KILL AUTODECODE for=20 good.=
Apr 25 2015
prev sibling parent "Kagamin" <spam here.lot> writes:
On Friday, 24 April 2015 at 20:44:34 UTC, Walter Bright wrote:
 Time has shown, however, that UTF8 has pretty much won. wchar 
 only exists for Windows API and Java
Also NSString. It used to support UTF-16 and C encoding. AFAIK, the latter later evolved into UTF-8.
Apr 30 2015