www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - The Case Against Autodecode

reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
 I am as unclear about the problems of autodecoding as I am about the necessity
 to remove curl. Whenever I ask I hear some arguments that work well emotionally
 but are scant on reason and engineering. Maybe it's time to rehash them? I just
 did so about curl, no solid argument seemed to come together. I'd be curious of
 a crisp list of grievances about autodecoding. -- Andrei
Here are some that are not matters of opinion. 1. Ranges of characters do not autodecode, but arrays of characters do. This is a glaring inconsistency. 2. Every time one wants an algorithm to work with both strings and ranges, you wind up special casing the strings to defeat the autodecoding, or to decode the ranges. Having to constantly special case it makes for more special cases when plugging together components. These issues often escape detection when unittesting because it is convenient to unittest only with arrays. 3. Wrapping an array in a struct with an alias this to an array turns off autodecoding, another special case. 4. Autodecoding is slow and has no place in high speed string processing. 5. Very few algorithms require decoding. 6. Autodecoding has two choices when encountering invalid code units - throw or produce an error dchar. Currently, it throws, meaning no algorithms using autodecode can be made nothrow. 7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out in the wild that pure Unicode is not universal - there's lots of dirty Unicode that should remain unmolested, and autocode does not play with that. 8. In my work with UTF-8 streams, dealing with autodecode has caused me considerably extra work every time. A convenient timesaver it ain't. 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid importing std.array one way or another, and then autodecode is there. 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of being arrays in the first place. 11. Indexing an array produces different results than autodecoding, another glaring special case.
May 12 2016
next sibling parent reply Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
 On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
 I am as unclear about the problems of autodecoding as I am
about the necessity
 to remove curl. Whenever I ask I hear some arguments that
work well emotionally
 but are scant on reason and engineering. Maybe it's time to
rehash them? I just
 did so about curl, no solid argument seemed to come together.
I'd be curious of
 a crisp list of grievances about autodecoding. -- Andrei
Here are some that are not matters of opinion. 1. Ranges of characters do not autodecode, but arrays of characters do. This is a glaring inconsistency. 2. Every time one wants an algorithm to work with both strings and ranges, you wind up special casing the strings to defeat the autodecoding, or to decode the ranges. Having to constantly special case it makes for more special cases when plugging together components. These issues often escape detection when unittesting because it is convenient to unittest only with arrays. 3. Wrapping an array in a struct with an alias this to an array turns off autodecoding, another special case. 4. Autodecoding is slow and has no place in high speed string processing. 5. Very few algorithms require decoding. 6. Autodecoding has two choices when encountering invalid code units - throw or produce an error dchar. Currently, it throws, meaning no algorithms using autodecode can be made nothrow. 7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out in the wild that pure Unicode is not universal - there's lots of dirty Unicode that should remain unmolested, and autocode does not play with that. 8. In my work with UTF-8 streams, dealing with autodecode has caused me considerably extra work every time. A convenient timesaver it ain't. 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid importing std.array one way or another, and then autodecode is there. 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of being arrays in the first place. 11. Indexing an array produces different results than autodecoding, another glaring special case.
12. The result of autodecoding, a range of Unicode code points, is rarely actually useful, and code that relies on autodecoding is rarely actually, universally correct. Graphemes are occasionally useful for a subset of scripts, and a subset of that subset has all graphemes mapped to single code points, but this only applies to some scripts/languages. In the majority of cases, autodecoding provides only the illusion of correctness.
May 12 2016
next sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, May 12, 2016 at 08:24:23PM +0000, Vladimir Panteleev via Digitalmars-d
wrote:
 On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
[...]
1. Ranges of characters do not autodecode, but arrays of characters
do.  This is a glaring inconsistency.

2. Every time one wants an algorithm to work with both strings and
ranges, you wind up special casing the strings to defeat the
autodecoding, or to decode the ranges. Having to constantly special
case it makes for more special cases when plugging together
components. These issues often escape detection when unittesting
because it is convenient to unittest only with arrays.
Example of string special-casing leading to bugs: https://issues.dlang.org/show_bug.cgi?id=15972 This particular issue highlight the problem quite well: one would hardly how could a single char need to be "auto-decoded" to a dchar? Unfortunately, due to Phobos algorithms assuming autodecoding, the resulting range of char is not recognized as "string-like" data by .joiner, thus causing a compile error. The workaround (as described in the bug comments) also illustrates the inconsistency in handling ranges of char vs. ranges of dchar: writing .joiner("\n".byCodeUnit) will actually fix the problem, basically by explicitly disabling autodecoding. We can, of course, fix .joiner to recognize this case and handle it correctly, but the fact the using .byCodeUnit works perfectly proves that autodecoding is not necessary here. Which begs the question, why have autodecoding at all, and then require .byCodeUnit to work around issues it causes? T -- It is widely believed that reinventing the wheel is a waste of time; but I disagree: without wheel reinventers, we would be still be stuck with wooden horse-cart wheels.
May 12 2016
prev sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, May 12, 2016 at 08:24:23PM +0000, Vladimir Panteleev via Digitalmars-d
wrote:
[...]
 12. The result of autodecoding, a range of Unicode code points, is
 rarely actually useful, and code that relies on autodecoding is rarely
 actually, universally correct. Graphemes are occasionally useful for a
 subset of scripts, and a subset of that subset has all graphemes
 mapped to single code points, but this only applies to some
 scripts/languages.
 
 In the majority of cases, autodecoding provides only the illusion of
 correctness.
A range of Unicode code points is not the same as a range of graphemes (a grapheme is what a layperson would consider to be a "character"). Autodecoding returns dchar, a code point, rather than a grapheme. Therefore, autodecoding actually only produces intuitively correct results when your string has a 1-to-1 correspondence between grapheme and code point. In general, this is only true for a small subset of languages, mainly a few common European languages and a handful of others. It doesn't work for Korean, and doesn't work for any language that uses combining diacritics or other modifiers. You need byGrapheme to have the correct results. So basically autodecoding, as currently implemented, fails to meet its goal of segmenting a string by "character" (i.e., grapheme), and yet imposes a performance penalty that is difficult to "turn off" (you have to sprinkle your code with byCodeUnit everywhere, and many Phobos algorithms just return a range of dchar anyway). Not to mention that a good number of string algorithms don't actually *need* autodecoding at all. (One could make a case for auto-segmenting by grapheme, but that's even worse in terms of performance (it requires a non-trivial Unicode algorithm involving lookup tables, and may need memory allocation). At the end of the day, we're back to square one: iterate by code unit, and explicitly ask for byGrapheme where necessary.) T -- "I'm running Windows '98." "Yes." "My computer isn't working now." "Yes, you already said that." -- User-Friendly
May 12 2016
parent reply Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Thursday, 12 May 2016 at 23:16:23 UTC, H. S. Teoh wrote:
 Therefore, autodecoding actually only produces intuitively 
 correct results when your string has a 1-to-1 correspondence 
 between grapheme and code point. In general, this is only true 
 for a small subset of languages, mainly a few common European 
 languages and a handful of others.  It doesn't work for Korean, 
 and doesn't work for any language that uses combining 
 diacritics or other modifiers.  You need byGrapheme to have the 
 correct results.
In fact, even most European languages are affected if NFD normalization is used, which is the default on MacOS X. And this is actually the main problem with it: It was introduced to make unicode string handling correct. Well, it doesn't, therefore it has no justification.
May 13 2016
parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Fri, 13 May 2016 10:49:24 +0000
schrieb Marc Sch=C3=BCtz <schuetzm gmx.net>:

 In fact, even most European languages are affected if NFD=20
 normalization is used, which is the default on MacOS X.
=20
 And this is actually the main problem with it: It was introduced=20
 to make unicode string handling correct. Well, it doesn't,=20
 therefore it has no justification.
+1 for leaning back and contemplate exactly what auto-decode was aiming for and how it missed that goal. You'll see that an o=CC=88 may still be cut between the o and the =C2=A8. Hangul symbols are composed of pieces that go in different corners. Those would also be split up by auto-decode. Can we handle real world text AT ALL? Are graphemes good enough to find the column in a fixed width display of some string (e.g. line+column or an error)? No, there my still be full-width characters in there that take 2 columns. :p --=20 Marco
May 13 2016
parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, May 13, 2016 at 09:26:40PM +0200, Marco Leise via Digitalmars-d wrote:
 Am Fri, 13 May 2016 10:49:24 +0000
 schrieb Marc Schütz <schuetzm gmx.net>:
 
 In fact, even most European languages are affected if NFD 
 normalization is used, which is the default on MacOS X.
 
 And this is actually the main problem with it: It was introduced 
 to make unicode string handling correct. Well, it doesn't, 
 therefore it has no justification.
+1 for leaning back and contemplate exactly what auto-decode was aiming for and how it missed that goal. You'll see that an ö may still be cut between the o and the ¨. Hangul symbols are composed of pieces that go in different corners. Those would also be split up by auto-decode. Can we handle real world text AT ALL? Are graphemes good enough to find the column in a fixed width display of some string (e.g. line+column or an error)? No, there my still be full-width characters in there that take 2 columns. :p
[...] A simple lookup table ought to fix this. Preferably in std.uni so that it doesn't get reinvented by every other project. T -- Don't modify spaghetti code unless you can eat the consequences.
May 13 2016
prev sibling next sibling parent reply Daniel Kozak <kozzi11 gmail.com> writes:
On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
 On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
 I am as unclear about the problems of autodecoding as I am
about the necessity
 to remove curl. Whenever I ask I hear some arguments that
work well emotionally
 but are scant on reason and engineering. Maybe it's time to
rehash them? I just
 did so about curl, no solid argument seemed to come together.
I'd be curious of
 a crisp list of grievances about autodecoding. -- Andrei
Here are some that are not matters of opinion. 1. Ranges of characters do not autodecode, but arrays of characters do. This is a glaring inconsistency. 2. Every time one wants an algorithm to work with both strings and ranges, you wind up special casing the strings to defeat the autodecoding, or to decode the ranges. Having to constantly special case it makes for more special cases when plugging together components. These issues often escape detection when unittesting because it is convenient to unittest only with arrays. 3. Wrapping an array in a struct with an alias this to an array turns off autodecoding, another special case. 4. Autodecoding is slow and has no place in high speed string processing. 5. Very few algorithms require decoding. 6. Autodecoding has two choices when encountering invalid code units - throw or produce an error dchar. Currently, it throws, meaning no algorithms using autodecode can be made nothrow. 7. Autodecode cannot be used with unicode path/filenames, because it is legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out in the wild that pure Unicode is not universal - there's lots of dirty Unicode that should remain unmolested, and autocode does not play with that. 8. In my work with UTF-8 streams, dealing with autodecode has caused me considerably extra work every time. A convenient timesaver it ain't. 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid importing std.array one way or another, and then autodecode is there. 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of being arrays in the first place. 11. Indexing an array produces different results than autodecoding, another glaring special case.
For me it is not about autodecoding. I would like to have something like String type which do that. But what I am really piss of is that current string type is alias to immutable(char)[] (so it is not usable at all). This is really problem for me. Because this make working on array of chars almost impossible. Even char[] is unusable. So I am force to used ubyte[], but this is really not an array of chars. ATM D does not support even full Unicode strings and even basic array of chars :(. I hope this will be fixed one day. So I could start to expand D in Czech, until than I am unable to do that.
May 12 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/12/2016 4:23 PM, Daniel Kozak wrote:
 But what I am really piss of is that current string type is
 alias to immutable(char)[] (so it is not usable at all). This is really problem
 for me. Because this make working on array of chars almost impossible.

 Even char[] is unusable. So I am force to used ubyte[], but this is really not
 an array of chars.

 ATM D does not support even full Unicode strings and even basic array of chars
:(.

 I hope this will be fixed one day. So I could start to expand D in Czech, until
 than I am unable to do that.
I can't find any actionable information in this.
May 12 2016
prev sibling next sibling parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Thu, 12 May 2016 13:15:45 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 7. Autodecode cannot be used with unicode path/filenames, because it is legal 
 (at least on Linux) to have invalid UTF-8 as filenames.
More precisely they are byte strings with '/' reserved to separate path elements. While on an out-of-the-box Linux nowadays everything is typically presented as UTF-8, there are still die-hards that use code pages, corrupted file systems or incorrectly bound network shares displaying with the wrong charset. It is safer to work with them as a ubyte[] and that also bypasses auto-decoding. I'd like 'string' to mean valid UTF-8 in D as far as the encoding goes. A filename should not be a 'string'. -- Marco
May 12 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/12/2016 4:52 PM, Marco Leise wrote:
 I'd like 'string' to mean valid UTF-8 in D as far as the
 encoding goes. A filename should not be a 'string'.
I would have agreed with you in the past, but more and more it just doesn't seem practical. UTF-8 is dirty in the real world, and D code will have to deal with it. By dealing with it I mean not crash, throw exceptions, or other tantrums when encountering it. Unless it matters, it should pass the invalid encodings along unmolested and without comment. For example, if you're searching for 'a' in a UTF-8 string, what does it matter if there are invalid encodings in that string? For filenames/paths in particular, having redone the file/path code in Phobos, I realized that invalid encodings are completely immaterial.
May 12 2016
prev sibling next sibling parent reply Jack Stouffer <jack jackstouffer.com> writes:
On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
 Here are some that are not matters of opinion.
If you're serious about removing auto-decoding, which I think you and others have shown has merits, you have to the THE SIMPLEST migration path ever, or you will kill D. I'm talking a simple press of a button. I'm not exaggerating here. Python, a language which was much more popular than D at the time, came out with two versions in 2008: Python 2.7 which had numerous unicode problems, and Python 3.0 which fixed those problems. Almost eight years later, and Python 2 is STILL the more popular version despite Py3 having five major point releases since and Python 2 only getting security patches. Think the tango vs phobos problem, only a little worse. D is much less popular now than was Python at the time, and Python 2 problems were more straight forward than the auto-decoding problem. You'll need a very clear migration path, years long deprecations, and automatic tools in order to make the transition work, or else D's usage will be permanently damaged.
May 12 2016
next sibling parent Jack Stouffer <jack jackstouffer.com> writes:
On Friday, 13 May 2016 at 00:47:04 UTC, Jack Stouffer wrote:
 I'm not exaggerating here. Python, a language which was much 
 more popular than D at the time, came out with two versions in 
 2008: Python 2.7 which had numerous unicode problems, and 
 Python 3.0 which fixed those problems. Almost eight years 
 later, and Python 2 is STILL the more popular version despite 
 Py3 having five major point releases since and Python 2 only 
 getting security patches. Think the tango vs phobos problem, 
 only a little worse.
To hammer this home a little more, Python 3 had a really useful library in order to abstract most of the differences automatically. But despite that, here is a list of the top 200 Python packages in 2011, three years after the fork, and if they supported Python 3 or not: https://web.archive.org/web/20110215214547/http://python3wos.appspot.com/ This is _three years_ later, and only 18 out of the top 200 supported Python 3. And here it is now, eight years later, at 174 out of 200 https://python3wos.appspot.com/
May 12 2016
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/12/2016 5:47 PM, Jack Stouffer wrote:
 D is much less popular now than was Python at the time, and Python 2 problems
 were more straight forward than the auto-decoding problem.  You'll need a very
 clear migration path, years long deprecations, and automatic tools in order to
 make the transition work, or else D's usage will be permanently damaged.
I agree, if it is possible at all.
May 12 2016
next sibling parent reply Chris <wendlec tcd.ie> writes:
On Friday, 13 May 2016 at 01:00:54 UTC, Walter Bright wrote:
 On 5/12/2016 5:47 PM, Jack Stouffer wrote:
 D is much less popular now than was Python at the time, and 
 Python 2 problems
 were more straight forward than the auto-decoding problem.  
 You'll need a very
 clear migration path, years long deprecations, and automatic 
 tools in order to
 make the transition work, or else D's usage will be 
 permanently damaged.
I agree, if it is possible at all.
I don't know to which extent my problems with string handling are related to autodecode. However, I had to write some utility functions to get around issues with code points, graphemes and the like. While it is not a huge issue in terms of programming time, it does slow down my program, because even simple operations may be referred to a utility function to make sure the result is correct (.length for example). But that might be an issue related to Unicode in general (or D's handling of it). If autodecode is killed, could we have a test version asap? I'd be willing to test my programs with autodecode turned off and see what happens. Others should do likewise and we could come up with a transition strategy based on what happened.
May 13 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/13/2016 2:12 AM, Chris wrote:
 If autodecode is killed, could we have a test version asap? I'd be willing to
 test my programs with autodecode turned off and see what happens. Others should
 do likewise and we could come up with a transition strategy based on what
happened.
You can avoid autodecode by using .byChar
May 13 2016
parent reply Chris <wendlec tcd.ie> writes:
On Friday, 13 May 2016 at 13:17:44 UTC, Walter Bright wrote:
 On 5/13/2016 2:12 AM, Chris wrote:
 If autodecode is killed, could we have a test version asap? 
 I'd be willing to
 test my programs with autodecode turned off and see what 
 happens. Others should
 do likewise and we could come up with a transition strategy 
 based on what happened.
You can avoid autodecode by using .byChar
Hm. It would be difficult to make sure that my whole code base doesn't do something, somewhere that doesn't trigger auto decode. PS Why does do I get a "StopForumSpam error" every time I post today? Has anyone else experienced the same problem: "StopForumSpam error: Socket error: Lookup error: getaddrinfo error: Name or service not known. Please solve a CAPTCHA to continue."
May 13 2016
parent reply Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Friday, 13 May 2016 at 13:41:30 UTC, Chris wrote:
 PS Why does do I get a "StopForumSpam error" every time I post 
 today? Has anyone else experienced the same problem:

 "StopForumSpam error: Socket error: Lookup error: getaddrinfo 
 error: Name or service not known. Please solve a CAPTCHA to 
 continue."
https://twitter.com/StopForumSpam
May 13 2016
parent Chris <wendlec tcd.ie> writes:
On Friday, 13 May 2016 at 14:06:28 UTC, Vladimir Panteleev wrote:
 On Friday, 13 May 2016 at 13:41:30 UTC, Chris wrote:
 PS Why does do I get a "StopForumSpam error" every time I post 
 today? Has anyone else experienced the same problem:

 "StopForumSpam error: Socket error: Lookup error: getaddrinfo 
 error: Name or service not known. Please solve a CAPTCHA to 
 continue."
https://twitter.com/StopForumSpam
I don't understand. Does that mean we have to solve CAPTCHAs every time we post? Annoying CAPTCHAs at that.
May 13 2016
prev sibling parent Iakh <iaktakh gmail.com> writes:
On Friday, 13 May 2016 at 01:00:54 UTC, Walter Bright wrote:
 On 5/12/2016 5:47 PM, Jack Stouffer wrote:
 D is much less popular now than was Python at the time, and 
 Python 2 problems
 were more straight forward than the auto-decoding problem.  
 You'll need a very
 clear migration path, years long deprecations, and automatic 
 tools in order to
 make the transition work, or else D's usage will be 
 permanently damaged.
I agree, if it is possible at all.
A plan: 1. Mark as deprecated places where auto-decoding used. I think it's all "range" functions for string(front, popFront, back, ...). Force using byChar & co. 2. Introduce new String type in Phobos. 3. After ages make immutable(char) ordinal array. Is it OK? Profit?
May 13 2016
prev sibling next sibling parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:
On Friday, 13 May 2016 at 00:47:04 UTC, Jack Stouffer wrote:
 D is much less popular now than was Python at the time, and 
 Python 2 problems were more straight forward than the 
 auto-decoding problem.  You'll need a very clear migration 
 path, years long deprecations, and automatic tools in order to 
 make the transition work, or else D's usage will be permanently 
 damaged.
Python 2 is/was deployed at a much larger scale and with far more library dependencies, so I don't think it is comparable. It is easier for D to get away with breaking changes. I am still using Python 2.7 exclusively, but now I use: from __future__ import division, absolute_import, with_statement, unicode_literals D can do something similar. C++ is using a comparable solution. Use switches to turn on different compatibility levels.
May 13 2016
prev sibling next sibling parent reply Nick Treleaven <ntrel-pub mybtinternet.com> writes:
On Friday, 13 May 2016 at 00:47:04 UTC, Jack Stouffer wrote:
 If you're serious about removing auto-decoding, which I think 
 you and others have shown has merits, you have to the THE 
 SIMPLEST migration path ever, or you will kill D. I'm talking a 
 simple press of a button.
char[] is always going to be unsafe for UTF-8. I don't think we can remove it or auto-decoding, only discourage use of it. We need a String struct IMO, without length or indexing. Its front can do autodecoding, and it has a ubyte[] raw() property too. (Possibly the byte length of front can be cached for use in popFront, assuming it was faster). This would be a gradual transition.
May 13 2016
parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, May 13, 2016 at 12:16:30PM +0000, Nick Treleaven via Digitalmars-d
wrote:
 On Friday, 13 May 2016 at 00:47:04 UTC, Jack Stouffer wrote:
If you're serious about removing auto-decoding, which I think you and
others have shown has merits, you have to the THE SIMPLEST migration
path ever, or you will kill D. I'm talking a simple press of a
button.
char[] is always going to be unsafe for UTF-8. I don't think we can remove it or auto-decoding, only discourage use of it. We need a String struct IMO, without length or indexing. Its front can do autodecoding, and it has a ubyte[] raw() property too. (Possibly the byte length of front can be cached for use in popFront, assuming it was faster). This would be a gradual transition.
alias String = typeof(std.uni.byGrapheme(immutable(char)[].init)); :-) Well, OK, perhaps you could wrap this in a struct that allows extraction of .raw, etc.. But basically this isn't hard to implement today. We already have all of the tools necessary. T -- Dogs have owners ... cats have staff. -- Krista Casada
May 13 2016
prev sibling parent reply Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 05/12/2016 08:47 PM, Jack Stouffer wrote:
 If you're serious about removing auto-decoding, which I think you and
 others have shown has merits, you have to the THE SIMPLEST migration
 path ever, or you will kill D. I'm talking a simple press of a button.

 I'm not exaggerating here. Python, a language which was much more
 popular than D at the time, came out with two versions in 2008: Python
 2.7 which had numerous unicode problems, and Python 3.0 which fixed
 those problems. Almost eight years later, and Python 2 is STILL the more
 popular version despite Py3 having five major point releases since and
 Python 2 only getting security patches. Think the tango vs phobos
 problem, only a little worse.

 D is much less popular now than was Python at the time, and Python 2
 problems were more straight forward than the auto-decoding problem.
 You'll need a very clear migration path, years long deprecations, and
 automatic tools in order to make the transition work, or else D's usage
 will be permanently damaged.
As much as I agree on the importance of a good smooth migration path, I don't think the "Python 2 vs 3" situation is really all that comparable here. Unlike Python, we wouldn't be maintaining a "with auto-decoding" fork for years and years and years, ensuring nobody ever had a pressing reason to bother migrating. And on top of that, we don't have a culture and design philosophy that promotes "do the lazy thing first and the robust thing never". D users are more likely than dynamic language users to be willing to make a few changes for the sake of improvement. Heck, we weather breaking fixes enough anyway. There was even one point within the last couple years where something (forget offhand what it was) was removed from std.datetime and its replacement was added *in the very same compiler release*. No transition period. It was an annoying pain (at least to me), but I got through it fine and never even entertained the thought of just sticking with the old compiler. Not sure most people even noticed it. Point is, in D, even when something does need to change, life goes on fine. As long as we don't maintain a long-term fork ;) Naturally, minimizing breakage is important here, but I really don't think Python's UTF migration situation is all that comparable.
May 29 2016
next sibling parent reply Jack Stouffer <jack jackstouffer.com> writes:
On Sunday, 29 May 2016 at 17:35:35 UTC, Nick Sabalausky wrote:
 Unlike Python, we wouldn't be maintaining a "with 
 auto-decoding" fork for years and years and years, ensuring 
 nobody ever had a pressing reason to bother migrating.
If it happens, they better. The D1 fork was maintained for almost three years for a good reason.
 Heck, we weather breaking fixes enough anyway.
Not nearly on a scale similar to changing how strings are iterated; not since the D1/D2 split.
 It was an annoying pain (at least to me), but I got through it 
 fine and never even entertained the thought of just sticking 
 with the old compiler.
 Not sure most people even noticed it. Point is, in D, even when 
 something does need to change, life goes on fine. As long as we 
 don't maintain a long-term fork ;)
The problem is not active users. The problem is companies who have > 10K LOC and libraries that are no longer maintained. E.g. It took Sociomantic eight years after D2's release to switch only a few parts of their projects to D2. With the loss of old libraries/old code (even old answers on SO), all of a sudden you lose a lot of the network effect that makes programming languages much more useful.
May 29 2016
parent reply Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 05/29/2016 09:58 PM, Jack Stouffer wrote:
 The problem is not active users. The problem is companies who have > 10K
 LOC and libraries that are no longer maintained. E.g. It took
 Sociomantic eight years after D2's release to switch only a few parts of
 their projects to D2. With the loss of old libraries/old code (even old
 answers on SO), all of a sudden you lose a lot of the network effect
 that makes programming languages much more useful.
D1 -> D2 was a vastly more disruptive change than getting rid of auto-decoding would be.
May 30 2016
next sibling parent reply Jack Stouffer <jack jackstouffer.com> writes:
On Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote:
 D1 -> D2 was a vastly more disruptive change than getting rid 
 of auto-decoding would be.
Don't be so sure. All string handling code would become broken, even if it appears to work at first.
May 30 2016
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/30/2016 12:34 PM, Jack Stouffer wrote:
 On Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote:
 D1 -> D2 was a vastly more disruptive change than getting rid of
 auto-decoding would be.
Don't be so sure. All string handling code would become broken, even if it appears to work at first.
That kind of makes this thread less productive than "How to improve autodecoding?" -- Andrei
May 30 2016
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 30-May-2016 21:24, Andrei Alexandrescu wrote:
 On 05/30/2016 12:34 PM, Jack Stouffer wrote:
 On Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote:
 D1 -> D2 was a vastly more disruptive change than getting rid of
 auto-decoding would be.
Don't be so sure. All string handling code would become broken, even if it appears to work at first.
That kind of makes this thread less productive than "How to improve autodecoding?" -- Andrei
1. Generalize to all ranges of code units i.e. ranges of char/wchar. 2. Operating on codeunits explicitly would then always involve a step through ubyte/byte. -- Dmitry Olshansky
May 30 2016
prev sibling next sibling parent reply Jack Stouffer <jack jackstouffer.com> writes:
On Monday, 30 May 2016 at 18:24:23 UTC, Andrei Alexandrescu wrote:
 That kind of makes this thread less productive than "How to 
 improve autodecoding?" -- Andrei
Please don't misunderstand, I'm for fixing string behavior. But, let's not pretend that this wouldn't be one of the (if not the) largest breaking change since D2. As I said, straight up removing auto-decoding would break all string handling code.
May 30 2016
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/30/2016 03:00 PM, Jack Stouffer wrote:
 On Monday, 30 May 2016 at 18:24:23 UTC, Andrei Alexandrescu wrote:
 That kind of makes this thread less productive than "How to improve
 autodecoding?" -- Andrei
Please don't misunderstand, I'm for fixing string behavior.
Surely the misunderstanding is not on this side of the table :o). By "that" I meant your assertion at face value (i.e. assuming it's a fact) "All string handling code would become broken, even if it appears to work at first". -- Andrei
May 30 2016
prev sibling parent reply Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Monday, May 30, 2016 14:24:23 Andrei Alexandrescu via Digitalmars-d wrote:
 On 05/30/2016 12:34 PM, Jack Stouffer wrote:
 On Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote:
 D1 -> D2 was a vastly more disruptive change than getting rid of
 auto-decoding would be.
Don't be so sure. All string handling code would become broken, even if it appears to work at first.
That kind of makes this thread less productive than "How to improve autodecoding?" -- Andrei
I think that the first step is getting Phobos to work with all ranges of character types - be they char, wchar, dchar, or graphemes. Then the algorithms themselves will work whether we have auto-decoding or not. With that done, we can at minimum tell folks to use byCodeUnit, byChar!T, byGrapheme, etc. to get the correct, efficient behavior. Right now, if you try to use ranges like byCodeUnit, they work with some of Phobos but not enough to really work as a viable replacement to auto-decoding strings. With all that done, at least it should be reasonably easy for folks to sanely get around auto-decoding, though the question still remains at that point how possible it will be to remove auto-decoding and treat ranges of char the same way that byCodeUnit would. But at bare minimum, it's what we need to do to make it possible and reasonable to work around auto-decoding when you need to while specifying the level of Unicode that you actually want to operate at. - Jonathan M Davis
May 31 2016
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/31/16 2:21 PM, Jonathan M Davis via Digitalmars-d wrote:
 I think that the first step is getting Phobos to work with all ranges of
 character types - be they char, wchar, dchar, or graphemes. Then the
 algorithms themselves will work whether we have auto-decoding or not. With
 that done, we can at minimum tell folks to use byCodeUnit, byChar!T,
 byGrapheme, etc. to get the correct, efficient behavior. Right now, if you
 try to use ranges like byCodeUnit, they work with some of Phobos but not
 enough to really work as a viable replacement to auto-decoding strings.
Great. Could you put together a sample PR so we understand the implications better? Thanks! -- Andrei
May 31 2016
prev sibling parent reply Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Monday, 30 May 2016 at 16:34:49 UTC, Jack Stouffer wrote:
 On Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote:
 D1 -> D2 was a vastly more disruptive change than getting rid 
 of auto-decoding would be.
Don't be so sure. All string handling code would become broken, even if it appears to work at first.
Assuming silent breakage is on the table, what would be broken, really? Code that must intentionally count or otherwise operate code points, sure. But how much of all string handling code is like that? Perhaps it would be worth trying to silently remove autodecoding and seeing how much of Phobos breaks, as an experiment. Has this been tried before? (Not saying this is a route we should take, but it doesn't seem to me that it will break "all string handling code" either.)
May 30 2016
next sibling parent reply Seb <seb wilzba.ch> writes:
On Monday, 30 May 2016 at 21:39:14 UTC, Vladimir Panteleev wrote:
 On Monday, 30 May 2016 at 16:34:49 UTC, Jack Stouffer wrote:
 On Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote:
 D1 -> D2 was a vastly more disruptive change than getting rid 
 of auto-decoding would be.
Don't be so sure. All string handling code would become broken, even if it appears to work at first.
Assuming silent breakage is on the table, what would be broken, really? Code that must intentionally count or otherwise operate code points, sure. But how much of all string handling code is like that? Perhaps it would be worth trying to silently remove autodecoding and seeing how much of Phobos breaks, as an experiment. Has this been tried before? (Not saying this is a route we should take, but it doesn't seem to me that it will break "all string handling code" either.)
132 lines in Phobos use auto-decoding - that should be fixable ;-) See them: http://sprunge.us/hUCL More details: https://github.com/dlang/phobos/pull/4384
May 30 2016
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/30/16 7:52 PM, Seb wrote:
 On Monday, 30 May 2016 at 21:39:14 UTC, Vladimir Panteleev wrote:
 On Monday, 30 May 2016 at 16:34:49 UTC, Jack Stouffer wrote:
 On Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote:
 D1 -> D2 was a vastly more disruptive change than getting rid of
 auto-decoding would be.
Don't be so sure. All string handling code would become broken, even if it appears to work at first.
Assuming silent breakage is on the table, what would be broken, really? Code that must intentionally count or otherwise operate code points, sure. But how much of all string handling code is like that? Perhaps it would be worth trying to silently remove autodecoding and seeing how much of Phobos breaks, as an experiment. Has this been tried before? (Not saying this is a route we should take, but it doesn't seem to me that it will break "all string handling code" either.)
132 lines in Phobos use auto-decoding - that should be fixable ;-) See them: http://sprunge.us/hUCL More details: https://github.com/dlang/phobos/pull/4384
Thanks for this investigation! Results are about as I'd have speculated. -- Andrei
May 30 2016
prev sibling parent Jack Stouffer <jack jackstouffer.com> writes:
On Monday, 30 May 2016 at 21:39:14 UTC, Vladimir Panteleev wrote:
 Perhaps it would be worth trying to silently remove 
 autodecoding and seeing how much of Phobos breaks, as an 
 experiment. Has this been tried before?
Did it, the results are a large number of phobos modules fail to compile because of template constraints that test for is(Unqual!(ElementType!S2) == dchar). As a result, anything that imports std.format or std.uni fails to compile. Also, I see some errors caused by the fact that is(string.front == immutable) now. Is hard to find specifics because D halts execution after one test failure.
May 30 2016
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/30/2016 12:25 PM, Nick Sabalausky wrote:
 On 05/29/2016 09:58 PM, Jack Stouffer wrote:
 The problem is not active users. The problem is companies who have > 10K
 LOC and libraries that are no longer maintained. E.g. It took
 Sociomantic eight years after D2's release to switch only a few parts of
 their projects to D2. With the loss of old libraries/old code (even old
 answers on SO), all of a sudden you lose a lot of the network effect
 that makes programming languages much more useful.
D1 -> D2 was a vastly more disruptive change than getting rid of auto-decoding would be.
It was also made at a time when the community was smaller by a couple orders of magnitude. -- Andrei
May 30 2016
prev sibling parent reply Chris <wendlec tcd.ie> writes:
On Sunday, 29 May 2016 at 17:35:35 UTC, Nick Sabalausky wrote:
 On 05/12/2016 08:47 PM, Jack Stouffer wrote:

 As much as I agree on the importance of a good smooth migration 
 path, I don't think the "Python 2 vs 3" situation is really all 
 that comparable here. Unlike Python, we wouldn't be maintaining 
 a "with auto-decoding" fork for years and years and years, 
 ensuring nobody ever had a pressing reason to bother migrating. 
 And on top of that, we don't have a culture and design 
 philosophy that promotes "do the lazy thing first and the 
 robust thing never". D users are more likely than dynamic 
 language users to be willing to make a few changes for the sake 
 of improvement.

 Heck, we weather breaking fixes enough anyway. There was even 
 one point within the last couple years where something (forget 
 offhand what it was) was removed from std.datetime and its 
 replacement was added *in the very same compiler release*. No 
 transition period. It was an annoying pain (at least to me), 
 but I got through it fine and never even entertained the 
 thought of just sticking with the old compiler. Not sure most 
 people even noticed it. Point is, in D, even when something 
 does need to change, life goes on fine. As long as we don't 
 maintain a long-term fork ;)

 Naturally, minimizing breakage is important here, but I really 
 don't think Python's UTF migration situation is all that 
 comparable.
I suggest providing an automatic tool (either within the compiler or as a separate program like dfix) to help with the transition. Ideally the tool would advise the user where potential problems are and how to fix them. If it's true that auto decode is unnecessary in many cases, then it shouldn't affect the whole code base. But I might be mistaken here. Maybe we should make a list of the functions where auto decode does make a difference, see how common they are, and work out a strategy from there. Destroy.
May 30 2016
parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Mon, 30 May 2016 09:26:09 +0000
schrieb Chris <wendlec tcd.ie>:

 If it's true that auto decode is unnecessary in many cases, then=20
 it shouldn't affect the whole code base. But I might be mistaken=20
 here. Maybe we should make a list of the functions where auto=20
 decode does make a difference, see how common they are, and work=20
 out a strategy from there. Destroy.
It makes a difference for every function. But it still isn't necessary in many cases. It's fairly simple: code unit =3D=3D bytes/chars code point =3D=3D auto-decode grapheme* =3D=3D .byGrapheme So if for now you used auto-decode you iterated code-points, which works correctly for most scripts in NFC**. And here lies the rub and why people say auto-decoding is unnecessary most of the time: If you are working with XML, CSV or JSON or another structured text format, these all use ASCII characters for their syntax elements. Code unit, code point and graphemes become all the same and auto-decoding just slows you down. When on the other hand you work with real world international text, you'll want to work with graphemes. One example is putting an ellipsis in long text: "Alle Segelt=C3=B6rns im =C3=9Cberblick" (in NFD, e.g. OS X file name) may display as this with auto-decode: "Alle Segelto=E2=80=A6=C2=A8berblick" and this with byGrapheme: "Alle Segelt=C3=B6=E2=80=A6=C3=9Cberblick" But at that point you are likely also in need of localized sorting of strings, a set of algorithms that may change with the rise and fall of nations or reformations. So you'll use the platform's go-to Unicode library instead of what Phobos offers. For Java and Linux that would be ICU***. That last point makes me think we should not bother much with decoding in Phobos at all. Odds are we miss other capabilities to make good use of it. Users of auto-decode should review their code to see if code-points is really what they want and potentially switch to no-decoding or .byGrapheme. * What we typically perceive as one unit in written text. ** A normalization form where e.g. '=C3=B6' is a single code-point, as opposed to NFD, where '=C3=B6' would be assembled from the two 'o' and '=C2=A8' code-points as in OS X file names. *** http://site.icu-project.org/home#TOC-What-is-ICU- --=20 Marco
May 30 2016
next sibling parent reply Chris <wendlec tcd.ie> writes:
On Monday, 30 May 2016 at 16:03:03 UTC, Marco Leise wrote:

 *** http://site.icu-project.org/home#TOC-What-is-ICU-
I was actually talking about ICU with a colleague today. Could it be that Unicode itself is broken? I've often heard criticism of Unicode but never looked into it.
May 30 2016
next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Mon, 30 May 2016 17:35:36 +0000
schrieb Chris <wendlec tcd.ie>:

 I was actually talking about ICU with a colleague today. Could it 
 be that Unicode itself is broken? I've often heard criticism of 
 Unicode but never looked into it.
You have to compare to the situation before, when every operating system with every localization had its own encoding. Have some text file with ASCII art in a DOS code page? Doesn't render on Windows with the same locale. Open Cyrillic text on a Latin system? Indigestible. Someone wrote a website on Windows and incorrectly tagged it with an ISO charset? The browser has to fix it up for them. One objection I remember was the Han Unification: https://en.wikipedia.org/wiki/Han_unification Not everyone liked how Chinese, Japanese, Korean were represented with a common set of ideograms. At the time Unicode was still 16-bit and the unified symbols would already make up 32% of all code points. In my eyes many of the perceived problems of Unicode are stemming from the fact that raises awareness to different writing systems all over the globe in a way that we didn't have to, when software was developed locally instead of globally on GitHub, when the target was Windows instead of cross-platform and mobile, when we were lucky if we localized for a couple of latin languages, but Asia was a real barrier. I don't know what you and your colleague discussed about ICU, but likely if you should add another dependency and what alternatives there are. In Linux user space, almost everything is an outside project, an extra library, most of them with alternatives. My own research lead me to the point where I came to think that there was one set of libraries without real alternatives: ICU -> HarfBuff -> Pango That's the go-to chain for Unicode text. From text processing over rendering to layouting. Moreover many successful open-source projects make use of it: LibreOffice, sqlite, Qt, libxml2, WebKit to name a few. Unicode is here to stay, no matter what could have been done better in the past, and I think it is perfectly safe to bet on ICU on Linux for what e.g. Windows has built-in. Otherwise just do as Adam Ruppe said:
 Don't mess with strings. Get them from the user, store them
 without modification, spit them back out again.
:p -- Marco
May 30 2016
prev sibling parent reply Joakim <dlang joakim.fea.st> writes:
On Monday, 30 May 2016 at 17:35:36 UTC, Chris wrote:
 On Monday, 30 May 2016 at 16:03:03 UTC, Marco Leise wrote:

 *** http://site.icu-project.org/home#TOC-What-is-ICU-
I was actually talking about ICU with a colleague today. Could it be that Unicode itself is broken? I've often heard criticism of Unicode but never looked into it.
Part of it is the complexity of written language, part of it is bad technical decisions. Building the default string type in D around the horrible UTF-8 encoding was a fundamental mistake, both in terms of efficiency and complexity. I noted this in one of my first threads in this forum, and as Andrei said at the time, nobody agreed with me, with a lot of hand-waving about how efficiency wasn't an issue or that UTF-8 arrays were fine. Fast-forward years later and exactly the issues I raised are now causing pain. UTF-8 is an antiquated hack that needs to be eradicated. It forces all other languages than English to be twice as long, for no good reason, have fun with that when you're downloading text on a 2G connection in the developing world. It is unnecessarily inefficient, which is precisely why auto-decoding is a problem. It is only a matter of time till UTF-8 is ditched. D devs should lead the way in getting rid of the UTF-8 encoding, not bickering about how to make it more palatable. I suggested a single-byte encoding for most languages, with double-byte for the ones which wouldn't fit in a byte. Use some kind of header or other metadata to combine strings of different languages, _rather than encoding the language into every character!_ The common string-handling use case, by far, is strings with only one language, with a distant second some substrings in a second language, yet here we are putting the overhead into every character to allow inserting characters from an arbitrary language! This is madness. Yes, the complexity of diacritics and combining characters will remain, but that is complexity that is inherent to the variety of written language. UTF-8 is not: it is just a bad technical decision, likely chosen for ASCII compatibility and some misguided notion that being able to combine arbitrary language strings with no other metadata was worthwhile. It is not.
May 31 2016
next sibling parent reply Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, May 31, 2016 16:29:33 Joakim via Digitalmars-d wrote:
 UTF-8 is an antiquated hack that needs to be eradicated.  It
 forces all other languages than English to be twice as long, for
 no good reason, have fun with that when you're downloading text
 on a 2G connection in the developing world.  It is unnecessarily
 inefficient, which is precisely why auto-decoding is a problem.
 It is only a matter of time till UTF-8 is ditched.
Considering that *nix land uses UTF-8 almost exclusively, and many C libraries do even on Windows, I very much doubt that UTF-8 is going anywhere but vast sea of code that is C or C++ generally uses UTF-8 as do plenty of other programming languages. And even aside from English, most European languages are going to be more efficient with UTF-8, because they're still primarily ASCII even if they contain characters that are not. Stuff like Chinese is definitely worse in UTF-8 than it would be in UTF-16, but there are a lot of languages other than English which are going to encode better with UTF-8 than UTF-16 - let alone UTF-32. Regardless, UTF-8 isn't going anywhere anytime soon. _Way_ too much uses it for it to be going anywhere, and most folks have no problem with that. Any attempt to get rid of it would be a huge, uphill battle. But D supports UTF-8, UTF-16, _and_ UTF-32 natively - even without involving the standard library - so anyone who wants to avoid UTF-8 is free to do so. - Jonathan M Davis
May 31 2016
parent Joakim <dlang joakim.fea.st> writes:
On Tuesday, 31 May 2016 at 18:34:54 UTC, Jonathan M Davis wrote:
 On Tuesday, May 31, 2016 16:29:33 Joakim via Digitalmars-d 
 wrote:
 UTF-8 is an antiquated hack that needs to be eradicated.  It 
 forces all other languages than English to be twice as long, 
 for no good reason, have fun with that when you're downloading 
 text on a 2G connection in the developing world.  It is 
 unnecessarily inefficient, which is precisely why 
 auto-decoding is a problem. It is only a matter of time till 
 UTF-8 is ditched.
Considering that *nix land uses UTF-8 almost exclusively, and many C libraries do even on Windows, I very much doubt that UTF-8 is going anywhere anytime soon - if ever. The Win32 API is C or C++ generally uses UTF-8 as do plenty of other programming languages.
I agree that both UTF encodings are somewhat popular now.
 And even aside from English, most European languages are going 
 to be more efficient with UTF-8, because they're still 
 primarily ASCII even if they contain characters that are not. 
 Stuff like Chinese is definitely worse in UTF-8 than it would 
 be in UTF-16, but there are a lot of languages other than 
 English which are going to encode better with UTF-8 than UTF-16 
 - let alone UTF-32.
And there are a lot more languages that will be twice as long than English, ie ASCII.
 Regardless, UTF-8 isn't going anywhere anytime soon. _Way_ too 
 much uses it for it to be going anywhere, and most folks have 
 no problem with that. Any attempt to get rid of it would be a 
 huge, uphill battle.
I disagree, it is inevitable. Any tech so complex and inefficient cannot last long.
 But D supports UTF-8, UTF-16, _and_ UTF-32 natively - even 
 without involving the standard library - so anyone who wants to 
 avoid UTF-8 is free to do so.
Yes, but not by using UTF-16/32, which use too much memory. I've suggested a single-byte encoding for most languages instead, both in my last post and the earlier thread. D could use this new encoding internally, while keeping its current UTF-8/16 strings around for any outside UTF-8/16 data passed in. Any of that data run through algorithms that don't require decoding could be kept in UTF-8, but the moment any decoding is required, D would translate UTF-8 to the new encoding, which would be much easier for programmers to understand and manipulate. If UTF-8 output is needed, you'd have to encode back again. Yes, this translation layer would be a bit of a pain, but the new encoding would be so much more efficient and understandable that it would be worth it, and you're already decoding and encoding back to UTF-8 for those algorithms now. All that's changing is that you're using a new and different encoding than dchar as the default. If it succeeds for D, it could then be sold more widely as a replacement for UTF-8/16. I think this would be the right path forward, not navigating this UTF-8/16 mess further.
May 31 2016
prev sibling next sibling parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Tue, 31 May 2016 16:29:33 +0000
schrieb Joakim <dlang joakim.fea.st>:

 Part of it is the complexity of written language, part of it is=20
 bad technical decisions.  Building the default string type in D=20
 around the horrible UTF-8 encoding was a fundamental mistake,=20
 both in terms of efficiency and complexity.  I noted this in one=20
 of my first threads in this forum, and as Andrei said at the=20
 time, nobody agreed with me, with a lot of hand-waving about how=20
 efficiency wasn't an issue or that UTF-8 arrays were fine. =20
 Fast-forward years later and exactly the issues I raised are now=20
 causing pain.
Maybe you can dig up your old post and we can look at each of your complaints in detail.
 UTF-8 is an antiquated hack that needs to be eradicated.  It=20
 forces all other languages than English to be twice as long, for=20
 no good reason, have fun with that when you're downloading text=20
 on a 2G connection in the developing world.  It is unnecessarily=20
 inefficient, which is precisely why auto-decoding is a problem. =20
 It is only a matter of time till UTF-8 is ditched.
You don't download twice the data. First of all, some languages had two-byte encodings before UTF-8, and second web content is full of HTML syntax and gzip compressed afterwards. Take this Thai Wikipedia entry for example: https://th.wikipedia.org/wiki/%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97= %E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2 The download of the gzipped html is 11% larger in UTF-8 than in Thai TIS-620 single-byte encoding. And that is dwarfed by the size of JS + images. (I don't have the numbers, but I expect the effective overhead to be ~2%). Ironically a lot of symbols we take for granted would then have to be implemented as HTML entities using their Unicode code points(sic!). Amongst them basic stuff like dashes, degree (=C2=B0) and minute (=E2=80=B2), accents in names, non-breaking space or footnotes (=E2=86=91).
 D devs should lead the way in getting rid of the UTF-8 encoding,=20
 not bickering about how to make it more palatable.  I suggested a=20
 single-byte encoding for most languages, with double-byte for the=20
 ones which wouldn't fit in a byte.  Use some kind of header or=20
 other metadata to combine strings of different languages, _rather=20
 than encoding the language into every character!_
That would have put D on an island. "Some kind of header" would be a horrible mess to have in strings, because you have to account for it when concatenating strings and scan for them all the time to see if there is some interspersed 2 byte encoding in the stream. That's hardly better than UTF-8. And yes, a huge amount of websites mix scripts and a lot of other text uses the available extra symbols like =C2=B0 or =CE=B1,=CE=B2,=CE=B3.
 The common string-handling use case, by far, is strings with only=20
 one language, with a distant second some substrings in a second=20
 language, yet here we are putting the overhead into every=20
 character to allow inserting characters from an arbitrary=20
 language!  This is madness.
No thx, madness was when we couldn't reliably open text files, because nowhere was the encoding stored and when you had to compile programs for each of a dozen codepages, so localized text would be rendered correctly. And your retro codepage system wont convince the world to drop Unicode either.
 Yes, the complexity of diacritics and combining characters will=20
 remain, but that is complexity that is inherent to the variety of=20
 written language.  UTF-8 is not: it is just a bad technical=20
 decision, likely chosen for ASCII compatibility and some=20
 misguided notion that being able to combine arbitrary language=20
 strings with no other metadata was worthwhile.  It is not.
The web proves you wrong. Scripts do get mixed often. Be it Wikipedia, a foreign language learning site or mathematical symbols. --=20 Marco
May 31 2016
next sibling parent Joakim <dlang joakim.fea.st> writes:
On Tuesday, 31 May 2016 at 20:20:46 UTC, Marco Leise wrote:
 Am Tue, 31 May 2016 16:29:33 +0000
 schrieb Joakim <dlang joakim.fea.st>:

 Part of it is the complexity of written language, part of it 
 is bad technical decisions.  Building the default string type 
 in D around the horrible UTF-8 encoding was a fundamental 
 mistake, both in terms of efficiency and complexity.  I noted 
 this in one of my first threads in this forum, and as Andrei 
 said at the time, nobody agreed with me, with a lot of 
 hand-waving about how efficiency wasn't an issue or that UTF-8 
 arrays were fine. Fast-forward years later and exactly the 
 issues I raised are now causing pain.
Maybe you can dig up your old post and we can look at each of your complaints in detail.
Not interested. I believe you were part of that thread then. Google it if you want to read it again.
 UTF-8 is an antiquated hack that needs to be eradicated.  It 
 forces all other languages than English to be twice as long, 
 for no good reason, have fun with that when you're downloading 
 text on a 2G connection in the developing world.  It is 
 unnecessarily inefficient, which is precisely why 
 auto-decoding is a problem. It is only a matter of time till 
 UTF-8 is ditched.
You don't download twice the data. First of all, some languages had two-byte encodings before UTF-8, and second web content is full of HTML syntax and gzip compressed afterwards.
The vast majority can be encoded in a single byte, and are unnecessarily forced to two bytes by the inefficient UTF-8/16 encodings. HTML syntax is a non sequitur; compression helps but isn't as efficient as a proper encoding.
 Take this Thai Wikipedia entry for example:
 https://th.wikipedia.org/wiki/%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97%E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2
 The download of the gzipped html is 11% larger in UTF-8 than
 in Thai TIS-620 single-byte encoding. And that is dwarfed by
 the size of JS + images. (I don't have the numbers, but I
 expect the effective overhead to be ~2%).
Nobody on a 2G connection is waiting minutes to download such massive web pages. They are mostly sending text to each other on their favorite chat app, and waiting longer and using up more of their mobile data quota if they're forced to use bad encodings.
 Ironically a lot of symbols we take for granted would then
 have to be implemented as HTML entities using their Unicode
 code points(sic!). Amongst them basic stuff like dashes, degree
 (°) and minute (′), accents in names, non-breaking space or
 footnotes (↑).
No, they just don't use HTML, opting for much superior mobile apps instead. :)
 D devs should lead the way in getting rid of the UTF-8 
 encoding, not bickering about how to make it more palatable.  
 I suggested a single-byte encoding for most languages, with 
 double-byte for the ones which wouldn't fit in a byte.  Use 
 some kind of header or other metadata to combine strings of 
 different languages, _rather than encoding the language into 
 every character!_
That would have put D on an island. "Some kind of header" would be a horrible mess to have in strings, because you have to account for it when concatenating strings and scan for them all the time to see if there is some interspersed 2 byte encoding in the stream. That's hardly better than UTF-8. And yes, a huge amount of websites mix scripts and a lot of other text uses the available extra symbols like ° or α,β,γ.
Let's see: a constant-time addition to a header or constantly decoding every character every time I want to manipulate the string... I wonder which is a better choice?! You would not "intersperse" any other encodings, unless you kept track of those substrings in the header. My whole point is that such mixing of languages or "extra symbols" is an extreme minority use case: the vast majority of strings are a single language.
 The common string-handling use case, by far, is strings with 
 only one language, with a distant second some substrings in a 
 second language, yet here we are putting the overhead into 
 every character to allow inserting characters from an 
 arbitrary language!  This is madness.
No thx, madness was when we couldn't reliably open text files, because nowhere was the encoding stored and when you had to compile programs for each of a dozen codepages, so localized text would be rendered correctly. And your retro codepage system wont convince the world to drop Unicode either.
Unicode _is_ a retro codepage system, they merely standardized a bunch of the most popular codepages. So that's not going away no matter what system you use. :)
 Yes, the complexity of diacritics and combining characters 
 will remain, but that is complexity that is inherent to the 
 variety of written language.  UTF-8 is not: it is just a bad 
 technical decision, likely chosen for ASCII compatibility and 
 some misguided notion that being able to combine arbitrary 
 language strings with no other metadata was worthwhile.  It is 
 not.
The web proves you wrong. Scripts do get mixed often. Be it Wikipedia, a foreign language learning site or mathematical symbols.
Those are some of the least-trafficked parts of the web, which itself is dying off as the developing world comes online through mobile apps, not the bloated web stack. Anyway, I'm not interested in rehashing this dumb argument again. The UTF-8/16 encodings are a horrible mess, and D made a big mistake by baking them in.
May 31 2016
prev sibling next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 31.05.2016 22:20, Marco Leise wrote:
 Am Tue, 31 May 2016 16:29:33 +0000
 schrieb Joakim<dlang joakim.fea.st>:

Part of it is the complexity of written language, part of it is
bad technical decisions.  Building the default string type in D
around the horrible UTF-8 encoding was a fundamental mistake,
both in terms of efficiency and complexity.  I noted this in one
of my first threads in this forum, and as Andrei said at the
time, nobody agreed with me, with a lot of hand-waving about how
efficiency wasn't an issue or that UTF-8 arrays were fine.
Fast-forward years later and exactly the issues I raised are now
causing pain.
Maybe you can dig up your old post and we can look at each of your complaints in detail.
It is probably this one. Not sure what "exactly the issues" are though. http://forum.dlang.org/thread/bwbuowkblpdxcpysejpb forum.dlang.org
May 31 2016
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/31/2016 1:20 PM, Marco Leise wrote:
 [...]
I agree. I dealt the madness of code pages, Shift-JIS, EBCDIC, locales, etc., in the pre-Unicode days. Despite its problems, Unicode (and UTF-8) is a major improvement, and I mean major. 16 years ago, I bet that Unicode was the future, and events have shown that to be correct. But I didn't know which encoding would win - UTF-8, UTF-16, or UCS-2, so D bet on all three. If I had a do-over, I'd just support UTF-8. UTF-16 is useful pretty much only as a transitional encoding to talk with Windows APIs. Nobody uses UCS-2 (it consumes far too much memory).
May 31 2016
next sibling parent reply ag0aep6g <anonymous example.com> writes:
On 06/01/2016 12:47 AM, Walter Bright wrote:
 But I didn't know which encoding would win - UTF-8, UTF-16, or UCS-2, so
 D bet on all three. If I had a do-over, I'd just support UTF-8. UTF-16
 is useful pretty much only as a transitional encoding to talk with
 Windows APIs. Nobody uses UCS-2 (it consumes far too much memory).
Wikipedia says [1] that UCS-2 is essentially UTF-16 without surrogate pairs. I suppose you mean UTF-32/UCS-4. [1] https://en.wikipedia.org/wiki/UTF-16
May 31 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/31/2016 4:00 PM, ag0aep6g wrote:
 Wikipedia says [1] that UCS-2 is essentially UTF-16 without surrogate pairs. I
 suppose you mean UTF-32/UCS-4.
 [1] https://en.wikipedia.org/wiki/UTF-16
Thanks for the correction.
May 31 2016
prev sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Tue, 31 May 2016 15:47:02 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 But I didn't know which encoding would win - UTF-8, UTF-16, or UCS-2, so D bet 
 on all three. If I had a do-over, I'd just support UTF-8. UTF-16 is useful 
 pretty much only as a transitional encoding to talk with Windows APIs.
I think so too, although more APIs than just Windows use UTF-16. Think of Java or ICU. Aside from their Java heritage they found that it is the fastest encoding for transcoding from and to Unicode as UTF-16 codepoints cover most 8-bit codepages. Also Qt defined a char as UTF-16 code point, but they probably regret it as the 'charmap' program KCharSelect is now unable to show Unicode characters >= 0x10000. -- Marco
May 31 2016
prev sibling next sibling parent reply ag0aep6g <anonymous example.com> writes:
On 05/31/2016 06:29 PM, Joakim wrote:
 D devs should lead the way in getting rid of the UTF-8 encoding, not
 bickering about how to make it more palatable.  I suggested a
 single-byte encoding for most languages, with double-byte for the ones
 which wouldn't fit in a byte.  Use some kind of header or other metadata
 to combine strings of different languages, _rather than encoding the
 language into every character!_
Guys, may I ask you to move this discussion to a new thread? I'd like to follow the (already crowded) autodecode thing, and this is really a separate topic.
May 31 2016
parent Joakim <dlang joakim.fea.st> writes:
On Tuesday, 31 May 2016 at 20:28:32 UTC, ag0aep6g wrote:
 On 05/31/2016 06:29 PM, Joakim wrote:
 D devs should lead the way in getting rid of the UTF-8 
 encoding, not
 bickering about how to make it more palatable.  I suggested a
 single-byte encoding for most languages, with double-byte for 
 the ones
 which wouldn't fit in a byte.  Use some kind of header or 
 other metadata
 to combine strings of different languages, _rather than 
 encoding the
 language into every character!_
Guys, may I ask you to move this discussion to a new thread? I'd like to follow the (already crowded) autodecode thing, and this is really a separate topic.
No, this is the root of the problem, but I'm not interested in debating it, so you can go back to discussing how to avoid the elephant in the room.
May 31 2016
prev sibling parent reply Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Tuesday, 31 May 2016 at 16:29:33 UTC, Joakim wrote:
 UTF-8 is an antiquated hack that needs to be eradicated.  It 
 forces all other languages than English to be twice as long, 
 for no good reason, have fun with that when you're downloading 
 text on a 2G connection in the developing world.
I assume you're talking about the web here. In this case, plain text makes up only a minor part of the entire traffic, the majority of which is images (binary data), javascript and stylesheets (almost pure ASCII), and HTML markup (ditto). It's like not significant even without taking compression into account, which is ubiquitous.
 It is unnecessarily inefficient, which is precisely why 
 auto-decoding is a problem.
No, inefficiency is the least of the problems with auto-decoding.
 It is only a matter of time till UTF-8 is ditched.
This is ridiculous, even if your other claims were true.
 D devs should lead the way in getting rid of the UTF-8 
 encoding, not bickering about how to make it more palatable.  I 
 suggested a single-byte encoding for most languages, with 
 double-byte for the ones which wouldn't fit in a byte.  Use 
 some kind of header or other metadata to combine strings of 
 different languages, _rather than encoding the language into 
 every character!_
I think I remember that post, and - sorry to be so blunt - it was one of the worst things I've ever seen proposed regarding text encoding.
 The common string-handling use case, by far, is strings with 
 only one language, with a distant second some substrings in a 
 second language, yet here we are putting the overhead into 
 every character to allow inserting characters from an arbitrary 
 language!  This is madness.
No. The common string-handling use case is code that is unaware which script (not language, btw) your text is in.
Jun 01 2016
parent reply Joakim <dlang joakim.fea.st> writes:
On Wednesday, 1 June 2016 at 10:04:42 UTC, Marc Schütz wrote:
 On Tuesday, 31 May 2016 at 16:29:33 UTC, Joakim wrote:
 UTF-8 is an antiquated hack that needs to be eradicated.  It 
 forces all other languages than English to be twice as long, 
 for no good reason, have fun with that when you're downloading 
 text on a 2G connection in the developing world.
I assume you're talking about the web here. In this case, plain text makes up only a minor part of the entire traffic, the majority of which is images (binary data), javascript and stylesheets (almost pure ASCII), and HTML markup (ditto). It's like not significant even without taking compression into account, which is ubiquitous.
No, I explicitly said not the web in a subsequent post. The ignorance here of what 2G speeds are like is mind-boggling.
 It is unnecessarily inefficient, which is precisely why 
 auto-decoding is a problem.
No, inefficiency is the least of the problems with auto-decoding.
Right... that's why this 200-post thread was spawned with that as the main reason.
 It is only a matter of time till UTF-8 is ditched.
This is ridiculous, even if your other claims were true.
The UTF-8 encoding is what's ridiculous.
 D devs should lead the way in getting rid of the UTF-8 
 encoding, not bickering about how to make it more palatable.  
 I suggested a single-byte encoding for most languages, with 
 double-byte for the ones which wouldn't fit in a byte.  Use 
 some kind of header or other metadata to combine strings of 
 different languages, _rather than encoding the language into 
 every character!_
I think I remember that post, and - sorry to be so blunt - it was one of the worst things I've ever seen proposed regarding text encoding.
Well, when you _like_ a ludicrous encoding like UTF-8, not sure your opinion matters.
 The common string-handling use case, by far, is strings with 
 only one language, with a distant second some substrings in a 
 second language, yet here we are putting the overhead into 
 every character to allow inserting characters from an 
 arbitrary language!  This is madness.
No. The common string-handling use case is code that is unaware which script (not language, btw) your text is in.
Lol, this may be the dumbest argument put forth yet. I don't think anyone here even understands what a good encoding is and what it's for, which is why there's no point in debating this.
Jun 01 2016
next sibling parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Wed, 01 Jun 2016 13:57:27 +0000
schrieb Joakim <dlang joakim.fea.st>:

 No, I explicitly said not the web in a subsequent post.  The 
 ignorance here of what 2G speeds are like is mind-boggling.
I've used 56k and had a phone conversation with my sister while she was downloading a 800 MiB file over 2G. You just learn to be patient (or you already are when the next major city is hundreds of kilometers away) and load only what you need. Your point about the costs convinced me more. Here is one article spiced up with numbers and figures: http://www.thequint.com/technology/2016/05/30/almost-every-indian-may-be-online-if-data-cost-cut-to-one-third But even if you could prove with a study that UTF-8 caused a notable bandwith cost in real life, it would - I think - be a matter of regional ISPs to provide special servers and apps that reduce data volume. There is also the overhead of key exchange when establishing a secure connection: http://stackoverflow.com/a/20306907/4038614 Something every app should do, but will increase bandwidth use. Then there is the overhead of using XML in applications like WhatsApp, which I presume is quite popular around the world. I'm just trying to broaden the view a bit here. This note from the XMPP that WhatsApp and Jabber use will make you cringe: https://tools.ietf.org/html/rfc6120#section-11.6 -- Marco
Jun 01 2016
parent reply Joakim <dlang joakim.fea.st> writes:
On Wednesday, 1 June 2016 at 14:58:47 UTC, Marco Leise wrote:
 Am Wed, 01 Jun 2016 13:57:27 +0000
 schrieb Joakim <dlang joakim.fea.st>:

 No, I explicitly said not the web in a subsequent post.  The 
 ignorance here of what 2G speeds are like is mind-boggling.
I've used 56k and had a phone conversation with my sister while she was downloading a 800 MiB file over 2G. You just learn to be patient (or you already are when the next major city is hundreds of kilometers away) and load only what you need. Your point about the costs convinced me more.
I see that max 2G speeds are 100-200 kbits/s. At that rate, it would have taken her more than 10 hours to download such a large file, that's nuts. The worst part is when the download gets interrupted and you have to start over again because most download managers don't know how to resume, including the stock one on Android. Also, people in these countries buy packs of around 100-200 MB for 30-60 US cents, so they would never download such a large file. They use messaging apps like Whatsapp or WeChat, which nobody in the US uses, to avoid onerous SMS charges.
 Here is one article spiced up with numbers and figures: 
 http://www.thequint.com/technology/2016/05/30/almost-every-indian-may-be-online-if-data-cost-cut-to-one-third
Yes, only the middle class, which are at most 10-30% of the population in these developing countries, can even afford 2G. The way to get costs down even further is to make the tech as efficient as possible. Of course, much of the rest of the population are illiterate, so there are bigger problems there.
 But even if you could prove with a study that UTF-8 caused a
 notable bandwith cost in real life, it would - I think - be a
 matter of regional ISPs to provide special servers and apps
 that reduce data volume.
Yes, by ditching UTF-8.
 There is also the overhead of
 key exchange when establishing a secure connection:
 http://stackoverflow.com/a/20306907/4038614
 Something every app should do, but will increase bandwidth use.
That's not going to happen, even HTTP/2 ditched that requirement. Also, many of those countries' govts will not allow it: google how Blackberry had to give up their keys for "secure" BBM in many countries. It's not just Canada and the US spying on their citizens.
 Then there is the overhead of using XML in applications
 like WhatsApp, which I presume is quite popular around the
 world. I'm just trying to broaden the view a bit here.
I didn't know they used XML. Googling it now, I see mention that they switched to an "internally developed protocol" at some point, so I doubt they're using XML now.
 This note from the XMPP that WhatsApp and Jabber use will make
 you cringe: https://tools.ietf.org/html/rfc6120#section-11.6
Haha, no wonder Jabber is dead. :) I jumped on Jabber for my own messages a decade ago, as it seemed like an open way out of that proprietary messaging mess, then I read that they're using XML and gave up on it. On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote:
 On Wednesday, 1 June 2016 at 13:57:27 UTC, Joakim wrote:
 No, I explicitly said not the web in a subsequent post.  The 
 ignorance here of what 2G speeds are like is mind-boggling.
It's not hard. I think a lot of us remember when a 14.4 modem was cutting-edge.
Well, then apparently you're unaware of how bloated web pages are nowadays. It used to take me minutes to download popular web pages _back then_ at _top speed_, and those pages were a _lot_ smaller.
 Codepages and incompatible encodings were terrible then, too.

 Never again.
This only shows you probably don't know the difference between an encoding and a code page, which are orthogonal concepts in Unicode. It's not surprising, as Walter and many others responding show the same ignorance. I explained this repeatedly in the previous thread, but it depends on understanding the tech, and I can't spoon-feed that to everyone.
 Well, when you _like_ a ludicrous encoding like UTF-8, not 
 sure your opinion matters.
It _is_ kind of ludicrous, isn't it? But it really is the least-bad option for the most text. Sorry, bub.
I think we can do a lot better.
 No. The common string-handling use case is code that is 
 unaware which script (not language, btw) your text is in.
Lol, this may be the dumbest argument put forth yet.
This just makes it feel like you're trolling. You're not just trolling, right?
Are you trolling? Because I was just calling it like it is. The vast majority of software is written for _one_ language, the local one. You may think otherwise because the software that sells the most and makes the most money is internationalized software like Windows or iOS, because it can be resold into many markets. But as a percentage of lines of code written, such international code is almost nothing.
 I don't think anyone here even understands what a good 
 encoding is and what it's for, which is why there's no point 
 in debating this.
And I don't think you realise how backwards you sound to people who had to live through the character encoding hell of the past. This has been an ongoing headache for the better part of a century (it still comes up in old files, sites, and systems) and you're literally the only person I've ever seen seriously suggest we turn back now that the madness has been somewhat tamed.
No, I have never once suggested "turning back." I have suggested a new scheme that retains one technical aspect of the prior schemes, ie constant-width encoding for each language, with a single byte sufficing for most. _You and several others_, including Walter, see that and automatically translate that to, "He wants EBCDIC to come back!," as though that were the only possible single-byte encoding and largely ignoring the possibilities of the header scheme I suggested. I could call that "trolling" by all of you, :) but I'll instead call it what it likely is, reactionary thinking, and move on.
 If you have to deal with delivering the fastest possible i18n 
 at GSM data rates, well, that's a tough problem and it sounds 
 like you might need to do something pretty special. Turning the 
 entire ecosystem into your special case is not the answer.
I don't think you understand: _you_ are the special case. The 5 billion people outside the US and EU are _not the special case_. Yes, they have not mattered so far, because they were too poor to buy computers. But the "computers" with the most sales these days are smartphones, and Motorola just launched their new Moto G4 in India and Samsung their new C5 and C7 in China. They didn't bother announcing release dates for these mid-range phones- well, they're high-end in those countries- in the US. That's because "computer" sales in all these non-ASCII countries now greatly outweighs the US. Now, a large majority of people in those countries don't have smartphones or text each other, so a significant chunk of the minority who do buy mostly ~$100 smartphones over there can likely afford a fatter text encoding and I don't know what encodings these developing markets are commonly using now. The problem is all the rest, and those just below who cannot afford it at all, in part because the tech is not as efficient as it could be yet. Ditching UTF-8 will be one way to make it more efficient. On Wednesday, 1 June 2016 at 16:15:15 UTC, Patrick Schluter wrote:
 Indeed, Joakim's proposal is so insane it beggars belief (why 
 not go back to baudot encoding, it's only 5 bit, hurray, it's 
 so much faster when used with flag semaphores).
I suspect you don't understand my proposal.
 As a programmer in the European Commission translation unit, 
 working on the probably biggest translation memory in the world 
 for 14 years, I can attest that Unicode is a blessing. When I 
 remember the shit we had in our documents because of the code 
 pages before most programs could handle utf-8 or utf-16 (and 
 before 2004 we only had 2 alphabets to take care of, Western 
 and Greek). What Joakim does not understand, is that there are 
 huge, huge quantities of documents that are multi-lingual.
Oh, I'm well aware of this. I just think a variable-length encoding like UTF-8 or UTF-16 is a bad design. And what you have to realize is that most strings in most software will only have one language. Anyway, the scheme I sketched out handles multiple languages: it just doesn't optimize for completely random jumbles of characters from every possible language, which is what UTF-8 is optimized for and is a ridiculous decision.
 Translators of course handle nearly exclusively with at least 
 bi-lingual documents. Any document encountered by a translator 
 must at least be able to present the source and the target 
 language. But even outside of that specific population, 
 multilingual documents are very, very common.
You are likely biased by the fact that all your documents are bilingual: they're _not_ common for the vast majority of users. Even if they were, UTF-8 is as suboptimal, compared to the constant-width encoding scheme I've sketched, for bilingual or even trilingual documents as it is for a single language, so even if I were wrong about their frequency, it wouldn't matter.
Jun 01 2016
parent reply Wyatt <wyatt.epp gmail.com> writes:
On Wednesday, 1 June 2016 at 16:45:04 UTC, Joakim wrote:
 On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote:
 It's not hard.  I think a lot of us remember when a 14.4 modem 
 was cutting-edge.
Well, then apparently you're unaware of how bloated web pages are nowadays. It used to take me minutes to download popular web pages _back then_ at _top speed_, and those pages were a _lot_ smaller.
It's telling that you think the encoding of the text is anything but the tiniest fraction of the problem. You should look at where the actual weight of a "modern" web page comes from.
 Codepages and incompatible encodings were terrible then, too.

 Never again.
This only shows you probably don't know the difference between an encoding and a code page,
"I suggested a single-byte encoding for most languages, with double-byte for the ones which wouldn't fit in a byte. Use some kind of header or other metadata to combine strings of different languages, _rather than encoding the language into every character!_" Yeah, that? That's codepages. And your exact proposal to put encodings in the header was ALSO tried around the time that Unicode was getting hashed out. It sucked. A lot. (Not as bad as storing it in the directory metadata, though.)
 Well, when you _like_ a ludicrous encoding like UTF-8, not 
 sure your opinion matters.
It _is_ kind of ludicrous, isn't it? But it really is the least-bad option for the most text. Sorry, bub.
I think we can do a lot better.
Maybe. But no one's done it yet.
 The vast majority of software is written for _one_ language, 
 the local one.  You may think otherwise because the software 
 that sells the most and makes the most money is 
 internationalized software like Windows or iOS, because it can 
 be resold into many markets.  But as a percentage of lines of 
 code written, such international code is almost nothing.
I'm surprised you think this even matters after talking about web pages. The browser is your most common string processing situation. Nothing else even comes close.
 largely ignoring the possibilities of the header scheme I 
 suggested.
"Possibilities" that were considered and discarded decades ago by people with way better credentials. The era of single-byte encodings is gone, it won't come back, and good riddance to bad rubbish.
 I could call that "trolling" by all of you, :) but I'll instead 
 call it what it likely is, reactionary thinking, and move on.
It's not trolling to call you out for clearly not doing your homework.
 I don't think you understand: _you_ are the special case.
Oh, I understand perfectly. _We_ (whoever "we" are) can handle any sequence of glyphs and combining characters (correctly-formed or not) in any language at any time, so we're the special case...? Yeah, it sounds funny to me, too.
 The 5 billion people outside the US and EU are _not the special 
 case_.
Fortunately, it works for them to.
 The problem is all the rest, and those just below who cannot 
 afford it at all, in part because the tech is not as efficient 
 as it could be yet.  Ditching UTF-8 will be one way to make it 
 more efficient.
All right, now you've found the special case; the case where the generic, unambiguous encoding may need to be lowered to something else: people for whom that encoding is suboptimal because of _current_ network constraints. I fully acknowledge it's a couple billion people and that's nothing to sneeze at, but I also see that it's a situation that will become less relevant over time. -Wyatt
Jun 01 2016
parent Joakim <dlang joakim.fea.st> writes:
On Wednesday, 1 June 2016 at 18:30:25 UTC, Wyatt wrote:
 On Wednesday, 1 June 2016 at 16:45:04 UTC, Joakim wrote:
 On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote:
 It's not hard.  I think a lot of us remember when a 14.4 
 modem was cutting-edge.
Well, then apparently you're unaware of how bloated web pages are nowadays. It used to take me minutes to download popular web pages _back then_ at _top speed_, and those pages were a _lot_ smaller.
It's telling that you think the encoding of the text is anything but the tiniest fraction of the problem. You should look at where the actual weight of a "modern" web page comes from.
I'm well aware that text is a small part of it. My point is that they're not downloading those web pages, they're using mobile instead, as I explicitly said in a prior post. My only point in mentioning the web bloat to you is that _your perception_ is off because you seem to think they're downloading _current_ web pages over 2G connections, and comparing it to your downloads of _past_ web pages with modems. Not only did it take minutes for us back then, it takes _even longer_ now. I know the text encoding won't help much with that. Where it will help is the mobile apps they're actually using, not the bloated websites they don't use.
 Codepages and incompatible encodings were terrible then, too.

 Never again.
This only shows you probably don't know the difference between an encoding and a code page,
"I suggested a single-byte encoding for most languages, with double-byte for the ones which wouldn't fit in a byte. Use some kind of header or other metadata to combine strings of different languages, _rather than encoding the language into every character!_" Yeah, that? That's codepages. And your exact proposal to put encodings in the header was ALSO tried around the time that Unicode was getting hashed out. It sucked. A lot. (Not as bad as storing it in the directory metadata, though.)
You know what's also codepages? Unicode. The UCS is a standardized set of code pages for each language, often merely picking the most popular code page at that time. I don't doubt that nothing I'm saying hasn't been tried in some form before. The question is whether that alternate form would be better if designed and implemented properly, not if a botched design/implementation has ever been attempted.
 Well, when you _like_ a ludicrous encoding like UTF-8, not 
 sure your opinion matters.
It _is_ kind of ludicrous, isn't it? But it really is the least-bad option for the most text. Sorry, bub.
I think we can do a lot better.
Maybe. But no one's done it yet.
That's what people said about mobile devices for a long time, until about a decade ago. It's time we got this right.
 The vast majority of software is written for _one_ language, 
 the local one.  You may think otherwise because the software 
 that sells the most and makes the most money is 
 internationalized software like Windows or iOS, because it can 
 be resold into many markets.  But as a percentage of lines of 
 code written, such international code is almost nothing.
I'm surprised you think this even matters after talking about web pages. The browser is your most common string processing situation. Nothing else even comes close.
No, it's certainly popular software, but at the scale we're talking about, ie all string processing in all software, it's fairly small. And the vast majority of webapps that handle strings passed from a browser are written to only handle one language, the local one.
 largely ignoring the possibilities of the header scheme I 
 suggested.
"Possibilities" that were considered and discarded decades ago by people with way better credentials. The era of single-byte encodings is gone, it won't come back, and good riddance to bad rubbish.
Lol, credentials. :D If you think that matters at all in the face of the blatant stupidity embodied by UTF-8, I don't know what to tell you.
 I could call that "trolling" by all of you, :) but I'll 
 instead call it what it likely is, reactionary thinking, and 
 move on.
It's not trolling to call you out for clearly not doing your homework.
That's funny, because it's precisely you and others who haven't done your homework. So are you all trolling me? By your definition of trolling, which btw is not the standard one, _you_ are the one doing it.
 I don't think you understand: _you_ are the special case.
Oh, I understand perfectly. _We_ (whoever "we" are) can handle any sequence of glyphs and combining characters (correctly-formed or not) in any language at any time, so we're the special case...?
And you're doing so by mostly using a single-byte encoding for _your own_ Euro-centric languages, ie ASCII, while imposing unnecessary double-byte and triple-byte encodings on everyone else, despite their outnumbering you 10 to 1. That is the very definition of a special case.
 Yeah, it sounds funny to me, too.
I'm happy to hear you find your privilege "funny," but I'm sorry to tell you, it won't last.
 The 5 billion people outside the US and EU are _not the 
 special case_.
Fortunately, it works for them to.
At a higher and unneccessary cost, which is why it won't last.
 The problem is all the rest, and those just below who cannot 
 afford it at all, in part because the tech is not as efficient 
 as it could be yet.  Ditching UTF-8 will be one way to make it 
 more efficient.
All right, now you've found the special case; the case where the generic, unambiguous encoding may need to be lowered to something else: people for whom that encoding is suboptimal because of _current_ network constraints. I fully acknowledge it's a couple billion people and that's nothing to sneeze at, but I also see that it's a situation that will become less relevant over time.
I continue to marvel at your calling a couple billion people "the special case," presumably thinking ~700 million people in the US and EU primarily using the single-byte encoding of ASCII are the general case. As for the continued relevance of such constrained use, I suggest you read the link Marco provided above. The vast majority of the worlwide literate population doesn't have a smartphone or use a cellular data plan, whereas the opposite is true if you include featurephones, largely because they can by used only for voice. As that article notes, costs for smartphones and 2G data plans will have to come down for them to go wider. That will take decades to roll out, though the basic tech design will mostly be done now. The costs will go down by making the tech more efficient, and ditching UTF-8 will be one of the ways the tech will be made more efficient.
Jun 01 2016
prev sibling parent reply Wyatt <wyatt.epp gmail.com> writes:
On Wednesday, 1 June 2016 at 13:57:27 UTC, Joakim wrote:
 No, I explicitly said not the web in a subsequent post.  The 
 ignorance here of what 2G speeds are like is mind-boggling.
It's not hard. I think a lot of us remember when a 14.4 modem was cutting-edge. Codepages and incompatible encodings were terrible then, too. Never again.
 Well, when you _like_ a ludicrous encoding like UTF-8, not sure 
 your opinion matters.
It _is_ kind of ludicrous, isn't it? But it really is the least-bad option for the most text. Sorry, bub.
 No. The common string-handling use case is code that is 
 unaware which script (not language, btw) your text is in.
Lol, this may be the dumbest argument put forth yet.
This just makes it feel like you're trolling. You're not just trolling, right?
 I don't think anyone here even understands what a good encoding 
 is and what it's for, which is why there's no point in debating 
 this.
And I don't think you realise how backwards you sound to people who had to live through the character encoding hell of the past. This has been an ongoing headache for the better part of a century (it still comes up in old files, sites, and systems) and you're literally the only person I've ever seen seriously suggest we turn back now that the madness has been somewhat tamed. If you have to deal with delivering the fastest possible i18n at GSM data rates, well, that's a tough problem and it sounds like you might need to do something pretty special. Turning the entire ecosystem into your special case is not the answer. -Wyatt
Jun 01 2016
next sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote:
 On Wednesday, 1 June 2016 at 13:57:27 UTC, Joakim wrote:
 No, I explicitly said not the web in a subsequent post.  The 
 ignorance here of what 2G speeds are like is mind-boggling.
It's not hard. I think a lot of us remember when a 14.4 modem was cutting-edge. Codepages and incompatible encodings were terrible then, too. Never again.
 Well, when you _like_ a ludicrous encoding like UTF-8, not 
 sure your opinion matters.
It _is_ kind of ludicrous, isn't it? But it really is the least-bad option for the most text. Sorry, bub.
 No. The common string-handling use case is code that is 
 unaware which script (not language, btw) your text is in.
Lol, this may be the dumbest argument put forth yet.
This just makes it feel like you're trolling. You're not just trolling, right?
 I don't think anyone here even understands what a good 
 encoding is and what it's for, which is why there's no point 
 in debating this.
And I don't think you realise how backwards you sound to people who had to live through the character encoding hell of the past. This has been an ongoing headache for the better part of a century (it still comes up in old files, sites, and systems) and you're literally the only person I've ever seen seriously suggest we turn back now that the madness has been somewhat tamed.
Indeed, Joakim's proposal is so insane it beggars belief (why not go back to baudot encoding, it's only 5 bit, hurray, it's so much faster when used with flag semaphores). As a programmer in the European Commission translation unit, working on the probably biggest translation memory in the world for 14 years, I can attest that Unicode is a blessing. When I remember the shit we had in our documents because of the code pages before most programs could handle utf-8 or utf-16 (and before 2004 we only had 2 alphabets to take care of, Western and Greek). What Joakim does not understand, is that there are huge, huge quantities of documents that are multi-lingual. Translators of course handle nearly exclusively with at least bi-lingual documents. Any document encountered by a translator must at least be able to present the source and the target language. But even outside of that specific population, multilingual documents are very, very common.
 If you have to deal with delivering the fastest possible i18n 
 at GSM data rates, well, that's a tough problem and it sounds 
 like you might need to do something pretty special. Turning the 
 entire ecosystem into your special case is not the answer.
Jun 01 2016
parent reply deadalnix <deadalnix gmail.com> writes:
On Wednesday, 1 June 2016 at 16:15:15 UTC, Patrick Schluter wrote:
 What Joakim does not understand, is that there are huge, huge 
 quantities of documents that are multi-lingual.
That should be obvious to anyone living outside the USA.
Jun 01 2016
next sibling parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 06/01/2016 12:26 PM, deadalnix wrote:
 On Wednesday, 1 June 2016 at 16:15:15 UTC, Patrick Schluter wrote:
 What Joakim does not understand, is that there are huge, huge
 quantities of documents that are multi-lingual.
That should be obvious to anyone living outside the USA.
Or anyone in the USA who's ever touched a product that includes a manual or a safety warning, or gone to high school (a foreign language class is pretty much universally mandatory, even in the US).
Jun 01 2016
prev sibling parent Kagamin <spam here.lot> writes:
On Wednesday, 1 June 2016 at 16:26:36 UTC, deadalnix wrote:
 On Wednesday, 1 June 2016 at 16:15:15 UTC, Patrick Schluter 
 wrote:
 What Joakim does not understand, is that there are huge, huge 
 quantities of documents that are multi-lingual.
That should be obvious to anyone living outside the USA.
https://msdn.microsoft.com/th-th inside too :)
Jun 01 2016
prev sibling parent Kagamin <spam here.lot> writes:
On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote:
 If you have to deal with delivering the fastest possible i18n 
 at GSM data rates, well, that's a tough problem and it sounds 
 like you might need to do something pretty special. Turning the 
 entire ecosystem into your special case is not the answer.
UTF-8 encoded SMS work fine for me in GSM network, didn't notice any problem.
Jun 01 2016
prev sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 30 May 2016 at 16:03:03 UTC, Marco Leise wrote:
 When on the other hand you work with real world international 
 text, you'll want to work with graphemes.
Actually, my main rule of thumb is: don't mess with strings. Get them from the user, store them without modification, spit them back out again. Wherever possible, don't do anything more. But if you do have to implement the rest, eh, it depends on what you're doing still. If I want an ellipsis, for example, I like to take font size into account too - basically, I do a dry-run of the whole font render to get the length in pixels, then slice off the partial grapheme... So yeah that's kinda complicated...
May 30 2016
prev sibling next sibling parent Jack Stouffer <jack jackstouffer.com> writes:
On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
 2. Every time one wants an algorithm to work with both strings 
 and ranges, you wind up special casing the strings to defeat 
 the autodecoding, or to decode the ranges. Having to constantly 
 special case it makes for more special cases when plugging 
 together components. These issues often escape detection when 
 unittesting because it is convenient to unittest only with 
 arrays.
This is a great example of special casing in Phobos that someone showed me: https://github.com/dlang/phobos/blob/master/std/algorithm/searching.d#L1714
May 12 2016
prev sibling next sibling parent reply Bill Hicks <billhicks reality.com> writes:
On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
 Here are some that are not matters of opinion.

 1. Ranges of characters do not autodecode, but arrays of 
 characters do. This is a glaring inconsistency.

 2. Every time one wants an algorithm to work with both strings 
 and ranges, you wind up special casing the strings to defeat 
 the autodecoding, or to decode the ranges. Having to constantly 
 special case it makes for more special cases when plugging 
 together components. These issues often escape detection when 
 unittesting because it is convenient to unittest only with 
 arrays.

 3. Wrapping an array in a struct with an alias this to an array 
 turns off autodecoding, another special case.

 4. Autodecoding is slow and has no place in high speed string 
 processing.

 5. Very few algorithms require decoding.

 6. Autodecoding has two choices when encountering invalid code 
 units - throw or produce an error dchar. Currently, it throws, 
 meaning no algorithms using autodecode can be made nothrow.

 7. Autodecode cannot be used with unicode path/filenames, 
 because it is legal (at least on Linux) to have invalid UTF-8 
 as filenames. It turns out in the wild that pure Unicode is not 
 universal - there's lots of dirty Unicode that should remain 
 unmolested, and autocode does not play with that.

 8. In my work with UTF-8 streams, dealing with autodecode has 
 caused me considerably extra work every time. A convenient 
 timesaver it ain't.

 9. Autodecode cannot be turned off, i.e. it isn't practical to 
 avoid importing std.array one way or another, and then 
 autodecode is there.

 10. Autodecoded arrays cannot be RandomAccessRanges, losing a 
 key benefit of being arrays in the first place.

 11. Indexing an array produces different results than 
 autodecoding, another glaring special case.
Wow, that's eleven things wrong with just one tiny element of D, with the potential to cause problems, whether fixed or not. And I get called a troll and other names when I list half a dozen things wrong with D, my posts get removed/censored, etc, all because I try to inform people not to waste time with D because it's a broken and failed language. *sigh* Phobos, a piece of useless rock orbiting a dead planet ... the irony.
May 12 2016
next sibling parent Ethan Watson <gooberman gmail.com> writes:
On Friday, 13 May 2016 at 06:50:49 UTC, Bill Hicks wrote:
 *rant*
Actually, chap, it's the attitude that's the turn-off in your post there. Listing problems in order to improve them, and listing problems to convince people something is a waste of time are incompatible mindsets around here.
May 13 2016
prev sibling next sibling parent reply poliklosio <poliklosio happypizza.com> writes:
On Friday, 13 May 2016 at 06:50:49 UTC, Bill Hicks wrote:
 On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
 (...)
Wow, that's eleven things wrong with just one tiny element of D, with the potential to cause problems, whether fixed or not. And I get called a troll and other names when I list half a dozen things wrong with D, my posts get removed/censored, etc, all because I try to inform people not to waste time with D because it's a broken and failed language. *sigh* Phobos, a piece of useless rock orbiting a dead planet ... the irony.
You get banned because there is a difference between torpedoing a project and having constructive criticism. Also, you are missing the point by claiming that a technical problem is sure to kill D. Note that very successful languages like C++, python and so on also have undergone heated discussions about various features, and often live design mistakes for many years. The real reason why languages are successful is what they enable, not how many quirks they have. Quirks are why they get replaced by others 20 years later. :)
May 13 2016
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:
On Sunday, 15 May 2016 at 01:45:25 UTC, Bill Hicks wrote:
 From a technical point, D is not successful, for the most part.
  C/C++ at least can use the excuse that they were created 
 during a time when we didn't have the experience and the 
 knowledge that we do now.
Not really. The dominating precursor to C, BCPL was a bootstrapping language for CPL. C was a quick hack to implement Unix. C++ has always been viewed as a hack and was heavily criticised since its inception as a ugly bastardized language that got many things wrong. Reality is, current main stream programming languages draw on theory that has been well understood for 40+ years. There is virtually no innovation, but a lot of repeated mistakes. Some esoteric languages draw on more modern concepts and innovate, but I can't think of a single mainstream language that does that.
 If by successful you mean the size of the user base, then D 
 doesn't have that either.  The number of D users is most 
 definitely less than 10k.  The number of people who have tried 
 D is no doubt greater than that, but that's the thing with D, 
 it has a low retention rate, for obvious reasons.
Yes, but D can make breaking changes, something C++ cannot do. Unfortunately there is no real willingness to clean up the language, so D is moving way too slow to become competitive. But that is more of a cultural issue than a language issue. I am personally increasingly involved with C++, but unfortunately, there is no single C++ language. The C/C++ committees have unfortunately tried to make the C-languages more high performant and high level at the cost of correctness. So, now you either have to do heavy code reviews or carefully select compiler options to get a sane C++ environment. Like, in modern C/C++ the compiler assumes that there is no aliasing between pointers to different types. So if I cast a scalar float pointer to a simd pointer I either have to: 1. make sure that I turn off that assumption by using the compiler switch "-fno-strict-aliasing" and add "__restrict__" where I know there is no aliasing, or 2. Put __may_alias__ on my simd pointers. 3. Carefully place memory barriers between pointer type casts. 4. Dig into the compiler internals to figure out what it does. C++ is trying way too hard to become a high level language, without the foundation to support it. This is an area where D could do well, but it isn't doing enough to get there, neither on the theoretical level or the implementation level. Rust seems to try, but I don't think they will make it as they don't seem to have a broad view of programming. Maybe someone will build a new language over the Rust mid-level IR (MIR) that will be successful. I'm hopeful, but hey, it won't happen in less than 5 years. Until then there is only three options for C++ish progamming: C++, D and Loci. Currently C++ is the path of least resistance (but with very high initial investment, 1+ year for an experienced educated programmer). So clearly a language comparable to D _could_ make headway, but not without a philosophical change that makes it a significant improvement over C++ and systematically adresses the C++ short-comings one by one (while retaining the application area and basic programming model).
May 15 2016
prev sibling next sibling parent Chris <wendlec tcd.ie> writes:
On Friday, 13 May 2016 at 06:50:49 UTC, Bill Hicks wrote:
 Wow, that's eleven things wrong with just one tiny element of 
 D, with the potential to cause problems, whether fixed or not.  
 And I get called a troll and other names when I list half a 
 dozen things wrong with D, my posts get removed/censored, etc, 
 all because I try to inform people not to waste time with D 
 because it's a broken and failed language.

 *sigh*

 Phobos, a piece of useless rock orbiting a dead planet ... the 
 irony.
Is there any PL that doesn't have multiple issues? Look at Swift. They keep changing it, although it started out as _the_ big the chronically ill C++. There is no such thing as the perfect PL, and as hardware is changing, PLs are outdated anyway and have to catch up. The question is not whether a language sucks or not, the question is which language sucks the least for the task at hand. PS I wonder does Bill Hicks know you're using his name? But I guess he's lost interest in this planet and happily lives on Mars now.
May 13 2016
prev sibling next sibling parent Kagamin <spam here.lot> writes:
On Friday, 13 May 2016 at 06:50:49 UTC, Bill Hicks wrote:
 not to waste time with D because it's a broken and failed 
 language.
D is a better broken thing among all the broken things in this broken world, so it's to be expected to be preferred to spend time on.
May 13 2016
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/12/2016 11:50 PM, Bill Hicks wrote:
 And I get called a troll and
 other names when I list half a dozen things wrong with D, my posts get
 removed/censored, etc, all because I try to inform people not to waste time
with
 D because it's a broken and failed language.
Posts that engage in personal attacks and bring up personal issues about other forum members get removed. You're welcome to post here in a reasonably professional manner.
May 13 2016
prev sibling next sibling parent reply Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Thursday, May 12, 2016 13:15:45 Walter Bright via Digitalmars-d wrote:
 On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
  > I am as unclear about the problems of autodecoding as I am about the
  > necessity to remove curl. Whenever I ask I hear some arguments that work
  > well emotionally but are scant on reason and engineering. Maybe it's
  > time to rehash them? I just did so about curl, no solid argument seemed
  > to come together. I'd be curious of a crisp list of grievances about
  > autodecoding. -- Andrei

 Here are some that are not matters of opinion.

 1. Ranges of characters do not autodecode, but arrays of characters do. This
 is a glaring inconsistency.

 2. Every time one wants an algorithm to work with both strings and ranges,
 you wind up special casing the strings to defeat the autodecoding, or to
 decode the ranges. Having to constantly special case it makes for more
 special cases when plugging together components. These issues often escape
 detection when unittesting because it is convenient to unittest only with
 arrays.

 3. Wrapping an array in a struct with an alias this to an array turns off
 autodecoding, another special case.

 4. Autodecoding is slow and has no place in high speed string processing.

 5. Very few algorithms require decoding.

 6. Autodecoding has two choices when encountering invalid code units - throw
 or produce an error dchar. Currently, it throws, meaning no algorithms
 using autodecode can be made nothrow.

 7. Autodecode cannot be used with unicode path/filenames, because it is
 legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out
 in the wild that pure Unicode is not universal - there's lots of dirty
 Unicode that should remain unmolested, and autocode does not play with
 that.

 8. In my work with UTF-8 streams, dealing with autodecode has caused me
 considerably extra work every time. A convenient timesaver it ain't.

 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid
 importing std.array one way or another, and then autodecode is there.

 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of
 being arrays in the first place.

 11. Indexing an array produces different results than autodecoding, another
 glaring special case.
It also results in constantly special-casing algorithms for narrow strings in order to avoid auto-decoding. Phobos does this all over the place. We have a ridiculous amount of code in Phobos just to avoid auto-decoding, and anyone who wants high performance will have to do the same. And it's not like auto-decoding is even correct. It would be one thing if auto-decoding were fully correct but slow, but to be fully correct, it would need to operate at the grapheme level, not the code point level. So, by default, we get slower code without actually getting fully correct code. So, we're neither fast nor correct. We _are_ correct in more cases than we'd be if we simply acted like ASCII was all there was, but what we end up with is the illusion that we're correct when we're not. IIRC, Andrei talked in TDPL about how Java's choice to go with UTF-16 was worse than the choice to go with UTF-8, because it was correct in many more cases to operate on the code unit level as if a code unit were a character, and it was therefore harder to realize that what you were doing was wrong, whereas with UTF-8, it's obvious very quickly. We currently have that same problem with auto-decoding except that it's treating UTF-32 code units as if they were full characters rather than treating UTF-16 code units as if they were full characters. Ideally, algorithms would be Unicode aware as appropriate, but the default would be to operate on code units with wrappers to handle decoding by code point or grapheme. Then it's easy to write fast code while still allowing for full correctness. Granted, it's not necessarily easy to get correct code that way, but anyone who wants fully correctness without caring about efficiency can just use ranges of graphemes. Ranges of code points are rare regardless. Based on what I've seen in previous conversations on auto-decoding over the past few years (be it in the newsgroup, on github, or at dconf), most of the core devs think that auto-decoding was a major blunder that we continue to pay for. But unfortunately, even if we all agree that it was a huge mistake and want to fix it, the question remains of how to do that without breaking tons of code - though since AFAIK, Andrei is still in favor of auto-decoding, we'd have a hard time going forward with plans to get rid of it even if we had come up with a good way of doing so. But I would love it if we could get rid of auto-decoding and clean up string handling in D. - Jonathan M Davis
May 13 2016
next sibling parent Chris <wendlec tcd.ie> writes:
On Friday, 13 May 2016 at 10:38:09 UTC, Jonathan M Davis wrote:
 Based on what I've seen in previous conversations on 
 auto-decoding over the past few years (be it in the newsgroup, 
 on github, or at dconf), most of the core devs think that 
 auto-decoding was a major blunder that we continue to pay for. 
 But unfortunately, even if we all agree that it was a huge 
 mistake and want to fix it, the question remains of how to do 
 that without breaking tons of code - though since AFAIK, Andrei 
 is still in favor of auto-decoding, we'd have a hard time going 
 forward with plans to get rid of it even if we had come up with 
 a good way of doing so. But I would love it if we could get rid 
 of auto-decoding and clean up string handling in D.

 - Jonathan M Davis
Why not just try it in a separate test release? Only then can we know to what extent it actually breaks code, and what remedies we could come up with.
May 13 2016
prev sibling next sibling parent reply Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Friday, 13 May 2016 at 10:38:09 UTC, Jonathan M Davis wrote:
 Ideally, algorithms would be Unicode aware as appropriate, but 
 the default would be to operate on code units with wrappers to 
 handle decoding by code point or grapheme. Then it's easy to 
 write fast code while still allowing for full correctness. 
 Granted, it's not necessarily easy to get correct code that 
 way, but anyone who wants fully correctness without caring 
 about efficiency can just use ranges of graphemes. Ranges of 
 code points are rare regardless.
char[], wchar[] etc. can simply be made non-ranges, so that the user has to choose between .byCodePoint, .byCodeUnit (or .representation as it already exists), .byGrapheme, or even higher-level units like .byLine or .byWord. Ranges of char, wchar however stay as they are today. That way it's harder to accidentally get it wrong.
 Based on what I've seen in previous conversations on 
 auto-decoding over the past few years (be it in the newsgroup, 
 on github, or at dconf), most of the core devs think that 
 auto-decoding was a major blunder that we continue to pay for. 
 But unfortunately, even if we all agree that it was a huge 
 mistake and want to fix it, the question remains of how to do 
 that without breaking tons of code - though since AFAIK, Andrei 
 is still in favor of auto-decoding, we'd have a hard time going 
 forward with plans to get rid of it even if we had come up with 
 a good way of doing so. But I would love it if we could get rid 
 of auto-decoding and clean up string handling in D.
There is a simple deprecation path that's already been suggested. `isInputRange` and friends can output a helpful deprecation warning when they're called with a range that currently triggers auto-decoding.
May 13 2016
parent Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
prev sibling parent reply Kagamin <spam here.lot> writes:
On Friday, 13 May 2016 at 10:38:09 UTC, Jonathan M Davis wrote:
 IIRC, Andrei talked in TDPL about how Java's choice to go with 
 UTF-16 was worse than the choice to go with UTF-8, because it 
 was correct in many more cases
UTF-16 was a migration from UCS-2, and UCS-2 was superior at the time.
May 13 2016
parent reply Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Friday, May 13, 2016 12:52:13 Kagamin via Digitalmars-d wrote:
 On Friday, 13 May 2016 at 10:38:09 UTC, Jonathan M Davis wrote:
 IIRC, Andrei talked in TDPL about how Java's choice to go with
 UTF-16 was worse than the choice to go with UTF-8, because it
 was correct in many more cases
UTF-16 was a migration from UCS-2, and UCS-2 was superior at the time.
The history of why UTF-16 was chosen isn't really relevant to my point (Win32 has the same problem as Java and for similar reasons). My point was that if you use UTF-8, then it's obvious _really_ fast when you screwed up Unicode-handling by treating a code unit as a character, because anything beyond ASCII is going to fall flat on its face. But with UTF-16, a _lot_ more code units are representable as a single code point - as well as a single grapheme - so it's far easier to write code that treats a code unit as if it were a full character without realizing that you're screwing it up. UTF-8 is fail-fast in this regard, whereas UTF-16 is not. UTF-32 takes that problem to a new level, because now you'll only notice problems when you're dealing with a grapheme constructed of multiple code points. So, odds are that even if you test with Unicode strings, you won't catch the bugs. It'll work 99% of the time, and you'll get subtle bugs the rest of the time. There are reasons to operate at the code point level, but in general, you either want to be operating at the code unit level or the grapheme level, not the code point level, and if you don't know what you're doing, then anything other than the grapheme level is likely going to be wrong if you're manipulating individual characters. Fortunately, a lot of string processing doesn't need to operate on individual characters and as long as the standard library functions get it right, you'll tend to be okay, but still, operating at the code point level is almost always wrong, and it's even harder to catch when it's wrong than when treating UTF-16 code units as characters. - Jonathan M Davis
May 13 2016
parent reply Kagamin <spam here.lot> writes:
On Friday, 13 May 2016 at 21:46:28 UTC, Jonathan M Davis wrote:
 The history of why UTF-16 was chosen isn't really relevant to 
 my point (Win32 has the same problem as Java and for similar 
 reasons).

 My point was that if you use UTF-8, then it's obvious _really_ 
 fast when you screwed up Unicode-handling by treating a code 
 unit as a character, because anything beyond ASCII is going to 
 fall flat on its face.
On the other hand if you deal with UTF-16 text, you can't interpret it in a way other than UTF-16, people either get it correct or give up, even for ASCII, even with casts, it's that resilient. With UTF-8 problems happened on a massive scale in LAMP setups: mysql used latin1 as a default encoding and almost everything worked fine.
May 17 2016
parent sarn <sarn theartofmachinery.com> writes:
On Tuesday, 17 May 2016 at 09:53:17 UTC, Kagamin wrote:
 With UTF-8 problems happened on a massive scale in LAMP setups: 
 mysql used latin1 as a default encoding and almost everything 
 worked fine.
^ latin-1 with Swedish collation rules. And even if you set the encoding to "utf8", almost everything works fine until you discover that you need to set the encoding to "utf8mb4" to get real utf8. Also, MySQL has per-connection character encoding settings, so even if your application is properly set up to use utf8, you can break things by accidentally connecting with a client using the default pretty-much-latin1 encoding. With MySQL's "silently ram the square peg into the round hole" design philosophy, this can cause data corruption. But, of course, almost everything works fine. Just some examples of why broken utf8 exists (and some venting of MySQL trauma).
May 17 2016
prev sibling next sibling parent reply Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
 7. Autodecode cannot be used with unicode path/filenames, 
 because it is legal (at least on Linux) to have invalid UTF-8 
 as filenames. It turns out in the wild that pure Unicode is not 
 universal - there's lots of dirty Unicode that should remain 
 unmolested, and autocode does not play with that.
This just means that filenames mustn't be represented as strings; it's unrelated to auto decoding.
May 13 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/13/2016 3:43 AM, Marc Schütz wrote:
 On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
 7. Autodecode cannot be used with unicode path/filenames, because it is legal
 (at least on Linux) to have invalid UTF-8 as filenames. It turns out in the
 wild that pure Unicode is not universal - there's lots of dirty Unicode that
 should remain unmolested, and autocode does not play with that.
This just means that filenames mustn't be represented as strings; it's unrelated to auto decoding.
It means much more than that, filenames are just an example. I recently fixed MicroEmacs (my text editor) to assume the source is UTF-8, and display Unicode characters. But it still needs to work with dirty UTF-8 without throwing exceptions, modifying the text in-place, or other tantrums.
May 13 2016
prev sibling next sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 5/12/16 4:15 PM, Walter Bright wrote:

 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key
 benefit of being arrays in the first place.
I'll repeat what I said in the other thread. The problem isn't auto-decoding. The problem is hijacking the char[] and wchar[] (and variants) array type to mean autodecoding non-arrays. If you think this code makes sense, then my definition of sane varies slightly from yours: static assert(!hasLength!R && is(typeof(R.init.length))); static assert(!is(ElementType!R == R.init[0])); static assert(!isRandomAccessRange!R && is(typeof(R.init[0])) && is(typeof(R.init[0 .. $]))); I think D would be fine if string meant some auto-decoding struct with an immutable(char)[] array backing. I can accept and work with that. I can transform that into a char[] that makes sense if I have no use for auto-decoding. As of today, I have to use byCodePoint, or .representation, etc. and it's very unwieldy. If I ran D, that's what I would do. -Steve
May 13 2016
parent reply Alex Parrill <initrd.gz gmail.com> writes:
On Friday, 13 May 2016 at 16:05:21 UTC, Steven Schveighoffer 
wrote:
 On 5/12/16 4:15 PM, Walter Bright wrote:

 10. Autodecoded arrays cannot be RandomAccessRanges, losing a 
 key
 benefit of being arrays in the first place.
I'll repeat what I said in the other thread. The problem isn't auto-decoding. The problem is hijacking the char[] and wchar[] (and variants) array type to mean autodecoding non-arrays. If you think this code makes sense, then my definition of sane varies slightly from yours: static assert(!hasLength!R && is(typeof(R.init.length))); static assert(!is(ElementType!R == R.init[0])); static assert(!isRandomAccessRange!R && is(typeof(R.init[0])) && is(typeof(R.init[0 .. $]))); I think D would be fine if string meant some auto-decoding struct with an immutable(char)[] array backing. I can accept and work with that. I can transform that into a char[] that makes sense if I have no use for auto-decoding. As of today, I have to use byCodePoint, or .representation, etc. and it's very unwieldy. If I ran D, that's what I would do. -Steve
Well, the "auto" part of autodecoding means "automatically doing it for plain strings", right? If you explicitly do decoding, I think it would just be "decoding"; there's no "auto" part. I doubt anyone is going to complain if you add in a struct wrapper around a string that iterates over code units or graphemes. The issue most people have, as you say, is the fact that the default for strings is to decode.
May 13 2016
parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 5/13/16 5:25 PM, Alex Parrill wrote:
 On Friday, 13 May 2016 at 16:05:21 UTC, Steven Schveighoffer wrote:
 On 5/12/16 4:15 PM, Walter Bright wrote:

 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key
 benefit of being arrays in the first place.
I'll repeat what I said in the other thread. The problem isn't auto-decoding. The problem is hijacking the char[] and wchar[] (and variants) array type to mean autodecoding non-arrays. If you think this code makes sense, then my definition of sane varies slightly from yours: static assert(!hasLength!R && is(typeof(R.init.length))); static assert(!is(ElementType!R == R.init[0])); static assert(!isRandomAccessRange!R && is(typeof(R.init[0])) && is(typeof(R.init[0 .. $]))); I think D would be fine if string meant some auto-decoding struct with an immutable(char)[] array backing. I can accept and work with that. I can transform that into a char[] that makes sense if I have no use for auto-decoding. As of today, I have to use byCodePoint, or .representation, etc. and it's very unwieldy. If I ran D, that's what I would do.
Well, the "auto" part of autodecoding means "automatically doing it for plain strings", right? If you explicitly do decoding, I think it would just be "decoding"; there's no "auto" part.
No, the problem isn't the auto-decoding. The problem is having *arrays* do that. Sometimes. I would be perfectly fine with a custom string type that all string literals were typed as, as long as I can get a sanely behaving array out of it.
 I doubt anyone is going to complain if you add in a struct wrapper
 around a string that iterates over code units or graphemes. The issue
 most people have, as you say, is the fact that the default for strings
 is to decode.
I want to clarify that I don't really care if strings by default auto-decode. I think that's fine. What I dislike is that immutable(char)[] auto-decodes. -Steve
May 13 2016
prev sibling next sibling parent reply Jon D <jond noreply.com> writes:
On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
 On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
 I am as unclear about the problems of autodecoding as I am
about the necessity
 to remove curl. Whenever I ask I hear some arguments that
work well emotionally
 but are scant on reason and engineering. Maybe it's time to
rehash them? I just
 did so about curl, no solid argument seemed to come together.
I'd be curious of
 a crisp list of grievances about autodecoding. -- Andrei
Given the importance of performance in the auto-decoding topic, it seems reasonable to quantify it. I took a stab at this. It would of course be prudent to have others conduct similar analysis rather than rely on my numbers alone. Measurements were done using an artificial scenario, counting lower-case ascii letters. This had the effect of calling front/popFront many times on a long block of text. Runs were done both treating the text as char[] and ubyte[] and comparing the run times. (char[] performs auto-decoding, ubyte[] does not.) Timings were done with DMD and LDC, and on two different data sets. One data set was a mix of latin languages (e.g. German, English, Finnish, etc.), the other non-Latin languages (e.g. Japanese, Chinese, Greek, etc.). The goal being to distinguish between scenarios with high and low Ascii character content. The result: For DMD, auto-decoding showed a 1.6x to 2.6x cost. For LDC, a 12.2x to 12.9x cost. Details: - Test program: https://dpaste.dzfl.pl/67c7be11301f - DMD 2.071.0. Options: -release -O -boundscheck=off -inline - LDC 1.0.0-beta1 (based on DMD v2.070.2). Options: -release -O -boundscheck=off - Machine: Macbook Pro (2.8 GHz Intel I7, 16GB ram) Runs for each combination were done five times and the median times used. The median times and the char[] to ubyte[] ratio are below: | | | char[] | ubyte[] | | Compiler | Text type | time (ms) | time (ms) | ratio | |----------+-----------+-----------+-----------+-------| | DMD | Latin | 7261 | 4513 | 1.6 | | DMD | Non-latin | 10240 | 3928 | 2.6 | | LDC | Latin | 11773 | 913 | 12.9 | | LDC | Non-latin | 10756 | 883 | 12.2 | Note: The numbers above don't provide enough info to derive a front/popFront rate. The program artificially makes multiple loops to increase the run-times. (For these runs, the program's repeat-count was set to 20). Characteristics of the two data sets: | | | | | Bytes per | | Text type | Bytes | DChars | Ascii Chars | DChar | Pct Ascii | |-----------+---------+---------+-------------+-----------+-----------| | Latin | 4156697 | 4059016 | 3965585 | 1.024 | 97.7% | | Non-latin | 4061554 | 1949290 | 348164 | 2.084 | 17.9% | Run-to-run variability - The run times recorded were quite stable. The largest delta between minimum and median time for any group was 17 milliseconds.
May 15 2016
next sibling parent reply Jack Stouffer <jack jackstouffer.com> writes:
On Sunday, 15 May 2016 at 23:10:38 UTC, Jon D wrote:
 Given the importance of performance in the auto-decoding topic, 
 it seems reasonable to quantify it. I took a stab at this. It 
 would of course be prudent to have others conduct similar 
 analysis rather than rely on my numbers alone.
Here is another benchmark (see the above comment for the code to apply the patch to) that measures the iteration time difference: http://forum.dlang.org/post/ndj6dm$a6c$1 digitalmars.com The result is a 756% slow down
May 15 2016
parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Mon, May 16, 2016 at 12:31:04AM +0000, Jack Stouffer via Digitalmars-d wrote:
 On Sunday, 15 May 2016 at 23:10:38 UTC, Jon D wrote:
Given the importance of performance in the auto-decoding topic, it
seems reasonable to quantify it. I took a stab at this. It would of
course be prudent to have others conduct similar analysis rather than
rely on my numbers alone.
Here is another benchmark (see the above comment for the code to apply the patch to) that measures the iteration time difference: http://forum.dlang.org/post/ndj6dm$a6c$1 digitalmars.com The result is a 756% slow down
I decide to do my own benchmarking too. Here's the code: /** * Simple-minded benchmark for measuring performance degradation caused by * autodecoding. */ import std.typecons : Flag, Yes, No; size_t countNewlines(Flag!"autodecode" autodecode)(const(char)[] input) { size_t count = 0; static if (autodecode) { import std.array; foreach (dchar ch; input) { if (ch == '\n') count++; } } else // !autodecode { import std.utf : byCodeUnit; foreach (char ch; input.byCodeUnit) { if (ch == '\n') count++; } } return count; } void main(string[] args) { import std.datetime : benchmark; import std.file : read; import std.stdio : writeln, writefln; string input = (args.length >= 2) ? args[1] : "/usr/src/d/phobos/std/datetime.d"; uint n = 50; auto data = cast(char[]) read(input); writefln("Input: %s (%d bytes)", input, data.length); size_t count; writeln("With autodecoding:"); auto result = benchmark!({ count = countNewlines!(Yes.autodecode)(data); })(n); writefln("Newlines: %d Time: %s msecs", count, result[0].msecs); writeln("Without autodecoding:"); result = benchmark!({ count = countNewlines!(No.autodecode)(data); })(n); writefln("Newlines: %d Time: %s msecs", count, result[0].msecs); } // vim:set sw=4 ts=4 et: Just for fun, I decided to use std/datetime.d, one of the largest modules in Phobos, as a test case. For comparison, I compiled with dmd (latest git head) and gdc 5.3.1. The compile commands were: dmd -O -inline bench.d -ofbench.dmd gdc -O3 bench.d -o bench.gdc Here are the results from bench.dmd: Input: /usr/src/d/phobos/std/datetime.d (1464089 bytes) With autodecoding: Newlines: 35398 Time: 331 msecs Without autodecoding: Newlines: 35398 Time: 254 msecs And the results from bench.gdc: Input: /usr/src/d/phobos/std/datetime.d (1464089 bytes) With autodecoding: Newlines: 35398 Time: 253 msecs Without autodecoding: Newlines: 35398 Time: 25 msecs These results are pretty typical across multiple runs. There is a variance of about 20 msecs or so between bench.dmd runs, but the bench.gdc runs vary only by about 1-2 msecs. So for bench.dmd, autodecoding adds about a 30% overhead to running time, whereas for bench.gdc, autodecoding costs an order of magnitude increase in running time. As an interesting aside, compiling with dmd without -O -inline causes the non-autodecoding case to be actually consistently *slower* than the autodecoding case. Apparently in this case the performance is dominated by the cost of calling non-inlined range primitives on byCodeUnit, whereas a manual for-loop over the array of chars produces similar results to the -O -inline case. I find this interesting, because it shows that the cost of autodecoding is relatively small compared to the cost of unoptimized range primitives. Nevertheless, it does make a big difference when range primitives are properly optimized. It is especially poignant in the case of gdc that, given a superior optimizer, the non-autodecoding case can be made an order of magnitude faster, whereas the autodecoding case is presumably complex enough to defeat the optimizer. T -- Democracy: The triumph of popularity over principle. -- C.Bond
May 15 2016
prev sibling parent jmh530 <john.michael.hall gmail.com> writes:
On Sunday, 15 May 2016 at 23:10:38 UTC, Jon D wrote:
 Runs for each combination were done five times and the median 
 times used. The median times and the char[] to ubyte[] ratio 
 are below:
 |          |           |    char[] |   ubyte[] |
 | Compiler | Text type | time (ms) | time (ms) | ratio |
 |----------+-----------+-----------+-----------+-------|
 | DMD      | Latin     |      7261 |      4513 |   1.6 |
 | DMD      | Non-latin |     10240 |      3928 |   2.6 |
 | LDC      | Latin     |     11773 |       913 |  12.9 |
 | LDC      | Non-latin |     10756 |       883 |  12.2 |
Interesting that LDC is slower than DMD for char[].
May 16 2016
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
This might be a good time to discuss this a tad further. I'd appreciate 
if the debate stayed on point going forward. Thanks!

My thesis: the D1 design decision to represent strings as char[] was 
disastrous and probably one of the largest weaknesses of D1. The 
decision in D2 to use immutable(char)[] for strings is a vast 
improvement but still has a number of issues. The approach to 
autodecoding in Phobos is an improvement on that decision. The insistent 
shunning of a user-defined type to represent strings is not good and we 
need to rid ourselves of it.

On 05/12/2016 04:15 PM, Walter Bright wrote:
 On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
  > I am as unclear about the problems of autodecoding as I am about the
 necessity
  > to remove curl. Whenever I ask I hear some arguments that work well
 emotionally
  > but are scant on reason and engineering. Maybe it's time to rehash
 them? I just
  > did so about curl, no solid argument seemed to come together. I'd be
 curious of
  > a crisp list of grievances about autodecoding. -- Andrei

 Here are some that are not matters of opinion.

 1. Ranges of characters do not autodecode, but arrays of characters do.
 This is a glaring inconsistency.
Agreed. At the point of that decision, the party line was "arrays of characters are strings, nothing else is or should be". Now it is apparent that shouldn't have been the case.
 2. Every time one wants an algorithm to work with both strings and
 ranges, you wind up special casing the strings to defeat the
 autodecoding, or to decode the ranges. Having to constantly special case
 it makes for more special cases when plugging together components. These
 issues often escape detection when unittesting because it is convenient
 to unittest only with arrays.
This is a consequence of 1. It is at least partially fixable.
 3. Wrapping an array in a struct with an alias this to an array turns
 off autodecoding, another special case.
This is also a consequence of 1.
 4. Autodecoding is slow and has no place in high speed string processing.
I would agree only with the amendment "...if used naively", which is important. Knowledge of how autodecoding works is a prerequisite for writing fast string code in D. Also, little code should deal with one code unit or code point at a time; instead, it should use standard library algorithms for searching, matching etc. When needed, iterating every code unit is trivially done through indexing. Also allow me to point that much of the slowdown can be addressed tactically. The test c < 0x80 is highly predictable (in ASCII-heavy text) and therefore easily speculated. We can and we should arrange code to minimize impact.
 5. Very few algorithms require decoding.
The key here is leaving it to the standard library to do the right thing instead of having the user wonder separately for each case. These uses don't need decoding, and the standard library correctly doesn't involve it (or if it currently does it has a bug): s.find("abc") s.findSplit("abc") s.findSplit('a') s.count!(c => "!()-;:,.?".canFind(c)) // punctuation However the following do require autodecoding: s.walkLength s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation s.count!(c => c >= 32) // non-control characters Currently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.
 6. Autodecoding has two choices when encountering invalid code units -
 throw or produce an error dchar. Currently, it throws, meaning no
 algorithms using autodecode can be made nothrow.
Agreed. This is probably the most glaring mistake. I think we should open a discussion no fixing this everywhere in the stdlib, even at the cost of breaking code.
 7. Autodecode cannot be used with unicode path/filenames, because it is
 legal (at least on Linux) to have invalid UTF-8 as filenames. It turns
 out in the wild that pure Unicode is not universal - there's lots of
 dirty Unicode that should remain unmolested, and autocode does not play
 with that.
If paths are not UTF-8, then they shouldn't have string type (instead use ubyte[] etc). More on that below.
 8. In my work with UTF-8 streams, dealing with autodecode has caused me
 considerably extra work every time. A convenient timesaver it ain't.
Objection. Vague.
 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid
 importing std.array one way or another, and then autodecode is there.
Turning off autodecoding is as easy as inserting .representation after any string. (Not to mention using indexing directly.)
 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key
 benefit of being arrays in the first place.
First off, you always have the option with .representation. That's a great name because it gives you the type used to represent the string - i.e. an array of integers of a specific width. Second, it's as it should. The entire scaffolding rests on the notion that char[] is distinguished from ubyte[] by having UTF8 code units, not arbitrary bytes. It seems that many arguments against autodecoding are in fact arguments in favor of eliminating virtually all distinctions between char[] and ubyte[]. Then the natural question is, what _is_ the difference between char[] and ubyte[] and why do we need char as a separate type from ubyte? This is a fundamental question for which we need a rigorous answer. What is the purpose of char, wchar, and dchar? My current understanding is that they're justified as pretty much indistinguishable in primitives and behavior from ubyte, ushort, and uint respectively, but they reflect a loose subjective intent from the programmer that they hold actual UTF code units. The core language does not enforce such, except it does special things in random places like for loops (any other)? If char is to be distinct from ubyte, and char[] is to be distinct from ubyte[], then autodecoding does the right thing: it makes sure they are distinguished in behavior and embodies the assumption that char is, in fact, a UTF8 code point.
 11. Indexing an array produces different results than autodecoding,
 another glaring special case.
This is a direct consequence of the fact that string is immutable(char)[] and not a specific type. That error predates autodecoding. Overall, I think the one way to make real steps forward in improving string processing in the D language is to give a clear answer of what char, wchar, and dchar mean. Andrei
May 26 2016
next sibling parent Jack Stouffer <jack jackstouffer.com> writes:
On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu 
wrote:
 instead, it should use standard library algorithms for 
 searching,
 matching etc. When needed, iterating every code unit is 
 trivially
 done through indexing.
For an example where the std.algorithm/range functions don't cut it, my random format date string parser first breaks up the given character range into tokens. Once it has the tokens, it checks several known formats. One piece of that is checking if some of the tokens are in AAs of month and day names for fast tests of presence. Because the AAs are int[string], and it's unknowable the encoding of string (it's complicated), during tokenization, the character range must be forced to UTF-8 with byChar with all isSomeString!R == true inputs to avoid the auto-decoding and subsequent AA key mismatch.
 Agreed. This is probably the most glaring mistake. I think we 
 should open a discussion no fixing this everywhere in the 
 stdlib, even at the cost of breaking code.
See the discussion here: https://issues.dlang.org/show_bug.cgi?id=14519 I think some of the proposals there are interesting.
 Overall, I think the one way to make real steps forward in 
 improving string processing in the D language is to give a 
 clear answer of what char, wchar, and dchar mean.
If you agree that iterating over code units and code points isn't what people want/need most of the time, then I will quote something from my article on the subject: "I really don't see the benefit of the automatic behavior fulfilling this one specific corner case when you're going to make everyone else call a range generating function when they want to iterate over code units or graphemes. Just make everyone call a range generating function to specify the type of iteration and save a lot of people the trouble!" I think the only clear way forward is to not make strings ranges and force people to make a decision when passing them to range functions. The HUGE problem is the code this will break, which is just about all of it.
May 26 2016
prev sibling next sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, May 26, 2016 at 12:00:54PM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
[...]
 On 05/12/2016 04:15 PM, Walter Bright wrote:
[...]
 4. Autodecoding is slow and has no place in high speed string processing.
I would agree only with the amendment "...if used naively", which is important. Knowledge of how autodecoding works is a prerequisite for writing fast string code in D. Also, little code should deal with one code unit or code point at a time; instead, it should use standard library algorithms for searching, matching etc. When needed, iterating every code unit is trivially done through indexing. Also allow me to point that much of the slowdown can be addressed tactically. The test c < 0x80 is highly predictable (in ASCII-heavy text) and therefore easily speculated. We can and we should arrange code to minimize impact.
General Unicode strings have a lot of non-ASCII characters. Why are we only optimizing for the ASCII case?
 5. Very few algorithms require decoding.
The key here is leaving it to the standard library to do the right thing instead of having the user wonder separately for each case. These uses don't need decoding, and the standard library correctly doesn't involve it (or if it currently does it has a bug): s.find("abc") s.findSplit("abc") s.findSplit('a') s.count!(c => "!()-;:,.?".canFind(c)) // punctuation However the following do require autodecoding: s.walkLength s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation s.count!(c => c >= 32) // non-control characters
Question: what should count return, given a string containing (1) combining diacritics, or (2) Korean text? Or (3) zero-width spaces?
 Currently the standard library operates at code point level even
 though inside it may choose to use code units when admissible. Leaving
 such a decision to the library seems like a wise thing to do.
The problem is that often such decisions can only be made by the user, because it depends on what the user wants to accomplish. What should count return, given some Unicode string? If the user wants to determine the size of a buffer (e.g., to store a string minus some characters to be stripped), then count should return the byte count. If the user wants to count the number of matching visual characters, then count should return the number of graphemes. If the user wants to determine the visual width of the (filtered) string, then count should not be used at all, but instead a font metric algorithm. (I can't think of a practical use case where you'd actually need to count code points(!).) Having the library arbitrarily choose one use case over the others (especially one that seems the least applicable to practical situations) just doesn't seem right to me at all. Rather, the user ought to specify what exactly is to be counted, i.e., s.byCodeUnit.count(), s.byCodePoint.count(), or s.byGrapheme.count(). [...]
 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid
 importing std.array one way or another, and then autodecode is there.
Turning off autodecoding is as easy as inserting .representation after any string. (Not to mention using indexing directly.)
Therefore, instead of: myString.splitter!"abc".joiner!"def".count; we have to write: myString.representation .splitter!("abc".representation) .joiner!("def".representation) .count; Great. [...]
 Second, it's as it should. The entire scaffolding rests on the notion
 that char[] is distinguished from ubyte[] by having UTF8 code units,
 not arbitrary bytes. It seems that many arguments against autodecoding
 are in fact arguments in favor of eliminating virtually all
 distinctions between char[] and ubyte[].
That is a strawman. We are not arguing for eliminating the distinction between char[] and ubyte[]. Rather, the complaint is that autodecoding represents a constant overhead in string processing that's often *unnecessary*. Many string operations don't *need* to autodecode, and even those that may seem like they do, are often better implemented differently. For example, filtering a string by a non-ASCII character can actually be done via substring search -- expand the non-ASCII character into 1 to 6 code units, and then do the equivalent of C's strstr(). This will not have false positives thanks to the way UTF-8 is designed. It eliminates the overhead of decoding every single character -- in implementational terms, it could, for example, first scan for the first 1st byte by linear scan through the string without decoding, which is a lot faster than decoding every single character and then comparing with the target. Only when the first byte matches does it need to do the slightly more expensive operation of substring comparison. Similarly, splitter does not need to operate on code points at all. It's unnecessarily slow that way. Most use cases of splitter has lots of data in between delimiters, which means most of the work done by autodecoding is wasted. Instead, splitter should just scan for the substring to split on -- again the design of UTF-8 guarantees there will be no false positives -- and only put in the effort where it's actually needed: at the delimiters, not the data in between. The same could be said of joiner, and many other common string algorithms. There aren't many algorithms that actually need to decode; decoding should be restricted to them, rather than an overhead applied across the board. [...]
 Overall, I think the one way to make real steps forward in improving
 string processing in the D language is to give a clear answer of what
 char, wchar, and dchar mean.
[...] We already have a clear definition: char, wchar, and dchar are Unicode code units, and the latter is also Unicode code points. That's all there is to it. If we want Phobos to truly be able to take advantage of the fact that char[], wchar[], dchar[] contain Unicode strings, we need to stop the navel gazing at what byte representations and bits mean, and look at the bigger picture. Consider char[] as a unit in itself, a complete Unicode string -- the actual code units don't really matter, as they are just an implementation detail. What you want to be able to do is for a Phobos algorithm to decide, OK, in order to produce output X, it's faster to do substring scanning, and in order to produce output Y, it's better to decode first. In other words, decoding or not decoding ought to be a decision made at the algorithm level (or higher), depending on the need at hand. It should not be hard-boiled into the lower-level internals of how strings are handled, such that higher-level algorithms are straitjacketed and forced to work with the decoded stream, even when they actually don't *need* decoding to do what they want. In the cases where Phobos is unable to make a decision (e.g., what should count return -- which depends on what the user is trying to accomplish), it should be left to the user. The user shouldn't have to work against a default setting that only works for a subset of use cases. T -- Without geometry, life would be pointless. -- VS
May 26 2016
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/26/2016 07:23 PM, H. S. Teoh via Digitalmars-d wrote:
 Therefore, instead of:

 	myString.splitter!"abc".joiner!"def".count;

 we have to write:

 	myString.representation
 		.splitter!("abc".representation)
 		.joiner!("def".representation)
 		.count;
No, that's not necessary (or correct). -- Andrei
May 26 2016
prev sibling parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Thu, 26 May 2016 16:23:16 -0700
schrieb "H. S. Teoh via Digitalmars-d"
<digitalmars-d puremagic.com>:

 On Thu, May 26, 2016 at 12:00:54PM -0400, Andrei Alexandrescu via
Digitalmars-d wrote:
 [...]
 s.walkLength
 s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
 s.count!(c => c >= 32) // non-control characters  
Question: what should count return, given a string containing (1) combining diacritics, or (2) Korean text? Or (3) zero-width spaces?
 Currently the standard library operates at code point level even
 though inside it may choose to use code units when admissible. Leaving
 such a decision to the library seems like a wise thing to do.  
The problem is that often such decisions can only be made by the user, because it depends on what the user wants to accomplish. What should count return, given some Unicode string? If the user wants to determine the size of a buffer (e.g., to store a string minus some characters to be stripped), then count should return the byte count. If the user wants to count the number of matching visual characters, then count should return the number of graphemes. If the user wants to determine the visual width of the (filtered) string, then count should not be used at all, but instead a font metric algorithm. (I can't think of a practical use case where you'd actually need to count code points(!).)
Hey, I was about to answer exactly the same. It reminds me that a few years ago I proposed making string iteration explicit by code-unit, code-point and grapheme in "Rust" and there was virtually no debate about doing it in the sense that to enable people to write correct code they'd need to understand a bit of Unicode and pick the right primitive. If you don't know what to pick you look it up. -- Marco
May 30 2016
parent reply Andrew Godfrey <X y.com> writes:
I like "make string iteration explicit" but I wonder about other 
constructs. E.g. What about "sort an array of strings"? How would 
you tell a generic sort function whether you want it to interpret 
strings by code unit vs code point vs grapheme?
May 30 2016
next sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 30 May 2016 at 17:14:47 UTC, Andrew Godfrey wrote:
 I like "make string iteration explicit" but I wonder about 
 other constructs. E.g. What about "sort an array of strings"? 
 How would you tell a generic sort function whether you want it 
 to interpret strings by code unit vs code point vs grapheme?
The comparison predicate does that... sort!( (string a, string b) { /* you interpret a and b here and return the comparison */ })(["hi", "there"]);
May 30 2016
parent Andrew Godfrey <X y.com> writes:
On Monday, 30 May 2016 at 18:26:32 UTC, Adam D. Ruppe wrote:
 On Monday, 30 May 2016 at 17:14:47 UTC, Andrew Godfrey wrote:
 I like "make string iteration explicit" but I wonder about 
 other constructs. E.g. What about "sort an array of strings"? 
 How would you tell a generic sort function whether you want it 
 to interpret strings by code unit vs code point vs grapheme?
The comparison predicate does that... sort!( (string a, string b) { /* you interpret a and b here and return the comparison */ })(["hi", "there"]);
Thanks! You left out some details but I think I see - an example predicate might be "cmp(a.byGrapheme, b.byGrapheme)" and by the looks of it, that code works in D today. (However, "cmp(a, b)" would default to code points today, which is surprising to almost everyone, and that's more what this thread is about).
May 30 2016
prev sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Mon, 30 May 2016 17:14:47 +0000
schrieb Andrew Godfrey <X y.com>:

 I like "make string iteration explicit" but I wonder about other=20
 constructs. E.g. What about "sort an array of strings"? How would=20
 you tell a generic sort function whether you want it to interpret=20
 strings by code unit vs code point vs grapheme?
You are just scratching the surface! Unicode strings are sorted following the Unicode Collation Algorithm which is described in the 86 pages document here: (http://www.unicode.org/reports/tr10/) which is implemented in the ICU library mentioned before. Some obvious considerations from the description of the algorithm: In Sweden z comes before =C3=B6, while in Germany its the reverse. In Germany, words in a dictionary are sorted differently from lists of names in a phone book. dictionary: of < =C3=B6f, phone book: =C3=B6f < of Spanish sorts 'll' as one character right after 'l'. The default collation is selected in Windows through the control panel's localization app and on Linux (Posix) using the LC_COLLATE environment variable. The actual string sorting in the user's locale can then be performed with the C library using http://www.cplusplus.com/reference/cstring/strcoll/ or OS specific functions like CompareStringEx on Windows https://msdn.microsoft.com/en-us/library/windows/desktop/dd317761%28v=3Dvs.= 85%29.aspx TL;DR neither code-points nor grapheme clusters are adequate for string sorting. Also two strings may compare unequal byte for byte, while they are actually the same text in different normalization forms. (E.g. Umlauts on OS X (NFD) vs. rest of the world (NFC)). Admittedly I find myself using str1 =3D=3D str2 without first normalizing both, because it is frigging convenient and fast. --=20 Marco
May 30 2016
prev sibling next sibling parent reply Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu 
wrote:
 4. Autodecoding is slow and has no place in high speed string 
 processing.
I would agree only with the amendment "...if used naively", which is important. Knowledge of how autodecoding works is a prerequisite for writing fast string code in D.
It is completely wasted mental effort.
 5. Very few algorithms require decoding.
The key here is leaving it to the standard library to do the right thing instead of having the user wonder separately for each case. These uses don't need decoding, and the standard library correctly doesn't involve it (or if it currently does it has a bug): s.count!(c => "!()-;:,.?".canFind(c)) // punctuation
As far as I can see, the language currently does not provide the facilities to implement the above without autodecoding.
 However the following do require autodecoding:

 s.walkLength
Usage of the result of this expression will be incorrect in many foreseeable cases.
 s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
Ditto.
 s.count!(c => c >= 32) // non-control characters
Ditto, with a big red flag. If you are dealing with control characters, the code is likely low-level enough that you need to be explicit in what you are counting. It is likely not what actually needs to be counted. Such confusion can lead to security risks.
 Currently the standard library operates at code point level 
 even though inside it may choose to use code units when 
 admissible. Leaving such a decision to the library seems like a 
 wise thing to do.
It should be explicit.
 7. Autodecode cannot be used with unicode path/filenames, 
 because it is
 legal (at least on Linux) to have invalid UTF-8 as filenames. 
 It turns
 out in the wild that pure Unicode is not universal - there's 
 lots of
 dirty Unicode that should remain unmolested, and autocode does 
 not play
 with that.
If paths are not UTF-8, then they shouldn't have string type (instead use ubyte[] etc). More on that below.
This is not practical. Do you really see changing std.file and std.path to accept ubyte[] for all path arguments?
 8. In my work with UTF-8 streams, dealing with autodecode has 
 caused me
 considerably extra work every time. A convenient timesaver it 
 ain't.
Objection. Vague.
I can confirm this vague subjective observation. For example, DustMite reimplements some std.string functions in order to be able to handle D files with invalid UTF-8 characters.
 9. Autodecode cannot be turned off, i.e. it isn't practical to 
 avoid
 importing std.array one way or another, and then autodecode is 
 there.
Turning off autodecoding is as easy as inserting .representation after any string. (Not to mention using indexing directly.)
This is neither easy nor practical. It makes writing reliable string handling code a chore in D. Because it is difficult to find all places where this must be done, it is not possible to do on a program-wide scale, thus bugs can only be discovered when this or that component fails because it was not tested with Unicode strings.
 10. Autodecoded arrays cannot be RandomAccessRanges, losing a 
 key
 benefit of being arrays in the first place.
First off, you always have the option with .representation. That's a great name because it gives you the type used to represent the string - i.e. an array of integers of a specific width. Second, it's as it should. The entire scaffolding rests on the notion that char[] is distinguished from ubyte[] by having UTF8 code units, not arbitrary bytes. It seems that many arguments against autodecoding are in fact arguments in favor of eliminating virtually all distinctions between char[] and ubyte[]. Then the natural question is, what _is_ the difference between char[] and ubyte[] and why do we need char as a separate type from ubyte? This is a fundamental question for which we need a rigorous answer.
Why?
 What is the purpose of char, wchar, and dchar? My current 
 understanding is that they're justified as pretty much 
 indistinguishable in primitives and behavior from ubyte, 
 ushort, and uint respectively, but they reflect a loose 
 subjective intent from the programmer that they hold actual UTF 
 code units. The core language does not enforce such, except it 
 does special things in random places like for loops (any other)?

 If char is to be distinct from ubyte, and char[] is to be 
 distinct from ubyte[], then autodecoding does the right thing: 
 it makes sure they are distinguished in behavior and embodies 
 the assumption that char is, in fact, a UTF8 code point.
I don't follow this line of reasoning at all.
 11. Indexing an array produces different results than 
 autodecoding,
 another glaring special case.
This is a direct consequence of the fact that string is immutable(char)[] and not a specific type. That error predates autodecoding.
There is no convincing argument why indexing and slicing should not simply operate on code units.
 Overall, I think the one way to make real steps forward in 
 improving string processing in the D language is to give a 
 clear answer of what char, wchar, and dchar mean.
I don't follow. Though, making char implicitly convertible to wchar and dchar has clearly been a mistake.
May 26 2016
parent reply Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Friday, May 27, 2016 04:31:49 Vladimir Panteleev via Digitalmars-d wrote:
 On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu
 9. Autodecode cannot be turned off, i.e. it isn't practical to
 avoid
 importing std.array one way or another, and then autodecode is
 there.
Turning off autodecoding is as easy as inserting .representation after any string. (Not to mention using indexing directly.)
This is neither easy nor practical. It makes writing reliable string handling code a chore in D. Because it is difficult to find all places where this must be done, it is not possible to do on a program-wide scale, thus bugs can only be discovered when this or that component fails because it was not tested with Unicode strings.
In addition, as soon as you have ubyte[], none of the string-related functions work. That's fixable, but as it stands, operating on ubyte[] instead of char[] is a royal pain. - Jonathan M Davis
May 31 2016
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 02:57 PM, Jonathan M Davis via Digitalmars-d wrote:
 In addition, as soon as you have ubyte[], none of the string-related
 functions work. That's fixable, but as it stands, operating on ubyte[]
 instead of char[] is a royal pain.
That'd be nice to fix indeed. Please break the ground? -- Andrei
May 31 2016
prev sibling next sibling parent reply Kagamin <spam here.lot> writes:
On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu 
wrote:
 11. Indexing an array produces different results than 
 autodecoding,
 another glaring special case.
This is a direct consequence of the fact that string is immutable(char)[] and not a specific type. That error predates autodecoding.
Sounds like you want to say that string should be smarter than an array of code units in dealing with unicode. As I understand, design rationale behind strings being plain arrays of code units is that it's impractical for the string to smarter than array of code units - it just won't cut it, while plain array provides simple and easy to understand implementation of string.
May 27 2016
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/27/16 6:26 AM, Kagamin wrote:
 As I understand, design rationale
 behind strings being plain arrays of code units is that it's impractical
 for the string to smarter than array of code units - it just won't cut
 it, while plain array provides simple and easy to understand
 implementation of string.
That's my understanding too. And I think the design rationale is wrong. -- Andrei
May 27 2016
prev sibling next sibling parent reply Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu 
wrote:
 This might be a good time to discuss this a tad further. I'd 
 appreciate if the debate stayed on point going forward. Thanks!

 My thesis: the D1 design decision to represent strings as 
 char[] was disastrous and probably one of the largest 
 weaknesses of D1. The decision in D2 to use immutable(char)[] 
 for strings is a vast improvement but still has a number of 
 issues. The approach to autodecoding in Phobos is an 
 improvement on that decision.
It is not, which has been shown by various posts in this thread. Iterating by code points is at least as wrong as iterating by code units; it can be argued it is worse because it sometimes makes the fact that it's wrong harder to detect.
 The insistent shunning of a user-defined type to represent 
 strings is not good and we need to rid ourselves of it.
While this may be true, it has nothing to do with auto decoding. I assume you would want such a user-define string type to auto-decode as well, right?
 On 05/12/2016 04:15 PM, Walter Bright wrote:
 5. Very few algorithms require decoding.
The key here is leaving it to the standard library to do the right thing instead of having the user wonder separately for each case. These uses don't need decoding, and the standard library correctly doesn't involve it (or if it currently does it has a bug): s.find("abc") s.findSplit("abc") s.findSplit('a')
Yes.
 s.count!(c => "!()-;:,.?".canFind(c)) // punctuation
Ideally yes, but this is a special case that cannot be detected by `count`.
 However the following do require autodecoding:

 s.walkLength
 s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
 s.count!(c => c >= 32) // non-control characters
No, they do not need _auto_ decoding, they need a decision _by the user_ what they should be decoded to. Code units? Code points? Graphemes? Words? Lines?
 Currently the standard library operates at code point level
Because it auto decodes.
 even though inside it may choose to use code units when 
 admissible. Leaving such a decision to the library seems like a 
 wise thing to do.
No one wants to take that second part away. For example, the `find` can provide an overload that accepts `const(char)[]` directly, while `walkLength` doesn't, requiring a decision by the caller.
 7. Autodecode cannot be used with unicode path/filenames, 
 because it is
 legal (at least on Linux) to have invalid UTF-8 as filenames. 
 It turns
 out in the wild that pure Unicode is not universal - there's 
 lots of
 dirty Unicode that should remain unmolested, and autocode does 
 not play
 with that.
If paths are not UTF-8, then they shouldn't have string type (instead use ubyte[] etc). More on that below.
I believe a library type would be more appropriate than bare `ubyte[]`. It should provide conversion between the OS encoding (which can be detected automatically) and UTF strings, for example. And it should be used for any "strings" that comes from outside the program, like main's arguments, env variables...
 9. Autodecode cannot be turned off, i.e. it isn't practical to 
 avoid
 importing std.array one way or another, and then autodecode is 
 there.
Turning off autodecoding is as easy as inserting .representation after any string. (Not to mention using indexing directly.)
This would no longer work if char[] and char ranges were to be treated identically.
 10. Autodecoded arrays cannot be RandomAccessRanges, losing a 
 key
 benefit of being arrays in the first place.
First off, you always have the option with .representation. That's a great name because it gives you the type used to represent the string - i.e. an array of integers of a specific width. Second, it's as it should. The entire scaffolding rests on the notion that char[] is distinguished from ubyte[] by having UTF8 code units, not arbitrary bytes. It seems that many arguments against autodecoding are in fact arguments in favor of eliminating virtually all distinctions between char[] and ubyte[]. Then the natural question is, what _is_ the difference between char[] and ubyte[] and why do we need char as a separate type from ubyte? This is a fundamental question for which we need a rigorous answer. What is the purpose of char, wchar, and dchar? My current understanding is that they're justified as pretty much indistinguishable in primitives and behavior from ubyte, ushort, and uint respectively, but they reflect a loose subjective intent from the programmer that they hold actual UTF code units. The core language does not enforce such, except it does special things in random places like for loops (any other)?
Agreed.
 If char is to be distinct from ubyte, and char[] is to be 
 distinct from ubyte[], then autodecoding does the right thing: 
 it makes sure they are distinguished in behavior and embodies 
 the assumption that char is, in fact, a UTF8 code point.
Distinguishing them is the right thing to do, but auto decoding is not the way to achieve that, see above.
May 27 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/27/16 6:56 AM, Marc Schütz wrote:
 It is not, which has been shown by various posts in this thread.
Couldn't quite find strong arguments. Could you please be more explicit on which you found most convincing? -- Andrei
May 27 2016
parent reply Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Friday, 27 May 2016 at 13:34:33 UTC, Andrei Alexandrescu wrote:
 On 5/27/16 6:56 AM, Marc Schütz wrote:
 It is not, which has been shown by various posts in this 
 thread.
Couldn't quite find strong arguments. Could you please be more explicit on which you found most convincing? -- Andrei
There are several possibilities of what iteration over a char range can mean. (For the sake of simplicity, let's ignore special cases like `find` and `split`; instead, let's look at `walkLength`, `retro` and similar.) BEFORE the introduction of auto decoding, it used to iterate over UTF8 code _units_, which is wrong for any non-ASCII data (except for the unlikely case where you really want code units). AFTER the introduction of auto decoding, it iterates over UTF8 code _points_, which is wrong for combined characters, e.g. äöüéòàñ on MacOS X, more "exotic" ones everywhere (except for the even more unlikely case where you really want code points). That is, both the BEFORE and AFTER behaviour are wrong, both break for various kinds of input in different ways. So, is AFTER an improvement over BEFORE? The set of inputs where auto decoding produces wrong output is likely smaller, making it slightly less likely to encounter problems in practice; on the other hand, it's still wrong, and it's harder to find these problems during testing. That's like "improving" a bicycle so that it only breaks down after riding it for 30 minutes instead of just after 10 minutes, so you won't notice it during a test ride. But there are even more possibilities. It could iterate over graphemes, which is expensive, but more likely to produce the results that the user wants. Or it could iterate by lines, or words (and there are different ways to define what a word is), and so on. The fundamental problem is choosing one of those possibilities over the others without knowing what the user actually wants, which is what both BEFORE and AFTER do. So, what was the original goal when introducing auto decoding? To improve correctness, right? I would argue that this goal has not been achieved. Have a look at the article [1], which IMO gives good criteria for how a _correct_ string type should behave. Both BEFORE and AFTER fail most of them. [1] https://mortoray.com/2013/11/27/the-string-type-is-broken/
May 28 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/28/16 6:59 AM, Marc Schütz wrote:
 The fundamental problem is choosing one of those possibilities over the
 others without knowing what the user actually wants, which is what both
 BEFORE and AFTER do.
OK, that's a fair argument, thanks. So it seems there should be no "default" way to iterate a string, and furthermore iterating for each constituent of a string should be fairly rare. Strings and substrings yes, but not individual points/units/graphemes unless expressly asked. (Indeed some languages treat strings as first-class entities and individual characters are mere short substrings.) So it harkens back to the original mistake: strings should NOT be arrays with the respective primitives. Andrei
May 28 2016
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:
 So it harkens back to the original mistake: strings should NOT be arrays with
 the respective primitives.
An array of code units provides consistency, predictability, flexibility, and performance. It's a solid base upon which the programmer can build what he needs as required. A string class does not do that (from the article: "I admit the correct answer is not always clear").
May 28 2016
next sibling parent reply Andrew Godfrey <X y.com> writes:
On Saturday, 28 May 2016 at 19:04:14 UTC, Walter Bright wrote:
 On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:
 So it harkens back to the original mistake: strings should NOT 
 be arrays with
 the respective primitives.
An array of code units provides consistency, predictability, flexibility, and performance. It's a solid base upon which the programmer can build what he needs as required. A string class does not do that (from the article: "I admit the correct answer is not always clear").
You're right. An "array of code units" is a very useful low-level primitive. I've dealt with a lot of code that uses these (more or less correctly) in various languages. But when providing such a thing, I think it's very important to make it *look* like a low-level primitive, and use the type system to distinguish it from higher-level ones. E.g. A string literal should not implicitly convert into an array of code units. What should it implicitly convert to? I'm not sure. Something close to how it looks in the source code, probably. A sequential range of graphemes? From all the detail in this thread, I wonder now if "a grapheme" is even an unambiguous concept across different environments. But one thing I'm sure of (and this is from other languages/API's, not from D specifically): A function which converts from one representation to another, but doesn't keep track of the change (e.g. Different compile-time type; e.g. State in a "string" class about whether it is in normalized form), is a "bug farm".
May 28 2016
parent reply Chris <wendlec tcd.ie> writes:
On Saturday, 28 May 2016 at 22:29:12 UTC, Andrew Godfrey wrote:
[snip]
 From all the detail in this thread, I wonder now if "a 
 grapheme" is even an unambiguous concept across different 
 environments.
Unicode graphemes are not always the same as graphemes in natural (written) languages. If <é> is composed in Unicode, it is still one grapheme in a written language, not two distinct characters. However, in natural languages two characters can be one grapheme, as in English <sh>, it represents the sound in `shower, shop, fish`. In German the same sound is represented by three characters <sch> as in `Schaf` ("sheep"). A bit nit-picky but we should make clear that we talk about "Unicode graphemes" that map to single characters on the written page. But is that at all possible across all languages? To avoid confusion and misunderstandings we should agree on the terminology first.
May 29 2016
parent reply Tobias =?UTF-8?B?TcO8bGxlcg==?= <troplin bluewin.ch> writes:
On Sunday, 29 May 2016 at 11:25:11 UTC, Chris wrote:
 Unicode graphemes are not always the same as graphemes in 
 natural (written) languages. If <é> is composed in Unicode, it 
 is still one grapheme in a written language, not two distinct 
 characters. However, in natural languages two characters can be 
 one grapheme, as in English <sh>, it represents the sound in 
 `shower, shop, fish`. In German the same sound is represented 
 by three characters <sch> as in `Schaf` ("sheep"). A bit 
 nit-picky but we should make clear that we talk about "Unicode 
 graphemes" that map to single characters on the written page. 
 But is that at all possible across all languages?

 To avoid confusion and misunderstandings we should agree on the 
 terminology first.
No, this is well established terminology, you are confusing several things here: - A grapheme is a "character" as written on the page - A phoneme is a spoken "character" - A codepoint is the fundamental "unit" of unicode Graphemes are built from one or more codepoints. Phonemes are a different topic and not really covered by the unicode standard AFAIK. Except for the IPA notation, but these are again graphemes that represent phonemes.
May 29 2016
next sibling parent reply default0 <Kevin.Labschek gmx.de> writes:
On Sunday, 29 May 2016 at 11:47:30 UTC, Tobias Müller wrote:
 On Sunday, 29 May 2016 at 11:25:11 UTC, Chris wrote:
 Unicode graphemes are not always the same as graphemes in 
 natural (written) languages. If <é> is composed in Unicode, it 
 is still one grapheme in a written language, not two distinct 
 characters. However, in natural languages two characters can 
 be one grapheme, as in English <sh>, it represents the sound 
 in `shower, shop, fish`. In German the same sound is 
 represented by three characters <sch> as in `Schaf` ("sheep"). 
 A bit nit-picky but we should make clear that we talk about 
 "Unicode graphemes" that map to single characters on the 
 written page. But is that at all possible across all languages?

 To avoid confusion and misunderstandings we should agree on 
 the terminology first.
No, this is well established terminology, you are confusing several things here: - A grapheme is a "character" as written on the page - A phoneme is a spoken "character" - A codepoint is the fundamental "unit" of unicode Graphemes are built from one or more codepoints. Phonemes are a different topic and not really covered by the unicode standard AFAIK. Except for the IPA notation, but these are again graphemes that represent phonemes.
I am pretty sure that a single grapheme in unicode does not correspond to your notion of "character". I am pretty sure that what you think of as a "character" is officially called "Grapheme Cluster" not "Grapheme". See here: http://www.unicode.org/glossary/#grapheme_cluster
May 29 2016
parent reply Tobias M <troplin bluewin.ch> writes:
On Sunday, 29 May 2016 at 12:08:52 UTC, default0 wrote:
 I am pretty sure that a single grapheme in unicode does not 
 correspond to your notion of "character". I am pretty sure that 
 what you think of as a "character" is officially called 
 "Grapheme Cluster" not "Grapheme".
Grapheme is a linguistic term. AFAIUI, a grapheme cluster is a cluster of codepoints representing a grapheme. It's called "cluster" in the unicode spec, because there there is no dedicated grapheme unit. I put "character" into quotes, because the term is not really well defined. I just used it for a short and pregnant answer. I'm sure there's a better/more correct definition of graphem/phoneme, but it's probably also much longer and complicated.
May 29 2016
parent Chris <wendlec tcd.ie> writes:
On Sunday, 29 May 2016 at 13:04:18 UTC, Tobias M wrote:
 On Sunday, 29 May 2016 at 12:08:52 UTC, default0 wrote:
 I am pretty sure that a single grapheme in unicode does not 
 correspond to your notion of "character". I am pretty sure 
 that what you think of as a "character" is officially called 
 "Grapheme Cluster" not "Grapheme".
Grapheme is a linguistic term. AFAIUI, a grapheme cluster is a cluster of codepoints representing a grapheme. It's called "cluster" in the unicode spec, because there there is no dedicated grapheme unit.
 I put "character" into quotes, because the term is not really 
 well defined. I just used it for a short and pregnant answer. 
 I'm sure there's a better/more correct definition of 
 graphem/phoneme, but it's probably also much longer and 
 complicated.
Which is why we need to agree on a terminology, i.e. be clear when we use linguistic terms and when we use Unicode specific terminology.
May 29 2016
prev sibling next sibling parent reply Chris <wendlec tcd.ie> writes:
On Sunday, 29 May 2016 at 11:47:30 UTC, Tobias Müller wrote:
 On Sunday, 29 May 2016 at 11:25:11 UTC, Chris wrote:
 Unicode graphemes are not always the same as graphemes in 
 natural (written) languages. If <é> is composed in Unicode, it 
 is still one grapheme in a written language, not two distinct 
 characters. However, in natural languages two characters can 
 be one grapheme, as in English <sh>, it represents the sound 
 in `shower, shop, fish`. In German the same sound is 
 represented by three characters <sch> as in `Schaf` ("sheep"). 
 A bit nit-picky but we should make clear that we talk about 
 "Unicode graphemes" that map to single characters on the 
 written page. But is that at all possible across all languages?

 To avoid confusion and misunderstandings we should agree on 
 the terminology first.
No, this is well established terminology, you are confusing several things here: - A grapheme is a "character" as written on the page - A phoneme is a spoken "character" - A codepoint is the fundamental "unit" of unicode Graphemes are built from one or more codepoints. Phonemes are a different topic and not really covered by the unicode standard AFAIK. Except for the IPA notation, but these are again graphemes that represent phonemes.
Ok, you have a point there, to be precise <sh> is a multigraph (a digraph)(cf. [1]). In French you can have multigraphs consisting of three or more characters <eau> /o/, as in Irish <aoi> => /i:/. However, a phoneme is not necessarily a spoken "character" as <sh> represents one phoneme but consists of two "characters" or graphemes. <th> can represent two different phonemes (voiced and unvoiced "th" as in `this` vs. `thorough`). My point was that we have to be _very_ careful not to mix our cultural experience with written text with machine representations. There's bound to be confusion. That's why we should always make clear what we refer to when we use the words grapheme, character, code point etc. [1] https://en.wikipedia.org/wiki/Grapheme
May 29 2016
parent reply Tobias M <troplin bluewin.ch> writes:
On Sunday, 29 May 2016 at 12:41:50 UTC, Chris wrote:
 Ok, you have a point there, to be precise <sh> is a multigraph 
 (a digraph)(cf. [1]). In French you can have multigraphs 
 consisting of three or more characters <eau> /o/, as in Irish 
 <aoi> => /i:/. However, a phoneme is not necessarily a spoken 
 "character" as <sh> represents one phoneme but consists of two 
 "characters" or graphemes. <th> can represent two different 
 phonemes (voiced and unvoiced "th" as in `this` vs. `thorough`).
What I meant was, a phoneme is the "character" (smallest unit) in a spoken language, not that it corresponds to a character (whatever that means).
 My point was that we have to be _very_ careful not to mix our 
 cultural experience with written text with machine 
 representations. There's bound to be confusion. That's why we 
 should always make clear what we refer to when we use the words 
 grapheme, character, code point etc.
I used 'character' in quotes, because it's not a well defined therm. Code point, grapheme and phoneme are well defined.
May 29 2016
parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Sun, May 29, 2016 at 01:13:36PM +0000, Tobias M via Digitalmars-d wrote:
 On Sunday, 29 May 2016 at 12:41:50 UTC, Chris wrote:
 Ok, you have a point there, to be precise <sh> is a multigraph (a
 digraph)(cf. [1]). In French you can have multigraphs consisting of
 three or more characters <eau> /o/, as in Irish <aoi> => /i:/.
 However, a phoneme is not necessarily a spoken "character" as <sh>
 represents one phoneme but consists of two "characters" or
 graphemes. <th> can represent two different phonemes (voiced and
 unvoiced "th" as in `this` vs. `thorough`).
What I meant was, a phoneme is the "character" (smallest unit) in a spoken language, not that it corresponds to a character (whatever that means).
[...] Calling a phoneme a "character" is misleading. A phoneme is a logical sound unit in a spoken language, whereas a "character" is a unit of written language. The two do not necessarily have a direct correspondence (or even any correspondence whatsoever). In a language like English, whose writing system was codified many hundreds of years ago, the spoken language has sufficiently diverged from the written language (specifically, in the way words are spelt) that the correspondence between the two is complex at best, downright arbitrary at worst. For example, the 'o' in "women" and the 'i' in "fish" map to the same phoneme, the short /i/, in (common dialects of) spoken English, in spite of being two completely different characters. Therefore conflating "character" and "phoneme" is misleading and is only confusing the issue. As far as Unicode is concerned, it is a standard for representing *written* text, not spoken language, so concepts like phonemes aren't even relevant in the first place. Let's not get derailed from the present discussion by confusing the two. T -- What are you when you run out of Monet? Baroque.
May 29 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/29/2016 5:56 PM, H. S. Teoh via Digitalmars-d wrote:
 As far as Unicode is concerned, it is a standard for representing
 *written* text, not spoken language, so concepts like phonemes aren't
 even relevant in the first place.  Let's not get derailed from the
 present discussion by confusing the two.
As far as D is concerned, we are not going to invent our own concepts around text that is different from Unicode or redefine Unicode terms. Unicode is what it is, and D is going to work with it.
May 29 2016
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/29/2016 4:47 AM, Tobias Müller wrote:
 No, this is well established terminology, you are confusing several things
here:
For D, we should stick with the terminology as defined by Unicode.
May 29 2016
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/28/2016 03:04 PM, Walter Bright wrote:
 On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:
 So it harkens back to the original mistake: strings should NOT be
 arrays with
 the respective primitives.
An array of code units provides consistency, predictability, flexibility, and performance. It's a solid base upon which the programmer can build what he needs as required.
Nope. Not buying it.
 A string class does not do that
Buying it. -- Andrei
May 30 2016
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 30.05.2016 18:01, Andrei Alexandrescu wrote:
 On 05/28/2016 03:04 PM, Walter Bright wrote:
 On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:
 So it harkens back to the original mistake: strings should NOT be
 arrays with
 the respective primitives.
An array of code units provides consistency, predictability, flexibility, and performance. It's a solid base upon which the programmer can build what he needs as required.
Nope. Not buying it.
I'm buying it. IMO alias string=immutable(char)[] is the most useful choice, and auto-decoding ideally wouldn't exist.
May 30 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/30/2016 03:04 PM, Timon Gehr wrote:
 On 30.05.2016 18:01, Andrei Alexandrescu wrote:
 On 05/28/2016 03:04 PM, Walter Bright wrote:
 On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:
 So it harkens back to the original mistake: strings should NOT be
 arrays with
 the respective primitives.
An array of code units provides consistency, predictability, flexibility, and performance. It's a solid base upon which the programmer can build what he needs as required.
Nope. Not buying it.
I'm buying it. IMO alias string=immutable(char)[] is the most useful choice, and auto-decoding ideally wouldn't exist.
Wouldn't D then be seen (and rightfully so) as largely not supporting Unicode, seeing as its many many core generic algorithms seem to randomly work or not on arrays of characters? Wouldn't ranges - the most important artifact of D's stdlib - default for strings on the least meaningful approach to strings (dumb code units)? Would a smattering of Unicode primitives in std.utf and friends entitle us to claim D had dyed Unicode in its wool? (All are not rhetorical.) I.e. wouldn't be in a worse place than now? (This is rhetorical.) The best argument for autodecoding is to contemplate where we'd be without it: the ghetto of Unicode string handling. I'm not going to debate this further (though I'll look for meaningful answers to the questions above). But this thread has been informative in that it did little to change my conviction that autodecoding is a good thing for D, all things considered (i.e. the wrong decision to not encapsulate string as a separate type distinct from bare array of code units). I'd lie if I said it did nothing. It did, but only a little. Funny thing is that's not even what's important. What's important is that autodecoding is here to stay - there's no realistic way to eliminate it from D. So the focus should be making autodecoding the best it could ever be. Andrei
May 30 2016
next sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Mon, May 30, 2016 at 03:28:38PM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
 On 05/30/2016 03:04 PM, Timon Gehr wrote:
 On 30.05.2016 18:01, Andrei Alexandrescu wrote:
 On 05/28/2016 03:04 PM, Walter Bright wrote:
 On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:
 So it harkens back to the original mistake: strings should NOT
 be arrays with the respective primitives.
An array of code units provides consistency, predictability, flexibility, and performance. It's a solid base upon which the programmer can build what he needs as required.
Nope. Not buying it.
I'm buying it. IMO alias string=immutable(char)[] is the most useful choice, and auto-decoding ideally wouldn't exist.
Wouldn't D then be seen (and rightfully so) as largely not supporting Unicode, seeing as its many many core generic algorithms seem to randomly work or not on arrays of characters?
They already randomly work or not work on ranges of dchar. I hope we don't have to rehash all the examples of why things that seem to work, like count, filter, map, etc., actually *don't* work outside of a very narrow set of languages. The best of all this is that they *both* don't work properly *and* make your program pay for the performance overhead, even when you're not even using them -- thanks to ubiquitous autodecoding.
 Wouldn't ranges - the most important artifact of D's stdlib - default
 for strings on the least meaningful approach to strings (dumb code
 units)?
No, ideally there should *not* be a default range type -- the user needs to specify what he wants to iterate by, whether code unit, code point, or grapheme, etc..
 Would a smattering of Unicode primitives in std.utf and friends
 entitle us to claim D had dyed Unicode in its wool? (All are not
 rhetorical.)
I have no idea what this means.
 I.e. wouldn't be in a worse place than now? (This is rhetorical.) The
 best argument for autodecoding is to contemplate where we'd be without
 it: the ghetto of Unicode string handling.
I've no idea what you're talking about. Without autodecoding we'd actually have faster string handling, and forcing the user to specify the unit of iteration would actually bring more Unicode-awareness which would improve the quality of string handling code, instead of proliferating today's wrong code that just happens to work in some languages but make a hash of things everywhere else.
 I'm not going to debate this further (though I'll look for meaningful
 answers to the questions above). But this thread has been informative
 in that it did little to change my conviction that autodecoding is a
 good thing for D, all things considered (i.e. the wrong decision to
 not encapsulate string as a separate type distinct from bare array of
 code units). I'd lie if I said it did nothing. It did, but only a
 little.
 
 Funny thing is that's not even what's important. What's important is
 that autodecoding is here to stay - there's no realistic way to
 eliminate it from D. So the focus should be making autodecoding the
 best it could ever be.
[...] If I ever had to write string-heavy code, I'd probably fork Phobos just so I can get decent performance. Just sayin'. T -- People walk. Computers run.
May 30 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/30/2016 12:52 PM, H. S. Teoh via Digitalmars-d wrote:
 If I ever had to write string-heavy code, I'd probably fork Phobos just
 so I can get decent performance. Just sayin'.
When I wrote Warp, the only point of which was speed, I couldn't use phobos because of autodecoding. I have since recoded a number of phobos functions so they didn't autodecode, so the situation is better.
May 30 2016
parent reply Chris <wendlec tcd.ie> writes:
On Monday, 30 May 2016 at 21:39:00 UTC, Walter Bright wrote:
 On 5/30/2016 12:52 PM, H. S. Teoh via Digitalmars-d wrote:
 If I ever had to write string-heavy code, I'd probably fork 
 Phobos just
 so I can get decent performance. Just sayin'.
When I wrote Warp, the only point of which was speed, I couldn't use phobos because of autodecoding. I have since recoded a number of phobos functions so they didn't autodecode, so the situation is better.
Two questions: 1. Given you experience with Warp, how hard would it be to clean Phobos up? 2. After recoding a number of Phobos functions, how much code did actually break (yours or someone else's)?.
May 31 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/31/2016 1:57 AM, Chris wrote:
 1. Given you experience with Warp, how hard would it be to clean Phobos up?
It's not hard, it's just a bit tedious.
 2. After recoding a number of Phobos functions, how much code did actually
break
 (yours or someone else's)?.
It's been a while so I don't remember exactly, but as I recall if the API had to change, I created a new overload or a new name, and left the old one as it is. For the std.path functions, I just changed them. While that technically changed the API, I'm not aware of any actual problems it caused. (Decoding file strings is a latent bug anyway, as pointed out elsewhere in this thread. It's a change that had to be made sooner or later.)
May 31 2016
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 30.05.2016 21:28, Andrei Alexandrescu wrote:
 On 05/30/2016 03:04 PM, Timon Gehr wrote:
 On 30.05.2016 18:01, Andrei Alexandrescu wrote:
 On 05/28/2016 03:04 PM, Walter Bright wrote:
 On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:
 So it harkens back to the original mistake: strings should NOT be
 arrays with
 the respective primitives.
An array of code units provides consistency, predictability, flexibility, and performance. It's a solid base upon which the programmer can build what he needs as required.
Nope. Not buying it.
I'm buying it. IMO alias string=immutable(char)[] is the most useful choice, and auto-decoding ideally wouldn't exist.
Wouldn't D then be seen (and rightfully so) as largely not supporting Unicode, seeing as its many many core generic algorithms seem to randomly work or not on arrays of characters?
In D, enum does not mean enumeration, const does not mean constant, pure is not pure, lazy is not lazy, and char does not mean character.
 Wouldn't ranges - the most
 important artifact of D's stdlib - default for strings on the least
 meaningful approach to strings (dumb code units)?
I don't see how that's the least meaningful approach. It's the data that you actually have sitting in memory. It's the data that you can slice and index and get a length for in constant time.
 Would a smattering of
 Unicode primitives in std.utf and friends entitle us to claim D had dyed
 Unicode in its wool? (All are not rhetorical.)
...
We should support Unicode by having all the required functionality and properly documenting the data formats used. What is the goal here? I.e. what does a language that has "Unicode dyed in its wool" have that other languages do not? Why isn't it enough to provide data types for UTF8/16/32 and Unicode algorithms operating on them?
 I.e. wouldn't be in a worse place than now? (This is rhetorical.) The
 best argument for autodecoding is to contemplate where we'd be without
 it: the ghetto of Unicode string handling.
 ...
Those questions seem to be mostly marketing concerns. I'm more concerned with whether I find it convenient to use. Autodecoding does not improve Unicode support.
 I'm not going to debate this further (though I'll look for meaningful
 answers to the questions above). But this thread has been informative in
 that it did little to change my conviction that autodecoding is a good
 thing for D, all things considered (i.e. the wrong decision to not
 encapsulate string as a separate type distinct from bare array of code
 units). I'd lie if I said it did nothing. It did, but only a little.

 Funny thing is that's not even what's important. What's important is
 that autodecoding is here to stay - there's no realistic way to
 eliminate it from D. So the focus should be making autodecoding the best
 it could ever be.


 Andrei
Sure, I didn't mean to engage in a debate (it seems there is no decision to be made here that might affect me in the future).
May 30 2016
parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 05/30/2016 04:30 PM, Timon Gehr wrote:
 In D, enum does not mean enumeration, const does not mean constant, pure
 is not pure, lazy is not lazy, and char does not mean character.
My new favorite quote :)
May 30 2016
prev sibling next sibling parent Jack Stouffer <jack jackstouffer.com> writes:
On Saturday, 28 May 2016 at 12:04:20 UTC, Andrei Alexandrescu 
wrote:
 OK, that's a fair argument, thanks. So it seems there should be 
 no "default" way to iterate a string
Yes!
 So it harkens back to the original mistake: strings should NOT 
 be arrays with the respective primitives.
If you're proposing a library type, a la RCStr, as an alternative then yeah.
May 28 2016
prev sibling next sibling parent Dicebot <public dicebot.lv> writes:
On 05/28/2016 03:04 PM, Andrei Alexandrescu wrote:
 On 5/28/16 6:59 AM, Marc Schütz wrote:
 The fundamental problem is choosing one of those possibilities over the
 others without knowing what the user actually wants, which is what both
 BEFORE and AFTER do.
OK, that's a fair argument, thanks. So it seems there should be no "default" way to iterate a string.
Ideally there should not be a way to iterate a (unicode) string at all without explictily stating mode of operations, i.e. struct String { private void[] data; CodeUnitRange byCodeUnit ( ); CodePointRange byCodePoint ( ); GraphemeRange byGrapheme ( ); bool normalize ( ); } (byGrapheme and normalize have rather expensive dependencies so probably better to provide those via UFCS on demand)
May 29 2016
prev sibling parent reply Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Saturday, 28 May 2016 at 12:04:20 UTC, Andrei Alexandrescu 
wrote:
 On 5/28/16 6:59 AM, Marc Schütz wrote:
 The fundamental problem is choosing one of those possibilities 
 over the
 others without knowing what the user actually wants, which is 
 what both
 BEFORE and AFTER do.
OK, that's a fair argument, thanks. So it seems there should be no "default" way to iterate a string, and furthermore iterating for each constituent of a string should be fairly rare. Strings and substrings yes, but not individual points/units/graphemes unless expressly asked. (Indeed some languages treat strings as first-class entities and individual characters are mere short substrings.) So it harkens back to the original mistake: strings should NOT be arrays with the respective primitives.
I think this is going too far. It's sufficient if they (= char slices, not ranges) can't be iterated over directly, i.e. aren't input ranges (and maybe don't work with foreach). That would force the user to append .byCodeUnit etc. as needed. This provides a very nice deprecation path, by the way, it's just not clear whether it can be implemented with the way `deprecated` currently works. I.e. deprecate/warn every time auto decoding kicks in, print a nice message to the user, and later remove auto decoding and make isInputRange!string return false.
May 30 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/30/2016 07:58 AM, Marc Schütz wrote:
 On Saturday, 28 May 2016 at 12:04:20 UTC, Andrei Alexandrescu wrote:
 On 5/28/16 6:59 AM, Marc Schütz wrote:
 The fundamental problem is choosing one of those possibilities over the
 others without knowing what the user actually wants, which is what both
 BEFORE and AFTER do.
OK, that's a fair argument, thanks. So it seems there should be no "default" way to iterate a string, and furthermore iterating for each constituent of a string should be fairly rare. Strings and substrings yes, but not individual points/units/graphemes unless expressly asked. (Indeed some languages treat strings as first-class entities and individual characters are mere short substrings.) So it harkens back to the original mistake: strings should NOT be arrays with the respective primitives.
I think this is going too far. It's sufficient if they (= char slices, not ranges) can't be iterated over directly, i.e. aren't input ranges (and maybe don't work with foreach).
That's... what I said. -- Andrei
May 30 2016
parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 30 May 2016 at 12:45:27 UTC, Andrei Alexandrescu wrote:
 That's... what I said. -- Andrei
You said "not arrays", he said "not ranges". So that just means making the std.range.primitives.popFront and front add a constraint if(!isSomeString()). Language built-ins still work, but the library rejects them. Indeed, we could add a deprecated overload then that points people to the other range getter methods (byCodeUnit, byCodePoint, byGrapheme, etc.)... this might be our migration path.
May 30 2016
parent reply Seb <seb wilzba.ch> writes:
On Monday, 30 May 2016 at 12:59:08 UTC, Adam D. Ruppe wrote:
 On Monday, 30 May 2016 at 12:45:27 UTC, Andrei Alexandrescu 
 wrote:
 That's... what I said. -- Andrei
You said "not arrays", he said "not ranges". So that just means making the std.range.primitives.popFront and front add a constraint if(!isSomeString()). Language built-ins still work, but the library rejects them. Indeed, we could add a deprecated overload then that points people to the other range getter methods (byCodeUnit, byCodePoint, byGrapheme, etc.)... this might be our migration path.
That's a great idea - the compiler should also issue deprecation warnings when I try to do things like: string a = "你好"; a[1]; // deprecation: direct access to a Unicode string is highly error-prone. Please specify the type of access. More details (shortlink) a[1] = "b"; // deprecation: direct index assignment to a Unicode string is ... a.length; // deprecation: a Unicode string has multiple definitions of length. Please specify your iteration (...). More details (shortlink) ... Btw should a[] be an alias for `byCodeUnit` or also trigger a warning?
May 30 2016
next sibling parent reply ag0aep6g <anonymous example.com> writes:
On 05/30/2016 04:35 PM, Seb wrote:
 That's a great idea - the compiler should also issue deprecation
 warnings when I try to do things like:

 string a  = "你好";

 a[1]; // deprecation: direct access to a Unicode string is highly
 error-prone. Please specify the type of access. More details (shortlink)

 a[1] = "b"; // deprecation: direct index assignment to a Unicode string
 is ...

 a.length; // deprecation: a Unicode string has multiple definitions of
 length. Please specify your iteration (...). More details (shortlink)

 ...

 Btw should a[] be an alias for `byCodeUnit` or also trigger a warning?
All this is only sensible when we move to a dedicated string type that's not just an alias of `immutable(char)[]`. `immutable(char)[]` explicitly is an array of code units. It would not be acceptable, in my opinion, if the normal array syntax got broken for it.
May 30 2016
parent reply Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Monday, 30 May 2016 at 14:56:36 UTC, ag0aep6g wrote:
 All this is only sensible when we move to a dedicated string 
 type that's not just an alias of `immutable(char)[]`.

 `immutable(char)[]` explicitly is an array of code units. It 
 would not be acceptable, in my opinion, if the normal array 
 syntax got broken for it.
I agree; most of the troubles have been with auto-decoding. In an ideal world, we'd also want to change the way `length` and `opIndex` work, but if we only fix the range primitives, we've achieved almost as much with fewer compatibility problems.
May 30 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/30/2016 8:34 AM, Marc Schütz wrote:
 In an ideal world, we'd also want to change the way `length` and `opIndex`
work,
Why? strings are arrays of code units. All the trouble comes from erratically pretending otherwise.
May 30 2016
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/30/16 5:51 PM, Walter Bright wrote:
 On 5/30/2016 8:34 AM, Marc Schütz wrote:
 In an ideal world, we'd also want to change the way `length` and
 `opIndex` work,
Why? strings are arrays of code units. All the trouble comes from erratically pretending otherwise.
That's not an argument. Objects are arrays of bytes, or tuples of their fields, etc. The whole point of encapsulation is superimposing a more structured view on top of the representation. Operating on open-heart representation is risky, and strings are no exception. -- Andrei
May 30 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/30/2016 9:16 PM, Andrei Alexandrescu wrote:
 On 5/30/16 5:51 PM, Walter Bright wrote:
 On 5/30/2016 8:34 AM, Marc Schütz wrote:
 In an ideal world, we'd also want to change the way `length` and
 `opIndex` work,
Why? strings are arrays of code units. All the trouble comes from erratically pretending otherwise.
That's not an argument.
Consistency is a factual argument, and autodecode is not consistent.
 Objects are arrays of bytes, or tuples of their fields,
 etc. The whole point of encapsulation is superimposing a more structured view
on
 top of the representation. Operating on open-heart representation is risky, and
 strings are no exception.
If there is an abstraction for strings that is efficient, consistent, useful, and hides the fact that it is UTF, I am not aware of it. Autodecoding is not it.
May 31 2016
next sibling parent deadalnix <deadalnix gmail.com> writes:
On Tuesday, 31 May 2016 at 07:56:54 UTC, Walter Bright wrote:
 On 5/30/2016 9:16 PM, Andrei Alexandrescu wrote:
 On 5/30/16 5:51 PM, Walter Bright wrote:
 On 5/30/2016 8:34 AM, Marc Schütz wrote:
 In an ideal world, we'd also want to change the way `length` 
 and
 `opIndex` work,
Why? strings are arrays of code units. All the trouble comes from erratically pretending otherwise.
That's not an argument.
Consistency is a factual argument, and autodecode is not consistent.
+1
 Objects are arrays of bytes, or tuples of their fields,
 etc. The whole point of encapsulation is superimposing a more 
 structured view on
 top of the representation. Operating on open-heart 
 representation is risky, and
 strings are no exception.
If there is an abstraction for strings that is efficient, consistent, useful, and hides the fact that it is UTF, I am not aware of it. Autodecoding is not it.
Thing is, more info is needed to support unicode properly. Collation for instance.
May 31 2016
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/31/16 3:56 AM, Walter Bright wrote:
 On 5/30/2016 9:16 PM, Andrei Alexandrescu wrote:
 On 5/30/16 5:51 PM, Walter Bright wrote:
 On 5/30/2016 8:34 AM, Marc Schütz wrote:
 In an ideal world, we'd also want to change the way `length` and
 `opIndex` work,
Why? strings are arrays of code units. All the trouble comes from erratically pretending otherwise.
That's not an argument.
Consistency is a factual argument, and autodecode is not consistent.
Consistency with what? Consistent with what?
 Objects are arrays of bytes, or tuples of their fields,
 etc. The whole point of encapsulation is superimposing a more
 structured view on
 top of the representation. Operating on open-heart representation is
 risky, and
 strings are no exception.
If there is an abstraction for strings that is efficient, consistent, useful, and hides the fact that it is UTF, I am not aware of it.
It's been mentioned several times: a string type that does not offer range primitives; instead it offers explicit primitives (such as byCodeUnit, byCodePoint, byGrapheme etc) that yield appropriate ranges. -- Andrei
May 31 2016
next sibling parent deadalnix <deadalnix gmail.com> writes:
On Tuesday, 31 May 2016 at 15:07:09 UTC, Andrei Alexandrescu 
wrote:
 Consistency with what? Consistent with what?
It is a slice type. It should work as a slice type. Every other design stink.
May 31 2016
prev sibling parent reply Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, May 31, 2016 11:07:09 Andrei Alexandrescu via Digitalmars-d wrote:
 On 5/31/16 3:56 AM, Walter Bright wrote:
 If there is an abstraction for strings that is efficient, consistent,
 useful, and hides the fact that it is UTF, I am not aware of it.
It's been mentioned several times: a string type that does not offer range primitives; instead it offers explicit primitives (such as byCodeUnit, byCodePoint, byGrapheme etc) that yield appropriate ranges.
Not exactly. Such a string type does not hide the fact that it's UTF. Rather, it forces you to deal with the fact that its UTF. I have to agree with Walter in that there really isn't a way to automatically handle Unicode correctly and efficiently while hiding the fact that it's doing all of the stuff that has to be done for UTF. That being said, while an array of code units is really what a string should be underneath the hood, having a string type that provides byCodeUnit, byCodePoint, and byGrapheme is an improvement over treating immutable(char)[] as string, even if byCodeUnit returns immutable(char)[], because it forces the programmer to decide what they want to do rather than blindingly operate on immutable(char)[] as if a char were a full character. And as long as it provides access to each level of Unicode, then it's possible for programmers who know what they're doing to efficiently operate on Unicode while simultaneously making it much more obvious to those who don't know what they're doing that they don't know they're doing rather than having them blindly act like char is a full character. There's really no reason why we couldn't define a string type that operated that way while continuing to treat arrays of char the way that we do now in the language, though transitioning to such a scheme is not at all straightforward in terms of avoiding code breakage. Defining a String type would be simple enough, and any function in Phobos which accepted a string could be changed to accept a String, but we'd have problems with many functions which currently returned string, since changing what they returned would break code. But even if Phobos were somehow completly changed over to use a new String type, and even if the string alias were deprecated/removed, we'd still have to deal with arrays of char, wchar, and dchar and run the risk of someone using those and having problems, because they didn't treat them as arrays of code units. We can't really prevent that, just make it so that string/String is something else that makes the Unicode issue obvious so that folks are less likely to blindly treat chars as full characters. But even then, it's not like it would be hard for folks to just use the wrong Unicode level. All we'd really be doing is shoving the issue in their face so that they'd have to acknowledge it on some level and maybe then actually learn enough to operate on Unicode strings correctly. But then again, since all you're really doing at that point is shoving the Unicode issues in folks' faces by not treating strings as ranges or indexable and forcing them to call byCodeUnit, byCodePoint, byGrapheme, etc., I don't know that it actually solves much over treating immutable(char)[] as string. Programmers still have to learn Unicode enough to handle it correctly, just like they do now (whether we have autodecoding or not). And such a string type really doesn't make the Unicode handling any easier. It just make it harder to ignore the Unicode issues. The Unicode problem is a lot like the floating point problems that have been discussed recently. Programmers want it to "just work" without them having to worry about the details, but that really doesn't work, and while the average programmer may not understand either floating point operations or Unicode properly, the average programmer does actually have to work with both on a regular basis. I'm not at all convinced that having string be an alias of immutable(char)[] was a mistake, but having a struct that's not a range may very well be an improvement. It _would_ at least make some of the Unicode issues more obvious, but it doesn't really solve much from what I can see. - Jonathan M Davis
May 31 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 12:45 PM, Jonathan M Davis via Digitalmars-d wrote:
 On Tuesday, May 31, 2016 11:07:09 Andrei Alexandrescu via Digitalmars-d wrote:
 On 5/31/16 3:56 AM, Walter Bright wrote:
 If there is an abstraction for strings that is efficient, consistent,
 useful, and hides the fact that it is UTF, I am not aware of it.
It's been mentioned several times: a string type that does not offer range primitives; instead it offers explicit primitives (such as byCodeUnit, byCodePoint, byGrapheme etc) that yield appropriate ranges.
Not exactly. Such a string type does not hide the fact that it's UTF. Rather, it forces you to deal with the fact that its UTF.
How is that different from what I said? -- Andrei
May 31 2016
parent Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, May 31, 2016 13:01:11 Andrei Alexandrescu via Digitalmars-d wrote:
 On 05/31/2016 12:45 PM, Jonathan M Davis via Digitalmars-d wrote:
 On Tuesday, May 31, 2016 11:07:09 Andrei Alexandrescu via Digitalmars-d 
wrote:
 On 5/31/16 3:56 AM, Walter Bright wrote:
 If there is an abstraction for strings that is efficient, consistent,
 useful, and hides the fact that it is UTF, I am not aware of it.
It's been mentioned several times: a string type that does not offer range primitives; instead it offers explicit primitives (such as byCodeUnit, byCodePoint, byGrapheme etc) that yield appropriate ranges.
Not exactly. Such a string type does not hide the fact that it's UTF. Rather, it forces you to deal with the fact that its UTF.
How is that different from what I said? -- Andrei
My point was that Walter was stating that you can't have a type that hides the fact that it's dealing with Unicode while still being efficient, whereas you mentioned a proposal for a type that does not hide the fact that it's dealing with Unicode. So, you weren't really responding with a type that rebutted Walter's statement. Rather, you responded with a type that attempts to make its Unicode nature more explicit than immutable(char)[]. - Jonathan M Davis
May 31 2016
prev sibling parent reply Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Monday, 30 May 2016 at 21:51:36 UTC, Walter Bright wrote:
 On 5/30/2016 8:34 AM, Marc Schütz wrote:
 In an ideal world, we'd also want to change the way `length` 
 and `opIndex` work,
Why? strings are arrays of code units.
So, strings are _implemented_ as arrays of code units. But indiscriminately treating them as such in all situations leads to wrong results (just like arrays of code points would). In an ideal world, the programs someone intuitively writes will do the right thing, and if they can't, they at least refuse to compile. If we agree that it's up to the user whether to iterate over a string by code unit or code points or graphemes, and that we shouldn't arbitrarily choose one of those (except when we know that it's what the user wants), then the same applies to indexing, slicing and counting. On the other hand, changing such low-level things will likely be impractical, that's why I said "In an ideal world".
 All the trouble comes from erratically pretending otherwise.
For me, the trouble comes from pretending otherwise _without being told to_. To make sure there are no misunderstandings, here is what is suggested as an alternative to the current situation: * `char[]`, `wchar[]` (and `dchar[]`?) no longer pass `isInputRange`. * Ranges with element type `char`, `wchar`, and `dchar` do pass `isInputRange`. * A bunch of rangeifying helpers are added to `std.string` (I believe they are already there): `byCodePoint`, `byCodeUnit`, `byChar`, `byWchar`, `byDchar`, ... * Algorithms like `find`, `join(er)` get overloads that accept char slices directly. * Built-in operators and `length` of char slices are unchanged. Advantages: * Algorithms that can work _correctly_ without any kind of decoding will do so. * Algorithms that would yield incorrect results won't compile, requiring the user to make a decision regarding the desired element type. * No auto-decoding. => Best performance depending on the actual requirements. => No results that look correct when tested with only precomposed characters but are wrong in the general case. * Behaviour of [] and .length is no worse than today.
May 31 2016
next sibling parent reply Seb <seb wilzba.ch> writes:
On Tuesday, 31 May 2016 at 13:33:14 UTC, Marc Schütz wrote:
 On Monday, 30 May 2016 at 21:51:36 UTC, Walter Bright wrote:
 [...]
So, strings are _implemented_ as arrays of code units. But indiscriminately treating them as such in all situations leads to wrong results (just like arrays of code points would). [...]
If we follow Adam's proposal to deprecate front, back, popFront and popBack, we don't even need to touch the compiler and it's trivial to do so. The proof of concept change needs eight lines. https://github.com/dlang/phobos/pull/4384 Explicitly stating the type of iteration in the 132 places with auto-decoding in Phobos doesn't sound that terrible.
May 31 2016
next sibling parent ag0aep6g <anonymous example.com> writes:
On 05/31/2016 04:33 PM, Seb wrote:
 https://github.com/dlang/phobos/pull/4384

 Explicitly stating the type of iteration in the 132 places with
 auto-decoding in Phobos doesn't sound that terrible.
After checking some of those 132 places, they are in generic functions that take ranges. std.algorithm.equal, std.range.take - stuff like that. That's expected, of course, as the range primitives are used there. But those places are not the ones we'd have to fix. We'd have to fix the code that uses those generic functions on strings.
May 31 2016
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/31/16 10:33 AM, Seb wrote:
 Explicitly stating the type of iteration in the 132 places with
 auto-decoding in Phobos doesn't sound that terrible.
It is terrible, no two ways about it. We've been very very careful with changes that caused a handful or breakages in Phobos. It really means every D project on the planet will be broken. We can't contemplate that, it's suicide. -- Andrei
May 31 2016
prev sibling parent Kagamin <spam here.lot> writes:
On Tuesday, 31 May 2016 at 13:33:14 UTC, Marc Schütz wrote:
 In an ideal world, the programs someone intuitively writes will 
 do the right thing, and if they can't, they at least refuse to 
 compile. If we agree that it's up to the user whether to 
 iterate over a string by code unit or code points or graphemes, 
 and that we shouldn't arbitrarily choose one of those (except 
 when we know that it's what the user wants), then the same 
 applies to indexing, slicing and counting.
If the user doesn't know how he wants to iterate and you leave the decision to the user... erm... it's not going to give correct result :)
May 31 2016
prev sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 30 May 2016 at 14:35:03 UTC, Seb wrote:
 That's a great idea - the compiler should also issue 
 deprecation warnings when I try to do things like:
I don't agree on changing those. Indexing and slicing a char[] is really useful and actually not hard to do correctly (at least with regard to handling code units). Besides, it'd be a much bigger change than the library transition.
May 30 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/30/2016 11:25 AM, Adam D. Ruppe wrote:
 I don't agree on changing those. Indexing and slicing a char[] is really useful
 and actually not hard to do correctly (at least with regard to handling code
 units).
Yup. It isn't hard at all to use arrays of codeunits correctly.
May 30 2016
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/30/16 6:00 PM, Walter Bright wrote:
 On 5/30/2016 11:25 AM, Adam D. Ruppe wrote:
 I don't agree on changing those. Indexing and slicing a char[] is
 really useful
 and actually not hard to do correctly (at least with regard to
 handling code
 units).
Yup. It isn't hard at all to use arrays of codeunits correctly.
Trouble is, it isn't hard at all to use arrays of codeunits incorrectly, too. -- Andrei
May 30 2016
parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Tue, May 31, 2016 at 12:13:57AM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
 On 5/30/16 6:00 PM, Walter Bright wrote:
 On 5/30/2016 11:25 AM, Adam D. Ruppe wrote:
 I don't agree on changing those. Indexing and slicing a char[] is
 really useful and actually not hard to do correctly (at least with
 regard to handling code units).
Yup. It isn't hard at all to use arrays of codeunits correctly.
Trouble is, it isn't hard at all to use arrays of codeunits incorrectly, too. -- Andrei
Neither does autodecoding make code anymore correct. It just better hides the fact that the code is wrong. T -- I've been around long enough to have seen an endless parade of magic new techniques du jour, most of which purport to remove the necessity of thought about your programming problem. In the end they wind up contributing one or two pieces to the collective wisdom, and fade away in the rearview mirror. -- Walter Bright
May 30 2016
parent reply default0 <Kevin.Labschek gmx.de> writes:
On Tuesday, 31 May 2016 at 06:45:56 UTC, H. S. Teoh wrote:
 On Tue, May 31, 2016 at 12:13:57AM -0400, Andrei Alexandrescu 
 via Digitalmars-d wrote:
 On 5/30/16 6:00 PM, Walter Bright wrote:
 On 5/30/2016 11:25 AM, Adam D. Ruppe wrote:
 I don't agree on changing those. Indexing and slicing a 
 char[] is really useful and actually not hard to do 
 correctly (at least with regard to handling code units).
Yup. It isn't hard at all to use arrays of codeunits correctly.
Trouble is, it isn't hard at all to use arrays of codeunits incorrectly, too. -- Andrei
Neither does autodecoding make code anymore correct. It just better hides the fact that the code is wrong. T
Thinking about this a bit more - what algorithms are actually correct when implemented on the level of code units? Off the top of my head I can only really think of copying and hashing, since you want to do that on the byte level anyways. I would also think that if you know your strings are normalized in the same normalization form (for example because they come from the same normalized source), you can check two strings for equality on the code unit level, but my understanding of unicode is still quite lacking, so I'm not sure on that.
May 31 2016
next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Tue, 31 May 2016 07:17:03 +0000
schrieb default0 <Kevin.Labschek gmx.de>:

 Thinking about this a bit more - what algorithms are actually 
 correct when implemented on the level of code units?
Calculating the buffer size of a string, validation and fast versions of general algorithms that can be defined in terms of ASCII, like skipAsciiWhitespace(), splitByComma(), splitByLineAscii().
 I would also think that if you know your strings are normalized 
 in the same normalization form (for example because they come 
 from the same normalized source), you can check two strings for 
 equality on the code unit level, but my understanding of unicode 
 is still quite lacking, so I'm not sure on that.
That's correct. -- Marco
May 31 2016
prev sibling parent reply Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, May 31, 2016 07:17:03 default0 via Digitalmars-d wrote:
 Thinking about this a bit more - what algorithms are actually
 correct when implemented on the level of code units?
 Off the top of my head I can only really think of copying and
 hashing, since you want to do that on the byte level anyways.
 I would also think that if you know your strings are normalized
 in the same normalization form (for example because they come
 from the same normalized source), you can check two strings for
 equality on the code unit level, but my understanding of unicode
 is still quite lacking, so I'm not sure on that.
Equality does not require decoding. Similarly, functions like find don't either. Something like filter generally would, but it's also not particularly normal to filter a string on a by-character basis. You'd probably want to get to at least the word level in that case. To make matters worse, functions like find or splitter are frequently used to look for ASCII delimiters, even when the strings themselves contain Unicode characters. So, even if decoding were necessary when looking for a Unicode character, it's utterly wasteful when the character you're looking for is ASCII. But searching generally does not require decoding so long as the same character is always encoded the same way. So, Unicode normalization _can_ be a problem, but that's a problem with code points as well as code units (since the normalization has to do with the order of code points when multiple code points make up a single grapheme). You'd have to go to the grapheme level to avoid that problem. And that's why at least some of the time, string-processing code is going to need to normalize its strings before doing searches. But the searches themselves can then operate at the code unit level. - Jonathan M Davis
May 31 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 12:54 PM, Jonathan M Davis via Digitalmars-d wrote:
 Equality does not require decoding. Similarly, functions like find don't
 either. Something like filter generally would, but it's also not
 particularly normal to filter a string on a by-character basis. You'd
 probably want to get to at least the word level in that case.
It's nice that the stdlib takes care of that.
 To make matters worse, functions like find or splitter are frequently used
 to look for ASCII delimiters, even when the strings themselves contain
 Unicode characters. So, even if decoding were necessary when looking for a
 Unicode character, it's utterly wasteful when the character you're looking
 for is ASCII.
Good idea. We could overload functions such as find on char, wchar, and dchar. Jonathan, could you look into a PR to do that?
 But searching generally does not require decoding so long as
 the same character is always encoded the same way.
Yah, a good rule of thumb is to get the same (consistent, heh) results for a given string (including a given normalization) regardless of the encoding used. So e.g. it's nice that walkLength the same number for the string whether it's UTF8/16/32. Andrei
May 31 2016
parent Marco Leise <Marco.Leise gmx.de> writes:
Am Tue, 31 May 2016 13:06:16 -0400
schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:

 On 05/31/2016 12:54 PM, Jonathan M Davis via Digitalmars-d wrote:
 Equality does not require decoding. Similarly, functions like find don't
 either. Something like filter generally would, but it's also not
 particularly normal to filter a string on a by-character basis. You'd
 probably want to get to at least the word level in that case. =20
=20 It's nice that the stdlib takes care of that.
Both "equality" and "find" require byGrapheme. =E2=87=B0 The equivalence algorithm first brings both strings to a common normalization form (NFD or NFC), which works on one grapheme cluster at a time and afterwards does the binary comparison. http://www.unicode.org/reports/tr15/#Canon_Compat_Equivalence =E2=87=B0 Find would yield false positives for the start of grapheme clust= ers. I.e. will match 'o' in an NFD "=C3=B6" (simplified example). http://www.unicode.org/reports/tr10/#Searching --=20 Marco
May 31 2016
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 31-May-2016 01:00, Walter Bright wrote:
 On 5/30/2016 11:25 AM, Adam D. Ruppe wrote:
 I don't agree on changing those. Indexing and slicing a char[] is
 really useful
 and actually not hard to do correctly (at least with regard to
 handling code
 units).
Yup. It isn't hard at all to use arrays of codeunits correctly.
Ehm as long as all you care for is operating on substrings I'd say. Working with individual character requires either decoding or clever tricks like operating on encoded UTF directly. -- Dmitry Olshansky
May 31 2016
next sibling parent Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, May 31, 2016 22:47:56 Dmitry Olshansky via Digitalmars-d wrote:
 On 31-May-2016 01:00, Walter Bright wrote:
 On 5/30/2016 11:25 AM, Adam D. Ruppe wrote:
 I don't agree on changing those. Indexing and slicing a char[] is
 really useful
 and actually not hard to do correctly (at least with regard to
 handling code
 units).
Yup. It isn't hard at all to use arrays of codeunits correctly.
Ehm as long as all you care for is operating on substrings I'd say. Working with individual character requires either decoding or clever tricks like operating on encoded UTF directly.
Yeah, but Phobos provides the tools to do that reasonably easily even when autodecoding isn't involved. Sure, it's slightly more tedious to call std.utf.decode or std.utf.encode yourself rather than letting autodecoding take care of it, but it's easy enough to do and allows you to control when it's done. And we have stuff like byChar!dchar or byGrapheme for the cases where you don't want to actually operate on arrays of code units. - Jonathan M Davis
May 31 2016
prev sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Tue, May 31, 2016 at 10:47:56PM +0300, Dmitry Olshansky via Digitalmars-d
wrote:
 On 31-May-2016 01:00, Walter Bright wrote:
 On 5/30/2016 11:25 AM, Adam D. Ruppe wrote:
 I don't agree on changing those. Indexing and slicing a char[] is
 really useful and actually not hard to do correctly (at least with
 regard to handling code units).
Yup. It isn't hard at all to use arrays of codeunits correctly.
Ehm as long as all you care for is operating on substrings I'd say. Working with individual character requires either decoding or clever tricks like operating on encoded UTF directly.
[...] Working on individual characters needs byGrapheme, unless you know beforehand that the character(s) you're working with are ASCII, or fits in a single code unit. About "clever tricks", it's not really that hard. I was thinking that things like s.canFind('Ш') should translate the 'Ш' into a UTF-8 byte sequence, and then do a substring search directly on the encoded string. This way, a large number of single-character algorithms don't even need to decode. The way UTF-8 is designed guarantees that there will not be any false positives. This will eliminate a lot of the current overhead of autodecoding. T -- Klein bottle for rent ... inquire within. -- Stephen Mulraney
May 31 2016
prev sibling next sibling parent reply Chris <wendlec tcd.ie> writes:
On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu 
wrote:
[snip]
 I would agree only with the amendment "...if used naively", 
 which is important. Knowledge of how autodecoding works is a 
 prerequisite for writing fast string code in D. Also, little 
 code should deal with one code unit or code point at a time; 
 instead, it should use standard library algorithms for 
 searching, matching etc. When needed, iterating every code unit 
 is trivially done through indexing.
I disagree. "if used naively" shouldn't be the default. A user (naively) expects string algorithms to work as efficiently as possible without overheads. To tell the user later that s/he shouldn't _naively_ have used a certain algorithm provided by the library is a bit cynical. Having to redesign a code base because of hidden behavior is a big turn off, having to go through Phobos to determine where the hidden pitfalls are is not the user's job.
 Also allow me to point that much of the slowdown can be 
 addressed tactically. The test c < 0x80 is highly predictable 
 (in ASCII-heavy text) and therefore easily speculated. We can 
 and we should arrange code to minimize impact.
And what if you deal with non-ASCII heavy text? Does the user have to guess an micro-optimize for simple use cases?
 5. Very few algorithms require decoding.
The key here is leaving it to the standard library to do the right thing instead of having the user wonder separately for each case. These uses don't need decoding, and the standard library correctly doesn't involve it (or if it currently does it has a bug): s.find("abc") s.findSplit("abc") s.findSplit('a') s.count!(c => "!()-;:,.?".canFind(c)) // punctuation However the following do require autodecoding: s.walkLength s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation s.count!(c => c >= 32) // non-control characters Currently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.
But how is the user supposed to know without being a core contributor to Phobos? If using a library method that works well in one case can slow down your code in a slightly different case, something is wrong with the language/library design. For simple cases the burden shouldn't be on the user, or, if it is, s/he should be informed about it in order to be able to make well-informed decisions. Personally I wouldn't mind having to decide in each case what I want (provided I have a best practices cheat sheet :)), so I can get the best out of it. But to keep guessing, testing and benchmarking each string handling library function is not good at all. [snip]
May 27 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/27/16 7:19 AM, Chris wrote:
 On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu wrote:
 [snip]
 I would agree only with the amendment "...if used naively", which is
 important. Knowledge of how autodecoding works is a prerequisite for
 writing fast string code in D. Also, little code should deal with one
 code unit or code point at a time; instead, it should use standard
 library algorithms for searching, matching etc. When needed, iterating
 every code unit is trivially done through indexing.
I disagree.
Misunderstanding.
 "if used naively" shouldn't be the default. A user (naively)
 expects string algorithms to work as efficiently as possible without
 overheads.
That's what happens with autodecoding.
 Also allow me to point that much of the slowdown can be addressed
 tactically. The test c < 0x80 is highly predictable (in ASCII-heavy
 text) and therefore easily speculated. We can and we should arrange
 code to minimize impact.
And what if you deal with non-ASCII heavy text? Does the user have to guess an micro-optimize for simple use cases?
Misunderstanding.
 5. Very few algorithms require decoding.
The key here is leaving it to the standard library to do the right thing instead of having the user wonder separately for each case. These uses don't need decoding, and the standard library correctly doesn't involve it (or if it currently does it has a bug): s.find("abc") s.findSplit("abc") s.findSplit('a') s.count!(c => "!()-;:,.?".canFind(c)) // punctuation However the following do require autodecoding: s.walkLength s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation s.count!(c => c >= 32) // non-control characters Currently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.
But how is the user supposed to know without being a core contributor to Phobos?
Misunderstanding. All examples work properly today because of autodecoding. -- Andrei
May 27 2016
parent reply ag0aep6g <anonymous example.com> writes:
On 05/27/2016 03:32 PM, Andrei Alexandrescu wrote:
 However the following do require autodecoding:

 s.walkLength
 s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
 s.count!(c => c >= 32) // non-control characters

 Currently the standard library operates at code point level even
 though inside it may choose to use code units when admissible. Leaving
 such a decision to the library seems like a wise thing to do.
But how is the user supposed to know without being a core contributor to Phobos?
Misunderstanding. All examples work properly today because of autodecoding. -- Andrei
They only work "properly" if you define "properly" as "in terms of code points". But working in terms of code points is usually wrong. If you want to count "characters", you need to work with graphemes. https://dpaste.dzfl.pl/817dec505fd2
May 27 2016
next sibling parent reply Chris <wendlec tcd.ie> writes:
On Friday, 27 May 2016 at 13:47:32 UTC, ag0aep6g wrote:
 Misunderstanding. All examples work properly today because of
 autodecoding. -- Andrei
They only work "properly" if you define "properly" as "in terms of code points". But working in terms of code points is usually wrong. If you want to count "characters", you need to work with graphemes. https://dpaste.dzfl.pl/817dec505fd2
I agree. It has happened to me that characters like "é" return length == 2, which has been the cause of some bugs in my code. I'm wiser now, of course, but you wouldn't expect this, if you write if (input.length == 1) speakCharacter(input); // e.g. when spelling a word else processInput(input); The worst thing is that you never know, what's going on under the hood and where autodecode slows you down, unbeknownst to yourself.
May 27 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/27/16 10:15 AM, Chris wrote:
 It has happened to me that characters like "é" return length == 2
Would normalization make length 1? -- Andrei
May 27 2016
next sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
On Friday, 27 May 2016 at 18:11:22 UTC, Andrei Alexandrescu wrote:
 Would normalization make length 1? -- Andrei
In some, but not all cases.
May 27 2016
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 27-May-2016 21:11, Andrei Alexandrescu wrote:
 On 5/27/16 10:15 AM, Chris wrote:
 It has happened to me that characters like "é" return length == 2
Would normalization make length 1? -- Andrei
No, this is not the point of normalization. -- Dmitry Olshansky
May 27 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/27/2016 03:39 PM, Dmitry Olshansky wrote:
 On 27-May-2016 21:11, Andrei Alexandrescu wrote:
 On 5/27/16 10:15 AM, Chris wrote:
 It has happened to me that characters like "é" return length == 2
Would normalization make length 1? -- Andrei
No, this is not the point of normalization.
What is? -- Andrei
May 27 2016
next sibling parent Minas Mina <minas_0 hotmail.co.uk> writes:
On Friday, 27 May 2016 at 20:42:13 UTC, Andrei Alexandrescu wrote:
 On 05/27/2016 03:39 PM, Dmitry Olshansky wrote:
 On 27-May-2016 21:11, Andrei Alexandrescu wrote:
 On 5/27/16 10:15 AM, Chris wrote:
 It has happened to me that characters like "é" return length 
 == 2
Would normalization make length 1? -- Andrei
No, this is not the point of normalization.
What is? -- Andrei
This video will be helpfull :) https://www.youtube.com/watch?v=n0GK-9f4dl8 It talks about Unicode in C++, but also explains how unicode works.
May 27 2016
prev sibling next sibling parent reply tsbockman <thomas.bockman gmail.com> writes:
On Friday, 27 May 2016 at 20:42:13 UTC, Andrei Alexandrescu wrote:
 On 05/27/2016 03:39 PM, Dmitry Olshansky wrote:
 No, this is not the point of normalization.
What is? -- Andrei
1) A grapheme may include several combining characters (such as diacritics) whose order is not supposed to be semantically significant. Normalization sorts them in a standardized way so that string comparisons return the expected result for graphemes which differ only by the internal order of their constituent combining code points. 2) Some graphemes (like accented latin letters) can be represented by a single code point OR a letter followed by a combining diacritic. Normalization either splits them all apart (NFD), or combines them whenever possible (NFC). Again, this is primarily intended to make things like string comparisons work as expected, and perhaps to simplify low-level tasks like graphical rendering of text. (Disclaimer: This is an oversimplification, because nothing about Unicode is ever simple.)
May 27 2016
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 28-May-2016 01:04, tsbockman wrote:
 On Friday, 27 May 2016 at 20:42:13 UTC, Andrei Alexandrescu wrote:
 On 05/27/2016 03:39 PM, Dmitry Olshansky wrote:
 No, this is not the point of normalization.
What is? -- Andrei
1) A grapheme may include several combining characters (such as diacritics) whose order is not supposed to be semantically significant. Normalization sorts them in a standardized way so that string comparisons return the expected result for graphemes which differ only by the internal order of their constituent combining code points. 2) Some graphemes (like accented latin letters) can be represented by a single code point OR a letter followed by a combining diacritic. Normalization either splits them all apart (NFD), or combines them whenever possible (NFC). Again, this is primarily intended to make things like string comparisons work as expected, and perhaps to simplify low-level tasks like graphical rendering of text.
Quite accurate statement of the goals. Normalization is all about having canonical order of combining code points.
 (Disclaimer: This is an oversimplification, because nothing about
 Unicode is ever simple.)
-- Dmitry Olshansky
May 28 2016
prev sibling parent reply Minas Mina <minas_0 hotmail.co.uk> writes:
On Friday, 27 May 2016 at 20:42:13 UTC, Andrei Alexandrescu wrote:
 On 05/27/2016 03:39 PM, Dmitry Olshansky wrote:
 On 27-May-2016 21:11, Andrei Alexandrescu wrote:
 On 5/27/16 10:15 AM, Chris wrote:
 It has happened to me that characters like "é" return length 
 == 2
Would normalization make length 1? -- Andrei
No, this is not the point of normalization.
What is? -- Andrei
Here is an example about normalization. In Unicode, the grapheme Ä is composed of two code points: A (the ascii A) and the ¨ character. However, one of the goals of unicode was to be backwards to compatible with earlier encodings that extended ASCII (codepages). In some codepages, Ä was an actual codepoint. So in some cases you would have the unicode one which is two codepoints and the one from some codepages which would be one. Those should be the same though, i.e compare the same. In order to do that, there is normalization. What is does is to _expand_ the single codepoint Ä into A + ¨
May 27 2016
parent reply David Nadlinger <code klickverbot.at> writes:
On Friday, 27 May 2016 at 22:12:57 UTC, Minas Mina wrote:
 Those should be the same though, i.e compare the same. In order 
 to do that, there is normalization. What is does is to _expand_ 
 the single codepoint Ä into A + ¨
Unless I'm mistaken, this depends on the form used. For example, in NFKC you'd get the single codepoint Ä. — David
May 27 2016
parent Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Friday, May 27, 2016 23:16:58 David Nadlinger via Digitalmars-d wrote:
 On Friday, 27 May 2016 at 22:12:57 UTC, Minas Mina wrote:
 Those should be the same though, i.e compare the same. In order
 to do that, there is normalization. What is does is to _expand_
 the single codepoint  into A + 
Unless I'm mistaken, this depends on the form used. For example, in NFKC you'd get the single codepoint .
Yeah. For better or worse, there are different normalization schemes for Unicode. A normalization scheme makes the encodings consisent, but that doesn't mean that each of the different normalization schemes does the same thing, just that if you apply the same normalization scheme to two strings, then all graphemes within those strings will be encoded identically. - Jonathan M Davis
May 31 2016
prev sibling parent Chris <wendlec tcd.ie> writes:
On Friday, 27 May 2016 at 18:11:22 UTC, Andrei Alexandrescu wrote:
 On 5/27/16 10:15 AM, Chris wrote:
 It has happened to me that characters like "é" return length 
 == 2
Would normalization make length 1? -- Andrei
No, I've tried it. I think dchar[] returns one or you check by grapheme.
May 28 2016
prev sibling next sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, May 27, 2016 at 03:47:32PM +0200, ag0aep6g via Digitalmars-d wrote:
 On 05/27/2016 03:32 PM, Andrei Alexandrescu wrote:
 However the following do require autodecoding:
 
 s.walkLength
 s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
 s.count!(c => c >= 32) // non-control characters
 
 Currently the standard library operates at code point level even
 though inside it may choose to use code units when admissible.
 Leaving such a decision to the library seems like a wise thing
 to do.
But how is the user supposed to know without being a core contributor to Phobos?
Misunderstanding. All examples work properly today because of autodecoding. -- Andrei
They only work "properly" if you define "properly" as "in terms of code points". But working in terms of code points is usually wrong. If you want to count "characters", you need to work with graphemes. https://dpaste.dzfl.pl/817dec505fd2
Exactly. And we just keep getting stuck on this point. It seems that the message just isn't getting through. The unfounded assumption continues to be made that iterating by code point is somehow "correct" by definition and nobody can challenge it. String handling, especially in the standard library, ought to be (1) efficient where possible, and (2) be as correct as possible (meaning, most corresponding to user expectations -- principle of least surprise). If we can't have both, we should at least have one, right? However, the way autodecoding is currently implemented, we have neither. Firstly, it is beyond clear that autodecoding adds a significant amount of overhead, and because it's automatic, it applies to ALL string processing in D. The only way around it is to fight against the standard library and use workarounds to bypass all that meticulously-crafted autodecoding code, begging the question of why we're even spending the effort on said code in the first place. Secondly, it violates the principle of least surprise when the user, given a string of, say, Korean text, discovers that s.count() *doesn't* return the correct answer. Oh, it's "correct", all right, if your definition of correct is "number of Unicode code points", but to a Korean user, such an answer is completely meaningless because it has little correspondence with what he would perceive as the number of "characters" in the string. It might as well be a random number and it would be just as meaningful. It is just as wrong as s.count() returning the number of code units, except that in the current Euro-centric D community the wrong instances are less often encountered and so are often overlooked. But that doesn't change the fact that code that assumes s.count() returns anything remotely meaningful to the user is buggy. Autodecoding into code points only serves to hide the bugs. As has been said before already countless times, autodecoding, as currently implemented, is neither "correct" nor efficient. Iterating by code point is much faster, but more prone to user mistakes; whereas iterating by grapheme more often corresponds with user expectations but performs quite poorly. The current implementation of autodecoding represents the worst of both worlds: it is both inefficient *and* prone to user mistakes, and worse yet, it serves to conceal such user mistakes by giving the false sense of security that because we're iterating by code points we're somehow magically "correct" by definition. The fact of the matter is that if you're going to write Unicode string processing code, you're gonna hafta to know the dirty nitty gritty of Unicode strings, including the fine distinctions between code units, code points, grapheme clusters, etc.. Since this is required knowledge anyway, why not just let the user worry about how to iterate over the string? Let the user choose what best suits his application, whether it's working directly with code units for speed, or iterating over grapheme clusters for correctness (in terms of visual "characters"), instead of choosing the pessimal middle ground that's neither efficient nor correct? T -- Do not reason with the unreasonable; you lose by definition.
May 27 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/27/16 12:40 PM, H. S. Teoh via Digitalmars-d wrote:
 Exactly. And we just keep getting stuck on this point. It seems that the
 message just isn't getting through. The unfounded assumption continues
 to be made that iterating by code point is somehow "correct" by
 definition and nobody can challenge it.
Which languages are covered by code points, and which languages require graphemes consisting of multiple code points? How does normalization play into this? -- Andrei
May 27 2016
next sibling parent reply ag0aep6g <anonymous example.com> writes:
On 05/27/2016 08:42 PM, Andrei Alexandrescu wrote:
 Which languages are covered by code points, and which languages require
 graphemes consisting of multiple code points? How does normalization
 play into this? -- Andrei
I don't think there is value in distinguishing by language. The point of Unicode is that you shouldn't need to do that. I think there are scripts that use combining characters extensively, but Unicode also has stuff like combining arrows. Those can make sense in an otherwise plain English text. For example: 'a' + U+20D7 = a⃗. There is no combined character for that, so normalization can't do anything here.
May 27 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/27/16 3:10 PM, ag0aep6g wrote:
 I don't think there is value in distinguishing by language. The point of
 Unicode is that you shouldn't need to do that.
It seems code points are kind of useless because they don't really mean anything, would that be accurate? -- Andrei
May 27 2016
next sibling parent ag0aep6g <anonymous example.com> writes:
On 05/27/2016 09:30 PM, Andrei Alexandrescu wrote:
 It seems code points are kind of useless because they don't really mean
 anything, would that be accurate? -- Andrei
I think so, yeah. Due to combining characters, code points are similar to code units: a Unicode thing that you need to know about of when working below the human-perceived character (grapheme) level.
May 27 2016
prev sibling next sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, May 27, 2016 at 03:30:53PM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
 On 5/27/16 3:10 PM, ag0aep6g wrote:
 I don't think there is value in distinguishing by language. The
 point of Unicode is that you shouldn't need to do that.
It seems code points are kind of useless because they don't really mean anything, would that be accurate? -- Andrei
That's what we've been trying to say all along! :-P They're a kind of low-level Unicode construct used for building "real" characters, i.e., what a layperson would consider to be a "character". T -- English is useful because it is a mess. Since English is a mess, it maps well onto the problem space, which is also a mess, which we call reality. Similarly, Perl was designed to be a mess, though in the nicest of all possible ways. -- Larry Wall
May 27 2016
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/27/2016 03:43 PM, H. S. Teoh via Digitalmars-d wrote:
 That's what we've been trying to say all along!
If that's the case things are pretty dire, autodecoding or not. -- Andrei
May 27 2016
next sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, May 27, 2016 at 04:41:09PM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
 On 05/27/2016 03:43 PM, H. S. Teoh via Digitalmars-d wrote:
 That's what we've been trying to say all along!
If that's the case things are pretty dire, autodecoding or not. -- Andrei
Like it or not, Unicode ain't merely some glorified form of C's ASCII char arrays. It's about time we faced the reality and dealt with it accordingly. Trying to sweep the complexities of Unicode under the rug is not doing us any good. T -- The fact that anyone still uses AOL shows that even the presence of options doesn't stop some people from picking the pessimal one. - Mike Ellis
May 27 2016
prev sibling parent Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Friday, May 27, 2016 16:41:09 Andrei Alexandrescu via Digitalmars-d wrote:
 On 05/27/2016 03:43 PM, H. S. Teoh via Digitalmars-d wrote:
 That's what we've been trying to say all along!
If that's the case things are pretty dire, autodecoding or not. -- Andrei
True enough. Correctly handling Unicode in the general case is ridiculously hard - especially if you want to be efficient. We could do everything at the grapheme level to get the correctness, but we'd be so slow that it would be ridiculous. Fortunately, many string algorithms really don't need to care much about Unicode so long as the strings involved are normalized. For instance, a function like find can usually compare code units without decoding anything (though even then, depending on the normalization, you run the risk of finding a part of a character if it involves combining code points - e.g. searching for e could give you the first part of if its encoded with the e followed by the accent). But ultimately, fully correct string handling requires having a far better understanding of Unicode than most programmers have. Even the percentage of programmers here that have that level of understanding isn't all that great - though the fact that D supports UTF-8, UTF-16, and UTF-32 the way that it does has led a number of us to dig further into Unicode and learn it better in ways that we probably wouldn't have if all it had was char. It highlights that there is something that needs to be learned to get this right in a way that most languages don't. - Jonathan M Davis
May 31 2016
prev sibling parent reply Tobias M <troplin bluewin.ch> writes:
On Friday, 27 May 2016 at 19:43:16 UTC, H. S. Teoh wrote:
 On Fri, May 27, 2016 at 03:30:53PM -0400, Andrei Alexandrescu 
 via Digitalmars-d wrote:
 On 5/27/16 3:10 PM, ag0aep6g wrote:
 I don't think there is value in distinguishing by language. 
 The point of Unicode is that you shouldn't need to do that.
It seems code points are kind of useless because they don't really mean anything, would that be accurate? -- Andrei
That's what we've been trying to say all along! :-P They're a kind of low-level Unicode construct used for building "real" characters, i.e., what a layperson would consider to be a "character".
Code points are *the fundamental unit* of unicode. AFAIK most (all?) algorithms in the unicode spec are defined in terms of code points. Sure, some algorithms also work on the code unit level. That can be used as an optimization, but they are still defined on code points. Code points are also abstracting over the different representations (UTF-...), providing a uniform "interface".
May 29 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/29/2016 09:42 AM, Tobias M wrote:
 On Friday, 27 May 2016 at 19:43:16 UTC, H. S. Teoh wrote:
 On Fri, May 27, 2016 at 03:30:53PM -0400, Andrei Alexandrescu via
 Digitalmars-d wrote:
 On 5/27/16 3:10 PM, ag0aep6g wrote:
 I don't think there is value in distinguishing by language. > The
point of Unicode is that you shouldn't need to do that. It seems code points are kind of useless because they don't really mean anything, would that be accurate? -- Andrei
That's what we've been trying to say all along! :-P They're a kind of low-level Unicode construct used for building "real" characters, i.e., what a layperson would consider to be a "character".
Code points are *the fundamental unit* of unicode. AFAIK most (all?) algorithms in the unicode spec are defined in terms of code points. Sure, some algorithms also work on the code unit level. That can be used as an optimization, but they are still defined on code points. Code points are also abstracting over the different representations (UTF-...), providing a uniform "interface".
So now code points are good? -- Andrei
May 29 2016
next sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Sun, May 29, 2016 at 03:55:22PM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
 On 05/29/2016 09:42 AM, Tobias M wrote:
 On Friday, 27 May 2016 at 19:43:16 UTC, H. S. Teoh wrote:
 On Fri, May 27, 2016 at 03:30:53PM -0400, Andrei Alexandrescu via
 Digitalmars-d wrote:
 On 5/27/16 3:10 PM, ag0aep6g wrote:
 I don't think there is value in distinguishing by language.
 The point of Unicode is that you shouldn't need to do that.
It seems code points are kind of useless because they don't really mean anything, would that be accurate? -- Andrei
That's what we've been trying to say all along! :-P They're a kind of low-level Unicode construct used for building "real" characters, i.e., what a layperson would consider to be a "character".
Code points are *the fundamental unit* of unicode. AFAIK most (all?) algorithms in the unicode spec are defined in terms of code points. Sure, some algorithms also work on the code unit level. That can be used as an optimization, but they are still defined on code points. Code points are also abstracting over the different representations (UTF-...), providing a uniform "interface".
So now code points are good? -- Andrei
It depends on what you're trying to accomplish. That's the point we're trying to get at. For some operations, working with code points makes the most sense. But for other operations, it does not. There is no one representation that is best for all situations; it needs to be decided on a case-by-case basis. Which is why forcing everything to decode to code points eventually leads to problems. T -- Customer support: the art of getting your clients to pay for your own incompetence.
May 29 2016
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/29/2016 04:47 PM, H. S. Teoh via Digitalmars-d wrote:
 It depends on what you're trying to accomplish. That's the point we're
 trying to get at.  For some operations, working with code points makes
 the most sense. But for other operations, it does not.  There is no one
 representation that is best for all situations; it needs to be decided
 on a case-by-case basis.  Which is why forcing everything to decode to
 code points eventually leads to problems.
I see. Again this all to me sounds like "naked arrays of characters are the wrong choice and should have been encapsulated in a dedicated string type". -- Andrei
May 30 2016
prev sibling parent Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Sunday, May 29, 2016 13:47:32 H. S. Teoh via Digitalmars-d wrote:
 On Sun, May 29, 2016 at 03:55:22PM -0400, Andrei Alexandrescu via 
Digitalmars-d wrote:
 So now code points are good? -- Andrei
It depends on what you're trying to accomplish. That's the point we're trying to get at. For some operations, working with code points makes the most sense. But for other operations, it does not. There is no one representation that is best for all situations; it needs to be decided on a case-by-case basis. Which is why forcing everything to decode to code points eventually leads to problems.
Exactly. And even a given function can't necessarily always be defined to use a specific level of Unicode, because whether that's correct or not depends on what the programmer is actually trying to do with the function. And then there are cases where the programmer knows enough about the data that they're dealing with that they're able to operate at a different level of Unicode than would normally be correct. The most obvious example of that is when you know that your strings are pure ASCII, but it's not the only case. We should strive to make Phobos operate correctly on strings by default where we can, but there are cases where the programmer needs to know enough to specify the behavior that they want, and deciding for them is just going to lead to behavior that happens to be right some of the time while making it hard for code using Phobos to have the correct behavior the rest of the time. And the default behavior that we currently have is inefficient to boot. - Jonathan M Davis
May 31 2016
prev sibling next sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Friday, 27 May 2016 at 19:30:53 UTC, Andrei Alexandrescu wrote:
 It seems code points are kind of useless because they don't 
 really mean anything, would that be accurate? -- Andrei
It might help to think of code points as being a kind of byte code for a text-representing VM. It's not meaningless, but it also isn't trivial and relevant metrics can only be seen in application. BTW you don't even have to get into unicode to hit complications. Tab, backspace, carriage return, these are part of ASCII but already complicate questions. http://stackoverflow.com/questions/6792812/the-backspace-escape-character-b-in-c-unexpected-behavior came up on a quick search. Does the backspace character reduce the length of a string? In some contexts, maybe.
May 27 2016
parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, May 27, 2016 at 07:53:30PM +0000, Adam D. Ruppe via Digitalmars-d wrote:
 On Friday, 27 May 2016 at 19:30:53 UTC, Andrei Alexandrescu wrote:
 It seems code points are kind of useless because they don't really
 mean anything, would that be accurate? -- Andrei
It might help to think of code points as being a kind of byte code for a text-representing VM. It's not meaningless, but it also isn't trivial and relevant metrics can only be seen in application. BTW you don't even have to get into unicode to hit complications. Tab, backspace, carriage return, these are part of ASCII but already complicate questions. http://stackoverflow.com/questions/6792812/the-backspace-escape-character-b-in-c-unexpected-behavior came up on a quick search. Does the backspace character reduce the length of a string? In some contexts, maybe.
Fun fact: on some old Unix boxen, Backspace + underscore was interpreted to mean "underline the previous character". Probably inherited from the old typewriter days. Scarily enough, some Posix terminals may still interpret this sequence this way! An early precursor of Unicode combining diacritics, perhaps? :-D T -- Everybody talks about it, but nobody does anything about it! -- Mark Twain
May 27 2016
prev sibling parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 5/27/16 3:30 PM, Andrei Alexandrescu wrote:
 On 5/27/16 3:10 PM, ag0aep6g wrote:
 I don't think there is value in distinguishing by language. The point of
 Unicode is that you shouldn't need to do that.
It seems code points are kind of useless because they don't really mean anything, would that be accurate? -- Andrei
The only unmistakably correct use I can think of is transcoding from one UTF representation to another. That is, in order to transcode from UTF8 to UTF16, I don't need to know anything about character composition. -Steve
May 27 2016
prev sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, May 27, 2016 at 02:42:27PM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
 On 5/27/16 12:40 PM, H. S. Teoh via Digitalmars-d wrote:
 Exactly. And we just keep getting stuck on this point. It seems that
 the message just isn't getting through. The unfounded assumption
 continues to be made that iterating by code point is somehow
 "correct" by definition and nobody can challenge it.
Which languages are covered by code points, and which languages require graphemes consisting of multiple code points? How does normalization play into this? -- Andrei
This is a complicated issue; for a full explanation you'll probably want to peruse the Unicode codices. For example: http://www.unicode.org/faq/char_combmark.html But in brief, it's mostly a number of common European languages have 1-to-1 code point to character mapping, as well as Chinese writing. Outside of this narrow set, you're on shaky ground. Examples (that I can think of, there are many others): - Almost all Korean characters are composed of multiple code points. - The Indic languages (which cover quite a good number of Unicode code pages) have ligatures that require multiple code points. - The Thai block contains a series of combining diacritics for vowels and tones. - Hebrew vowel points require multiple code points; - A good number of native American scripts require combining marks, e.g., Navajo. - International Phonetic Alphabet (primarily only for linguistic uses, but could be widespread because it's relevant everywhere language is spoken). - Classical Greek accents (though this is less common, mostly being used only in academic circles). Even within the realm of European languages and languages that use some version of the Latin script, there is an entire block of code points in Unicode (the U+0300 block) dedicated to combining diacritics. A good number of combinations do not have precomposed characters. Now as far as normalization is concerned, it only helps if a particular combination of diacritics on a base glyph have a precomposed form. A large number of the above languages do not have precomposed characters simply because of the sheer number of combinations. The only reason the CJK block actually includes a huge number of precomposed characters was because the rules for combining the base forms are too complex to encode compositionally. Otherwise, most languages with combining diacritics would not have precomposed characters assigned to their respective blocks. In fact, a good number (all?) of precomposed Latin characters were included in Unicode only because they existed in pre-Unicode days and some form of compatibility was desired back when Unicode was still not yet widely adopted. So basically, besides a small number of languages, the idea of 1 code point == 1 character is pretty unworkable. Especially in this day and age of worldwide connectivity. T -- The diminished 7th chord is the most flexible and fear-instilling chord. Use it often, use it unsparingly, to subdue your listeners into submission!
May 27 2016
prev sibling next sibling parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Fri, 27 May 2016 15:47:32 +0200
schrieb ag0aep6g <anonymous example.com>:

 On 05/27/2016 03:32 PM, Andrei Alexandrescu wrote:
 However the following do require autodecoding:

 s.walkLength
 s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
 s.count!(c => c >= 32) // non-control characters

 Currently the standard library operates at code point level even
 though inside it may choose to use code units when admissible. Leaving
 such a decision to the library seems like a wise thing to do.  
But how is the user supposed to know without being a core contributor to Phobos?
Misunderstanding. All examples work properly today because of autodecoding. -- Andrei
They only work "properly" if you define "properly" as "in terms of code points". But working in terms of code points is usually wrong. If you want to count "characters", you need to work with graphemes. https://dpaste.dzfl.pl/817dec505fd2
1: Auto-decoding shall ALWAYS do the proper thing 2: Therefor humans shall read text in units of code points 3: OS X is an anomaly and must be purged from this planet 4: Indonesians shall be converted to a sane alphabet 5: He who useth combining diacritics shall burn in hell 6: We shall live in peace and harmony forevermore Let's give this a rest. -- Marco
May 30 2016
parent Marco Leise <Marco.Leise gmx.de> writes:
 4: Indonesians* shall be converted to a sane alphabet
*Correction: Koreans (2-4 Hangul syllables (code points) form each letter) -- Marco
May 30 2016
prev sibling parent reply Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Friday, May 27, 2016 09:40:21 H. S. Teoh via Digitalmars-d wrote:
 On Fri, May 27, 2016 at 03:47:32PM +0200, ag0aep6g via Digitalmars-d wrote:
 On 05/27/2016 03:32 PM, Andrei Alexandrescu wrote:
 However the following do require autodecoding:

 s.walkLength
 s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
 s.count!(c => c >= 32) // non-control characters

 Currently the standard library operates at code point level even
 though inside it may choose to use code units when admissible.
 Leaving such a decision to the library seems like a wise thing
 to do.
But how is the user supposed to know without being a core contributor to Phobos?
Misunderstanding. All examples work properly today because of autodecoding. -- Andrei
They only work "properly" if you define "properly" as "in terms of code points". But working in terms of code points is usually wrong. If you want to count "characters", you need to work with graphemes. https://dpaste.dzfl.pl/817dec505fd2
Exactly. And we just keep getting stuck on this point. It seems that the message just isn't getting through. The unfounded assumption continues to be made that iterating by code point is somehow "correct" by definition and nobody can challenge it. String handling, especially in the standard library, ought to be (1) efficient where possible, and (2) be as correct as possible (meaning, most corresponding to user expectations -- principle of least surprise). If we can't have both, we should at least have one, right? However, the way autodecoding is currently implemented, we have neither.
Exactly. Saying that operating at the code point level - UTF-32 - is correct is like saying that operating at UTF-16 instead of UTF-8 is correct. More, full characters fit in a single code unit, but they still don't all fit. You have to go to the grapheme level for that. IIRC, Andrei talked in TDPL about how UTF-8 was better than UTF-16, because you figured out when you screwed up Unicode handling more quickly, because very few Unicode characters fit in single UTF-8 code unit, whereas many more fit in a single UTF-16 code unit, making it harder to catch errors with UTF-16. Well, we're making the same mistake but with UTF-32 instead of UTF-16. The code is still wrong, but it's that much harder to catch that it's wrong.
 Firstly, it is beyond clear that autodecoding adds a significant amount
 of overhead, and because it's automatic, it applies to ALL string
 processing in D.  The only way around it is to fight against the
 standard library and use workarounds to bypass all that
 meticulously-crafted autodecoding code, begging the question of why
 we're even spending the effort on said code in the first place.
The standard library has to fight against itself because of autodecoding! The vast majority of the algorithms in Phobos are special-cased on strings in an attempt to get around autodecoding. That alone should highlight the fact that autodecoding is problematic.
 The fact of the matter is that if you're going to write Unicode string
 processing code, you're gonna hafta to know the dirty nitty gritty of
 Unicode strings, including the fine distinctions between code units,
 code points, grapheme clusters, etc.. Since this is required knowledge
 anyway, why not just let the user worry about how to iterate over the
 string? Let the user choose what best suits his application, whether
 it's working directly with code units for speed, or iterating over
 grapheme clusters for correctness (in terms of visual "characters"),
 instead of choosing the pessimal middle ground that's neither efficient
 nor correct?
There is no solution here that's going to be both correct and efficient. Ideally, we either need to provide a fully correct solution that's dog slow, or we need to provide a solution that's efficient but requires that the programmer understand Unicode to write correct code. Right now, we have a slow solution that's incorrect. - Jonathan M Davis
May 31 2016
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
 Saying that operating at the code point level - UTF-32 - is correct
 is like saying that operating at UTF-16 instead of UTF-8 is correct.
Could you please substantiate that? My understanding is that code unit is a higher-level Unicode notion independent of encoding, whereas code point is an encoding-dependent representation detail. -- Andrei
May 31 2016
next sibling parent reply Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d wrote:
 On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
 Saying that operating at the code point level - UTF-32 - is correct
 is like saying that operating at UTF-16 instead of UTF-8 is correct.
Could you please substantiate that? My understanding is that code unit is a higher-level Unicode notion independent of encoding, whereas code point is an encoding-dependent representation detail. -- Andrei
Okay. If you have the letter A, it will fit in one UTF-8 code unit, one UTF-16 code unit, and one UTF-32 code unit (so, one code point). assert("A"c.length == 1); assert("A"w.length == 1); assert("A"d.length == 1); If you have 月, then you get assert("月"c.length == 3); assert("月"w.length == 1); assert("月"d.length == 1); whereas if you have 𐀆, then you get assert("𐀆"c.length == 4); assert("𐀆"w.length == 2); assert("𐀆"d.length == 1); So, with these characters, it's clear that UTF-8 and UTF-16 don't cut it for holding an entire character, but it still looks like UTF-32 does. However, what about characters like é or שׂ? Notice that שׂ takes up more than one code point. assert("שׂ"c.length == 4); assert("שׂ"w.length == 2); assert("שׂ"d.length == 2); It's ש with some sort of dot marker on it that they have in Hebrew, but it's a single character in spite of the fact that it's multiple code points. é is in a similar, though more complicated boat. With D, you'll get assert("é"c.length == 2); assert("é"w.length == 1); assert("é"d.length == 1); because the compiler decides to use the version of é that's a single code point. However, Unicode is set up so that that accent can be its own code point and be applied to any other code point - be it an e, an a, or even something like the number 0. If we normalize é, we can see other versions of it that take up more than one code point. e.g. assert("é"d.normalize!NFC.length == 1); assert("é"d.normalize!NFD.length == 2); assert("é"d.normalize!NFKC.length == 1); assert("é"d.normalize!NFKD.length == 2); And you can even put that accent on 0 by doing something like assert("0"d ~ "é"d.normalize!NFKD[1] == "0́"d); One or more code units combine to make a single code point, but one or more code points also combine to make a grapheme. So, while there is a definite layer of separation between code units and code points, it's still the case that a single code point is not guaranteed to be a single character. You do indeed have encodings with code units and not code points (though those still have different normalizations, which is kind of like having different encodings), but in terms of correctness, you have the same problem with treating code points as characters that you have as treating code units as characters. You're still not guaranteed that you're operating on full characters and risk chopping them up. It's just that at the code point level, you're generally chopping something up that is visually separable (like an accent from a letter or a superscript on a symbol), whereas with code units, you end up with utter garbage when you chop them incorrectly. By operating at the code point level, we're correct for _way_ more characters than we would be than if we treated char like a full character, but we're still not fully correct, and it's a lot harder to notice when you screw it up, because the number of characters which are handled incorrectly is far smaller. - Jonathan M Davis
May 31 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote:
 On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d wrote:
 On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
 Saying that operating at the code point level - UTF-32 - is correct
 is like saying that operating at UTF-16 instead of UTF-8 is correct.
Could you please substantiate that? My understanding is that code unit is a higher-level Unicode notion independent of encoding, whereas code point is an encoding-dependent representation detail. -- Andrei
Okay. If you have the letter A, it will fit in one UTF-8 code unit, one UTF-16 code unit, and one UTF-32 code unit (so, one code point). assert("A"c.length == 1); assert("A"w.length == 1); assert("A"d.length == 1); If you have 月, then you get assert("月"c.length == 3); assert("月"w.length == 1); assert("月"d.length == 1); whereas if you have 𐀆, then you get assert("𐀆"c.length == 4); assert("𐀆"w.length == 2); assert("𐀆"d.length == 1); So, with these characters, it's clear that UTF-8 and UTF-16 don't cut it for holding an entire character, but it still looks like UTF-32 does.
Does walkLength yield the same number for all representations?
 However,
 what about characters like é or שׂ? Notice that שׂ takes up more than one
code
 point.

 assert("שׂ"c.length == 4);
 assert("שׂ"w.length == 2);
 assert("שׂ"d.length == 2);

 It's ש with some sort of dot marker on it that they have in Hebrew, but it's
 a single character in spite of the fact that it's multiple code points. é is
 in a similar, though more complicated boat. With D, you'll get

 assert("é"c.length == 2);
 assert("é"w.length == 1);
 assert("é"d.length == 1);

 because the compiler decides to use the version of é that's a single code
 point.
Does walkLength yield the same number for all representations?
 However, Unicode is set up so that that accent can be its own code
 point and be applied to any other code point - be it an e, an a, or even
 something like the number 0. If we normalize é, we can see other
 versions of it that take up more than one code point. e.g.

 assert("é"d.normalize!NFC.length == 1);
 assert("é"d.normalize!NFD.length == 2);
 assert("é"d.normalize!NFKC.length == 1);
 assert("é"d.normalize!NFKD.length == 2);
Does walkLength yield the same number for all representations?
 And you can even put that accent on 0 by doing something like

 assert("0"d ~ "é"d.normalize!NFKD[1] == "0́"d);

 One or more code units combine to make a single code point, but one or more
 code points also combine to make a grapheme.
That's right. D's handling of UTF is at the code unit level (like all of Unicode is portably defined). If you want graphemes use byGrapheme. It seems you destroyed your own argument, which was:
 Saying that operating at the code point level - UTF-32 - is correct
 is like saying that operating at UTF-16 instead of UTF-8 is correct.
You can't claim code units are just a special case of code points. Andrei
May 31 2016
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 31.05.2016 20:30, Andrei Alexandrescu wrote:
 D's
Phobos'
 handling of UTF is at the code unit
code point
 level (like all of Unicode is portably defined).
May 31 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 02:46 PM, Timon Gehr wrote:
 On 31.05.2016 20:30, Andrei Alexandrescu wrote:
 D's
Phobos'
foreach, too. -- Andrei
May 31 2016
parent reply ZombineDev <petar.p.kirov gmail.com> writes:
On Tuesday, 31 May 2016 at 19:33:03 UTC, Andrei Alexandrescu 
wrote:
 On 05/31/2016 02:46 PM, Timon Gehr wrote:
 On 31.05.2016 20:30, Andrei Alexandrescu wrote:
 D's
Phobos'
foreach, too. -- Andrei
Incorrect. https://dpaste.dzfl.pl/ba7a65d59534
Jun 01 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/01/2016 01:35 PM, ZombineDev wrote:
 On Tuesday, 31 May 2016 at 19:33:03 UTC, Andrei Alexandrescu wrote:
 On 05/31/2016 02:46 PM, Timon Gehr wrote:
 On 31.05.2016 20:30, Andrei Alexandrescu wrote:
 D's
Phobos'
foreach, too. -- Andrei
Incorrect. https://dpaste.dzfl.pl/ba7a65d59534
Try typing the iteration variable with "dchar". -- Andrei
Jun 01 2016
next sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
On Wednesday, 1 June 2016 at 17:57:15 UTC, Andrei Alexandrescu 
wrote:
 Try typing the iteration variable with "dchar". -- Andrei
Or you can type it as wchar... But important to note: that's opt in, not automatic.
Jun 01 2016
prev sibling parent reply ZombineDev <petar.p.kirov gmail.com> writes:
On Wednesday, 1 June 2016 at 17:57:15 UTC, Andrei Alexandrescu 
wrote:
 On 06/01/2016 01:35 PM, ZombineDev wrote:
 On Tuesday, 31 May 2016 at 19:33:03 UTC, Andrei Alexandrescu 
 wrote:
 On 05/31/2016 02:46 PM, Timon Gehr wrote:
 On 31.05.2016 20:30, Andrei Alexandrescu wrote:
 D's
Phobos'
foreach, too. -- Andrei
Incorrect. https://dpaste.dzfl.pl/ba7a65d59534
Try typing the iteration variable with "dchar". -- Andrei
I think you are not getting my point. This is not autodecoding. There is nothing auto-magic w.r.t. strings in plain foreach. Typing char, wchar or dchar is the same using byChar, byWchar or byDchar - it is opt-in. The only problems are the front, empty and popFront overloads for narrow strings.
Jun 01 2016
next sibling parent ZombineDev <petar.p.kirov gmail.com> writes:
On Wednesday, 1 June 2016 at 19:07:26 UTC, ZombineDev wrote:
 On Wednesday, 1 June 2016 at 17:57:15 UTC, Andrei Alexandrescu 
 wrote:
 On 06/01/2016 01:35 PM, ZombineDev wrote:
 On Tuesday, 31 May 2016 at 19:33:03 UTC, Andrei Alexandrescu 
 wrote:
 On 05/31/2016 02:46 PM, Timon Gehr wrote:
 On 31.05.2016 20:30, Andrei Alexandrescu wrote:
 D's
Phobos'
foreach, too. -- Andrei
Incorrect. https://dpaste.dzfl.pl/ba7a65d59534
Try typing the iteration variable with "dchar". -- Andrei
I think you are not getting my point. This is not autodecoding. There is nothing auto-magic w.r.t. strings in plain foreach. Typing char, wchar or dchar is the same using byChar, byWchar or byDchar - it is opt-in. The only problems are the front, empty and popFront overloads for narrow strings...
in std.range.primitives.
Jun 01 2016
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/01/2016 03:07 PM, ZombineDev wrote:
 This is not autodecoding. There is nothing auto-magic w.r.t. strings in
 plain foreach.
I understand where you're coming from, but it actually is autodecoding. Consider: byte[] a; foreach (byte x; a) {} foreach (short x; a) {} foreach (int x; a) {} That works by means of a conversion short->int. However: char[] a; foreach (char x; a) {} foreach (wchar x; a) {} foreach (dchar x; a) {} The latter two do autodecoding, not coversion as the rest of the language. Andrei
Jun 01 2016
next sibling parent reply Jack Stouffer <jack jackstouffer.com> writes:
On Wednesday, 1 June 2016 at 19:52:01 UTC, Andrei Alexandrescu 
wrote:
 foreach (dchar x; a) {}
 The latter two do autodecoding, not coversion as the rest of 
 the language.
This seems to be a miscommunication with semantics. This is not auto-decoding at all; you're decoding, but there is nothing "auto" about it. This code is an explicit choice by the programmer to do something. On the other hand, using std.range.primitives.front for narrow strings is auto-decoding because the programmer has not made a choice, the choice is made for the programmer.
Jun 01 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/01/2016 05:30 PM, Jack Stouffer wrote:
 On Wednesday, 1 June 2016 at 19:52:01 UTC, Andrei Alexandrescu wrote:
 foreach (dchar x; a) {}
 The latter two do autodecoding, not coversion as the rest of the
 language.
This seems to be a miscommunication with semantics. This is not auto-decoding at all; you're decoding, but there is nothing "auto" about it. This code is an explicit choice by the programmer to do something.
No, this is autodecoding pure and simple. We can't move the goals whenever we don't like where the ball gets. The usual language rules are not applied for strings - they are autodecoded (i.e. there's code generated that magically decodes UTF surprisingly for beginners, in apparent violation of the language rules, and without any user-visible request) by the foreach statement. -- Andrei
Jun 01 2016
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 01.06.2016 23:48, Andrei Alexandrescu wrote:
 On 06/01/2016 05:30 PM, Jack Stouffer wrote:
 On Wednesday, 1 June 2016 at 19:52:01 UTC, Andrei Alexandrescu wrote:
 foreach (dchar x; a) {}
 The latter two do autodecoding, not coversion as the rest of the
 language.
This seems to be a miscommunication with semantics. This is not auto-decoding at all; you're decoding, but there is nothing "auto" about it. This code is an explicit choice by the programmer to do something.
No, this is autodecoding pure and simple. We can't move the goals whenever we don't like where the ball gets.
It does not share most of the characteristics that make Phobos' autodecoding painful in practice.
 The usual language rules are
 not applied for strings - they are autodecoded (i.e. there's code
 generated that magically decodes UTF surprisingly for beginners, in
 apparent violation of the language rules, and without any user-visible
 request) by the foreach statement. -- Andrei
Agreed. (But implicit conversion from char to dchar is a bad language rule.)
Jun 02 2016
prev sibling next sibling parent reply ZombineDev <petar.p.kirov gmail.com> writes:
On Wednesday, 1 June 2016 at 19:52:01 UTC, Andrei Alexandrescu 
wrote:
 On 06/01/2016 03:07 PM, ZombineDev wrote:
 This is not autodecoding. There is nothing auto-magic w.r.t. 
 strings in
 plain foreach.
I understand where you're coming from, but it actually is autodecoding. Consider: byte[] a; foreach (byte x; a) {} foreach (short x; a) {} foreach (int x; a) {} That works by means of a conversion short->int. However: char[] a; foreach (char x; a) {} foreach (wchar x; a) {} foreach (dchar x; a) {} The latter two do autodecoding, not coversion as the rest of the language. Andrei
Regardless of how different people may call it, it's not what this thread is about. Deprecating front, popFront and empty for narrow strings is what we are talking about here. This has little to do with explicit string transcoding in foreach. I don't think anyone has a problem with it, because it is **opt-in** and easy to change to get the desired behavior. On the other hand, trying to prevent Phobos from autodecoding without typesystem defeating hacks like .representation is an uphill battle right now. Removing range autodecoding will also be beneficial for library writers. For example, instead of writing find specializations for char, wchar and dchar needles, it would be much more productive to focus on optimising searching for T in T[] and specializing on element size and other type properties that generic code should care about. Having to specialize for all the char and string types instead of just any types of that size that can be compared bitwise is like programming in a language with no support for generic programing. And like many others have pointed out, it also about correctness. Only the users can decide if searching at code unit, code point or grapheme level (or something else) is right for their needs. A library that pretends that a single interpretation (i.e. code point) is right for every case is a false friend.
Jun 01 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/01/2016 06:09 PM, ZombineDev wrote:
 Regardless of how different people may call it, it's not what this
 thread is about.
Yes, definitely - but then again we can't after each invalidated claim to go "yeah well but that other point stands".
 Deprecating front, popFront and empty for narrow
 strings is what we are talking about here.
That will not happen. Walter and I consider the cost excessive and the benefit too small.
 This has little to do with
 explicit string transcoding in foreach.
It is implicit, not explicit.
 I don't think anyone has a
 problem with it, because it is **opt-in** and easy to change to get the
 desired behavior.
It's not opt-in. There is no way to tell foreach "iterate this array by converting char to dchar by the usual language rules, no autodecoding". You can if you e.g. use uint for the iteration variable. Same deal as with .representation.
 On the other hand, trying to prevent Phobos from autodecoding without
 typesystem defeating hacks like .representation is an uphill battle
 right now.
Characterizing .representation as a typesystem defeating hack is a stretch. What memory safety issues is it introducing? Andrei
Jun 01 2016
next sibling parent Kagamin <spam here.lot> writes:
On Wednesday, 1 June 2016 at 22:24:49 UTC, Andrei Alexandrescu 
wrote:
 Deprecating front, popFront and empty for narrow
 strings is what we are talking about here.
That will not happen. Walter and I consider the cost excessive and the benefit too small.
 This has little to do with
 explicit string transcoding in foreach.
It is implicit, not explicit.
Do you mean you agree that range primitives for strings can be changed to stay (auto)decoding to dchar, but require some form of explicit opt-in?
Jun 02 2016
prev sibling next sibling parent reply ZombineDev <petar.p.kirov gmail.com> writes:
On Wednesday, 1 June 2016 at 22:24:49 UTC, Andrei Alexandrescu 
wrote:
 On 06/01/2016 06:09 PM, ZombineDev wrote:
 Regardless of how different people may call it, it's not what 
 this
 thread is about.
Yes, definitely - but then again we can't after each invalidated claim to go "yeah well but that other point stands".
My claim was not invalidated. I just didn't want to waste time arguing about it, because it is off topic. My point was that foreach is a purely language construct that doesn't know about the std.range.primitives module, therefore doesn't use it and therefore foreach doesn't perform **auto**decoding. It does perform explicit decoding because you need to specify a different type of iteration variable to trigger the behavior. If the variable type is not specified, you won't get any decoding (it will instead iterate over the code units).
 Deprecating front, popFront and empty for narrow
 strings is what we are talking about here.
That will not happen. Walter and I consider the cost excessive and the benefit too small.
On the other hand many people think that the cost of using a language (like C++) that has accumulated excessive number of bad design decisions and pitfalls is too high. Keeping bad design decisions alienates existing users and repulses new ones. I know you are in a difficult decision making position, but imagine telling people ten years from now: A) For the last ten years we worked on fixing every bad design and improving all the good ones. That's why we managed to expand our market share/mind share 10x-100x to what we had before. B) This strange feature you need to know about is here because we chose comparability with old code, over building the best language possible. The language managed to continue growing (but not as fast as we hoped) only because of the other good features. You should use this feature and here's a long list of things you need to consider when avoiding it. The majority of D users ten years from now are not yet D users. That's the target group you need to consider. And given the overwhelming support for fixing this problem by the existing users, you need to reevaluate your cost vs benefit metrics. This theme (breaking code) has come up many times before and I think that instead of complaining about the cost, we should focus on lower it with tooling. The problem I currently see is that there is not enough support for building and improving tools like dfix and leveraging them for language/std lib design process.
 This has little to do with
 explicit string transcoding in foreach.
It is implicit, not explicit.
 I don't think anyone has a
 problem with it, because it is **opt-in** and easy to change 
 to get the
 desired behavior.
It's not opt-in.
You need to opt-in by specifying a the type of the iteration variable and that type needs to be different than the typeof(array[0]). That's opt-in in my book.
 There is no way to tell foreach "iterate this array by 
 converting char to dchar by the usual language rules, no 
 autodecoding". You can if you e.g. use uint for the iteration 
 variable. Same deal as with .representation.
Again, off topic. No sane person wants automatic conversion (bitcast) from char to dchar, because dchar gives the impression of a fully decoded code point, which the result of such cast would certainly not provide.
 On the other hand, trying to prevent Phobos from autodecoding 
 without
 typesystem defeating hacks like .representation is an uphill 
 battle
 right now.
Characterizing .representation as a typesystem defeating hack is a stretch. What memory safety issues is it introducing?
Memory safety is not the only benefit of a type system. This goal is only a small subset of the larger goal of preventing logical errors and allowing greater expressiveness. You may as well invent a memory safe subset of D that works only ubyte, ushort, uint, ulong and arrays of those types, but I don't think anyone would want to use such language. Using .representation in parts of your code, makes those parts like the aforementioned language that no one wants to use.
Jun 02 2016
next sibling parent ZombineDev <petar.p.kirov gmail.com> writes:
 ...

 B) This strange feature you need to know about is here because 
 we chose comparability with old code, over building the best 
 language possible. The language managed to continue growing 
 (but not as fast as we hoped) only because of the other good 
 features. You should use this feature and here's a long list of 
 things you need to consider when avoiding it.
B) This strange feature is here because we chose compatibility with old code, over building the best language possible. The language managed to continue growing (but not as fast as we hoped) only because of the other good features. You shouldn't use this feature because of this and that potential pitfalls and here's a long list of things you need to consider when avoiding it.
 ...
Jun 02 2016
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 06:42 AM, ZombineDev wrote:
 On Wednesday, 1 June 2016 at 22:24:49 UTC, Andrei Alexandrescu wrote:
 On 06/01/2016 06:09 PM, ZombineDev wrote:
 Regardless of how different people may call it, it's not what this
 thread is about.
Yes, definitely - but then again we can't after each invalidated claim to go "yeah well but that other point stands".
My claim was not invalidated. I just didn't want to waste time arguing about it, because it is off topic. My point was that foreach is a purely language construct that doesn't know about the std.range.primitives module, therefore doesn't use it and therefore foreach doesn't perform **auto**decoding. It does perform explicit decoding because you need to specify a different type of iteration variable to trigger the behavior. If the variable type is not specified, you won't get any decoding (it will instead iterate over the code units).
Your claim was obliterated, and now you continue arguing it by adjusting term definitions on the fly, while at the same time awesomely claiming to choose the high road by not wasting time to argue it. I should remember the trick :o). Stand with the points that stand, own those that don't.
 Deprecating front, popFront and empty for narrow
 strings is what we are talking about here.
That will not happen. Walter and I consider the cost excessive and the benefit too small.
On the other hand many people think that the cost of using a language (like C++) that has accumulated excessive number of bad design decisions and pitfalls is too high. Keeping bad design decisions alienates existing users and repulses new ones.
Definitely. It's a fine line to walk; this particular decision is not that much on the edge at all. We must stay with autodecoding.
 I know you are in a difficult decision making position, but imagine
 telling people ten years from now:

 A) For the last ten years we worked on fixing every bad design and
 improving all the good ones. That's why we managed to expand our market
 share/mind share 10x-100x to what we had before.
I think we have underperformed and we need to do radically better. I'm on lookout for radical new approaches to things all the time. This is for another discussion though.
 B) This strange feature you need to know about is here because we chose
 comparability with old code, over building the best language possible.
 The language managed to continue growing (but not as fast as we hoped)
 only because of the other good features. You should use this feature and
 here's a long list of things you need to consider when avoiding it.
There are many components to the decision, not only compatibility with old code.
 The majority of D users ten years from now are not yet D users. That's
 the target group you need to consider. And given the overwhelming
 support for fixing this problem by the existing users, you need to
 reevaluate your cost vs benefit metrics.
It's funny that evidence for the "overwhelming" support is the vote of 35 voters, which was cast in terms of percentages. Math is great. ZombineDev, I've been at the top level in the C++ community for many many years, even after I wanted to exit :o). I'm familiar with how the committee that steers C++ works, perspective that is unique in our community - even Walter lacks it. I see trends and patterns. It is interesting how easily a small but very influential priesthood can alienate itself from the needs of the larger community and get into a frenzy over matters that are simply missing the point. This is what's happening here. We worked ourselves to a foam because the creator of the language started a thread entitled "The Case Against Autodecode", whilst fully understanding there is no way to actually eliminate autodecode. The very definition of a useless debate, the kind he and I had agreed to not initiate anymore. It was a mistake. I'm still metaphorically angry at him for it. I admit I started it by asking the question, but Walter shouldn't have answered. Following that, there was blood in the water; any of us loves to improve something by 2% by completely rewiring the thing. A proneness to doing that is why we self-select to be in this community and forum. Meanwhile, I go to conferences. Train and consult at large companies. Dozens every year, cumulatively thousands of people. I talk about D and ask people what it would take for them to use the language. Invariably I hear a surprisingly small number of reasons: * The garbage collector eliminates probably 60% of potential users right off. * Tooling is immature and of poorer quality compared to the competition. * Safety has holes and bugs. * Hiring people who know D is a problem. * Documentation and tutorials are weak. * There's no web services framework (by this time many folks know of D, but of those a shockingly small fraction has even heard of vibe.d). I have strongly argued with Sönke to bundle vibe.d with dmd over one year ago, and also in this forum. There wasn't enough interest. * (On Windows) if it doesn't have a compelling Visual Studio plugin, it doesn't exist. * Let's wait for the "herd effect" (corporate support) to start. * Not enough advantages over the competition to make up for the weaknesses above. There is a second echelon of arguments related to language proper issues, but those collectively count as much less than the above. And "inefficient/poor/error-prone string handling" has NEVER come up. Literally NEVER, even among people who had some familiarity with D and would otherwise make very informed comments about it. Look at reddit and hackernews, too - admittedly other self-selected communities. Language debates often spring about. How often is the point being made that D is wanting because of its string support? Nada.
 This theme (breaking code) has come up many times before and I think
 that instead of complaining about the cost, we should focus on lower it
 with tooling. The problem I currently see is that there is not enough
 support for building and improving tools like dfix and leveraging them
 for language/std lib design process.
Currently dfix is weak because it doesn't do lookup. So we need to make the front end into a library. Daniel said he wants to be on it, but he has two jobs to worry about so he's short on time. There's only so many hours in the day, and I think the right focus is on attacking the matters above.
 This has little to do with
 explicit string transcoding in foreach.
It is implicit, not explicit.
 I don't think anyone has a
 problem with it, because it is **opt-in** and easy to change to get the
 desired behavior.
It's not opt-in.
You need to opt-in by specifying a the type of the iteration variable and that type needs to be different than the typeof(array[0]). That's opt-in in my book.
Taking exception to language rules for iteration with dchar is not opt-in.
 There is no way to tell foreach "iterate this array by converting char
 to dchar by the usual language rules, no autodecoding". You can if you
 e.g. use uint for the iteration variable. Same deal as with
 .representation.
Again, off topic.
It's very on-topic. It's surprising semantics compared to the rest of the language, for which the user needs to be informed.
 No sane person wants automatic conversion (bitcast)
 from char to dchar, because dchar gives the impression of a fully
 decoded code point, which the result of such cast would certainly not
 provide.
void fun(char c) { if (c < 0x80) { // Look ma I'm not a sane person dchar d = c; // conversion is implicit, too ... } }
 On the other hand, trying to prevent Phobos from autodecoding without
 typesystem defeating hacks like .representation is an uphill battle
 right now.
Characterizing .representation as a typesystem defeating hack is a stretch. What memory safety issues is it introducing?
Memory safety is not the only benefit of a type system. This goal is only a small subset of the larger goal of preventing logical errors and allowing greater expressiveness.
This sounds like "no comeback here so let's insert a filler". Care to substantiate?
 You may as well invent a memory safe subset of D that works only ubyte,
 ushort, uint, ulong and arrays of those types, but I don't think anyone
 would want to use such language. Using .representation in parts of your
 code, makes those parts like the aforementioned language that no one
 wants to use.
I disagree. Andrei
Jun 02 2016
next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 15:06, Andrei Alexandrescu wrote:
 On 06/02/2016 06:42 AM, ZombineDev wrote:
 On Wednesday, 1 June 2016 at 22:24:49 UTC, Andrei Alexandrescu wrote:
 On 06/01/2016 06:09 PM, ZombineDev wrote:
 Regardless of how different people may call it, it's not what this
 thread is about.
Yes, definitely - but then again we can't after each invalidated claim to go "yeah well but that other point stands".
My claim was not invalidated. I just didn't want to waste time arguing about it, because it is off topic. My point was that foreach is a purely language construct that doesn't know about the std.range.primitives module, therefore doesn't use it and therefore foreach doesn't perform **auto**decoding. It does perform explicit decoding because you need to specify a different type of iteration variable to trigger the behavior. If the variable type is not specified, you won't get any decoding (it will instead iterate over the code units).
Your claim was obliterated, and now you continue arguing it by adjusting term definitions on the fly, while at the same time awesomely claiming to choose the high road by not wasting time to argue it. I should remember the trick :o). Stand with the points that stand, own those that don't.
It's not "on the fly". You two were presumably using different definitions of terms all along.
Jun 02 2016
prev sibling next sibling parent reply cym13 <cpicard openmailbox.org> writes:
On Thursday, 2 June 2016 at 13:06:44 UTC, Andrei Alexandrescu 
wrote:
 Your claim was obliterated, and now you continue arguing it by 
 adjusting term definitions on the fly, while at the same time 
 awesomely claiming to choose the high road by not wasting time 
 to argue it. I should remember the trick :o). Stand with the 
 points that stand, own those that don't.
 Definitely. It's a fine line to walk; this particular decision 
 is not that much on the edge at all. We must stay with 
 autodecoding.
If you are to stay with autodecoding (and I hope you won't) then please, *please*, at least make it decode to graphemes so that it decodes to something that actually have some kind of meaning of its own.
 I think we have underperformed and we need to do radically 
 better. I'm on lookout for radical new approaches to things all 
 the time. This is for another discussion though.

 There are many components to the decision, not only 
 compatibility with old code.

 It's funny that evidence for the "overwhelming" support is the 
 vote of 35 voters, which was cast in terms of percentages. Math 
 is great.

 ZombineDev, I've been at the top level in the C++ community for 
 many many years, even after I wanted to exit :o). I'm familiar 
 with how the committee that steers C++ works, perspective that 
 is unique in our community - even Walter lacks it. I see trends 
 and patterns. It is interesting how easily a small but very 
 influential priesthood can alienate itself from the needs of 
 the larger community and get into a frenzy over matters that 
 are simply missing the point.

 This is what's happening here. We worked ourselves to a foam 
 because the creator of the language started a thread entitled 
 "The Case Against Autodecode", whilst fully understanding there 
 is no way to actually eliminate autodecode. The very definition 
 of a useless debate, the kind he and I had agreed to not 
 initiate anymore. It was a mistake. I'm still metaphorically 
 angry at him for it. I admit I started it by asking the 
 question, but Walter shouldn't have answered. Following that, 
 there was blood in the water; any of us loves to improve 
 something by 2% by completely rewiring the thing. A proneness 
 to doing that is why we self-select to be in this community and 
 forum.

 Meanwhile, I go to conferences. Train and consult at large 
 companies. Dozens every year, cumulatively thousands of people. 
 I talk about D and ask people what it would take for them to 
 use the language. Invariably I hear a surprisingly small number 
 of reasons:

 * The garbage collector eliminates probably 60% of potential 
 users right off.

 * Tooling is immature and of poorer quality compared to the 
 competition.

 * Safety has holes and bugs.

 * Hiring people who know D is a problem.

 * Documentation and tutorials are weak.

 * There's no web services framework (by this time many folks 
 know of D, but of those a shockingly small fraction has even 
 heard of vibe.d). I have strongly argued with Sönke to bundle 
 vibe.d with dmd over one year ago, and also in this forum. 
 There wasn't enough interest.

 * (On Windows) if it doesn't have a compelling Visual Studio 
 plugin, it doesn't exist.

 * Let's wait for the "herd effect" (corporate support) to start.

 * Not enough advantages over the competition to make up for the 
 weaknesses above.

 There is a second echelon of arguments related to language 
 proper issues, but those collectively count as much less than 
 the above. And "inefficient/poor/error-prone string handling" 
 has NEVER come up. Literally NEVER, even among people who had 
 some familiarity with D and would otherwise make very informed 
 comments about it.

 Look at reddit and hackernews, too - admittedly other 
 self-selected communities. Language debates often spring about. 
 How often is the point being made that D is wanting because of 
 its string support? Nada.
I think the real reason about why this isn't mentioned in the critics you mention is that people don't know about it. Most people don't even imagine it can be as broken as it is. Heck, it even took Walter by surprise after years! This thread is the first real discussion we've had about it with proper deconstruction and very reasonnable arguments against it. The only unreasonnable thing here has been your own arguments. I'd like not to point a finger at you but the fact is that you are the only single one defending autodecoding and not with good arguments. Currently autodecoding relies on chance only. (Yes, I call “hoping the text we're manipulating can be represented by dchars” chance.) This cannot be anymore.
 Currently dfix is weak because it doesn't do lookup. So we need 
 to make the front end into a library. Daniel said he wants to 
 be on it, but he has two jobs to worry about so he's short on 
 time. There's only so many hours in the day, and I think the 
 right focus is on attacking the matters above.
...
 Andrei
Jun 02 2016
next sibling parent tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 2 June 2016 at 13:55:28 UTC, cym13 wrote:
 If you are to stay with autodecoding (and I hope you won't) then
 please, *please*, at least make it decode to graphemes so that
 it decodes to something that actually have some kind of meaning
 of its own.
That would cause just as much - if not more - code breakage as ditching auto-decoding entirely. It would also be considerably slower and more memory-hungry.
Jun 02 2016
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 09:55 AM, cym13 wrote:
 On Thursday, 2 June 2016 at 13:06:44 UTC, Andrei Alexandrescu wrote:
 Your claim was obliterated, and now you continue arguing it by
 adjusting term definitions on the fly, while at the same time
 awesomely claiming to choose the high road by not wasting time to
 argue it. I should remember the trick :o). Stand with the points that
 stand, own those that don't.
 Definitely. It's a fine line to walk; this particular decision is not
 that much on the edge at all. We must stay with autodecoding.
If you are to stay with autodecoding (and I hope you won't) then please, *please*, at least make it decode to graphemes so that it decodes to something that actually have some kind of meaning of its own.
That's not going to work. A false impression created in this thread has been that code points are useless and graphemes are da bomb. That's not the case even if we ignore the overwhelming issue of changing semantics of existing code.
 I think the real reason about why this isn't mentioned in the
 critics you mention is that people don't know about it. Most people
 don't even imagine it can be as broken as it is.
This should be taken at face value - rampant speculation. From my experience that's not how these things work.
 Heck, it even
 took Walter by surprise after years! This thread is the first real
 discussion we've had about it with proper deconstruction and
 very reasonnable arguments against it. The only unreasonnable thing
 here has been your own arguments. I'd like not to point a finger at
 you but the fact is that you are the only single one defending
 autodecoding and not with good arguments.
Fair enough. I accept continuous scrutiny of my competency - it comes with the territory.
 Currently autodecoding relies on chance only. (Yes, I call “hoping
 the text we're manipulating can be represented by dchars” chance.)
 This cannot be anymore.
The real ticket out of this is RCStr. It solves a major problem in the language (compulsive GC) and also a minor occasional annoyance (autodecoding). Andrei
Jun 02 2016
parent reply Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Thursday, 2 June 2016 at 14:28:44 UTC, Andrei Alexandrescu 
wrote:
 That's not going to work. A false impression created in this 
 thread has been that code points are useless
They _are_ useless for almost anything you can do with strings. The only places where they should be used are std.uni and std.regex. Again: What is the justification for using code points, in your opinion? Which practical tasks are made possible (and work _correctly_) if you decode to code points, that don't already work with code units?
Jun 02 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 01:54 PM, Marc Schütz wrote:
 On Thursday, 2 June 2016 at 14:28:44 UTC, Andrei Alexandrescu wrote:
 That's not going to work. A false impression created in this thread
 has been that code points are useless
They _are_ useless for almost anything you can do with strings. The only places where they should be used are std.uni and std.regex. Again: What is the justification for using code points, in your opinion? Which practical tasks are made possible (and work _correctly_) if you decode to code points, that don't already work with code units?
Pretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16). * s.all!(c => c == 'ö') works only with autodecoding. It returns always false without. * s.any!(c => c == 'ö') works only with autodecoding. It returns always false without. * s.balancedParens('〈', '〉') works only with autodecoding. * s.canFind('ö') works only with autodecoding. It returns always false without. * s.commonPrefix(s1) works only if they both use the same encoding; otherwise it still compiles but silently produces an incorrect result. * s.count('ö') works only with autodecoding. It returns always zero without. * s.countUntil(s1) is really odd - without autodecoding, whether it works at all, and the result it returns, depends on both encodings. With autodecoding it always works and returns a number independent of the encodings. * s.endsWith('ö') works only with autodecoding. It returns always false without. * s.endsWith(s1) works only with autodecoding. Otherwise it compiles and runs but produces incorrect results if s and s1 have different encodings. * s.find('ö') works only with autodecoding. It never finds it without. * s.findAdjacent is a very interesting one. It works with autodecoding, but without it it just does odd things. * s.findAmong(s1) is also interesting. It works only with autodecoding. * s.findSkip(s1) works only if s and s1 have the same encoding. Otherwise it compiles and runs but produces incorrect results. * s.findSplit(s1), s.findSplitAfter(s1), s.findSplitBefore(s1) work only if s and s1 have the same encoding. Otherwise they compile and run but produce incorrect results. * s.minCount, s.maxCount are unlikely to be terribly useful but with autodecoding it consistently returns the extremum numeric code unit regardless of representation. Without, they just return encoding-dependent and meaningless numbers. * s.minPos, s.maxPos follow a similar semantics. * s.skipOver(s1) only works with autodecoding. Otherwise it compiles and runs but produces incorrect results if s and s1 have different encodings. * s.startsWith('ö') works only with autodecoding. Otherwise it compiles and runs but produces incorrect results if s and s1 have different encodings. * s.startsWith(s1) works only with autodecoding. Otherwise it compiles and runs but produces incorrect results if s and s1 have different encodings. * s.until!(c => c == 'ö') works only with autodecoding. Otherwise, it will span the entire range. === The intent of autodecoding was to make std.algorithm work meaningfully with strings. As it's easy to see I just went through std.algorithm.searching alphabetically and found issues literally with every primitive in there. It's an easy exercise to go forth with the others. Andrei
Jun 02 2016
next sibling parent reply ag0aep6g <anonymous example.com> writes:
On 06/02/2016 09:05 PM, Andrei Alexandrescu wrote:
 Pretty much everything. Consider s and s1 string variables with possibly
 different encodings (UTF8/UTF16).

 * s.all!(c => c == 'ö') works only with autodecoding. It returns always
 false without.
Doesn't work with autodecoding (to code points) when a combining diaeresis (U+0308) is used in s. Would actually work with UTF-16 and only combined 'ö's in s, because the combined character fits in a single UTF-16 code unit.
Jun 02 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
ag0aep6g <anonymous example.com> wrote:
 On 06/02/2016 09:05 PM, Andrei Alexandrescu wrote:
 Pretty much everything. Consider s and s1 string variables with possibly
 different encodings (UTF8/UTF16).
 
 * s.all!(c => c == 'ö') works only with autodecoding. It returns always
 false without.
Doesn't work with autodecoding (to code points) when a combining diaeresis (U+0308) is used in s.
Works if s is normalized appropriately. No?
Jun 02 2016
next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 21:26, Andrei Alexandrescu wrote:
 ag0aep6g <anonymous example.com> wrote:
 On 06/02/2016 09:05 PM, Andrei Alexandrescu wrote:
 Pretty much everything. Consider s and s1 string variables with possibly
 different encodings (UTF8/UTF16).

 * s.all!(c => c == 'ö') works only with autodecoding. It returns always
 false without.
Doesn't work with autodecoding (to code points) when a combining diaeresis (U+0308) is used in s.
Works if s is normalized appropriately. No?
No. assert(!"ö̶".normalize!NFC.any!(c => c== 'ö'));
Jun 02 2016
prev sibling parent ag0aep6g <anonymous example.com> writes:
On 06/02/2016 09:26 PM, Andrei Alexandrescu wrote:
 ag0aep6g <anonymous example.com> wrote:
 On 06/02/2016 09:05 PM, Andrei Alexandrescu wrote:
 Pretty much everything. Consider s and s1 string variables with possibly
 different encodings (UTF8/UTF16).

 * s.all!(c => c == 'ö') works only with autodecoding. It returns always
 false without.
Doesn't work with autodecoding (to code points) when a combining diaeresis (U+0308) is used in s.
Works if s is normalized appropriately. No?
Works when normalized to precomposed characters, yes. That's not a given, of course. When the user is aware enough to normalize their strings that way, then they should be able to call byDchar explicitly. And of course you can't do s.all!(c => c == 'a⃗'), despite a⃗ looking like one character. Need byGrapheme for that.
Jun 02 2016
prev sibling next sibling parent reply tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu 
wrote:
 Pretty much everything. Consider s and s1 string variables with 
 possibly different encodings (UTF8/UTF16).
 ...
Your 'ö' examples will NOT work reliably with auto-decoded code points, and for nearly the same reason that they won't work with code units; you would have to use byGrapheme. The fact that you still don't get that, even after a dozen plus attempts by the community to explain the difference, makes you unfit to direct Phobos' Unicode support. Please, either go study Unicode until you really understand it, or delegate this issue to someone else.
Jun 02 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 03:34 PM, tsbockman wrote:
 On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:
 Pretty much everything. Consider s and s1 string variables with
 possibly different encodings (UTF8/UTF16).
 ...
Your 'ö' examples will NOT work reliably with auto-decoded code points, and for nearly the same reason that they won't work with code units; you would have to use byGrapheme.
They do work per spec: find this code point. It would be surprising if 'ö' were found but the string were positioned at a different code point.
 The fact that you still don't get that, even after a dozen plus attempts
 by the community to explain the difference, makes you unfit to direct
 Phobos' Unicode support.
Well there's gotta be a reason why my basic comprehension is under constant scrutiny whereas yours is safe.
 Please, either go study Unicode until you
 really understand it, or delegate this issue to someone else.
Would be happy to. To whom would I delegate? Andrei
Jun 02 2016
next sibling parent Brad Anderson <eco gnuk.net> writes:
On Thursday, 2 June 2016 at 20:13:14 UTC, Andrei Alexandrescu 
wrote:
 On 06/02/2016 03:34 PM, tsbockman wrote:
 [...]
They do work per spec: find this code point. It would be surprising if 'ö' were found but the string were positioned at a different code point.
 [...]
Well there's gotta be a reason why my basic comprehension is under constant scrutiny whereas yours is safe.
 [...]
Would be happy to. To whom would I delegate? Andrei
If there were to be a unicode lieutenant, Dmitry seems to be the obvious choice (if he's interested).
Jun 02 2016
prev sibling next sibling parent reply ag0aep6g <anonymous example.com> writes:
On 06/02/2016 10:13 PM, Andrei Alexandrescu wrote:
 They do work per spec: find this code point. It would be surprising if
 'ö' were found but the string were positioned at a different code point.
The "spec" here is how the range primitives for narrow strings are defined, right? I.e., the spec says auto-decode code units to code points. The discussion is about whether the spec is good or bad. No one is arguing that there are bugs in the decoding to code points. People are arguing that auto-decoding to code points is not useful.
Jun 02 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:23 PM, ag0aep6g wrote:
 People are arguing that auto-decoding to code points is not useful.
And want to return to the point where char[] is but an indiscriminated array, which would take std.algorithm back to the stone age. -- Andrei
Jun 02 2016
next sibling parent reply default0 <Kevin.Labschek gmx.de> writes:
On Thursday, 2 June 2016 at 20:30:34 UTC, Andrei Alexandrescu 
wrote:
 On 06/02/2016 04:23 PM, ag0aep6g wrote:
 People are arguing that auto-decoding to code points is not 
 useful.
And want to return to the point where char[] is but an indiscriminated array, which would take std.algorithm back to the stone age. -- Andrei
Just make RCStr the most amazing string type of any standard library ever and everyone will be happy :o)
Jun 02 2016
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:37 PM, default0 wrote:
 On Thursday, 2 June 2016 at 20:30:34 UTC, Andrei Alexandrescu wrote:
 On 06/02/2016 04:23 PM, ag0aep6g wrote:
 People are arguing that auto-decoding to code points is not useful.
And want to return to the point where char[] is but an indiscriminated array, which would take std.algorithm back to the stone age. -- Andrei
Just make RCStr the most amazing string type of any standard library ever and everyone will be happy :o)
Soon as this thread ends. -- Andrei
Jun 02 2016
prev sibling parent reply ag0aep6g <anonymous example.com> writes:
On 06/02/2016 10:30 PM, Andrei Alexandrescu wrote:
 And want to return to the point where char[] is but an indiscriminated
 array, which would take std.algorithm back to the stone age. -- Andrei
I think you'd have to substantiate how that would be worse than auto-decoding. Your examples only show that treating code points as characters falls apart at a higher level than treating code units as characters. But it still falls apart. Failing early is a quality.
Jun 02 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:47 PM, ag0aep6g wrote:
 On 06/02/2016 10:30 PM, Andrei Alexandrescu wrote:
 And want to return to the point where char[] is but an indiscriminated
 array, which would take std.algorithm back to the stone age. -- Andrei
I think you'd have to substantiate how that would be worse than auto-decoding.
I gave a long list of std.algorithm uses that perform virtually randomly on char[].
 Your examples only show that treating code points as characters falls
 apart at a higher level than treating code units as characters. But it
 still falls apart. Failing early is a quality.
It does not fall apart for code points. Andrei
Jun 02 2016
parent reply ag0aep6g <anonymous example.com> writes:
On 06/02/2016 10:50 PM, Andrei Alexandrescu wrote:
 It does not fall apart for code points.
Yes it does. You've been given plenty examples where it falls apart. Your answer to that was that it operates on code points, not graphemes. Well, duh. Comparing UTF-8 code units against each other works, too. That's not an argument for doing that by default.
Jun 02 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:01 PM, ag0aep6g wrote:
 On 06/02/2016 10:50 PM, Andrei Alexandrescu wrote:
 It does not fall apart for code points.
Yes it does. You've been given plenty examples where it falls apart.
There weren't any.
 Your answer to that was that it operates on code points, not graphemes.
That is correct.
 Well, duh. Comparing UTF-8 code units against each other works, too.
 That's not an argument for doing that by default.
Nope, that's a radically different matter. As the examples show, the examples would be entirely meaningless at code unit level. Andrei
Jun 02 2016
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 23:06, Andrei Alexandrescu wrote:
 As the examples show, the examples would be entirely meaningless at code
 unit level.
So far, I needed to count the number of characters 'ö' inside some string exactly zero times, but I wanted to chain or join strings relatively often.
Jun 02 2016
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 23:16, Timon Gehr wrote:
 On 02.06.2016 23:06, Andrei Alexandrescu wrote:
 As the examples show, the examples would be entirely meaningless at code
 unit level.
So far, I needed to count the number of characters 'ö' inside some string exactly zero times,
(Obviously this isn't even what the example would do. I predict I will never need to count the number of code points 'ö' by calling some function from std.algorithm directly.)
 but I wanted to chain or join strings
 relatively often.
Jun 02 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:19 PM, Timon Gehr wrote:
 On 02.06.2016 23:16, Timon Gehr wrote:
 On 02.06.2016 23:06, Andrei Alexandrescu wrote:
 As the examples show, the examples would be entirely meaningless at code
 unit level.
So far, I needed to count the number of characters 'ö' inside some string exactly zero times,
(Obviously this isn't even what the example would do. I predict I will never need to count the number of code points 'ö' by calling some function from std.algorithm directly.)
You may look for a specific dchar, and it'll work. How about findAmong("...") with a bunch of ASCII and Unicode punctuation symbols? -- Andrei
Jun 02 2016
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 23:23, Andrei Alexandrescu wrote:
 On 6/2/16 5:19 PM, Timon Gehr wrote:
 On 02.06.2016 23:16, Timon Gehr wrote:
 On 02.06.2016 23:06, Andrei Alexandrescu wrote:
 As the examples show, the examples would be entirely meaningless at
 code
 unit level.
So far, I needed to count the number of characters 'ö' inside some string exactly zero times,
(Obviously this isn't even what the example would do. I predict I will never need to count the number of code points 'ö' by calling some function from std.algorithm directly.)
You may look for a specific dchar, and it'll work. How about findAmong("...") with a bunch of ASCII and Unicode punctuation symbols? -- Andrei
.̂ ̪.̂ (Copy-paste it somewhere else, I think it might not be rendered correctly on the forum.) The point is that if I do: ".̂ ̪.̂".normalize!NFC.byGrapheme.findAmong([Grapheme("."),Grapheme(",")]) no match is returned. If I use your method with dchars, I will get spurious matches. I.e. the suggested method to look for punctuation symbols is incorrect: writeln(".̂ ̪.̂".findAmong(",.")); // ".̂ ̪.̂" (Also, do you have an use case for this?)
Jun 02 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:43 PM, Timon Gehr wrote:
 .̂ ̪.̂

 (Copy-paste it somewhere else, I think it might not be rendered
 correctly on the forum.)

 The point is that if I do:

 ".̂ ̪.̂".normalize!NFC.byGrapheme.findAmong([Grapheme("."),Grapheme(",")])

 no match is returned.

 If I use your method with dchars, I will get spurious matches. I.e. the
 suggested method to look for punctuation symbols is incorrect:

 writeln(".̂ ̪.̂".findAmong(",.")); // ".̂ ̪.̂"
Nice example.
 (Also, do you have an use case for this?)
Count delimited words. Did you also look at balancedParens? Andrei
Jun 02 2016
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 23:46, Andrei Alexandrescu wrote:
 On 6/2/16 5:43 PM, Timon Gehr wrote:
 .̂ ̪.̂

 (Copy-paste it somewhere else, I think it might not be rendered
 correctly on the forum.)

 The point is that if I do:

 ".̂ ̪.̂".normalize!NFC.byGrapheme.findAmong([Grapheme("."),Grapheme(",")])

 no match is returned.

 If I use your method with dchars, I will get spurious matches. I.e. the
 suggested method to look for punctuation symbols is incorrect:

 writeln(".̂ ̪.̂".findAmong(",.")); // ".̂ ̪.̂"
Nice example. ...
Thanks! :o)
 (Also, do you have an use case for this?)
Count delimited words. Did you also look at balancedParens? Andrei
On 02.06.2016 22:01, Timon Gehr wrote:
 * s.balancedParens('〈', '〉') works only with autodecoding.
 ...
Doesn't work, e.g. s="⟨⃖". Shouldn't compile.
assert("⟨⃖".normalize!NFC.byGrapheme.balancedParens(Grapheme("⟨"),Grapheme("⟩"))); writeln("⟨⃖".balancedParens('⟨','⟩')); // false
Jun 02 2016
prev sibling parent reply ag0aep6g <anonymous example.com> writes:
On 06/02/2016 11:06 PM, Andrei Alexandrescu wrote:
 Nope, that's a radically different matter. As the examples show, the
 examples would be entirely meaningless at code unit level.
They're simply not possible. Won't compile. There is no single UTF-8 code unit for 'ö', so you can't (easily) search for it in a range for code units. Just like there is no single code point for 'a⃗' so you can't search for it in a range of code points. You can still search for 'a', and 'o', and the rest of ASCII in a range of code units.
Jun 02 2016
next sibling parent ag0aep6g <anonymous example.com> writes:
On 06/02/2016 11:24 PM, ag0aep6g wrote:
 They're simply not possible. Won't compile. There is no single UTF-8
 code unit for 'ö', so you can't (easily) search for it in a range for
 code units. Just like there is no single code point for 'a⃗' so you can't
 search for it in a range of code points.

 You can still search for 'a', and 'o', and the rest of ASCII in a range
 of code units.
I'm ignoring combining characters there. You can search for 'a' in code units in the same way that you can search for 'ä' in code points. I.e., more or less, depending on how serious you are about combining characters.
Jun 02 2016
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:24 PM, ag0aep6g wrote:
 On 06/02/2016 11:06 PM, Andrei Alexandrescu wrote:
 Nope, that's a radically different matter. As the examples show, the
 examples would be entirely meaningless at code unit level.
They're simply not possible. Won't compile.
They do compile.
 There is no single UTF-8
 code unit for 'ö', so you can't (easily) search for it in a range for
 code units.
Of course you can. Can you search for an int in a short[]? Oh yes you can. Can you search for a dchar in a char[]? Of course you can. Autodecoding also gives it meaning.
 Just like there is no single code point for 'a⃗' so you can't
 search for it in a range of code points.
Of course you can.
 You can still search for 'a', and 'o', and the rest of ASCII in a range
 of code units.
You can search for a dchar in a char[] because you can compare an individual dchar with either another dchar (correct, autodecoding) or with a char (incorrect, no autodecoding). As I said: this thread produces an unpleasant amount of arguments in favor of autodecoding. Even I don't like that :o). Andrei
Jun 02 2016
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:27 PM, Andrei Alexandrescu wrote:
 On 6/2/16 5:24 PM, ag0aep6g wrote:
 Just like there is no single code point for 'a⃗' so you can't
 search for it in a range of code points.
Of course you can.
Correx, indeed you can't. -- Andrei
Jun 02 2016
prev sibling parent reply ag0aep6g <anonymous example.com> writes:
On 06/02/2016 11:27 PM, Andrei Alexandrescu wrote:
 On 6/2/16 5:24 PM, ag0aep6g wrote:
 On 06/02/2016 11:06 PM, Andrei Alexandrescu wrote:
 Nope, that's a radically different matter. As the examples show, the
 examples would be entirely meaningless at code unit level.
They're simply not possible. Won't compile.
They do compile.
Yes, you're right, of course they do. char implicitly converts to dchar. I didn't think of that anti-feature.
 As I said: this thread produces an unpleasant amount of arguments in
 favor of autodecoding. Even I don't like that :o).
It's more of an argument against char : dchar, I'd say.
Jun 02 2016
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:35 PM, ag0aep6g wrote:
 On 06/02/2016 11:27 PM, Andrei Alexandrescu wrote:
 On 6/2/16 5:24 PM, ag0aep6g wrote:
 On 06/02/2016 11:06 PM, Andrei Alexandrescu wrote:
 Nope, that's a radically different matter. As the examples show, the
 examples would be entirely meaningless at code unit level.
They're simply not possible. Won't compile.
They do compile.
Yes, you're right, of course they do. char implicitly converts to dchar. I didn't think of that anti-feature.
 As I said: this thread produces an unpleasant amount of arguments in
 favor of autodecoding. Even I don't like that :o).
It's more of an argument against char : dchar, I'd say.
I do think that's an interesting option in PL design space, but that would be super disruptive. -- Andrei
Jun 02 2016
prev sibling parent reply tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 2 June 2016 at 20:13:14 UTC, Andrei Alexandrescu 
wrote:
 On 06/02/2016 03:34 PM, tsbockman wrote:
 Your 'ö' examples will NOT work reliably with auto-decoded 
 code points,
 and for nearly the same reason that they won't work with code 
 units; you
 would have to use byGrapheme.
They do work per spec: find this code point. It would be surprising if 'ö' were found but the string were positioned at a different code point.
Your examples will pass or fail depending on how (and whether) the 'ö' grapheme is normalized. They only ever succeeds because 'ö' happens to be one of the privileged graphemes that *can* be (but often isn't!) represented as a single code point. Many other graphemes have no such representation. Working directly with code points is sometimes useful anyway - but then, working with code units can be, also. Neither will lead to inherently "correct" Unicode processing, and in the absence of a compelling context, your examples fall completely flat as an argument for the inherent superiority of processing at the code unit level.
 The fact that you still don't get that, even after a dozen 
 plus attempts
 by the community to explain the difference, makes you unfit to 
 direct
 Phobos' Unicode support.
Well there's gotta be a reason why my basic comprehension is under constant scrutiny whereas yours is safe.
Who said mine is safe? I *know* that I'm not qualified to be in charge of this. Your comprehension is under greater scrutiny because you are proposing to overrule nearly all other active contributors combined.
 Please, either go study Unicode until you
 really understand it, or delegate this issue to someone else.
Would be happy to. To whom would I delegate?
If you're serious, I would suggest Dmitry Olshansky. He seems to be our top Unicode expert, based on his contributions to `std.uni` and `std.regex`. But, if he is unwilling/unsuitable for some reason there are other candidates participating in this thread (not me).
Jun 02 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:36 PM, tsbockman wrote:
 Your examples will pass or fail depending on how (and whether) the 'ö'
 grapheme is normalized.
And that's fine. Want graphemes, .byGrapheme wags its tail in that corner. Otherwise, you work on code points which is a completely meaningful way to go about things. What's not meaningful is the random results you get from operating on code units.
 They only ever succeeds because 'ö' happens to
 be one of the privileged graphemes that *can* be (but often isn't!)
 represented as a single code point. Many other graphemes have no such
 representation.
Then there's no dchar for them so no problem to start with. s.find(c) ----> "Find code unit c in string s" Andrei
Jun 02 2016
parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, Jun 02, 2016 at 04:38:28PM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
 On 06/02/2016 04:36 PM, tsbockman wrote:
 Your examples will pass or fail depending on how (and whether) the
 'ö' grapheme is normalized.
And that's fine. Want graphemes, .byGrapheme wags its tail in that corner. Otherwise, you work on code points which is a completely meaningful way to go about things. What's not meaningful is the random results you get from operating on code units.
 They only ever succeeds because 'ö' happens to be one of the
 privileged graphemes that *can* be (but often isn't!) represented as
 a single code point. Many other graphemes have no such
 representation.
Then there's no dchar for them so no problem to start with. s.find(c) ----> "Find code unit c in string s"
[...] This is a ridiculous argument. We might as well say, "there's no single byte UTF-8 that can represent Ш, so that's no problem to start with" -- since we can just define it away by saying s.find(c) == "find byte c in string s", and thereby justify using ASCII as our standard string representation. The point is that dchar is NOT ENOUGH TO REPRESENT A SINGLE CHARACTER in the general case. It is adequate for a subset of characters -- just like ASCII is also adequate for a subset of characters. If you only need to work with ASCII, it suffices to work with ubyte[]. Similarly, if your work is restricted to only languages without combining diacritics, then a range of dchar suffices. But a range of dchar is NOT good enough in the general case, and arguing that it does only makes you look like a fool. Appealing to normalization doesn't change anything either, since only a subset of base character + diacritic combinations will normalize to a single code point. If the string has a base character + diacritic combination doesn't have a precomposed code point, it will NOT fit in a dchar. (And keep in mind that the notion of diacritic is still very Euro-centric. In Korean, for example, a single character is composed of multiple parts, each of which occupies 1 code point. While some precomposed combinations do exist, they don't cover all of the possibilities, so normalization won't help you there.) T -- Frank disagreement binds closer than feigned agreement.
Jun 02 2016
prev sibling next sibling parent reply deadalnix <deadalnix gmail.com> writes:
On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu 
wrote:
 Pretty much everything. Consider s and s1 string variables with 
 possibly different encodings (UTF8/UTF16).

 * s.all!(c => c == 'ö') works only with autodecoding. It 
 returns always false without.
False. Many characters can be represented by different sequences of codepoints. For instance, ê can be ê as one codepoint or ^ as a modifier followed by e. ö is one such character.
 * s.any!(c => c == 'ö') works only with autodecoding. It 
 returns always false without.
False. (while this is pretty much the same as 1, one can come up with with as many example as wished by tweaking the same one to produce endless variations).
 * s.balancedParens('〈', '〉') works only with autodecoding.
Not sure, so I'll say OK.
 * s.canFind('ö') works only with autodecoding. It returns 
 always false without.
False.
 * s.commonPrefix(s1) works only if they both use the same 
 encoding; otherwise it still compiles but silently produces an 
 incorrect result.
False.
 * s.count('ö') works only with autodecoding. It returns always 
 zero without.
False.
 * s.countUntil(s1) is really odd - without autodecoding, 
 whether it works at all, and the result it returns, depends on 
 both encodings. With autodecoding it always works and returns a 
 number independent of the encodings.
False.
 * s.endsWith('ö') works only with autodecoding. It returns 
 always false without.
False.
 * s.endsWith(s1) works only with autodecoding. Otherwise it 
 compiles and runs but produces incorrect results if s and s1 
 have different encodings.
False.
 * s.find('ö') works only with autodecoding. It never finds it 
 without.
False.
 * s.findAdjacent is a very interesting one. It works with 
 autodecoding, but without it it just does odd things.
Not sure so I'll say OK, while I strongly suspect that, like for other, this will only work if string are normalized.
 * s.findAmong(s1) is also interesting. It works only with 
 autodecoding.
False.
 * s.findSkip(s1) works only if s and s1 have the same encoding. 
 Otherwise it compiles and runs but produces incorrect results.
False.
 * s.findSplit(s1), s.findSplitAfter(s1), s.findSplitBefore(s1) 
 work only if s and s1 have the same encoding. Otherwise they 
 compile and run but produce incorrect results.
False.
 * s.minCount, s.maxCount are unlikely to be terribly useful but 
 with autodecoding it consistently returns the extremum numeric 
 code unit regardless of representation. Without, they just 
 return encoding-dependent and meaningless numbers.
Note sure, so I'll say ok.
 * s.minPos, s.maxPos follow a similar semantics.
Note sure, so I'll say ok.
 * s.skipOver(s1) only works with autodecoding. Otherwise it 
 compiles and runs but produces incorrect results if s and s1 
 have different encodings.
False.
 * s.startsWith('ö') works only with autodecoding. Otherwise it 
 compiles and runs but produces incorrect results if s and s1 
 have different encodings.
False.
 * s.startsWith(s1) works only with autodecoding. Otherwise it 
 compiles and runs but produces incorrect results if s and s1 
 have different encodings.
False.
 * s.until!(c => c == 'ö') works only with autodecoding. 
 Otherwise, it will span the entire range.
False.
 ===

 The intent of autodecoding was to make std.algorithm work 
 meaningfully with strings. As it's easy to see I just went 
 through std.algorithm.searching alphabetically and found issues 
 literally with every primitive in there. It's an easy exercise 
 to go forth with the others.


 Andrei
I mean what a trainwreck. Your examples are saying it all doesn't it ? Almost none of them would work without normalizing the string first. And that is the point you've been refusing to hear so far. autodecoding doesn't pay for itself as it is unable to do what it is supposed to do in the general case. Really, there is not much you can do with anything unicode related without first going through normalization. If you want anything more than searching substring or alike, you'll also need a collation, that is locale dependent (for sorting for instance). Supporting unicode, IMO, would be to provide facilities to normalize (preferably lazilly as a range), to manage collations, and so on. Decoding to codepoints just don't cut it. As a result, any algorithm that need to support string need to either fight against the language because it doesn't need decoding, use decoding and assume to be incorrect without normalization or do the correct thing by itself (which is also going to require to work against the language).
Jun 02 2016
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 03:34 PM, deadalnix wrote:
 On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:
 Pretty much everything. Consider s and s1 string variables with
 possibly different encodings (UTF8/UTF16).

 * s.all!(c => c == 'ö') works only with autodecoding. It returns
 always false without.
False.
True. "Are all code points equal to this one?" -- Andrei
Jun 02 2016
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 22:13, Andrei Alexandrescu wrote:
 On 06/02/2016 03:34 PM, deadalnix wrote:
 On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:
 Pretty much everything. Consider s and s1 string variables with
 possibly different encodings (UTF8/UTF16).

 * s.all!(c => c == 'ö') works only with autodecoding. It returns
 always false without.
False.
True. "Are all code points equal to this one?" -- Andrei
I.e. you are saying that 'works' means 'operates on code points'.
Jun 02 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:17 PM, Timon Gehr wrote:
 I.e. you are saying that 'works' means 'operates on code points'.
Affirmative. -- Andrei
Jun 02 2016
parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, Jun 02, 2016 at 04:28:45PM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
 On 06/02/2016 04:17 PM, Timon Gehr wrote:
 I.e. you are saying that 'works' means 'operates on code points'.
Affirmative. -- Andrei
Again, a ridiculous position. I can use exactly the same line of argument for why we should just standardize on ASCII. All I have to do is to define "work" to mean "operates on an ASCII character", and then every ASCII algorithm "works" by definition, so nobody can argue with me. Unfortunately, everybody else's definition of "work" is different from mine, so the argument doesn't hold water. Similarly, you are the only one whose definition of "work" means "operates on code points". Basically nobody else here uses that definition, so while you may be right according to your own made-up tautological arguments, none of your conclusions actually have any bearing in the real world of Unicode handling. Give it up. It is beyond reasonable doubt that autodecoding is a liability. D should be moving away from autodecoding instead of clinging to historical mistakes in the face of overwhelming evidence. (And note, I said *auto*-decoding; decoding by itself obviously is very relevant. But it needs to be opt-in because of its performance and correctness implications. The user needs to be able to choose whether to decode, and how to decode.) T -- Freedom: (n.) Man's self-given right to be enslaved by his own depravity.
Jun 02 2016
prev sibling next sibling parent reply cym13 <cpicard openmailbox.org> writes:
On Thursday, 2 June 2016 at 20:13:52 UTC, Andrei Alexandrescu 
wrote:
 On 06/02/2016 03:34 PM, deadalnix wrote:
 On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu 
 wrote:
 Pretty much everything. Consider s and s1 string variables 
 with
 possibly different encodings (UTF8/UTF16).

 * s.all!(c => c == 'ö') works only with autodecoding. It 
 returns
 always false without.
False.
True. "Are all code points equal to this one?" -- Andrei
A:“We should decode to code points” B:“No, decoding to code points is a stupid idea.” A:“No it's not!” B:“Can you show a concrete example where it does something useful?” A:“Sure, look at that!” B:“This isn't working at all, look at all those counter-examples!” A:“It may not work for your examples but look how easy it is to find code points!” *Sigh*
Jun 02 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:22 PM, cym13 wrote:
 A:“We should decode to code points”
 B:“No, decoding to code points is a stupid idea.”
 A:“No it's not!”
 B:“Can you show a concrete example where it does something useful?”
 A:“Sure, look at that!”
 B:“This isn't working at all, look at all those counter-examples!”
 A:“It may not work for your examples but look how easy it is to
     find code points!”
With autodecoding all of std.algorithm operates correctly on code points. Without it all it does for strings is gibberish. -- Andrei
Jun 02 2016
next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 22:29, Andrei Alexandrescu wrote:
 On 06/02/2016 04:22 PM, cym13 wrote:
 A:“We should decode to code points”
 B:“No, decoding to code points is a stupid idea.”
 A:“No it's not!”
 B:“Can you show a concrete example where it does something useful?”
 A:“Sure, look at that!”
 B:“This isn't working at all, look at all those counter-examples!”
 A:“It may not work for your examples but look how easy it is to
     find code points!”
With autodecoding all of std.algorithm operates correctly on code points. Without it all it does for strings is gibberish. -- Andrei
No, without it, it operates correctly on code units.
Jun 02 2016
prev sibling next sibling parent reply cym13 <cpicard openmailbox.org> writes:
On Thursday, 2 June 2016 at 20:29:48 UTC, Andrei Alexandrescu 
wrote:
 On 06/02/2016 04:22 PM, cym13 wrote:
 A:“We should decode to code points”
 B:“No, decoding to code points is a stupid idea.”
 A:“No it's not!”
 B:“Can you show a concrete example where it does something 
 useful?”
 A:“Sure, look at that!”
 B:“This isn't working at all, look at all those 
 counter-examples!”
 A:“It may not work for your examples but look how easy it is to
     find code points!”
With autodecoding all of std.algorithm operates correctly on code points. Without it all it does for strings is gibberish. -- Andrei
Allow me to try another angle: - There are different levels of unicode support and you don't want to support them all transparently. That's understandable. - The level you choose to support is the code point level. There are many good arguments about why this isn't a good default but you won't change your mind. I don't like that at all and I'm not alone but let's forget the entirety of the vocal D community for a moment. - A huge part of unicode chars can be normalized to fit your definition. That way not everything work (far from it) but a sufficiently big subset works. - On the other hand without normalization it just doesn't make any sense from a user perspective.The ö example has clearly shown that much, you even admitted it yourself by stating that many counter arguments would have worked had the string been normalized). - The most proeminent problem is with graphems that can have different representations as those that can't be normalized can't be searched as dchars as well. - If autodecoding to code points is to stay and in an effort to find a compromise then normalizing should be done by default. Sure it would take some more time but it wouldn't break any code (I think) and would actually make things more correct. They still wouldn't be correct but I feel that something as crazy as unicode cannot be tackled generically anyway.
Jun 02 2016
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:38 PM, cym13 wrote:
 Allow me to try another angle:

 - There are different levels of unicode support and you don't want to
 support them all transparently. That's understandable.
Cool.
 - The level you choose to support is the code point level. There are
 many good arguments about why this isn't a good default but you won't
 change your mind. I don't like that at all and I'm not alone but let's
 forget the entirety of the vocal D community for a moment.
You mean all 35 of them? It's not about changing my mind! A massive thing that the code point level handling is the incumbent, and that changing it would need to mark an absolutely Earth-shattering improvement to be worth it!
 - A huge part of unicode chars can be normalized to fit your
 definition. That way not everything work (far from it) but a
 sufficiently big subset works.
Cool.
 - On the other hand without normalization it just doesn't make any
 sense from a user perspective.The ö example has clearly shown that
 much, you even admitted it yourself by stating that many counter
 arguments would have worked had the string been normalized).
Yah, operating at code point level does not come free of caveats. It is vastly superior to operating on code units, and did I mention it's the incumbent.
 - The most proeminent problem is with graphems that can have different
 representations as those that can't be normalized can't be searched as
 dchars as well.
Yah, I'd say if the program needs graphemes the option is there. Phobos by default deals with code points which are not perfect but are independent of representation, produce meaningful and consistent results with std.algorithm etc.
 - If autodecoding to code points is to stay and in an effort to find a
 compromise then normalizing should be done by default. Sure it would
 take some more time but it wouldn't break any code (I think) and would
 actually make things more correct. They still wouldn't be correct but
 I feel that something as crazy as unicode cannot be tackled
 generically anyway.
Some more work on normalization at strategic points in Phobos would be interesting! Andrei
Jun 02 2016
prev sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, Jun 02, 2016 at 04:29:48PM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
 On 06/02/2016 04:22 PM, cym13 wrote:
 
 A:“We should decode to code points”
 B:“No, decoding to code points is a stupid idea.”
 A:“No it's not!”
 B:“Can you show a concrete example where it does something useful?”
 A:“Sure, look at that!”
 B:“This isn't working at all, look at all those counter-examples!”
 A:“It may not work for your examples but look how easy it is to
     find code points!”
With autodecoding all of std.algorithm operates correctly on code points. Without it all it does for strings is gibberish. -- Andrei
With ASCII strings, all of std.algorithm operates correctly on ASCII bytes. So let's standardize on ASCII strings. What a vacuous argument! Basically you're saying "I define code points to be correct. Therefore, I conclude that decoding to code points is correct." Well, duh. Unfortunately such vacuous conclusions have no bearing in the real world of Unicode handling. T -- I am Ohm of Borg. Resistance is voltage over current.
Jun 02 2016
prev sibling parent reply deadalnix <deadalnix gmail.com> writes:
On Thursday, 2 June 2016 at 20:13:52 UTC, Andrei Alexandrescu 
wrote:
 On 06/02/2016 03:34 PM, deadalnix wrote:
 On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu 
 wrote:
 Pretty much everything. Consider s and s1 string variables 
 with
 possibly different encodings (UTF8/UTF16).

 * s.all!(c => c == 'ö') works only with autodecoding. It 
 returns
 always false without.
False.
True. "Are all code points equal to this one?" -- Andrei
The good thing when you define works by whatever it does right now, it is that everything always works and there are literally never any bug. The bad thing is that this is a completely useless definition of work. The sample code won't count the instance of the grapheme 'ö' as some of its encoding won't be counted, which definitively count as doesn't work. When your point need to redefine words in ways that nobody agree with, it is time to admit the point is bogus.
Jun 02 2016
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:20 PM, deadalnix wrote:
 The good thing when you define works by whatever it does right now
No, it works as it was designed. -- Andrei
Jun 02 2016
parent reply deadalnix <deadalnix gmail.com> writes:
On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu 
wrote:
 On 6/2/16 5:20 PM, deadalnix wrote:
 The good thing when you define works by whatever it does right 
 now
No, it works as it was designed. -- Andrei
Nobody says it doesn't. Everybody says the design is crap.
Jun 02 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:35 PM, deadalnix wrote:
 On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote:
 On 6/2/16 5:20 PM, deadalnix wrote:
 The good thing when you define works by whatever it does right now
No, it works as it was designed. -- Andrei
Nobody says it doesn't. Everybody says the design is crap.
I think I like it more after this thread. -- Andrei
Jun 02 2016
next sibling parent reply deadalnix <deadalnix gmail.com> writes:
On Thursday, 2 June 2016 at 21:37:11 UTC, Andrei Alexandrescu 
wrote:
 On 6/2/16 5:35 PM, deadalnix wrote:
 On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu 
 wrote:
 On 6/2/16 5:20 PM, deadalnix wrote:
 The good thing when you define works by whatever it does 
 right now
No, it works as it was designed. -- Andrei
Nobody says it doesn't. Everybody says the design is crap.
I think I like it more after this thread. -- Andrei
You start reminding me of the joke with that guy complaining that everybody is going backward on the highway.
Jun 02 2016
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:38 PM, deadalnix wrote:
 On Thursday, 2 June 2016 at 21:37:11 UTC, Andrei Alexandrescu wrote:
 On 6/2/16 5:35 PM, deadalnix wrote:
 On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote:
 On 6/2/16 5:20 PM, deadalnix wrote:
 The good thing when you define works by whatever it does right now
No, it works as it was designed. -- Andrei
Nobody says it doesn't. Everybody says the design is crap.
I think I like it more after this thread. -- Andrei
You start reminding me of the joke with that guy complaining that everybody is going backward on the highway.
Touché. (Get it?) -- Andrei
Jun 02 2016
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:37 PM, Andrei Alexandrescu wrote:
 On 6/2/16 5:35 PM, deadalnix wrote:
 On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote:
 On 6/2/16 5:20 PM, deadalnix wrote:
 The good thing when you define works by whatever it does right now
No, it works as it was designed. -- Andrei
Nobody says it doesn't. Everybody says the design is crap.
I think I like it more after this thread. -- Andrei
Meh, thinking of it again: I don't like it more, I'd still do it differently given a clean slate (viz. RCStr). But let's say I didn't get many compelling reasons to remove autodecoding from this thread. -- Andrei
Jun 02 2016
prev sibling parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 06/02/2016 05:37 PM, Andrei Alexandrescu wrote:
 On 6/2/16 5:35 PM, deadalnix wrote:
 On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote:
 On 6/2/16 5:20 PM, deadalnix wrote:
 The good thing when you define works by whatever it does right now
No, it works as it was designed. -- Andrei
Nobody says it doesn't. Everybody says the design is crap.
I think I like it more after this thread. -- Andrei
Well there's a fantastic argument.
Jun 03 2016
prev sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 23:20, deadalnix wrote:
 The sample code won't count the instance of the grapheme 'ö' as some of
 its encoding won't be counted, which definitively count as doesn't work.
It also has false positives (you can combine 'ö' with some combining character in order to get some strange character that is not an 'ö', and not even NFC helps with that).
Jun 02 2016
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 12:34 PM, deadalnix wrote:
 On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:
 Pretty much everything. Consider s and s1 string variables with possibly
 different encodings (UTF8/UTF16).

 * s.all!(c => c == 'ö') works only with autodecoding. It returns always false
 without.
False. Many characters can be represented by different sequences of codepoints. For instance, ê can be ê as one codepoint or ^ as a modifier followed by e. ö is one such character.
There are 3 levels of Unicode support. What Andrei is talking about is Level 1. http://unicode.org/reports/tr18/tr18-5.1.html I wonder what rationale there is for Unicode to have two different sequences of codepoints be treated as the same. It's madness.
Jun 02 2016
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:27 PM, Walter Bright wrote:
 On 6/2/2016 12:34 PM, deadalnix wrote:
 On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:
 Pretty much everything. Consider s and s1 string variables with possibly
 different encodings (UTF8/UTF16).

 * s.all!(c => c == 'ö') works only with autodecoding. It returns
 always false
 without.
False. Many characters can be represented by different sequences of codepoints. For instance, ê can be ê as one codepoint or ^ as a modifier followed by e. ö is one such character.
There are 3 levels of Unicode support. What Andrei is talking about is Level 1. http://unicode.org/reports/tr18/tr18-5.1.html
Apparently I'm not the only idiot. -- Andrei
Jun 02 2016
prev sibling next sibling parent reply deadalnix <deadalnix gmail.com> writes:
On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote:
 On 6/2/2016 12:34 PM, deadalnix wrote:
 On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu 
 wrote:
 Pretty much everything. Consider s and s1 string variables 
 with possibly
 different encodings (UTF8/UTF16).

 * s.all!(c => c == 'ö') works only with autodecoding. It 
 returns always false
 without.
False. Many characters can be represented by different sequences of codepoints. For instance, ê can be ê as one codepoint or ^ as a modifier followed by e. ö is one such character.
There are 3 levels of Unicode support. What Andrei is talking about is Level 1. http://unicode.org/reports/tr18/tr18-5.1.html I wonder what rationale there is for Unicode to have two different sequences of codepoints be treated as the same. It's madness.
To be able to convert back and forth from/to unicode in a lossless manner.
Jun 02 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 2:25 PM, deadalnix wrote:
 On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote:
 I wonder what rationale there is for Unicode to have two different sequences
 of codepoints be treated as the same. It's madness.
To be able to convert back and forth from/to unicode in a lossless manner.
Sorry, that makes no sense, as it is saying "they're the same, only different."
Jun 02 2016
prev sibling next sibling parent reply John Colvin <john.loughran.colvin gmail.com> writes:
On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote:
 On 6/2/2016 12:34 PM, deadalnix wrote:
 On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu 
 wrote:
 Pretty much everything. Consider s and s1 string variables 
 with possibly
 different encodings (UTF8/UTF16).

 * s.all!(c => c == 'ö') works only with autodecoding. It 
 returns always false
 without.
False. Many characters can be represented by different sequences of codepoints. For instance, ê can be ê as one codepoint or ^ as a modifier followed by e. ö is one such character.
There are 3 levels of Unicode support. What Andrei is talking about is Level 1. http://unicode.org/reports/tr18/tr18-5.1.html I wonder what rationale there is for Unicode to have two different sequences of codepoints be treated as the same. It's madness.
There are languages that make heavy use of diacritics, often several on a single "character". Hebrew is a good example. Should there be only one valid ordering of any given set of diacritics on any given character? It's an interesting idea, but it's not how things are.
Jun 02 2016
next sibling parent Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Thursday, June 02, 2016 22:27:16 John Colvin via Digitalmars-d wrote:
 On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote:
 I wonder what rationale there is for Unicode to have two
 different sequences of codepoints be treated as the same. It's
 madness.
There are languages that make heavy use of diacritics, often several on a single "character". Hebrew is a good example. Should there be only one valid ordering of any given set of diacritics on any given character? It's an interesting idea, but it's not how things are.
Yeah. I'm inclined to think that the fact that there are multiple normalizations was a huge mistake on the part of the Unicode folks, but we're stuck dealing with it. And as horrible as it is for most cases, maybe it _does_ ultimately make sense because of certain use cases; I don't know. But bad idea or not, we're stuck. :( - Jonathan M Davis
Jun 02 2016
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 3:27 PM, John Colvin wrote:
 I wonder what rationale there is for Unicode to have two different sequences
 of codepoints be treated as the same. It's madness.
There are languages that make heavy use of diacritics, often several on a single "character". Hebrew is a good example. Should there be only one valid ordering of any given set of diacritics on any given character?
I didn't say ordering, I said there should be no such thing as "normalization" in Unicode, where two codepoints are considered to be identical to some other codepoint.
Jun 02 2016
parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, Jun 02, 2016 at 05:19:48PM -0700, Walter Bright via Digitalmars-d wrote:
 On 6/2/2016 3:27 PM, John Colvin wrote:
 I wonder what rationale there is for Unicode to have two different
 sequences of codepoints be treated as the same. It's madness.
There are languages that make heavy use of diacritics, often several on a single "character". Hebrew is a good example. Should there be only one valid ordering of any given set of diacritics on any given character?
I didn't say ordering, I said there should be no such thing as "normalization" in Unicode, where two codepoints are considered to be identical to some other codepoint.
I think it was a combination of historical baggage and trying to accomodate unusual but still valid use cases. The historical baggage was that Unicode was trying to unify all of the various already-existing codepages out there, and many of those codepages already come with various precomposed characters. To maximize compatibility with existing codepages, Unicode tried to preserve as much of the original mappings as possible within each 256-point block, so these precomposed characters became part of the standard. However, there weren't enough of them -- some people demanded less common character + diacritic combinations, and some languages had writing so complex their characters had to be composed from more basic parts. The original Unicode range was 16-bit, so there wasn't enough room to fit all of the precomposed characters people demanded, plus there were other things people wanted, like multiple diacritics (e.g., in IPA). So the concept of combining diacritics was invented, in part to prevent combinatorial explosion from soaking up the available code point space, in part to allow for novel combinations of diacritics that somebody out there somewhere might want to represent. However, this meant that some precomposed characters were "redundant": they represented character + diacritic combinations that could equally well be expressed separately. Normalization was the inevitable consequence. (Normalization, of course, also subsumes a few other things, such as collation, but this is one of the factors behind it.) (This is a greatly over-simplified description, of course. At the time Unicode also had to grapple with tricky issues like what to do with lookalike characters that served different purposes or had different meanings, e.g., the mu sign in the math block vs. the real letter mu in the Greek block, or the Cyrillic A which looks and behaves exactly like the Latin A, yet the Cyrillic Р, which looks like the Latin P, does *not* mean the same thing (it's the equivalent of R), or the Cyrillic В whose lowercase is в not b, and also had a different sound, but lowercase Latin b looks very similar to Cyrillic ь, which serves a completely different purpose (the uppercase is Ь, not B, you see). Then you have the wonderful Indic and Arabic cursive writings, where letterforms mutate depending on the surrounding context, which, if you were to include all variants as distinct code points, would occupy many more pages than they currently do. And also sticky issues like the oft-mentioned Turkish i, which is encoded as a Latin i but behaves differently w.r.t. upper/lowercasing when in Turkish locale -- some cases of this, IIRC, are unfixable bugs in Phobos because we currently do not handle locales. So you see, imagining that code points == the solution to Unicode string handling is a joke. Writing correct Unicode handling is *hard*.) As with all sufficiently complex software projects, Unicode represents a compromise between many contradictory factors -- writing systems in the world being the complex, not-very-consistent beasts they are -- so such "dirty" details are somewhat inevitable. T -- Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan
Jun 03 2016
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
 However, this
 meant that some precomposed characters were "redundant": they
 represented character + diacritic combinations that could equally well
 be expressed separately. Normalization was the inevitable consequence.
It is not inevitable. Simply disallow the 2 codepoint sequences - the single one has to be used instead. There is precedent. Some characters can be encoded with more than one UTF-8 sequence, and the longer sequences were declared invalid. Simple. I.e. have the normalization up front when the text is created rather than everywhere else.
Jun 03 2016
parent reply Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Friday, 3 June 2016 at 10:05:11 UTC, Walter Bright wrote:
 On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
 However, this
 meant that some precomposed characters were "redundant": they
 represented character + diacritic combinations that could 
 equally well
 be expressed separately. Normalization was the inevitable 
 consequence.
It is not inevitable. Simply disallow the 2 codepoint sequences - the single one has to be used instead. There is precedent. Some characters can be encoded with more than one UTF-8 sequence, and the longer sequences were declared invalid. Simple. I.e. have the normalization up front when the text is created rather than everywhere else.
I don't think it would work (or at least, the analogy doesn't hold). It would mean that you can't add new precomposited characters, because that means that previously valid sequences are now invalid.
Jun 03 2016
next sibling parent reply Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Friday, June 03, 2016 10:10:18 Vladimir Panteleev via Digitalmars-d wrote:
 On Friday, 3 June 2016 at 10:05:11 UTC, Walter Bright wrote:
 On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
 However, this
 meant that some precomposed characters were "redundant": they
 represented character + diacritic combinations that could
 equally well
 be expressed separately. Normalization was the inevitable
 consequence.
It is not inevitable. Simply disallow the 2 codepoint sequences - the single one has to be used instead. There is precedent. Some characters can be encoded with more than one UTF-8 sequence, and the longer sequences were declared invalid. Simple. I.e. have the normalization up front when the text is created rather than everywhere else.
I don't think it would work (or at least, the analogy doesn't hold). It would mean that you can't add new precomposited characters, because that means that previously valid sequences are now invalid.
I would have argued that no composited characters should have ever existed regardless of what was done in previous encodings, since they're redundant, and you need the non-composited characters to avoid a combinatorial explosion of characters, so you can't have characters that just have a composited version and be consistent. However, the Unicode folks obviously didn't go that route. But given where we sit now, even though we're stuck with some composited characters, I'd argue that we should at least never add any new ones. But who knows what the Unicode folks are actually going to do. As it is, you probably should normalize strings in many cases where they enter the program, just like ideally, you'd validate them when they enter the program. But regardless, you have to deal with the fact that multiple normalization schemes exist and that there's no guarantee that you're even going to get valid Unicode, let alone Unicode that's normalized the way you want. - Jonathan M Davis
Jun 03 2016
parent reply Chris <wendlec tcd.ie> writes:
On Friday, 3 June 2016 at 11:46:50 UTC, Jonathan M Davis wrote:
 On Friday, June 03, 2016 10:10:18 Vladimir Panteleev via 
 Digitalmars-d wrote:
 On Friday, 3 June 2016 at 10:05:11 UTC, Walter Bright wrote:
 On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
 However, this
 meant that some precomposed characters were "redundant": 
 they
 represented character + diacritic combinations that could
 equally well
 be expressed separately. Normalization was the inevitable
 consequence.
It is not inevitable. Simply disallow the 2 codepoint sequences - the single one has to be used instead. There is precedent. Some characters can be encoded with more than one UTF-8 sequence, and the longer sequences were declared invalid. Simple. I.e. have the normalization up front when the text is created rather than everywhere else.
I don't think it would work (or at least, the analogy doesn't hold). It would mean that you can't add new precomposited characters, because that means that previously valid sequences are now invalid.
I would have argued that no composited characters should have ever existed regardless of what was done in previous encodings, since they're redundant, and you need the non-composited characters to avoid a combinatorial explosion of characters, so you can't have characters that just have a composited version and be consistent. However, the Unicode folks obviously didn't go that route. But given where we sit now, even though we're stuck with some composited characters, I'd argue that we should at least never add any new ones. But who knows what the Unicode folks are actually going to do. As it is, you probably should normalize strings in many cases where they enter the program, just like ideally, you'd validate them when they enter the program. But regardless, you have to deal with the fact that multiple normalization schemes exist and that there's no guarantee that you're even going to get valid Unicode, let alone Unicode that's normalized the way you want. - Jonathan M Davis
I do exactly this. Validate and normalize.
Jun 03 2016
parent deadalnix <deadalnix gmail.com> writes:
On Friday, 3 June 2016 at 12:04:39 UTC, Chris wrote:
 I do exactly this. Validate and normalize.
And once you've done this, auto decoding is useless because the same character has the same representation anyway.
Jun 05 2016
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2016 3:10 AM, Vladimir Panteleev wrote:
 I don't think it would work (or at least, the analogy doesn't hold). It would
 mean that you can't add new precomposited characters, because that means that
 previously valid sequences are now invalid.
So don't add new precomposited characters when a recognized existing sequence exists.
Jun 03 2016
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
 At the time
 Unicode also had to grapple with tricky issues like what to do with
 lookalike characters that served different purposes or had different
 meanings, e.g., the mu sign in the math block vs. the real letter mu in
 the Greek block, or the Cyrillic A which looks and behaves exactly like
 the Latin A, yet the Cyrillic Р, which looks like the Latin P, does
 *not* mean the same thing (it's the equivalent of R), or the Cyrillic В
 whose lowercase is в not b, and also had a different sound, but
 lowercase Latin b looks very similar to Cyrillic ь, which serves a
 completely different purpose (the uppercase is Ь, not B, you see).
I don't see that this is tricky at all. Adding additional semantic meaning that does not exist in printed form was outside of the charter of Unicode. Hence there is no justification for having two distinct characters with identical glyphs. They should have put me in charge of Unicode. I'd have put a stop to much of the madness :-)
Jun 03 2016
next sibling parent reply Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Friday, 3 June 2016 at 10:08:43 UTC, Walter Bright wrote:
 On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
 At the time
 Unicode also had to grapple with tricky issues like what to do 
 with
 lookalike characters that served different purposes or had 
 different
 meanings, e.g., the mu sign in the math block vs. the real 
 letter mu in
 the Greek block, or the Cyrillic A which looks and behaves 
 exactly like
 the Latin A, yet the Cyrillic Р, which looks like the Latin P, 
 does
 *not* mean the same thing (it's the equivalent of R), or the 
 Cyrillic В
 whose lowercase is в not b, and also had a different sound, but
 lowercase Latin b looks very similar to Cyrillic ь, which 
 serves a
 completely different purpose (the uppercase is Ь, not B, you 
 see).
I don't see that this is tricky at all. Adding additional semantic meaning that does not exist in printed form was outside of the charter of Unicode. Hence there is no justification for having two distinct characters with identical glyphs.
That's not right either. Cyrillic letters can look slightly different from their latin lookalikes in some circumstances. I'm sure there are extremely good reasons for not using the latin lookalikes in the Cyrillic alphabets, because most (all?) 8-bit Cyrillic encodings use separate codes for the lookalikes. It's not restricted to Unicode.
Jun 03 2016
next sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Jun 03, 2016 at 10:14:15AM +0000, Vladimir Panteleev via Digitalmars-d
wrote:
 On Friday, 3 June 2016 at 10:08:43 UTC, Walter Bright wrote:
 On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
 At the time Unicode also had to grapple with tricky issues like
 what to do with lookalike characters that served different
 purposes or had different meanings, e.g., the mu sign in the math
 block vs. the real letter mu in the Greek block, or the Cyrillic A
 which looks and behaves exactly like the Latin A, yet the Cyrillic
 Р, which looks like the Latin P, does *not* mean the same thing
 (it's the equivalent of R), or the Cyrillic В whose lowercase is в
 not b, and also had a different sound, but lowercase Latin b looks
 very similar to Cyrillic ь, which serves a completely different
 purpose (the uppercase is Ь, not B, you see).
I don't see that this is tricky at all. Adding additional semantic meaning that does not exist in printed form was outside of the charter of Unicode. Hence there is no justification for having two distinct characters with identical glyphs.
That's not right either. Cyrillic letters can look slightly different from their latin lookalikes in some circumstances. I'm sure there are extremely good reasons for not using the latin lookalikes in the Cyrillic alphabets, because most (all?) 8-bit Cyrillic encodings use separate codes for the lookalikes. It's not restricted to Unicode.
Yeah, lowercase Cyrillic П is п, which looks like lowercase Greek π in some fonts, but in cursive form it looks more like Latin lowercase n. It wouldn't make sense to encode Cyrillic п the same as Greek π or Latin lowercase n just by appearance, since logically it stands as its own character despite its various appearances. But it wouldn't make sense to encode it differently just because you're using a different font! Similarly, lowercase Cyrillic т in some cursive fonts looks like lowercase Latin m. I don't think it would make sense to encode lowercase Т as Latin m just because of that. Eventually you have no choice but to encode by logical meaning rather than by appearance, since there are many lookalikes between different languages that actually mean something completely different, and often behaves completely differently. T -- People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG
Jun 03 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2016 9:28 AM, H. S. Teoh via Digitalmars-d wrote:
 Eventually you have no choice but to encode by logical meaning rather
 than by appearance, since there are many lookalikes between different
 languages that actually mean something completely different, and often
 behaves completely differently.
It's almost as if printed documents and books have never existed!
Jun 03 2016
next sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Jun 03, 2016 at 11:43:07AM -0700, Walter Bright via Digitalmars-d wrote:
 On 6/3/2016 9:28 AM, H. S. Teoh via Digitalmars-d wrote:
 Eventually you have no choice but to encode by logical meaning
 rather than by appearance, since there are many lookalikes between
 different languages that actually mean something completely
 different, and often behaves completely differently.
It's almost as if printed documents and books have never existed!
But if we were to encode appearance instead of logical meaning, that would mean the *same* lowercase Cyrillic ь would have multiple, different encodings depending on which font was in use. That doesn't seem like the right solution either. Do we really want Unicode strings to encode font information too?? 'Cos by that argument, serif and sans serif letters should have different encodings, because in languages like Hebrew, a tiny little serif could mean the difference between two completely different letters. And what of the Arabic and Indic scripts? They would need to encode the same letter multiple times, each being a variation of the physical form that changes depending on the surrounding context. Even the Greek sigma has two forms depending on whether it's at the end of a word or not -- so should it be two code points or one? If you say two, then you'd have a problem with how to search for sigma in Greek text, and you'd have to search for either medial sigma or final sigma. But if you say one, then you'd have a problem with having two different letterforms for a single codepoint. Besides, that still doesn't solve the problem of what "i".uppercase() should return. In most languages, it should return "I", but in Turkish it should not. And if we really went the route of encoding Cyrillic letters the same as their Latin lookalikes, we'd have a problem with what "m".uppercase() should return, because now it depends on which font is in effect (if it's a Cyrillic cursive font, the correct answer is "Т", if it's a Latin font, the correct answer is "M" -- the other combinations: who knows). That sounds far worse than what we have today. T -- Let's eat some disquits while we format the biskettes.
Jun 03 2016
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2016 1:53 PM, H. S. Teoh via Digitalmars-d wrote:
 But if we were to encode appearance instead of logical meaning, that
 would mean the *same* lowercase Cyrillic ь would have multiple,
 different encodings depending on which font was in use.
I don't see that consequence at all.
 That doesn't
 seem like the right solution either.  Do we really want Unicode strings
 to encode font information too??
No.
  'Cos by that argument, serif and sans
 serif letters should have different encodings, because in languages like
 Hebrew, a tiny little serif could mean the difference between two
 completely different letters.
If they are different letters, then they should have a different code point. I don't see why this is such a hard concept.
 And what of the Arabic and Indic scripts? They would need to encode the
 same letter multiple times, each being a variation of the physical form
 that changes depending on the surrounding context. Even the Greek sigma
 has two forms depending on whether it's at the end of a word or not --
 so should it be two code points or one?
Two. Again, why is this hard to grasp? If there is meaning in having two different visual representations, then they are two codepoints. If the visual representation is the same, then it is one codepoint. If the difference is only due to font selection, that it is the same codepoint.
 Besides, that still doesn't solve the problem of what "i".uppercase()
 should return. In most languages, it should return "I", but in Turkish
 it should not.
 And if we really went the route of encoding Cyrillic
 letters the same as their Latin lookalikes, we'd have a problem with
 what "m".uppercase() should return, because now it depends on which font
 is in effect (if it's a Cyrillic cursive font, the correct answer is
 "Т", if it's a Latin font, the correct answer is "M" -- the other
 combinations: who knows).  That sounds far worse than what we have
 today.
The notion of 'case' should not be part of Unicode, as that is semantic information that is beyond the scope of Unicode.
Jun 03 2016
parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Jun 03, 2016 at 03:35:18PM -0700, Walter Bright via Digitalmars-d wrote:
 On 6/3/2016 1:53 PM, H. S. Teoh via Digitalmars-d wrote:
[...]
 'Cos by that argument, serif and sans serif letters should have
 different encodings, because in languages like Hebrew, a tiny little
 serif could mean the difference between two completely different
 letters.
If they are different letters, then they should have a different code point. I don't see why this is such a hard concept.
[...] It's not a hard concept, except that these different letters have lookalike forms with completely unrelated letters. Again: - Lowercase Latin m looks visually the same as lowercase Cyrillic Т in cursive form. In some font renderings the two are IDENTICAL glyphs, in spite of being completely different, unrelated letters. However, in non-cursive form, Cyrillic lowercase т is visually distinct. - Similarly, lowercase Cyrillic П in cursive font looks like lowercase Latin n, and in some fonts they are identical glyphs. Again, completely unrelated letters, yet they have the SAME VISUAL REPRESENTATION. However, in non-cursive font, lowercase Cyrillic П is п, which is visually distinct from Latin n. - These aren't the only ones, either. Other Cyrillic false friends include cursive Д, which in some fonts looks like lowercase Latin g. But in non-cursive font, it's д. Just given the above, it should be clear that going by visual representation is NOT enough to disambiguate between these different letters. By your argument, since lowercase Cyrillic Т is, visually, just m, it should be encoded the same way as lowercase Latin m. But this is untenable, because the letterform changes with a different font. So you end up with the unworkable idea of a font-dependent encoding. Similarly, since lowercase Cyrillic П is n (in cursive font), we should encode it the same way as Latin lowercase n. But again, the letterform changes based on font. Your criteria of "same visual representation" does not work outside of English. What you imagine to be a simple, straightforward concept is far from being simple once you're dealing with the diverse languages and writing systems of the world. Or, to use an example closer to home, uppercase Latin O and the digit 0 are visually identical. Should they be encoded as a single code point or two? Worse, in some fonts, the digit 0 is rendered like Ø (to differentiate it from uppercase O). Does that mean that it should be encoded the same way as the Danish letter Ø? Obviously not, but according to your "visual representation" idea, the answer should be yes. The bottomline is that uppercase O and the digit 0 represent different LOGICAL entities, in spite of their sharing the same visual representation. Eventually you have to resort to representing *logical* entities ("characters") rather than visual appearance, which is a property of the font, and has no place in a digital text encoding.
 Besides, that still doesn't solve the problem of what
 "i".uppercase() should return. In most languages, it should return
 "I", but in Turkish it should not.
 And if we really went the route of encoding Cyrillic letters the
 same as their Latin lookalikes, we'd have a problem with what
 "m".uppercase() should return, because now it depends on which font
 is in effect (if it's a Cyrillic cursive font, the correct answer is
 "Т", if it's a Latin font, the correct answer is "M" -- the other
 combinations: who knows).  That sounds far worse than what we have
 today.
The notion of 'case' should not be part of Unicode, as that is semantic information that is beyond the scope of Unicode.
But what should "i".toUpper return? Or are you saying the standard library should not include such a basic function as a case-changing function? T -- Customer support: the art of getting your clients to pay for your own incompetence.
Jun 03 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2016 6:08 PM, H. S. Teoh via Digitalmars-d wrote:
 It's not a hard concept, except that these different letters have
 lookalike forms with completely unrelated letters. Again:

 - Lowercase Latin m looks visually the same as lowercase Cyrillic Т in
   cursive form. In some font renderings the two are IDENTICAL glyphs, in
   spite of being completely different, unrelated letters.  However, in
   non-cursive form, Cyrillic lowercase т is visually distinct.

 - Similarly, lowercase Cyrillic П in cursive font looks like lowercase
   Latin n, and in some fonts they are identical glyphs. Again,
   completely unrelated letters, yet they have the SAME VISUAL
   REPRESENTATION.  However, in non-cursive font, lowercase Cyrillic П is
   п, which is visually distinct from Latin n.

 - These aren't the only ones, either.  Other Cyrillic false friends
   include cursive Д, which in some fonts looks like lowercase Latin g.
   But in non-cursive font, it's д.

 Just given the above, it should be clear that going by visual
 representation is NOT enough to disambiguate between these different
 letters.
It works for books. Unicode invented a problem, and came up with a thoroughly wretched "solution" that we'll be stuck with for generations. One of those bad solutions is have the reader not know what a glyph actually is without pulling back the cover to read the codepoint. It's madness.
 By your argument, since lowercase Cyrillic Т is, visually,
 just m, it should be encoded the same way as lowercase Latin m. But this
 is untenable, because the letterform changes with a different font. So
 you end up with the unworkable idea of a font-dependent encoding.
Oh rubbish. Let go of the idea that choosing bad fonts should drive Unicode codepoint decisions.
 Or, to use an example closer to home, uppercase Latin O and the digit 0
 are visually identical. Should they be encoded as a single code point or
 two?  Worse, in some fonts, the digit 0 is rendered like Ø (to
 differentiate it from uppercase O). Does that mean that it should be
 encoded the same way as the Danish letter Ø?  Obviously not, but
 according to your "visual representation" idea, the answer should be
 yes.
Don't confuse fonts with code points. It'd be adequate if Unicode defined a canonical glyph for each code point, and let the font makers do what they wish.
 The notion of 'case' should not be part of Unicode, as that is
 semantic information that is beyond the scope of Unicode.
But what should "i".toUpper return?
Not relevant to my point that Unicode shouldn't decide what "upper case" for all languages means, any more than Unicode should specify a font. Now when you argue that Unicode should make such decisions, note what a spectacularly hopeless job of it they've done.
Jun 03 2016
next sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Jun 03, 2016 at 08:03:16PM -0700, Walter Bright via Digitalmars-d wrote:
 On 6/3/2016 6:08 PM, H. S. Teoh via Digitalmars-d wrote:
 It's not a hard concept, except that these different letters have
 lookalike forms with completely unrelated letters. Again:
 
 - Lowercase Latin m looks visually the same as lowercase Cyrillic Т
 in cursive form. In some font renderings the two are IDENTICAL
 glyphs, in spite of being completely different, unrelated letters.
 However, in non-cursive form, Cyrillic lowercase т is visually
 distinct.
 
 - Similarly, lowercase Cyrillic П in cursive font looks like
 lowercase Latin n, and in some fonts they are identical glyphs.
 Again, completely unrelated letters, yet they have the SAME VISUAL
 REPRESENTATION.  However, in non-cursive font, lowercase Cyrillic П
 is п, which is visually distinct from Latin n.
 
 - These aren't the only ones, either.  Other Cyrillic false friends
 include cursive Д, which in some fonts looks like lowercase Latin g.
 But in non-cursive font, it's д.
 
 Just given the above, it should be clear that going by visual
 representation is NOT enough to disambiguate between these different
 letters.
It works for books.
Because books don't allow their readers to change the font.
 Unicode invented a problem, and came up with a thoroughly wretched
 "solution" that we'll be stuck with for generations. One of those bad
 solutions is have the reader not know what a glyph actually is without
 pulling back the cover to read the codepoint. It's madness.
This madness already exists *without* Unicode. If you have a page with a single glyph 'm' printed on it and show it to an English speaker, he will say it's lowercase M. Show it to a Russian speaker, and he will say it's lowercase Т. So which letter is it, M or Т? The fundamental problem is that writing systems for different languages interpret the same letter forms differently. In English, lowercase g has at least two different forms that we recognize as the same letter. However, to a Cyrillic reader the two forms are distinct, because one of them looks like a Cyrillic letter but the other one looks foreign. So should g be encoded as a single point or two different points? In a similar vein, to a Cyrillic reader the glyphs т and m represent the same letter, but to an English letter they are clearly two different things. If you're going to represent both languages, you cannot get away from needing to represent letters abstractly, rather than visually.
 By your argument, since lowercase Cyrillic Т is, visually, just m,
 it should be encoded the same way as lowercase Latin m. But this is
 untenable, because the letterform changes with a different font. So
 you end up with the unworkable idea of a font-dependent encoding.
Oh rubbish. Let go of the idea that choosing bad fonts should drive Unicode codepoint decisions.
It's not a bad font. It's standard practice to print Cyrillic cursive letters with different glyphs. Russian readers can read both without any problem. The same letter is represented by different glyphs, and therefore the abstract letter is a more fundamental unit of meaning than the glyph itself.
 Or, to use an example closer to home, uppercase Latin O and the
 digit 0 are visually identical. Should they be encoded as a single
 code point or two?  Worse, in some fonts, the digit 0 is rendered
 like Ø (to differentiate it from uppercase O). Does that mean that
 it should be encoded the same way as the Danish letter Ø?  Obviously
 not, but according to your "visual representation" idea, the answer
 should be yes.
Don't confuse fonts with code points. It'd be adequate if Unicode defined a canonical glyph for each code point, and let the font makers do what they wish.
So should O and 0 share the same glyph or not? They're visually the same thing, even though some fonts render them differently. What should be the canonical shape of O vs. 0? If they are the same shape, then by your argument they must be the same code point, regardless of what font makers do to disambiguate them. Good luck writing a parser that can't tell between an identifier that begins with O vs. a number literal that begins with 0. The very fact that we distinguish between O and 0, independently of what Unicode did/does, is already proof enough that going by visual representation is inadequate.
 The notion of 'case' should not be part of Unicode, as that is
 semantic information that is beyond the scope of Unicode.
But what should "i".toUpper return?
Not relevant to my point that Unicode shouldn't decide what "upper case" for all languages means, any more than Unicode should specify a font. Now when you argue that Unicode should make such decisions, note what a spectacularly hopeless job of it they've done.
In other words toUpper and toLower does not belong in the standard library. Great. T -- Heuristics are bug-ridden by definition. If they didn't have bugs, they'd be algorithms.
Jun 03 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2016 11:17 PM, H. S. Teoh via Digitalmars-d wrote:
 On Fri, Jun 03, 2016 at 08:03:16PM -0700, Walter Bright via Digitalmars-d
wrote:
 It works for books.
Because books don't allow their readers to change the font.
Unicode is not the font.
 This madness already exists *without* Unicode. If you have a page with a
 single glyph 'm' printed on it and show it to an English speaker, he
 will say it's lowercase M. Show it to a Russian speaker, and he will say
 it's lowercase Т.  So which letter is it, M or Т?
It's not a problem that Unicode can solve. As you said, the meaning is in the context. Unicode has no context, and tries to solve something it cannot. ('m' doesn't always mean m in english, either. It depends on the context.) Ya know, if Unicode actually solved these problems, you'd have a case. But it doesn't, and so you don't :-)
 If you're going to represent both languages, you cannot get away from
 needing to represent letters abstractly, rather than visually.
Books do visually just fine!
 So should O and 0 share the same glyph or not? They're visually the same
 thing,
No, they're not. Not even on old typewriters where every key was expensive. Even without the slash, the O tends to be fatter than the 0.
 The very fact that we distinguish between O and 0, independently of what
 Unicode did/does, is already proof enough that going by visual
 representation is inadequate.
Except that you right now are using a font where they are different enough that you have no trouble at all distinguishing them without bothering to look it up. And so am I.
 In other words toUpper and toLower does not belong in the standard
 library. Great.
Unicode and the standard library are two different things.
Jun 04 2016
parent docandrew <x x.com> writes:
On Saturday, 4 June 2016 at 08:12:47 UTC, Walter Bright wrote:
 On 6/3/2016 11:17 PM, H. S. Teoh via Digitalmars-d wrote:
 On Fri, Jun 03, 2016 at 08:03:16PM -0700, Walter Bright via 
 Digitalmars-d wrote:
 It works for books.
Because books don't allow their readers to change the font.
Unicode is not the font.
 This madness already exists *without* Unicode. If you have a 
 page with a
 single glyph 'm' printed on it and show it to an English 
 speaker, he
 will say it's lowercase M. Show it to a Russian speaker, and 
 he will say
 it's lowercase Т.  So which letter is it, M or Т?
It's not a problem that Unicode can solve. As you said, the meaning is in the context. Unicode has no context, and tries to solve something it cannot. ('m' doesn't always mean m in english, either. It depends on the context.) Ya know, if Unicode actually solved these problems, you'd have a case. But it doesn't, and so you don't :-)
 If you're going to represent both languages, you cannot get 
 away from
 needing to represent letters abstractly, rather than visually.
Books do visually just fine!
 So should O and 0 share the same glyph or not? They're 
 visually the same
 thing,
No, they're not. Not even on old typewriters where every key was expensive. Even without the slash, the O tends to be fatter than the 0.
 The very fact that we distinguish between O and 0, 
 independently of what
 Unicode did/does, is already proof enough that going by visual
 representation is inadequate.
Except that you right now are using a font where they are different enough that you have no trouble at all distinguishing them without bothering to look it up. And so am I.
 In other words toUpper and toLower does not belong in the 
 standard
 library. Great.
Unicode and the standard library are two different things.
Even if a character in different languages share a glyph or look identical though, it makes sense to duplicate them with different code points/units/whatever. Simple functions like isCyrillicLetter() can then do a simple less-than / greater-than comparison instead of having a lookup table to check different numeric representations scattered throughout the Unicode table. Functions like toUpper and toLower become easier to write as well (for SOME languages anyhow), it's simply myletter +/- numlettersinalphabet. Redundancy here is very helpful. Maybe instead of Unicode they should have called it Babel... :) "The Lord said, “If as one people speaking the same language they have begun to do this, then nothing they plan to do will be impossible for them. Come, let us go down and confuse their language so they will not understand each other.”" -Jon
Jun 05 2016
prev sibling parent reply deadalnix <deadalnix gmail.com> writes:
On Saturday, 4 June 2016 at 03:03:16 UTC, Walter Bright wrote:
 Oh rubbish. Let go of the idea that choosing bad fonts should 
 drive Unicode codepoint decisions.
Interestingly enough, I've mentioned earlier here that only people from the US would believe that documents with mixed languages aren't commonplace. I wasn't expecting to be proven right that fast.
Jun 05 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 6/5/2016 1:07 AM, deadalnix wrote:
 On Saturday, 4 June 2016 at 03:03:16 UTC, Walter Bright wrote:
 Oh rubbish. Let go of the idea that choosing bad fonts should drive Unicode
 codepoint decisions.
Interestingly enough, I've mentioned earlier here that only people from the US would believe that documents with mixed languages aren't commonplace. I wasn't expecting to be proven right that fast.
You'd be in error. I've been casually working on my grandfather's thesis trying to make a web version of it, and it is mixed German, French, and English. I've also made a digital version of an old history book that is mixed English, old English, German, French, Greek, old Greek, and Egyptian hieroglyphs (available on Amazons in your neighborhood!). I've also lived in Germany for 3 years, though that was before computers took over the world.
Jun 05 2016
prev sibling next sibling parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Friday, 3 June 2016 at 20:53:32 UTC, H. S. Teoh wrote:

 Even the Greek sigma has two forms depending on whether it's at 
 the end of a word or not -- so should it be two code points or 
 one? If you say two, then you'd have a problem with how to 
 search for sigma in Greek text, and you'd have to search for 
 either medial sigma or final sigma. But if you say one, then 
 you'd have a problem with having two different letterforms for 
 a single codepoint.
In Unicode there are 2 different codepoints for lower case sigma ς U+03C2 and σ U+3C3 but only one uppercase Σ U+3A3 sigma. Codepoint U+3A2 is undefined. So your objection is not hypothetic, it is actually an issue for uppercase() and lowercase() functions. Another difficulty besides dotless and dotted i of Turkic, the double letters used in latin transcription of cyrillic text in east and south europe dž, lj, nj and dz, which have an uppercase forme (DŽ, LJ, NJ, DZ) and a titlecase form (Dž, Lj, Nj, Dz).
 Besides, that still doesn't solve the problem of what 
 "i".uppercase() should return. In most languages, it should 
 return "I", but in Turkish it should not.  And if we really 
 went the route of encoding Cyrillic letters the same as their 
 Latin lookalikes, we'd have a problem with what "m".uppercase() 
 should return, because now it depends on which font is in 
 effect (if it's a Cyrillic cursive font, the correct answer is 
 "Т", if it's a Latin font, the correct answer is "M" -- the 
 other combinations: who knows).  That sounds far worse than 
 what we have today.
As an anecdote I can tell the story of the accession to the European Union of Romania and Bulgaria in 2007. The issue was that 3 letters used by Romanian and Bulgarian had been forgotten by the Unicode consortium (Ș U+0218, ș U+219, Ț U+21A, ț U+21B and 2 Cyrillic letters that I do not remember). The Romanian used as a replacement Ş, ş, Ţ and ţ (U+15D, U+15E and U+161 and U+162), which look a little bit alike. When the Commission finally managed to force Mirosoft to correct the fonts to include them, we could start to correct the data. The transition was finished in 2012 and was only possible because no other language we deal with uses the "wrong" codepoints (Turkish but fortunately we only have a handful of them in our db's). So 5 years of ad hoc processing for the substicion of 4 codepoints. BTW: using combining diacritics was out of the question at that time simply because Microsoft Word didn't support it at that time and many documents we encountered still only used codepages (one has also to remember that in big institution like the EC, the IT is always several years behind the open market, which means that when product is in release X, the Institution still might use release X-5 years).
Jun 04 2016
prev sibling parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
One has also to take into consideration that Unicode is the way 
it is because it was not invented in an empty space. It had to 
take consideration of the existing and find compromisses allowing 
its adoption. Even if they had invented the perfect encoding, NO 
ONE WOULD HAVE USED IT, as it would have fubar the existing.
As it was invented it allowed a (relatively smooth) transition. 
Here some points that made it even possible that Unicode could be 
adopted at all:
- 16 bits: while that choice was a bit shortsighted, 16 bits is a 
good compromice between compactness and richness (BMP suffice to 
express nearly all living languages).
- Using more or less the same arrangement of codepoints as in the 
different codepages. This allowed to transform legacy documents 
with simple scripts (matter of fact I wrote a script to repair 
misencoded Greek documents, it consisted mainly of  unich = 
ch>0x80 ? ch+0x2D0 : ch;
- Utf-8: this was the genious stroke encoding that allowed to mix 
it all without requiring awful acrobatics (Joakim is completely 
out to lunch on that one, shifting encoding without 
self-synchronisation are hellish, that's why Chinese and Japanese 
adopted Unicode without hesitation, they had enough experience 
with their legacy encodings.
- Letting time for the transition.

So all the points that people here criticize, were in fact the 
reason why Unicode could even be become the standard it is now.
Jun 04 2016
prev sibling next sibling parent reply ketmar <ketmar ketmar.no-ip.org> writes:
On Friday, 3 June 2016 at 18:43:07 UTC, Walter Bright wrote:
 It's almost as if printed documents and books have never 
 existed!
some old xUSSR books which has some English text sometimes used Cyrillic font to represent English. it was awful, and barely readable. this was done to ease the work of compositors, and the result was unacceptable. do you feel a recognizable pattern here? ;-)
Jun 03 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2016 5:42 PM, ketmar wrote:
 sometimes used Cyrillic font to represent English.
Nobody here suggested using the wrong font, it's completely irrelevant.
Jun 03 2016
parent ketmar <ketmar ketmar.no-ip.org> writes:
On Saturday, 4 June 2016 at 02:46:31 UTC, Walter Bright wrote:
 On 6/3/2016 5:42 PM, ketmar wrote:
 sometimes used Cyrillic font to represent English.
Nobody here suggested using the wrong font, it's completely irrelevant.
you suggested that unicode designers should make similar-looking glyphs share the same code, and it reminds me this little story. maybe i misunderstood you, though.
Jun 03 2016
prev sibling parent reply deadalnix <deadalnix gmail.com> writes:
On Friday, 3 June 2016 at 18:43:07 UTC, Walter Bright wrote:
 On 6/3/2016 9:28 AM, H. S. Teoh via Digitalmars-d wrote:
 Eventually you have no choice but to encode by logical meaning 
 rather
 than by appearance, since there are many lookalikes between 
 different
 languages that actually mean something completely different, 
 and often
 behaves completely differently.
It's almost as if printed documents and books have never existed!
TIL: books are read by computers.
Jun 05 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 6/5/2016 1:05 AM, deadalnix wrote:
 TIL: books are read by computers.
I should introduce you to a fabulous technology called OCR. :-)
Jun 05 2016
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2016 3:14 AM, Vladimir Panteleev wrote:
 That's not right either. Cyrillic letters can look slightly different from
their
 latin lookalikes in some circumstances.

 I'm sure there are extremely good reasons for not using the latin lookalikes in
 the Cyrillic alphabets, because most (all?) 8-bit Cyrillic encodings use
 separate codes for the lookalikes. It's not restricted to Unicode.
How did people ever get by with printed books and documents?
Jun 03 2016
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 03.06.2016 20:41, Walter Bright wrote:
 On 6/3/2016 3:14 AM, Vladimir Panteleev wrote:
 That's not right either. Cyrillic letters can look slightly different
 from their
 latin lookalikes in some circumstances.

 I'm sure there are extremely good reasons for not using the latin
 lookalikes in
 the Cyrillic alphabets, because most (all?) 8-bit Cyrillic encodings use
 separate codes for the lookalikes. It's not restricted to Unicode.
How did people ever get by with printed books and documents?
They can disambiguate the letters based on context well enough.
Jun 03 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2016 11:54 AM, Timon Gehr wrote:
 On 03.06.2016 20:41, Walter Bright wrote:
 How did people ever get by with printed books and documents?
They can disambiguate the letters based on context well enough.
Characters do not have semantic meaning. Their meaning is always inferred from the context. Unicode's troubles started the moment they stepped beyond their charter.
Jun 03 2016
prev sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
On Friday, 3 June 2016 at 18:41:36 UTC, Walter Bright wrote:
 How did people ever get by with printed books and documents?
Printed books pick one font and one layout, then is read by people. It doesn't have to be represented in some format where end users can change the font and size etc.
Jun 03 2016
prev sibling parent reply Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Friday, June 03, 2016 03:08:43 Walter Bright via Digitalmars-d wrote:
 On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
 At the time
 Unicode also had to grapple with tricky issues like what to do with
 lookalike characters that served different purposes or had different
 meanings, e.g., the mu sign in the math block vs. the real letter mu in
 the Greek block, or the Cyrillic A which looks and behaves exactly like
 the Latin A, yet the Cyrillic Р, which looks like the Latin P, does
 *not* mean the same thing (it's the equivalent of R), or the Cyrillic В
 whose lowercase is в not b, and also had a different sound, but
 lowercase Latin b looks very similar to Cyrillic ь, which serves a
 completely different purpose (the uppercase is Ь, not B, you see).
I don't see that this is tricky at all. Adding additional semantic meaning that does not exist in printed form was outside of the charter of Unicode. Hence there is no justification for having two distinct characters with identical glyphs. They should have put me in charge of Unicode. I'd have put a stop to much of the madness :-)
Actually, I would argue that the moment that Unicode is concerned with what the character actually looks like rather than what character it logically is that it's gone outside of its charter. The way that characters actually look is far too dependent on fonts, and aside from display code, code does not care one whit what the character looks like. For instance, take the capital letter I, the lowercase letter l, and the number one. In some fonts that are feeling cruel towards folks who actually want to read them, two of those characters - or even all three of them - look identical. But I think that you'll agree that those characters should be represented as distinct characters in Unicode regardless of what they happen to look like in a particular font. Now, take a cyrllic letter that looks similar to a latin letter. If they're logically equivalent such that no code would ever want to distinguish between the two and such that no font would ever even consider representing them differently, then they're truly the same letter, and they should only have one Unicode representation. But if anyone would ever consider them to be logically distinct, then it makes no sense for them to be considered to be the same character by Unicode, because they don't have the same identity. And that distinction is quite clear if any font would ever consider representing the two characters differently, no matter how slight that difference might be. Really, what a character looks like has nothing to do with Unicode. The exact same Unicode is used regardless of how the text is displayed. Rather, what Unicode is doing is providing logical identifiers for characters so that code can operate on them, and display code can then do whatever it does to display those characters, whether they happen to look similar or not. I would think that the fact that non-display code does not care one whit about what a character looks like and that display code can have drastically different visual representations for the same character would make it clear that Unicode is concerned with having identifiers for logical characters and that that is distinct from any visual representation. - Jonathan M Davis
Jun 03 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2016 2:10 PM, Jonathan M Davis via Digitalmars-d wrote:
 Actually, I would argue that the moment that Unicode is concerned with what
 the character actually looks like rather than what character it logically is
 that it's gone outside of its charter. The way that characters actually look
 is far too dependent on fonts, and aside from display code, code does not
 care one whit what the character looks like.
What I meant was pretty clear. Font is an artistic style that does not change context nor semantic meaning. If a font choice changes the meaning then it is not a font.
Jun 03 2016
next sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
On Friday, 3 June 2016 at 22:38:38 UTC, Walter Bright wrote:
 If a font choice changes the meaning then it is not a font.
Nah, then it is an Awesome Font that is totally Web Scale! i wish i was making that up http://fontawesome.io/ i hate that thing But, it is kinda legal: gotta love the Unicode private use area!
Jun 03 2016
prev sibling parent Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Friday, June 03, 2016 15:38:38 Walter Bright via Digitalmars-d wrote:
 On 6/3/2016 2:10 PM, Jonathan M Davis via Digitalmars-d wrote:
 Actually, I would argue that the moment that Unicode is concerned with
 what
 the character actually looks like rather than what character it logically
 is that it's gone outside of its charter. The way that characters
 actually look is far too dependent on fonts, and aside from display code,
 code does not care one whit what the character looks like.
What I meant was pretty clear. Font is an artistic style that does not change context nor semantic meaning. If a font choice changes the meaning then it is not a font.
Well, maybe I misunderstood what was being argued, but it seemed like you've been arguing that two characters should be considered the same just because they look similar, whereas H. S. Teoh is arguing that two characters can be logically distinct while still looking similar and that they should be treated as distinct in Unicode because they're logically distinct. And if that's what's being argued, then I agree with H. S. Teoh. I expect - at least ideally - for Unicode to contain identifiers for characters that are distinct from whatever their visual representation might be. Stuff like fonts then worries about how to display them, and hopefully don't do stupid stuff like make a capital I look like a lowercase l (though they often do, unfortunately). But if two characters in different scripts - be they latin and cyrillic or whatever - happen to often look the same but would be considered two different characters by humans, then I would expect Unicode to consider them to be different, whereas if no one would reasonably consider them to be anything but exactly the same character, then there should only be one character in Unicode. However, if we really have crazy stuff where subtly different visual representations of the letter g are considered to be one character in English and two in Russian, then maybe those should be three different characters in Unicode so that the English text can clearly be operating on g, whereas the Russian text is doing whatever it does with its two characters that happen to look like g. I don't know. That sort of thing just gets ugly. But I definitely think that Unicode characters should be made up of what the logical characters are and leave the visual representation up to the fonts and the like. Now, how to deal with uppercase vs lowercase and all of that sort of stuff is a completely separate issue IMHO, and that comes down to how the characters are somehow logically associated with one another, and it's going to be very locale-specific such that it's not really part of the core of Unicode's charter IMHO (though I'm not sure that it's bad if there's a set of locale rules that go along with Unicode for those looking to correctly apply such rules - they just have nothing to do with code points and graphemes and how they're represented in code). - Jonathan M Davis
Jun 05 2016
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 02-Jun-2016 23:27, Walter Bright wrote:
 On 6/2/2016 12:34 PM, deadalnix wrote:
 On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:
 Pretty much everything. Consider s and s1 string variables with possibly
 different encodings (UTF8/UTF16).

 * s.all!(c => c == 'ö') works only with autodecoding. It returns
 always false
 without.
False. Many characters can be represented by different sequences of codepoints. For instance, ê can be ê as one codepoint or ^ as a modifier followed by e. ö is one such character.
There are 3 levels of Unicode support. What Andrei is talking about is Level 1. http://unicode.org/reports/tr18/tr18-5.1.html I wonder what rationale there is for Unicode to have two different sequences of codepoints be treated as the same. It's madness.
Yeah, Unicode was not meant to be easy it seems. Or this is whatever happens with evolutionary design that started with "everything is a 16-bit character". -- Dmitry Olshansky
Jun 03 2016
parent Alix Pexton <alix.pexton gmail.com> writes:
On 03/06/2016 20:12, Dmitry Olshansky wrote:
 On 02-Jun-2016 23:27, Walter Bright wrote:
 I wonder what rationale there is for Unicode to have two different
 sequences of codepoints be treated as the same. It's madness.
Yeah, Unicode was not meant to be easy it seems. Or this is whatever happens with evolutionary design that started with "everything is a 16-bit character".
Typing as someone who as spent some time creating typefaces, having two representations makes sense, and it didn't start with Unicode, it started with movable type. It is much easier for a font designer to create the two codepoint versions of characters for most instances, i.e. make the base letters and the diacritics once. Then what I often do is make single codepoint versions of the ones I'm likely to use, but only if they need more tweaking than the kerning options of the font format allow. I'll omit the history lesson on how this was similar in the case of movable type. Keyboards for different languages mean that a character that is a single keystroke in one case is two together or in sequence in another. This means that Unicode not only represents completed strings, but also those that are mid composition. The ordering that it uses to ensure that graphemes have a single canonical representation is based on the order that those multi-key characters are entered. I wouldn't call it elegant, but its not inelegant either. Trying to represent all sufficiently similar glyphs with the same codepoint would lead to a layout problem. How would you order them so that strings of any language can be sorted by their local sorting rules, without having to special case algorithms? Also consider ligatures, such as those for "ff", "fi", "ffi", "fl", "ffl" and many, many more. Typographers create these glyphs whenever available kerning tools do a poor job of combining them from the individual glyphs. From the point of view of meaning they should still be represented as individual codepoints, but for display (electronic or print) that sequence needs to be replaced with the single codepoint for the ligature. I think that in order to understand the decisions of the Unicode committee, one has to consider that they are trying to unify the concerns of representing written information from two sides. One side prioritises storage and manipulation, while the other considers aesthetics and design workflow more important. My experience of using Unicode from both sides gives me a different appreciation for the difficulties of reconciling the two. A... P.S. Then they started adding emojis, and I lost all faith in humanity ;)
Jun 04 2016
prev sibling next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 21:05, Andrei Alexandrescu wrote:
 On 06/02/2016 01:54 PM, Marc Schütz wrote:
 On Thursday, 2 June 2016 at 14:28:44 UTC, Andrei Alexandrescu wrote:
 That's not going to work. A false impression created in this thread
 has been that code points are useless
They _are_ useless for almost anything you can do with strings. The only places where they should be used are std.uni and std.regex. Again: What is the justification for using code points, in your opinion? Which practical tasks are made possible (and work _correctly_) if you decode to code points, that don't already work with code units?
Pretty much everything. Consider s and s1 string variables with possibly different encodings (UTF8/UTF16). * s.all!(c => c == 'ö') works only with autodecoding. It returns always false without. ...
Doesn't work. Shouldn't compile. (char and wchar shouldn't be comparable.) assert("ö".all!(c => c == 'ö')); // fails
 * s.any!(c => c == 'ö') works only with autodecoding. It returns always
 false without.
 ...
Doesn't work. Shouldn't compile. assert("ö".any!(c => c == 'ö")); // fails assert(!"̃ö⃖".any!(c => c== 'ö')); // fails
 * s.balancedParens('〈', '〉') works only with autodecoding.
 ...
Doesn't work, e.g. s="⟨⃖". Shouldn't compile.
 * s.canFind('ö') works only with autodecoding. It returns always false
 without.
 ...
Doesn't work. Shouldn't compile. assert("ö".canFind!(c => c == 'ö")); // fails
 * s.commonPrefix(s1) works only if they both use the same encoding;
 otherwise it still compiles but silently produces an incorrect result.
 ...
Doesn't work. Shouldn't compile.
 * s.count('ö') works only with autodecoding. It returns always zero
 without.
 ....
Doesn't work. Shouldn't compile.
 * s.countUntil(s1) is really odd - without autodecoding, whether it
 works at all, and the result it returns, depends on both encodings.  With
 autodecoding it always works and returns a number independent of the
 encodings.
 ...
Doesn't work. Shouldn't compile.
 * s.endsWith('ö') works only with autodecoding. It returns always false
 without.
 ...
Doesn't work. Shouldn't compile.
 * s.endsWith(s1) works only with autodecoding.
Doesn't work.
 Otherwise it compiles and
 runs but produces incorrect results if s and s1 have different encodings.
...
Shouldn't compile.
 * s.find('ö') works only with autodecoding. It never finds it without.
 ...
Doesn't work. Shouldn't compile.
 * s.findAdjacent is a very interesting one. It works with autodecoding,
 but without it it just does odd things.
 ....
Doesn't work. Shouldn't compile.
 * s.findAmong(s1) is also interesting. It works only with autodecoding.
 ...
Doesn't work. Shouldn't compile.
 * s.findSkip(s1) works only if s and s1 have the same encoding.
 Otherwise it compiles and runs but produces incorrect results.
 ...
Doesn't work. Shouldn't compile.
 * s.findSplit(s1), s.findSplitAfter(s1), s.findSplitBefore(s1) work only
 if s and s1 have the same encoding.
Doesn't work.
 Otherwise they compile and run but produce incorrect results.
 ...
Shouldn't compile.
 * s.minCount, s.maxCount are unlikely to be terribly useful but with
 autodecoding it consistently returns the extremum numeric code unit
 regardless of representation. Without, they just return
 encoding-dependent and meaningless numbers.

 * s.minPos, s.maxPos follow a similar semantics.
 ...
Hardly a point in favour of autodecoding.
 * s.skipOver(s1) only works with autodecoding.
Doesn't work. Shouldn't compile.
 Otherwise it compiles and
 runs but produces incorrect results if s and s1 have different encodings.
 ...
Shouldn't compile.
 * s.startsWith('ö') works only with autodecoding. Otherwise it compiles
 and runs but produces incorrect results if s and s1 have different
 encodings.
 ...
Doesn't work. Shouldn't compile.
 * s.startsWith(s1) works only with autodecoding. Otherwise it compiles
 and runs but produces incorrect results if s and s1 have different
 encodings.
 ...
Doesn't work. Shouldn't compile.
 * s.until!(c => c == 'ö') works only with autodecoding. Otherwise, it
 will span the entire range.
 ...
Doesn't work. Shouldn't compile.
 ===

 The intent of autodecoding was to make std.algorithm work meaningfully
 with strings. As it's easy to see I just went through
 std.algorithm.searching alphabetically and found issues literally with
 every primitive in there. It's an easy exercise to go forth with the
 others.
 ...
Basically all of those still don't work with UTF-32 (assuming your goal is to operate on characters). You need to normalize and possibly iterate on graphemes. Also, many of those functions actually have valid uses intentionally operating on code units. The "shouldn't compile" remarks ideally would be handled at the language level: char/wchar/dchar should be incompatible types and char[], wchar[] and dchar[] should be handled like all arrays.
Jun 02 2016
next sibling parent jmh530 <john.michael.hall gmail.com> writes:
On Thursday, 2 June 2016 at 20:01:54 UTC, Timon Gehr wrote:
 Doesn't work. Shouldn't compile. (char and wchar shouldn't be 
 comparable.)
In Andrei's original post, he says that s is a string variable. He doesn't say it's a char. I find the weirder thing to be that t below is false, per deadalnix's point. import std.algorithm : all; import std.stdio : writeln; void main() { string s = "ö"; auto t = s.all!(c => c == 'ö'); writeln(t); //prints false } I could imagine getting frustrated that something like the code below throws errors. import std.algorithm : all; import std.stdio : writeln; void main() { import std.uni : byGrapheme; string s = "ö"; auto s2 = s.byGrapheme; auto t2 = s2.all!(c => c == 'ö'); writeln(t2); }
Jun 02 2016
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:01 PM, Timon Gehr wrote:
 Doesn't work. Shouldn't compile. (char and wchar shouldn't be comparable.)
That would be another language design option, which we don't have the luxury to explore. -- Andrei
Jun 02 2016
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:01 PM, Timon Gehr wrote:
 assert("ö".all!(c => c == 'ö')); // fails
As expected. Different code units for different folks. That's a different matter than walking blindly through code units. -- Andrei
Jun 02 2016
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:01 PM, Timon Gehr wrote:
 Basically all of those still don't work with UTF-32 (assuming your goal
 is to operate on characters).
The goal is to operate on code units. -- Andrei
Jun 02 2016
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:26 PM, Andrei Alexandrescu wrote:
 On 06/02/2016 04:01 PM, Timon Gehr wrote:
 Basically all of those still don't work with UTF-32 (assuming your goal
 is to operate on characters).
The goal is to operate on code units. -- Andrei
s/units/points/
Jun 02 2016
prev sibling parent reply ag0aep6g <anonymous example.com> writes:
On 06/02/2016 10:26 PM, Andrei Alexandrescu wrote:
 The goal is to operate on code units. -- Andrei
You sure you got the right word there? The code unit is the smallest building block. A code point is encoded with one or more code units. Also, if you mean code points, that's where people disagree. Operating on code points by default is seen as not particularly useful.
Jun 02 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:33 PM, ag0aep6g wrote:
 Operating on code points by default is seen as not particularly useful.
By whom? The "support level 1" folks yonder at the Unicode standard? :o) -- Andrei
Jun 02 2016
next sibling parent reply tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 2 June 2016 at 20:36:12 UTC, Andrei Alexandrescu 
wrote:
 On 06/02/2016 04:33 PM, ag0aep6g wrote:
 Operating on code points by default is seen as not 
 particularly useful.
By whom? The "support level 1" folks yonder at the Unicode standard? :o) -- Andrei
From the standard:
 Level 1 support works well in many circumstances. However, it 
 does not handle more complex languages or extensions to the 
 Unicode Standard very well. Particularly important cases are 
 surrogates, canonical equivalence, word boundaries, grapheme 
 boundaries, and loose matches. (For more information about 
 boundary conditions, see The Unicode Standard, Section 5-15.)

 Level 2 support matches much more what user expectations are 
 for sequences of Unicode characters. It is still locale 
 independent and easily implementable. However, the 
 implementation may be slower when supporting Level 2, and some 
 expressions may require Level 1 matches. Thus it is usually 
 required to have some sort of syntax that will turn Level 2 
 support on and off.
That doesn't sound like much of an endorsement for defaulting to only level 1 support to me - "it does not handle more complex languages or extensions to the Unicode Standard very well".
Jun 02 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:47 PM, tsbockman wrote:
 That doesn't sound like much of an endorsement for defaulting to only
 level 1 support to me - "it does not handle more complex languages or
 extensions to the Unicode Standard very well".
Code point/Level 1 support sounds like a sweet spot between efficiency/complexity and conviviality. Level 2 is opt-in with byGrapheme. -- Andrei
Jun 02 2016
parent reply tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 2 June 2016 at 20:49:52 UTC, Andrei Alexandrescu 
wrote:
 On 06/02/2016 04:47 PM, tsbockman wrote:
 That doesn't sound like much of an endorsement for defaulting 
 to only
 level 1 support to me - "it does not handle more complex 
 languages or
 extensions to the Unicode Standard very well".
Code point/Level 1 support sounds like a sweet spot between efficiency/complexity and conviviality. Level 2 is opt-in with byGrapheme. -- Andrei
Actually, according to the document Walter Bright linked level 1 does NOT operate at the code point level:
 Level 1: Basic Unicode Support. At this level, the regular 
 expression engine provides support for Unicode characters as 
 basic 16-bit logical units. (This is independent of the actual 
 serialization of Unicode as UTF-8, UTF-16BE, UTF-16LE, or 
 UTF-32.)
 ...
 Level 1 support works well in many circumstances. However, it 
 does not handle more complex languages or extensions to the 
 Unicode Standard very well. Particularly important cases are 
 **surrogates** ...
So, level 1 appears to be UTF-16 code units, not code points. To do code points it would have to recognize surrogates, which are specifically mentioned as not supported. Level 2 skips straight to graphemes, and there is no code point level. However, this document is very old - from Unicode 3.0 and the year 2000:
 While there are no surrogate characters in Unicode 3.0 (outside 
 of private use characters), future versions of Unicode will 
 contain them...
Perhaps level 1 has since been redefined?
Jun 02 2016
parent tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 2 June 2016 at 21:00:17 UTC, tsbockman wrote:
 However, this document is very old - from Unicode 3.0 and the 
 year 2000:

 While there are no surrogate characters in Unicode 3.0 
 (outside of private use characters), future versions of 
 Unicode will contain them...
Perhaps level 1 has since been redefined?
I found the latest (unofficial) draft version: http://www.unicode.org/reports/tr18/tr18-18.html Relevant changes: * Level 1 is to be redefined as working on code points, not code units:
 A fundamental requirement is that Unicode text be interpreted 
 semantically by code point, not code units.
* Level 2 (graphemes) is explicitly described as a "default level":
 This is still a default level—independent of country or 
 language—but provides much better support for end-user 
 expectations than the raw level 1...
* All mention of level 2 being slow has been removed. The only reason given for making it toggle-able is for compatibility with level 1 algorithms:
 Level 2 support matches much more what user expectations are 
 for sequences of Unicode characters. It is still 
 locale-independent and easily implementable. However, for 
 compatibility with Level 1, it is useful to have some sort of 
 syntax that will turn Level 2 support on and off.
Jun 02 2016
prev sibling parent reply ag0aep6g <anonymous example.com> writes:
On 06/02/2016 10:36 PM, Andrei Alexandrescu wrote:
 By whom? The "support level 1" folks yonder at the Unicode standard? :o)
 -- Andrei
Do they say that level 1 should be the default, and do they give a rationale for that? Would you kindly link or quote that?
Jun 02 2016
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:52 PM, ag0aep6g wrote:
 On 06/02/2016 10:36 PM, Andrei Alexandrescu wrote:
 By whom? The "support level 1" folks yonder at the Unicode standard? :o)
 -- Andrei
Do they say that level 1 should be the default, and do they give a rationale for that? Would you kindly link or quote that?
No, but that sounds agreeable to me, especially since it breaks no code of ours. We really should document this better. Kudos to Walter for finding all that Level 1 support. Andrei
Jun 02 2016
prev sibling parent reply default0 <Kevin.Labschek gmx.de> writes:
On Thursday, 2 June 2016 at 20:52:29 UTC, ag0aep6g wrote:
 On 06/02/2016 10:36 PM, Andrei Alexandrescu wrote:
 By whom? The "support level 1" folks yonder at the Unicode 
 standard? :o)
 -- Andrei
Do they say that level 1 should be the default, and do they give a rationale for that? Would you kindly link or quote that?
The level 2 support description noted that it should be opt-in because its slow. Arguably it should be easier to operate on code units if you know its safe to do so, but either always working on code units or always working on graphemes as the default seems to be either too broken too often or too slow too often. Now one can argue either consistency for code units (because then we can treat char[] and friends as a slice) or correctness for graphemes but really the more I think about it the more I think there is no good default and you need to learn unicode anyways. The only sad parts here are that 1) we hijacked an array type for strings, which sucks and 2) that we dont have an api that is actually good at teaching the user what it does and doesnt do. The consequence of 1 is that generic code that also wants to deal with strings will want to special-case to get rid of auto-decoding, the consequence of 2 is that we will have tons of not-actually-correct string handling. I would assume that almost all string handling code that is out in the wild is broken anyways (in code I have encountered I have never seen attempts to normalize or do other things before or after comparisons, searching, etc), unless of course, YOU or one of your colleagues wrote it (consider that checking the length of characters is often done and wrong, because .Length is the number of UTF-16 code units in those languages) :o) So really as bad and alarming as "incorrect string handling" by default seems, it in practice of other languages that get used way more than D has not prevented people from writing working (internationalized!) applications in those languages. One could say we should do it better than them, but I would be inclined to believe that RCStr provides our opportunity to do so. Having char[] be what it is is an annoying wart, and maybe at some point we can deprecate/remove that behaviour, but for now Id rather see if RCStr is viable than attempt to change semantics of all string handling code in D.
Jun 02 2016
parent reply tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 2 June 2016 at 21:07:19 UTC, default0 wrote:
 The level 2 support description noted that it should be opt-in 
 because its slow.
1) It does not say that level 2 should be opt-in; it says that level 2 should be toggle-able. Nowhere does it say which of level 1 and 2 should be the default. 2) It says that working with graphemes is slower than UTF-16 code UNITS (level 1), but says nothing about streaming decoding of code POINTS (what we have). 3) That document is from 2000, and its claims about performance are surely extremely out-dated, anyway. Computers and the Unicode standard have both changed much since then.
Jun 02 2016
parent reply default0 <Kevin.Labschek gmx.de> writes:
On Thursday, 2 June 2016 at 21:30:51 UTC, tsbockman wrote:
 On Thursday, 2 June 2016 at 21:07:19 UTC, default0 wrote:
 The level 2 support description noted that it should be opt-in 
 because its slow.
1) It does not say that level 2 should be opt-in; it says that level 2 should be toggle-able. Nowhere does it say which of level 1 and 2 should be the default. 2) It says that working with graphemes is slower than UTF-16 code UNITS (level 1), but says nothing about streaming decoding of code POINTS (what we have). 3) That document is from 2000, and its claims about performance are surely extremely out-dated, anyway. Computers and the Unicode standard have both changed much since then.
1) Right because a special toggleable syntax is definitely not "opt-in". 2) Several people in this thread noted that working on graphemes is way slower (which makes sense, because its yet another processing you need to do after you decoded - therefore more work - therefore slower) than working on code points. 3) Not an argument - doing more work makes code slower. The only thing that changes is what specific operations have what cost (for instance, memory access has a much higher cost now than it had then). Considering the way the process works and judging from what others in this thread have said about it, I will stick with "always decoding to graphemes for all operations is very slow" and indulge in being too lazy to write benchmarks for it to show just how bad it is.
Jun 02 2016
parent reply tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 2 June 2016 at 21:38:02 UTC, default0 wrote:
 On Thursday, 2 June 2016 at 21:30:51 UTC, tsbockman wrote:
 1) It does not say that level 2 should be opt-in; it says that 
 level 2 should be toggle-able. Nowhere does it say which of 
 level 1 and 2 should be the default.

 2) It says that working with graphemes is slower than UTF-16 
 code UNITS (level 1), but says nothing about streaming 
 decoding of code POINTS (what we have).

 3) That document is from 2000, and its claims about 
 performance are surely extremely out-dated, anyway. Computers 
 and the Unicode standard have both changed much since then.
1) Right because a special toggleable syntax is definitely not "opt-in".
It is not "opt-in" unless it is toggled off by default. The only reason it doesn't talk about toggling in the level 1 section, is because that section is written with the assumption that many programs will *only* support level 1.
 2) Several people in this thread noted that working on 
 graphemes is way slower (which makes sense, because its yet 
 another processing you need to do after you decoded - therefore 
 more work - therefore slower) than working on code points.
And working on code points is way slower than working on code units (the actual level 1).
 3) Not an argument - doing more work makes code slower.
What do you think I'm arguing for? It's not graphemes-by-default. What I actually want to see: permanently deprecate the auto-decoding range primitives. Force the user to explicitly specify whichever of `by!dchar`, `byCodePoint`, or `byGrapheme` their specific algorithm actually needs. Removing the implicit conversions between `char`, `wchar`, and `dchar` would also be nice, but isn't really necessary I think. That would be a standards-compliant solution (one of several possible). What we have now is non-standard, at least going by the old version Walter linked.
Jun 02 2016
parent reply default0 <Kevin.Labschek gmx.de> writes:
On Thursday, 2 June 2016 at 21:51:51 UTC, tsbockman wrote:
 On Thursday, 2 June 2016 at 21:38:02 UTC, default0 wrote:
 On Thursday, 2 June 2016 at 21:30:51 UTC, tsbockman wrote:
 1) It does not say that level 2 should be opt-in; it says 
 that level 2 should be toggle-able. Nowhere does it say which 
 of level 1 and 2 should be the default.

 2) It says that working with graphemes is slower than UTF-16 
 code UNITS (level 1), but says nothing about streaming 
 decoding of code POINTS (what we have).

 3) That document is from 2000, and its claims about 
 performance are surely extremely out-dated, anyway. Computers 
 and the Unicode standard have both changed much since then.
1) Right because a special toggleable syntax is definitely not "opt-in".
It is not "opt-in" unless it is toggled off by default. The only reason it doesn't talk about toggling in the level 1 section, is because that section is written with the assumption that many programs will *only* support level 1.
*sigh* reading comprehension. Needing to write .byGrapheme or similar to enable the behaviour qualifies for what that description was arguing for. I hope you understand that now that I am repeating this for you.
 2) Several people in this thread noted that working on 
 graphemes is way slower (which makes sense, because its yet 
 another processing you need to do after you decoded - 
 therefore more work - therefore slower) than working on code 
 points.
And working on code points is way slower than working on code units (the actual level 1).
Never claimed the opposite. Do note however that its specifically talking about UTF-16 code units.
 3) Not an argument - doing more work makes code slower.
What do you think I'm arguing for? It's not graphemes-by-default.
Unrelated. I was refuting the point you made about the relevance of the performance claims of the unicode level 2 support description, not evaluating your hypothetical design. Please do not take what I say out of context, thank you.
Jun 02 2016
parent tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 2 June 2016 at 22:03:01 UTC, default0 wrote:
 *sigh* reading comprehension.
 ...
 Please do not take what I say out of context, thank you.
Earlier you said:
 The level 2 support description noted that it should be opt-in 
 because its slow.
My main point is simply that you mischaracterized what the standard says. Making level 1 opt-in, rather than level 2, would be just as compliant as the reverse. The standard makes no suggestion as to which should be default.
Jun 02 2016
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 12:05 PM, Andrei Alexandrescu wrote:
 * s.all!(c => c == 'ö') works only with autodecoding. It returns always false
 without.
The o is inferred as a wchar. The lamda then is inferred to return a wchar. The algorithm can check that the input is char[], and is being tested against a wchar. Therefore, the algorithm can specialize to do the decoding itself. No autodecoding necessary, and it does the right thing.
Jun 02 2016
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 22:07, Walter Bright wrote:
 On 6/2/2016 12:05 PM, Andrei Alexandrescu wrote:
 * s.all!(c => c == 'ö') works only with autodecoding. It returns
 always false
 without.
The o is inferred as a wchar. The lamda then is inferred to return a wchar.
No, the lambda returns a bool.
 The algorithm can check that the input is char[], and is being
 tested against a wchar. Therefore, the algorithm can specialize to do
 the decoding itself.

 No autodecoding necessary, and it does the right thing.
It still would not be the right thing. The lambda shouldn't compile. It is not meaningful to compare utf-8 and utf-16 code units directly.
Jun 02 2016
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:12 PM, Timon Gehr wrote:
 It is not meaningful to compare utf-8 and utf-16 code units directly.
But it is meaningful to compare Unicode code points. -- Andrei
Jun 02 2016
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 22:28, Andrei Alexandrescu wrote:
 On 06/02/2016 04:12 PM, Timon Gehr wrote:
 It is not meaningful to compare utf-8 and utf-16 code units directly.
But it is meaningful to compare Unicode code points. -- Andrei
It is also meaningful to compare two utf-8 code units or two utf-16 code units.
Jun 02 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:50 PM, Timon Gehr wrote:
 On 02.06.2016 22:28, Andrei Alexandrescu wrote:
 On 06/02/2016 04:12 PM, Timon Gehr wrote:
 It is not meaningful to compare utf-8 and utf-16 code units directly.
But it is meaningful to compare Unicode code points. -- Andrei
It is also meaningful to compare two utf-8 code units or two utf-16 code units.
By decoding them of course. -- Andrei
Jun 02 2016
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 22:51, Andrei Alexandrescu wrote:
 On 06/02/2016 04:50 PM, Timon Gehr wrote:
 On 02.06.2016 22:28, Andrei Alexandrescu wrote:
 On 06/02/2016 04:12 PM, Timon Gehr wrote:
 It is not meaningful to compare utf-8 and utf-16 code units directly.
But it is meaningful to compare Unicode code points. -- Andrei
It is also meaningful to compare two utf-8 code units or two utf-16 code units.
By decoding them of course. -- Andrei
That makes no sense, I cannot decode single code units. BTW, I guess the reason why char converts to wchar converts to dchar is that the lower half of code units in char and the lower half of code units in wchar are code points. Maybe code units and code points with low numerical values should have distinct types.
Jun 02 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:23 PM, Timon Gehr wrote:
 On 02.06.2016 22:51, Andrei Alexandrescu wrote:
 On 06/02/2016 04:50 PM, Timon Gehr wrote:
 On 02.06.2016 22:28, Andrei Alexandrescu wrote:
 On 06/02/2016 04:12 PM, Timon Gehr wrote:
 It is not meaningful to compare utf-8 and utf-16 code units directly.
But it is meaningful to compare Unicode code points. -- Andrei
It is also meaningful to compare two utf-8 code units or two utf-16 code units.
By decoding them of course. -- Andrei
That makes no sense, I cannot decode single code units. BTW, I guess the reason why char converts to wchar converts to dchar is that the lower half of code units in char and the lower half of code units in wchar are code points. Maybe code units and code points with low numerical values should have distinct types.
Then you lost me. (I'm sure you're making a good point.) -- Andrei
Jun 02 2016
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 23:29, Andrei Alexandrescu wrote:
 On 6/2/16 5:23 PM, Timon Gehr wrote:
 On 02.06.2016 22:51, Andrei Alexandrescu wrote:
 On 06/02/2016 04:50 PM, Timon Gehr wrote:
 On 02.06.2016 22:28, Andrei Alexandrescu wrote:
 On 06/02/2016 04:12 PM, Timon Gehr wrote:
 It is not meaningful to compare utf-8 and utf-16 code units directly.
But it is meaningful to compare Unicode code points. -- Andrei
It is also meaningful to compare two utf-8 code units or two utf-16 code units.
By decoding them of course. -- Andrei
That makes no sense, I cannot decode single code units. BTW, I guess the reason why char converts to wchar converts to dchar is that the lower half of code units in char and the lower half of code units in wchar are code points. Maybe code units and code points with low numerical values should have distinct types.
Then you lost me. (I'm sure you're making a good point.) -- Andrei
Basically: bool bad(char c,dchar d){ return c==d; } // ideally shouldn't compile bool good(char c,char d){ return c==d; } // should compile
Jun 02 2016
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 1:12 PM, Timon Gehr wrote:
 On 02.06.2016 22:07, Walter Bright wrote:
 On 6/2/2016 12:05 PM, Andrei Alexandrescu wrote:
 * s.all!(c => c == 'ö') works only with autodecoding. It returns
 always false
 without.
The o is inferred as a wchar. The lamda then is inferred to return a wchar.
No, the lambda returns a bool.
Thanks for the correction.
 The algorithm can check that the input is char[], and is being
 tested against a wchar. Therefore, the algorithm can specialize to do
 the decoding itself.

 No autodecoding necessary, and it does the right thing.
It still would not be the right thing. The lambda shouldn't compile. It is not meaningful to compare utf-8 and utf-16 code units directly.
Yes, you have a good point. But we do allow things like: byte b; if (b == 10000) ...
Jun 02 2016
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 23:56, Walter Bright wrote:
 On 6/2/2016 1:12 PM, Timon Gehr wrote:
 ...
 It is not
 meaningful to compare utf-8 and utf-16 code units directly.
Yes, you have a good point. But we do allow things like: byte b; if (b == 10000) ...
Well, this is a somewhat different case, because 10000 is just not representable as a byte. Every value that fits in a byte fits in an int though. It's different for code units. They are incompatible both ways. E.g. dchar obviously does not fit in a char, and while the lower half of char is compatible with dchar, the upper half is specific to the encoding. dchar cannot represent upper half char code units. You get the code points with the corresponding values instead. E.g.: void main(){ import std.stdio,std.utf; foreach(dchar d;"ö".byCodeUnit) writeln(d); // "Ã", "¶" }
Jun 02 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 3:11 PM, Timon Gehr wrote:
 Well, this is a somewhat different case, because 10000 is just not
representable
 as a byte. Every value that fits in a byte fits in an int though.

 It's different for code units. They are incompatible both ways.
Not exactly. (c == 'ö') is always false for the same reason that (b == 1000) is always false. I'm not sure what the right answer is here.
Jun 02 2016
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 03.06.2016 00:26, Walter Bright wrote:
 On 6/2/2016 3:11 PM, Timon Gehr wrote:
 Well, this is a somewhat different case, because 10000 is just not
 representable
 as a byte. Every value that fits in a byte fits in an int though.

 It's different for code units. They are incompatible both ways.
Not exactly. (c == 'ö') is always false for the same reason that (b == 1000) is always false. ...
Yes. And _additionally_, some other concerns apply that are not there for byte vs. int. I.e. if b == 10000 is disallowed, then c == d should be disallowed too, but b == 10000 can be allowed even if c == d is disallowed.
 I'm not sure what the right answer is here.
char to dchar is a lossy conversion, so it shouldn't happen. byte to int is a lossless conversion, so there is no problem a priori.
Jun 02 2016
prev sibling parent Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Thursday, 2 June 2016 at 21:56:10 UTC, Walter Bright wrote:
 Yes, you have a good point. But we do allow things like:

    byte b;
    if (b == 10000) ...
Why allowing char/wchar/dchar comparisons is wrong: void main() { string s = "Привет"; foreach (c; s) assert(c != 'Ñ'); } From my post from 2014: http://forum.dlang.org/post/knrwiqxhlvqwxqshyqpy forum.dlang.org
Jun 02 2016
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:07 PM, Walter Bright wrote:
 On 6/2/2016 12:05 PM, Andrei Alexandrescu wrote:
 * s.all!(c => c == 'ö') works only with autodecoding. It returns
 always false
 without.
The o is inferred as a wchar. The lamda then is inferred to return a wchar.
The lambda returns bool. -- Andrei
Jun 02 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 1:27 PM, Andrei Alexandrescu wrote:
 The lambda returns bool. -- Andrei
Yes, I was wrong about that. But the point still stands with:
 * s.balancedParens('〈', '〉') works only with autodecoding.
 * s.canFind('ö') works only with autodecoding. It returns always false
without.
Can be made to work without autodecoding.
Jun 02 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 05:58 PM, Walter Bright wrote:
 On 6/2/2016 1:27 PM, Andrei Alexandrescu wrote:
 The lambda returns bool. -- Andrei
Yes, I was wrong about that. But the point still stands with: > * s.balancedParens('〈', '〉') works only with autodecoding. > * s.canFind('ö') works only with autodecoding. It returns always false without. Can be made to work without autodecoding.
By special casing? Perhaps. I seem to recall though that one major issue with autodecoding was that it special-cases certain algorithms. So you'd need to go through all of std.algorithm and make sure you can special-case your way out of situations that work today. Andrei
Jun 02 2016
next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 03.06.2016 00:23, Andrei Alexandrescu wrote:
 On 06/02/2016 05:58 PM, Walter Bright wrote:
 On 6/2/2016 1:27 PM, Andrei Alexandrescu wrote:
 The lambda returns bool. -- Andrei
Yes, I was wrong about that. But the point still stands with: > * s.balancedParens('〈', '〉') works only with autodecoding. > * s.canFind('ö') works only with autodecoding. It returns always false without. Can be made to work without autodecoding.
By special casing? Perhaps. I seem to recall though that one major issue with autodecoding was that it special-cases certain algorithms.
The major issue is that it special cases when there's different, more natural semantics available.
Jun 02 2016
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 3:23 PM, Andrei Alexandrescu wrote:
 On 06/02/2016 05:58 PM, Walter Bright wrote:
  > * s.balancedParens('〈', '〉') works only with autodecoding.
  > * s.canFind('ö') works only with autodecoding. It returns always
 false without.

 Can be made to work without autodecoding.
By special casing? Perhaps.
The argument to canFind() can be detected as not being a char, then decoded into a sequence of char's, then forwarded to a substring search.
 I seem to recall though that one major issue with
 autodecoding was that it special-cases certain algorithms. So you'd need to go
 through all of std.algorithm and make sure you can special-case your way out of
 situations that work today.
That's right. A side effect of that is that the algorithms will go even faster! So it's good. (A substring of codeunits is faster to search than decoding the input stream.)
Jun 02 2016
parent reply Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Thursday, June 02, 2016 15:48:03 Walter Bright via Digitalmars-d wrote:
 On 6/2/2016 3:23 PM, Andrei Alexandrescu wrote:
 On 06/02/2016 05:58 PM, Walter Bright wrote:
  > * s.balancedParens('〈', '〉') works only with autodecoding.
  > * s.canFind('ö') works only with autodecoding. It returns always

 false without.

 Can be made to work without autodecoding.
By special casing? Perhaps.
The argument to canFind() can be detected as not being a char, then decoded into a sequence of char's, then forwarded to a substring search.
How do you suggest that we handle the normalization issue? Should we just assume NFC like std.uni.normalize does and provide an optional template argument to indicate a different normalization (like normalize does)? Since without providing a way to deal with the normalization, we're not actually making the code fully correct, just faster. - Jonathan M Davis
Jun 02 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 4:29 PM, Jonathan M Davis via Digitalmars-d wrote:
 How do you suggest that we handle the normalization issue?
Started a new thread for that one.
Jun 02 2016
prev sibling parent Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Thursday, June 02, 2016 18:23:19 Andrei Alexandrescu via Digitalmars-d 
wrote:
 On 06/02/2016 05:58 PM, Walter Bright wrote:
 On 6/2/2016 1:27 PM, Andrei Alexandrescu wrote:
 The lambda returns bool. -- Andrei
Yes, I was wrong about that. But the point still stands with: > * s.balancedParens('〈', '〉') works only with autodecoding. > * s.canFind('ö') works only with autodecoding. It returns always false without. Can be made to work without autodecoding.
By special casing? Perhaps. I seem to recall though that one major issue with autodecoding was that it special-cases certain algorithms. So you'd need to go through all of std.algorithm and make sure you can special-case your way out of situations that work today.
Yeah, I believe that you do have to do some special casing, though it would be special casing on ranges of code units in general and not strings specifically, and a lot of those functions are already special cased on string in an attempt be efficient. In particular, with a function like find or canFind, you'd take the needle and encode it to match the haystack it was passed so that you can do the comparisons via code units. So, you incur the encoding cost once when encoding the needle rather than incurring the decoding cost of each code point or grapheme as you iterate over the haystack. So, you end up with something that's correct and efficient. It's also much friendlier to code that only operates on ASCII. The one issue that I'm not quite sure how we'd handle in that case is normalization (which auto-decoding doesn't handle either), since you'd need to normalize the needle to match the haystack (which also assumes that the haystack was already normalized). Certainly, it's the sort of thing that makes it so that you kind of wish you were dealing with a string type that had the normalization built into it rather than either an array of code units or an arbitrary range of code units. But maybe we could assume the NFC normalization like std.uni.normalize does and provide an optional template argument for the normalization scheme. In any case, while it's not entirely straightforward, it is quite possible to write some algorithms in a way which works on arbitrary ranges of code units and deals with Unicode correctly without auto-decoding or requiring that the user convert it to a range of code points or graphemes in order to properly handle the full range of Unicode. And even if we keep auto-decoding, we pretty much need to fix it so that std.algorithm and friends are Unicode-aware in this manner so that ranges of code units work in general without requiring that you use byGrapheme. So, this sort of thing could have a large impact on RCStr, even if we keep auto-decoding for narrow strings. Other algorithms, however, can't be made to work automatically with Unicode - at least not with the current range paradigm. filter, for instance, really needs to operate on graphemes to filter on characters, but with a range of code units, that would mean operating on groups of code units as a single element, which you can't do with something like a range of char, since that essentially becomes a range of ranges. It has to be wrapped in a range that's going to provide graphemes - and of course, if you know that you're operating only on ASCII, then you wouldn't want to deal with graphemes anyway, so automatically converting to graphemes would be undesirable. So, for a function like filter, it really does have to be up to the programmer to indicate what level of Unicode they want to operate at. But if we don't make functions Unicode-aware where possible, then we're going to take a performance hit by essentially forcing everyone to use explicit ranges of code points or graphemes even when they should be unnecessary. So, I think that we're stuck with some level of special casing, but it would then be for ranges of code units and code points and not strings. So, it would work efficiently for stuff like RCStr, which the current scheme does not. I think that the reality of the matter is that regardless of whether we keep auto-decoding for narrow strings in place, we need to make Phobos operate on arbitrary ranges of code units and code points, since even stuff like RCStr won't work efficiently otherwise, and stuff like byCodeUnit won't be usuable in as many cases otherwise, because if a generic function isn't Unicode-aware, then in many cases, byCodeUnit will be very wrong, just like byCodePoint would be wrong. So, as far as Phobos goes, I'm not sure that the question of auto-decoding matters much for what we need to do at this point. If we do what we need to do, then Phobos will work whether we have auto-decoding or not (working in a Unicode-aware manner where possible and forcing the user to decide the correct level of Unicode to work at where not), and then it just becomes a question of whether we can or should deprecate auto-decoding once all that's done. - Jonathan M Davis
Jun 02 2016
prev sibling next sibling parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Thu, 2 Jun 2016 15:05:44 -0400
schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:

 On 06/02/2016 01:54 PM, Marc Sch=C3=BCtz wrote:
 Which practical tasks are made possible (and work _correctly_) if you
 decode to code points, that don't already work with code units? =20
=20 Pretty much everything. s.all!(c =3D> c =3D=3D '=C3=B6')
Andrei, your ignorance is really starting to grind on everyones nerves. If after 350 posts you still don't see why this is incorrect: s.any!(c =3D> c =3D=3D 'o'), you must be actively skipping the informational content of this thread. You are in error, no one agrees with you, and you refuse to see it and in the end we have to assume you will make a decisive vote against any PR with the intent to remove auto-decoding from Phobos. Your so called vocal minority is actually D's panel of Unicode experts who understand that auto-decoding is a false ally and should be on the deprecation track. Remember final-by-default? You promised, that your objection about breaking code means that D2 will only continue to be fixed in a backwards compatible way, be it the implementation of shared or whatever else. Yet months later you opened a thread with the title "inout must go". So that must have been an appeasement back then. People don't forget these things easily and RCStr seems to be a similar distraction, considering we haven't looked into borrowing/scoped enough and you promise wonders from it. --=20 Marco
Jun 02 2016
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 3:10 PM, Marco Leise wrote:
 we haven't looked into borrowing/scoped enough
That's my fault. As for scoped, the idea is to make scope work analogously to DIP25's 'return ref'. I don't believe we need borrowing, we've worked out another solution that will work for ref counting. Please do not reply to this in this thread - start a new one if you wish to continue with this topic.
Jun 02 2016
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 06:10 PM, Marco Leise wrote:
 Am Thu, 2 Jun 2016 15:05:44 -0400
 schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:

 On 06/02/2016 01:54 PM, Marc Schütz wrote:
 Which practical tasks are made possible (and work _correctly_) if you
 decode to code points, that don't already work with code units?
Pretty much everything. s.all!(c => c == 'ö')
Andrei, your ignorance is really starting to grind on everyones nerves.
Indeed there seem to be serious questions about my competence, basic comprehension, and now knowledge. I understand it is tempting to assume that a disagreement is caused by the other simply not understanding the matter. Even if that were true it's not worth sacrificing civility over it.
 If after 350 posts you still don't see
 why this is incorrect: s.any!(c => c == 'o'), you must be
 actively skipping the informational content of this thread.
Is it 'o' with an umlaut or without? At any rate, consider s of type string and x of type dchar. The dchar type is defined as "a Unicode code point", or at least my understanding that has been a reasonable definition to operate with in the D language ever since its first release. Also in the D language, the various string types char[], wchar[] etc. with their respective qualified versions are meant to hold Unicode strings with one of the UTF8, UTF16, and UTF32 encodings. Following these definitions, it stands to reason to infer that the call s.find(c => c == x) means "find the code point x in string s and return the balance of s positioned there". It's prima facie application of the definitions of the entities involved. Is this the only possible or recommended meaning? Most likely not, viz. the subtle cases in which a given grapheme is represented via either one or multiple code points by means of combining characters. Is it the best possible meaning? It's even difficult to define what "best" means (fastest, covering most languages, etc). I'm not claiming that meaning is the only possible, the only recommended, or the best possible. All I'm arguing is that it's not retarded, and within a certain universe confined to operating at code point level (which is reasonable per the definitions of the types involved) it can be considered correct. If at any point in the reasoning above some rampant ignorance comes about, please point it out.
 You are in error, no one agrees with you, and you refuse to see
 it and in the end we have to assume you will make a decisive
 vote against any PR with the intent to remove auto-decoding
 from Phobos.
This seems to assume I have some vesting in the position that makes it independent of facts. That is not the case. I do what I think is right to do, and you do what you think is right to do.
 Your so called vocal minority is actually D's panel of Unicode
 experts who understand that auto-decoding is a false ally and
 should be on the deprecation track.
They have failed to convince me. But I am more convinced than before that RCStr should not offer a default mode of iteration. I think its impact is lost in this discussion, because once it's understood RCStr will become D's recommended string type, the entire matter becomes moot.
 Remember final-by-default? You promised, that your objection
 about breaking code means that D2 will only continue to be
 fixed in a backwards compatible way, be it the implementation
 of shared or whatever else. Yet months later you opened a
 thread with the title "inout must go". So that must have been
 an appeasement back then. People don't forget these things
 easily and RCStr seems to be a similar distraction,
 considering we haven't looked into borrowing/scoped enough and
 you promise wonders from it.
What the hell is this, digging dirt on me? Paying back debts? Please stop that crap. Andrei
Jun 02 2016
parent Marco Leise <Marco.Leise gmx.de> writes:
Am Thu, 2 Jun 2016 18:54:21 -0400
schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:

 On 06/02/2016 06:10 PM, Marco Leise wrote:
 Am Thu, 2 Jun 2016 15:05:44 -0400
 schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:
 =20
 On 06/02/2016 01:54 PM, Marc Sch=C3=BCtz wrote: =20
 Which practical tasks are made possible (and work _correctly_) if you
 decode to code points, that don't already work with code units? =20
Pretty much everything. s.all!(c =3D> c =3D=3D '=C3=B6') =20
Andrei, your ignorance is really starting to grind on everyones nerves. =20
=20 Indeed there seem to be serious questions about my competence, basic=20 comprehension, and now knowledge.
That's not my general impression, but something is different with this thread.
 I understand it is tempting to assume that a disagreement is caused by=20
 the other simply not understanding the matter. Even if that were true=20
 it's not worth sacrificing civility over it.
Civility has had us caught in an 36 pages long, tiresome debate with us mostly talking past each other. I was being impolite and can't say I regret it, because I prefer this answer over the rest of the thread. It's more informed, elaborate and conclusive.
 If after 350 posts you still don't see
 why this is incorrect: s.any!(c =3D> c =3D=3D 'o'), you must be
 actively skipping the informational content of this thread. =20
=20 Is it 'o' with an umlaut or without? At any rate, consider s of type string and x of type dchar. The dchar type is defined as "a Unicode code point", or at least my understanding that has been a reasonable definition to operate with in the D language ever since its first release. Also in the D language, the various string types char[], wchar[] etc. with their respective qualified versions are meant to hold Unicode strings with one of the UTF8, UTF16, and UTF32 encodings. Following these definitions, it stands to reason to infer that the call=20 s.find(c =3D> c =3D=3D x) means "find the code point x in string s and re=
turn=20
 the balance of s positioned there". It's prima facie application of the=20
 definitions of the entities involved.
=20
 Is this the only possible or recommended meaning? Most likely not, viz.=20
 the subtle cases in which a given grapheme is represented via either one=
=20
 or multiple code points by means of combining characters. Is it the best=
=20
 possible meaning? It's even difficult to define what "best" means=20
 (fastest, covering most languages, etc).
=20
 I'm not claiming that meaning is the only possible, the only=20
 recommended, or the best possible. All I'm arguing is that it's not=20
 retarded, and within a certain universe confined to operating at code=20
 point level (which is reasonable per the definitions of the types=20
 involved) it can be considered correct.
=20
 If at any point in the reasoning above some rampant ignorance comes=20
 about, please point it out.
No, it's pretty close now. We can all agree that there is no "best" way, only different use cases. Just defining Phobos to work on code points gives the false illusion that it does the correct thing in all use cases - after all D claims to support Unicode. But in case you wanted to iterate on visual letters it is incorrect and otherwise slow when you work on ASCII structured formats (JSON, XML, paths, Warp, ...). Then there is explaining the different default iteration schemes when using foreach vs. range API (no big deal, just not easily justified) and the cost of implementation when dealing with char[]/wchar[]. =46rom this observation we concluded that decoding should be opt-in and that when we need it, it should be a conscious decision. Unicode is quite complex and learning about the difference between code points and grapheme clusters when segmenting strings will benefit code quality. As for the question, do multi-code-point graphemes ever appear in the wild ? OS X is known to use NFD on its native file system and there is a hint on Wikipedia that some symbols from Thai or Hindi's Devanagari need them: https://en.wikipedia.org/wiki/UTF-8#Disadvantages Some form of Lithuanian seems to have a use for them, too: http://www.unicode.org/L2/L2012/12026r-n4191-lithuanian.pdf Aside from those there is nothing generally wrong about decomposed letters appearing in strings, even though the use of NFC is encouraged.
 [=E2=80=A6harsh tone removed=E2=80=A6] in the end we have to assume you
 will make a decisive vote against any PR with the intent
 to remove auto-decoding from Phobos. =20
=20 This seems to assume I have some vesting in the position that makes it independent of facts. That is not the case. I do what I think is right to do, and you do what you think is right to do.
Your vote outweighs that of many others for better or worse. When a decision needs to be made and the community is divided, we need you or Walter or anyone who is invested in the matter to cast a ruling vote. However when several dozen people support an idea after discussion, hearing everyones arguments with practically no objections and you overrule everyone tensions build up. I welcome the idea to delegate some of the tasks to smaller groups. No single person is knowledgeable in every area of CS and both a bus factor of 1 and too big a group can hinder decision making. It would help to know for the future, if you understand your role as one with veto powers or if you could arrange with giving up responsibilities to decisions within the community and if so under what conditions.
 Your so called vocal minority is actually D's panel of Unicode
 experts who understand that auto-decoding is a false ally and
 should be on the deprecation track. =20
=20 They have failed to convince me. But I am more convinced than before=20 that RCStr should not offer a default mode of iteration. I think its=20 impact is lost in this discussion, because once it's understood RCStr=20 will become D's recommended string type, the entire matter becomes moot.
 Remember final-by-default? You promised, that your objection
 about breaking code means that D2 will only continue to be
 fixed in a backwards compatible way, be it the implementation
 of shared or whatever else. Yet months later you opened a
 thread with the title "inout must go". So that must have been
 an appeasement back then. People don't forget these things
 easily and RCStr seems to be a similar distraction,
 considering we haven't looked into borrowing/scoped enough and
 you promise wonders from it. =20
=20 What the hell is this, digging dirt on me? Paying back debts? Please=20 stop that crap.
No, that was my actual impression. I must apologize for generalizing it to other people though. I welcome that RCStr project and hope it will be good. At this time though it is not yet fleshed out and we can't tell how fast its adoption will be. Remember that DIPs on scope and RC have had the past tendency to go into long debates with unclear outcome. Unlike this thread, which may be the first in D's forum history with such a high agreement across the board.
 Andrei
--=20 Marco
Jun 03 2016
prev sibling parent Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Thursday, June 02, 2016 15:05:44 Andrei Alexandrescu via Digitalmars-d 
wrote:
 The intent of autodecoding was to make std.algorithm work meaningfully
 with strings. As it's easy to see I just went through
 std.algorithm.searching alphabetically and found issues literally with
 every primitive in there. It's an easy exercise to go forth with the others.
It comes down to the question of whether it's better to fail quickly when Unicode is handled incorrectly so that it's obvious that you're doing it wrong, or whether it's better for it to work in a large number of cases so that for a lot of code it "just works" but is still wrong in the general case, and it's a lot less obvious that that's the case, so many folks won't realize that they need to do more in order to have their string handling be Unicode-correct. With code units - especially UTF-8 - it becomes obvious very quickly that treating each element of the string/range as a character is wrong. With code points, you have to work far harder to find examples that are incorrect. So, it's not at all obvious (especially to the lay programmer) that the Unicode handling is incorrect and that their code is wrong - but their code will end up working a large percentage of the time in spite of it being wrong in the general case. So, yes, it's trivial to show how operating on ranges of code units as if they were characters gives incorrect results far more easily than operating on ranges of code points does. But operating on code points as if they were characters is still going to give incorrect results in the general case. Regardless of auto-decoding, the anwser is that the programmer needs to understand the Unicode issues and use ranges of code units or code points where appropriate and use ranges of graphemes where appropriate. It's just that if we default to handling code points, then a lot of code will be written which treats those as characters, and it will provide the correct result more often than it would if it treated code units as characters. In any case, I've probably posted too much in this thread already. It's clear that the first step to solving this problem is to improve Phobos so that it handles ranges of code units, code points, and graphemes correctly whether auto-decoding is involved or not, and only then can we consider the possibility of removing auto-decoding (and even then, the answer may still be that we're stuck, because we consider the resulting code breakage to be too great). But whether Phobos retains auto-decoding or not, the Unicode handling stuff in general is the same, and what we need to do to improve the siutation is the same. So, clearly, I need to do a much better job of finding time to work on D so that I can create some PRs to help the situation. Unfortunately, it's far easier to find a few minutes here and there while waiting on other stuff to shoot off a post or two in the newsgroup than it is to find time to substantively work on code. :| - Jonathan M Davis
Jun 03 2016
prev sibling next sibling parent reply jmh530 <john.michael.hall gmail.com> writes:
On Thursday, 2 June 2016 at 13:06:44 UTC, Andrei Alexandrescu 
wrote:
 Look at reddit and hackernews, too - admittedly other 
 self-selected communities. Language debates often spring about. 
 How often is the point being made that D is wanting because of 
 its string support? Nada.
I've been lurking on this thread for a while and was convinced by the arguments that autodecoding should go. Nevertheless, I think this is really the strongest argument you've made against using the community's resources to fix it now. If your position from the beginning were this clear, then I think the thread might not have gone on so long. As someone trained in economics, I get convinced by arguments about scarce resources. It makes more sense to focus on higher value issues. However, the case against autodecoding is clearly popular. At a minimum, it has resulted in a significant amount of time dedicated to forum discussion and has made you metaphorically angry at Walter. Resources spent grumbling about it could be better spent elsewhere. One way to deal with the problem of scarce resources is by reducing the cost of whatever action you want to take. For instance, Adam Ruppe just put up a good post in the Dealing with Autodecode thread https://forum.dlang.org/post/ksasfwpuvpwxjfniupiv forum.dlang.org noting that a compiler switch could easily be added to phobos. Combined with a long deprecation timeline, the cost that it would impose on D users who are not active forum members and might want to complain about the issue would be relatively small. Another problem related to scarce resources is that there is a division of labor in the community. People like yourself and Walter have fewer substitutes for your labor. It makes sense that the top contributors should be focusing on higher value issues where fewer people have the ability to contribute. I don't dispute that. However, there seem to be a number of people who can contribute on this issue and want to contribute. Scarcity of resources seems to be less of an issue here. Finally, when you discussed things people complain about D, you mentioned tooling. In the time I've been following this forum, I haven't seen a single thread focusing on this issue. I don't mean a few comments like "oh D should improve its tooling." I mean a thread dedicated to D's tooling strengths and weaknesses with a goal of creating a plan on what to do to improve things.
 Currently dfix is weak because it doesn't do lookup. So we need 
 to make the front end into a library. Daniel said he wants to 
 be on it, but he has two jobs to worry about so he's short on 
 time. There's only so many hours in the day, and I think the 
 right focus is on attacking the matters above.
On a somewhat tangential basis, I was reading about Microsoft's Roslyn a week or so ago. They do something similar where they have a compiler API. I don't have a very good sense of how it works from their overview, but it seems to be an interesting approach.
Jun 02 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 10:14 AM, jmh530 wrote:
 However, the case against autodecoding is clearly popular. At a minimum,
 it has resulted in a significant amount of time dedicated to forum
 discussion and has made you metaphorically angry at Walter. Resources
 spent grumbling about it could be better spent elsewhere.
Yah, this is a bummer and one of the larger issues of our community: there's too much talking about doing things and too little doing things. On one hand I want to empower people (as I said at DConf: please get me fired!), and on the other I need to prevent silly things from happening. The quality of some of the code that gets into Phobos when I look the other way is sadly sub-par. Cumulatively that has reduced its quality over time. That (improving the time * talent torque) is the real solution to Phobos' technical debt, of which autodecoding is negligible.
 One way to deal with the problem of scarce resources is by reducing the
 cost of whatever action you want to take. For instance, Adam Ruppe just
 put up a good post in the Dealing with Autodecode thread
 https://forum.dlang.org/post/ksasfwpuvpwxjfniupiv forum.dlang.org
 noting that a compiler switch could easily be added to phobos. Combined
 with a long deprecation timeline, the cost that it would impose on D
 users who are not active forum members and might want to complain about
 the issue would be relatively small.
This is a very costly solution to a very small problem. I'm here to prevent silly things like this from happening and from bringing back perspective. We've had huge issues with language changes that were much more important and brought much less breakage. The fact that people talk about 132 breakages in Phobos with a straight face is a good sign that the heat of the debate has taken perspective away. I'm sure it will come back in a few weeks. Just need to keep the dam until then. The real ticket out of this is RCStr. It solves a major problem in the language (compulsive GC) and also a minor occasional annoyance (autodecoding). This is what I need to work on, instead of writing long messages to put back sense into people. Many don't realize that the only reason current strings ever work in safe code is because of the GC. char[] is too little encapsulation, so it needs GC as a crutch to be safe. That's the problem with D's strings, not autodecoding. That's why we need to change things. That's what keeps me awake at night. Andrei
Jun 02 2016
parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Thursday, 2 June 2016 at 15:02:13 UTC, Andrei Alexandrescu 
wrote:
 Yah, this is a bummer and one of the larger issues of our 
 community: there's too much talking about doing things and too 
 little doing things.
We wrote a PR to implement the first step in the autodecode deprecation cycle. Granted, it wasn't ready to merge, but you just closed it with a flippant "not gonna happen" despite the *unanimous* agreement that the status quo sucks, and now complain that there's too much talking and too little doing! When we do something, you just shut it down then blame us. What's even the point of trying anymore?
Jun 02 2016
next sibling parent deadalnix <deadalnix gmail.com> writes:
On Thursday, 2 June 2016 at 15:38:46 UTC, Adam D. Ruppe wrote:
 On Thursday, 2 June 2016 at 15:02:13 UTC, Andrei Alexandrescu 
 wrote:
 Yah, this is a bummer and one of the larger issues of our 
 community: there's too much talking about doing things and too 
 little doing things.
We wrote a PR to implement the first step in the autodecode deprecation cycle. Granted, it wasn't ready to merge, but you just closed it with a flippant "not gonna happen" despite the *unanimous* agreement that the status quo sucks, and now complain that there's too much talking and too little doing! When we do something, you just shut it down then blame us. What's even the point of trying anymore?
https://www.youtube.com/watch?v=MJiBjfvltQw
Jun 02 2016
prev sibling next sibling parent reply Kagamin <spam here.lot> writes:
On Thursday, 2 June 2016 at 15:38:46 UTC, Adam D. Ruppe wrote:
 We wrote a PR to implement the first step in the autodecode 
 deprecation cycle.
It outright deprecated popFront - that's not the first step in the migration.
Jun 02 2016
next sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Thursday, 2 June 2016 at 15:50:54 UTC, Kagamin wrote:
 It outright deprecated popFront - that's not the first step in 
 the migration.
Which gave us the list of places inside Phobos to fix, only about two hours of work, and proved that the version() method was viable (and REALLY easy to implement).
Jun 02 2016
next sibling parent reply Kagamin <spam here.lot> writes:
On Thursday, 2 June 2016 at 16:02:18 UTC, Adam D. Ruppe wrote:
 Which gave us the list of places inside Phobos to fix, only 
 about two hours of work, and proved that the version() method 
 was viable (and REALLY easy to implement).
Yes, it was a research PR that was never meant to be an implementation of the first step. You used wrong wording that just unnecessarily freaked Andrei out.
Jun 02 2016
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 12:45 PM, Kagamin wrote:
 On Thursday, 2 June 2016 at 16:02:18 UTC, Adam D. Ruppe wrote:
 Which gave us the list of places inside Phobos to fix, only about two
 hours of work, and proved that the version() method was viable (and
 REALLY easy to implement).
Yes, it was a research PR that was never meant to be an implementation of the first step. You used wrong wording that just unnecessarily freaked Andrei out.
I closed it because it wasn't an actual implementation, in full understanding that the discussion in it could continue. -- Andrei
Jun 02 2016
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 9:02 AM, Adam D. Ruppe wrote:
 Which gave us the list of places inside Phobos to fix, only about two hours of
 work, and proved that the version() method was viable (and REALLY easy to
 implement).
Nothing prevents anyone from doing that on their own (it's trivial) in order to find Phobos problems, and pick one or three to fix.
Jun 02 2016
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 8:50 AM, Kagamin wrote:
 It outright deprecated popFront - that's not the first step in the migration.
That's right. It's going about things backwards. The first step is to adjust Phobos implementations and documentation so they do not rely on autodecoding. This will take some time and care, particularly with algorithms that support mixed codeunit argument types. (Or perhaps mixed codeunit argument types can be deprecated.) This is not so simple, as they have to be dealt with one by one.
Jun 02 2016
parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Thursday, 2 June 2016 at 20:32:39 UTC, Walter Bright wrote:
 The first step is to adjust Phobos implementations and 
 documentation so they do not rely on autodecoding.
The compiler can help you with that. That's the point of the do not merge PR: it got an actionable list out of the compiler and proved the way forward was viable.
Jun 02 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 1:46 PM, Adam D. Ruppe wrote:
 The compiler can help you with that. That's the point of the do not merge PR:
it
 got an actionable list out of the compiler and proved the way forward was
viable.
What is supposed to be done with "do not merge" PRs other than close them?
Jun 02 2016
next sibling parent Jack Stouffer <jack jackstouffer.com> writes:
On Thursday, 2 June 2016 at 20:56:26 UTC, Walter Bright wrote:
 What is supposed to be done with "do not merge" PRs other than 
 close them?
Experimentally iterate until something workable comes about. This way it's done publicly and people can collaborate.
Jun 02 2016
prev sibling parent reply tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 2 June 2016 at 20:56:26 UTC, Walter Bright wrote:
 What is supposed to be done with "do not merge" PRs other than 
 close them?
Occasionally people need to try something on the auto tester (not sure if that's relevant to that particular PR, though). Presumably if someone marks their own PR as "do not merge", it means they're planning to either close it themselves after it has served its purpose, or they plan to fix/finish it and then remove the "do not merge" label. Either way, they shouldn't be closed just because they say "do not merge" (unless they're abandoned or something, obviously).
Jun 02 2016
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:05 PM, tsbockman wrote:
 On Thursday, 2 June 2016 at 20:56:26 UTC, Walter Bright wrote:
 What is supposed to be done with "do not merge" PRs other than close
 them?
Occasionally people need to try something on the auto tester (not sure if that's relevant to that particular PR, though). Presumably if someone marks their own PR as "do not merge", it means they're planning to either close it themselves after it has served its purpose, or they plan to fix/finish it and then remove the "do not merge" label.
Feel free to reopen if it helps, it wasn't closed in anger. -- Andrei
Jun 02 2016
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 2:05 PM, tsbockman wrote:
 On Thursday, 2 June 2016 at 20:56:26 UTC, Walter Bright wrote:
 What is supposed to be done with "do not merge" PRs other than close them?
Occasionally people need to try something on the auto tester (not sure if that's relevant to that particular PR, though).
I've done that, but that doesn't apply here.
 Presumably if someone marks their own
 PR as "do not merge", it means they're planning to either close it themselves
 after it has served its purpose, or they plan to fix/finish it and then remove
 the "do not merge" label.
That doesn't seem to apply here, either.
 Either way, they shouldn't be closed just because they say "do not merge"
 (unless they're abandoned or something, obviously).
Something like that could not be merged until 132 other PRs are done to fix Phobos. It doesn't belong as a PR.
Jun 02 2016
parent tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 2 June 2016 at 22:20:49 UTC, Walter Bright wrote:
 On 6/2/2016 2:05 PM, tsbockman wrote:
 Presumably if someone marks their own
 PR as "do not merge", it means they're planning to either 
 close it themselves
 after it has served its purpose, or they plan to fix/finish it 
 and then remove
 the "do not merge" label.
That doesn't seem to apply here, either.
 Either way, they shouldn't be closed just because they say "do 
 not merge"
 (unless they're abandoned or something, obviously).
Something like that could not be merged until 132 other PRs are done to fix Phobos. It doesn't belong as a PR.
I was just responding to the general question you posed about "do not merge" PRs, not really arguing for that one, in particular, to be re-opened. I'm sure wilzbach is willing to explain if anyone cares to ask him why he did it as a PR, though.
Jun 02 2016
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 11:38 AM, Adam D. Ruppe wrote:
 On Thursday, 2 June 2016 at 15:02:13 UTC, Andrei Alexandrescu wrote:
 Yah, this is a bummer and one of the larger issues of our
 community: there's too much talking about doing things and too
 little doing things.
We wrote a PR to implement the first step in the autodecode deprecation cycle. Granted, it wasn't ready to merge, but you just closed it with a flippant "not gonna happen" despite the *unanimous* agreement that the status quo sucks, and now complain that there's too much talking and too little doing!
You mean https://github.com/dlang/phobos/pull/4384, the one with "[do not merge]" in the title? Would you realistically have advised me to merge it? I spent time writing what I thought was a reasonable and reasonably long answer. Allow me to quote it below:
  wilzbach thanks for running this experiment.

 Andrei is wrong.
Definitely wouldn't be the first time and not the last.
 We can all see it, and maybe if we demonstrate that a migration
 path is possible, even actually pretty easy following a simple
 deprecation path, maybe he can see it too.
I'm not sure who "all" is but that's beside the point. Taking a step back, we'd take in a change that breaks Phobos in 132 places only if it was a major language overhaul bringing dramatic improvements to the quality of life for D programmers. An artifact as earth shattering as ranges, or an ownership system that was massively simple and beneficial. For comparison, the recent changes in name lookup broke Phobos in fewer places (I don't have an exact number, but I think they were at most a couple dozen.) Those changes were closing an enormous hole in the language and mark a huge step forward. I'd be really hard pressed to characterize the elimination of autodecoding as enough of an improvement to warrant this kind of breakage. (I do realize there's a difference between breakage and deprecation, but for many folks the distinction is academic.) The better end game here is to improve efficiency of code that uses autodecoding (e.g. per the recent `find()` work), and to make sure `RCStr` is the right design. A string that manages its own memory _and_ does the right things with regard to Unicode is the ticket. Let's focus future efforts on that.
Could you please point me at the parts you found flippant in it, or merely unreasonable?
 When we do something, you just shut it down then blame us. What's
 even the point of trying anymore?
At some point I need to stick with what I think is the better course for D, even if that means disagreeing with you. But I hope you understand this is not "flippant" or teasing people then shutting down their good work. Andrei
Jun 02 2016
parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Thursday, 2 June 2016 at 16:12:01 UTC, Andrei Alexandrescu 
wrote:
 Would you realistically have advised me to merge it?
Not at this time, no, but I also wouldn't advise you to close it and tell us to stop trying if you were actually open to a chance. You closed that and posted this at about the same time: http://forum.dlang.org/post/nii497$2p79$1 digitalmars.com "I'm not going to debate this further" "What's important is that autodecoding is here to stay - there's no realistic way to eliminate it from D." So, what do you seriously expect us to think? We had a migration plan and enough excitement to start working on the code, then within about 15 minutes of each other, you close the study PR and post that the discussion is over and your mistake is here to stay.
 I'm not sure who "all" is but that's beside the point.
This sentence makes me pretty mad too. This topic has come up many times and nobody, NOBODY, with the exception of yourself agrees with the current behavior anymore. It is a very frequently asked question among new users, and we have no real justification because there is no technical merit to it.
Jun 02 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 02:36 PM, Adam D. Ruppe wrote:
 We had a migration plan and enough excitement to start working on the code
I don't think the plan is realistic. How can I tell you this without you getting mad at me? Apparently the only way to go is do as you say. -- Andrei
Jun 02 2016
parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Thursday, 2 June 2016 at 18:43:54 UTC, Andrei Alexandrescu 
wrote:
 I don't think the plan is realistic. How can I tell you this 
 without you getting mad at me?
You get out of the way and let the community get to work. Actually delegate, let people take ownership of problems, success and failure alike. If we fail then, at least it will be from our own experience instead of from executive meddling.
Jun 02 2016
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 03:13 PM, Adam D. Ruppe wrote:
 On Thursday, 2 June 2016 at 18:43:54 UTC, Andrei Alexandrescu wrote:
 I don't think the plan is realistic. How can I tell you this without
 you getting mad at me?
You get out of the way and let the community get to work. Actually delegate, let people take ownership of problems, success and failure alike.
That's a good point. We plan to do more of that in the future.
 If we fail then, at least it will be from our own experience instead of
 from executive meddling.
This applies to high-risk work that is also of commensurately extraordinary value. My assessment is this is not it. If you were in my position you'd also do what you think is the best thing to do, and nobody should feel offended by that. Andrei
Jun 02 2016
prev sibling next sibling parent reply Kagamin <spam here.lot> writes:
On Thursday, 2 June 2016 at 13:06:44 UTC, Andrei Alexandrescu 
wrote:
 This is what's happening here. We worked ourselves to a foam 
 because the creator of the language started a thread entitled 
 "The Case Against Autodecode", whilst fully understanding there 
 is no way to actually eliminate autodecode.
Autodecode doesn't need to be removed from phobos completely, it only needs to be more bearable, like it is in the foreach statement. E.g. byDchar will stay, initial idea is to actually put it to more intensive usage in phobos and user code, no need to remove it.
Jun 02 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 10:53 AM, Kagamin wrote:
 On Thursday, 2 June 2016 at 13:06:44 UTC, Andrei Alexandrescu wrote:
 This is what's happening here. We worked ourselves to a foam because
 the creator of the language started a thread entitled "The Case
 Against Autodecode", whilst fully understanding there is no way to
 actually eliminate autodecode.
Autodecode doesn't need to be removed from phobos completely, it only needs to be more bearable, like it is in the foreach statement. E.g. byDchar will stay, initial idea is to actually put it to more intensive usage in phobos and user code, no need to remove it.
Yah, and then such code will work with RCStr. -- Andrei
Jun 02 2016
parent reply Kagamin <spam here.lot> writes:
On Thursday, 2 June 2016 at 15:06:20 UTC, Andrei Alexandrescu 
wrote:
 Autodecode doesn't need to be removed from phobos completely, 
 it only
 needs to be more bearable, like it is in the foreach 
 statement. E.g.
 byDchar will stay, initial idea is to actually put it to more 
 intensive
 usage in phobos and user code, no need to remove it.
Yah, and then such code will work with RCStr. -- Andrei
Yes, do consider Walter's proposal, it will be an enabling technology for RCStr too: the more phobos works with string-like ranges the more it is usable for RCStr.
Jun 02 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 12:14 PM, Kagamin wrote:
 On Thursday, 2 June 2016 at 15:06:20 UTC, Andrei Alexandrescu wrote:
 Autodecode doesn't need to be removed from phobos completely, it only
 needs to be more bearable, like it is in the foreach statement. E.g.
 byDchar will stay, initial idea is to actually put it to more intensive
 usage in phobos and user code, no need to remove it.
Yah, and then such code will work with RCStr. -- Andrei
Yes, do consider Walter's proposal, it will be an enabling technology for RCStr too: the more phobos works with string-like ranges the more it is usable for RCStr.
Walter and I have a unified view on this. Although I'd need to raise the issue that the primitive should be by!dchar, not byDchar. -- Andrei
Jun 02 2016
parent ZombineDev <petar.p.kirov gmail.com> writes:
On Thursday, 2 June 2016 at 16:21:33 UTC, Andrei Alexandrescu 
wrote:
 On 06/02/2016 12:14 PM, Kagamin wrote:
 On Thursday, 2 June 2016 at 15:06:20 UTC, Andrei Alexandrescu 
 wrote:
 Autodecode doesn't need to be removed from phobos 
 completely, it only
 needs to be more bearable, like it is in the foreach 
 statement. E.g.
 byDchar will stay, initial idea is to actually put it to 
 more intensive
 usage in phobos and user code, no need to remove it.
Yah, and then such code will work with RCStr. -- Andrei
Yes, do consider Walter's proposal, it will be an enabling technology for RCStr too: the more phobos works with string-like ranges the more it is usable for RCStr.
Walter and I have a unified view on this. Although I'd need to raise the issue that the primitive should be by!dchar, not byDchar. -- Andrei
The primitive is byUTF!dchar:
Jun 02 2016
prev sibling next sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, Jun 02, 2016 at 09:06:44AM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
[...]
 ZombineDev, I've been at the top level in the C++ community for many
 many years, even after I wanted to exit :o). I'm familiar with how the
 committee that steers C++ works, perspective that is unique in our
 community - even Walter lacks it. I see trends and patterns. It is
 interesting how easily a small but very influential priesthood can
 alienate itself from the needs of the larger community and get into a
 frenzy over matters that are simply missing the point.
Appeal to authority.
 This is what's happening here. We worked ourselves to a foam because
 the creator of the language started a thread entitled "The Case
 Against Autodecode", whilst fully understanding there is no way to
 actually eliminate autodecode.
I think that's a misrepresentation of the situation. I was getting increasingly unhappy with autodecoding myself, completely independently of Walter, and in fact have filed bugs and posted complaints about it long before Walter started his thread. I used to be a supporter of autodecoding, but over time it has become increasingly clear to me that it was a mistake. The fact that you continue to deny this and write it off in the face of similar complaints raised by many active D users is very off-putting, to say the least, and does not inspire confidence. Not to mention the fact that you started this thread yourself with a question about what it is we dislike about autodecoding, yet after having received a multitude of complaints, corrobated by many forum members, you simply write off the whole thing like it was nothing. If you want D to succeed, you need to raise the morale of the community, and this is not the way to raise morale.
 The very definition of a useless debate, the kind he and I had agreed
 to not initiate anymore. It was a mistake. I'm still metaphorically
 angry at him for it.
On the contrary, I found that Walter's willingness to admit past mistakes very refreshing, even if practically speaking we can't actually get rid of autodecoding today. What he proposed in the other thread is actually a workable step towards reversing the wrong decision behind autodecoding, that doesn't leave existing users out in the cold, and that we might actually be able to pull off if done carefully. I know you probably won't see it the same way, since you still seem convinced that autodecoding was a good idea, but you need to understand that your opinion is not representative in this case. [...]
 Meanwhile, I go to conferences. Train and consult at large companies.
 Dozens every year, cumulatively thousands of people. I talk about D
 and ask people what it would take for them to use the language.
 Invariably I hear a surprisingly small number of reasons:
 
 * The garbage collector eliminates probably 60% of potential users
 right off.
At least we have begun to do something about this. That's good news.
 * Tooling is immature and of poorer quality compared to the
 competition.
And what have we done about it? How long has it been since dfix existed, yet we still haven't really integrated it into the dmd toolchain?
 * Safety has holes and bugs.
And what have we done about it?
 * Hiring people who know D is a problem.
There are many willing candidates right here. :-P
 * Documentation and tutorials are weak.
And what have we done about this?
 * There's no web services framework (by this time many folks know of
 D, but of those a shockingly small fraction has even heard of vibe.d).
 I have strongly argued with Snke to bundle vibe.d with dmd over one
 year ago, and also in this forum. There wasn't enough interest.
What about linking to it in a prominent place on dlang.org? This isn't a big problem, AFAICT. I don't think it takes months and years to put up a big prominent banner promoting vibe.d on, say, the download page of dlang.org.
 * (On Windows) if it doesn't have a compelling Visual Studio plugin,
 it doesn't exist.
And what have we done about this? One of the things that I have found a little disappointing with D is that while it has many very promising features, it lacks polish in many small details. Such as the way features interact with each other in corner cases. E.g., the whole can't-use-gc from dtor debacle, the semantics of closures over aggregate members, holes in safe, holes in const/immutable in unions, the whole import mess that took oh-how-many-years to clean up that thankfully was finally improved recently, can't use nogc with Phobos, can't use const/pure/etc. in Object.toString, Object.opEqual, et al (which we've been trying to get of since how many years ago now?), and a whole long list of small irritations that in themselves are nothing, but together add up like a dustball to an overall perception of lack of polish. I'm more sympathetic to Walter's stance of improving the language for *current* users, instead of bending over backwards to please would-be adopters who may never actually adopt the language -- they'd just come back with new excuses of why they can't adopt D yet. If you make existing users happier, they will do all the work of evangelism for you, instead of you having to fight the uphill battle by yourself while bleeding away current users due to poor morale. T -- Why ask rhetorical questions? -- JC
Jun 02 2016
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 10:48 AM, H. S. Teoh via Digitalmars-d wrote:
 On Thu, Jun 02, 2016 at 09:06:44AM -0400, Andrei Alexandrescu via
Digitalmars-d wrote:
 [...]
 ZombineDev, I've been at the top level in the C++ community for many
 many years, even after I wanted to exit :o). I'm familiar with how the
 committee that steers C++ works, perspective that is unique in our
 community - even Walter lacks it. I see trends and patterns. It is
 interesting how easily a small but very influential priesthood can
 alienate itself from the needs of the larger community and get into a
 frenzy over matters that are simply missing the point.
Appeal to authority.
You cut the context, which was rampant speculation.
 This is what's happening here. We worked ourselves to a foam because
 the creator of the language started a thread entitled "The Case
 Against Autodecode", whilst fully understanding there is no way to
 actually eliminate autodecode.
I think that's a misrepresentation of the situation. I was getting increasingly unhappy with autodecoding myself, completely independently of Walter, and in fact have filed bugs and posted complaints about it long before Walter started his thread. I used to be a supporter of autodecoding, but over time it has become increasingly clear to me that it was a mistake. The fact that you continue to deny this and write it off in the face of similar complaints raised by many active D users is very off-putting, to say the least, and does not inspire confidence. Not to mention the fact that you started this thread yourself with a question about what it is we dislike about autodecoding, yet after having received a multitude of complaints, corrobated by many forum members, you simply write off the whole thing like it was nothing. If you want D to succeed, you need to raise the morale of the community, and this is not the way to raise morale.
There is no denying. If I did things all over again, autodecoding would not be in. But also string would not be immutable(char)[] which is the real mistake. Some of the arguments in here have been good, but many (probably the majority) of them were not so much. A good one didn't even come up, Walter told it to me over the phone: the reality of invalid UTF strings forces you to mind the representation more often than you'd want in an ideal world. There is no "writing off". Again, the real solution here is RCStr. We can't continue with immutable(char)[] as our flagship string. Autodecoding is the least of its problems.
 The very definition of a useless debate, the kind he and I had agreed
 to not initiate anymore. It was a mistake. I'm still metaphorically
 angry at him for it.
On the contrary, I found that Walter's willingness to admit past mistakes very refreshing, even if practically speaking we can't actually get rid of autodecoding today. What he proposed in the other thread is actually a workable step towards reversing the wrong decision behind autodecoding, that doesn't leave existing users out in the cold, and that we might actually be able to pull off if done carefully. I know you probably won't see it the same way, since you still seem convinced that autodecoding was a good idea, but you need to understand that your opinion is not representative in this case.
I don't see it the same way. Yes, I agree my opinion is not representative. I'd also say I'm glad I can do something about this.
 [...]
 Meanwhile, I go to conferences. Train and consult at large companies.
 Dozens every year, cumulatively thousands of people. I talk about D
 and ask people what it would take for them to use the language.
 Invariably I hear a surprisingly small number of reasons:

 * The garbage collector eliminates probably 60% of potential users
 right off.
At least we have begun to do something about this. That's good news.
I've been working on RCStr for the past few days. I'd get a lot more work done if I didn't need to talk sense into people in this thread.
 * Tooling is immature and of poorer quality compared to the
 competition.
And what have we done about it? How long has it been since dfix existed, yet we still haven't really integrated it into the dmd toolchain?
I've spoken to Brian about it. Dfix does not do lookup, which makes it sadly not up for meaningful uses.
 * Safety has holes and bugs.
And what have we done about it?
Walter and I are working on safe RC.
 * Hiring people who know D is a problem.
There are many willing candidates right here. :-P
Nice.
 * Documentation and tutorials are weak.
And what have we done about this?
http://tour.dlang.org is a good start.
 * There's no web services framework (by this time many folks know of
 D, but of those a shockingly small fraction has even heard of vibe.d).
 I have strongly argued with Snke to bundle vibe.d with dmd over one
 year ago, and also in this forum. There wasn't enough interest.
What about linking to it in a prominent place on dlang.org? This isn't a big problem, AFAICT. I don't think it takes months and years to put up a big prominent banner promoting vibe.d on, say, the download page of dlang.org.
PR please. I can't babysit everything. I'm preparing for a conference where I'll evangelize for D next week (http://ndcoslo.com/speaker/andrei-alexandrescu/). As I mentioned at DConf, for better or worse this is the kind of stuff I cannot delegate. That kind of work is where the community would really make an impact, not a large debate that I need to worry will lead to some silly rash decision.
 * (On Windows) if it doesn't have a compelling Visual Studio plugin,
 it doesn't exist.
And what have we done about this?
I'm actively looking for a collaboration.
 One of the things that I have found a little disappointing with D is
 that while it has many very promising features, it lacks polish in many
 small details. Such as the way features interact with each other in
 corner cases. E.g., the whole can't-use-gc from dtor debacle, the
 semantics of closures over aggregate members, holes in  safe, holes in
 const/immutable in unions, the whole import mess that took
 oh-how-many-years to clean up that thankfully was finally improved
 recently, can't use  nogc with Phobos, can't use const/pure/etc. in
 Object.toString, Object.opEqual, et al (which we've been trying to get
 of since how many years ago now?), and a whole long list of small
 irritations that in themselves are nothing, but together add up like a
 dustball to an overall perception of lack of polish.
It's a fair perspective. Those annoy me as well. I'll also note every language has such matter, including the mainstream ones. At some point we need to acknowledge they're there but they're small enough to live with. (Some of those you enumerated aren't small, e.g. the holes in safe.)
 I'm more sympathetic to Walter's stance of improving the language for
 *current* users, instead of bending over backwards to please would-be
 adopters who may never actually adopt the language -- they'd just come
 back with new excuses of why they can't adopt D yet. If you make
 existing users happier, they will do all the work of evangelism for you,
 instead of you having to fight the uphill battle by yourself while
 bleeding away current users due to poor morale.
We want to improve the language for current AND future users. RCStr is part of that. Andrei
Jun 02 2016
prev sibling parent reply Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Thursday, June 02, 2016 09:06:44 Andrei Alexandrescu via Digitalmars-d 
wrote:
 Meanwhile, I go to conferences. Train and consult at large companies.
 Dozens every year, cumulatively thousands of people. I talk about D and
 ask people what it would take for them to use the language. Invariably I
 hear a surprisingly small number of reasons:
Are folks going to not start using D because of auto-decoding? No, because they won't know anything about it. Many of them don't even know anything about ranges. But it _will_ result in a WTF moment for pretty much everyone. It happens all the time and results in plenty of questions on D.Learn and stackoverflow, because no one expects it, and it causes them problems. Can we sanely remove auto-decoding from Phobos? I don't know. It's entrenched enough that doing so without breaking code is going to be very difficult. But at minimum, we need to mitigate it's effects, and I'm sure that we're going to be sorry in the long run if we don't figure out how to actually excise it. It's already a major wart that causes frequent problems, and it's the sort of thing that's going to make a number of folks unhappy with D in the long run, even if you can convince them to switch to it now while auto-decoding is still in place. Will it make them unhappy enough to switch away from D? Probably not. But it is going to be a constant pain point of the sort that folks frequently complain about with C++ - only this is one that we'll have, and C++ won't. - Jonathan M Davis
Jun 02 2016
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 11:58 AM, Jonathan M Davis via Digitalmars-d wrote:
 On Thursday, June 02, 2016 09:06:44 Andrei Alexandrescu via Digitalmars-d
 wrote:
 Meanwhile, I go to conferences. Train and consult at large companies.
 Dozens every year, cumulatively thousands of people. I talk about D and
 ask people what it would take for them to use the language. Invariably I
 hear a surprisingly small number of reasons:
Are folks going to not start using D because of auto-decoding? No, because they won't know anything about it. Many of them don't even know anything about ranges.
Actually ranges are a major reason for which people look into D. -- Andrei
Jun 02 2016
prev sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/1/16 6:24 PM, Andrei Alexandrescu wrote:
 On 06/01/2016 06:09 PM, ZombineDev wrote:
 Deprecating front, popFront and empty for narrow
 strings is what we are talking about here.
That will not happen. Walter and I consider the cost excessive and the benefit too small.
If this doesn't happen, then all this push to change anything in Phobos is completely wasted effort. As long as arrays aren't treated like arrays, we will have to deal with auto-decoding. You can change string literals to be something other than arrays, and then we have a path forward. But as long as char[] is not an array, we have lost the battle of sanity. -Steve
Jun 02 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 09:05 AM, Steven Schveighoffer wrote:
 On 6/1/16 6:24 PM, Andrei Alexandrescu wrote:
 On 06/01/2016 06:09 PM, ZombineDev wrote:
 Deprecating front, popFront and empty for narrow
 strings is what we are talking about here.
That will not happen. Walter and I consider the cost excessive and the benefit too small.
If this doesn't happen, then all this push to change anything in Phobos is completely wasted effort.
Really? "Anything"?
 As long as arrays aren't treated like
 arrays, we will have to deal with auto-decoding.

 You can change string literals to be something other than arrays, and
 then we have a path forward. But as long as char[] is not an array, we
 have lost the battle of sanity.
Yeah, it's a miracle the language stays glued eh. Your post is a prime example that this thread has lost the battle of sanity. I'll destroy you in person tonight. Andrei
Jun 02 2016
next sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/2/16 9:09 AM, Andrei Alexandrescu wrote:
 On 06/02/2016 09:05 AM, Steven Schveighoffer wrote:
 On 6/1/16 6:24 PM, Andrei Alexandrescu wrote:
 On 06/01/2016 06:09 PM, ZombineDev wrote:
 Deprecating front, popFront and empty for narrow
 strings is what we are talking about here.
That will not happen. Walter and I consider the cost excessive and the benefit too small.
If this doesn't happen, then all this push to change anything in Phobos is completely wasted effort.
Really? "Anything"?
The push to make Phobos only use byDchar (or any other band-aid fixes for this issue) is what I meant by anything. not "anything" anything :)
 As long as arrays aren't treated like
 arrays, we will have to deal with auto-decoding.

 You can change string literals to be something other than arrays, and
 then we have a path forward. But as long as char[] is not an array, we
 have lost the battle of sanity.
Yeah, it's a miracle the language stays glued eh.
I mean as far as narrow strings are concerned. To have the language tell me, yes, char[] is an array with a .length member, but hasLength is false? What, str[4] works, but isRandomAccessRange is false? Maybe it's more Orwellian than insane: Phobos is saying 2 + 2 = 5 ;)
 Your post is a prime example that this thread has lost the battle of
 sanity. I'll destroy you in person tonight.
It's the cynicism of talking/debating about this for years and years and not seeing any progress. We can discuss of course, and see who gets destroyed :) And yes, I'm about to kill this thread from my newsreader, since it's wasting too much of my time... -Steve
Jun 02 2016
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 09:25 AM, Steven Schveighoffer wrote:
 And yes, I'm about to kill this thread from my newsreader, since it's
 wasting too much of my time...
A good idea for all of us. Could you also please look on my post on our meetup page? Thx! -- Andrei
Jun 02 2016
prev sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 15:09, Andrei Alexandrescu wrote:
 You can change string literals to be something other than arrays, and
 then we have a path forward. But as long as char[] is not an array, we
 have lost the battle of sanity.
Yeah, it's a miracle the language stays glued eh. ...
It's not a language problem. Just avoid Phobos.
 Your post is a prime example that this thread has lost the battle of
 sanity.
He is just saying that the fundamental reason why autodecoding is bad is that it denies that T[] is an array for any T.
Jun 02 2016
prev sibling parent reply deadalnix <deadalnix gmail.com> writes:
On Wednesday, 1 June 2016 at 19:52:01 UTC, Andrei Alexandrescu 
wrote:
 On 06/01/2016 03:07 PM, ZombineDev wrote:
 This is not autodecoding. There is nothing auto-magic w.r.t. 
 strings in
 plain foreach.
I understand where you're coming from, but it actually is autodecoding. Consider: byte[] a; foreach (byte x; a) {} foreach (short x; a) {} foreach (int x; a) {} That works by means of a conversion short->int. However: char[] a; foreach (char x; a) {} foreach (wchar x; a) {} foreach (dchar x; a) {} The latter two do autodecoding, not coversion as the rest of the language. Andrei
This, deep down, point at the fact that conversion from/to char types are ill defined. One should be able to convert from char to byte/ubyte but not the other way around. One should be able to convert from byte to short but not from char to wchar. Once you disable the naive conversions, then the autodecoding in foreach isn't inconsistent anymore.
Jun 02 2016
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 12:38, deadalnix wrote:

 This, deep down, point at the fact that conversion from/to char types
 are ill defined.

 One should be able to convert from char to byte/ubyte but not the other
 way around.
 One should be able to convert from byte to short but not from char to
 wchar.

 Once you disable the naive conversions, then the autodecoding in foreach
 isn't inconsistent anymore.
The current situation is bad: void main(){ import std.utf,std.stdio; foreach(dchar d;"∑") writeln(d); // "∑" foreach(dchar d;"∑".byCodeUnit) writeln(d); // "â", "ˆ\210", "\221‘" } Implicit conversion should not happen, and I'd prefer both of them to behave the same. (I.e. make both a compile-time error or decode for both).
Jun 02 2016
prev sibling next sibling parent reply Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, May 31, 2016 14:30:08 Andrei Alexandrescu via Digitalmars-d wrote:
 On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote:
 On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d 
wrote:
 On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
 Saying that operating at the code point level - UTF-32 - is correct
 is like saying that operating at UTF-16 instead of UTF-8 is correct.
Could you please substantiate that? My understanding is that code unit is a higher-level Unicode notion independent of encoding, whereas code point is an encoding-dependent representation detail. -- Andrei
Does walkLength yield the same number for all representations?
walkLength treats a code point like it's a character. My point is that that's incorrect behavior. It will not result in correct string processing in the general case, because a code point is not guaranteed to be a full character. walkLength does not report the length of a character as one in all cases just like length does not report the length of a character as one in all cases. walkLength is counting bigger units than length, but it's still counting pieces of a character rather than counting full characters.
 And you can even put that accent on 0 by doing something like

 assert("0"d ~ "é"d.normalize!NFKD[1] == "0́"d);

 One or more code units combine to make a single code point, but one or
 more
 code points also combine to make a grapheme.
That's right. D's handling of UTF is at the code unit level (like all of Unicode is portably defined). If you want graphemes use byGrapheme. It seems you destroyed your own argument, which was:
 Saying that operating at the code point level - UTF-32 - is correct
 is like saying that operating at UTF-16 instead of UTF-8 is correct.
You can't claim code units are just a special case of code points.
The point is that treating a code point like it's a full character is just as wrong as treating a code unit as if it were a full character. It's _not_ guaranteed to be a full character. Treating code points as full characters does give you the correct result in more cases than treating a code unit as a full character gives you the correct result, but it still gives you the wrong result in many cases. If we want to have fully correct behavior without making the programmer deal with all of the Unicode issues themselves, then we need to operate at the grapheme level so that we are operating on full characters (though that obviously comes at a high cost to efficiency). Treating code points as characters like we do right now does not give the correct result in the general case just like treating code units as characters doesn't give the correct result in the general case. Both work some of the time, but neither works all of the time. Autodecoding attempts to hide the fact that it's operating on Unicode but does not actually go far enough to result in correct behavior. So, we pay the cost of decoding without getting the benefit of correctness. - Jonathan M Davis
May 31 2016
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 31.05.2016 20:53, Jonathan M Davis via Digitalmars-d wrote:
 On Tuesday, May 31, 2016 14:30:08 Andrei Alexandrescu via Digitalmars-d wrote:
On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote:
On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d
wrote:
On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
Saying that operating at the code point level - UTF-32 - is correct
is like saying that operating at UTF-16 instead of UTF-8 is correct.
Could you please substantiate that? My understanding is that code unit is a higher-level Unicode notion independent of encoding, whereas code point is an encoding-dependent representation detail. -- Andrei
Does walkLength yield the same number for all representations?
walkLength treats a code point like it's a character. My point is that that's incorrect behavior. It will not result in correct string processing in the general case, because a code point is not guaranteed to be a full character. ...
What's "correct"? Maybe the user intended to count the number of code points in order to pre-allocate a dchar[] of the correct size. Generally, I don't see how algorithms become magically "incorrect" when applied to utf code units.
 walkLength does not report the length of a character as one in all cases
 just like length does not report the length of a character as one in all
 cases. walkLength is counting bigger units than length, but it's still
 counting pieces of a character rather than counting full characters.
The 'length' of a character is not one in all contexts. The following text takes six columns in my terminal: 日本語 123456
May 31 2016
next sibling parent reply Wyatt <wyatt.epp gmail.com> writes:
On Tuesday, 31 May 2016 at 19:20:19 UTC, Timon Gehr wrote:
 The 'length' of a character is not one in all contexts.
 The following text takes six columns in my terminal:

 日本語
 123456
That's a property of your font and font rendering engine, not Unicode. (Also, it's probably not quite six columns; most fonts I've tested, 漢字 are rendered as something like 1.5 characters wide, assuming your terminal doesn't overlap them.) -Wyatt
May 31 2016
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 31.05.2016 21:40, Wyatt wrote:
 On Tuesday, 31 May 2016 at 19:20:19 UTC, Timon Gehr wrote:
 The 'length' of a character is not one in all contexts.
 The following text takes six columns in my terminal:

 日本語
 123456
That's a property of your font and font rendering engine, not Unicode.
Sure. Hence "context". If you are e.g. trying to manually underline some text in console output, for example in a compiler error message, counting the number of characters will not actually be what you want, even though it works reliably for ASCII text.
 (Also, it's probably not quite six columns; most fonts I've tested, 漢字
 are rendered as something like 1.5 characters wide, assuming your
 terminal doesn't overlap them.)

 -Wyatt
It's precisely six columns in my terminal (also in emacs and in gedit). My point was, how can std.algorithm ever guess correctly what you /actually/ intended to do?
May 31 2016
parent Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, May 31, 2016 21:48:36 Timon Gehr via Digitalmars-d wrote:
 On 31.05.2016 21:40, Wyatt wrote:
 On Tuesday, 31 May 2016 at 19:20:19 UTC, Timon Gehr wrote:
 The 'length' of a character is not one in all contexts.
 The following text takes six columns in my terminal:

 日本語
 123456
That's a property of your font and font rendering engine, not Unicode.
Sure. Hence "context". If you are e.g. trying to manually underline some text in console output, for example in a compiler error message, counting the number of characters will not actually be what you want, even though it works reliably for ASCII text.
 (Also, it's probably not quite six columns; most fonts I've tested, 漢字
 are rendered as something like 1.5 characters wide, assuming your
 terminal doesn't overlap them.)

 -Wyatt
It's precisely six columns in my terminal (also in emacs and in gedit). My point was, how can std.algorithm ever guess correctly what you /actually/ intended to do?
It can't, which is precisely why having it select for you was a bad design decision. The programmer needs to be making that decision. And the fact that Phobos currently makes that decision for you means that it's often doing the wrong thing, and the fact that it chose to decode code points by default means that it's often eating up unnecessary cycles to boot. - Jonathan M Davis
May 31 2016
prev sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Tue, May 31, 2016 at 07:40:13PM +0000, Wyatt via Digitalmars-d wrote:
 On Tuesday, 31 May 2016 at 19:20:19 UTC, Timon Gehr wrote:
 
 The 'length' of a character is not one in all contexts.
 The following text takes six columns in my terminal:
 
 日本語
 123456
That's a property of your font and font rendering engine, not Unicode. (Also, it's probably not quite six columns; most fonts I've tested, 漢字 are rendered as something like 1.5 characters wide, assuming your terminal doesn't overlap them.)
[...] I believe he was talking about a console terminal that uses 2 columns to render the so-called "double width" characters. The CJK block does contain "double-width" versions of selected blocks (e.g., the ASCII block), to be used with said characters. Of course, using string length to measure string width is a risky venture fraught with pitfalls, because your terminal may not actually render them the way you think it should. Nevertheless, it does serve to highlight why a construct like s.walkLength is essentially buggy, because there is not enough information to determine which length it should return -- length of the buffer in bytes, or the number of code points, or the number of graphemes, or the width of the string. No matter which choice you make, it only works for a subset of cases and is wrong for the other cases. This is a prime illustration of why forcing autodecoding on every string in D is a wrong design. T -- Не дорог подарок, дорога любовь.
May 31 2016
prev sibling parent reply Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, May 31, 2016 21:20:19 Timon Gehr via Digitalmars-d wrote:
 On 31.05.2016 20:53, Jonathan M Davis via Digitalmars-d wrote:
 On Tuesday, May 31, 2016 14:30:08 Andrei Alexandrescu via Digitalmars-d 
wrote:
On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote:
On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via
Digitalmars-d
wrote:
On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
Saying that operating at the code point level - UTF-32 - is
correct
is like saying that operating at UTF-16 instead of UTF-8 is
correct.
Could you please substantiate that? My understanding is that code unit is a higher-level Unicode notion independent of encoding, whereas code point is an encoding-dependent representation detail. -- Andrei
Does walkLength yield the same number for all representations?
walkLength treats a code point like it's a character. My point is that that's incorrect behavior. It will not result in correct string processing in the general case, because a code point is not guaranteed to be a full character. ...
What's "correct"? Maybe the user intended to count the number of code points in order to pre-allocate a dchar[] of the correct size. Generally, I don't see how algorithms become magically "incorrect" when applied to utf code units.
In the vast majority of cases what folks care about is full characters, which is not what code points are. But the fact that they want different things in different situation just highlights the fact that just converting everything to code points by default is a bad idea. And even worse, code points are usually the worst choice. Many operations don't require decoding and can be done at the code unit level, meaning that operating at the code point level is just plain inefficient. And the vast majority of the operations that can't operate at the code point level, then need to operate on full characters, which means that they need to be operating at the grapheme level. Code points are in this weird middle ground that's useful in some cases but usually isn't what you want or need. We need to be able to operate at the code unit level, the code point level, and the grapheme level. But defaulting to the code point level really makes no sense.
 walkLength does not report the length of a character as one in all cases
 just like length does not report the length of a character as one in all
 cases. walkLength is counting bigger units than length, but it's still
 counting pieces of a character rather than counting full characters.
The 'length' of a character is not one in all contexts. The following text takes six columns in my terminal: 日本語 123456
Well, that's getting into displaying characters which is a whole other can of worms, but it also highlights that assuming that the programmer wants a particular level of unicode is not a particularly good idea and that we should avoid converting for them without being asked, since it risks being inefficient to no benefit. - Jonathan M Davis
May 31 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 03:44 PM, Jonathan M Davis via Digitalmars-d wrote:
 In the vast majority of cases what folks care about is full character
How are you so sure? -- Andrei
May 31 2016
next sibling parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Tue, 31 May 2016 16:56:43 -0400
schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:

 On 05/31/2016 03:44 PM, Jonathan M Davis via Digitalmars-d wrote:
 In the vast majority of cases what folks care about is full character  
How are you so sure? -- Andrei
Because a full character is the typical unit of a written language. It's what we visualize in our heads when we think about finding a substring or counting characters. A special case of this is the reduction to ASCII where we can use code units in place of grapheme clusters. -- Marco
May 31 2016
parent reply Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, May 31, 2016 23:36:20 Marco Leise via Digitalmars-d wrote:
 Am Tue, 31 May 2016 16:56:43 -0400

 schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:
 On 05/31/2016 03:44 PM, Jonathan M Davis via Digitalmars-d wrote:
 In the vast majority of cases what folks care about is full character
How are you so sure? -- Andrei
Because a full character is the typical unit of a written language. It's what we visualize in our heads when we think about finding a substring or counting characters. A special case of this is the reduction to ASCII where we can use code units in place of grapheme clusters.
Exactly. How many folks here have written code where the correct thing to do is to search on code points? Under what circumstances is that even useful? Code points are a mid-level abstraction between UTF-8/16 and graphemes that are not particularly useful on their own. Yes, by using code points, we eliminate the differences between the encodings, but how much code even operates on multiple string types? Having all of your strings have the same encoding fixes the consistency problem just as well as autodecoding to dchar evereywhere does - and without the efficiency hit. Typically, folks operate on string or char[] unless they're talking to the Windows API, in which case, they need wchar[]. Our general recommendation is that D code operate on UTF-8 except when it needs to operate on a different encoding because of other stuff it has to interact with (like the Win32 API), in which case, ideally it converts those strings to UTF-8 once they get into the D code and operates on them as UTF-8, and anything that has to be output in a different encoding is operated on as UTF-8 until it needs to be outputed, in which case, it's converted to UTF-16 or whatever the target encoding is. Not much of anyone is recommending that you use dchar[] everywhere, but that's essentially what the range API is trying to force. I think that it's very safe to say that the vast majority of string processing either is looking to operate on strings as a whole or on individual, full characters within a string. Code points are neither. While code may play tricks with Unicode to be efficient (e.g. operating at the code unit level where it can rather than decoding to either code points or graphemes), or it might make assumptions about its data being ASCII-only, aside from explicit Unicode processing code, I have _never_ seen code that was actually looking to logically operate on only pieces of characters. While it may operate on code units for efficiency, it's always looking to be logically operating on string as a unit or on whole characters. Anyone looking to operate on code points is going to need to take into account the fact that they're not full characters, just like anyone who operates on code units needs to take into account the fact that they're not whole characters. Operating on code points as if they were characters - which is exactly what D currently does with ranges - is just plain wrong. We need to support operating at the code point level for those rare cases where it's actually useful, but autedecoding makes no sense. It incurs a performance penality without actually giving correct results except in those rare cases where you want code points instead of full characters. And only Unicode experts are ever going to want that. The average programmer who is not super Unicode savvy doesn't even know what code points are. They're clearly going to be looking to operate on strings as sequences of characters, not sequences of code points. I don't see how anyone could expect otherwise. Code points are a mid-level, Unicode abstraction that only those who are Unicode savvy are going to know or care about, let alone want to operate on. - Jonathan M Davis
May 31 2016
parent Jack Stouffer <jack jackstouffer.com> writes:
On Wednesday, 1 June 2016 at 02:17:21 UTC, Jonathan M Davis wrote:
 ...
This thread is going in circles; the against crowd has stated each of their arguments very clearly at least five times in different ways. The cost/benefit problems with auto decoding are as clear as day. If the evidence already presented in this thread (and in the many others) isn't enough to convince people of that, then I don't think anything else said will have an impact. I don't want to sound like someone telling people not to discuss this anymore, but honestly, what is continuing this thread going to accomplish?
May 31 2016
prev sibling parent Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Tuesday, 31 May 2016 at 20:56:43 UTC, Andrei Alexandrescu 
wrote:
 On 05/31/2016 03:44 PM, Jonathan M Davis via Digitalmars-d 
 wrote:
 In the vast majority of cases what folks care about is full 
 character
How are you so sure? -- Andrei
He doesn't need to be sure. You are the one advocating for code points, so the burden is on you to present evidence that it's the correct choice.
Jun 01 2016
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 02:53 PM, Jonathan M Davis via Digitalmars-d wrote:
 walkLength treats a code point like it's a character.
No, it treats a code point like it's a code point. -- Andrei
May 31 2016
parent reply Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, May 31, 2016 15:33:38 Andrei Alexandrescu via Digitalmars-d wrote:
 On 05/31/2016 02:53 PM, Jonathan M Davis via Digitalmars-d wrote:
 walkLength treats a code point like it's a character.
No, it treats a code point like it's a code point. -- Andrei
Wasn't the whole point of operating at the code point level by default to make it so that code would be operating on full characters by default instead of chopping them up as is so easy to do when operating at the code unit level? Thanks to how Phobos treats strings as ranges of dchar, most D code treats code points as if they were characters. So, whether it's correct or not, a _lot_ of D code is treating walkLength like it returns the number of characters in a string. And if walkLength doesn't provide the number of characters in a string, why would I want to use it under normal circumstances? Why would I want to be operating at the code point level in my code? It's not necessarily a full character, since it's not necessarily a grapheme. So, by using walkLength and front and popFront and whatnot with strings, I'm not getting full characters. I'm still only getting pieces of characters - just like would happen if strings were treated as ranges of code units. I'm just getting bigger pieces of the characters out of the deal. But if they're not full characters, what's the point? I am sure that there is code that is going to want to operate at the code point level, but your average program is either operating on strings as a whole or individual characters. As long as strings are being operated on as a whole, code units are generally plenty, and careful encoding of characters into code units for comparisons means that much of the time that you want to operate on individual characters, you can still operate at the code unit level. But if you can't, then you need the grapheme level, because a code point is not necessarily a full character. So, what is the point of operating on code points in your average D program? walkLength will not always tell me the number of characters in a string. front risks giving me a partial character rather than a whole one. Slicing dchar[] risks chopping up characters just like slicing char[] does. Operating on code points by default does not result in correct string processing. I honestly don't see how autodecoding is defensible. We may not be able to get rid of it due to the breakage that doing that would cause, but I fail to see how it is at all desirable that we have autodecoded strings. I can understand how we got it if it's based on a misunderstanding on your part about how Unicode works. We all make mistakes. But I fail to see how autodecoding wasn't a mistake. It's the worst of both worlds - inefficient while still incorrect. At least operating at the code unit level would be fast while being incorrect, and it would be obviously incorrect once you did anything with non-ASCII values, whereas it's easy to miss that ranges of dchar are doing the wrong thing too - Jonathan M Davis
May 31 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 04:01 PM, Jonathan M Davis via Digitalmars-d wrote:
 Wasn't the whole point of operating at the code point level by default to
 make it so that code would be operating on full characters by default
 instead of chopping them up as is so easy to do when operating at the code
 unit level?
The point is to operate on representation-independent entities (Unicode code points) instead of low-level representation-specific artifacts (code units). That's the contract, and it seems meaningful seeing how Unicode is defined in terms of code points as its abstract building block. If user code needs to go lower at the code unit level, they can do so. If user code needs to go upper at the grapheme level, they can do so. If anything this thread strengthens my opinion that autodecoding is a sweet spot. -- Andrei
May 31 2016
next sibling parent Max Samukha <maxsamukha gmail.com> writes:
On Tuesday, 31 May 2016 at 21:01:17 UTC, Andrei Alexandrescu 
wrote:

 If user code needs to go upper at the grapheme level, they can 
 If anything this thread strengthens my opinion that 
 autodecoding is a sweet spot. -- Andrei
Unicode FAQ disagrees (http://unicode.org/faq/utf_bom.html): "Q: How about using UTF-32 interfaces in my APIs? A: Except in some environments that store text as UTF-32 in memory, most Unicode APIs are using UTF-16. With UTF-16 APIs the low level indexing is at the storage or code unit level, with higher-level mechanisms for graphemes or words specifying their boundaries in terms of the code units. This provides efficiency at the low levels, and the required functionality at the high levels."
May 31 2016
prev sibling next sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Tue, May 31, 2016 at 05:01:17PM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
 On 05/31/2016 04:01 PM, Jonathan M Davis via Digitalmars-d wrote:
 Wasn't the whole point of operating at the code point level by
 default to make it so that code would be operating on full
 characters by default instead of chopping them up as is so easy to
 do when operating at the code unit level?
The point is to operate on representation-independent entities (Unicode code points) instead of low-level representation-specific artifacts (code units).
This is basically saying that we operate on dchar[] by default, except that we disguise its detrimental memory usage consequences by compressing to UTF-8/UTF-16 and incurring the cost of decompression every time we access its elements. Perhaps you love the idea of running an OS that stores all files in compressed form and always decompresses upon every syscall to read(), but I prefer a higher-performance system.
 That's the contract, and it seems meaningful
 seeing how Unicode is defined in terms of code points as its abstract
 building block.
Where's this contract stated, and when did we sign up for this?
 If user code needs to go lower at the code unit level, they can do so.
 If user code needs to go upper at the grapheme level, they can do so.
Only with much pain by using workarounds to bypass meticulously-crafted autodecoding algorithms in Phobos.
 If anything this thread strengthens my opinion that autodecoding is a
 sweet spot. -- Andrei
No, autodecoding is a stalemate that's neither fast nor correct. T -- "Real programmers can write assembly code in any language. :-)" -- Larry Wall
May 31 2016
prev sibling parent reply Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Tuesday, 31 May 2016 at 21:01:17 UTC, Andrei Alexandrescu 
wrote:
 On 05/31/2016 04:01 PM, Jonathan M Davis via Digitalmars-d 
 wrote:
 Wasn't the whole point of operating at the code point level by 
 default to
 make it so that code would be operating on full characters by 
 default
 instead of chopping them up as is so easy to do when operating 
 at the code
 unit level?
The point is to operate on representation-independent entities (Unicode code points) instead of low-level representation-specific artifacts (code units).
_Both_ are low-level representation-specific artifacts.
Jun 01 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/01/2016 06:25 AM, Marc Schütz wrote:
 On Tuesday, 31 May 2016 at 21:01:17 UTC, Andrei Alexandrescu wrote:
 On 05/31/2016 04:01 PM, Jonathan M Davis via Digitalmars-d wrote:
 Wasn't the whole point of operating at the code point level by
 default to
 make it so that code would be operating on full characters by default
 instead of chopping them up as is so easy to do when operating at the
 code
 unit level?
The point is to operate on representation-independent entities (Unicode code points) instead of low-level representation-specific artifacts (code units).
_Both_ are low-level representation-specific artifacts.
Maybe this is a misunderstanding. Representation = how things are laid out in memory. What does associating numbers with various Unicode symbols have to do with representation? -- Andrei
Jun 01 2016
next sibling parent reply Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 06/01/2016 10:29 AM, Andrei Alexandrescu wrote:
 On 06/01/2016 06:25 AM, Marc Schütz wrote:
 On Tuesday, 31 May 2016 at 21:01:17 UTC, Andrei Alexandrescu wrote:
 The point is to operate on representation-independent entities
 (Unicode code points) instead of low-level representation-specific
 artifacts (code units).
_Both_ are low-level representation-specific artifacts.
Maybe this is a misunderstanding. Representation = how things are laid out in memory. What does associating numbers with various Unicode symbols have to do with representation? -- Andrei
As has been explained countless times already, code points are a non-1:1 internal representation of graphemes. Code points don't exist for their own sake, their entire existence is purely as a way to encode graphemes. Whether that technically qualifies as "memory representation" or not is irrelevant: it's still a low-level implementation detail of text.
Jun 01 2016
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/01/2016 12:41 PM, Nick Sabalausky wrote:
 As has been explained countless times already, code points are a non-1:1
 internal representation of graphemes. Code points don't exist for their
 own sake, their entire existence is purely as a way to encode graphemes.
Of course, thank you.
 Whether that technically qualifies as "memory representation" or not is
 irrelevant: it's still a low-level implementation detail of text.
The relevance is meandering across the discussion, and it's good to have the same definitions for terms. Unicode code points are abstract notions with meanings attached to them, whereas UTF8/16/32 are concerned with their representation. Andrei
Jun 01 2016
prev sibling parent Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Wednesday, 1 June 2016 at 14:29:58 UTC, Andrei Alexandrescu 
wrote:
 On 06/01/2016 06:25 AM, Marc Schütz wrote:
 On Tuesday, 31 May 2016 at 21:01:17 UTC, Andrei Alexandrescu 
 wrote:
 The point is to operate on representation-independent entities
 (Unicode code points) instead of low-level 
 representation-specific
 artifacts (code units).
_Both_ are low-level representation-specific artifacts.
Maybe this is a misunderstanding. Representation = how things are laid out in memory. What does associating numbers with various Unicode symbols have to do with representation? --
Ok, if you define it that way, sure. I was thinking in terms of the actual text: Unicode is a way to represent that text using a variety of low-level representations: UTF8/NFC, UTF8/NFD, unnormalized UTF8, UTF16 big/little endian x normalization, UTF32 x normalization, some other more obscure ones. From that viewpoint, auto decoded char[] (= UTF8) is equivalent to dchar[] (= UTF32). Neither of them is the actual text. Both writing and the memory representation consist of fundamental units. But there is no 1:1 relationship between the units of char[] (UTF8 code units) or auto decoded strings (Unicode code points) on the one hand, and the units of writing (graphemes) on the other.
Jun 02 2016
prev sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
[...]
 Does walkLength yield the same number for all representations?
Let's put the question this way. Given the following string, what do *you* think walkLength should return? şŭt̥ḛ́k̠ I think any reasonable person would have to say it should return 5, because there are 5 visual "characters" here. Otherwise, what is even the meaning of walkLength?! For it to return anything other than 5 means that it's a leaky abstraction, because it's leaking low-level "implementation details" of the Unicode representation of this string. However, with the current implementation of autodecoding, walkLength returns 11. Can anyone reasonably argue that it's reasonable for "şŭt̥ḛ́k̠".walkLength to equal 11? What difference does this make if we get rid of autodecoding, and walkLength returns 17 instead? *Both* are wrong. 17 is actually the right answer if you're looking to allocate a buffer large enough to hold this string, because that's the number of bytes it occupies. 5 is the right answer to an end user who knows nothing about Unicode. 11 is an answer that a question that only makes sense to a Unicode specialist, and that no layperson understands. 11 is the answer we currently give. And that, at the cost of across-the-board performance degradation. Yet you're seriously arguing that 11 should be the right answer, by insisting that the current implementation of autodecoding is "correct". It boggles the mind. T -- Today's society is one of specialization: as you grow, you learn more and more about less and less. Eventually, you know everything about nothing.
May 31 2016
next sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 5/31/16 3:32 PM, H. S. Teoh via Digitalmars-d wrote:
 On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via
Digitalmars-d wrote:
 [...]
 Does walkLength yield the same number for all representations?
Let's put the question this way. Given the following string, what do *you* think walkLength should return?
Compiler error. -Steve
May 31 2016
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 31.05.2016 21:51, Steven Schveighoffer wrote:
 On 5/31/16 3:32 PM, H. S. Teoh via Digitalmars-d wrote:
 On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via
 Digitalmars-d wrote:
 [...]
 Does walkLength yield the same number for all representations?
Let's put the question this way. Given the following string, what do *you* think walkLength should return?
Compiler error. -Steve
What about e.g. joiner?
May 31 2016
next sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Tue, May 31, 2016 at 10:38:03PM +0200, Timon Gehr via Digitalmars-d wrote:
 On 31.05.2016 21:51, Steven Schveighoffer wrote:
 On 5/31/16 3:32 PM, H. S. Teoh via Digitalmars-d wrote:
 On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via
 Digitalmars-d wrote:
 [...]
 Does walkLength yield the same number for all representations?
Let's put the question this way. Given the following string, what do *you* think walkLength should return?
Compiler error. -Steve
What about e.g. joiner?
joiner is one of those algorithms that can work perfectly fine *without* autodecoding anything at all. The only time it'd actually need to decode would be if you're joining a set of UTF-8 strings with a UTF-16 delimiter, or some other such combination, which should be pretty rare. After all, within the same application you'd usually only be dealing with a single encoding rather than mixing UTF-8, UTF-16, and UTF-32 willy-nilly. (Unless the code is specifically written for transcoding, in which case decoding is part of the job description, so it should be expected that the programmer ought to know how to do it properly without needing Phobos to do it for him.) Even in the case of s.joiner('Ш'), joiner could easily convert that dchar into a short UTF-8 string and then operate directly on UTF-8. T -- Just because you survived after you did it, doesn't mean it wasn't stupid!
May 31 2016
prev sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 5/31/16 4:38 PM, Timon Gehr wrote:
 On 31.05.2016 21:51, Steven Schveighoffer wrote:
 On 5/31/16 3:32 PM, H. S. Teoh via Digitalmars-d wrote:
 On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via
 Digitalmars-d wrote:
 [...]
 Does walkLength yield the same number for all representations?
Let's put the question this way. Given the following string, what do *you* think walkLength should return?
Compiler error.
What about e.g. joiner?
Compiler error. Better than what it does now. -Steve
May 31 2016
parent reply Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Wednesday, 1 June 2016 at 01:13:17 UTC, Steven Schveighoffer 
wrote:
 On 5/31/16 4:38 PM, Timon Gehr wrote:
 What about e.g. joiner?
Compiler error. Better than what it does now.
I believe everything that does only concatenation will work correctly. That's why joiner() is one of those algorithms that should accept strings directly without going through any decoding (but it may need to recode the joining element itself, of course).
Jun 01 2016
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/1/16 6:31 AM, Marc Schütz wrote:
 On Wednesday, 1 June 2016 at 01:13:17 UTC, Steven Schveighoffer wrote:
 On 5/31/16 4:38 PM, Timon Gehr wrote:
 What about e.g. joiner?
Compiler error. Better than what it does now.
I believe everything that does only concatenation will work correctly. That's why joiner() is one of those algorithms that should accept strings directly without going through any decoding (but it may need to recode the joining element itself, of course).
This means that a string is a range. What is it a range of? If you want to make it a range of code units, I think you will lose that battle. If you want to special-case joiner for strings, that's always possible. Or string could be changed to be a range of dchar struct explicitly. Then at least joiner makes sense, and I can reasonably explain why it behaves the way it does. -Steve
Jun 02 2016
next sibling parent reply Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Thursday, 2 June 2016 at 13:11:10 UTC, Steven Schveighoffer 
wrote:
 On 6/1/16 6:31 AM, Marc Schütz wrote:
 I believe everything that does only concatenation will work 
 correctly.
 That's why joiner() is one of those algorithms that should 
 accept
 strings directly without going through any decoding (but it 
 may need to
 recode the joining element itself, of course).
This means that a string is a range. What is it a range of? If you want to make it a range of code units, I think you will lose that battle.
No, I don't want to make string a range of anything, I want to provide an additional overload for joiner() that accepts a const(char)[], and returns a range of chars. The remark about the joining element is that ["abc", "xyz"].joiner(","d) should convert ","d to "," first, to match the element type of the elements. But this is purely a convenience; it can also be pushed to the user.
 If you want to special-case joiner for strings, that's always 
 possible.
Yes, that's what I want. Sorry if it wasn't clear.
 Or string could be changed to be a range of dchar struct 
 explicitly. Then at least joiner makes sense, and I can 
 reasonably explain why it behaves the way it does.

 -Steve
Jun 02 2016
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 15:48, Marc Schütz wrote:

 No, I don't want to make string a range of anything, I want to provide
 an additional overload for joiner() that accepts a const(char)[], and
 returns a range of chars.
If strings are not ranges, returning a range of chars is inconsistent.
Jun 02 2016
prev sibling parent Kagamin <spam here.lot> writes:
On Thursday, 2 June 2016 at 13:11:10 UTC, Steven Schveighoffer 
wrote:
 This means that a string is a range. What is it a range of? If 
 you want to make it a range of code units, I think you will 
 lose that battle.
After the first migration step joiner will return a decoded dchar range just like it does now, only code will change internally, there will be no observable semantic difference to the user. Anyway, read Walter's proposal in the thread about dealing with autodecode.
Jun 02 2016
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 03:32 PM, H. S. Teoh via Digitalmars-d wrote:
 Let's put the question this way. Given the following string, what do
 *you*  think walkLength should return?

 	şŭt̥ḛ́k̠
The number of code units in the string. That's the contract promised and honored by Phobos. -- Andrei
May 31 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote:
 On 05/31/2016 03:32 PM, H. S. Teoh via Digitalmars-d wrote:
 Let's put the question this way. Given the following string, what do
 *you*  think walkLength should return?

     şŭt̥ḛ́k̠
The number of code units in the string. That's the contract promised and honored by Phobos. -- Andrei
Code points I mean. -- Andrei
May 31 2016
parent reply Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote:
 On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote:
 On 05/31/2016 03:32 PM, H. S. Teoh via Digitalmars-d wrote:
 Let's put the question this way. Given the following string, what do
 *you*  think walkLength should return?

     şŭt̥ḛ́k̠
The number of code units in the string. That's the contract promised and honored by Phobos. -- Andrei
Code points I mean. -- Andrei
Yes, we know it's the contract. ***That's the problem.*** As everybody is saying, it *SHOULDN'T* be the contract. Why shouldn't it be the contract? Because it's proven itself, both logically (as presented by pretty much everybody other than you in both this and other threads) and empirically (in phobos, warp, and other user code) to be both the least useful and most PITA option.
May 31 2016
parent Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, May 31, 2016 20:38:14 Nick Sabalausky via Digitalmars-d wrote:
 On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote:
 On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote:
 On 05/31/2016 03:32 PM, H. S. Teoh via Digitalmars-d wrote:
 Let's put the question this way. Given the following string, what do
 *you*  think walkLength should return?

     şŭt̥ḛ́k̠
The number of code units in the string. That's the contract promised and honored by Phobos. -- Andrei
Code points I mean. -- Andrei
Yes, we know it's the contract. ***That's the problem.*** As everybody is saying, it *SHOULDN'T* be the contract. Why shouldn't it be the contract? Because it's proven itself, both logically (as presented by pretty much everybody other than you in both this and other threads) and empirically (in phobos, warp, and other user code) to be both the least useful and most PITA option.
Exactly. Operating at the code point level rarely makes sense. What sorts of algorithms purposefully do that in a typical program? Unless you're doing very specific Unicode stuff or somehow know that your strings don't contain any graphemes that are made up of multiple code points, operating at the code point level is just bug-prone, and unless you're using dchar[] everywhere, it's slow to boot, because you're strings have to be decoded whether the algorithm needs to or not. I think that it's very safe to say that the vast majority of string algorithms are either able to operate at the code unit level without decoding (though possibly encoding another string to match - e.g. with a string comparison or search), or they have to operate at the grapheme level in order to deal with full characters. A code point is borderline useless on its own. It's just a step above the different UTF encodings without actually getting to proper characters. - Jonathan M Davis
May 31 2016
prev sibling parent reply ag0aep6g <anonymous example.com> writes:
On 05/31/2016 07:21 PM, Andrei Alexandrescu wrote:
 Could you please substantiate that? My understanding is that code unit
 is a higher-level Unicode notion independent of encoding, whereas code
 point is an encoding-dependent representation detail. -- Andrei
You got the terms mixed up. Code unit is lower level. Code point is higher level. One code point is encoded with one or more code units. char is a UTF-8 code unit. wchar is a UTF-16 code unit. dchar is both a UTF-32 code unit and a code point, because in UTF-32 it's a 1-to-1 relation.
May 31 2016
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 03:34 PM, ag0aep6g wrote:
 On 05/31/2016 07:21 PM, Andrei Alexandrescu wrote:
 Could you please substantiate that? My understanding is that code unit
 is a higher-level Unicode notion independent of encoding, whereas code
 point is an encoding-dependent representation detail. -- Andrei
You got the terms mixed up. Code unit is lower level. Code point is higher level.
Apologies and thank you. -- Andrei
May 31 2016
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
 The standard library has to fight against itself because of autodecoding!
 The vast majority of the algorithms in Phobos are special-cased on strings
 in an attempt to get around autodecoding. That alone should highlight the
 fact that autodecoding is problematic.
The way I see it is it's specialization to speed things up without giving up the higher level abstraction. -- Andrei
May 31 2016
parent Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 05/31/2016 01:23 PM, Andrei Alexandrescu wrote:
 On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
 The standard library has to fight against itself because of autodecoding!
 The vast majority of the algorithms in Phobos are special-cased on
 strings
 in an attempt to get around autodecoding. That alone should highlight the
 fact that autodecoding is problematic.
The way I see it is it's specialization to speed things up without giving up the higher level abstraction. -- Andrei
Problem is, that "higher"[1] level abstraction you don't want to give up (ie working on code points) is rarely useful, and yet the default is to pay the price for something which is rarely useful. [1] It's really the mid-level abstraction - grapheme is the high-level one (and more likely useful).
May 31 2016
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/26/2016 9:00 AM, Andrei Alexandrescu wrote:
 My thesis: the D1 design decision to represent strings as char[] was disastrous
 and probably one of the largest weaknesses of D1. The decision in D2 to use
 immutable(char)[] for strings is a vast improvement but still has a number of
 issues.
The mutable vs immutable has nothing to do with autodecoding.
 On 05/12/2016 04:15 PM, Walter Bright wrote:
 On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
 2. Every time one wants an algorithm to work with both strings and
 ranges, you wind up special casing the strings to defeat the
 autodecoding, or to decode the ranges. Having to constantly special case
 it makes for more special cases when plugging together components. These
 issues often escape detection when unittesting because it is convenient
 to unittest only with arrays.
This is a consequence of 1. It is at least partially fixable.
It's a consequence of autodecoding, not arrays.
 4. Autodecoding is slow and has no place in high speed string processing.
I would agree only with the amendment "...if used naively", which is important. Knowledge of how autodecoding works is a prerequisite for writing fast string code in D.
Having written high speed string processing code in D, that also deals with unicode (i.e. Warp), the only knowledge of autodecoding needed was how to have it not happen. Autodecoding made it slower than necessary in every case it was used. I found no place in Warp where autodecoding was desirable.
 Also, little code should deal with one code unit or code point at a
 time; instead, it should use standard library algorithms for searching,
matching
 etc.
That doesn't work so well. There always seems to be a need for custom string processing. Worse, when pipelining strings, the autodecoding changes the type to dchar, which then needs to be re-encoded into the result. The std.string algorithms I wrote all work much better (i.e. faster) without autodecoding, while maintaining proper Unicode support. I.e. the autodecoding did not benefit the algorithms at all, and if the user is to use standard algorithms instead of custom ones, then autodecoding is not necessary.
 When needed, iterating every code unit is trivially done through indexing.
This implies replacing pipelining with loops, and also falls apart if indexing is redone to index by code points.
 Also allow me to point that much of the slowdown can be addressed tactically.
 The test c < 0x80 is highly predictable (in ASCII-heavy text) and therefore
 easily speculated. We can and we should arrange code to minimize impact.
I.e. special case the code to avoid autodecoding. The trouble is that the low level code cannot avoid autodecoding, as it happens before the low level code gets it. This is conceptually backwards, and winds up requiring every algorithm to special case strings, even when completely unnecessary. (The 'copy' algorithm is an example of utterly unnecessary decoding.) When teaching people how to write algorithms, having to write every one twice, once for ranges and arrays, and a specialization for strings even when decoding is never necessary (such as for 'copy'), is embarrassing.
 5. Very few algorithms require decoding.
The key here is leaving it to the standard library to do the right thing instead of having the user wonder separately for each case. These uses don't need decoding, and the standard library correctly doesn't involve it (or if it currently does it has a bug): s.find("abc") s.findSplit("abc") s.findSplit('a') s.count!(c => "!()-;:,.?".canFind(c)) // punctuation However the following do require autodecoding: s.walkLength s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation s.count!(c => c >= 32) // non-control characters Currently the standard library operates at code point level even though inside it may choose to use code units when admissible. Leaving such a decision to the library seems like a wise thing to do.
Running my char[] through a pipeline and having it come out sometimes as char[] and sometimes dchar[] and sometimes ubyte[] is hidden and surprising behavior.
 6. Autodecoding has two choices when encountering invalid code units -
 throw or produce an error dchar. Currently, it throws, meaning no
 algorithms using autodecode can be made nothrow.
Agreed. This is probably the most glaring mistake. I think we should open a discussion no fixing this everywhere in the stdlib, even at the cost of breaking code.
A third option is to pass the invalid code units through unmolested, which won't work if autodecoding is used.
 7. Autodecode cannot be used with unicode path/filenames, because it is
 legal (at least on Linux) to have invalid UTF-8 as filenames. It turns
 out in the wild that pure Unicode is not universal - there's lots of
 dirty Unicode that should remain unmolested, and autocode does not play
 with that.
If paths are not UTF-8, then they shouldn't have string type (instead use ubyte[] etc). More on that below.
Requiring code units to be all 100% valid is not workable, nor is redoing them to be ubytes. More on that below.
 8. In my work with UTF-8 streams, dealing with autodecode has caused me
 considerably extra work every time. A convenient timesaver it ain't.
Objection. Vague.
Sorry I didn't log the time I spent on it.
 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid
 importing std.array one way or another, and then autodecode is there.
Turning off autodecoding is as easy as inserting .representation after any string.
.representation changes the type to ubyte[]. All knowledge that this is a Unicode string then gets lost for the rest of the pipeline.
 (Not to mention using indexing directly.)
Doesn't work if you're pipelining.
 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key
 benefit of being arrays in the first place.
First off, you always have the option with .representation. That's a great name because it gives you the type used to represent the string - i.e. an array of integers of a specific width.
I found .representation to be unworkable because it changed the type.
 11. Indexing an array produces different results than autodecoding,
 another glaring special case.
This is a direct consequence of the fact that string is immutable(char)[] and not a specific type. That error predates autodecoding.
Even if it is made a special type, the problem of what an index means will remain. Of course, indexing by code point is an O(n) operation, which I submit is surprising and shouldn't be supported as [i] even by a special type (for the same reason that indexing of linked lists is frowned upon). Giving up indexing means giving up efficient slicing, which would be a major downgrade for D.
 Overall, I think the one way to make real steps forward in improving string
 processing in the D language is to give a clear answer of what char, wchar, and
 dchar mean.
They mean code units. This is not ambiguous. How a code unit is different from a ubyte: A. I know you hate bringing up my personal experience, but here goes. I've programmed in C forever. In C, char is used for both small integers and characters. It's always been a source of confusion, and sometimes bugs, to conflate the two: struct S { char field; }; Which is it, a character or a small integer? I have to rely on reading the code. It's a definite improvement in D that they are distinguished, and I feel that improvement every time I have to deal with C/C++ code and see 'char' used as a small integer instead of a character. B. Overloading is different, and that's good. For example, writeln(T[]) produces different results for char[] and ubyte[], and this is unsurprising and expected. It "just works". C. More overloading: writeln('a'); Does anyone want that to print 96? Does anyone really want 'a' to be of type dchar? (The trouble with that is type inference when building up more complex types, as you'll wind up with hidden dchar[] if not careful. My experience with dchar[] is it is almost never desirable, as it is too memory hungry.)
May 27 2016
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/27/16 1:11 PM, Walter Bright wrote:
 They mean code units.
Always valid or potentially invalid as well? -- Andrei
May 27 2016
parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/27/2016 11:27 AM, Andrei Alexandrescu wrote:
 On 5/27/16 1:11 PM, Walter Bright wrote:
 They mean code units.
Always valid or potentially invalid as well? -- Andrei
Some years ago I would have said always valid. Experience, however, says that Unicode is often dirty and code should be tolerant of that. Consider Unicode in a text editor. You can't have it throwing exceptions, silently changing things to replacement characters, etc., when there's a few invalid sequences in it. You also can't just say "the file isn't Unicode" and refuse to display the Unicode in it. It isn't hard to deal with invalid Unicode in a user friendly manner.
May 27 2016
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/27/16 1:11 PM, Walter Bright wrote:
 The std.string algorithms I wrote all work much better (i.e. faster)
 without autodecoding, while maintaining proper Unicode support.
Violent agreement is occurring here. We have plenty of those and need more. -- Andrei
May 27 2016
prev sibling next sibling parent Martin Nowak <code+news.digitalmars dawg.eu> writes:
On 05/12/2016 10:15 PM, Walter Bright wrote:
 On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
 I am as unclear about the problems of autodecoding as I am about the
necessity
 to remove curl. Whenever I ask I hear some arguments that work well
emotionally
 but are scant on reason and engineering. Maybe it's time to rehash
them? I just
 did so about curl, no solid argument seemed to come together. I'd be
curious of
 a crisp list of grievances about autodecoding. -- Andrei
Here are some that are not matters of opinion. 6. Autodecoding has two choices when encountering invalid code units - throw or produce an error dchar. Currently, it throws, meaning no algorithms using autodecode can be made nothrow.
There are more than 2 choices here, see the related discussion on avoiding redundant unicode validation https://issues.dlang.org/show_bug.cgi?id=14519#c32.
May 29 2016
prev sibling parent Marco Leise <Marco.Leise gmx.de> writes:
A relevant thread in the Rust bug tracker I remember from
three years ago: https://github.com/rust-lang/rust/issues/7043
May it be of inspiration.

-- 
Marco
May 30 2016