www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Fix Phobos dependencies on autodecoding

reply Walter Bright <newshound2 digitalmars.com> writes:
We don't yet have a good plan on how to remove autodecoding and yet provide 
backward compatibility with autodecoding-reliant projects, but one thing we can 
do is make Phobos work properly with and without autodecoding.

To that end, I created a build of Phobos that disables autodecoding:

https://github.com/dlang/phobos/pull/7130

Of course, it fails. If people want impactful things to work on, fixing each 
failure is worthwhile (each in separate PRs).

Note that this is neither trivial nor mindless code editing. Each case has to
be 
examined as to why it is doing autodecoding, is autodecoding necessary, and 
deciding to replace it with byChar, byDchar, or simply hardcoding the decoding 
logic.
Aug 13 2019
next sibling parent reply a11e99z <black80 bk.ru> writes:
On Tuesday, 13 August 2019 at 07:08:03 UTC, Walter Bright wrote:
 We don't yet have a good plan on how to remove autodecoding and 
 yet provide backward compatibility with autodecoding-reliant 
 projects, but one thing we can do is make Phobos work properly 
 with and without autodecoding.

 To that end, I created a build of Phobos that disables 
 autodecoding:

 https://github.com/dlang/phobos/pull/7130

 Of course, it fails. If people want impactful things to work 
 on, fixing each failure is worthwhile (each in separate PRs).

 Note that this is neither trivial nor mindless code editing. 
 Each case has to be examined as to why it is doing 
 autodecoding, is autodecoding necessary, and deciding to 
 replace it with byChar, byDchar, or simply hardcoding the 
 decoding logic.
imo autodecoding is one of right thing. maybe will be better to leave it as is and just to add
 immutable(ubyte)[] bytes( string str )  nogc nothrow {
     return *cast( immutable(ubyte)[]* )&str;
 }
and use it as
 foreach( b; "Привет, Мир!".bytes) // Hello world in RU
     writefln( "%x", b );          // 21 bytes, 12 runes
? why u decide to fight with autodecoding?
Aug 13 2019
next sibling parent reply Alexandru Ermicioi <alexandru.ermicioi gmail.com> writes:
On Tuesday, 13 August 2019 at 07:31:28 UTC, a11e99z wrote:
 On Tuesday, 13 August 2019 at 07:08:03 UTC, Walter Bright wrote:
 We don't yet have a good plan on how to remove autodecoding 
 and yet provide backward compatibility with 
 autodecoding-reliant projects, but one thing we can do is make 
 Phobos work properly with and without autodecoding.

 To that end, I created a build of Phobos that disables 
 autodecoding:

 https://github.com/dlang/phobos/pull/7130

 Of course, it fails. If people want impactful things to work 
 on, fixing each failure is worthwhile (each in separate PRs).

 Note that this is neither trivial nor mindless code editing. 
 Each case has to be examined as to why it is doing 
 autodecoding, is autodecoding necessary, and deciding to 
 replace it with byChar, byDchar, or simply hardcoding the 
 decoding logic.
imo autodecoding is one of right thing. maybe will be better to leave it as is and just to add
 immutable(ubyte)[] bytes( string str )  nogc nothrow {
     return *cast( immutable(ubyte)[]* )&str;
 }
and use it as
 foreach( b; "Привет, Мир!".bytes) // Hello world in RU
     writefln( "%x", b );          // 21 bytes, 12 runes
? why u decide to fight with autodecoding?
One of the reasons is that it adds unnecessary complexity for templated code that is working with ranges. Check function prototypes for some algorithms found in std.algorithm package, you're bound to find special treatment for autodecoding strings. It also messes up user expectation when suddenly applying a range function on a string instead of front char you're getting dchar. Best regards, Alexandru
Aug 13 2019
parent reply a11e99z <black80 bk.ru> writes:
On Tuesday, 13 August 2019 at 07:51:23 UTC, Alexandru Ermicioi 
wrote:
 On Tuesday, 13 August 2019 at 07:31:28 UTC, a11e99z wrote:
 On Tuesday, 13 August 2019 at 07:08:03 UTC, Walter Bright
One of the reasons is that it adds unnecessary complexity for templated code that is working with ranges. Check function prototypes for some algorithms found in std.algorithm package, you're bound to find special treatment for autodecoding strings. It also messes up user expectation when suddenly applying a range function on a string instead of front char you're getting dchar.
imo this is a contrived problem. string contains chars, not in meaning "char" as type but runes or codepoints. and world is not perfect so chars/runes are stored as utf8 codepoints. in world where "char" is alias for "byte"/"ubyte" such vision was a problem: is this buffer string(seq of chars) or just raw bytes? how it should be enumerated? but we have better world with different bytes and chars. probably better was naming for "char" as "utf8cp"/orSomething (don't mix with C/C++ type) and when u/anybody see string from that point everything falls into place. I don't see problem that str.front returns codepoint from 0..0x10ffff and when str.length returns 21 and str.count=12. but somebody see problem here, so again this is a contrived problem. and for now this vision problem will recreate/recheck tons of code. I thought that WB don't want change code peremptorily. Should be BIG problem when he does.
Aug 13 2019
parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Tuesday, August 13, 2019 2:52:58 AM MDT a11e99z via Digitalmars-d wrote:
 On Tuesday, 13 August 2019 at 07:51:23 UTC, Alexandru Ermicioi

 wrote:
 On Tuesday, 13 August 2019 at 07:31:28 UTC, a11e99z wrote:
 On Tuesday, 13 August 2019 at 07:08:03 UTC, Walter Bright
One of the reasons is that it adds unnecessary complexity for templated code that is working with ranges. Check function prototypes for some algorithms found in std.algorithm package, you're bound to find special treatment for autodecoding strings. It also messes up user expectation when suddenly applying a range function on a string instead of front char you're getting dchar.
imo this is a contrived problem. string contains chars, not in meaning "char" as type but runes or codepoints. and world is not perfect so chars/runes are stored as utf8 codepoints. in world where "char" is alias for "byte"/"ubyte" such vision was a problem: is this buffer string(seq of chars) or just raw bytes? how it should be enumerated? but we have better world with different bytes and chars. probably better was naming for "char" as "utf8cp"/orSomething (don't mix with C/C++ type) and when u/anybody see string from that point everything falls into place. I don't see problem that str.front returns codepoint from 0..0x10ffff and when str.length returns 21 and str.count=12. but somebody see problem here, so again this is a contrived problem. and for now this vision problem will recreate/recheck tons of code. I thought that WB don't want change code peremptorily. Should be BIG problem when he does.
Code points are almost always the wrong level to be operating at. Many algorithms can operate at the code unit level with no problem, whereas those that require decoding usually need to operate at the grapheme level so that the actual, conceptual characters are being compared. Just like code units aren't necessarily full characters, code points aren't necessarily full characters. Auto-decoding was introduced, because at the time, Andrei did not have a solid enough understanding of Unicode and thought that code points were always entire characters and didn't know about graphemes. Having auto-decoding has caused us tons of problems. It's inefficient, gives a false sense of code correctness, requires special-casing all over the place, and the whole "narrow string" concept causes all kinds of grief where algorithms don't work properly with strings, because they don't consider them to be random access, have a different type for their range element type than for their actual element type, etc. Pretty much all of the big D contributors have thought for years now that auto-decoding was a mistake, and we've wanted to get rid of it. Many of us actually thought that autodecoding was a good idea at first, but we've all come to understand how terrible it is. Walter is one of the few that understood from the get-go, but he wasn't paying much attention to Phobos (since he usually focuses on the compiler) and didn't catch Andrei's mistake. If he had, autodecoding would never have been a thing in Phobos. The only reason that auto-decoding still exists in Phobos is because of how hard it is to remove without breaking code. Making Phobos not rely on autodecoding and making it so that it will work regardless of whether the character type for a range is char, wchar, dchar, or a grapheme is exactly what we need to be doing. Some work has been done in that direction already but nowhere near enough. Once that's done, then we can look at how to fully remove autodecoding, be it Phobos v2 (which Andrei has already proposed) or some other clever solution. But regardless of how we go about removing auto-decoding - or even if we ultimately end up leaving it in place - we need to make Phobos autodecoding-agnostic so that it's not forced on everything. - Jonathan M Davis
Aug 13 2019
parent a11e99z <black80 bk.ru> writes:
On Tuesday, 13 August 2019 at 09:15:30 UTC, Jonathan M Davis 
wrote:
 On Tuesday, August 13, 2019 2:52:58 AM MDT a11e99z via 
 Digitalmars-d wrote:
 On Tuesday, 13 August 2019 at 07:51:23 UTC, Alexandru Ermicioi
we've wanted to get rid of it. Many of us actually thought that autodecoding was a good idea at first, but we've all come to
thx for explanations. probably I am on this stage too. ok. I can live with .byRunes and .byBytes
Aug 13 2019
prev sibling next sibling parent Daniel Kozak <kozzi11 gmail.com> writes:
On Tue, Aug 13, 2019 at 9:35 AM a11e99z via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 imo autodecoding is one of right thing.
 maybe will be better to leave it as is and just to add
 immutable(ubyte)[] bytes( string str )  nogc nothrow {
     return *cast( immutable(ubyte)[]* )&str;
 }
and use it as
 foreach( b; "Привет, Мир!".bytes) // Hello world in RU
     writefln( "%x", b );          // 21 bytes, 12 runes
? why u decide to fight with autodecoding?
I hate autodecoding for many reason, one of them it is not done right: https://run.dlang.io/is/IHECPf ``` import std.stdio; void main() { string strd = "é🜢🜢࠷❻𐝃"; size_t cnt; foreach(i, wchar c; strd) { write(i); } writeln(""); foreach(i, char c; strd) { write(i); } writeln(""); foreach(i, dchar c; strd) { write(i); } } ```
Aug 13 2019
prev sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Aug 13, 2019 at 07:31:28AM +0000, a11e99z via Digitalmars-d wrote:
[...]
 imo autodecoding is one of right thing.
[...]
 why u decide to fight with autodecoding?
Because it *appears* to be right, but it's actually wrong. For example: import std.range : retro; import std.stdio; void main() { writeln("привет".retro); writeln("приве́т".retro); } Expected output: тевирп те́вирп Actual output: тевирп т́евирп The problem is that autodecoding makes the assumption that Unicode code point == grapheme, but this is not true. It's usually true for European languages, but it fails for many other languages. So auto-decoding gives you the illusion of correctness, but when you ship your product to Asia suddenly you get a ton of bug reports. To guarantee correctness you need to work with graphemes (see .byGrapheme). But we can't make that the default because it's a big performance hit, and many string algorithms don't actually need grapheme segmentation. Ultimately, the correct solution is to put the onus on the programmer to select the iteration scheme (by code units, code points, or graphemes) depending on what's actually needed at the application level. Arbitrarily choosing one of them to be the default leads to a false sense of security. T -- That's not a bug; that's a feature!
Aug 13 2019
next sibling parent reply jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 13 August 2019 at 16:18:03 UTC, H. S. Teoh wrote:
 [snip]

 Because it *appears* to be right, but it's actually wrong. For 
 example:

 	import std.range : retro;
 	import std.stdio;

 	void main() {
 		writeln("привет".retro);
 		writeln("приве́т".retro);
 	}

 Expected output:
 	тевирп
 	те́вирп

 Actual output:
 	тевирп
 	т́евирп
Huh, those two look the same.
Aug 13 2019
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Aug 13, 2019 at 04:29:33PM +0000, jmh530 via Digitalmars-d wrote:
 On Tuesday, 13 August 2019 at 16:18:03 UTC, H. S. Teoh wrote:
 [snip]
 
 Because it *appears* to be right, but it's actually wrong. For example:
 
 	import std.range : retro;
 	import std.stdio;
 
 	void main() {
 		writeln("привет".retro);
 		writeln("приве́т".retro);
 	}
 
 Expected output:
 	тевирп
 	те́вирп
 
 Actual output:
 	тевирп
 	т́евирп
 
Huh, those two look the same.
The location of the acute accent on the second line is wrong. T -- GEEK = Gatherer of Extremely Enlightening Knowledge
Aug 13 2019
parent reply jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 13 August 2019 at 16:36:16 UTC, H. S. Teoh wrote:
 [snip]

 The location of the acute accent on the second line is wrong.


 T
I'm still confused... What I was first confused about was that the second line of the expected output looks exactly the same as the second line of the actual output. However, you seemed to have indicated that is a problem. From your follow-up post, I'm still confused because the accent seems to be on the "e" on both of them. Isn't that where it's supposed to be?
Aug 13 2019
next sibling parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Tuesday, August 13, 2019 10:51:57 AM MDT jmh530 via Digitalmars-d wrote:
 On Tuesday, 13 August 2019 at 16:36:16 UTC, H. S. Teoh wrote:
 [snip]

 The location of the acute accent on the second line is wrong.


 T
I'm still confused... What I was first confused about was that the second line of the expected output looks exactly the same as the second line of the actual output. However, you seemed to have indicated that is a problem. From your follow-up post, I'm still confused because the accent seems to be on the "e" on both of them. Isn't that where it's supposed to be?
It's not on the e in both of them. It's on the e on the second line of the "expected" output, but it's on the T in the second line of the "actual" output. - Jonathan M Davis
Aug 13 2019
next sibling parent reply Gregor =?UTF-8?B?TcO8Y2ts?= <gregormueckl gmx.de> writes:
On Tuesday, 13 August 2019 at 16:58:38 UTC, Jonathan M Davis 
wrote:
 On Tuesday, August 13, 2019 10:51:57 AM MDT jmh530 via 
 Digitalmars-d wrote:
 On Tuesday, 13 August 2019 at 16:36:16 UTC, H. S. Teoh wrote:
 [snip]

 The location of the acute accent on the second line is wrong.


 T
I'm still confused... What I was first confused about was that the second line of the expected output looks exactly the same as the second line of the actual output. However, you seemed to have indicated that is a problem. From your follow-up post, I'm still confused because the accent seems to be on the "e" on both of them. Isn't that where it's supposed to be?
It's not on the e in both of them. It's on the e on the second line of the "expected" output, but it's on the T in the second line of the "actual" output. - Jonathan M Davis
We must be seeing different things then. I've taken a screenshot of how the post looks to me: http://www.gregor-mueckl.de/~gmueckl/unicode_confusion.png
Aug 13 2019
next sibling parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Tuesday, August 13, 2019 11:43:19 AM MDT Gregor Mckl via Digitalmars-d 
wrote:
 On Tuesday, 13 August 2019 at 16:58:38 UTC, Jonathan M Davis

 wrote:
 On Tuesday, August 13, 2019 10:51:57 AM MDT jmh530 via

 Digitalmars-d wrote:
 On Tuesday, 13 August 2019 at 16:36:16 UTC, H. S. Teoh wrote:
 [snip]

 The location of the acute accent on the second line is wrong.


 T
I'm still confused... What I was first confused about was that the second line of the expected output looks exactly the same as the second line of the actual output. However, you seemed to have indicated that is a problem. From your follow-up post, I'm still confused because the accent seems to be on the "e" on both of them. Isn't that where it's supposed to be?
It's not on the e in both of them. It's on the e on the second line of the "expected" output, but it's on the T in the second line of the "actual" output. - Jonathan M Davis
We must be seeing different things then. I've taken a screenshot of how the post looks to me: http://www.gregor-mueckl.de/~gmueckl/unicode_confusion.png
I suspect that some clients are not handling the text correctly (probably due to bugs in their Unicode handling). If I view this thread on forum.dlang.org in firefox, then the text ends up with the accent on the T in the code with it being on the B in the expected output and on the e in the actual output. If I view it in chrome, the code has it on the e, the expected output has it on the e, and the actual output has it on the T - which is exactly what happens in my e-mail client. If I run the program on run.dlang.io in either firefox or chrome, it does the same thing as chrome and my e-mail client do with the forum post, putting the accent on the e in the code and putting it on the T in the output. The same thing happens when I run it locally in my console on FreeBSD. In no case do I see the accent on the e in the actual output, but it probably wouldn't be hard for a bug in a program's Unicode handling to put it on the e. Unicode is stupidly hard to process correctly, and the correct output of this program isn't something that you would normally see in real text. - Jonathan M Davis
Aug 13 2019
prev sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Aug 13, 2019 at 05:43:19PM +0000, Gregor Mückl via Digitalmars-d wrote:
[...]
 We must be seeing different things then. I've taken a screenshot of
 how the post looks to me:
 
 http://www.gregor-mueckl.de/~gmueckl/unicode_confusion.png
Did you copy-n-paste the code and run it? If you did, the browser may have done some Unicode processing on the string literal and munged the results. Maybe spelling out the second string literal might help: writeln("приве\u0301т".retro); Basically, the issue here is that "е\u0301" should be processed as a single grapheme, but since it's two separate code points, auto-decoding splits the grapheme, and when .retro is applied to it, the \u0301 is now attached to the wrong code point. This is probably not the best example, since е\u0301 isn't really how Russian is normally written (it could be used in some learner dictionaries to indicate stress, but it's non-standard and most printed material don't do that). Perhaps a better example might be Hangul Jamo or Arabic ligatures, but I'm unfamiliar with those languages so I don't know how to come up with a realistic example. But the point is that according to Unicode, a grapheme consists of a base character followed by zero or more combining diacritics. Auto-decoding treats the base character separately from any combining diacritics, because it iterates over code points rather than graphemes, thus when the application is logically dealing with graphemes, you'll get incorrect results. But if you're working only with code points, then auto-decoding works. The problem is that most of the time, either (1) you're working with "characters" ("visual" characters, i.e. graphemes), or (2) you don't actually care about the string contents but just need to copy / move / erase a substring. For (1), auto-decoding gives the wrong results. For (2), auto-decoding wastes time decoding code units: you could have just used a straight memcpy / memcmp / etc.. Unless you're implementing Unicode algorithms, you rarely need to work with code points directly. And if you're implementing Unicode algorithms, you already know (or should already know) at which level you need to be working with (code units, code points, or graphemes), so you hardly need the default iteration to be code points (just write .byCodePoint for clarity). It doesn't make sense to have Phobos iterate over code points *by default* when it's not the common use case, represents a hidden performance hit, and in spite of that still not 100% correct anyway. T -- Век живи - век учись. А дураком помрёшь.
Aug 13 2019
parent "Nick Sabalausky (Abscissa)" <SeeWebsiteToContactMe semitwist.com> writes:
On 8/13/19 2:11 PM, H. S. Teoh wrote:
 But if you're working only with code points,
 then auto-decoding works.
Albeit much slower than necessary in most cases...
Aug 15 2019
prev sibling parent reply jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 13 August 2019 at 16:58:38 UTC, Jonathan M Davis 
wrote:
 [snip]

 It's not on the e in both of them. It's on the e on the second 
 line of the "expected" output, but it's on the T in the second 
 line of the "actual" output.

 - Jonathan M Davis
On my machine & browser, it looks like it is on the e on both.
Aug 13 2019
next sibling parent Dukc <ajieskola gmail.com> writes:
On Tuesday, 13 August 2019 at 18:24:23 UTC, jmh530 wrote:
 On Tuesday, 13 August 2019 at 16:58:38 UTC, Jonathan M Davis 
 wrote:
 [snip]

 It's not on the e in both of them. It's on the e on the second 
 line of the "expected" output, but it's on the T in the second 
 line of the "actual" output.

 - Jonathan M Davis
On my machine & browser, it looks like it is on the e on both.
And for me, both on e at Windows but the bottom one on T at Linux, on the same browser (Firefox)!
Aug 13 2019
prev sibling next sibling parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Tuesday, 13 August 2019 at 18:24:23 UTC, jmh530 wrote:
 On Tuesday, 13 August 2019 at 16:58:38 UTC, Jonathan M Davis 
 wrote:
 [snip]

 It's not on the e in both of them. It's on the e on the second 
 line of the "expected" output, but it's on the T in the second 
 line of the "actual" output.

 - Jonathan M Davis
On my machine & browser, it looks like it is on the e on both.
You're not alone, on my firefox on windows 10 pro the accents are both on the e.
Aug 13 2019
prev sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Aug 13, 2019 at 06:24:23PM +0000, jmh530 via Digitalmars-d wrote:
 On Tuesday, 13 August 2019 at 16:58:38 UTC, Jonathan M Davis wrote:
 [snip]
 
 It's not on the e in both of them. It's on the e on the second line
 of the "expected" output, but it's on the T in the second line of
 the "actual" output.
 
 - Jonathan M Davis
On my machine & browser, it looks like it is on the e on both.
Probably what Jonathan said about the browser munging the Unicode. Unicode is notoriously hard to process correctly, and I wouldn't be surprised if the majority of applications out there actually don't handle it correctly in all cases. The whole auto-decoding deal is a prime example of this: even an expert programmer like Andrei fell into the wrong assumption that code point == grapheme. I have no confidence that less capable programmers, who form the majority of today's programmers and write the bulk of the industry's code, are any more likely to get it right. (For years I myself didn't even know there was such a thing as "graphemes".) In fact, almost every day I see "enterprise" code that commits atrocities against Unicode -- because QA hasn't thought to pass a *real* Unicode string as test input yet. The day the idea occurs to them, a LOT of code (and I mean a LOT) will need to be rewritten, probably from scratch. T -- "Real programmers can write assembly code in any language. :-)" -- Larry Wall
Aug 13 2019
prev sibling next sibling parent a11e99z <black80 bk.ru> writes:
On Tuesday, 13 August 2019 at 16:51:57 UTC, jmh530 wrote:
 On Tuesday, 13 August 2019 at 16:36:16 UTC, H. S. Teoh wrote:
 [snip]

 The location of the acute accent on the second line is wrong.
I'm still confused... What I was first confused about was that the second line of the expected output looks exactly the same as the second line of the actual output. However, you seemed to have indicated that is a problem. From your follow-up post, I'm still confused because the accent seems to be on the "e" on both of them. Isn't that where it's supposed to be?
accent in wchar array can looks like: приве'т - accent to vowel 'e'.. two glyphs combined in one but in reverse it can be like: т'евирп - accent to consonants 'т'.. that is wrong, accent can be for vowels only (for RU) its Russian lang and it have not additions to glyphs but other langs has and one letter can be represented as 2 wchars or just 1.. depends from editor.. and it looks same with same meanings. OT: about Russian(cyrillic) letter with additions Е and Ё are 2 different letters, vowels, sometimes interchangeable (when u cannot find letter Ё on keyboard u can use Е) И and Й are 2 different letters, vowel and consonant, not interchangeable, they are totally different. Ё(upper) ё(lower), Й(up) й(low) - letters where addition is part of letter itself, it cannot be separated. E - transcription is "jˈe" Ё - "jˈɵ" И - "ˈi" Й - "j" Russian has not additions to glyphs except accent for right reading unknown word or for dictionaries like Oxford, Wiki and etc. Many Europian - Latin or Cyrillic - can have additions to letters, idk their meanings.
Aug 13 2019
prev sibling parent dangbinghoo <dangbinghoo gmail.com> writes:
On Tuesday, 13 August 2019 at 16:51:57 UTC, jmh530 wrote:
 On Tuesday, 13 August 2019 at 16:36:16 UTC, H. S. Teoh wrote:
 [snip]

 The location of the acute accent on the second line is wrong.


 T
I'm still confused... What I was first confused about was that the second line of the expected output looks exactly the same as the second line of the actual output. However, you seemed to have indicated that is a problem. From your follow-up post, I'm still confused because the accent seems to be on the "e" on both of them. Isn't that where it's supposed to be?
we can take Chinese char. as an example, it's clear: ``` writeln("汉语&中国🇨🇳".retro); writeln("汉字🐠中国🇨🇳".retro); ``` expected: 🇨🇳国中&语汉 🇨🇳国中🐠字汉 actual: 🇳🇨国中&语汉 🇳🇨国中🐠字汉 -- binghoo dang
Aug 13 2019
prev sibling parent reply matheus <matheus gmail.com> writes:
On Tuesday, 13 August 2019 at 16:29:33 UTC, jmh530 wrote:
 On Tuesday, 13 August 2019 at 16:18:03 UTC, H. S. Teoh wrote:
 ...
 Expected output:
 	тевирп
 	те́вирп

 Actual output:
 	тевирп
 	т́евирп
Huh, those two look the same.
Copy and paste Expected and Actual output on notepad and you will see the difference, or just take a look at the HTML page source on your browser (Search for Expected Output): <span class="forum-quote-prefix">&gt; </span>Expected output: <span class="forum-quote-prefix">&gt; </span> тевирп <span class="forum-quote-prefix">&gt; </span> те́вирп <span class="forum-quote-prefix">&gt;</span> <span class="forum-quote-prefix">&gt; </span>Actual output: <span class="forum-quote-prefix">&gt; </span> тевирп <span class="forum-quote-prefix">&gt; </span> т́евирп For me it shows the difference pretty clear. Matheus.
Aug 13 2019
next sibling parent jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 13 August 2019 at 19:10:17 UTC, matheus wrote:
 [snip]

 Copy and paste Expected and Actual output on notepad and you 
 will see the difference, or just take a look at the HTML page 
 source on your browser (Search for Expected Output):

 <span class="forum-quote-prefix">&gt; </span>Expected output:
 <span class="forum-quote-prefix">&gt; </span>	тевирп
 <span class="forum-quote-prefix">&gt; </span>	те́вирп
 <span class="forum-quote-prefix">&gt;</span>
 <span class="forum-quote-prefix">&gt; </span>Actual output:
 <span class="forum-quote-prefix">&gt; </span>	тевирп
 <span class="forum-quote-prefix">&gt; </span>	т́евирп

 For me it shows the difference pretty clear.

 Matheus.
Interestingly enough, what you have there does not look any different. However, if I actually do what you say and post it to notepad or something, then it does look different.
Aug 13 2019
prev sibling parent reply matheus <matheus gmail.com> writes:
On Tuesday, 13 August 2019 at 19:10:17 UTC, matheus wrote:
 ...
Like others said you may not be able to see through the Browser, because the render may "fix" this. Here how it looks through the HTML Code Inspection: https://i.imgur.com/e57wCZp.png Notice the character '´' position. Matheus.
Aug 13 2019
parent "Nick Sabalausky (Abscissa)" <SeeWebsiteToContactMe semitwist.com> writes:
On 8/13/19 3:17 PM, matheus wrote:
 On Tuesday, 13 August 2019 at 19:10:17 UTC, matheus wrote:
 ...
Like others said you may not be able to see through the Browser, because the render may "fix" this.
Jesus, haven't browser devs learned *ANYTHING* from their very own, INFAMOUS, "Let's completely fuck up 'the reliability principle'" debacle? I guess not. Cult of the amateurs wins out again...
Aug 15 2019
prev sibling parent reply Argolis <argolis gmail.com> writes:
On Tuesday, 13 August 2019 at 16:18:03 UTC, H. S. Teoh wrote:

 But we can't make that the default because it's a big 
 performance hit, and many string algorithms don't actually need 
 grapheme segmentation.
Can you provide example of algorithms and use cases that don't need grapheme segmentation? Are they really SO common that the correct default is go for code points? Is it not better to have as a default the grapheme segmentation, the correct way of handling a string, instead?
Aug 14 2019
next sibling parent reply Gregor =?UTF-8?B?TcO8Y2ts?= <gregormueckl gmx.de> writes:
On Wednesday, 14 August 2019 at 07:15:54 UTC, Argolis wrote:
 On Tuesday, 13 August 2019 at 16:18:03 UTC, H. S. Teoh wrote:

 But we can't make that the default because it's a big 
 performance hit, and many string algorithms don't actually 
 need grapheme segmentation.
Can you provide example of algorithms and use cases that don't need grapheme segmentation? Are they really SO common that the correct default is go for code points? Is it not better to have as a default the grapheme segmentation, the correct way of handling a string, instead?
There is no single universally correct way to segment a string. Grapheme segmentation requires a correct assumption of the text encoding in the string and also the assumption that the encoding is flawless. Neither may be guaranteed in general. There is a lot of ways to corrupt UTF-8 strings, for example. And then there is a question of the length of a grapheme: IIRC they can consist of up to 6 or 7 code points with each of them encoded in a varying number of bytes in UTF-8, UTF-16 or UCS-2. So what data type do you use for representing graphemes then that is both not wasteful and doesn't require dynamic memory management? Then there are other nasty quirks around graphemes: their encoding is not unique. This Unicode TR gives a good impression of how complex this single aspect is: https://unicode.org/reports/tr15/ So if you want to use graphemes, do you want to keep the original encoding or do you implicitly convert them to NFC or NFD? NFC tends to be better for language processing, NFD tends to be better for text rendering (with exceptions). If you don't normalize, semantically equivalent graphemes may not be equal under comparison. At this point you're probably approaching the complexity of libraries like ICU. You can take a look at it if you want a good scare. ;)
Aug 14 2019
next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Aug 14, 2019 at 09:29:30AM +0000, Gregor Mckl via Digitalmars-d wrote:
[...]
 At this point you're probably approaching the complexity of libraries
 like ICU. You can take a look at it if you want a good scare. ;)
Or, instead of homebrewing your own string-handling algorithms and probably getting it all wrong, actually *use* ICU to handle Unicode strings for you instead. Saves you from writing more code, and from unintentional bugs. T -- Truth, Sir, is a cow which will give [skeptics] no more milk, and so they are gone to milk the bull. -- Sam. Johnson
Aug 14 2019
prev sibling parent Argolis <argolis gmail.com> writes:
On Wednesday, 14 August 2019 at 09:29:30 UTC, Gregor Mückl wrote:

 There is no single universally correct way to segment a string. 
 Grapheme segmentation requires a correct assumption of the text 
 encoding in the string and also the assumption that the 
 encoding is flawless. Neither may be guaranteed in general. 
 There is a lot of ways to corrupt UTF-8 strings, for example.
Are you meaning that there's no way to verify that assumptions? Sorting algorithms in Phobos are returning a SortedRange.
 And then there is a question of the length of a grapheme: IIRC 
 they can consist of up to 6 or 7 code points with each of them 
 encoded in a varying number of bytes in UTF-8, UTF-16 or UCS-2. 
 So what data type do you use for representing graphemes then 
 that is both not wasteful and doesn't require dynamic memory 
 management?
It's performance the rationale of not using dynamic memory management, if that it's unavoidable to have a correct behaviour?
 Then there are other nasty quirks around graphemes: their 
 encoding is not unique. This Unicode TR gives a good impression 
 of how complex this single aspect is: 
 https://unicode.org/reports/tr15/
 So if you want to use graphemes, do you want to keep the 
 original encoding or do you implicitly convert them to NFC or 
 NFD? NFC tends to be better for language processing, NFD tends 
 to be better for text rendering (with exceptions). If you don't 
 normalize, semantically equivalent graphemes may not be equal 
 under comparison.
It's performance the rationale of not using normalisation, that solves all the problems you have mentioned above?
 At this point you're probably approaching the complexity of 
 libraries like ICU. You can take a look at it if you want a 
 good scare. ;)
The original question still is not answered: can you provide example of algorithms and use cases that don't need grapheme segmentation?
Aug 15 2019
prev sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Aug 14, 2019 at 07:15:54AM +0000, Argolis via Digitalmars-d wrote:
 On Tuesday, 13 August 2019 at 16:18:03 UTC, H. S. Teoh wrote:
 
 But we can't make that the default because it's a big performance
 hit, and many string algorithms don't actually need grapheme
 segmentation.
Can you provide example of algorithms and use cases that don't need grapheme segmentation?
Most cases of string processing involve: - Taking substrings: does not need grapheme segmentation; you just slice the string. - Copying one string to another: does not need grapheme segmentation, you just use memcpy (or equivalent). - Concatenating n strings: does not need grapheme segmentation, you just use memcpy (or equivalent). In D, you just use array append, or std.array.appender if you get fancy. - Comparing one string to another: does not need grapheme segmentation; you either use strcmp/memcmp, or if you need more delicate semantics, call one of the standard Unicode string collation algorithms (std.uni, meaning, your code does not need to worry about grapheme segmentation, and besides, Unicode collation algorithms operate at the code point level, not at the grapheme level). - Matching a substring: does not need grapheme segmentation; most applications just need subarray matching, i.e., treat the substring as an opaque blob of bytes, and match it against the target. If you need more delicate semantics, there are standard Unicode algorithms for substring matching (i.e., user code does not need to worry about the low-level details -- the inputs are basically opaque Unicode strings whose internal structure is unimportant). You really only need grapheme segmentation when: - Implementing a text layout algorithm where you need to render glyphs to some canvas. Usually, this is already taken care of by the GUI framework or the terminal emulator, so user code rarely has to worry about this. - Measuring the size of some piece of text for output alignment purposes: in this case, grapheme segmentation isn't enough; you need font size information and other such details (like kerning, spacing parameters, etc.). Usually, you wouldn't write this yourself, but use a text rendering library. So most user code don't actually have to worry about this. (Note that iterating by graphemes does NOT give you the correct value for width even with a fixed-width font in a text mode terminal emulator, because there are such things as double-width characters in Unicode, which occupy two cells each. And also zero-width characters which count as distinct (empty) graphemes, but occupy no space.) And as an appendix, the way most string processing code is done in C/C++ (iterate over characters) is actually wrong w.r.t. Unicode, because it's really only reliable for ASCII inputs. For "real" Unicode strings, you can't really get away with the "character by character" approach, even if you use grapheme segmentation: in some writing systems like Arabic breaking up a string like this can cause incorrect behaviour like breaking ligatures, which may not be intended. For this sort of operations the application really needs to be using the standard Unicode algorithms, that depend on the *purpose* of the function, not the mechanics of iterating over characters, e.g., find suitable line breaks, find suitable hyphenation points, etc.. There's a reason the Unicode Consortium defines standard algorithms for these operations: it's because navely iterating over graphemes, in general, does *not* yield the correct results in all cases. Ultimately, the whole point behind removing autodecoding is to put the onus on the user code to decide what kind of iteration it wants: code units, code points, or graphemes. (Or just use one of the standard algorithms and don't reinvent the square wheel.)
 Are they really SO common that the correct default is go for code
 points?
The whole point behind removing autodecoding is so that we do NOT default to code points, which is currently the default. We want to put the choice in the user's hand, not silently default to iteration by code point under the illusion of correctness, which is actually incorrect for non-trivial inputs.
 Is it not better to have as a default the grapheme segmentation, the
 correct way of handling a string, instead?
Grapheme segmentation is very complex, and therefore, very slow. Most string processing doesn't actually need grapheme segmentation. Setting that as the default would mean D string processing will be excruciatingly slow by default, and furthermore all that extra work will be mostly for nothing because most of the time we don't need it anyway. Not to repeat that most nave iterations over graphemes actually do *not* yield what one might think is the correct result. For example, measuring the size of a piece of text in a fixed-width font in a text-mode terminal by counting graphemes is actually wrong, due to double-width and zero-width characters. T -- The most powerful one-line C program: #include "/dev/tty" -- IOCCC
Aug 14 2019
parent reply Argolis <argolis gmail.com> writes:
On Wednesday, 14 August 2019 at 17:12:00 UTC, H. S. Teoh wrote:

 - Taking substrings: does not need grapheme segmentation; you 
 just slice the string.
What is the use case of slicing some multi-codeunit encoded grapheme in the middle?
 - Copying one string to another: does not need grapheme 
 segmentation, - you just use memcpy (or equivalent).
 - Concatenating n strings: does not need grapheme segmentation, 
 you just use memcpy (or equivalent).  In D, you just use array 
 append,  or  std.array.appender if you get fancy.
That use case is not string processing, but general memory handling of an opaque type
 - Comparing one string to another: does not need grapheme  
 segmentation;
   you either use strcmp/memcmp
That use case is not string processing, but general memory comparison of an opaque type
, or if you need more delicate semantics,
 call one of the standard Unicode string collation algorithms 
 (std.uni, meaning, your code does not need to worry about 
 grapheme segmentation, and besides, Unicode collation 
 algorithms operate at the code point  level, not at the 
 grapheme level).
So this use case algorithm needs a proper handling of encoded code units, and can't be satisfied simply removing auto decoding
 - Matching a substring: does not need grapheme segmentation;  
 most
   applications just need subarray matching, i.e., treat the  
 substring as
   an opaque blob of bytes, and match it against the target.
That use case is not string processing, but general memory comparison of an opaque type
 If  you need more delicate semantics, there are standard 
 Unicode  algorithms for
 substring matching (i.e., user code does not need to worry 
 about the low-level details -- the inputs are basically opaque 
 Unicode strings whose internal structure is unimportant).
Again, removing auto decoding does not change anything for that.
 You really only need grapheme segmentation when:
 - Implementing a text layout algorithm where you need to render 
 glyphs
 to some canvas.
 - Measuring the size of some piece of text for output alignment
   purposes: in this case, grapheme segmentation isn't enough; 
 you need font size information and other such details (like 
 kerning, spacing parameters, etc.).
What about all the example above in the thread, about the wrong way of working of auto decoding right now? Retro, correct substrings slicing, correct indexing, et cetera
 Ultimately, the whole point behind removing autodecoding is to 
 put the onus on the user code to decide what kind of iteration 
 it wants: code units, code points, or graphemes. (Or just use 
 one of the standard algorithms and don't reinvent the square 
 wheel.)
There will be always a default way to iterate, see below
 Are they really SO common that the correct default is go for 
 code points?
The whole point behind removing autodecoding is so that we do NOT default to code points, which is currently the default. We want to put the choice in the user's hand, not silently default to iteration by code point under the illusion of correctness, which is actually incorrect for non-trivial inputs.
The illusion of correctness should be turned into correctness, then.
 Is it not better to have as a default the grapheme 
 segmentation, the correct way of handling a string, instead?
Grapheme segmentation is very complex, and therefore, very slow. Most string processing doesn't actually need grapheme segmentation.
Can you provide string processing that doesn't need grapheme segmentation? The examples listed above are not string processing example.
 Setting that as the default would mean D string processing will 
 be excruciatingly slow by default, and furthermore all that 
 extra work will be mostly for nothing because most of the time 
 we don't need it anyway.
From the examples above, most of the time you simply need opaque memory management, so decaying the string/dstring/wstring to a binary blob, but that's not string processing My (refined) point still stands: can you provide example of (text processing) algorithms and use cases that don't need grapheme segmentation?
Aug 15 2019
next sibling parent nkm1 <t4nk074 openmailbox.org> writes:
On Thursday, 15 August 2019 at 11:02:54 UTC, Argolis wrote:

 My (refined) point still stands: can you provide example of 
 (text processing) algorithms and use cases that don't need 
 grapheme segmentation?
Parsing XML, HTML and other such things is what people usually have in mind. In general, all sorts of text where human-readable parts are interleaved with (easier to handle) machine instructions.
Aug 15 2019
prev sibling parent reply Gregor =?UTF-8?B?TcO8Y2ts?= <gregormueckl gmx.de> writes:
On Thursday, 15 August 2019 at 11:02:54 UTC, Argolis wrote:
 From the examples above, most of the time you simply need 
 opaque memory management, so decaying the 
 string/dstring/wstring to a binary blob, but that's not string 
 processing
This is the point we're trying to get across to you: this isn't sufficient. Depending on the context and the script/language, you need access to the string at various levels. E.g. a font renderer needs to sometimes iterate code points, not graphemes in order to compose the correct glyphs. Binary blob comparisons for comparing strings are *also* not sufficient, again depending on both script/language of the text in the string and the context in which the comparison is performed. If the comparison is to be purely semantic, the following strings should be equal: "\u00f6" and "\u006f\u0308". They both represent the same "Latin Small Letter O with Diaeresis". Their in-memory representations clearly aren't equal, so a memcpy won't yield the correct result. The same applies to sorting. If you decide to force a specific string normalization internally, you put the burden on the user to explicitly select a different normalization when they require it. Plus, there is no way to perfectly reconstruct the input binary representation of a string, e.g. when it was given in a non-normalized form (e.g. a mix of NFC and NFD). Once such a string is through a normalization algorithm, the exact input is unrecoverable. This makes interfacing with other code that has idiosyncrasies around all of this hard to impossible to achieve. One such system that I worked on in the past was a small embedded microcontroller driven HCI module with very limited capabilites, but with the requirement to be multilingual. I carefully worked out that for the languages that were required, a UTF-8 encoding with a very specific normalization would just about work. This choice was viable because the user interface was created in a custom tool where I could control the code and data generation just enough to make it work. Another case where normalization is troublesome is ligatures. Ligatures that are purely stylistic like "ff", "ffi", "fft", "st", "ct" etc... have their own code points. Yet, it is a purely stylistic choice whether to use them. So in terms of the contained text, the ligature \ufb00 is equal to the string "ff", but it is not the same grapheme. Whether you can normalize this depends on the context. The user may have selected the ligature representation deliberately to have it appear as such on screen. If you want to do spell checking on the other hand, you would need to resolve the ligature to its individual letters. And then there is Hangul: this is a prime example of a writing system that is "weird" to westerners. It is based on 40 symbols (19 consonants, 21 vowels) which aren't written individually, but merged syllable by syllable into rectangular blocks of two or three such symbols. These symbols get arranged in different layouts depending on which symbols there are in a syllable. As far as I understand, this follows a clear algorithm. This results in approximately 6500 individual graphemes that are actually written. Yet each of these is a group of two or three letters and parsed as such. So depending on whether you're interested in individual letters or syllables, you need to use a different string representation for processing that language. OK, this are all just examples that come to my mind while brainstorming the question a little bit. However, none of us are not experts in language processing, so whatever examples we can come up with are very likely just the very tip of the iceberg. There is a reason why libraries like ICU give the user a lot of control over string handling and expose a lot of variants of functions depending on the user intent and context. This design rests on a lot of expert knowledge that we don't have, but we know that it is sound. Going against that wisdom is inviting trouble. Autodecoding is an example of doing just that.
Aug 15 2019
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
In my not-so-humble opinion, the introduction of "normalization" to Unicode was 
a huge mistake. It's not necessary and causes nothing but grief. They should 
have consulted with me first :-)
Aug 15 2019
next sibling parent reply Gregor =?UTF-8?B?TcO8Y2ts?= <gregormueckl gmx.de> writes:
On Thursday, 15 August 2019 at 19:11:14 UTC, Walter Bright wrote:
 In my not-so-humble opinion, the introduction of 
 "normalization" to Unicode was a huge mistake. It's not 
 necessary and causes nothing but grief. They should have 
 consulted with me first :-)
I am not sure that you can go entirely without normalization for all languages in existence. But Unicode conflates semantic representation and rendering in ways that are effectively layering violations. The LTR and RTL control characters are nice examples of that. Why should a Unicode string be able to specify the displayed direction of the script? The same goes for the stylistic ligatures I pointed out. These should be handled exclusively by the font rendering subsystem. There's a substitution table in OpenType for that, FFS! Well, I guess that Unicode is the best we have despite all this maddening cruft. Attempting to do better would just result in text encoding "standard" N+1. And we know how much the world needs that. ;)
Aug 15 2019
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/15/2019 12:38 PM, Gregor Mückl wrote:
 I am not sure that you can go entirely without normalization for all languages 
 in existence. But Unicode conflates semantic representation and rendering in 
 ways that are effectively layering violations. The LTR and RTL control 
 characters are nice examples of that. Why should a Unicode string be able to 
 specify the displayed direction of the script? The same goes for the stylistic 
 ligatures I pointed out. These should be handled exclusively by the font 
 rendering subsystem. There's a substitution table in OpenType for that, FFS!
Unicode also fouled up by adding semantic information that is invisible to the rendering. It should have stuck with the Unicode<=>print round-trip not losing information. Naturally, people have already used such to trick people, track people, etc. Another thing I hate about Unicode is there are articles about how people can get their vanity symbol into Unicode! And they do. They invent glyphs, and get them in. This goes on all the time. Unicode started out as a cool idea, and turned rather quickly into a cesspool.
Aug 15 2019
parent reply ag0aep6g <anonymous example.com> writes:
On 15.08.19 21:54, Walter Bright wrote:
 Unicode also fouled up by adding semantic information that is invisible 
 to the rendering. It should have stuck with the Unicode<=>print 
 round-trip not losing information.
 
 Naturally, people have already used such to trick people, track people, 
 etc.
'I' and 'l' are (virtually) identical in many fonts.
Aug 15 2019
next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Aug 15, 2019 at 11:38:08PM +0200, ag0aep6g via Digitalmars-d wrote:
 On 15.08.19 21:54, Walter Bright wrote:
 Unicode also fouled up by adding semantic information that is
 invisible to the rendering. It should have stuck with the
 Unicode<=>print round-trip not losing information.
 
 Naturally, people have already used such to trick people, track
 people, etc.
'I' and 'l' are (virtually) identical in many fonts.
And 0 and O are also identical in many fonts. But none of us would seriously entertain the idea that O and 0 ought to be the same character. T -- Indifference will certainly be the downfall of mankind, but who cares? -- Miquel van Smoorenburg
Aug 15 2019
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/15/2019 2:38 PM, ag0aep6g wrote:
 On 15.08.19 21:54, Walter Bright wrote:
 Unicode also fouled up by adding semantic information that is invisible to the 
 rendering. It should have stuck with the Unicode<=>print round-trip not losing 
 information.

 Naturally, people have already used such to trick people, track people, etc.
'I' and 'l' are (virtually) identical in many fonts.
That's a problem with some fonts, not the concept. When such fonts are used, the distinguishment comes from the context, not the symbol itself. On the other hand, the Unicode spec itself routinely shows identical glyphs for different code points. Consider also: (800)555-1212 You know it's a phone number, because of the context. The digits used in are NOT actually numbers, they do not have any mathematical properties. Should Unicode have a separate code point for these? The point is, the meaning of the symbol comes from its context, not the symbol itself. This is the fundamental error Unicode made.
Aug 15 2019
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Aug 15, 2019 at 03:21:32PM -0700, Walter Bright via Digitalmars-d wrote:
 On 8/15/2019 2:38 PM, ag0aep6g wrote:
 On 15.08.19 21:54, Walter Bright wrote:
 Unicode also fouled up by adding semantic information that is
 invisible to the rendering. It should have stuck with the
 Unicode<=>print round-trip not losing information.
 
 Naturally, people have already used such to trick people, track
 people, etc.
'I' and 'l' are (virtually) identical in many fonts.
That's a problem with some fonts, not the concept. When such fonts are used, the distinguishment comes from the context, not the symbol itself.
[...] And there you go: you're basically saying that "symbol" is different from "glyph", and therefore, you're contradicting your own axiom that character == glyph. "Symbol" is basically an abstract notion of a character that exists *apart from the glyph used to render it*. And now that you agree that character encoding should be based on "symbol" rather than "glyph", the next step is the realization that, in the wide world of international languages out there, there exist multiple "symbols" that are rendered with the *same* glyph. This is a hard fact of reality, and no matter how you wish it to be otherwise, it simply ain't so. Your ideal of "character == glyph" simply doesn't work in real life. T -- There's light at the end of the tunnel. It's the oncoming train.
Aug 15 2019
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/15/2019 3:56 PM, H. S. Teoh wrote:
 And now that you agree that character encoding should be based on
 "symbol" rather than "glyph", the next step is the realization that, in
 the wide world of international languages out there, there exist
 multiple "symbols" that are rendered with the *same* glyph.  This is a
 hard fact of reality, and no matter how you wish it to be otherwise, it
 simply ain't so.  Your ideal of "character == glyph" simply doesn't
 work in real life.
Splitting semantic hares is pointless, as the fact remains it worked just fine in real life before Unicode, it's called "printing" on paper. As for not working in real life, that's Unicode.
Aug 15 2019
parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Friday, 16 August 2019 at 06:28:30 UTC, Walter Bright wrote:
 On 8/15/2019 3:56 PM, H. S. Teoh wrote:
 And now that you agree that character encoding should be based 
 on
 "symbol" rather than "glyph", the next step is the realization 
 that, in
 the wide world of international languages out there, there 
 exist
 multiple "symbols" that are rendered with the *same* glyph.  
 This is a
 hard fact of reality, and no matter how you wish it to be 
 otherwise, it
 simply ain't so.  Your ideal of "character == glyph" simply 
 doesn't
 work in real life.
Splitting semantic hares is pointless, as the fact remains it worked just fine in real life before Unicode, it's called "printing" on paper.
Sorry, no it didn't work in reality before Unicode. Multi language system were a mess. My job is on the biggest translation memory in the world, the Euramis system of the European Union and when I started there in 2002, the system supported only 11 languages. The data in the Oracle database was already in Unicode but all the supporting translation chain was codepage based. It was a catastrophe and the amount of crap, especially in Greek data, was staggering. The issues H.S.Teoh described above were indeed a real pain point. In greek text it was very frequent to have mixed Latin characters with Greek character from codepage 1253. Was the A an alpha or a \x41. This crap made a lot of algorithms that were used downstream from the database (CAT tools, automatic translation etc.) completely bonkers. For the 2004 extension of the EU we had to support one alphabet more (Cyrillic for Bulgarian) and 4 codepages more (CP-1250 Latin-2 Extended-A, CP-1251 Cyrillic, CP-1257 Baltic and ISO-8859-3 Maltese). It would have been such a mess that we decided to convert everything to Unicode. We don't have these crap data anymore. Our code is not perfect, far from it, but adopting Unicode through and throug and dropping all support for the old coding crap simplified our lives tremendously. When we got in 2010 the request from the EEAS (European External Action Service) to support also other languages than the 24 official EU languages, i.e. Russian, Arabic and Chinese, we didn't break a sweat to implement it, thanks to Unicode.
 As for not working in real life, that's Unicode.
Unicode works much, much better than anything that existed before. The issue is that not a lot of people work in a multi-language environment and don't have a clue of the unholy mess it was before.
Aug 16 2019
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/16/2019 2:20 AM, Patrick Schluter wrote:
 Sorry, no it didn't work in reality before Unicode. Multi language system were
a 
 mess.
I have several older books that move facilely between multiple languages. It's not a mess. Since the reader can figure all this out without invisible semantic information in the glyphs, that invisible information is not necessary. Once you print/display the Unicode string, all that semantic information is gone. It is not needed.
 Unicode works much, much better than anything that existed before. The issue
is 
 that not a lot of people work in a multi-language environment and don't have a 
 clue of the unholy mess it was before.
Actually, I do. Zortech C++ supported multiple code pages, multiple multibyte encodings, and had error messages in 4 languages. Unicode, in its original vision, solved those problems.
Aug 16 2019
next sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Friday, 16 August 2019 at 09:34:21 UTC, Walter Bright wrote:
 On 8/16/2019 2:20 AM, Patrick Schluter wrote:
 Sorry, no it didn't work in reality before Unicode. Multi 
 language system were a mess.
I have several older books that move facilely between multiple languages. It's not a mess. Since the reader can figure all this out without invisible semantic information in the glyphs, that invisible information is not necessary.
Unicode's purpose is not limited to the output at the end the processing chain. It's the whole processing chain that is the point.
 Once you print/display the Unicode string, all that semantic 
 information is gone. It is not needed.
As said, printing is only a minor part of language processing. To give an example from the EU again, and just to illustrate, we have exactly three laser printer (one is a photocopier) on each floor of our offices. You may say; o you're the IT guys, you don't need to print that much, to which I respond, half of the floor is populated with the english translation unit and while they indeed use the printers more than us, it is not a significant part of their workflow.
 Unicode works much, much better than anything that existed 
 before. The issue is that not a lot of people work in a 
 multi-language environment and don't have a clue of the unholy 
 mess it was before.
Actually, I do. Zortech C++ supported multiple code pages, multiple multibyte encodings, and had error messages in 4 languages.
Each string was in its own language. We have to deal with texts that are mixed languages. Sentences in Bulgarian with an office address in Greece, embedded in a xml file. Codepages don't work in that case, or you have to introduce an escaping scheme much more brittle and annoying than utf-8 or utf-16 encoding. European Parliament's session logs are what is called panaché documents, i.e. the transcripts are in native language of intervening MEP's. So completely mixed documents.
 Unicode, in its original vision, solved those problems.
Unicode is not perfect and indeed the crap with emoji is crap, but Unicode is better than what was used before. And to insist again, Unicode is mostly about "DATA PROCESSING". Sometime it might result to a human readable result, but that is only one part of its purpose.
Aug 16 2019
next sibling parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Friday, August 16, 2019 4:32:06 AM MDT Patrick Schluter via Digitalmars-d 
wrote:
 Unicode, in its original vision, solved those problems.
Unicode is not perfect and indeed the crap with emoji is crap, but Unicode is better than what was used before. And to insist again, Unicode is mostly about "DATA PROCESSING". Sometime it might result to a human readable result, but that is only one part of its purpose.
I don't think that anyone is arguing that Unicode is worse than what we had before. The problem is that there are aspects of Unicode that are screwed up, making it far worse to deal with than it should be. We'd be way better off if those mistakes had not been made. So, we're better off than we were but also definitely worse off than we should be. - Jonathan M Davis
Aug 16 2019
prev sibling next sibling parent reply Abdulhaq <alynch4047 gmail.com> writes:
On Friday, 16 August 2019 at 10:32:06 UTC, Patrick Schluter wrote:
 On Friday, 16 August 2019 at 09:34:21 UTC, Walter Bright wrote:
 [...]
Unicode's purpose is not limited to the output at the end the processing chain. It's the whole processing chain that is the point.
 [...]
As said, printing is only a minor part of language processing. To give an example from the EU again, and just to illustrate, we have exactly three laser printer (one is a photocopier) on each floor of our offices. You may say; o you're the IT guys, you don't need to print that much, to which I respond, half of the floor is populated with the english translation unit and while they indeed use the printers more than us, it is not a significant part of their workflow.
 [...]
Each string was in its own language. We have to deal with texts that are mixed languages. Sentences in Bulgarian with an office address in Greece, embedded in a xml file. Codepages don't work in that case, or you have to introduce an escaping scheme much more brittle and annoying than utf-8 or utf-16 encoding. European Parliament's session logs are what is called panaché documents, i.e. the transcripts are in native language of intervening MEP's. So completely mixed documents.
 [...]
Unicode is not perfect and indeed the crap with emoji is crap, but Unicode is better than what was used before. And to insist again, Unicode is mostly about "DATA PROCESSING". Sometime it might result to a human readable result, but that is only one part of its purpose.
These are great examples and I totally agree with you (and HS Teoh). It's no coincidence that those people who can read, write and speak more than one language with more than one script are those who think Unicode is beneficial. It seems that those who are stuck in the world of anglo/latin characters just don't have the experience required to understand why their simpler schemes won't work.
Aug 16 2019
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Aug 16, 2019 at 04:41:01PM +0000, Abdulhaq via Digitalmars-d wrote:
[...]
 It's no coincidence that those people who can read, write and speak
 more than one language with more than one script are those who think
 Unicode is beneficial.
To be clear, there are aspects of Unicode that I don't agree with. But what Walter is proposing (1 glyph == 1 character) simply does not work. It fails to handle the inherent complexities of working with multi-lingual strings.
 It seems that those who are stuck in the world of anglo/latin
 characters just don't have the experience required to understand why
 their simpler schemes won't work.
Walter claims to have experience working with code translated into 4 languages. I suspect (Walter please correct me if I'm wrong) that it mostly just involved selecting a language at the beginning of the program, and substituting strings with translations into said language during output. If this is the case, his stance of 1 glyph == 1 character makes sense, because that's all that's needed to support this limited functionality. Where this scheme falls down is when you need to perform automatic processing of multi-lingual strings -- an unavoidable inevitability in this day and age of global communications. It makes no sense for a single letter to have two different encodings just because your user decided to use a different font, but that's exactly what Walter is proposing -- I wonder if he realizes that. T -- Written on the window of a clothing store: No shirt, no shoes, no service.
Aug 16 2019
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/16/2019 3:32 AM, Patrick Schluter wrote:
 Unicode is not perfect and indeed the crap with emoji is crap, but Unicode is 
 better than what was used before.
I'm not arguing otherwise.
 And to insist again, Unicode is mostly about "DATA PROCESSING". Sometime it 
 might result to a human readable result, but that is only one part of its
purpose.
And that's mission creep, which came later and should not have occurred. With such mission creep, there will be no end of intractable problems. People assign new semantic meanings to characters all the time. Trying to embed that into Unicode is beyond impractical. To repeat an example: a + b = c Why not have special Unicode code points for when letters are used as mathematical symbols? 18004775555 Maybe some special Unicode code points for phone numbers? How about Social Security digits? Credit card digits?
Aug 16 2019
parent reply lithium iodate <whatdoiknow doesntexist.net> writes:
On Friday, 16 August 2019 at 20:14:33 UTC, Walter Bright wrote:
 To repeat an example:

     a + b = c

 Why not have special Unicode code points for when letters are 
 used as mathematical symbols?
Uhm, well, the Unicode block "Mathematical Alphanumeric Symbols" already exists and is basically that.
Aug 16 2019
parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/16/2019 1:26 PM, lithium iodate wrote:
 On Friday, 16 August 2019 at 20:14:33 UTC, Walter Bright wrote:
 To repeat an example:

     a + b = c

 Why not have special Unicode code points for when letters are used as 
 mathematical symbols?
Uhm, well, the Unicode block "Mathematical Alphanumeric Symbols" already exists and is basically that.
ye gawds: https://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symbols I see they forgot the phone number code points.
Aug 16 2019
prev sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Aug 16, 2019 at 02:34:21AM -0700, Walter Bright via Digitalmars-d wrote:
[...]
 Once you print/display the Unicode string, all that semantic
 information is gone. It is not needed.
[...] So in other words, we should encode 1, I, |, and l with exactly the same value, because in print, they aII look about the same anyway, and the user is well able to figure out from context which one is meant. After a11, once you print the string the semantic distinction is gone anyway, and human beings are very good at te||ing what was actually intended in spite of the ambiguity. Bye-bye unambiguous D lexer, we hardly knew you; now we need to rewrite you with a context-sensitive algorithm that figures out whether we meant 11, ||, II, or ll in our source code encoded in Walter Encoding. T -- Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan
Aug 16 2019
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/16/2019 10:52 AM, H. S. Teoh wrote:
 So in other words, we should encode 1, I, |, and l with exactly the same
 value, because in print, they aII look about the same anyway, and the
 user is well able to figure out from context which one is meant. After
 a11, once you print the string the semantic distinction is gone anyway,
 and human beings are very good at te||ing what was actually intended in
 spite of the ambiguity.
 
 Bye-bye unambiguous D lexer, we hardly knew you; now we need to rewrite
 you with a context-sensitive algorithm that figures out whether we meant
 11, ||, II, or ll in our source code encoded in Walter Encoding.
Fonts people use for programming take pains to distinguish them.
Aug 16 2019
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Aug 16, 2019 at 01:18:54PM -0700, Walter Bright via Digitalmars-d wrote:
 On 8/16/2019 10:52 AM, H. S. Teoh wrote:
 So in other words, we should encode 1, I, |, and l with exactly the
 same value, because in print, they aII look about the same anyway,
 and the user is well able to figure out from context which one is
 meant. After a11, once you print the string the semantic distinction
 is gone anyway, and human beings are very good at te||ing what was
 actually intended in spite of the ambiguity.
 
 Bye-bye unambiguous D lexer, we hardly knew you; now we need to
 rewrite you with a context-sensitive algorithm that figures out
 whether we meant 11, ||, II, or ll in our source code encoded in
 Walter Encoding.
Fonts people use for programming take pains to distinguish them.
So you're saying that what constitutes a "character" should be determined by fonts?? T -- Programming is not just an act of telling a computer what to do: it is also an act of telling other programmers what you wished the computer to do. Both are important, and the latter deserves care. -- Andrew Morton
Aug 16 2019
prev sibling parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Thursday, August 15, 2019 1:11:14 PM MDT Walter Bright via Digitalmars-d 
wrote:
 In my not-so-humble opinion, the introduction of "normalization" to
 Unicode was a huge mistake. It's not necessary and causes nothing but
 grief. They should have consulted with me first :-)
IMHO, the fact that Unicode normalization is a thing is one of those things that proves that Unicode is unnecessarily complex. There should only be a single way to represent a given character. Unfortunately, that's definitely not the way they went, and we suffer that much more because of it. Honestly, I question that very many applications exist which actually handle Unicode fully correctly. Its level of complexity is way past the point that the average programmer has much chance of getting it right. - Jonathan M Davis
Aug 15 2019
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/15/2019 12:44 PM, Jonathan M Davis wrote:
 There should only be a
 single way to represent a given character.
Exactly. And two glyphs that render identically should be the same code point. After all, when we write: a + b = c we don't use a separate code point for the letters. Also, a) one item b) another item we don't use a separate code point, either. I've debated this point with Unicode people, and their arguments for separate glyphs fall to pieces when I point this out.
Aug 15 2019
next sibling parent reply a11e99z <black80 bk.ru> writes:
On Thursday, 15 August 2019 at 19:59:34 UTC, Walter Bright wrote:
 On 8/15/2019 12:44 PM, Jonathan M Davis wrote:
 There should only be a
 single way to represent a given character.
Exactly. And two glyphs that render identically should be the same code point.
if it was not sarcasm: different code points can ref to same glyphs not vice verse: A(EN,\u0041), A(RU,\u0410), A(EL,\u0391) else sorting for non English will not work. even order(A<B) will be wrong for example such RU glyphs ABCEHKMOPTXacepuxy corresponds to next English letters by sound or meaning AVSENKMORTHaserihu as u can see even uppers and lowers don't exists as pairs and have different meanings
Aug 15 2019
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/15/2019 2:26 PM, a11e99z wrote:
 On Thursday, 15 August 2019 at 19:59:34 UTC, Walter Bright wrote:
 On 8/15/2019 12:44 PM, Jonathan M Davis wrote:
 There should only be a
 single way to represent a given character.
Exactly. And two glyphs that render identically should be the same code point.
if it was not sarcasm: different code points can ref to same glyphs not vice verse: A(EN,\u0041), A(RU,\u0410), A(EL,\u0391) else sorting for non English will not work. even order(A<B) will be wrong for example such RU glyphs ABCEHKMOPTXacepuxy corresponds to next English letters by sound or meaning AVSENKMORTHaserihu as u can see even uppers and lowers don't exists as pairs and have different meanings
Yes, I've heard this argument before. The answer is that language should not be embedded in Unicode. It will lead to nothing but problems. The language is something externally assigned to a block of text, not the text itself, just like in printed text. Again, a + b = c Should those be separate code points? How about: a) one thing b) another
Aug 15 2019
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Aug 15, 2019 at 02:42:50PM -0700, Walter Bright via Digitalmars-d wrote:
 On 8/15/2019 2:26 PM, a11e99z wrote:
[...]
 if it was not sarcasm:
 different code points can ref to same glyphs not vice verse:
 A(EN,\u0041), A(RU,\u0410), A(EL,\u0391)
 else sorting for non English will not work.
 
 even order(A<B) will be wrong for example such RU glyphs
 ABCEHKMOPTXacepuxy
 corresponds to next English letters by sound or meaning
 AVSENKMORTHaserihu
 as u can see even uppers and lowers don't exists as pairs and have
 different meanings
Yes, I've heard this argument before. The answer is that language should not be embedded in Unicode. It will lead to nothing but problems. The language is something externally assigned to a block of text, not the text itself, just like in printed text.
[...] You cannot avoid conveying language in a string. Certain characters only exist in certain languages, and the existence of the character itself already encodes language. But that's a peripheral issue. The more pertinent point is that *different* languages may reuse the *same* glyphs for different (often completely unrelated) purposes. And because of these different purposes, it changes the way the *same* glyph is printed / laid out, and may affect other things in the surrounding context as well. Put it this way: you agree that the encoding of a character ought not to change depending on font, right? If so, consider your proposal to identify characters by glyph shape. A letter with the shape 'u', by your argument, ought to be represented by one, and only one, Unicode code point -- because, after all, it has the same glyph shape. Correct? If so, now you have a problem: the shape 'u' in Cyrillic is the cursive lowercase form of и. So now you're essentially saying that all occurrences of 'u' in Cyrillic text must be substituted with и when you change the font from cursive to non-cursive. Which is a contradiction of the initial axiom that character encoding should not be font-dependent. Please explain how you solve this problem. T -- Real men don't take backups. They put their source on a public FTP-server and let the world mirror it. -- Linus Torvalds
Aug 15 2019
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/15/2019 3:16 PM, H. S. Teoh wrote:
 Please explain how you solve this problem.
The same way printers solved the problem for the last 500 years.
Aug 15 2019
next sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Friday, 16 August 2019 at 06:29:50 UTC, Walter Bright wrote:
 On 8/15/2019 3:16 PM, H. S. Teoh wrote:
 Please explain how you solve this problem.
The same way printers solved the problem for the last 500 years.
They didn't have to do automatic processing of the represented data, i.e. it was for pure human consumption. When the data is to be processed automatically, it is a whole other problem. I'm quite sure that you sometime appreciate the results of automatic translation (ggogle translate, yandex, systran etc.). While the results are far from perfect, they would be absolutely impossible if we used what you propose here.
Aug 16 2019
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/16/2019 2:27 AM, Patrick Schluter wrote:
 While the results are far from 
 perfect, they would be absolutely impossible if we used what you propose here.
Google translate can (and does) figure it out from the context, just like a human reader would. Sentences written in mixed languages *are* written for human consumption. I have many books written that way. They are quite readable, and don't have any need to clue in the reader "the next word is in french/latin/greek/german". And frankly, if data processing software is totally reliant on using the correct language-specific glyph, it will fail, because people will not type in the correct one, and visually they cannot proof it for correctness. Anything that does OCR is going to completely fail at this. Robust data processing software is going to be forced to accept and allow for multiple encodings of the same glyph, pretty much rendering the semantic difference meaningless. I bet in 10 or 20 years of being clobbered by experience you'll reluctantly agree with me that assigning semantics to individual code points was a mistake. :-) BTW, I was a winner in the 1986 Obfuscated C Code Contest with: ------------------------- #include <stdio.h> #define O1O printf #define OlO putchar #define O10 exit #define Ol0 strlen #define QLQ fopen #define OlQ fgetc #define O1Q abs #define QO0 for typedef char lOL; lOL*QI[] = {"Use:\012\011dump file\012","Unable to open file '\x25s'\012", "\012"," ",""}; main(I,Il) lOL*Il[]; { FILE *L; unsigned lO; int Q,OL[' '^'0'],llO = EOF, O=1,l=0,lll=O+O+O+l,OQ=056; lOL*llL="%2x "; (I != 1<<1&&(O1O(QI[0]),O10(1011-1010))), ((L = QLQ(Il[O],"r"))==0&&(O1O(QI[O],Il[O]),O10(O))); lO = I-(O<<l<<O); while (L-l,1) { QO0(Q = 0L;((Q &~(0x10-O))== l); OL[Q++] = OlQ(L)); if (OL[0]==llO) break; O1O("\0454x: ",lO); if (I == (1<<1)) { QO0(Q=Ol0(QI[O<<O<<1]);Q<Ol0(QI[0]); Q++)O1O((OL[Q]!=llO)?llL:QI[lll],OL[Q]);/*" O10(QI[1O])*/ O1O(QI[lll]);{} } QO0 (Q=0L;Q<1<<1<<1<<1<<1;Q+=Q<0100) { (OL[Q]!=llO)? /* 0010 10lOQ 000LQL */ ((D(OL[Q])==0&&(*(OL+O1Q(Q-l))=OQ)), OlO(OL[Q])): OlO(1<<(1<<1<<1)<<1); } O1O(QI[01^10^9]); lO+=Q+0+l;} } D(l) { return l>=' '&&l<='\~'; } ------------------------- I am indeed aware of the problems with confusing O0l1|. D does take steps to be more tolerant of bad fonts, such as 10l being allowed in C, but not D. I seriously considered banning the identifiers l and O. Perhaps I should have. | is not a problem because the grammar (i.e. the context) detects errors with it.
Aug 16 2019
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Aug 16, 2019 at 01:44:20PM -0700, Walter Bright via Digitalmars-d wrote:
[...]
 Google translate can (and does) figure it out from the context, just
 like a human reader would.
Ha! Actually, IME, randomly substituting lookalike characters from other languages in the input to Google Translate often transmutes the result from passably-understandable to outright hilarious (and ridiculous). Or the poor befuddled software just gives up and spits the input back at you verbatim. [...]
 And frankly, if data processing software is totally reliant on using
 the correct language-specific glyph, it will fail, because people will
 not type in the correct one, and visually they cannot proof it for
 correctness.  Anything that does OCR is going to completely fail at
 this.
 
 Robust data processing software is going to be forced to accept and
 allow for multiple encodings of the same glyph, pretty much rendering
 the semantic difference meaningless.
It's not a hard problem. You just need a preprocessing stage to normalize such stray glyphs into the correct language-specific code points, and all subsequent stages in your software pipeline will Just Work(tm). Think of it as a rudimentary "OCR" stage to sanitize your inputs. This option would be unavailable if you used an encoding scheme that *cannot* encode language as part of the string.
 I bet in 10 or 20 years of being clobbered by experience you'll
 reluctantly agree with me that assigning semantics to individual code
 points was a mistake. :-)
That remains to be seen. :-)
 BTW, I was a winner in the 1986 Obfuscated C Code Contest with:
[...]
 I am indeed aware of the problems with confusing O0l1|. D does take
 steps to be more tolerant of bad fonts, such as 10l being allowed in
 C, but not D. I seriously considered banning the identifiers l and O.
 Perhaps I should have.  | is not a problem because the grammar (i.e.
 the context) detects errors with it.
I also won an IOCCC award once, albeit anonymously (see 2005/anon)... though it had nothing to do with lookalike characters, but more to do with what I call M.A.S.S. (Memory Allocated by Stack-Smashing), in which the program does not declare any variables (besides the two parameters to main()) nor calls any memory allocation functions, but happily manipulates arrays of data. :-D T -- The computer is only a tool. Unfortunately, so is the user. -- Armaphine, K5
Aug 16 2019
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Aug 15, 2019 at 11:29:50PM -0700, Walter Bright via Digitalmars-d wrote:
 On 8/15/2019 3:16 PM, H. S. Teoh wrote:
 Please explain how you solve this problem.
The same way printers solved the problem for the last 500 years.
Please elaborate. Because you appear to be saying that Unicode should encode the specific glyph, i.e., every font will have unique encodings for its glyphs, because every unique glyph corresponds to a unique encoding. This is patently absurd, since your string encoding becomes dependent on font selection. How do you reconcile these two things: (1) The encoding of a character should not be font-dependent. I.e., it should encode the abstract "symbol" rather than the physical rendering of said symbol. (2) In the real world, there exist different symbols that share the same glyph shape. T -- Customer support: the art of getting your clients to pay for your own incompetence.
Aug 16 2019
prev sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Aug 16, 2019 at 10:01:57AM -0700, H. S. Teoh via Digitalmars-d wrote:
[...]
 How do you reconcile these two things:
 
 (1) The encoding of a character should not be font-dependent. I.e., it
     should encode the abstract "symbol" rather than the physical
     rendering of said symbol.
 
 (2) In the real world, there exist different symbols that share the same
     glyph shape.
[...] Or, to use a different example that stem from the same underlying issue, let's say we take a Russian string: Я тебя люблю. In a cursive font, it might look something like this: Я mеδя ∧юδ∧ю. (I'm deliberately substituting various divergent Unicode characters to make a point.) According to your proposal, т and m ought to be encoded differently. So that means that Cyrillic lowercase т has *two* different encodings (and ditto with the other lookalikes). This is obviously absurd, because it's the SAME LETTER in Cyrillic. Insisting that they be encoded differently means your string encoding depends on font, which is in itself already ridiculous, and worse yet, it means that if you're writing a web script that accepts input from users, you have no idea which encoding they will use when they want to write Cyrillic lowercase т. You end up with two strings that are logically identical, but bitwise different because the user happened to have a font where т is displayed as m. Goodbye, sane substring search, goodbye sane automatic string processing, goodbye, consistent string rendering code. This is equivalent to saying that English capital A in serif ought to have a different encoding from English capital A in sans serif, because their glyph shapes are different. If you follow that route, pretty soon you'll have a different encoding for bolded A, another encoding for slanted A (which is different from italic A), and the combinatorial explosion of useless redundant encodings thereof. It simply does not make any sense. The only sane way out of this mess is the way Unicode has taken: you encode *not* the glyph, but the logical entity behind the glyph, i.e., the "symbol" as you call it, or in Unicode parlance, the code point. Cyrillic lowercase т is a unique entity that should correspond with exactly one code point, notwithstanding that some of its forms are lookalikes to Latin lowercase m. Even if the font ultimately uses literally the same glyph to render them, they remain distinct entities in the encoding because they are *logically different things*. In today's age of international communications and multilingual strings, the fact of different logical characters sharing the same rendered form is an unavoidable, harsh reality. You either face it and deal with it in a sane way, or you can hold on to broken old approaches that don't work and fade away in the rearview mirror. Your choice. :-D T -- Без труда не выловишь и рыбку из пруда.
Aug 16 2019
prev sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Aug 15, 2019 at 12:59:34PM -0700, Walter Bright via Digitalmars-d wrote:
 On 8/15/2019 12:44 PM, Jonathan M Davis wrote:
 There should only be a single way to represent a given character.
Exactly. And two glyphs that render identically should be the same code point.
[...] It's not as simple as you imagine. Letter shapes across different languages can look alike, but have zero correspondence with each other. Conflating two distinct letter forms just because they happen to look alike is the beginning of the road to madness. First and foremost, the exact glyph shape depends on the font -- a cursive M is a different shape from a serif upright M which is different from a sans-serif bolded M. They are logically the exact same character, but they are rendered differently depending on the font. What's the problem with that, you say? Here's the problem: if we follow your suggestion of identifying characters by rendered glyph, that means a lowercase English 'u' ought to be the same character as the cursive form of Cyrillic и (because that's how it's written in cursive). However, non-cursive Cyrillic и is printed as и (i.e., the equivalent of a "backwards" small-caps English N). You cannot be seriously suggesting that и and u should be the same character, right?! The point is that this changes *based on the font*; Russian speakers recognize the two *distinct* glyphs as the SAME letter. They also recognize that it's a DIFFERENT letter from English u, in spite of the fact the glyphs are identical. This is just one of many such examples. Yet another Cyrillic example: lowercase cursive Т is written with a glyph that, for all practical purposes, is identical to the glyph for English 'm'. Again, conflating the two based on your idea is outright ridiculous. Just because the user changes the font, should not mean that now the character becomes a different letter! (Or that the program needs to rewrite all и's into lowercase u's!) How a letter is rendered is a question of *font*, and I'm sure you'll agree that it doesn't make sense to make decisions on character identity based on which font you happen to be using. Then take an example from Chinese: the character for "one" is, once you strip away the stylistic embellishments (which is an issue of font, and ought not to come into play with a character encoding), basically the same shape as a hyphen. You cannot seriously be telling me that we should treat the two as the same thing. Basically, there is no sane way to avoid detaching the character encoding from the physical appearance of the character. It simply makes no sense to have a different character for every variation of glyph across a set of fonts. You *have* to work on a more abstract level, at the level of the *logical* identity of the character, not its specific physical appearance per font. But that *inevitably* means you'll end up with multiple distinct characters that happen to share the same glyph (again, modulo which font the user selected for displaying the text). See the Cyrillic examples above. There are many other examples of logically-distinct characters from different languages that happen to share the same glyph shape with some English letter in some cases, which you cannot possibly conflate without ending up with nonsensical results. You cannot eliminate dependence on the specific font if you insist on identifying characters by shape. The only sane solution is to work on the abstract level, where the same logical character (e.g., Cyrillic letter N) can have multiple different glyphs depending on the font (in cursive, for example, capital И looks like English U). But once you work at the abstract level, you cannot avoid some logically-distinct letters coinciding in glyph shape (e.g., English lowercase u vs. Cyrillic и). And once you start on that slippery slope, you're not very far from descending into the "chaos" of the current Unicode standard -- because inevitably you'll have to make distinctions like "lowercase Greek mu as used in mathematics" vs. "lowercase Greek mu as used by Greeks to write their language" -- because although historically the two were identical, over time their usage has diverged and now there exists some contexts where you have to differentiate between the two. The fact of the matter is that human language is inherently complex (not to mention *changes over time* -- something many people don't consider), and no amount of cleverness is going to surmount that without producing an inherently-complex solution. T -- Why ask rhetorical questions? -- JC
Aug 15 2019
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/15/2019 3:04 PM, H. S. Teoh wrote:
 [...]
And yet somehow people manage to read printed material without all these problems.
Aug 15 2019
parent reply xenon325 <anm programmer.net> writes:
On Thursday, 15 August 2019 at 22:23:13 UTC, Walter Bright wrote:
 And yet somehow people manage to read printed material without 
 all these problems.
If same glyphs had same codes, what will you do with these: 1) Sort string. In my phone's contact lists there are entries in russian, in english and mixed. Now they are sorted as: A (latin), B (latin), C, А (ru), Б, В (ru). Wich is pretty easy to search/navigate. What would be the order in case Unicode worked the way you want? 2) Convert cases: - in english: 'B'.toLower == 'b' - in russian: 'В'.toLower == 'в'
Aug 16 2019
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/16/2019 9:32 AM, xenon325 wrote:
 On Thursday, 15 August 2019 at 22:23:13 UTC, Walter Bright wrote:
 And yet somehow people manage to read printed material without all these 
 problems.
If same glyphs had same codes, what will you do with these: 1) Sort string. In my phone's contact lists there are entries in russian, in english and mixed. Now they are sorted as: A (latin), B (latin), C, А (ru), Б, В (ru). Wich is pretty easy to search/navigate.
Except that there's no guarantee that whoever entered the data used the right code point. The pragmatic solution, again, is to use context. I.e. if a glyphy is surrounded by russian characters, it's likely a russian glyph. If it is surrounded by characters that form a common russian word, it's likely a russian glyph. Of course it isn't perfect, but I bet using context will work better than expecting the code points to have been entered correctly. I note that you had to tag В with (ru), because otherwise no human reader or OCR would know what it was. This is exactly the problem I'm talking about. Writing software that relies on invisible semantic information is never going to work.
Aug 16 2019
next sibling parent Gregor =?UTF-8?B?TcO8Y2ts?= <gregormueckl gmx.de> writes:
On Friday, 16 August 2019 at 21:05:44 UTC, Walter Bright wrote:
 On 8/16/2019 9:32 AM, xenon325 wrote:
 On Thursday, 15 August 2019 at 22:23:13 UTC, Walter Bright 
 wrote:
 And yet somehow people manage to read printed material 
 without all these problems.
If same glyphs had same codes, what will you do with these: 1) Sort string. In my phone's contact lists there are entries in russian, in english and mixed. Now they are sorted as: A (latin), B (latin), C, А (ru), Б, В (ru). Wich is pretty easy to search/navigate.
Except that there's no guarantee that whoever entered the data used the right code point.
Depends. On smartphones, switching the keyboard language is easy (just a swipe on Android), so users that are regularly multilingual should be fine there. Windows also offers keyboard layout switching on the fly with an awkward keyboard shortcut, but it is pretty well hidden. So again, users that are multilingual in their daily routines should really be fine. But taking a step back and trying to take a bird's eye view on this discussion, it becomes clear to me that the argument could be solved if there was a clear separation of text representations for processing (sorting, spell checking, whatever other NLP you can think of) and a completely seperate one for display. The transformation to the later would naturally be lossy and not perfectly reversible. The funny thing about that part is that text rendering with OpenType fonts is *already* doing exactly this transformation to derive the font specific glyph indices from the text. But all the bells and whistles in Unicode blur this boundary way too much. And this is what we are getting hung up over, I think. Man, we really managed to go off track in this thread, didn't we? ;)
Aug 16 2019
prev sibling parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Friday, 16 August 2019 at 21:05:44 UTC, Walter Bright wrote:
 On 8/16/2019 9:32 AM, xenon325 wrote:
 On Thursday, 15 August 2019 at 22:23:13 UTC, Walter Bright 
 wrote:
 And yet somehow people manage to read printed material 
 without all these problems.
If same glyphs had same codes, what will you do with these: 1) Sort string. In my phone's contact lists there are entries in russian, in english and mixed. Now they are sorted as: A (latin), B (latin), C, А (ru), Б, В (ru). Wich is pretty easy to search/navigate.
Except that there's no guarantee that whoever entered the data used the right code point.
From my experience, that was an issue that WE encountered often before Unicode the uppercase letters in Greek texts that were mixes of ASCII (A 0x41) and Greek (Α 0xC1 in CP-1253). It was so bad that the Greek translation department didn't use Euramis for a significant amount. It was only when we got completely rid of this crap (and also the RTF file format) and embraced Unicode that we got rid of this issue of mis-used encoding. While I get that Unicode is (over-)complicated and in some aspects silly. It has nonetheless 2 essential virtues that all other encoding schemes never were able achieve: - it is a norm that is widely used, almost universal. - it is a norm that is widely used, almost universal. Yeah, I'm lame, I repeated it twice :-) The fact that it is widely adopted even in far east makes it really something essential. Could they have defined things differently or simpler? Maybe but I doubt it, as the complexity of Unicode comes from the complexity of language themselves.
 The pragmatic solution, again, is to use context. I.e. if a 
 glyphy is surrounded by russian characters, it's likely a 
 russian glyph. If it is surrounded by characters that form a 
 common russian word, it's likely a russian glyph.
No, that doesn't work for panaché documents, we've been there, we had that and it sucks. UTF was such a relief. Here little example from our configuration. The regular expression used to detect a document reference in a text as a replaceable: 0:UN:EC_N:((№|č.|nr.|št.|αριθ.|No|nr|N:o|Uimh.|br.|n.|Nr.|Nru|[Nn][º o]|[Nn].[º°o])[  ][0-9]+/[0-9]+/(EC|ES|EF|EG|EK|EΚ|CE|EÜ|EY|CE|EZ|EB|KE|WE)) What is the context here? Btw the EC is Cyrillic and the first EK is Greek and their substitution expressions T:BG:EC_N:№\2/ЕС T:CS:EC_N:č.\2/ES T:DA:EC_N:nr.\2/EF T:DE:EC_N:Nr.\2/EG T:EL:EC_N:αριθ.\2/EΚ T:EN:EC_N:No\2/EC T:ES:EC_N:nº\2/CE T:ET:EC_N:nr\2/EÜ T:FI:EC_N:N:o\2/EY T:FR:EC_N:nº\2/CE T:GA:EC_N:Uimh.\2/CE T:HR:EC_N:br.\2/EZ T:IT:EC_N:n.\2/CE T:LT:EC_N:Nr.\2/EB T:LV:EC_N:Nr.\2/EK T:MT:EC_N:Nru\2/KE T:NL:EC_N:nr.\2/EG T:PL:EC_N:nr\2/WE T:PT:EC_N:n.º\2/CE T:RO:EC_N:nr.\2/CE T:SK:EC_N:č.\2/ES T:SL:EC_N:št.\2/ES T:SV:EC_N:nr\2/EG and as said before, such a number can be in a citation in the language of the citation not in the language of the document.
 Of course it isn't perfect, but I bet using context will work 
 better than expecting the code points to have been entered 
 correctly.

 I note that you had to tag В with (ru), because otherwise no 
 human reader or OCR would know what it was. This is exactly the 
 problem I'm talking about.
Yeah, but what you propose makes it even worse not better.
 Writing software that relies on invisible semantic information 
 is never going to work.
Invisible to your eyes, not invisible to the machines, that's the whole point. Why do we need to annotate all the functions in D with these annoying attributes if the compiler can detect them automagically via context? Because in general it can't, the semantic information must be provided somehow.
Aug 17 2019
prev sibling parent sarn <sarn theartofmachinery.com> writes:
On Friday, 16 August 2019 at 16:32:05 UTC, xenon325 wrote:
 If same glyphs had same codes, what will you do with these:
 ...
 2) Convert cases:
 - in english: 'B'.toLower == 'b'
 - in russian: 'В'.toLower == 'в'
FWIW, we have that problem today with Unicode and the letter i: https://en.wikipedia.org/wiki/Dotted_and_dotless_I#In_computing
Aug 16 2019
prev sibling parent reply Gregor =?UTF-8?B?TcO8Y2ts?= <gregormueckl gmx.de> writes:
On Thursday, 15 August 2019 at 22:04:01 UTC, H. S. Teoh wrote:
 Basically, there is no sane way to avoid detaching the 
 character encoding from the physical appearance of the 
 character.  It simply makes no sense to have a different 
 character for every variation of glyph across a set of fonts.  
 You *have* to work on a more abstract level, at the level of 
 the *logical* identity of the character, not its specific 
 physical appearance per font.
OK, but Unicode also does the inverse: it has multiple legal representations of characters that are logically the same, mean the same and should appear the exact same (the later doesn't necessarily happen because of font rendering deficiencies). E.g. the word "schön" can be encoded two different ways while using only code points intended for German. So you can get the situation that "schön" != "schön". This is unnecessary duplication.
Aug 15 2019
next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Aug 15, 2019 at 10:37:57PM +0000, Gregor Mckl via Digitalmars-d wrote:
 On Thursday, 15 August 2019 at 22:04:01 UTC, H. S. Teoh wrote:
 Basically, there is no sane way to avoid detaching the character
 encoding from the physical appearance of the character.  It simply
 makes no sense to have a different character for every variation of
 glyph across a set of fonts.  You *have* to work on a more abstract
 level, at the level of the *logical* identity of the character, not
 its specific physical appearance per font.
OK, but Unicode also does the inverse: it has multiple legal representations of characters that are logically the same, mean the same and should appear the exact same (the later doesn't necessarily happen because of font rendering deficiencies). E.g. the word "schn" can be encoded two different ways while using only code points intended for German. So you can get the situation that "schn" !=
prev sibling next sibling parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Thursday, August 15, 2019 4:59:45 PM MDT H. S. Teoh via Digitalmars-d 
wrote:
 On Thu, Aug 15, 2019 at 10:37:57PM +0000, Gregor Mckl via Digitalmars-d 
wrote:
 On Thursday, 15 August 2019 at 22:04:01 UTC, H. S. Teoh wrote:
 Basically, there is no sane way to avoid detaching the character
 encoding from the physical appearance of the character.  It simply
 makes no sense to have a different character for every variation of
 glyph across a set of fonts.  You *have* to work on a more abstract
 level, at the level of the *logical* identity of the character, not
 its specific physical appearance per font.
OK, but Unicode also does the inverse: it has multiple legal representations of characters that are logically the same, mean the same and should appear the exact same (the later doesn't necessarily happen because of font rendering deficiencies). E.g. the word "schn" can be encoded two different ways while using only code points intended for German. So you can get the situation that "schn" !=
prev sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Aug 15, 2019 at 05:06:57PM -0600, Jonathan M Davis via Digitalmars-d
wrote:
 On Thursday, August 15, 2019 4:59:45 PM MDT H. S. Teoh via Digitalmars-d 
 wrote:
[...]
 Unicode does have some dark corners like that.[*]
[...]
 [*] And some worse-than-dark-corners, like the whole codepage
 dedicated to emoji *and* combining marks for said emoji that changes
 their *appearance* -- something that ought not to have any place in
 a character encoding scheme!  Talk about scope creep...
Considering that emojis are supposed to be pictures formed with letters (simple ASCII art basically), they have no business being part part of an encoding scheme in the first place - but having combining marks to change their appearance definitely makes it that much worse.
[...] It's not just emojis; GUI icons are already a thing in Unicode. If this trend of encoding graphics in a string continues, in about a decade's time we'll be able to reinvent Nethack with graphical tiles inside a text mode terminal, using Unicode RPG icon "characters" which you can animate by attaching various "combining diacritics". It would be kewl. But also utterly pointless and ridiculous. (In fact, I wouldn't be surprised if you can already do this to some extent using emojis and GUI icon "characters". Just add a few more Unicode "characters" for in-game objects and a few more "diacritics" for animation frames, and we're already there. Throw in a zero-width, non-spacing "animation frame variant selector" "character", and we could have an entire animation sequence encoded as a string. Who even needs PNGs and animated SVGs anymore?!) T -- First Rule of History: History doesn't repeat itself -- historians merely repeat each other.
Aug 15 2019
prev sibling parent Argolis <argolis gmail.com> writes:
On Thursday, 15 August 2019 at 19:05:32 UTC, Gregor Mückl wrote:
 On Thursday, 15 August 2019 at 11:02:54 UTC, Argolis wrote:
 [...]
This is the point we're trying to get across to you: this isn't sufficient. Depending on the context and the script/language, you need access to the string at various levels. E.g. a font renderer needs to sometimes iterate code points, not graphemes in order to compose the correct glyphs. [...]
I want to thank you, that's was really inspiring to me in trying to dig harder in the problem!
Aug 16 2019
prev sibling next sibling parent reply GreatSam4sure <greatsam4sure gmail.com> writes:
On Tuesday, 13 August 2019 at 07:08:03 UTC, Walter Bright wrote:
 We don't yet have a good plan on how to remove autodecoding and 
 yet provide backward compatibility with autodecoding-reliant 
 projects, but one thing we can do is make Phobos work properly 
 with and without autodecoding.

 To that end, I created a build of Phobos that disables 
 autodecoding:

 https://github.com/dlang/phobos/pull/7130

 Of course, it fails. If people want impactful things to work 
 on, fixing each failure is worthwhile (each in separate PRs).

 Note that this is neither trivial nor mindless code editing. 
 Each case has to be examined as to why it is doing 
 autodecoding, is autodecoding necessary, and deciding to 
 replace it with byChar, byDchar, or simply hardcoding the 
 decoding logic.
Thanks for your effort toward this direction I once a massive this discussion on auto decoding. Recently I have witnessed a massive effort from you, Andrei and the entire community on the D language. I must confess you have a beautiful language already. The D language promises a lot by its elegance, compilation speed, speed, generic and multiple programming techniques supported. I don't have a problem with the language that much but with the libraries, tutorial, documentation, ide. Each time I download the library from fun packages almost 90% there must be one error or another. I will be happy if the tools and library just work out of the box. The tools, the library should be set up that a novice like me can use them. I don't have much expertise in programming so I can contribute to D for the now
Aug 13 2019
parent Andre Pany <andre s-e-a-p.de> writes:
On Tuesday, 13 August 2019 at 11:01:30 UTC, GreatSam4sure wrote:
 On Tuesday, 13 August 2019 at 07:08:03 UTC, Walter Bright wrote:
 [...]
Thanks for your effort toward this direction I once a massive this discussion on auto decoding. Recently I have witnessed a massive effort from you, Andrei and the entire community on the D language. I must confess you have a beautiful language already. The D language promises a lot by its elegance, compilation speed, speed, generic and multiple programming techniques supported. I don't have a problem with the language that much but with the libraries, tutorial, documentation, ide. Each time I download the library from fun packages almost 90% there must be one error or another. I will be happy if the tools and library just work out of the box. The tools, the library should be set up that a novice like me can use them. I don't have much expertise in programming so I can contribute to D for the now
I started to create github issues every time I see some errors on libraries. This already helps a lot. What really would be useful is to see the build status of libraries on code.dlang.org. With the new CI/CD functionality of Github, (free for open source projects), this becomes are lot more feasible and easy to setup. Kind regards Andre
Aug 13 2019
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/13/2019 12:08 AM, Walter Bright wrote:
 If people want impactful things to work on, fixing each 
 failure is worthwhile (each in separate PRs).
First fix: https://github.com/dlang/phobos/pull/7133
Aug 14 2019
prev sibling parent reply Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Tuesday, 13 August 2019 at 07:08:03 UTC, Walter Bright wrote:
 https://github.com/dlang/phobos/pull/7130
Thank you for working on this! Surprisingly, the amount of breakage this causes seems rather small. I sent a few PRs for the modules that I am listed as a code owner of. However, I noticed that one kind of the breakage is silent (the code compiles and runs, but behaves differently). This makes me uneasy, as it would be difficult to ensure that programs are fully and correctly updated for a (hypothetical) transition to no-autodecode. I found two cases of such silent breakage. One was in std.stdio: https://github.com/dlang/phobos/pull/7140 If there was a warning or it was an error to implicitly convert char to dchar, the breakage would have been detected during compilation. I'm sure we discussed this before. (Allowing a char, which might have a value >=0x80, to implicitly convert to dchar, which would be nonsense, is problematic, etc.) The other instance of silent breakage was in std.regex. This unittest assert started failing: https://github.com/dlang/phobos/blob/5cb4d927e56725a38b0b1ea1548d9954083d3290/std/regex/package.d#L629 I haven't looked into that, perhaps someone more familiar with std.regex and std.uni could have a look.
Aug 15 2019
next sibling parent Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Thursday, 15 August 2019 at 12:09:02 UTC, Vladimir Panteleev 
wrote:
 I haven't looked into that, perhaps someone more familiar with 
 std.regex and std.uni could have a look.
In std.uni, there is genericDecodeGrapheme, which needs to: 1. Work with strings of any width 2. Work with input ranges of dchars 3. Advance the given range by ref With autodecoding on, the first case is handled by .front / .popFront. With autodecoding off, there is no direct equivalent any more. The problem is that the function needs to peek ahead (which can be multiple range elements for ranges of narrow char types, which is not possible for input ranges). - Replacing .front / .popFront with std.utf.decodeFront does not work because the function does not do these operations in the same place, so we need to save the range before decodeFront advances it, but we can't .save() input ranges from case 2 above. - Using byDchar does not work because .byDchar does not take its range by ref, so advancing the byDchar range will not advance the range passed by ref to genericDecodeGrapheme. I tried to use std.range.refRange for this but hit a compiler ICE ("precedence not defined for token 'cantexp'"). Perhaps there is already a construct in Phobos that can solve this?
Aug 15 2019
prev sibling next sibling parent Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Thursday, 15 August 2019 at 12:09:02 UTC, Vladimir Panteleev 
wrote:
 I haven't looked into that, perhaps someone more familiar with 
 std.regex and std.uni could have a look.
I should add that the std.uni "silent" breakage also was due to `dchar c = str.front`, and would have been found by disallowing char->dchar implicit conversion.
Aug 15 2019
prev sibling next sibling parent reply Les De Ridder <les lesderid.net> writes:
On Thursday, 15 August 2019 at 12:09:02 UTC, Vladimir Panteleev 
wrote:
 On Tuesday, 13 August 2019 at 07:08:03 UTC, Walter Bright wrote:
 https://github.com/dlang/phobos/pull/7130
[...] However, I noticed that one kind of the breakage is silent (the code compiles and runs, but behaves differently). This makes me uneasy, as it would be difficult to ensure that programs are fully and correctly updated for a (hypothetical) transition to no-autodecode.
I remembered this article from the wiki where you pointed this out back in 2014: https://wiki.dlang.org/Element_type_of_string_ranges See also the forum thread that it links to.
Aug 15 2019
next sibling parent Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Thursday, 15 August 2019 at 15:01:22 UTC, Les De Ridder wrote:
 I remembered this article from the wiki where you pointed this 
 out back
 in 2014:

 https://wiki.dlang.org/Element_type_of_string_ranges
I completely forgot about that. Thanks for bringing it up, looks like it's still relevant :)
Aug 15 2019
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
I ran into that as well with the 3 PRs I did:

fix array(String) to work with no autodecode
https://github.com/dlang/phobos/pull/7133

fix assocArray() unittests for no autodecode
https://github.com/dlang/phobos/pull/7134

fix unittests for array.join() for no autodecode
https://github.com/dlang/phobos/pull/7135

More specifically, the ElementType template returns a dchar for an
autodecodable 
string, and char/wchar for a non-autodecodable string. I suspect that most 
people are not aware of this, and code that uses ElementType may already be 
subtly broken.

Note that the documentation for ElementType is also wrong,

   https://github.com/dlang/phobos/pull/

because isNarrowString is NOT THE SAME THING as an autodecoding string! The 
difference is isNarrowString excludes stringish aggregates and enums with a 
string base type, while autodecoding types include them.

Does this confuse anyone? It confuses me. I can never remember which is which.

Autodecoding is not only a conceptual mistake, the way it is implemented is a 
buggy, confusing disaster. (isNarrowString is often incorrectly used instead 
isAutodecodableString in Phobos.)

I think the only solution is to "rip the band-aid off" and have ElementType
give 
the code unit type when autodecoding is disabled.
Aug 15 2019
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/15/2019 5:09 AM, Vladimir Panteleev wrote:
 I sent a few PRs for the modules that I am listed as a code owner of.
Can you please add a link to those PRs in https://github.com/dlang/phobos/pull/7130 ? I think such references to how Phobos fixed its dependencies on autodecode will be valuable to programmers who want to fix theirs.
Aug 15 2019
parent reply Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Thursday, 15 August 2019 at 21:21:33 UTC, Walter Bright wrote:
 On 8/15/2019 5:09 AM, Vladimir Panteleev wrote:
 I sent a few PRs for the modules that I am listed as a code 
 owner of.
Can you please add a link to those PRs in https://github.com/dlang/phobos/pull/7130 ?
it. Noticed you added some comments just now too doing the same.
Aug 15 2019
parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/15/2019 2:25 PM, Vladimir Panteleev wrote:
 On Thursday, 15 August 2019 at 21:21:33 UTC, Walter Bright wrote:
 On 8/15/2019 5:09 AM, Vladimir Panteleev wrote:
 I sent a few PRs for the modules that I am listed as a code owner of.
Can you please add a link to those PRs in https://github.com/dlang/phobos/pull/7130 ?
added some comments just now too doing the same.
I went one better, I added a [no autodecode] label!
Aug 15 2019