digitalmars.D - The Case Against Autodecode

Walter Bright (28/33) May 12 2016 Here are some that are not matters of opinion.

Vladimir Panteleev (9/52) May 12 2016 12. The result of autodecoding, a range of Unicode code points,

H. S. Teoh via Digitalmars-d (22/32) May 12 2016 Example of string special-casing leading to bugs:
H. S. Teoh via Digitalmars-d (27/36) May 12 2016 A range of Unicode code points is not the same as a range of graphemes

Marc =?UTF-8?B?U2Now7x0eg==?= (6/14) May 13 2016 In fact, even most European languages are affected if NFD

Marco Leise (13/19) May 13 2016 +1 for leaning back and contemplate exactly what auto-decode

H. S. Teoh via Digitalmars-d (7/28) May 13 2016 [...]

Daniel Kozak (12/55) May 12 2016 For me it is not about autodecoding. I would like to have

Walter Bright (2/10) May 12 2016 I can't find any actionable information in this.

Marco Leise (13/15) May 12 2016 More precisely they are byte strings with '/' reserved to

Walter Bright (9/11) May 12 2016 I would have agreed with you in the past, but more and more it just does...

Jack Stouffer (17/18) May 12 2016 If you're serious about removing auto-decoding, which I think you

Jack Stouffer (11/19) May 12 2016 To hammer this home a little more, Python 3 had a really useful
Walter Bright (2/6) May 12 2016 I agree, if it is possible at all.

Chris (13/23) May 13 2016 I don't know to which extent my problems with string handling are

Walter Bright (2/5) May 13 2016 You can avoid autodecode by using .byChar

Chris (8/16) May 13 2016 Hm. It would be difficult to make sure that my whole code base

Vladimir Panteleev (2/7) May 13 2016 https://twitter.com/StopForumSpam

Chris (3/11) May 13 2016 I don't understand. Does that mean we have to solve CAPTCHAs

Iakh (9/19) May 13 2016 A plan:

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (10/16) May 13 2016 Python 2 is/was deployed at a much larger scale and with far more
Nick Treleaven (8/12) May 13 2016 char[] is always going to be unsafe for UTF-8. I don't think we

H. S. Teoh via Digitalmars-d (9/21) May 13 2016 alias String = typeof(std.uni.byGrapheme(immutable(char)[].init));

Nick Sabalausky (20/35) May 29 2016 As much as I agree on the importance of a good smooth migration path, I

Jack Stouffer (12/22) May 29 2016 If it happens, they better. The D1 fork was maintained for almost

Nick Sabalausky (3/9) May 30 2016 D1 -> D2 was a vastly more disruptive change than getting rid of

Jack Stouffer (3/5) May 30 2016 Don't be so sure. All string handling code would become broken,

Andrei Alexandrescu (3/8) May 30 2016 That kind of makes this thread less productive than "How to improve

Dmitry Olshansky (6/15) May 30 2016 1. Generalize to all ranges of code units i.e. ranges of char/wchar.
Jack Stouffer (5/7) May 30 2016 Please don't misunderstand, I'm for fixing string behavior. But,

Andrei Alexandrescu (5/9) May 30 2016 Surely the misunderstanding is not on this side of the table :o). By

Jonathan M Davis via Digitalmars-d (16/25) May 31 2016 I think that the first step is getting Phobos to work with all ranges of

Andrei Alexandrescu (3/10) May 31 2016 Great. Could you put together a sample PR so we understand the

Vladimir Panteleev (11/16) May 30 2016 Assuming silent breakage is on the table, what would be broken,

Seb (4/21) May 30 2016 132 lines in Phobos use auto-decoding - that should be fixable ;-)

Andrei Alexandrescu (3/26) May 30 2016 Thanks for this investigation! Results are about as I'd have speculated....

Jack Stouffer (8/11) May 30 2016 Did it, the results are a large number of phobos modules fail to

Andrei Alexandrescu (3/14) May 30 2016 It was also made at a time when the community was smaller by a couple

Chris (10/34) May 30 2016 I suggest providing an automatic tool (either within the compiler

Marco Leise (39/44) May 30 2016 It makes a difference for every function. But it still isn't

Chris (4/5) May 30 2016 I was actually talking about ICU with a colleague today. Could it

Marco Leise (40/45) May 30 2016 You have to compare to the situation before, when every
Joakim (33/38) May 31 2016 Part of it is the complexity of written language, part of it is

Jonathan M Davis via Digitalmars-d (18/24) May 31 2016 Considering that *nix land uses UTF-8 almost exclusively, and many C

Joakim (26/55) May 31 2016 And there are a lot more languages that will be twice as long

Marco Leise (36/68) May 31 2016 Maybe you can dig up your old post and we can look at each of

Joakim (29/97) May 31 2016 Not interested. I believe you were part of that thread then.
Timon Gehr (3/16) May 31 2016 It is probably this one. Not sure what "exactly the issues" are though.
Walter Bright (10/11) May 31 2016 I agree. I dealt the madness of code pages, Shift-JIS, EBCDIC, locales, ...

ag0aep6g (4/8) May 31 2016 Wikipedia says [1] that UCS-2 is essentially UTF-16 without surrogate

Walter Bright (2/5) May 31 2016 Thanks for the correction.

Marco Leise (11/14) May 31 2016 I think so too, although more APIs than just Windows use

ag0aep6g (4/10) May 31 2016 Guys, may I ask you to move this discussion to a new thread? I'd like to...

Joakim (4/18) May 31 2016 No, this is the root of the problem, but I'm not interested in

Marc =?UTF-8?B?U2Now7x0eg==?= (14/33) Jun 01 2016 I assume you're talking about the web here. In this case, plain

Joakim (12/48) Jun 01 2016 No, I explicitly said not the web in a subsequent post. The

Marco Leise (23/25) Jun 01 2016 I've used 56k and had a phone conversation with my sister

Joakim (92/165) Jun 01 2016 I see that max 2G speeds are 100-200 kbits/s. At that rate, it

Wyatt (36/71) Jun 01 2016 It's telling that you think the encoding of the text is anything

Joakim (57/133) Jun 01 2016 I'm well aware that text is a small part of it. My point is that

Wyatt (20/30) Jun 01 2016 It's not hard. I think a lot of us remember when a 14.4 modem

Patrick Schluter (17/50) Jun 01 2016 Indeed, Joakim's proposal is so insane it beggars belief (why not

deadalnix (2/4) Jun 01 2016 That should be obvious to anyone living outside the USA.

Nick Sabalausky (4/8) Jun 01 2016 Or anyone in the USA who's ever touched a product that includes a manual...
Kagamin (2/7) Jun 01 2016 https://msdn.microsoft.com/th-th inside too :)

Kagamin (3/7) Jun 01 2016 UTF-8 encoded SMS work fine for me in GSM network, didn't notice

Adam D. Ruppe (10/12) May 30 2016 Actually, my main rule of thumb is: don't mess with strings. Get

Jack Stouffer (4/11) May 12 2016 This is a great example of special casing in Phobos that someone
Bill Hicks (10/43) May 12 2016 Wow, that's eleven things wrong with just one tiny element of D,

Ethan Watson (5/6) May 13 2016 Actually, chap, it's the attitude that's the turn-off in your
poliklosio (10/21) May 13 2016 You get banned because there is a difference between torpedoing a

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (49/58) May 15 2016 Not really. The dominating precursor to C, BCPL was a

Chris (12/21) May 13 2016 Is there any PL that doesn't have multiple issues? Look at Swift.
Kagamin (4/6) May 13 2016 D is a better broken thing among all the broken things in this
Walter Bright (4/8) May 13 2016 Posts that engage in personal attacks and bring up personal issues about...

Jonathan M Davis via Digitalmars-d (37/73) May 13 2016 It also results in constantly special-casing algorithms for narrow strin...

Chris (4/16) May 13 2016 Why not just try it in a separate test release? Only then can we
Marc =?UTF-8?B?U2Now7x0eg==?= (11/30) May 13 2016 char[], wchar[] etc. can simply be made non-ranges, so that the

Jonathan M Davis via Digitalmars-d (13/32) May 13 2016 It also means yet more special cases. You have arrays which aren't treat...

Kagamin (3/6) May 13 2016 UTF-16 was a migration from UCS-2, and UCS-2 was superior at the

Jonathan M Davis via Digitalmars-d (25/31) May 13 2016 The history of why UTF-16 was chosen isn't really relevant to my point

Kagamin (7/14) May 17 2016 On the other hand if you deal with UTF-16 text, you can't

sarn (13/16) May 17 2016 ^ latin-1 with Swedish collation rules.

Marc =?UTF-8?B?U2Now7x0eg==?= (3/8) May 13 2016 This just means that filenames mustn't be represented as strings;

Walter Bright (5/12) May 13 2016 It means much more than that, filenames are just an example. I recently ...

Steven Schveighoffer (17/19) May 13 2016 I'll repeat what I said in the other thread.

Alex Parrill (9/31) May 13 2016 Well, the "auto" part of autodecoding means "automatically doing

Steven Schveighoffer (10/44) May 13 2016 No, the problem isn't the auto-decoding. The problem is having *arrays*

Jon D (49/59) May 15 2016 Given the importance of performance in the auto-decoding topic,

Jack Stouffer (5/9) May 15 2016 Here is another benchmark (see the above comment for the code to

H. S. Teoh via Digitalmars-d (99/110) May 15 2016 I decide to do my own benchmarking too. Here's the code:

jmh530 (2/12) May 16 2016 Interesting that LDC is slower than DMD for char[].

Andrei Alexandrescu (75/114) May 26 2016 This might be a good time to discuss this a tad further. I'd appreciate

Jack Stouffer (28/39) May 26 2016 For an example where the std.algorithm/range functions don't cut
H. S. Teoh via Digitalmars-d (87/134) May 26 2016 [...]

Andrei Alexandrescu (2/9) May 26 2016 No, that's not necessary (or correct). -- Andrei
Marco Leise (12/36) May 30 2016 Am Thu, 26 May 2016 16:23:16 -0700

Andrew Godfrey (4/4) May 30 2016 I like "make string iteration explicit" but I wonder about other

Adam D. Ruppe (5/9) May 30 2016 The comparison predicate does that...

Andrew Godfrey (7/16) May 30 2016 Thanks! You left out some details but I think I see - an example

Marco Leise (33/37) May 30 2016 You are just scratching the surface! Unicode strings are

Vladimir Panteleev (31/111) May 26 2016 It is completely wasted mental effort.

Jonathan M Davis via Digitalmars-d (5/20) May 31 2016 In addition, as soon as you have ubyte[], none of the string-related

Andrei Alexandrescu (2/5) May 31 2016 That'd be nice to fix indeed. Please break the ground? -- Andrei

Kagamin (8/14) May 27 2016 Sounds like you want to say that string should be smarter than an

Andrei Alexandrescu (3/8) May 27 2016 That's my understanding too. And I think the design rationale is wrong.

Marc =?UTF-8?B?U2Now7x0eg==?= (30/104) May 27 2016 It is not, which has been shown by various posts in this thread.

Andrei Alexandrescu (3/4) May 27 2016 Couldn't quite find strong arguments. Could you please be more explicit

Marc =?UTF-8?B?U2Now7x0eg==?= (36/41) May 28 2016 There are several possibilities of what iteration over a char

Andrei Alexandrescu (10/13) May 28 2016 OK, that's a fair argument, thanks. So it seems there should be no

Walter Bright (6/8) May 28 2016 An array of code units provides consistency, predictability, flexibility...

Andrew Godfrey (18/27) May 28 2016 You're right. An "array of code units" is a very useful low-level

Chris (14/17) May 29 2016 Unicode graphemes are not always the same as graphemes in natural

Tobias =?UTF-8?B?TcO8bGxlcg==?= (10/22) May 29 2016 No, this is well established terminology, you are confusing

default0 (6/29) May 29 2016 I am pretty sure that a single grapheme in unicode does not

Tobias M (9/13) May 29 2016 Grapheme is a linguistic term. AFAIUI, a grapheme cluster is a

Chris (4/18) May 29 2016 Which is why we need to agree on a terminology, i.e. be clear

Chris (14/37) May 29 2016 Ok, you have a point there, to be precise is a multigraph (a

Tobias M (6/18) May 29 2016 What I meant was, a phoneme is the "character" (smallest unit) in

H. S. Teoh via Digitalmars-d (22/34) May 29 2016 [...]

Walter Bright (4/8) May 29 2016 As far as D is concerned, we are not going to invent our own concepts ar...

Walter Bright (2/3) May 29 2016 For D, we should stick with the terminology as defined by Unicode.

Andrei Alexandrescu (3/11) May 30 2016 Buying it. -- Andrei

Timon Gehr (3/13) May 30 2016 I'm buying it. IMO alias string=immutable(char)[] is the most useful

Andrei Alexandrescu (22/36) May 30 2016 Wouldn't D then be seen (and rightfully so) as largely not supporting

H. S. Teoh via Digitalmars-d (24/64) May 30 2016 They already randomly work or not work on ranges of dchar. I hope we

Walter Bright (4/6) May 30 2016 When I wrote Warp, the only point of which was speed, I couldn't use pho...

Chris (6/14) May 31 2016 Two questions:

Walter Bright (8/11) May 31 2016 It's been a while so I don't remember exactly, but as I recall if the AP...

Timon Gehr (16/57) May 30 2016 In D, enum does not mean enumeration, const does not mean constant, pure...

Nick Sabalausky (2/4) May 30 2016 My new favorite quote :)

Jack Stouffer (5/9) May 28 2016 Yes!
Dicebot (13/20) May 29 2016 Ideally there should not be a way to iterate a (unicode) string at all
Marc =?UTF-8?B?U2Now7x0eg==?= (11/26) May 30 2016 I think this is going too far. It's sufficient if they (= char

Andrei Alexandrescu (2/20) May 30 2016 That's... what I said. -- Andrei

Adam D. Ruppe (9/10) May 30 2016 You said "not arrays", he said "not ranges".

Seb (15/26) May 30 2016 That's a great idea - the compiler should also issue deprecation

ag0aep6g (5/16) May 30 2016 All this is only sensible when we move to a dedicated string type that's...

Marc =?UTF-8?B?U2Now7x0eg==?= (5/10) May 30 2016 I agree; most of the troubles have been with auto-decoding. In an

Walter Bright (3/4) May 30 2016 Why? strings are arrays of code units. All the trouble comes from errati...

Andrei Alexandrescu (5/10) May 30 2016 That's not an argument. Objects are arrays of bytes, or tuples of their

Walter Bright (4/16) May 31 2016 If there is an abstraction for strings that is efficient, consistent, us...

deadalnix (4/27) May 31 2016 Thing is, more info is needed to support unicode properly.
Andrei Alexandrescu (6/25) May 31 2016 It's been mentioned several times: a string type that does not offer

deadalnix (4/5) May 31 2016 It is a slice type. It should work as a slice type. Every other
Jonathan M Davis via Digitalmars-d (55/61) May 31 2016 Not exactly. Such a string type does not hide the fact that it's UTF.

Andrei Alexandrescu (2/12) May 31 2016 How is that different from what I said? -- Andrei

Jonathan M Davis via Digitalmars-d (9/22) May 31 2016 My point was that Walter was stating that you can't have a type that hid...

Marc =?UTF-8?B?U2Now7x0eg==?= (38/43) May 31 2016 So, strings are _implemented_ as arrays of code units. But

Seb (8/14) May 31 2016 If we follow Adam's proposal to deprecate front, back, popFront

ag0aep6g (6/9) May 31 2016 After checking some of those 132 places, they are in generic functions
Andrei Alexandrescu (5/7) May 31 2016 It is terrible, no two ways about it. We've been very very careful with

Kagamin (4/11) May 31 2016 If the user doesn't know how he wants to iterate and you leave

Adam D. Ruppe (5/7) May 30 2016 I don't agree on changing those. Indexing and slicing a char[] is

Walter Bright (2/5) May 30 2016 Yup. It isn't hard at all to use arrays of codeunits correctly.

Andrei Alexandrescu (3/10) May 30 2016 Trouble is, it isn't hard at all to use arrays of codeunits incorrectly,...

H. S. Teoh via Digitalmars-d (6/16) May 30 2016 Neither does autodecoding make code anymore correct. It just better

default0 (10/26) May 31 2016 Thinking about this a bit more - what algorithms are actually

Marco Leise (9/16) May 31 2016 Calculating the buffer size of a string, validation and
Jonathan M Davis via Digitalmars-d (19/28) May 31 2016 Equality does not require decoding. Similarly, functions like find don't

Andrei Alexandrescu (9/20) May 31 2016 Good idea. We could overload functions such as find on char, wchar, and

Marco Leise (14/21) May 31 2016 Both "equality" and "find" require byGrapheme.

Dmitry Olshansky (6/13) May 31 2016 Ehm as long as all you care for is operating on substrings I'd say.

Jonathan M Davis via Digitalmars-d (8/20) May 31 2016 Yeah, but Phobos provides the tools to do that reasonably easily even wh...
H. S. Teoh via Digitalmars-d (15/26) May 31 2016 [...]

Chris (24/53) May 27 2016 On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu

Andrei Alexandrescu (6/48) May 27 2016 That's what happens with autodecoding.

ag0aep6g (5/19) May 27 2016 They only work "properly" if you define "properly" as "in terms of code

Chris (11/18) May 27 2016 I agree. It has happened to me that characters like "é" return

Andrei Alexandrescu (2/3) May 27 2016 Would normalization make length 1? -- Andrei

Adam D. Ruppe (2/3) May 27 2016 In some, but not all cases.
Dmitry Olshansky (4/7) May 27 2016 No, this is not the point of normalization.

Andrei Alexandrescu (2/8) May 27 2016 What is? -- Andrei

Minas Mina (5/15) May 27 2016 This video will be helpfull :)
tsbockman (16/19) May 27 2016 1) A grapheme may include several combining characters (such as

Dmitry Olshansky (5/23) May 28 2016 Quite accurate statement of the goals. Normalization is all about having...

Minas Mina (12/22) May 27 2016 Here is an example about normalization.

David Nadlinger (4/7) May 27 2016 Unless I'm mistaken, this depends on the form used. For example,

Jonathan M Davis via Digitalmars-d (7/13) May 31 2016 Yeah. For better or worse, there are different normalization schemes for

Chris (3/7) May 28 2016 No, I've tried it. I think dchar[] returns one or you check by

H. S. Teoh via Digitalmars-d (51/74) May 27 2016 Exactly. And we just keep getting stuck on this point. It seems that the

Andrei Alexandrescu (4/8) May 27 2016 Which languages are covered by code points, and which languages require

ag0aep6g (9/12) May 27 2016 I don't think there is value in distinguishing by language. The point of...

Andrei Alexandrescu (3/5) May 27 2016 It seems code points are kind of useless because they don't really mean

ag0aep6g (5/7) May 27 2016 I think so, yeah.
H. S. Teoh via Digitalmars-d (7/13) May 27 2016 That's what we've been trying to say all along! :-P They're a kind of

Andrei Alexandrescu (2/3) May 27 2016 If that's the case things are pretty dire, autodecoding or not. -- Andre...

H. S. Teoh via Digitalmars-d (8/13) May 27 2016 Like it or not, Unicode ain't merely some glorified form of C's ASCII
Jonathan M Davis via Digitalmars-d (21/24) May 31 2016 True enough. Correctly handling Unicode in the general case is ridiculou...

Tobias M (9/21) May 29 2016 Code points are *the fundamental unit* of unicode. AFAIK most

Andrei Alexandrescu (2/21) May 29 2016 So now code points are good? -- Andrei

H. S. Teoh via Digitalmars-d (10/35) May 29 2016 It depends on what you're trying to accomplish. That's the point we're

Andrei Alexandrescu (4/10) May 30 2016 I see. Again this all to me sounds like "naked arrays of characters are

Jonathan M Davis via Digitalmars-d (18/26) May 31 2016 Exactly. And even a given function can't necessarily always be defined t...

Adam D. Ruppe (11/13) May 27 2016 It might help to think of code points as being a kind of byte

H. S. Teoh via Digitalmars-d (9/27) May 27 2016 Fun fact: on some old Unix boxen, Backspace + underscore was interpreted

Steven Schveighoffer (6/11) May 27 2016 The only unmistakably correct use I can think of is transcoding from one...

H. S. Teoh via Digitalmars-d (43/52) May 27 2016 This is a complicated issue; for a full explanation you'll probably want

Marco Leise (11/33) May 30 2016 1: Auto-decoding shall ALWAYS do the proper thing

Marco Leise (4/5) May 30 2016 *Correction: Koreans

Jonathan M Davis via Digitalmars-d (22/71) May 31 2016 Exactly. Saying that operating at the code point level - UTF-32 - is cor...

Andrei Alexandrescu (4/6) May 31 2016 Could you please substantiate that? My understanding is that code unit

Jonathan M Davis via Digitalmars-d (57/63) May 31 2016 Okay. If you have the letter A, it will fit in one UTF-8 code unit, one

Andrei Alexandrescu (9/60) May 31 2016 Does walkLength yield the same number for all representations?

Timon Gehr (3/6) May 31 2016 code point

Andrei Alexandrescu (2/5) May 31 2016 foreach, too. -- Andrei

ZombineDev (3/9) Jun 01 2016 Incorrect. https://dpaste.dzfl.pl/ba7a65d59534

Andrei Alexandrescu (2/11) Jun 01 2016 Try typing the iteration variable with "dchar". -- Andrei

Adam D. Ruppe (4/5) Jun 01 2016 Or you can type it as wchar...
ZombineDev (7/20) Jun 01 2016 I think you are not getting my point. This is not autodecoding.

ZombineDev (2/23) Jun 01 2016 in std.range.primitives.
Andrei Alexandrescu (14/16) Jun 01 2016 I understand where you're coming from, but it actually is autodecoding.

Jack Stouffer (9/12) Jun 01 2016 This seems to be a miscommunication with semantics. This is not

Andrei Alexandrescu (7/14) Jun 01 2016 No, this is autodecoding pure and simple. We can't move the goals

Timon Gehr (5/21) Jun 02 2016 It does not share most of the characteristics that make Phobos'

ZombineDev (25/43) Jun 01 2016 Regardless of how different people may call it, it's not what

Andrei Alexandrescu (13/25) Jun 01 2016 Yes, definitely - but then again we can't after each invalidated claim

Kagamin (5/12) Jun 02 2016 Do you mean you agree that range primitives for strings can be
ZombineDev (51/80) Jun 02 2016 My claim was not invalidated. I just didn't want to waste time

ZombineDev (7/15) Jun 02 2016 B) This strange feature is here because we chose compatibility
Andrei Alexandrescu (78/158) Jun 02 2016 Your claim was obliterated, and now you continue arguing it by adjusting...

Timon Gehr (3/25) Jun 02 2016 It's not "on the fly". You two were presumably using different
cym13 (23/97) Jun 02 2016 If you are to stay with autodecoding (and I hope you won't) then

tsbockman (4/8) Jun 02 2016 That would cause just as much - if not more - code breakage as
Andrei Alexandrescu (13/38) Jun 02 2016 That's not going to work. A false impression created in this thread has

Marc =?UTF-8?B?U2Now7x0eg==?= (9/11) Jun 02 2016 They _are_ useless for almost anything you can do with strings.

Andrei Alexandrescu (51/59) Jun 02 2016 Pretty much everything. Consider s and s1 string variables with possibly...

ag0aep6g (5/9) Jun 02 2016 Doesn't work with autodecoding (to code points) when a combining

Andrei Alexandrescu (2/11) Jun 02 2016 Works if s is normalized appropriately. No?

Timon Gehr (2/13) Jun 02 2016 No. assert(!"ö̶".normalize!NFC.any!(c => c== 'ö'));
ag0aep6g (7/18) Jun 02 2016 Works when normalized to precomposed characters, yes.

tsbockman (10/13) Jun 02 2016 Your 'ö' examples will NOT work reliably with auto-decoded code

Andrei Alexandrescu (7/19) Jun 02 2016 They do work per spec: find this code point. It would be surprising if

Brad Anderson (4/15) Jun 02 2016 If there were to be a unicode lieutenant, Dmitry seems to be the
ag0aep6g (6/8) Jun 02 2016 The "spec" here is how the range primitives for narrow strings are

Andrei Alexandrescu (3/4) Jun 02 2016 And want to return to the point where char[] is but an indiscriminated

default0 (4/10) Jun 02 2016 Just make RCStr the most amazing string type of any standard

Andrei Alexandrescu (2/10) Jun 02 2016 Soon as this thread ends. -- Andrei

ag0aep6g (6/8) Jun 02 2016 I think you'd have to substantiate how that would be worse than

Andrei Alexandrescu (5/13) Jun 02 2016 I gave a long list of std.algorithm uses that perform virtually randomly...

ag0aep6g (5/6) Jun 02 2016 Yes it does. You've been given plenty examples where it falls apart.

Andrei Alexandrescu (6/12) Jun 02 2016 That is correct.

Timon Gehr (4/6) Jun 02 2016 So far, I needed to count the number of characters 'ö' inside some

Timon Gehr (4/11) Jun 02 2016 (Obviously this isn't even what the example would do. I predict I will

Andrei Alexandrescu (4/14) Jun 02 2016 You may look for a specific dchar, and it'll work. How about

Timon Gehr (11/27) Jun 02 2016 .̂ ̪.̂

Andrei Alexandrescu (4/14) Jun 02 2016 Count delimited words. Did you also look at balancedParens?

Timon Gehr (5/30) Jun 02 2016 On 02.06.2016 22:01, Timon Gehr wrote:

ag0aep6g (7/9) Jun 02 2016 They're simply not possible. Won't compile. There is no single UTF-8

ag0aep6g (4/10) Jun 02 2016 I'm ignoring combining characters there. You can search for 'a' in code
Andrei Alexandrescu (12/23) Jun 02 2016 Of course you can. Can you search for an int in a short[]? Oh yes you

Andrei Alexandrescu (2/6) Jun 02 2016 Correx, indeed you can't. -- Andrei
ag0aep6g (4/13) Jun 02 2016 Yes, you're right, of course they do. char implicitly converts to dchar....

Andrei Alexandrescu (3/17) Jun 02 2016 I do think that's an interesting option in PL design space, but that

tsbockman (23/42) Jun 02 2016 Your examples will pass or fail depending on how (and whether)

Andrei Alexandrescu (8/14) Jun 02 2016 And that's fine. Want graphemes, .byGrapheme wags its tail in that

H. S. Teoh via Digitalmars-d (27/44) Jun 02 2016 [...]

deadalnix (44/101) Jun 02 2016 False. Many characters can be represented by different sequences

Andrei Alexandrescu (2/10) Jun 02 2016 True. "Are all code points equal to this one?" -- Andrei

Timon Gehr (2/13) Jun 02 2016 I.e. you are saying that 'works' means 'operates on code points'.

Andrei Alexandrescu (2/3) Jun 02 2016 Affirmative. -- Andrei

H. S. Teoh via Digitalmars-d (23/27) Jun 02 2016 Again, a ridiculous position. I can use exactly the same line of

cym13 (12/26) Jun 02 2016 A:“We should decode to code points”

Andrei Alexandrescu (3/11) Jun 02 2016 With autodecoding all of std.algorithm operates correctly on code

Timon Gehr (2/14) Jun 02 2016 No, without it, it operates correctly on code units.
cym13 (35/50) Jun 02 2016 Allow me to try another angle:

Andrei Alexandrescu (17/40) Jun 02 2016 You mean all 35 of them?

H. S. Teoh via Digitalmars-d (10/23) Jun 02 2016 With ASCII strings, all of std.algorithm operates correctly on ASCII

deadalnix (11/25) Jun 02 2016 The good thing when you define works by whatever it does right

Andrei Alexandrescu (2/3) Jun 02 2016 No, it works as it was designed. -- Andrei

deadalnix (3/7) Jun 02 2016 Nobody says it doesn't. Everybody says the design is crap.

Andrei Alexandrescu (2/8) Jun 02 2016 I think I like it more after this thread. -- Andrei

deadalnix (4/15) Jun 02 2016 You start reminding me of the joke with that guy complaining that

Andrei Alexandrescu (2/15) Jun 02 2016 Touché. (Get it?) -- Andrei

Andrei Alexandrescu (4/13) Jun 02 2016 Meh, thinking of it again: I don't like it more, I'd still do it
Nick Sabalausky (2/11) Jun 03 2016 Well there's a fantastic argument.

Timon Gehr (4/6) Jun 02 2016 It also has false positives (you can combine 'ö' with some combining

Walter Bright (5/15) Jun 02 2016 There are 3 levels of Unicode support. What Andrei is talking about is L...

Andrei Alexandrescu (2/20) Jun 02 2016 Apparently I'm not the only idiot. -- Andrei
deadalnix (3/26) Jun 02 2016 To be able to convert back and forth from/to unicode in a

Walter Bright (2/6) Jun 02 2016 Sorry, that makes no sense, as it is saying "they're the same, only diff...

John Colvin (6/29) Jun 02 2016 There are languages that make heavy use of diacritics, often

Jonathan M Davis via Digitalmars-d (7/16) Jun 02 2016 Yeah. I'm inclined to think that the fact that there are multiple
Walter Bright (4/9) Jun 02 2016 I didn't say ordering, I said there should be no such thing as "normaliz...

H. S. Teoh via Digitalmars-d (51/63) Jun 03 2016 I think it was a combination of historical baggage and trying to

Walter Bright (7/11) Jun 03 2016 It is not inevitable. Simply disallow the 2 codepoint sequences - the si...

Vladimir Panteleev (5/19) Jun 03 2016 I don't think it would work (or at least, the analogy doesn't

Jonathan M Davis via Digitalmars-d (16/38) Jun 03 2016 I would have argued that no composited characters should have ever exist...

Chris (2/46) Jun 03 2016 I do exactly this. Validate and normalize.

deadalnix (3/4) Jun 05 2016 And once you've done this, auto decoding is useless because the

Walter Bright (3/6) Jun 03 2016 So don't add new precomposited characters when a recognized existing seq...

Walter Bright (6/16) Jun 03 2016 I don't see that this is tricky at all. Adding additional semantic meani...

Vladimir Panteleev (7/31) Jun 03 2016 That's not right either. Cyrillic letters can look slightly

H. S. Teoh via Digitalmars-d (17/42) Jun 03 2016 Yeah, lowercase Cyrillic П is п, which looks like lowercase Greek π i...

Walter Bright (2/6) Jun 03 2016 It's almost as if printed documents and books have never existed!

H. S. Teoh via Digitalmars-d (30/37) Jun 03 2016 But if we were to encode appearance instead of logical meaning, that

Walter Bright (11/36) Jun 03 2016 No.

H. S. Teoh via Digitalmars-d (48/69) Jun 03 2016 [...]

Walter Bright (13/44) Jun 03 2016 It works for books. Unicode invented a problem, and came up with a thoro...

H. S. Teoh via Digitalmars-d (37/92) Jun 03 2016 This madness already exists *without* Unicode. If you have a page with a

Walter Bright (14/30) Jun 04 2016 It's not a problem that Unicode can solve. As you said, the meaning is i...

docandrew (17/58) Jun 05 2016 Even if a character in different languages share a glyph or look

deadalnix (5/7) Jun 05 2016 Interestingly enough, I've mentioned earlier here that only

Walter Bright (8/15) Jun 05 2016 You'd be in error. I've been casually working on my grandfather's thesis...

Patrick Schluter (30/48) Jun 04 2016 In Unicode there are 2 different codepoints for lower case sigma
Patrick Schluter (25/25) Jun 04 2016 One has also to take into consideration that Unicode is the way

ketmar (6/8) Jun 03 2016 some old xUSSR books which has some English text sometimes used

Walter Bright (2/3) Jun 03 2016 Nobody here suggested using the wrong font, it's completely irrelevant.

ketmar (4/8) Jun 03 2016 you suggested that unicode designers should make similar-looking

deadalnix (2/12) Jun 05 2016 TIL: books are read by computers.

Walter Bright (2/3) Jun 05 2016 I should introduce you to a fabulous technology called OCR. :-)

Walter Bright (2/7) Jun 03 2016 How did people ever get by with printed books and documents?

Timon Gehr (2/12) Jun 03 2016 They can disambiguate the letters based on context well enough.

Walter Bright (4/7) Jun 03 2016 Characters do not have semantic meaning. Their meaning is always inferre...

Adam D. Ruppe (4/5) Jun 03 2016 Printed books pick one font and one layout, then is read by

Jonathan M Davis via Digitalmars-d (33/50) Jun 03 2016 Actually, I would argue that the moment that Unicode is concerned with w...

Walter Bright (4/9) Jun 03 2016 What I meant was pretty clear. Font is an artistic style that does not c...

Adam D. Ruppe (5/6) Jun 03 2016 Nah, then it is an Awesome Font that is totally Web Scale!
Jonathan M Davis via Digitalmars-d (35/45) Jun 05 2016 Well, maybe I misunderstood what was being argued, but it seemed like yo...

Dmitry Olshansky (6/26) Jun 03 2016 Yeah, Unicode was not meant to be easy it seems. Or this is whatever

Alix Pexton (38/44) Jun 04 2016 Typing as someone who as spent some time creating typefaces, having two

Timon Gehr (34/118) Jun 02 2016 Doesn't work. Shouldn't compile. (char and wchar shouldn't be comparable...

jmh530 (25/27) Jun 02 2016 In Andrei's original post, he says that s is a string variable.
Andrei Alexandrescu (3/4) Jun 02 2016 That would be another language design option, which we don't have the
Andrei Alexandrescu (3/4) Jun 02 2016 As expected. Different code units for different folks. That's a
Andrei Alexandrescu (2/4) Jun 02 2016 The goal is to operate on code units. -- Andrei

Andrei Alexandrescu (2/6) Jun 02 2016 s/units/points/
ag0aep6g (5/6) Jun 02 2016 You sure you got the right word there? The code unit is the smallest

Andrei Alexandrescu (3/4) Jun 02 2016 By whom? The "support level 1" folks yonder at the Unicode standard? :o)...

tsbockman (6/24) Jun 02 2016 From the standard:

Andrei Alexandrescu (4/7) Jun 02 2016 Code point/Level 1 support sounds like a sweet spot between

tsbockman (12/34) Jun 02 2016 Actually, according to the document Walter Bright linked level 1

tsbockman (11/27) Jun 02 2016 I found the latest (unofficial) draft version:

ag0aep6g (3/5) Jun 02 2016 Do they say that level 1 should be the default, and do they give a

Andrei Alexandrescu (6/11) Jun 02 2016 No, but that sounds agreeable to me, especially since it breaks no code
default0 (36/42) Jun 02 2016 The level 2 support description noted that it should be opt-in

tsbockman (10/12) Jun 02 2016 1) It does not say that level 2 should be opt-in; it says that

default0 (15/27) Jun 02 2016 1) Right because a special toggleable syntax is definitely not

tsbockman (17/36) Jun 02 2016 It is not "opt-in" unless it is toggled off by default. The only

default0 (11/41) Jun 02 2016 *sigh* reading comprehension. Needing to write .byGrapheme or

tsbockman (6/11) Jun 02 2016 My main point is simply that you mischaracterized what the

Walter Bright (5/7) Jun 02 2016 The o is inferred as a wchar. The lamda then is inferred to return a wch...

Timon Gehr (4/14) Jun 02 2016 It still would not be the right thing. The lambda shouldn't compile. It

Andrei Alexandrescu (2/3) Jun 02 2016 But it is meaningful to compare Unicode code points. -- Andrei

Timon Gehr (3/6) Jun 02 2016 It is also meaningful to compare two utf-8 code units or two utf-16 code...

Andrei Alexandrescu (2/10) Jun 02 2016 By decoding them of course. -- Andrei

Timon Gehr (6/17) Jun 02 2016 That makes no sense, I cannot decode single code units.

Andrei Alexandrescu (2/21) Jun 02 2016 Then you lost me. (I'm sure you're making a good point.) -- Andrei

Timon Gehr (4/28) Jun 02 2016 Basically:

Walter Bright (5/21) Jun 02 2016 Yes, you have a good point. But we do allow things like:

Timon Gehr (15/22) Jun 02 2016 Well, this is a somewhat different case, because 10000 is just not

Walter Bright (4/7) Jun 02 2016 Not exactly. (c == 'ö') is always false for the same reason that (b == ...

Timon Gehr (7/17) Jun 02 2016 Yes. And _additionally_, some other concerns apply that are not there

Vladimir Panteleev (10/13) Jun 02 2016 Why allowing char/wchar/dchar comparisons is wrong:

Andrei Alexandrescu (2/8) Jun 02 2016 The lambda returns bool. -- Andrei

Walter Bright (3/6) Jun 02 2016 Can be made to work without autodecoding.

Andrei Alexandrescu (6/13) Jun 02 2016 By special casing? Perhaps. I seem to recall though that one major issue...

Timon Gehr (3/16) Jun 02 2016 The major issue is that it special cases when there's different, more
Walter Bright (6/17) Jun 02 2016 The argument to canFind() can be detected as not being a char, then deco...

Jonathan M Davis via Digitalmars-d (7/19) Jun 02 2016 How do you suggest that we handle the normalization issue? Should we jus...

Walter Bright (2/3) Jun 02 2016 Started a new thread for that one.

Jonathan M Davis via Digitalmars-d (63/78) Jun 02 2016 Yeah, I believe that you do have to do some special casing, though it wo...

Marco Leise (24/30) Jun 02 2016 Andrei, your ignorance is really starting to grind on

Walter Bright (7/8) Jun 02 2016 That's my fault.
Andrei Alexandrescu (40/70) Jun 02 2016 Indeed there seem to be serious questions about my competence, basic

Marco Leise (62/143) Jun 03 2016 That's not my general impression, but something is different

Jonathan M Davis via Digitalmars-d (40/44) Jun 03 2016 It comes down to the question of whether it's better to fail quickly whe...

jmh530 (43/52) Jun 02 2016 I've been lurking on this thread for a while and was convinced by

Andrei Alexandrescu (26/38) Jun 02 2016 Yah, this is a bummer and one of the larger issues of our community:

Adam D. Ruppe (9/12) Jun 02 2016 We wrote a PR to implement the first step in the autodecode

deadalnix (2/14) Jun 02 2016 https://www.youtube.com/watch?v=MJiBjfvltQw
Kagamin (3/5) Jun 02 2016 It outright deprecated popFront - that's not the first step in

Adam D. Ruppe (4/6) Jun 02 2016 Which gave us the list of places inside Phobos to fix, only about

Kagamin (4/7) Jun 02 2016 Yes, it was a research PR that was never meant to be an

Andrei Alexandrescu (3/10) Jun 02 2016 I closed it because it wasn't an actual implementation, in full

Walter Bright (3/6) Jun 02 2016 Nothing prevents anyone from doing that on their own (it's trivial) in o...

Walter Bright (8/9) Jun 02 2016 That's right. It's going about things backwards.

Adam D. Ruppe (4/6) Jun 02 2016 The compiler can help you with that. That's the point of the do

Walter Bright (2/4) Jun 02 2016 What is supposed to be done with "do not merge" PRs other than close the...

Jack Stouffer (3/5) Jun 02 2016 Experimentally iterate until something workable comes about. This
tsbockman (9/11) Jun 02 2016 Occasionally people need to try something on the auto tester (not

Andrei Alexandrescu (2/10) Jun 02 2016 Feel free to reopen if it helps, it wasn't closed in anger. -- Andrei
Walter Bright (5/15) Jun 02 2016 That doesn't seem to apply here, either.

tsbockman (5/18) Jun 02 2016 I was just responding to the general question you posed about "do

Andrei Alexandrescu (13/48) Jun 02 2016 You mean https://github.com/dlang/phobos/pull/4384, the one with "[do

Adam D. Ruppe (18/20) Jun 02 2016 Not at this time, no, but I also wouldn't advise you to close it

Andrei Alexandrescu (3/4) Jun 02 2016 I don't think the plan is realistic. How can I tell you this without you...

Adam D. Ruppe (7/9) Jun 02 2016 You get out of the way and let the community get to work.

Andrei Alexandrescu (7/14) Jun 02 2016 This applies to high-risk work that is also of commensurately

Kagamin (7/11) Jun 02 2016 Autodecode doesn't need to be removed from phobos completely, it

Andrei Alexandrescu (2/11) Jun 02 2016 Yah, and then such code will work with RCStr. -- Andrei

Kagamin (5/13) Jun 02 2016 Yes, do consider Walter's proposal, it will be an enabling

Andrei Alexandrescu (3/13) Jun 02 2016 Walter and I have a unified view on this. Although I'd need to raise the...

ZombineDev (4/25) Jun 02 2016 The primitive is byUTF!dchar:

H. S. Teoh via Digitalmars-d (60/92) Jun 02 2016 Appeal to authority.

Andrei Alexandrescu (37/126) Jun 02 2016 There is no denying. If I did things all over again, autodecoding would

Jonathan M Davis via Digitalmars-d (19/23) Jun 02 2016 Are folks going to not start using D because of auto-decoding? No, becau...

Andrei Alexandrescu (2/11) Jun 02 2016 Actually ranges are a major reason for which people look into D. -- Andr...

Steven Schveighoffer (8/13) Jun 02 2016 If this doesn't happen, then all this push to change anything in Phobos

Andrei Alexandrescu (6/20) Jun 02 2016 Yeah, it's a miracle the language stays glued eh.

Steven Schveighoffer (13/35) Jun 02 2016 The push to make Phobos only use byDchar (or any other band-aid fixes

Andrei Alexandrescu (3/5) Jun 02 2016 A good idea for all of us. Could you also please look on my post on our

Timon Gehr (4/13) Jun 02 2016 He is just saying that the fundamental reason why autodecoding is bad is...

deadalnix (10/28) Jun 02 2016 This, deep down, point at the fact that conversion from/to char

Timon Gehr (11/20) Jun 02 2016 The current situation is bad:

Jonathan M Davis via Digitalmars-d (28/52) May 31 2016 walkLength treats a code point like it's a character. My point is that

Timon Gehr (9/31) May 31 2016 What's "correct"? Maybe the user intended to count the number of code

Wyatt (6/10) May 31 2016 That's a property of your font and font rendering engine, not

Timon Gehr (8/20) May 31 2016 Sure. Hence "context". If you are e.g. trying to manually underline some...

Jonathan M Davis via Digitalmars-d (7/28) May 31 2016 It can't, which is precisely why having it select for you was a bad desi...

H. S. Teoh via Digitalmars-d (19/31) May 31 2016 [...]

Jonathan M Davis via Digitalmars-d (22/60) May 31 2016 In the vast majority of cases what folks care about is full characters,

Andrei Alexandrescu (2/3) May 31 2016 How are you so sure? -- Andrei

Marco Leise (9/13) May 31 2016 Because a full character is the typical unit of a written

Jonathan M Davis via Digitalmars-d (46/57) May 31 2016 Exactly. How many folks here have written code where the correct thing t...

Jack Stouffer (11/12) May 31 2016 This thread is going in circles; the against crowd has stated

Marc =?UTF-8?B?U2Now7x0eg==?= (5/10) Jun 01 2016 He doesn't need to be sure. You are the one advocating for code

Andrei Alexandrescu (2/3) May 31 2016 No, it treats a code point like it's a code point. -- Andrei

Jonathan M Davis via Digitalmars-d (41/44) May 31 2016 Wasn't the whole point of operating at the code point level by default t...

Andrei Alexandrescu (9/13) May 31 2016 The point is to operate on representation-independent entities (Unicode

Max Samukha (11/14) May 31 2016 Unicode FAQ disagrees (http://unicode.org/faq/utf_bom.html):
H. S. Teoh via Digitalmars-d (14/30) May 31 2016 This is basically saying that we operate on dchar[] by default, except
Marc =?UTF-8?B?U2Now7x0eg==?= (3/15) Jun 01 2016 _Both_ are low-level representation-specific artifacts.

Andrei Alexandrescu (4/17) Jun 01 2016 Maybe this is a misunderstanding. Representation = how things are laid

Nick Sabalausky (6/17) Jun 01 2016 As has been explained countless times already, code points are a non-1:1...

Andrei Alexandrescu (7/12) Jun 01 2016 The relevance is meandering across the discussion, and it's good to have...

Marc =?UTF-8?B?U2Now7x0eg==?= (14/26) Jun 02 2016 Ok, if you define it that way, sure. I was thinking in terms of

H. S. Teoh via Digitalmars-d (28/29) May 31 2016 Let's put the question this way. Given the following string, what do

Steven Schveighoffer (3/8) May 31 2016 Compiler error.

Timon Gehr (2/12) May 31 2016 What about e.g. joiner?

H. S. Teoh via Digitalmars-d (17/32) May 31 2016 joiner is one of those algorithms that can work perfectly fine *without*
Steven Schveighoffer (3/15) May 31 2016 Compiler error. Better than what it does now.

Marc =?UTF-8?B?U2Now7x0eg==?= (6/9) Jun 01 2016 I believe everything that does only concatenation will work

Steven Schveighoffer (8/17) Jun 02 2016 This means that a string is a range. What is it a range of? If you want

Marc =?UTF-8?B?U2Now7x0eg==?= (11/28) Jun 02 2016 No, I don't want to make string a range of anything, I want to

Timon Gehr (2/6) Jun 02 2016 If strings are not ranges, returning a range of chars is inconsistent.

Kagamin (7/10) Jun 02 2016 After the first migration step joiner will return a decoded dchar

Andrei Alexandrescu (3/6) May 31 2016 The number of code units in the string. That's the contract promised and...

Andrei Alexandrescu (2/9) May 31 2016 Code points I mean. -- Andrei

Nick Sabalausky (7/17) May 31 2016 Yes, we know it's the contract. ***That's the problem.*** As everybody

Jonathan M Davis via Digitalmars-d (16/34) May 31 2016 Exactly. Operating at the code point level rarely makes sense. What sort...

ag0aep6g (6/9) May 31 2016 You got the terms mixed up. Code unit is lower level. Code point is

Andrei Alexandrescu (2/8) May 31 2016 Apologies and thank you. -- Andrei

Andrei Alexandrescu (3/7) May 31 2016 The way I see it is it's specialization to speed things up without

Nick Sabalausky (6/14) May 31 2016 Problem is, that "higher"[1] level abstraction you don't want to give up...

Walter Bright (60/133) May 27 2016 It's a consequence of autodecoding, not arrays.

Andrei Alexandrescu (2/3) May 27 2016 Always valid or potentially invalid as well? -- Andrei

Walter Bright (8/11) May 27 2016 Some years ago I would have said always valid. Experience, however, says...

Andrei Alexandrescu (3/5) May 27 2016 Violent agreement is occurring here. We have plenty of those and need

Martin Nowak (4/19) May 29 2016 There are more than 2 choices here, see the related discussion on
Marco Leise (5/5) May 30 2016 A relevant thread in the Rust bug tracker I remember from

Walter Bright <newshound2 digitalmars.com> writes:

On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
 I am as unclear about the problems of autodecoding as I am about the necessity
 to remove curl. Whenever I ask I hear some arguments that work well emotionally
 but are scant on reason and engineering. Maybe it's time to rehash them? I just
 did so about curl, no solid argument seemed to come together. I'd be curious of
 a crisp list of grievances about autodecoding. -- Andrei

Here are some that are not matters of opinion.

1. Ranges of characters do not autodecode, but arrays of characters do. This is 
a glaring inconsistency.

2. Every time one wants an algorithm to work with both strings and ranges, you 
wind up special casing the strings to defeat the autodecoding, or to decode the 
ranges. Having to constantly special case it makes for more special cases when 
plugging together components. These issues often escape detection when 
unittesting because it is convenient to unittest only with arrays.

3. Wrapping an array in a struct with an alias this to an array turns off 
autodecoding, another special case.

4. Autodecoding is slow and has no place in high speed string processing.

5. Very few algorithms require decoding.

6. Autodecoding has two choices when encountering invalid code units - throw or 
produce an error dchar. Currently, it throws, meaning no algorithms using 
autodecode can be made nothrow.

7. Autodecode cannot be used with unicode path/filenames, because it is legal 
(at least on Linux) to have invalid UTF-8 as filenames. It turns out in the
wild 
that pure Unicode is not universal - there's lots of dirty Unicode that should 
remain unmolested, and autocode does not play with that.

8. In my work with UTF-8 streams, dealing with autodecode has caused me 
considerably extra work every time. A convenient timesaver it ain't.

9. Autodecode cannot be turned off, i.e. it isn't practical to avoid importing 
std.array one way or another, and then autodecode is there.

10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of 
being arrays in the first place.

11. Indexing an array produces different results than autodecoding, another 
glaring special case.

May 12 2016

Vladimir Panteleev <thecybershadow.lists gmail.com> writes:

On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
 On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
 I am as unclear about the problems of autodecoding as I am

 about the necessity
 to remove curl. Whenever I ask I hear some arguments that

 work well emotionally
 but are scant on reason and engineering. Maybe it's time to

 rehash them? I just
 did so about curl, no solid argument seemed to come together.

 I'd be curious of
 a crisp list of grievances about autodecoding. -- Andrei

 Here are some that are not matters of opinion.

 1. Ranges of characters do not autodecode, but arrays of 
 characters do. This is a glaring inconsistency.

 2. Every time one wants an algorithm to work with both strings 
 and ranges, you wind up special casing the strings to defeat 
 the autodecoding, or to decode the ranges. Having to constantly 
 special case it makes for more special cases when plugging 
 together components. These issues often escape detection when 
 unittesting because it is convenient to unittest only with 
 arrays.

 3. Wrapping an array in a struct with an alias this to an array 
 turns off autodecoding, another special case.

 4. Autodecoding is slow and has no place in high speed string 
 processing.

 5. Very few algorithms require decoding.

 6. Autodecoding has two choices when encountering invalid code 
 units - throw or produce an error dchar. Currently, it throws, 
 meaning no algorithms using autodecode can be made nothrow.

 7. Autodecode cannot be used with unicode path/filenames, 
 because it is legal (at least on Linux) to have invalid UTF-8 
 as filenames. It turns out in the wild that pure Unicode is not 
 universal - there's lots of dirty Unicode that should remain 
 unmolested, and autocode does not play with that.

 8. In my work with UTF-8 streams, dealing with autodecode has 
 caused me considerably extra work every time. A convenient 
 timesaver it ain't.

 9. Autodecode cannot be turned off, i.e. it isn't practical to 
 avoid importing std.array one way or another, and then 
 autodecode is there.

 10. Autodecoded arrays cannot be RandomAccessRanges, losing a 
 key benefit of being arrays in the first place.

 11. Indexing an array produces different results than 
 autodecoding, another glaring special case.

12. The result of autodecoding, a range of Unicode code points, 
is rarely actually useful, and code that relies on autodecoding 
is rarely actually, universally correct. Graphemes are 
occasionally useful for a subset of scripts, and a subset of that 
subset has all graphemes mapped to single code points, but this 
only applies to some scripts/languages.

In the majority of cases, autodecoding provides only the illusion 
of correctness.

May 12 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Thu, May 12, 2016 at 08:24:23PM +0000, Vladimir Panteleev via Digitalmars-d
wrote:
 On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:

[...]
1. Ranges of characters do not autodecode, but arrays of characters
do.  This is a glaring inconsistency.

2. Every time one wants an algorithm to work with both strings and
ranges, you wind up special casing the strings to defeat the
autodecoding, or to decode the ranges. Having to constantly special
case it makes for more special cases when plugging together
components. These issues often escape detection when unittesting
because it is convenient to unittest only with arrays.


Example of string special-casing leading to bugs:

	https://issues.dlang.org/show_bug.cgi?id=15972

This particular issue highlight the problem quite well: one would hardly

how could a single char need to be "auto-decoded" to a dchar?
Unfortunately, due to Phobos algorithms assuming autodecoding, the
resulting range of char is not recognized as "string-like" data by
.joiner, thus causing a compile error.

The workaround (as described in the bug comments) also illustrates the
inconsistency in handling ranges of char vs. ranges of dchar: writing
.joiner("\n".byCodeUnit) will actually fix the problem, basically by
explicitly disabling autodecoding.

We can, of course, fix .joiner to recognize this case and handle it
correctly, but the fact the using .byCodeUnit works perfectly proves
that autodecoding is not necessary here. Which begs the question, why
have autodecoding at all, and then require .byCodeUnit to work around
issues it causes?


T

-- 
It is widely believed that reinventing the wheel is a waste of time; but I
disagree: without wheel reinventers, we would be still be stuck with wooden
horse-cart wheels.

May 12 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Thu, May 12, 2016 at 08:24:23PM +0000, Vladimir Panteleev via Digitalmars-d
wrote:
[...]
 12. The result of autodecoding, a range of Unicode code points, is
 rarely actually useful, and code that relies on autodecoding is rarely
 actually, universally correct. Graphemes are occasionally useful for a
 subset of scripts, and a subset of that subset has all graphemes
 mapped to single code points, but this only applies to some
 scripts/languages.
 
 In the majority of cases, autodecoding provides only the illusion of
 correctness.

A range of Unicode code points is not the same as a range of graphemes
(a grapheme is what a layperson would consider to be a "character").
Autodecoding returns dchar, a code point, rather than a grapheme.

Therefore, autodecoding actually only produces intuitively correct
results when your string has a 1-to-1 correspondence between grapheme
and code point. In general, this is only true for a small subset of
languages, mainly a few common European languages and a handful of
others.  It doesn't work for Korean, and doesn't work for any language
that uses combining diacritics or other modifiers.  You need byGrapheme
to have the correct results.

So basically autodecoding, as currently implemented, fails to meet its
goal of segmenting a string by "character" (i.e., grapheme), and yet
imposes a performance penalty that is difficult to "turn off" (you have
to sprinkle your code with byCodeUnit everywhere, and many Phobos
algorithms just return a range of dchar anyway). Not to mention that a
good number of string algorithms don't actually *need* autodecoding at
all.

(One could make a case for auto-segmenting by grapheme, but that's even
worse in terms of performance (it requires a non-trivial Unicode
algorithm involving lookup tables, and may need memory allocation). At
the end of the day, we're back to square one: iterate by code unit, and
explicitly ask for byGrapheme where necessary.)


T

-- 
"I'm running Windows '98." "Yes." "My computer isn't working now." "Yes, you
already said that." -- User-Friendly

May 12 2016

Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:

On Thursday, 12 May 2016 at 23:16:23 UTC, H. S. Teoh wrote:
 Therefore, autodecoding actually only produces intuitively 
 correct results when your string has a 1-to-1 correspondence 
 between grapheme and code point. In general, this is only true 
 for a small subset of languages, mainly a few common European 
 languages and a handful of others.  It doesn't work for Korean, 
 and doesn't work for any language that uses combining 
 diacritics or other modifiers.  You need byGrapheme to have the 
 correct results.

In fact, even most European languages are affected if NFD 
normalization is used, which is the default on MacOS X.

And this is actually the main problem with it: It was introduced 
to make unicode string handling correct. Well, it doesn't, 
therefore it has no justification.

May 13 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Fri, 13 May 2016 10:49:24 +0000
schrieb Marc Sch=C3=BCtz <schuetzm gmx.net>:

 In fact, even most European languages are affected if NFD=20
 normalization is used, which is the default on MacOS X.
=20
 And this is actually the main problem with it: It was introduced=20
 to make unicode string handling correct. Well, it doesn't,=20
 therefore it has no justification.

+1 for leaning back and contemplate exactly what auto-decode
was aiming for and how it missed that goal.

You'll see that an o=CC=88 may still be cut between the o and the =C2=A8.
Hangul symbols are composed of pieces that go in different
corners. Those would also be split up by auto-decode.

Can we handle real world text AT ALL? Are graphemes good
enough to find the column in a fixed width display of some
string (e.g. line+column or an error)? No, there my still be
full-width characters in there that take 2 columns. :p

--=20
Marco

May 13 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Fri, May 13, 2016 at 09:26:40PM +0200, Marco Leise via Digitalmars-d wrote:
 Am Fri, 13 May 2016 10:49:24 +0000
 schrieb Marc Schütz <schuetzm gmx.net>:
 
 In fact, even most European languages are affected if NFD 
 normalization is used, which is the default on MacOS X.
 
 And this is actually the main problem with it: It was introduced 
 to make unicode string handling correct. Well, it doesn't, 
 therefore it has no justification.

 
 +1 for leaning back and contemplate exactly what auto-decode
 was aiming for and how it missed that goal.
 
 You'll see that an ö may still be cut between the o and the ¨.
 Hangul symbols are composed of pieces that go in different
 corners. Those would also be split up by auto-decode.
 
 Can we handle real world text AT ALL? Are graphemes good
 enough to find the column in a fixed width display of some
 string (e.g. line+column or an error)? No, there my still be
 full-width characters in there that take 2 columns. :p

[...]

A simple lookup table ought to fix this. Preferably in std.uni so that
it doesn't get reinvented by every other project.


T

-- 
Don't modify spaghetti code unless you can eat the consequences.

May 13 2016

Daniel Kozak <kozzi11 gmail.com> writes:

On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
 On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
 I am as unclear about the problems of autodecoding as I am

 about the necessity
 to remove curl. Whenever I ask I hear some arguments that

 work well emotionally
 but are scant on reason and engineering. Maybe it's time to

 rehash them? I just
 did so about curl, no solid argument seemed to come together.

 I'd be curious of
 a crisp list of grievances about autodecoding. -- Andrei

 Here are some that are not matters of opinion.

 1. Ranges of characters do not autodecode, but arrays of 
 characters do. This is a glaring inconsistency.

 2. Every time one wants an algorithm to work with both strings 
 and ranges, you wind up special casing the strings to defeat 
 the autodecoding, or to decode the ranges. Having to constantly 
 special case it makes for more special cases when plugging 
 together components. These issues often escape detection when 
 unittesting because it is convenient to unittest only with 
 arrays.

 3. Wrapping an array in a struct with an alias this to an array 
 turns off autodecoding, another special case.

 4. Autodecoding is slow and has no place in high speed string 
 processing.

 5. Very few algorithms require decoding.

 6. Autodecoding has two choices when encountering invalid code 
 units - throw or produce an error dchar. Currently, it throws, 
 meaning no algorithms using autodecode can be made nothrow.

 7. Autodecode cannot be used with unicode path/filenames, 
 because it is legal (at least on Linux) to have invalid UTF-8 
 as filenames. It turns out in the wild that pure Unicode is not 
 universal - there's lots of dirty Unicode that should remain 
 unmolested, and autocode does not play with that.

 8. In my work with UTF-8 streams, dealing with autodecode has 
 caused me considerably extra work every time. A convenient 
 timesaver it ain't.

 9. Autodecode cannot be turned off, i.e. it isn't practical to 
 avoid importing std.array one way or another, and then 
 autodecode is there.

 10. Autodecoded arrays cannot be RandomAccessRanges, losing a 
 key benefit of being arrays in the first place.

 11. Indexing an array produces different results than 
 autodecoding, another glaring special case.

For me it is not about autodecoding. I would like to have 
something like String type which do that. But what I am really 
piss of is that current string type is alias to immutable(char)[] 
(so it is not usable at all). This is really problem for me. 
Because this make working on array of chars almost impossible.

Even char[] is unusable. So I am force to used ubyte[], but this 
is really not an array of chars.

ATM D does not support even full Unicode strings and even basic 
array of chars :(.

I hope this will be fixed one day. So I could start to expand D 
in Czech, until than I am unable to do that.

May 12 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 5/12/2016 4:23 PM, Daniel Kozak wrote:
 But what I am really piss of is that current string type is
 alias to immutable(char)[] (so it is not usable at all). This is really problem
 for me. Because this make working on array of chars almost impossible.

 Even char[] is unusable. So I am force to used ubyte[], but this is really not
 an array of chars.

 ATM D does not support even full Unicode strings and even basic array of chars
:(.

 I hope this will be fixed one day. So I could start to expand D in Czech, until
 than I am unable to do that.

I can't find any actionable information in this.

May 12 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Thu, 12 May 2016 13:15:45 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 7. Autodecode cannot be used with unicode path/filenames, because it is legal 
 (at least on Linux) to have invalid UTF-8 as filenames.

More precisely they are byte strings with '/' reserved to
separate path elements. While on an out-of-the-box Linux
nowadays everything is typically presented as UTF-8, there are
still die-hards that use code pages, corrupted file systems
or incorrectly bound network shares displaying with the wrong
charset. It is safer to work with them as a ubyte[] and that
also bypasses auto-decoding.

I'd like 'string' to mean valid UTF-8 in D as far as the
encoding goes. A filename should not be a 'string'.

-- 
Marco

May 12 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 5/12/2016 4:52 PM, Marco Leise wrote:
 I'd like 'string' to mean valid UTF-8 in D as far as the
 encoding goes. A filename should not be a 'string'.

I would have agreed with you in the past, but more and more it just doesn't
seem 
practical. UTF-8 is dirty in the real world, and D code will have to deal with
it.

By dealing with it I mean not crash, throw exceptions, or other tantrums when 
encountering it. Unless it matters, it should pass the invalid encodings along 
unmolested and without comment. For example, if you're searching for 'a' in a 
UTF-8 string, what does it matter if there are invalid encodings in that string?

For filenames/paths in particular, having redone the file/path code in Phobos,
I 
realized that invalid encodings are completely immaterial.

May 12 2016

Jack Stouffer <jack jackstouffer.com> writes:

On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
 Here are some that are not matters of opinion.

If you're serious about removing auto-decoding, which I think you 
and others have shown has merits, you have to the THE SIMPLEST 
migration path ever, or you will kill D. I'm talking a simple 
press of a button.

I'm not exaggerating here. Python, a language which was much more 
popular than D at the time, came out with two versions in 2008: 
Python 2.7 which had numerous unicode problems, and Python 3.0 
which fixed those problems. Almost eight years later, and Python 
2 is STILL the more popular version despite Py3 having five major 
point releases since and Python 2 only getting security patches. 
Think the tango vs phobos problem, only a little worse.

D is much less popular now than was Python at the time, and 
Python 2 problems were more straight forward than the 
auto-decoding problem.  You'll need a very clear migration path, 
years long deprecations, and automatic tools in order to make the 
transition work, or else D's usage will be permanently damaged.

May 12 2016

Jack Stouffer <jack jackstouffer.com> writes:

On Friday, 13 May 2016 at 00:47:04 UTC, Jack Stouffer wrote:
 I'm not exaggerating here. Python, a language which was much 
 more popular than D at the time, came out with two versions in 
 2008: Python 2.7 which had numerous unicode problems, and 
 Python 3.0 which fixed those problems. Almost eight years 
 later, and Python 2 is STILL the more popular version despite 
 Py3 having five major point releases since and Python 2 only 
 getting security patches. Think the tango vs phobos problem, 
 only a little worse.

To hammer this home a little more, Python 3 had a really useful 
library in order to abstract most of the differences 
automatically. But despite that, here is a list of the top 200 
Python packages in 2011, three years after the fork, and if they 
supported Python 3 or not: 
https://web.archive.org/web/20110215214547/http://python3wos.appspot.com/

This is _three years_ later, and only 18 out of the top 200 
supported Python 3.

And here it is now, eight years later, at 174 out of 200 
https://python3wos.appspot.com/

May 12 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 5/12/2016 5:47 PM, Jack Stouffer wrote:
 D is much less popular now than was Python at the time, and Python 2 problems
 were more straight forward than the auto-decoding problem.  You'll need a very
 clear migration path, years long deprecations, and automatic tools in order to
 make the transition work, or else D's usage will be permanently damaged.

I agree, if it is possible at all.

May 12 2016

Chris <wendlec tcd.ie> writes:

On Friday, 13 May 2016 at 01:00:54 UTC, Walter Bright wrote:
 On 5/12/2016 5:47 PM, Jack Stouffer wrote:
 D is much less popular now than was Python at the time, and 
 Python 2 problems
 were more straight forward than the auto-decoding problem.  
 You'll need a very
 clear migration path, years long deprecations, and automatic 
 tools in order to
 make the transition work, or else D's usage will be 
 permanently damaged.

 I agree, if it is possible at all.

I don't know to which extent my problems with string handling are 
related to autodecode. However, I had to write some utility 
functions to get around issues with code points, graphemes and 
the like. While it is not a huge issue in terms of programming 
time, it does slow down my program, because even simple 
operations may be referred to a utility function to make sure the 
result is correct (.length for example). But that might be an 
issue related to Unicode in general (or D's handling of it).

If autodecode is killed, could we have a test version asap? I'd 
be willing to test my programs with autodecode turned off and see 
what happens. Others should do likewise and we could come up with 
a transition strategy based on what happened.

May 13 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 5/13/2016 2:12 AM, Chris wrote:
 If autodecode is killed, could we have a test version asap? I'd be willing to
 test my programs with autodecode turned off and see what happens. Others should
 do likewise and we could come up with a transition strategy based on what
happened.

You can avoid autodecode by using .byChar

May 13 2016

Chris <wendlec tcd.ie> writes:

On Friday, 13 May 2016 at 13:17:44 UTC, Walter Bright wrote:
 On 5/13/2016 2:12 AM, Chris wrote:
 If autodecode is killed, could we have a test version asap? 
 I'd be willing to
 test my programs with autodecode turned off and see what 
 happens. Others should
 do likewise and we could come up with a transition strategy 
 based on what happened.

 You can avoid autodecode by using .byChar

Hm. It would be difficult to make sure that my whole code base 
doesn't do something, somewhere that doesn't trigger auto decode.

PS Why does do I get a "StopForumSpam error" every time I post 
today? Has anyone else experienced the same problem:

"StopForumSpam error: Socket error: Lookup error: getaddrinfo 
error: Name or service not known. Please solve a CAPTCHA to 
continue."

May 13 2016

Vladimir Panteleev <thecybershadow.lists gmail.com> writes:

On Friday, 13 May 2016 at 13:41:30 UTC, Chris wrote:
 PS Why does do I get a "StopForumSpam error" every time I post 
 today? Has anyone else experienced the same problem:

 "StopForumSpam error: Socket error: Lookup error: getaddrinfo 
 error: Name or service not known. Please solve a CAPTCHA to 
 continue."

https://twitter.com/StopForumSpam

May 13 2016

Chris <wendlec tcd.ie> writes:

On Friday, 13 May 2016 at 14:06:28 UTC, Vladimir Panteleev wrote:
 On Friday, 13 May 2016 at 13:41:30 UTC, Chris wrote:
 PS Why does do I get a "StopForumSpam error" every time I post 
 today? Has anyone else experienced the same problem:

 "StopForumSpam error: Socket error: Lookup error: getaddrinfo 
 error: Name or service not known. Please solve a CAPTCHA to 
 continue."

 https://twitter.com/StopForumSpam

I don't understand. Does that mean we have to solve CAPTCHAs 
every time we post? Annoying CAPTCHAs at that.

May 13 2016

Iakh <iaktakh gmail.com> writes:

On Friday, 13 May 2016 at 01:00:54 UTC, Walter Bright wrote:
 On 5/12/2016 5:47 PM, Jack Stouffer wrote:
 D is much less popular now than was Python at the time, and 
 Python 2 problems
 were more straight forward than the auto-decoding problem.  
 You'll need a very
 clear migration path, years long deprecations, and automatic 
 tools in order to
 make the transition work, or else D's usage will be 
 permanently damaged.

 I agree, if it is possible at all.

A plan:
1. Mark as deprecated places where auto-decoding used. I
think it's all "range" functions for string(front, popFront, 
back, ...).
Force using byChar & co.

2. Introduce new String type in Phobos.

3. After ages make immutable(char) ordinal array.

Is it OK? Profit?

May 13 2016

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:

On Friday, 13 May 2016 at 00:47:04 UTC, Jack Stouffer wrote:
 D is much less popular now than was Python at the time, and 
 Python 2 problems were more straight forward than the 
 auto-decoding problem.  You'll need a very clear migration 
 path, years long deprecations, and automatic tools in order to 
 make the transition work, or else D's usage will be permanently 
 damaged.

Python 2 is/was deployed at a much larger scale and with far more 
library dependencies, so I don't think it is comparable. It is 
easier for D to get away with breaking changes.

I am still using Python 2.7 exclusively, but now I use:
from __future__ import division, absolute_import, with_statement, 
unicode_literals

D can do something similar.

C++ is using a comparable solution. Use switches to turn on 
different compatibility levels.

May 13 2016

Nick Treleaven <ntrel-pub mybtinternet.com> writes:

On Friday, 13 May 2016 at 00:47:04 UTC, Jack Stouffer wrote:
 If you're serious about removing auto-decoding, which I think 
 you and others have shown has merits, you have to the THE 
 SIMPLEST migration path ever, or you will kill D. I'm talking a 
 simple press of a button.

char[] is always going to be unsafe for UTF-8. I don't think we 
can remove it or auto-decoding, only discourage use of it. We 
need a String struct IMO, without length or indexing. Its front 
can do autodecoding, and it has a ubyte[] raw() property too. 
(Possibly the byte length of front can be cached for use in 
popFront, assuming it was faster). This would be a gradual 
transition.

May 13 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Fri, May 13, 2016 at 12:16:30PM +0000, Nick Treleaven via Digitalmars-d
wrote:
 On Friday, 13 May 2016 at 00:47:04 UTC, Jack Stouffer wrote:
If you're serious about removing auto-decoding, which I think you and
others have shown has merits, you have to the THE SIMPLEST migration
path ever, or you will kill D. I'm talking a simple press of a
button.

 
 char[] is always going to be unsafe for UTF-8. I don't think we can
 remove it or auto-decoding, only discourage use of it. We need a
 String struct IMO, without length or indexing. Its front can do
 autodecoding, and it has a ubyte[] raw() property too. (Possibly the
 byte length of front can be cached for use in popFront, assuming it
 was faster). This would be a gradual transition.

alias String = typeof(std.uni.byGrapheme(immutable(char)[].init));

:-)

Well, OK, perhaps you could wrap this in a struct that allows extraction
of .raw, etc.. But basically this isn't hard to implement today. We
already have all of the tools necessary.


T

-- 
Dogs have owners ... cats have staff. -- Krista Casada

May 13 2016

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On 05/12/2016 08:47 PM, Jack Stouffer wrote:
 If you're serious about removing auto-decoding, which I think you and
 others have shown has merits, you have to the THE SIMPLEST migration
 path ever, or you will kill D. I'm talking a simple press of a button.

 I'm not exaggerating here. Python, a language which was much more
 popular than D at the time, came out with two versions in 2008: Python
 2.7 which had numerous unicode problems, and Python 3.0 which fixed
 those problems. Almost eight years later, and Python 2 is STILL the more
 popular version despite Py3 having five major point releases since and
 Python 2 only getting security patches. Think the tango vs phobos
 problem, only a little worse.

 D is much less popular now than was Python at the time, and Python 2
 problems were more straight forward than the auto-decoding problem.
 You'll need a very clear migration path, years long deprecations, and
 automatic tools in order to make the transition work, or else D's usage
 will be permanently damaged.

As much as I agree on the importance of a good smooth migration path, I 
don't think the "Python 2 vs 3" situation is really all that comparable 
here. Unlike Python, we wouldn't be maintaining a "with auto-decoding" 
fork for years and years and years, ensuring nobody ever had a pressing 
reason to bother migrating. And on top of that, we don't have a culture 
and design philosophy that promotes "do the lazy thing first and the 
robust thing never". D users are more likely than dynamic language users 
to be willing to make a few changes for the sake of improvement.

Heck, we weather breaking fixes enough anyway. There was even one point 
within the last couple years where something (forget offhand what it 
was) was removed from std.datetime and its replacement was added *in the 
very same compiler release*. No transition period. It was an annoying 
pain (at least to me), but I got through it fine and never even 
entertained the thought of just sticking with the old compiler. Not sure 
most people even noticed it. Point is, in D, even when something does 
need to change, life goes on fine. As long as we don't maintain a 
long-term fork ;)

Naturally, minimizing breakage is important here, but I really don't 
think Python's UTF migration situation is all that comparable.

May 29 2016

Jack Stouffer <jack jackstouffer.com> writes:

On Sunday, 29 May 2016 at 17:35:35 UTC, Nick Sabalausky wrote:
 Unlike Python, we wouldn't be maintaining a "with 
 auto-decoding" fork for years and years and years, ensuring 
 nobody ever had a pressing reason to bother migrating.

If it happens, they better. The D1 fork was maintained for almost 
three years for a good reason.

 Heck, we weather breaking fixes enough anyway.

Not nearly on a scale similar to changing how strings are 
iterated; not since the D1/D2 split.

 It was an annoying pain (at least to me), but I got through it 
 fine and never even entertained the thought of just sticking 
 with the old compiler.
 Not sure most people even noticed it. Point is, in D, even when 
 something does need to change, life goes on fine. As long as we 
 don't maintain a long-term fork ;)

The problem is not active users. The problem is companies who 
have > 10K LOC and libraries that are no longer maintained. E.g. 
It took Sociomantic eight years after D2's release to switch only 
a few parts of their projects to D2. With the loss of old 
libraries/old code (even old answers on SO), all of a sudden you 
lose a lot of the network effect that makes programming languages 
much more useful.

May 29 2016

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On 05/29/2016 09:58 PM, Jack Stouffer wrote:
 The problem is not active users. The problem is companies who have > 10K
 LOC and libraries that are no longer maintained. E.g. It took
 Sociomantic eight years after D2's release to switch only a few parts of
 their projects to D2. With the loss of old libraries/old code (even old
 answers on SO), all of a sudden you lose a lot of the network effect
 that makes programming languages much more useful.

D1 -> D2 was a vastly more disruptive change than getting rid of 
auto-decoding would be.

May 30 2016

Jack Stouffer <jack jackstouffer.com> writes:

On Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote:
 D1 -> D2 was a vastly more disruptive change than getting rid 
 of auto-decoding would be.

Don't be so sure. All string handling code would become broken, 
even if it appears to work at first.

May 30 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 05/30/2016 12:34 PM, Jack Stouffer wrote:
 On Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote:
 D1 -> D2 was a vastly more disruptive change than getting rid of
 auto-decoding would be.

 Don't be so sure. All string handling code would become broken, even if
 it appears to work at first.

That kind of makes this thread less productive than "How to improve 
autodecoding?" -- Andrei

May 30 2016

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 30-May-2016 21:24, Andrei Alexandrescu wrote:
 On 05/30/2016 12:34 PM, Jack Stouffer wrote:
 On Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote:
 D1 -> D2 was a vastly more disruptive change than getting rid of
 auto-decoding would be.

 Don't be so sure. All string handling code would become broken, even if
 it appears to work at first.

 That kind of makes this thread less productive than "How to improve
 autodecoding?" -- Andrei

1. Generalize to all ranges of code units i.e. ranges of char/wchar.

2. Operating on codeunits explicitly would then always involve a step 
through ubyte/byte.

-- 
Dmitry Olshansky

May 30 2016

Jack Stouffer <jack jackstouffer.com> writes:

On Monday, 30 May 2016 at 18:24:23 UTC, Andrei Alexandrescu wrote:
 That kind of makes this thread less productive than "How to 
 improve autodecoding?" -- Andrei

Please don't misunderstand, I'm for fixing string behavior. But, 
let's not pretend that this wouldn't be one of the (if not the) 
largest breaking change since D2. As I said, straight up removing 
auto-decoding would break all string handling code.

May 30 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 05/30/2016 03:00 PM, Jack Stouffer wrote:
 On Monday, 30 May 2016 at 18:24:23 UTC, Andrei Alexandrescu wrote:
 That kind of makes this thread less productive than "How to improve
 autodecoding?" -- Andrei

 Please don't misunderstand, I'm for fixing string behavior.

Surely the misunderstanding is not on this side of the table :o). By 
"that" I meant your assertion at face value (i.e. assuming it's a fact) 
"All string handling code would become broken, even if it appears to 
work at first". -- Andrei

May 30 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Monday, May 30, 2016 14:24:23 Andrei Alexandrescu via Digitalmars-d wrote:
 On 05/30/2016 12:34 PM, Jack Stouffer wrote:
 On Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote:
 D1 -> D2 was a vastly more disruptive change than getting rid of
 auto-decoding would be.

 Don't be so sure. All string handling code would become broken, even if
 it appears to work at first.

 That kind of makes this thread less productive than "How to improve
 autodecoding?" -- Andrei

I think that the first step is getting Phobos to work with all ranges of
character types - be they char, wchar, dchar, or graphemes. Then the
algorithms themselves will work whether we have auto-decoding or not. With
that done, we can at minimum tell folks to use byCodeUnit, byChar!T,
byGrapheme, etc. to get the correct, efficient behavior. Right now, if you
try to use ranges like byCodeUnit, they work with some of Phobos but not
enough to really work as a viable replacement to auto-decoding strings.

With all that done, at least it should be reasonably easy for folks to
sanely get around auto-decoding, though the question still remains at that
point how possible it will be to remove auto-decoding and treat ranges of
char the same way that byCodeUnit would. But at bare minimum, it's what we
need to do to make it possible and reasonable to work around auto-decoding
when you need to while specifying the level of Unicode that you actually
want to operate at.

- Jonathan M Davis

May 31 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/31/16 2:21 PM, Jonathan M Davis via Digitalmars-d wrote:
 I think that the first step is getting Phobos to work with all ranges of
 character types - be they char, wchar, dchar, or graphemes. Then the
 algorithms themselves will work whether we have auto-decoding or not. With
 that done, we can at minimum tell folks to use byCodeUnit, byChar!T,
 byGrapheme, etc. to get the correct, efficient behavior. Right now, if you
 try to use ranges like byCodeUnit, they work with some of Phobos but not
 enough to really work as a viable replacement to auto-decoding strings.

Great. Could you put together a sample PR so we understand the 
implications better? Thanks! -- Andrei

May 31 2016

Vladimir Panteleev <thecybershadow.lists gmail.com> writes:

On Monday, 30 May 2016 at 16:34:49 UTC, Jack Stouffer wrote:
 On Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote:
 D1 -> D2 was a vastly more disruptive change than getting rid 
 of auto-decoding would be.

 Don't be so sure. All string handling code would become broken, 
 even if it appears to work at first.

Assuming silent breakage is on the table, what would be broken, 
really?

Code that must intentionally count or otherwise operate code 
points, sure. But how much of all string handling code is like 
that?

Perhaps it would be worth trying to silently remove autodecoding 
and seeing how much of Phobos breaks, as an experiment. Has this 
been tried before?

(Not saying this is a route we should take, but it doesn't seem 
to me that it will break "all string handling code" either.)

May 30 2016

Seb <seb wilzba.ch> writes:

On Monday, 30 May 2016 at 21:39:14 UTC, Vladimir Panteleev wrote:
 On Monday, 30 May 2016 at 16:34:49 UTC, Jack Stouffer wrote:
 On Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote:
 D1 -> D2 was a vastly more disruptive change than getting rid 
 of auto-decoding would be.

 Don't be so sure. All string handling code would become 
 broken, even if it appears to work at first.

 Assuming silent breakage is on the table, what would be broken, 
 really?

 Code that must intentionally count or otherwise operate code 
 points, sure. But how much of all string handling code is like 
 that?

 Perhaps it would be worth trying to silently remove 
 autodecoding and seeing how much of Phobos breaks, as an 
 experiment. Has this been tried before?

 (Not saying this is a route we should take, but it doesn't seem 
 to me that it will break "all string handling code" either.)

132 lines in Phobos use auto-decoding - that should be fixable ;-)

See them: http://sprunge.us/hUCL
More details: https://github.com/dlang/phobos/pull/4384

May 30 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/30/16 7:52 PM, Seb wrote:
 On Monday, 30 May 2016 at 21:39:14 UTC, Vladimir Panteleev wrote:
 On Monday, 30 May 2016 at 16:34:49 UTC, Jack Stouffer wrote:
 On Monday, 30 May 2016 at 16:25:20 UTC, Nick Sabalausky wrote:
 D1 -> D2 was a vastly more disruptive change than getting rid of
 auto-decoding would be.

 Don't be so sure. All string handling code would become broken, even
 if it appears to work at first.

 Assuming silent breakage is on the table, what would be broken, really?

 Code that must intentionally count or otherwise operate code points,
 sure. But how much of all string handling code is like that?

 Perhaps it would be worth trying to silently remove autodecoding and
 seeing how much of Phobos breaks, as an experiment. Has this been
 tried before?

 (Not saying this is a route we should take, but it doesn't seem to me
 that it will break "all string handling code" either.)

 132 lines in Phobos use auto-decoding - that should be fixable ;-)

 See them: http://sprunge.us/hUCL
 More details: https://github.com/dlang/phobos/pull/4384

Thanks for this investigation! Results are about as I'd have speculated. 
-- Andrei

May 30 2016

Jack Stouffer <jack jackstouffer.com> writes:

On Monday, 30 May 2016 at 21:39:14 UTC, Vladimir Panteleev wrote:
 Perhaps it would be worth trying to silently remove 
 autodecoding and seeing how much of Phobos breaks, as an 
 experiment. Has this been tried before?

Did it, the results are a large number of phobos modules fail to 
compile because of template constraints that test for 
is(Unqual!(ElementType!S2) == dchar). As a result, anything that 
imports std.format or std.uni fails to compile. Also, I see some 
errors caused by the fact that is(string.front == immutable) now.

Is hard to find specifics because D halts execution after one 
test failure.

May 30 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 05/30/2016 12:25 PM, Nick Sabalausky wrote:
 On 05/29/2016 09:58 PM, Jack Stouffer wrote:
 The problem is not active users. The problem is companies who have > 10K
 LOC and libraries that are no longer maintained. E.g. It took
 Sociomantic eight years after D2's release to switch only a few parts of
 their projects to D2. With the loss of old libraries/old code (even old
 answers on SO), all of a sudden you lose a lot of the network effect
 that makes programming languages much more useful.

 D1 -> D2 was a vastly more disruptive change than getting rid of
 auto-decoding would be.

It was also made at a time when the community was smaller by a couple 
orders of magnitude. -- Andrei

May 30 2016

Chris <wendlec tcd.ie> writes:

On Sunday, 29 May 2016 at 17:35:35 UTC, Nick Sabalausky wrote:
 On 05/12/2016 08:47 PM, Jack Stouffer wrote:

 As much as I agree on the importance of a good smooth migration 
 path, I don't think the "Python 2 vs 3" situation is really all 
 that comparable here. Unlike Python, we wouldn't be maintaining 
 a "with auto-decoding" fork for years and years and years, 
 ensuring nobody ever had a pressing reason to bother migrating. 
 And on top of that, we don't have a culture and design 
 philosophy that promotes "do the lazy thing first and the 
 robust thing never". D users are more likely than dynamic 
 language users to be willing to make a few changes for the sake 
 of improvement.

 Heck, we weather breaking fixes enough anyway. There was even 
 one point within the last couple years where something (forget 
 offhand what it was) was removed from std.datetime and its 
 replacement was added *in the very same compiler release*. No 
 transition period. It was an annoying pain (at least to me), 
 but I got through it fine and never even entertained the 
 thought of just sticking with the old compiler. Not sure most 
 people even noticed it. Point is, in D, even when something 
 does need to change, life goes on fine. As long as we don't 
 maintain a long-term fork ;)

 Naturally, minimizing breakage is important here, but I really 
 don't think Python's UTF migration situation is all that 
 comparable.

I suggest providing an automatic tool (either within the compiler 
or as a separate program like dfix) to help with the transition. 
Ideally the tool would advise the user where potential problems 
are and how to fix them.

If it's true that auto decode is unnecessary in many cases, then 
it shouldn't affect the whole code base. But I might be mistaken 
here. Maybe we should make a list of the functions where auto 
decode does make a difference, see how common they are, and work 
out a strategy from there. Destroy.

May 30 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Mon, 30 May 2016 09:26:09 +0000
schrieb Chris <wendlec tcd.ie>:

 If it's true that auto decode is unnecessary in many cases, then=20
 it shouldn't affect the whole code base. But I might be mistaken=20
 here. Maybe we should make a list of the functions where auto=20
 decode does make a difference, see how common they are, and work=20
 out a strategy from there. Destroy.

It makes a difference for every function. But it still isn't
necessary in many cases. It's fairly simple:

code unit  =3D=3D bytes/chars
code point =3D=3D auto-decode
grapheme*  =3D=3D .byGrapheme

So if for now you used auto-decode you iterated code-points,
which works correctly for most scripts in NFC**. And here lies
the rub and why people say auto-decoding is unnecessary most
of the time: If you are working with XML, CSV or JSON or
another structured text format, these all use ASCII characters
for their syntax elements. Code unit, code point and graphemes
become all the same and auto-decoding just slows you down.

When on the other hand you work with real world international
text, you'll want to work with graphemes. One example is
putting an ellipsis in long text:

"Alle Segelt=C3=B6rns im =C3=9Cberblick" (in NFD, e.g. OS X file name)
may display as this with auto-decode:
"Alle Segelto=E2=80=A6=C2=A8berblick"
and this with byGrapheme:
"Alle Segelt=C3=B6=E2=80=A6=C3=9Cberblick"

But at that point you are likely also in need of localized
sorting of strings, a set of algorithms that may change with
the rise and fall of nations or reformations. So you'll use the
platform's go-to Unicode library instead of what Phobos
offers. For Java and Linux that would be ICU***.

That last point makes me think we should not bother much with
decoding in Phobos at all. Odds are we miss other capabilities
to make good use of it. Users of auto-decode should review
their code to see if code-points is really what they want and
potentially switch to no-decoding or .byGrapheme.

* What we typically perceive as one unit in written text.
** A normalization form where e.g. '=C3=B6' is a single code-point,
   as opposed to NFD, where '=C3=B6' would be assembled from the
   two 'o' and '=C2=A8' code-points as in OS X file names.
*** http://site.icu-project.org/home#TOC-What-is-ICU-

--=20
Marco

May 30 2016

Chris <wendlec tcd.ie> writes:

On Monday, 30 May 2016 at 16:03:03 UTC, Marco Leise wrote:

 *** http://site.icu-project.org/home#TOC-What-is-ICU-

I was actually talking about ICU with a colleague today. Could it 
be that Unicode itself is broken? I've often heard criticism of 
Unicode but never looked into it.

May 30 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Mon, 30 May 2016 17:35:36 +0000
schrieb Chris <wendlec tcd.ie>:

 I was actually talking about ICU with a colleague today. Could it 
 be that Unicode itself is broken? I've often heard criticism of 
 Unicode but never looked into it.

You have to compare to the situation before, when every
operating system with every localization had its own encoding.
Have some text file with ASCII art in a DOS code page? Doesn't
render on Windows with the same locale. Open Cyrillic text on
a Latin system? Indigestible. Someone wrote a website on
Windows and incorrectly tagged it with an ISO charset? The
browser has to fix it up for them.

One objection I remember was the Han Unification:
https://en.wikipedia.org/wiki/Han_unification
Not everyone liked how Chinese, Japanese, Korean were
represented with a common set of ideograms. At the time
Unicode was still 16-bit and the unified symbols would already
make up 32% of all code points.

In my eyes many of the perceived problems of Unicode are
stemming from the fact that raises awareness to different
writing systems all over the globe in a way that we didn't
have to, when software was developed locally instead of
globally on GitHub, when the target was Windows instead of
cross-platform and mobile, when we were lucky if we localized
for a couple of latin languages, but Asia was a real barrier.

I don't know what you and your colleague discussed about ICU,
but likely if you should add another dependency and what
alternatives there are. In Linux user space, almost everything
is an outside project, an extra library, most of them with
alternatives. My own research lead me to the point where I
came to think that there was one set of libraries without
real alternatives: ICU -> HarfBuff -> Pango
That's the go-to chain for Unicode text. From text processing
over rendering to layouting. Moreover many successful
open-source projects make use of it: LibreOffice, sqlite, Qt,
libxml2, WebKit to name a few.
Unicode is here to stay, no matter what could have been done
better in the past, and I think it is perfectly safe to bet on
ICU on Linux for what e.g. Windows has built-in.

Otherwise just do as Adam Ruppe said:
 Don't mess with strings. Get them from the user, store them
 without modification, spit them back out again.

:p

-- 
Marco

May 30 2016

Joakim <dlang joakim.fea.st> writes:

On Monday, 30 May 2016 at 17:35:36 UTC, Chris wrote:
 On Monday, 30 May 2016 at 16:03:03 UTC, Marco Leise wrote:

 *** http://site.icu-project.org/home#TOC-What-is-ICU-

 I was actually talking about ICU with a colleague today. Could 
 it be that Unicode itself is broken? I've often heard criticism 
 of Unicode but never looked into it.

Part of it is the complexity of written language, part of it is 
bad technical decisions.  Building the default string type in D 
around the horrible UTF-8 encoding was a fundamental mistake, 
both in terms of efficiency and complexity.  I noted this in one 
of my first threads in this forum, and as Andrei said at the 
time, nobody agreed with me, with a lot of hand-waving about how 
efficiency wasn't an issue or that UTF-8 arrays were fine.  
Fast-forward years later and exactly the issues I raised are now 
causing pain.

UTF-8 is an antiquated hack that needs to be eradicated.  It 
forces all other languages than English to be twice as long, for 
no good reason, have fun with that when you're downloading text 
on a 2G connection in the developing world.  It is unnecessarily 
inefficient, which is precisely why auto-decoding is a problem.  
It is only a matter of time till UTF-8 is ditched.

D devs should lead the way in getting rid of the UTF-8 encoding, 
not bickering about how to make it more palatable.  I suggested a 
single-byte encoding for most languages, with double-byte for the 
ones which wouldn't fit in a byte.  Use some kind of header or 
other metadata to combine strings of different languages, _rather 
than encoding the language into every character!_

The common string-handling use case, by far, is strings with only 
one language, with a distant second some substrings in a second 
language, yet here we are putting the overhead into every 
character to allow inserting characters from an arbitrary 
language!  This is madness.

Yes, the complexity of diacritics and combining characters will 
remain, but that is complexity that is inherent to the variety of 
written language.  UTF-8 is not: it is just a bad technical 
decision, likely chosen for ASCII compatibility and some 
misguided notion that being able to combine arbitrary language 
strings with no other metadata was worthwhile.  It is not.

May 31 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Tuesday, May 31, 2016 16:29:33 Joakim via Digitalmars-d wrote:
 UTF-8 is an antiquated hack that needs to be eradicated.  It
 forces all other languages than English to be twice as long, for
 no good reason, have fun with that when you're downloading text
 on a 2G connection in the developing world.  It is unnecessarily
 inefficient, which is precisely why auto-decoding is a problem.
 It is only a matter of time till UTF-8 is ditched.

Considering that *nix land uses UTF-8 almost exclusively, and many C
libraries do even on Windows, I very much doubt that UTF-8 is going anywhere

but vast sea of code that is C or C++ generally uses UTF-8 as do plenty of
other programming languages.

And even aside from English, most European languages are going to be more
efficient with UTF-8, because they're still primarily ASCII even if they
contain characters that are not. Stuff like Chinese is definitely worse in
UTF-8 than it would be in UTF-16, but there are a lot of languages other
than English which are going to encode better with UTF-8 than UTF-16 - let
alone UTF-32.

Regardless, UTF-8 isn't going anywhere anytime soon. _Way_ too much uses it
for it to be going anywhere, and most folks have no problem with that. Any
attempt to get rid of it would be a huge, uphill battle.

But D supports UTF-8, UTF-16, _and_ UTF-32 natively - even without involving
the standard library - so anyone who wants to avoid UTF-8 is free to do so.

- Jonathan M Davis

May 31 2016

Joakim <dlang joakim.fea.st> writes:

On Tuesday, 31 May 2016 at 18:34:54 UTC, Jonathan M Davis wrote:
 On Tuesday, May 31, 2016 16:29:33 Joakim via Digitalmars-d 
 wrote:
 UTF-8 is an antiquated hack that needs to be eradicated.  It 
 forces all other languages than English to be twice as long, 
 for no good reason, have fun with that when you're downloading 
 text on a 2G connection in the developing world.  It is 
 unnecessarily inefficient, which is precisely why 
 auto-decoding is a problem. It is only a matter of time till 
 UTF-8 is ditched.

 Considering that *nix land uses UTF-8 almost exclusively, and 
 many C libraries do even on Windows, I very much doubt that 
 UTF-8 is going anywhere anytime soon - if ever. The Win32 API 

 is C or C++ generally uses UTF-8 as do plenty of other 
 programming languages.

I agree that both UTF encodings are somewhat popular now.

 And even aside from English, most European languages are going 
 to be more efficient with UTF-8, because they're still 
 primarily ASCII even if they contain characters that are not. 
 Stuff like Chinese is definitely worse in UTF-8 than it would 
 be in UTF-16, but there are a lot of languages other than 
 English which are going to encode better with UTF-8 than UTF-16 
 - let alone UTF-32.

And there are a lot more languages that will be twice as long 
than English, ie ASCII.

 Regardless, UTF-8 isn't going anywhere anytime soon. _Way_ too 
 much uses it for it to be going anywhere, and most folks have 
 no problem with that. Any attempt to get rid of it would be a 
 huge, uphill battle.

I disagree, it is inevitable.  Any tech so complex and 
inefficient cannot last long.

 But D supports UTF-8, UTF-16, _and_ UTF-32 natively - even 
 without involving the standard library - so anyone who wants to 
 avoid UTF-8 is free to do so.

Yes, but not by using UTF-16/32, which use too much memory.  I've 
suggested a single-byte encoding for most languages instead, both 
in my last post and the earlier thread.

D could use this new encoding internally, while keeping its 
current UTF-8/16 strings around for any outside UTF-8/16 data 
passed in.  Any of that data run through algorithms that don't 
require decoding could be kept in UTF-8, but the moment any 
decoding is required, D would translate UTF-8 to the new 
encoding, which would be much easier for programmers to 
understand and manipulate. If UTF-8 output is needed, you'd have 
to encode back again.

Yes, this translation layer would be a bit of a pain, but the new 
encoding would be so much more efficient and understandable that 
it would be worth it, and you're already decoding and encoding 
back to UTF-8 for those algorithms now.  All that's changing is 
that you're using a new and different encoding than dchar as the 
default.  If it succeeds for D, it could then be sold more widely 
as a replacement for UTF-8/16.

I think this would be the right path forward, not navigating this 
UTF-8/16 mess further.

May 31 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Tue, 31 May 2016 16:29:33 +0000
schrieb Joakim <dlang joakim.fea.st>:

 Part of it is the complexity of written language, part of it is=20
 bad technical decisions.  Building the default string type in D=20
 around the horrible UTF-8 encoding was a fundamental mistake,=20
 both in terms of efficiency and complexity.  I noted this in one=20
 of my first threads in this forum, and as Andrei said at the=20
 time, nobody agreed with me, with a lot of hand-waving about how=20
 efficiency wasn't an issue or that UTF-8 arrays were fine. =20
 Fast-forward years later and exactly the issues I raised are now=20
 causing pain.

Maybe you can dig up your old post and we can look at each of
your complaints in detail.

 UTF-8 is an antiquated hack that needs to be eradicated.  It=20
 forces all other languages than English to be twice as long, for=20
 no good reason, have fun with that when you're downloading text=20
 on a 2G connection in the developing world.  It is unnecessarily=20
 inefficient, which is precisely why auto-decoding is a problem. =20
 It is only a matter of time till UTF-8 is ditched.

You don't download twice the data. First of all, some
languages had two-byte encodings before UTF-8, and second web
content is full of HTML syntax and gzip compressed afterwards.
Take this Thai Wikipedia entry for example:
https://th.wikipedia.org/wiki/%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97=
%E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2
The download of the gzipped html is 11% larger in UTF-8 than
in Thai TIS-620 single-byte encoding. And that is dwarfed by
the size of JS + images. (I don't have the numbers, but I
expect the effective overhead to be ~2%).
Ironically a lot of symbols we take for granted would then
have to be implemented as HTML entities using their Unicode
code points(sic!). Amongst them basic stuff like dashes, degree
(=C2=B0) and minute (=E2=80=B2), accents in names, non-breaking space or
footnotes (=E2=86=91).

 D devs should lead the way in getting rid of the UTF-8 encoding,=20
 not bickering about how to make it more palatable.  I suggested a=20
 single-byte encoding for most languages, with double-byte for the=20
 ones which wouldn't fit in a byte.  Use some kind of header or=20
 other metadata to combine strings of different languages, _rather=20
 than encoding the language into every character!_

That would have put D on an island. "Some kind of header"
would be a horrible mess to have in strings, because you have
to account for it when concatenating strings and scan for them
all the time to see if there is some interspersed 2 byte
encoding in the stream. That's hardly better than UTF-8. And
yes, a huge amount of websites mix scripts and a lot of other
text uses the available extra symbols like =C2=B0 or =CE=B1,=CE=B2,=CE=B3.

 The common string-handling use case, by far, is strings with only=20
 one language, with a distant second some substrings in a second=20
 language, yet here we are putting the overhead into every=20
 character to allow inserting characters from an arbitrary=20
 language!  This is madness.

No thx, madness was when we couldn't reliably open text files,
because nowhere was the encoding stored and when you had to
compile programs for each of a dozen codepages, so localized
text would be rendered correctly. And your retro codepage
system wont convince the world to drop Unicode either.

 Yes, the complexity of diacritics and combining characters will=20
 remain, but that is complexity that is inherent to the variety of=20
 written language.  UTF-8 is not: it is just a bad technical=20
 decision, likely chosen for ASCII compatibility and some=20
 misguided notion that being able to combine arbitrary language=20
 strings with no other metadata was worthwhile.  It is not.

The web proves you wrong. Scripts do get mixed often. Be it
Wikipedia, a foreign language learning site or mathematical
symbols.

--=20
Marco

May 31 2016

Joakim <dlang joakim.fea.st> writes:

On Tuesday, 31 May 2016 at 20:20:46 UTC, Marco Leise wrote:
Am Tue, 31 May 2016 16:29:33 +0000
schrieb Joakim <dlang joakim.fea.st>:

Part of it is the complexity of written language, part of it
is bad technical decisions. Building the default string type
in D around the horrible UTF-8 encoding was a fundamental
mistake, both in terms of efficiency and complexity. I noted
this in one of my first threads in this forum, and as Andrei
said at the time, nobody agreed with me, with a lot of
hand-waving about how efficiency wasn't an issue or that UTF-8
arrays were fine. Fast-forward years later and exactly the
issues I raised are now causing pain.

Maybe you can dig up your old post and we can look at each of
your complaints in detail.

Not interested. I believe you were part of that thread then.
Google it if you want to read it again.

UTF-8 is an antiquated hack that needs to be eradicated. It
forces all other languages than English to be twice as long,
for no good reason, have fun with that when you're downloading
text on a 2G connection in the developing world. It is
unnecessarily inefficient, which is precisely why
auto-decoding is a problem. It is only a matter of time till
UTF-8 is ditched.

You don't download twice the data. First of all, some
languages had two-byte encodings before UTF-8, and second web
content is full of HTML syntax and gzip compressed afterwards.

The vast majority can be encoded in a single byte, and are
unnecessarily forced to two bytes by the inefficient UTF-8/16
encodings. HTML syntax is a non sequitur; compression helps but
isn't as efficient as a proper encoding.

Take this Thai Wikipedia entry for example:
https://th.wikipedia.org/wiki/%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97%E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2
The download of the gzipped html is 11% larger in UTF-8 than
in Thai TIS-620 single-byte encoding. And that is dwarfed by
the size of JS + images. (I don't have the numbers, but I
expect the effective overhead to be ~2%).

Nobody on a 2G connection is waiting minutes to download such
massive web pages. They are mostly sending text to each other on
their favorite chat app, and waiting longer and using up more of
their mobile data quota if they're forced to use bad encodings.

Ironically a lot of symbols we take for granted would then
have to be implemented as HTML entities using their Unicode
code points(sic!). Amongst them basic stuff like dashes, degree
(°) and minute (′), accents in names, non-breaking space or
footnotes (↑).

No, they just don't use HTML, opting for much superior mobile
apps instead. :)

D devs should lead the way in getting rid of the UTF-8
encoding, not bickering about how to make it more palatable.
I suggested a single-byte encoding for most languages, with
double-byte for the ones which wouldn't fit in a byte. Use
some kind of header or other metadata to combine strings of
different languages, _rather than encoding the language into
every character!_

That would have put D on an island. "Some kind of header" would
be a horrible mess to have in strings, because you have to
account for it when concatenating strings and scan for them all
the time to see if there is some interspersed 2 byte encoding
in the stream. That's hardly better than UTF-8. And yes, a huge
amount of websites mix scripts and a lot of other text uses the
available extra symbols like ° or α,β,γ.

Let's see: a constant-time addition to a header or constantly
decoding every character every time I want to manipulate the
string... I wonder which is a better choice?! You would not
"intersperse" any other encodings, unless you kept track of those
substrings in the header. My whole point is that such mixing of
languages or "extra symbols" is an extreme minority use case: the
vast majority of strings are a single language.

The common string-handling use case, by far, is strings with
only one language, with a distant second some substrings in a
second language, yet here we are putting the overhead into
every character to allow inserting characters from an
arbitrary language! This is madness.

No thx, madness was when we couldn't reliably open text files,
because nowhere was the encoding stored and when you had to
compile programs for each of a dozen codepages, so localized
text would be rendered correctly. And your retro codepage
system wont convince the world to drop Unicode either.

Unicode _is_ a retro codepage system, they merely standardized a
bunch of the most popular codepages. So that's not going away no
matter what system you use. :)

Yes, the complexity of diacritics and combining characters
will remain, but that is complexity that is inherent to the
variety of written language. UTF-8 is not: it is just a bad
technical decision, likely chosen for ASCII compatibility and
some misguided notion that being able to combine arbitrary
language strings with no other metadata was worthwhile. It is
not.

The web proves you wrong. Scripts do get mixed often. Be it
Wikipedia, a foreign language learning site or mathematical
symbols.

Those are some of the least-trafficked parts of the web, which
itself is dying off as the developing world comes online through
mobile apps, not the bloated web stack.

Anyway, I'm not interested in rehashing this dumb argument again.
The UTF-8/16 encodings are a horrible mess, and D made a big
mistake by baking them in.

May 31 2016

Timon Gehr <timon.gehr gmx.ch> writes:

On 31.05.2016 22:20, Marco Leise wrote:
 Am Tue, 31 May 2016 16:29:33 +0000
 schrieb Joakim<dlang joakim.fea.st>:

Part of it is the complexity of written language, part of it is
bad technical decisions.  Building the default string type in D
around the horrible UTF-8 encoding was a fundamental mistake,
both in terms of efficiency and complexity.  I noted this in one
of my first threads in this forum, and as Andrei said at the
time, nobody agreed with me, with a lot of hand-waving about how
efficiency wasn't an issue or that UTF-8 arrays were fine.
Fast-forward years later and exactly the issues I raised are now
causing pain.


 Maybe you can dig up your old post and we can look at each of
 your complaints in detail.

It is probably this one. Not sure what "exactly the issues" are though.

http://forum.dlang.org/thread/bwbuowkblpdxcpysejpb forum.dlang.org

May 31 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 5/31/2016 1:20 PM, Marco Leise wrote:
 [...]

I agree. I dealt the madness of code pages, Shift-JIS, EBCDIC, locales, etc.,
in 
the pre-Unicode days. Despite its problems, Unicode (and UTF-8) is a major 
improvement, and I mean major.

16 years ago, I bet that Unicode was the future, and events have shown that to 
be correct.

But I didn't know which encoding would win - UTF-8, UTF-16, or UCS-2, so D bet 
on all three. If I had a do-over, I'd just support UTF-8. UTF-16 is useful 
pretty much only as a transitional encoding to talk with Windows APIs. Nobody 
uses UCS-2 (it consumes far too much memory).

May 31 2016

ag0aep6g <anonymous example.com> writes:

On 06/01/2016 12:47 AM, Walter Bright wrote:
 But I didn't know which encoding would win - UTF-8, UTF-16, or UCS-2, so
 D bet on all three. If I had a do-over, I'd just support UTF-8. UTF-16
 is useful pretty much only as a transitional encoding to talk with
 Windows APIs. Nobody uses UCS-2 (it consumes far too much memory).

Wikipedia says [1] that UCS-2 is essentially UTF-16 without surrogate 
pairs. I suppose you mean UTF-32/UCS-4.


[1] https://en.wikipedia.org/wiki/UTF-16

May 31 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 5/31/2016 4:00 PM, ag0aep6g wrote:
 Wikipedia says [1] that UCS-2 is essentially UTF-16 without surrogate pairs. I
 suppose you mean UTF-32/UCS-4.
 [1] https://en.wikipedia.org/wiki/UTF-16

Thanks for the correction.

May 31 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Tue, 31 May 2016 15:47:02 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 But I didn't know which encoding would win - UTF-8, UTF-16, or UCS-2, so D bet 
 on all three. If I had a do-over, I'd just support UTF-8. UTF-16 is useful 
 pretty much only as a transitional encoding to talk with Windows APIs.

I think so too, although more APIs than just Windows use
UTF-16. Think of Java or ICU. Aside from their Java heritage
they found that it is the fastest encoding for transcoding
from and to Unicode as UTF-16 codepoints cover most 8-bit
codepages. Also Qt defined a char as UTF-16 code point, but
they probably regret it as the 'charmap' program KCharSelect
is now unable to show Unicode characters >= 0x10000.

-- 
Marco

May 31 2016

ag0aep6g <anonymous example.com> writes:

On 05/31/2016 06:29 PM, Joakim wrote:
 D devs should lead the way in getting rid of the UTF-8 encoding, not
 bickering about how to make it more palatable.  I suggested a
 single-byte encoding for most languages, with double-byte for the ones
 which wouldn't fit in a byte.  Use some kind of header or other metadata
 to combine strings of different languages, _rather than encoding the
 language into every character!_

Guys, may I ask you to move this discussion to a new thread? I'd like to 
follow the (already crowded) autodecode thing, and this is really a 
separate topic.

May 31 2016

Joakim <dlang joakim.fea.st> writes:

On Tuesday, 31 May 2016 at 20:28:32 UTC, ag0aep6g wrote:
 On 05/31/2016 06:29 PM, Joakim wrote:
 D devs should lead the way in getting rid of the UTF-8 
 encoding, not
 bickering about how to make it more palatable.  I suggested a
 single-byte encoding for most languages, with double-byte for 
 the ones
 which wouldn't fit in a byte.  Use some kind of header or 
 other metadata
 to combine strings of different languages, _rather than 
 encoding the
 language into every character!_

 Guys, may I ask you to move this discussion to a new thread? 
 I'd like to follow the (already crowded) autodecode thing, and 
 this is really a separate topic.

No, this is the root of the problem, but I'm not interested in 
debating it, so you can go back to discussing how to avoid the 
elephant in the room.

May 31 2016

Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:

On Tuesday, 31 May 2016 at 16:29:33 UTC, Joakim wrote:
 UTF-8 is an antiquated hack that needs to be eradicated.  It 
 forces all other languages than English to be twice as long, 
 for no good reason, have fun with that when you're downloading 
 text on a 2G connection in the developing world.

I assume you're talking about the web here. In this case, plain 
text makes up only a minor part of the entire traffic, the 
majority of which is images (binary data), javascript and 
stylesheets (almost pure ASCII), and HTML markup (ditto). It's 
like not significant even without taking compression into 
account, which is ubiquitous.

 It is unnecessarily inefficient, which is precisely why 
 auto-decoding is a problem.

No, inefficiency is the least of the problems with auto-decoding.

 It is only a matter of time till UTF-8 is ditched.

This is ridiculous, even if your other claims were true.

 D devs should lead the way in getting rid of the UTF-8 
 encoding, not bickering about how to make it more palatable.  I 
 suggested a single-byte encoding for most languages, with 
 double-byte for the ones which wouldn't fit in a byte.  Use 
 some kind of header or other metadata to combine strings of 
 different languages, _rather than encoding the language into 
 every character!_

I think I remember that post, and - sorry to be so blunt - it was 
one of the worst things I've ever seen proposed regarding text 
encoding.

 The common string-handling use case, by far, is strings with 
 only one language, with a distant second some substrings in a 
 second language, yet here we are putting the overhead into 
 every character to allow inserting characters from an arbitrary 
 language!  This is madness.

No. The common string-handling use case is code that is unaware 
which script (not language, btw) your text is in.

Jun 01 2016

Joakim <dlang joakim.fea.st> writes:

On Wednesday, 1 June 2016 at 10:04:42 UTC, Marc Schütz wrote:
 On Tuesday, 31 May 2016 at 16:29:33 UTC, Joakim wrote:
 UTF-8 is an antiquated hack that needs to be eradicated.  It 
 forces all other languages than English to be twice as long, 
 for no good reason, have fun with that when you're downloading 
 text on a 2G connection in the developing world.

 I assume you're talking about the web here. In this case, plain 
 text makes up only a minor part of the entire traffic, the 
 majority of which is images (binary data), javascript and 
 stylesheets (almost pure ASCII), and HTML markup (ditto). It's 
 like not significant even without taking compression into 
 account, which is ubiquitous.

No, I explicitly said not the web in a subsequent post.  The 
ignorance here of what 2G speeds are like is mind-boggling.

 It is unnecessarily inefficient, which is precisely why 
 auto-decoding is a problem.

 No, inefficiency is the least of the problems with 
 auto-decoding.

Right... that's why this 200-post thread was spawned with that as 
the main reason.

 It is only a matter of time till UTF-8 is ditched.

 This is ridiculous, even if your other claims were true.

The UTF-8 encoding is what's ridiculous.

 D devs should lead the way in getting rid of the UTF-8 
 encoding, not bickering about how to make it more palatable.  
 I suggested a single-byte encoding for most languages, with 
 double-byte for the ones which wouldn't fit in a byte.  Use 
 some kind of header or other metadata to combine strings of 
 different languages, _rather than encoding the language into 
 every character!_

 I think I remember that post, and - sorry to be so blunt - it 
 was one of the worst things I've ever seen proposed regarding 
 text encoding.

Well, when you _like_ a ludicrous encoding like UTF-8, not sure 
your opinion matters.

 The common string-handling use case, by far, is strings with 
 only one language, with a distant second some substrings in a 
 second language, yet here we are putting the overhead into 
 every character to allow inserting characters from an 
 arbitrary language!  This is madness.

 No. The common string-handling use case is code that is unaware 
 which script (not language, btw) your text is in.

Lol, this may be the dumbest argument put forth yet.

I don't think anyone here even understands what a good encoding 
is and what it's for, which is why there's no point in debating 
this.

Jun 01 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Wed, 01 Jun 2016 13:57:27 +0000
schrieb Joakim <dlang joakim.fea.st>:

 No, I explicitly said not the web in a subsequent post.  The 
 ignorance here of what 2G speeds are like is mind-boggling.

I've used 56k and had a phone conversation with my sister
while she was downloading a 800 MiB file over 2G. You just
learn to be patient (or you already are when the next major
city is hundreds of kilometers away) and load only what you
need. Your point about the costs convinced me more.

Here is one article spiced up with numbers and figures:
http://www.thequint.com/technology/2016/05/30/almost-every-indian-may-be-online-if-data-cost-cut-to-one-third

But even if you could prove with a study that UTF-8 caused a
notable bandwith cost in real life, it would - I think - be a
matter of regional ISPs to provide special servers and apps
that reduce data volume. There is also the overhead of
key exchange when establishing a secure connection:
http://stackoverflow.com/a/20306907/4038614
Something every app should do, but will increase bandwidth use.
Then there is the overhead of using XML in applications
like WhatsApp, which I presume is quite popular around the
world. I'm just trying to broaden the view a bit here.
This note from the XMPP that WhatsApp and Jabber use will make
you cringe: https://tools.ietf.org/html/rfc6120#section-11.6

-- 
Marco

Jun 01 2016

Joakim <dlang joakim.fea.st> writes:

On Wednesday, 1 June 2016 at 14:58:47 UTC, Marco Leise wrote:
Am Wed, 01 Jun 2016 13:57:27 +0000
schrieb Joakim <dlang joakim.fea.st>:

No, I explicitly said not the web in a subsequent post. The
ignorance here of what 2G speeds are like is mind-boggling.

I've used 56k and had a phone conversation with my sister while
she was downloading a 800 MiB file over 2G. You just learn to
be patient (or you already are when the next major city is
hundreds of kilometers away) and load only what you need. Your
point about the costs convinced me more.

I see that max 2G speeds are 100-200 kbits/s. At that rate, it
would have taken her more than 10 hours to download such a large
file, that's nuts. The worst part is when the download gets
interrupted and you have to start over again because most
download managers don't know how to resume, including the stock
one on Android.

Also, people in these countries buy packs of around 100-200 MB
for 30-60 US cents, so they would never download such a large
file. They use messaging apps like Whatsapp or WeChat, which
nobody in the US uses, to avoid onerous SMS charges.

Here is one article spiced up with numbers and figures:
http://www.thequint.com/technology/2016/05/30/almost-every-indian-may-be-online-if-data-cost-cut-to-one-third

Yes, only the middle class, which are at most 10-30% of the
population in these developing countries, can even afford 2G.
The way to get costs down even further is to make the tech as
efficient as possible. Of course, much of the rest of the
population are illiterate, so there are bigger problems there.

But even if you could prove with a study that UTF-8 caused a
notable bandwith cost in real life, it would - I think - be a
matter of regional ISPs to provide special servers and apps
that reduce data volume.

Yes, by ditching UTF-8.

There is also the overhead of
key exchange when establishing a secure connection:
http://stackoverflow.com/a/20306907/4038614
Something every app should do, but will increase bandwidth use.

That's not going to happen, even HTTP/2 ditched that requirement.
Also, many of those countries' govts will not allow it: google
how Blackberry had to give up their keys for "secure" BBM in many
countries. It's not just Canada and the US spying on their
citizens.

Then there is the overhead of using XML in applications
like WhatsApp, which I presume is quite popular around the
world. I'm just trying to broaden the view a bit here.

I didn't know they used XML. Googling it now, I see mention that
they switched to an "internally developed protocol" at some
point, so I doubt they're using XML now.

This note from the XMPP that WhatsApp and Jabber use will make
you cringe: https://tools.ietf.org/html/rfc6120#section-11.6

Haha, no wonder Jabber is dead. :) I jumped on Jabber for my own
messages a decade ago, as it seemed like an open way out of that
proprietary messaging mess, then I read that they're using XML
and gave up on it.

On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote:
On Wednesday, 1 June 2016 at 13:57:27 UTC, Joakim wrote:
No, I explicitly said not the web in a subsequent post. The
ignorance here of what 2G speeds are like is mind-boggling.

It's not hard. I think a lot of us remember when a 14.4 modem
was cutting-edge.

Well, then apparently you're unaware of how bloated web pages are
nowadays. It used to take me minutes to download popular web
pages _back then_ at _top speed_, and those pages were a _lot_
smaller.

Codepages and incompatible encodings were terrible then, too.

Never again.

This only shows you probably don't know the difference between an
encoding and a code page, which are orthogonal concepts in
Unicode. It's not surprising, as Walter and many others
responding show the same ignorance. I explained this repeatedly
in the previous thread, but it depends on understanding the tech,
and I can't spoon-feed that to everyone.

Well, when you _like_ a ludicrous encoding like UTF-8, not
sure your opinion matters.

It _is_ kind of ludicrous, isn't it? But it really is the
least-bad option for the most text. Sorry, bub.

I think we can do a lot better.

No. The common string-handling use case is code that is
unaware which script (not language, btw) your text is in.

Lol, this may be the dumbest argument put forth yet.

This just makes it feel like you're trolling. You're not just
trolling, right?

Are you trolling? Because I was just calling it like it is.

The vast majority of software is written for _one_ language, the
local one. You may think otherwise because the software that
sells the most and makes the most money is internationalized
software like Windows or iOS, because it can be resold into many
markets. But as a percentage of lines of code written, such
international code is almost nothing.

I don't think anyone here even understands what a good
encoding is and what it's for, which is why there's no point
in debating this.

And I don't think you realise how backwards you sound to people
who had to live through the character encoding hell of the
past. This has been an ongoing headache for the better part of
a century (it still comes up in old files, sites, and systems)
and you're literally the only person I've ever seen seriously
suggest we turn back now that the madness has been somewhat
tamed.

No, I have never once suggested "turning back." I have suggested
a new scheme that retains one technical aspect of the prior
schemes, ie constant-width encoding for each language, with a
single byte sufficing for most. _You and several others_,
including Walter, see that and automatically translate that to,
"He wants EBCDIC to come back!," as though that were the only
possible single-byte encoding and largely ignoring the
possibilities of the header scheme I suggested.

I could call that "trolling" by all of you, :) but I'll instead
call it what it likely is, reactionary thinking, and move on.

If you have to deal with delivering the fastest possible i18n
at GSM data rates, well, that's a tough problem and it sounds
like you might need to do something pretty special. Turning the
entire ecosystem into your special case is not the answer.

I don't think you understand: _you_ are the special case. The 5
billion people outside the US and EU are _not the special case_.
Yes, they have not mattered so far, because they were too poor to
buy computers. But the "computers" with the most sales these
days are smartphones, and Motorola just launched their new Moto
G4 in India and Samsung their new C5 and C7 in China. They
didn't bother announcing release dates for these mid-range
phones- well, they're high-end in those countries- in the US.
That's because "computer" sales in all these non-ASCII countries
now greatly outweighs the US.

Now, a large majority of people in those countries don't have
smartphones or text each other, so a significant chunk of the
minority who do buy mostly ~$100 smartphones over there can
likely afford a fatter text encoding and I don't know what
encodings these developing markets are commonly using now. The
problem is all the rest, and those just below who cannot afford
it at all, in part because the tech is not as efficient as it
could be yet. Ditching UTF-8 will be one way to make it more
efficient.

On Wednesday, 1 June 2016 at 16:15:15 UTC, Patrick Schluter wrote:
Indeed, Joakim's proposal is so insane it beggars belief (why
not go back to baudot encoding, it's only 5 bit, hurray, it's
so much faster when used with flag semaphores).

I suspect you don't understand my proposal.

As a programmer in the European Commission translation unit,
working on the probably biggest translation memory in the world
for 14 years, I can attest that Unicode is a blessing. When I
remember the shit we had in our documents because of the code
pages before most programs could handle utf-8 or utf-16 (and
before 2004 we only had 2 alphabets to take care of, Western
and Greek). What Joakim does not understand, is that there are
huge, huge quantities of documents that are multi-lingual.

Oh, I'm well aware of this. I just think a variable-length
encoding like UTF-8 or UTF-16 is a bad design. And what you have
to realize is that most strings in most software will only have
one language. Anyway, the scheme I sketched out handles multiple
languages: it just doesn't optimize for completely random jumbles
of characters from every possible language, which is what UTF-8
is optimized for and is a ridiculous decision.

Translators of course handle nearly exclusively with at least
bi-lingual documents. Any document encountered by a translator
must at least be able to present the source and the target
language. But even outside of that specific population,
multilingual documents are very, very common.

You are likely biased by the fact that all your documents are
bilingual: they're _not_ common for the vast majority of users.
Even if they were, UTF-8 is as suboptimal, compared to the
constant-width encoding scheme I've sketched, for bilingual or
even trilingual documents as it is for a single language, so even
if I were wrong about their frequency, it wouldn't matter.

Jun 01 2016

Wyatt <wyatt.epp gmail.com> writes:

On Wednesday, 1 June 2016 at 16:45:04 UTC, Joakim wrote:
 On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote:
 It's not hard.  I think a lot of us remember when a 14.4 modem 
 was cutting-edge.

 Well, then apparently you're unaware of how bloated web pages 
 are nowadays.  It used to take me minutes to download popular 
 web pages _back then_ at _top speed_, and those pages were a 
 _lot_ smaller.

It's telling that you think the encoding of the text is anything 
but the tiniest fraction of the problem.  You should look at 
where the actual weight of a "modern" web page comes from.

 Codepages and incompatible encodings were terrible then, too.

 Never again.

 This only shows you probably don't know the difference between 
 an encoding and a code page,

"I suggested a single-byte encoding for most languages, with 
double-byte for the ones which wouldn't fit in a byte. Use some 
kind of header or other metadata to combine strings of different 
languages, _rather than encoding the language into every 
character!_"

Yeah, that?  That's codepages.  And your exact proposal to put 
encodings in the header was ALSO tried around the time that 
Unicode was getting hashed out.  It sucked.  A lot.  (Not as bad 
as storing it in the directory metadata, though.)

 Well, when you _like_ a ludicrous encoding like UTF-8, not 
 sure your opinion matters.

 It _is_ kind of ludicrous, isn't it?  But it really is the 
 least-bad option for the most text.  Sorry, bub.

 I think we can do a lot better.

Maybe.  But no one's done it yet.

 The vast majority of software is written for _one_ language, 
 the local one.  You may think otherwise because the software 
 that sells the most and makes the most money is 
 internationalized software like Windows or iOS, because it can 
 be resold into many markets.  But as a percentage of lines of 
 code written, such international code is almost nothing.

I'm surprised you think this even matters after talking about web 
pages.  The browser is your most common string processing 
situation.  Nothing else even comes close.

 largely ignoring the possibilities of the header scheme I 
 suggested.

"Possibilities" that were considered and discarded decades ago by 
people with way better credentials.  The era of single-byte 
encodings is gone, it won't come back, and good riddance to bad 
rubbish.

 I could call that "trolling" by all of you, :) but I'll instead 
 call it what it likely is, reactionary thinking, and move on.

It's not trolling to call you out for clearly not doing your 
homework.

 I don't think you understand: _you_ are the special case.

Oh, I understand perfectly.  _We_ (whoever "we" are) can handle 
any sequence of glyphs and combining characters (correctly-formed 
or not) in any language at any time, so we're the special case...?

Yeah, it sounds funny to me, too.

 The 5 billion people outside the US and EU are _not the special 
 case_.

Fortunately, it works for them to.

 The problem is all the rest, and those just below who cannot 
 afford it at all, in part because the tech is not as efficient 
 as it could be yet.  Ditching UTF-8 will be one way to make it 
 more efficient.

All right, now you've found the special case; the case where the 
generic, unambiguous encoding may need to be lowered to something 
else: people for whom that encoding is suboptimal because of 
_current_ network constraints.

I fully acknowledge it's a couple billion people and that's 
nothing to sneeze at, but I also see that it's a situation that 
will become less relevant over time.

-Wyatt

Jun 01 2016

Joakim <dlang joakim.fea.st> writes:

On Wednesday, 1 June 2016 at 18:30:25 UTC, Wyatt wrote:
 On Wednesday, 1 June 2016 at 16:45:04 UTC, Joakim wrote:
 On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote:
 It's not hard.  I think a lot of us remember when a 14.4 
 modem was cutting-edge.

 Well, then apparently you're unaware of how bloated web pages 
 are nowadays.  It used to take me minutes to download popular 
 web pages _back then_ at _top speed_, and those pages were a 
 _lot_ smaller.

 It's telling that you think the encoding of the text is 
 anything but the tiniest fraction of the problem.  You should 
 look at where the actual weight of a "modern" web page comes 
 from.

I'm well aware that text is a small part of it.  My point is that 
they're not downloading those web pages, they're using mobile 
instead, as I explicitly said in a prior post.  My only point in 
mentioning the web bloat to you is that _your perception_ is off 
because you seem to think they're downloading _current_ web pages 
over 2G connections, and comparing it to your downloads of _past_ 
web pages with modems.  Not only did it take minutes for us back 
then, it takes _even longer_ now.

I know the text encoding won't help much with that.  Where it 
will help is the mobile apps they're actually using, not the 
bloated websites they don't use.

 Codepages and incompatible encodings were terrible then, too.

 Never again.

 This only shows you probably don't know the difference between 
 an encoding and a code page,

 "I suggested a single-byte encoding for most languages, with 
 double-byte for the ones which wouldn't fit in a byte. Use some 
 kind of header or other metadata to combine strings of 
 different languages, _rather than encoding the language into 
 every character!_"

 Yeah, that?  That's codepages.  And your exact proposal to put 
 encodings in the header was ALSO tried around the time that 
 Unicode was getting hashed out.  It sucked.  A lot.  (Not as 
 bad as storing it in the directory metadata, though.)

You know what's also codepages?  Unicode.  The UCS is a 
standardized set of code pages for each language, often merely 
picking the most popular code page at that time.

I don't doubt that nothing I'm saying hasn't been tried in some 
form before.  The question is whether that alternate form would 
be better if designed and implemented properly, not if a botched 
design/implementation has ever been attempted.

 Well, when you _like_ a ludicrous encoding like UTF-8, not 
 sure your opinion matters.

 It _is_ kind of ludicrous, isn't it?  But it really is the 
 least-bad option for the most text.  Sorry, bub.

 I think we can do a lot better.

 Maybe.  But no one's done it yet.

That's what people said about mobile devices for a long time, 
until about a decade ago.  It's time we got this right.

 The vast majority of software is written for _one_ language, 
 the local one.  You may think otherwise because the software 
 that sells the most and makes the most money is 
 internationalized software like Windows or iOS, because it can 
 be resold into many markets.  But as a percentage of lines of 
 code written, such international code is almost nothing.

 I'm surprised you think this even matters after talking about 
 web pages.  The browser is your most common string processing 
 situation.  Nothing else even comes close.

No, it's certainly popular software, but at the scale we're 
talking about, ie all string processing in all software, it's 
fairly small.  And the vast majority of webapps that handle 
strings passed from a browser are written to only handle one 
language, the local one.

 largely ignoring the possibilities of the header scheme I 
 suggested.

 "Possibilities" that were considered and discarded decades ago 
 by people with way better credentials.  The era of single-byte 
 encodings is gone, it won't come back, and good riddance to bad 
 rubbish.

Lol, credentials. :D If you think that matters at all in the face 
of the blatant stupidity embodied by UTF-8, I don't know what to 
tell you.

 I could call that "trolling" by all of you, :) but I'll 
 instead call it what it likely is, reactionary thinking, and 
 move on.

 It's not trolling to call you out for clearly not doing your 
 homework.

That's funny, because it's precisely you and others who haven't 
done your homework.  So are you all trolling me?  By your 
definition of trolling, which btw is not the standard one, _you_ 
are the one doing it.

 I don't think you understand: _you_ are the special case.

 Oh, I understand perfectly.  _We_ (whoever "we" are) can handle 
 any sequence of glyphs and combining characters 
 (correctly-formed or not) in any language at any time, so we're 
 the special case...?

And you're doing so by mostly using a single-byte encoding for 
_your own_ Euro-centric languages, ie ASCII, while imposing 
unnecessary double-byte and triple-byte encodings on everyone 
else, despite their outnumbering you 10 to 1.  That is the very 
definition of a special case.

 Yeah, it sounds funny to me, too.

I'm happy to hear you find your privilege "funny," but I'm sorry 
to tell you, it won't last.

 The 5 billion people outside the US and EU are _not the 
 special case_.

 Fortunately, it works for them to.

At a higher and unneccessary cost, which is why it won't last.

 The problem is all the rest, and those just below who cannot 
 afford it at all, in part because the tech is not as efficient 
 as it could be yet.  Ditching UTF-8 will be one way to make it 
 more efficient.

 All right, now you've found the special case; the case where 
 the generic, unambiguous encoding may need to be lowered to 
 something else: people for whom that encoding is suboptimal 
 because of _current_ network constraints.

 I fully acknowledge it's a couple billion people and that's 
 nothing to sneeze at, but I also see that it's a situation that 
 will become less relevant over time.

I continue to marvel at your calling a couple billion people "the 
special case," presumably thinking ~700 million people in the US 
and EU primarily using the single-byte encoding of ASCII are the 
general case.

As for the continued relevance of such constrained use, I suggest 
you read the link Marco provided above.  The vast majority of the 
worlwide literate population doesn't have a smartphone or use a 
cellular data plan, whereas the opposite is true if you include 
featurephones, largely because they can by used only for voice.  
As that article notes, costs for smartphones and 2G data plans 
will have to come down for them to go wider.  That will take 
decades to roll out, though the basic tech design will mostly be 
done now.

The costs will go down by making the tech more efficient, and 
ditching UTF-8 will be one of the ways the tech will be made more 
efficient.

Jun 01 2016

Wyatt <wyatt.epp gmail.com> writes:

On Wednesday, 1 June 2016 at 13:57:27 UTC, Joakim wrote:
 No, I explicitly said not the web in a subsequent post.  The 
 ignorance here of what 2G speeds are like is mind-boggling.

It's not hard.  I think a lot of us remember when a 14.4 modem 
was cutting-edge.  Codepages and incompatible encodings were 
terrible then, too.

Never again.

 Well, when you _like_ a ludicrous encoding like UTF-8, not sure 
 your opinion matters.

It _is_ kind of ludicrous, isn't it?  But it really is the 
least-bad option for the most text.  Sorry, bub.

 No. The common string-handling use case is code that is 
 unaware which script (not language, btw) your text is in.

 Lol, this may be the dumbest argument put forth yet.

This just makes it feel like you're trolling.  You're not just 
trolling, right?

 I don't think anyone here even understands what a good encoding 
 is and what it's for, which is why there's no point in debating 
 this.

And I don't think you realise how backwards you sound to people 
who had to live through the character encoding hell of the past.  
This has been an ongoing headache for the better part of a 
century (it still comes up in old files, sites, and systems) and 
you're literally the only person I've ever seen seriously suggest 
we turn back now that the madness has been somewhat tamed.

If you have to deal with delivering the fastest possible i18n at 
GSM data rates, well, that's a tough problem and it sounds like 
you might need to do something pretty special. Turning the entire 
ecosystem into your special case is not the answer.

-Wyatt

Jun 01 2016

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote:
 On Wednesday, 1 June 2016 at 13:57:27 UTC, Joakim wrote:
 No, I explicitly said not the web in a subsequent post.  The 
 ignorance here of what 2G speeds are like is mind-boggling.

 It's not hard.  I think a lot of us remember when a 14.4 modem 
 was cutting-edge.  Codepages and incompatible encodings were 
 terrible then, too.

 Never again.

 Well, when you _like_ a ludicrous encoding like UTF-8, not 
 sure your opinion matters.

 It _is_ kind of ludicrous, isn't it?  But it really is the 
 least-bad option for the most text.  Sorry, bub.

 No. The common string-handling use case is code that is 
 unaware which script (not language, btw) your text is in.

 Lol, this may be the dumbest argument put forth yet.

 This just makes it feel like you're trolling.  You're not just 
 trolling, right?

 I don't think anyone here even understands what a good 
 encoding is and what it's for, which is why there's no point 
 in debating this.

 And I don't think you realise how backwards you sound to people 
 who had to live through the character encoding hell of the 
 past.  This has been an ongoing headache for the better part of 
 a century (it still comes up in old files, sites, and systems) 
 and you're literally the only person I've ever seen seriously 
 suggest we turn back now that the madness has been somewhat 
 tamed.

Indeed, Joakim's proposal is so insane it beggars belief (why not 
go back to baudot encoding, it's only 5 bit, hurray, it's so much 
faster when used with flag semaphores).

As a programmer in the European Commission translation unit, 
working on the probably biggest translation memory in the world 
for 14 years, I can attest that Unicode is a blessing. When I 
remember the shit we had in our documents because of the code 
pages before most programs could handle utf-8 or utf-16 (and 
before 2004 we only had 2 alphabets to take care of, Western and 
Greek). What Joakim does not understand, is that there are huge, 
huge quantities of documents that are multi-lingual. Translators 
of course handle nearly exclusively with at least bi-lingual 
documents. Any document encountered by a translator must at least 
be able to present the source and the target language. But even 
outside of that specific population, multilingual documents are 
very, very common.

 If you have to deal with delivering the fastest possible i18n 
 at GSM data rates, well, that's a tough problem and it sounds 
 like you might need to do something pretty special. Turning the 
 entire ecosystem into your special case is not the answer.

Jun 01 2016

deadalnix <deadalnix gmail.com> writes:

On Wednesday, 1 June 2016 at 16:15:15 UTC, Patrick Schluter wrote:
 What Joakim does not understand, is that there are huge, huge 
 quantities of documents that are multi-lingual.

That should be obvious to anyone living outside the USA.

Jun 01 2016

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On 06/01/2016 12:26 PM, deadalnix wrote:
 On Wednesday, 1 June 2016 at 16:15:15 UTC, Patrick Schluter wrote:
 What Joakim does not understand, is that there are huge, huge
 quantities of documents that are multi-lingual.

 That should be obvious to anyone living outside the USA.

Or anyone in the USA who's ever touched a product that includes a manual 
or a safety warning, or gone to high school (a foreign language class is 
pretty much universally mandatory, even in the US).

Jun 01 2016

Kagamin <spam here.lot> writes:

On Wednesday, 1 June 2016 at 16:26:36 UTC, deadalnix wrote:
 On Wednesday, 1 June 2016 at 16:15:15 UTC, Patrick Schluter 
 wrote:
 What Joakim does not understand, is that there are huge, huge 
 quantities of documents that are multi-lingual.

 That should be obvious to anyone living outside the USA.

https://msdn.microsoft.com/th-th inside too :)

Jun 01 2016

Kagamin <spam here.lot> writes:

On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote:
 If you have to deal with delivering the fastest possible i18n 
 at GSM data rates, well, that's a tough problem and it sounds 
 like you might need to do something pretty special. Turning the 
 entire ecosystem into your special case is not the answer.

UTF-8 encoded SMS work fine for me in GSM network, didn't notice 
any problem.

Jun 01 2016

Adam D. Ruppe <destructionator gmail.com> writes:

On Monday, 30 May 2016 at 16:03:03 UTC, Marco Leise wrote:
 When on the other hand you work with real world international 
 text, you'll want to work with graphemes.

Actually, my main rule of thumb is: don't mess with strings. Get 
them from the user, store them without modification, spit them 
back out again. Wherever possible, don't do anything more.

But if you do have to implement the rest, eh, it depends on what 
you're doing still. If I want an ellipsis, for example, I like to 
take font size into account too - basically, I do a dry-run of 
the whole font render to get the length in pixels, then slice off 
the partial grapheme...

So yeah that's kinda complicated...

May 30 2016

Jack Stouffer <jack jackstouffer.com> writes:

On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
 2. Every time one wants an algorithm to work with both strings 
 and ranges, you wind up special casing the strings to defeat 
 the autodecoding, or to decode the ranges. Having to constantly 
 special case it makes for more special cases when plugging 
 together components. These issues often escape detection when 
 unittesting because it is convenient to unittest only with 
 arrays.

This is a great example of special casing in Phobos that someone 
showed me: 
https://github.com/dlang/phobos/blob/master/std/algorithm/searching.d#L1714

May 12 2016

Bill Hicks <billhicks reality.com> writes:

On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
 Here are some that are not matters of opinion.

 1. Ranges of characters do not autodecode, but arrays of 
 characters do. This is a glaring inconsistency.

 2. Every time one wants an algorithm to work with both strings 
 and ranges, you wind up special casing the strings to defeat 
 the autodecoding, or to decode the ranges. Having to constantly 
 special case it makes for more special cases when plugging 
 together components. These issues often escape detection when 
 unittesting because it is convenient to unittest only with 
 arrays.

 3. Wrapping an array in a struct with an alias this to an array 
 turns off autodecoding, another special case.

 4. Autodecoding is slow and has no place in high speed string 
 processing.

 5. Very few algorithms require decoding.

 6. Autodecoding has two choices when encountering invalid code 
 units - throw or produce an error dchar. Currently, it throws, 
 meaning no algorithms using autodecode can be made nothrow.

 7. Autodecode cannot be used with unicode path/filenames, 
 because it is legal (at least on Linux) to have invalid UTF-8 
 as filenames. It turns out in the wild that pure Unicode is not 
 universal - there's lots of dirty Unicode that should remain 
 unmolested, and autocode does not play with that.

 8. In my work with UTF-8 streams, dealing with autodecode has 
 caused me considerably extra work every time. A convenient 
 timesaver it ain't.

 9. Autodecode cannot be turned off, i.e. it isn't practical to 
 avoid importing std.array one way or another, and then 
 autodecode is there.

 10. Autodecoded arrays cannot be RandomAccessRanges, losing a 
 key benefit of being arrays in the first place.

 11. Indexing an array produces different results than 
 autodecoding, another glaring special case.

Wow, that's eleven things wrong with just one tiny element of D, 
with the potential to cause problems, whether fixed or not.  And 
I get called a troll and other names when I list half a dozen 
things wrong with D, my posts get removed/censored, etc, all 
because I try to inform people not to waste time with D because 
it's a broken and failed language.

*sigh*

Phobos, a piece of useless rock orbiting a dead planet ... the 
irony.

May 12 2016

Ethan Watson <gooberman gmail.com> writes:

On Friday, 13 May 2016 at 06:50:49 UTC, Bill Hicks wrote:
 *rant*

Actually, chap, it's the attitude that's the turn-off in your 
post there. Listing problems in order to improve them, and 
listing problems to convince people something is a waste of time 
are incompatible mindsets around here.

May 13 2016

poliklosio <poliklosio happypizza.com> writes:

On Friday, 13 May 2016 at 06:50:49 UTC, Bill Hicks wrote:
 On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
 (...)

 Wow, that's eleven things wrong with just one tiny element of 
 D, with the potential to cause problems, whether fixed or not.  
 And I get called a troll and other names when I list half a 
 dozen things wrong with D, my posts get removed/censored, etc, 
 all because I try to inform people not to waste time with D 
 because it's a broken and failed language.

 *sigh*

 Phobos, a piece of useless rock orbiting a dead planet ... the 
 irony.

You get banned because there is a difference between torpedoing a 
project and having constructive criticism.
Also, you are missing the point by claiming that a technical 
problem is sure to kill D. Note that very successful languages 
like C++, python and so on also have undergone heated discussions 
about various features, and often live design mistakes for many 
years. The real reason why languages are successful is what they 
enable, not how many quirks they have.
Quirks are why they get replaced by others 20 years later. :)

May 13 2016

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:

On Sunday, 15 May 2016 at 01:45:25 UTC, Bill Hicks wrote:
 From a technical point, D is not successful, for the most part.
  C/C++ at least can use the excuse that they were created 
 during a time when we didn't have the experience and the 
 knowledge that we do now.

Not really. The dominating precursor to C, BCPL was a 
bootstrapping language for CPL. C was a quick hack to implement 
Unix. C++ has always been viewed as a hack and was heavily 
criticised since its inception as a ugly bastardized language 
that got many things wrong. Reality is, current main stream 
programming languages draw on theory that has been well 
understood for 40+ years.  There is virtually no innovation, but 
a lot of repeated mistakes.

Some esoteric languages draw on more modern concepts and 
innovate, but I can't think of a single mainstream language that 
does that.


 If by successful you mean the size of the user base, then D 
 doesn't have that either.  The number of D users is most 
 definitely less than 10k.  The number of people who have tried 
 D is no doubt greater than that, but that's the thing with D, 
 it has a low retention rate, for obvious reasons.

Yes, but D can make breaking changes, something C++ cannot do. 
Unfortunately there is no real willingness to clean up the 
language, so D is moving way too slow to become competitive. But 
that is more of a cultural issue than a language issue.

I am personally increasingly involved with C++, but 
unfortunately, there is no single C++ language. The C/C++ 
committees have unfortunately tried to make the C-languages more 
high performant and high level at the cost of correctness. So, 
now you either have to do heavy code reviews or carefully select 
compiler options to get a sane C++ environment.

Like, in modern C/C++ the compiler assumes that there is no 
aliasing between pointers to different types. So if I cast a 
scalar float pointer to a simd pointer I either have to:

1. make sure that I turn off that assumption by using the 
compiler switch "-fno-strict-aliasing" and add "__restrict__" 
where I know there is no aliasing, or

2. Put __may_alias__ on my simd pointers.

3. Carefully place memory barriers between pointer type casts.

4. Dig into the compiler internals to figure out what it does.

C++ is trying way too hard to become a high level language, 
without the foundation to support it. This is an area where D 
could do well, but it isn't doing enough to get there, neither on 
the theoretical level or the implementation level.

Rust seems to try, but I don't think they will make it as they 
don't seem to have a broad view of programming. Maybe someone 
will build a new language over the Rust mid-level IR (MIR) that 
will be successful. I'm hopeful, but hey, it won't happen in less 
than 5 years.

Until then there is only three options for C++ish progamming: 
C++,  D and Loci. Currently C++ is the path of least resistance 
(but with very high initial investment, 1+ year for an 
experienced educated programmer).

So clearly a language comparable to D _could_ make headway, but 
not without a philosophical change that makes it a significant 
improvement over C++ and systematically adresses the C++ 
short-comings one by one (while retaining the application area 
and basic programming model).

May 15 2016

Chris <wendlec tcd.ie> writes:

On Friday, 13 May 2016 at 06:50:49 UTC, Bill Hicks wrote:
 Wow, that's eleven things wrong with just one tiny element of 
 D, with the potential to cause problems, whether fixed or not.  
 And I get called a troll and other names when I list half a 
 dozen things wrong with D, my posts get removed/censored, etc, 
 all because I try to inform people not to waste time with D 
 because it's a broken and failed language.

 *sigh*

 Phobos, a piece of useless rock orbiting a dead planet ... the 
 irony.

Is there any PL that doesn't have multiple issues? Look at Swift. 
They keep changing it, although it started out as _the_ big 

the chronically ill C++. There is no such thing as the perfect 
PL, and as hardware is changing, PLs are outdated anyway and have 
to catch up. The question is not whether a language sucks or not, 
the question is which language sucks the least for the task at 
hand.

PS I wonder does Bill Hicks know you're using his name? But I 
guess he's lost interest in this planet and happily lives on Mars 
now.

May 13 2016

Kagamin <spam here.lot> writes:

On Friday, 13 May 2016 at 06:50:49 UTC, Bill Hicks wrote:
 not to waste time with D because it's a broken and failed 
 language.

D is a better broken thing among all the broken things in this 
broken world, so it's to be expected to be preferred to spend 
time on.

May 13 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 5/12/2016 11:50 PM, Bill Hicks wrote:
 And I get called a troll and
 other names when I list half a dozen things wrong with D, my posts get
 removed/censored, etc, all because I try to inform people not to waste time
with
 D because it's a broken and failed language.

Posts that engage in personal attacks and bring up personal issues about other 
forum members get removed.

You're welcome to post here in a reasonably professional manner.

May 13 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Thursday, May 12, 2016 13:15:45 Walter Bright via Digitalmars-d wrote:
 On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
  > I am as unclear about the problems of autodecoding as I am about the
  > necessity to remove curl. Whenever I ask I hear some arguments that work
  > well emotionally but are scant on reason and engineering. Maybe it's
  > time to rehash them? I just did so about curl, no solid argument seemed
  > to come together. I'd be curious of a crisp list of grievances about
  > autodecoding. -- Andrei

 Here are some that are not matters of opinion.

 1. Ranges of characters do not autodecode, but arrays of characters do. This
 is a glaring inconsistency.

 2. Every time one wants an algorithm to work with both strings and ranges,
 you wind up special casing the strings to defeat the autodecoding, or to
 decode the ranges. Having to constantly special case it makes for more
 special cases when plugging together components. These issues often escape
 detection when unittesting because it is convenient to unittest only with
 arrays.

 3. Wrapping an array in a struct with an alias this to an array turns off
 autodecoding, another special case.

 4. Autodecoding is slow and has no place in high speed string processing.

 5. Very few algorithms require decoding.

 6. Autodecoding has two choices when encountering invalid code units - throw
 or produce an error dchar. Currently, it throws, meaning no algorithms
 using autodecode can be made nothrow.

 7. Autodecode cannot be used with unicode path/filenames, because it is
 legal (at least on Linux) to have invalid UTF-8 as filenames. It turns out
 in the wild that pure Unicode is not universal - there's lots of dirty
 Unicode that should remain unmolested, and autocode does not play with
 that.

 8. In my work with UTF-8 streams, dealing with autodecode has caused me
 considerably extra work every time. A convenient timesaver it ain't.

 9. Autodecode cannot be turned off, i.e. it isn't practical to avoid
 importing std.array one way or another, and then autodecode is there.

 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key benefit of
 being arrays in the first place.

 11. Indexing an array produces different results than autodecoding, another
 glaring special case.

It also results in constantly special-casing algorithms for narrow strings
in order to avoid auto-decoding. Phobos does this all over the place. We
have a ridiculous amount of code in Phobos just to avoid auto-decoding, and
anyone who wants high performance will have to do the same.

And it's not like auto-decoding is even correct. It would be one thing if
auto-decoding were fully correct but slow, but to be fully correct, it would
need to operate at the grapheme level, not the code point level. So, by
default, we get slower code without actually getting fully correct code.

So, we're neither fast nor correct. We _are_ correct in more cases than we'd
be if we simply acted like ASCII was all there was, but what we end up with
is the illusion that we're correct when we're not. IIRC, Andrei talked in
TDPL about how Java's choice to go with UTF-16 was worse than the choice to
go with UTF-8, because it was correct in many more cases to operate on the
code unit level as if a code unit were a character, and it was therefore
harder to realize that what you were doing was wrong, whereas with UTF-8,
it's obvious very quickly. We currently have that same problem with
auto-decoding except that it's treating UTF-32 code units as if they were
full characters rather than treating UTF-16 code units as if they were full
characters.

Ideally, algorithms would be Unicode aware as appropriate, but the default
would be to operate on code units with wrappers to handle decoding by code
point or grapheme. Then it's easy to write fast code while still allowing
for full correctness. Granted, it's not necessarily easy to get correct code
that way, but anyone who wants fully correctness without caring about
efficiency can just use ranges of graphemes. Ranges of code points are rare
regardless.

Based on what I've seen in previous conversations on auto-decoding over the
past few years (be it in the newsgroup, on github, or at dconf), most of the
core devs think that auto-decoding was a major blunder that we continue to
pay for. But unfortunately, even if we all agree that it was a huge mistake
and want to fix it, the question remains of how to do that without breaking
tons of code - though since AFAIK, Andrei is still in favor of
auto-decoding, we'd have a hard time going forward with plans to get rid of
it even if we had come up with a good way of doing so. But I would love it
if we could get rid of auto-decoding and clean up string handling in D.

- Jonathan M Davis

May 13 2016

Chris <wendlec tcd.ie> writes:

On Friday, 13 May 2016 at 10:38:09 UTC, Jonathan M Davis wrote:
 Based on what I've seen in previous conversations on 
 auto-decoding over the past few years (be it in the newsgroup, 
 on github, or at dconf), most of the core devs think that 
 auto-decoding was a major blunder that we continue to pay for. 
 But unfortunately, even if we all agree that it was a huge 
 mistake and want to fix it, the question remains of how to do 
 that without breaking tons of code - though since AFAIK, Andrei 
 is still in favor of auto-decoding, we'd have a hard time going 
 forward with plans to get rid of it even if we had come up with 
 a good way of doing so. But I would love it if we could get rid 
 of auto-decoding and clean up string handling in D.

 - Jonathan M Davis

Why not just try it in a separate test release? Only then can we 
know to what extent it actually breaks code, and what remedies we 
could come up with.

May 13 2016

Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:

On Friday, 13 May 2016 at 10:38:09 UTC, Jonathan M Davis wrote:
 Ideally, algorithms would be Unicode aware as appropriate, but 
 the default would be to operate on code units with wrappers to 
 handle decoding by code point or grapheme. Then it's easy to 
 write fast code while still allowing for full correctness. 
 Granted, it's not necessarily easy to get correct code that 
 way, but anyone who wants fully correctness without caring 
 about efficiency can just use ranges of graphemes. Ranges of 
 code points are rare regardless.

char[], wchar[] etc. can simply be made non-ranges, so that the 
user has to choose between .byCodePoint, .byCodeUnit (or 
.representation as it already exists), .byGrapheme, or even 
higher-level units like .byLine or .byWord. Ranges of char, wchar 
however stay as they are today. That way it's harder to 
accidentally get it wrong.

 Based on what I've seen in previous conversations on 
 auto-decoding over the past few years (be it in the newsgroup, 
 on github, or at dconf), most of the core devs think that 
 auto-decoding was a major blunder that we continue to pay for. 
 But unfortunately, even if we all agree that it was a huge 
 mistake and want to fix it, the question remains of how to do 
 that without breaking tons of code - though since AFAIK, Andrei 
 is still in favor of auto-decoding, we'd have a hard time going 
 forward with plans to get rid of it even if we had come up with 
 a good way of doing so. But I would love it if we could get rid 
 of auto-decoding and clean up string handling in D.

There is a simple deprecation path that's already been suggested. 
`isInputRange` and friends can output a helpful deprecation 
warning when they're called with a range that currently triggers 
auto-decoding.

May 13 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:

Kagamin <spam here.lot> writes:
On Friday, 13 May 2016 at 10:38:09 UTC, Jonathan M Davis wrote:
IIRC, Andrei talked in TDPL about how Java's choice to go with
UTF-16 was worse than the choice to go with UTF-8, because it
was correct in many more cases

UTF-16 was a migration from UCS-2, and UCS-2 was superior at the
time.

May 13 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Friday, May 13, 2016 12:52:13 Kagamin via Digitalmars-d wrote:
On Friday, 13 May 2016 at 10:38:09 UTC, Jonathan M Davis wrote:
IIRC, Andrei talked in TDPL about how Java's choice to go with
UTF-16 was worse than the choice to go with UTF-8, because it
was correct in many more cases

UTF-16 was a migration from UCS-2, and UCS-2 was superior at the
time.

The history of why UTF-16 was chosen isn't really relevant to my point
(Win32 has the same problem as Java and for similar reasons).

My point was that if you use UTF-8, then it's obvious _really_ fast when you
screwed up Unicode-handling by treating a code unit as a character, because
anything beyond ASCII is going to fall flat on its face. But with UTF-16, a
_lot_ more code units are representable as a single code point - as well as
a single grapheme - so it's far easier to write code that treats a code unit
as if it were a full character without realizing that you're screwing it up.
UTF-8 is fail-fast in this regard, whereas UTF-16 is not.

UTF-32 takes that problem to a new level, because now you'll only notice
problems when you're dealing with a grapheme constructed of multiple code
points. So, odds are that even if you test with Unicode strings, you won't
catch the bugs. It'll work 99% of the time, and you'll get subtle bugs the
rest of the time.

There are reasons to operate at the code point level, but in general, you
either want to be operating at the code unit level or the grapheme level,
not the code point level, and if you don't know what you're doing, then
anything other than the grapheme level is likely going to be wrong if you're
manipulating individual characters. Fortunately, a lot of string processing
doesn't need to operate on individual characters and as long as the standard
library functions get it right, you'll tend to be okay, but still, operating
at the code point level is almost always wrong, and it's even harder to
catch when it's wrong than when treating UTF-16 code units as characters.

- Jonathan M Davis

May 13 2016

Kagamin <spam here.lot> writes:
On Friday, 13 May 2016 at 21:46:28 UTC, Jonathan M Davis wrote:
The history of why UTF-16 was chosen isn't really relevant to
my point (Win32 has the same problem as Java and for similar
reasons).

On the other hand if you deal with UTF-16 text, you can't
interpret it in a way other than UTF-16, people either get it
correct or give up, even for ASCII, even with casts, it's that
resilient. With UTF-8 problems happened on a massive scale in
LAMP setups: mysql used latin1 as a default encoding and almost
everything worked fine.

May 17 2016

sarn <sarn theartofmachinery.com> writes:
On Tuesday, 17 May 2016 at 09:53:17 UTC, Kagamin wrote:
With UTF-8 problems happened on a massive scale in LAMP setups:
mysql used latin1 as a default encoding and almost everything
worked fine.

^ latin-1 with Swedish collation rules.
And even if you set the encoding to "utf8", almost everything
works fine until you discover that you need to set the encoding
to "utf8mb4" to get real utf8. Also, MySQL has per-connection
character encoding settings, so even if your application is
properly set up to use utf8, you can break things by accidentally
connecting with a client using the default pretty-much-latin1
encoding. With MySQL's "silently ram the square peg into the
round hole" design philosophy, this can cause data corruption.

But, of course, almost everything works fine.

Just some examples of why broken utf8 exists (and some venting of
MySQL trauma).

May 17 2016

Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
7. Autodecode cannot be used with unicode path/filenames,
because it is legal (at least on Linux) to have invalid UTF-8
as filenames. It turns out in the wild that pure Unicode is not
universal - there's lots of dirty Unicode that should remain
unmolested, and autocode does not play with that.

This just means that filenames mustn't be represented as strings;
it's unrelated to auto decoding.

May 13 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 5/13/2016 3:43 AM, Marc Schütz wrote:
On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
7. Autodecode cannot be used with unicode path/filenames, because it is legal
(at least on Linux) to have invalid UTF-8 as filenames. It turns out in the
wild that pure Unicode is not universal - there's lots of dirty Unicode that
should remain unmolested, and autocode does not play with that.

This just means that filenames mustn't be represented as strings; it's
unrelated
to auto decoding.

It means much more than that, filenames are just an example. I recently fixed
MicroEmacs (my text editor) to assume the source is UTF-8, and display Unicode
characters. But it still needs to work with dirty UTF-8 without throwing
exceptions, modifying the text in-place, or other tantrums.

May 13 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:
On 5/12/16 4:15 PM, Walter Bright wrote:

10. Autodecoded arrays cannot be RandomAccessRanges, losing a key
benefit of being arrays in the first place.

I'll repeat what I said in the other thread.

The problem isn't auto-decoding. The problem is hijacking the char[] and
wchar[] (and variants) array type to mean autodecoding non-arrays.

If you think this code makes sense, then my definition of sane varies
slightly from yours:

static assert(!hasLength!R && is(typeof(R.init.length)));
static assert(!is(ElementType!R == R.init[0]));
static assert(!isRandomAccessRange!R && is(typeof(R.init[0])) &&
is(typeof(R.init[0 .. $])));

I think D would be fine if string meant some auto-decoding struct with
an immutable(char)[] array backing. I can accept and work with that. I
can transform that into a char[] that makes sense if I have no use for
auto-decoding. As of today, I have to use byCodePoint, or
.representation, etc. and it's very unwieldy.

If I ran D, that's what I would do.

-Steve

May 13 2016

Alex Parrill <initrd.gz gmail.com> writes:
On Friday, 13 May 2016 at 16:05:21 UTC, Steven Schveighoffer
wrote:
On 5/12/16 4:15 PM, Walter Bright wrote:

10. Autodecoded arrays cannot be RandomAccessRanges, losing a
key
benefit of being arrays in the first place.

I'll repeat what I said in the other thread.

The problem isn't auto-decoding. The problem is hijacking the
char[] and wchar[] (and variants) array type to mean
autodecoding non-arrays.

If you think this code makes sense, then my definition of sane
varies slightly from yours:

static assert(!hasLength!R && is(typeof(R.init.length)));
static assert(!is(ElementType!R == R.init[0]));
static assert(!isRandomAccessRange!R && is(typeof(R.init[0]))
&& is(typeof(R.init[0 .. $])));

I think D would be fine if string meant some auto-decoding
struct with an immutable(char)[] array backing. I can accept
and work with that. I can transform that into a char[] that
makes sense if I have no use for auto-decoding. As of today, I
have to use byCodePoint, or .representation, etc. and it's very
unwieldy.

If I ran D, that's what I would do.

-Steve

Well, the "auto" part of autodecoding means "automatically doing
it for plain strings", right? If you explicitly do decoding, I
think it would just be "decoding"; there's no "auto" part.

I doubt anyone is going to complain if you add in a struct
wrapper around a string that iterates over code units or
graphemes. The issue most people have, as you say, is the fact
that the default for strings is to decode.

May 13 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:
On 5/13/16 5:25 PM, Alex Parrill wrote:
On Friday, 13 May 2016 at 16:05:21 UTC, Steven Schveighoffer wrote:
On 5/12/16 4:15 PM, Walter Bright wrote:

10. Autodecoded arrays cannot be RandomAccessRanges, losing a key
benefit of being arrays in the first place.

I'll repeat what I said in the other thread.

The problem isn't auto-decoding. The problem is hijacking the char[]
and wchar[] (and variants) array type to mean autodecoding non-arrays.

If you think this code makes sense, then my definition of sane varies
slightly from yours:

static assert(!hasLength!R && is(typeof(R.init.length)));
static assert(!is(ElementType!R == R.init[0]));
static assert(!isRandomAccessRange!R && is(typeof(R.init[0])) &&
is(typeof(R.init[0 .. $])));

If I ran D, that's what I would do.

Well, the "auto" part of autodecoding means "automatically doing it for
plain strings", right? If you explicitly do decoding, I think it would
just be "decoding"; there's no "auto" part.

No, the problem isn't the auto-decoding. The problem is having *arrays*
do that. Sometimes.

I would be perfectly fine with a custom string type that all string
literals were typed as, as long as I can get a sanely behaving array out
of it.

I doubt anyone is going to complain if you add in a struct wrapper
around a string that iterates over code units or graphemes. The issue
most people have, as you say, is the fact that the default for strings
is to decode.

I want to clarify that I don't really care if strings by default
auto-decode. I think that's fine. What I dislike is that
immutable(char)[] auto-decodes.

-Steve

May 13 2016

Jon D <jond noreply.com> writes:
On Thursday, 12 May 2016 at 20:15:45 UTC, Walter Bright wrote:
On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
I am as unclear about the problems of autodecoding as I am

about the necessity
to remove curl. Whenever I ask I hear some arguments that

work well emotionally
but are scant on reason and engineering. Maybe it's time to

rehash them? I just
did so about curl, no solid argument seemed to come together.

I'd be curious of
a crisp list of grievances about autodecoding. -- Andrei

Given the importance of performance in the auto-decoding topic,
it seems reasonable to quantify it. I took a stab at this. It
would of course be prudent to have others conduct similar
analysis rather than rely on my numbers alone.

Measurements were done using an artificial scenario, counting
lower-case ascii letters. This had the effect of calling
front/popFront many times on a long block of text. Runs were done
both treating the text as char[] and ubyte[] and comparing the
run times. (char[] performs auto-decoding, ubyte[] does not.)

Timings were done with DMD and LDC, and on two different data
sets. One data set was a mix of latin languages (e.g. German,
English, Finnish, etc.), the other non-Latin languages (e.g.
Japanese, Chinese, Greek, etc.). The goal being to distinguish
between scenarios with high and low Ascii character content.

The result: For DMD, auto-decoding showed a 1.6x to 2.6x cost.
For LDC, a 12.2x to 12.9x cost.

Details:
- Test program: https://dpaste.dzfl.pl/67c7be11301f
- DMD 2.071.0. Options: -release -O -boundscheck=off -inline
- LDC 1.0.0-beta1 (based on DMD v2.070.2). Options: -release -O
-boundscheck=off
- Machine: Macbook Pro (2.8 GHz Intel I7, 16GB ram)

Runs for each combination were done five times and the median
times used. The median times and the char[] to ubyte[] ratio are
below:
| | | char[] | ubyte[] |
| Compiler | Text type | time (ms) | time (ms) | ratio |
|----------+-----------+-----------+-----------+-------|
| DMD | Latin | 7261 | 4513 | 1.6 |
| DMD | Non-latin | 10240 | 3928 | 2.6 |
| LDC | Latin | 11773 | 913 | 12.9 |
| LDC | Non-latin | 10756 | 883 | 12.2 |

Note: The numbers above don't provide enough info to derive a
front/popFront rate. The program artificially makes multiple
loops to increase the run-times. (For these runs, the program's
repeat-count was set to 20).

Characteristics of the two data sets:
| | | | | Bytes per |
| Text type | Bytes | DChars | Ascii Chars | DChar | Pct
Ascii |
|-----------+---------+---------+-------------+-----------+-----------|
| Latin | 4156697 | 4059016 | 3965585 | 1.024 |
97.7% |
| Non-latin | 4061554 | 1949290 | 348164 | 2.084 |
17.9% |

Run-to-run variability - The run times recorded were quite
stable. The largest delta between minimum and median time for any
group was 17 milliseconds.

May 15 2016

Jack Stouffer <jack jackstouffer.com> writes:
On Sunday, 15 May 2016 at 23:10:38 UTC, Jon D wrote:
Given the importance of performance in the auto-decoding topic,
it seems reasonable to quantify it. I took a stab at this. It
would of course be prudent to have others conduct similar
analysis rather than rely on my numbers alone.

Here is another benchmark (see the above comment for the code to
apply the patch to) that measures the iteration time difference:
http://forum.dlang.org/post/ndj6dm$a6c$1 digitalmars.com

The result is a 756% slow down

May 15 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Mon, May 16, 2016 at 12:31:04AM +0000, Jack Stouffer via Digitalmars-d wrote:
On Sunday, 15 May 2016 at 23:10:38 UTC, Jon D wrote:
Given the importance of performance in the auto-decoding topic, it
seems reasonable to quantify it. I took a stab at this. It would of
course be prudent to have others conduct similar analysis rather than
rely on my numbers alone.

Here is another benchmark (see the above comment for the code to apply
the patch to) that measures the iteration time difference:
http://forum.dlang.org/post/ndj6dm$a6c$1 digitalmars.com

The result is a 756% slow down

I decide to do my own benchmarking too. Here's the code:

/**
* Simple-minded benchmark for measuring performance degradation caused by
* autodecoding.
*/

import std.typecons : Flag, Yes, No;

size_t countNewlines(Flag!"autodecode" autodecode)(const(char)[] input)
{
size_t count = 0;

static if (autodecode)
{
import std.array;
foreach (dchar ch; input)
{
if (ch == '\n') count++;
}
}
else // !autodecode
{
import std.utf : byCodeUnit;
foreach (char ch; input.byCodeUnit)
{
if (ch == '\n') count++;
}
}
return count;
}

void main(string[] args)
{
import std.datetime : benchmark;
import std.file : read;
import std.stdio : writeln, writefln;

string input = (args.length >= 2) ? args[1] :
"/usr/src/d/phobos/std/datetime.d";

uint n = 50;
auto data = cast(char[]) read(input);
writefln("Input: %s (%d bytes)", input, data.length);
size_t count;

writeln("With autodecoding:");
auto result = benchmark!({
count = countNewlines!(Yes.autodecode)(data);
})(n);
writefln("Newlines: %d Time: %s msecs", count, result[0].msecs);

writeln("Without autodecoding:");
result = benchmark!({
count = countNewlines!(No.autodecode)(data);
})(n);
writefln("Newlines: %d Time: %s msecs", count, result[0].msecs);
}

// vim:set sw=4 ts=4 et:

Just for fun, I decided to use std/datetime.d, one of the largest
modules in Phobos, as a test case.

For comparison, I compiled with dmd (latest git head) and gdc 5.3.1. The
compile commands were:

dmd -O -inline bench.d -ofbench.dmd
gdc -O3 bench.d -o bench.gdc

Here are the results from bench.dmd:

Input: /usr/src/d/phobos/std/datetime.d (1464089 bytes)
With autodecoding:
Newlines: 35398 Time: 331 msecs
Without autodecoding:
Newlines: 35398 Time: 254 msecs

And the results from bench.gdc:

Input: /usr/src/d/phobos/std/datetime.d (1464089 bytes)
With autodecoding:
Newlines: 35398 Time: 253 msecs
Without autodecoding:
Newlines: 35398 Time: 25 msecs

These results are pretty typical across multiple runs. There is a
variance of about 20 msecs or so between bench.dmd runs, but the
bench.gdc runs vary only by about 1-2 msecs.

So for bench.dmd, autodecoding adds about a 30% overhead to running
time, whereas for bench.gdc, autodecoding costs an order of magnitude
increase in running time.

As an interesting aside, compiling with dmd without -O -inline causes
the non-autodecoding case to be actually consistently *slower* than the
autodecoding case. Apparently in this case the performance is dominated
by the cost of calling non-inlined range primitives on byCodeUnit,
whereas a manual for-loop over the array of chars produces similar
results to the -O -inline case. I find this interesting, because it
shows that the cost of autodecoding is relatively small compared to the
cost of unoptimized range primitives. Nevertheless, it does make a big
difference when range primitives are properly optimized. It is
especially poignant in the case of gdc that, given a superior optimizer,
the non-autodecoding case can be made an order of magnitude faster,
whereas the autodecoding case is presumably complex enough to defeat the
optimizer.

--
Democracy: The triumph of popularity over principle. -- C.Bond

May 15 2016

jmh530 <john.michael.hall gmail.com> writes:
On Sunday, 15 May 2016 at 23:10:38 UTC, Jon D wrote:
Runs for each combination were done five times and the median
times used. The median times and the char[] to ubyte[] ratio
are below:
| | | char[] | ubyte[] |
| Compiler | Text type | time (ms) | time (ms) | ratio |
|----------+-----------+-----------+-----------+-------|
| DMD | Latin | 7261 | 4513 | 1.6 |
| DMD | Non-latin | 10240 | 3928 | 2.6 |
| LDC | Latin | 11773 | 913 | 12.9 |
| LDC | Non-latin | 10756 | 883 | 12.2 |

Interesting that LDC is slower than DMD for char[].

May 16 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
This might be a good time to discuss this a tad further. I'd appreciate
if the debate stayed on point going forward. Thanks!

My thesis: the D1 design decision to represent strings as char[] was
disastrous and probably one of the largest weaknesses of D1. The
decision in D2 to use immutable(char)[] for strings is a vast
improvement but still has a number of issues. The approach to
autodecoding in Phobos is an improvement on that decision. The insistent
shunning of a user-defined type to represent strings is not good and we
need to rid ourselves of it.

On 05/12/2016 04:15 PM, Walter Bright wrote:
On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
> I am as unclear about the problems of autodecoding as I am about the
necessity
> to remove curl. Whenever I ask I hear some arguments that work well
emotionally
> but are scant on reason and engineering. Maybe it's time to rehash
them? I just
> did so about curl, no solid argument seemed to come together. I'd be
curious of
> a crisp list of grievances about autodecoding. -- Andrei

Here are some that are not matters of opinion.

1. Ranges of characters do not autodecode, but arrays of characters do.
This is a glaring inconsistency.

Agreed. At the point of that decision, the party line was "arrays of
characters are strings, nothing else is or should be". Now it is
apparent that shouldn't have been the case.

2. Every time one wants an algorithm to work with both strings and
ranges, you wind up special casing the strings to defeat the
autodecoding, or to decode the ranges. Having to constantly special case
it makes for more special cases when plugging together components. These
issues often escape detection when unittesting because it is convenient
to unittest only with arrays.

This is a consequence of 1. It is at least partially fixable.

3. Wrapping an array in a struct with an alias this to an array turns
off autodecoding, another special case.

This is also a consequence of 1.

4. Autodecoding is slow and has no place in high speed string processing.

I would agree only with the amendment "...if used naively", which is
important. Knowledge of how autodecoding works is a prerequisite for
writing fast string code in D. Also, little code should deal with one
code unit or code point at a time; instead, it should use standard
library algorithms for searching, matching etc. When needed, iterating
every code unit is trivially done through indexing.

Also allow me to point that much of the slowdown can be addressed
tactically. The test c < 0x80 is highly predictable (in ASCII-heavy
text) and therefore easily speculated. We can and we should arrange code
to minimize impact.

5. Very few algorithms require decoding.

The key here is leaving it to the standard library to do the right thing
instead of having the user wonder separately for each case. These uses
don't need decoding, and the standard library correctly doesn't involve
it (or if it currently does it has a bug):

s.find("abc")
s.findSplit("abc")
s.findSplit('a')
s.count!(c => "!()-;:,.?".canFind(c)) // punctuation

However the following do require autodecoding:

s.walkLength
s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
s.count!(c => c >= 32) // non-control characters

Currently the standard library operates at code point level even though
inside it may choose to use code units when admissible. Leaving such a
decision to the library seems like a wise thing to do.

6. Autodecoding has two choices when encountering invalid code units -
throw or produce an error dchar. Currently, it throws, meaning no
algorithms using autodecode can be made nothrow.

Agreed. This is probably the most glaring mistake. I think we should
open a discussion no fixing this everywhere in the stdlib, even at the
cost of breaking code.

7. Autodecode cannot be used with unicode path/filenames, because it is
legal (at least on Linux) to have invalid UTF-8 as filenames. It turns
out in the wild that pure Unicode is not universal - there's lots of
dirty Unicode that should remain unmolested, and autocode does not play
with that.

If paths are not UTF-8, then they shouldn't have string type (instead
use ubyte[] etc). More on that below.

8. In my work with UTF-8 streams, dealing with autodecode has caused me
considerably extra work every time. A convenient timesaver it ain't.

Objection. Vague.

9. Autodecode cannot be turned off, i.e. it isn't practical to avoid
importing std.array one way or another, and then autodecode is there.

Turning off autodecoding is as easy as inserting .representation after
any string. (Not to mention using indexing directly.)

10. Autodecoded arrays cannot be RandomAccessRanges, losing a key
benefit of being arrays in the first place.

First off, you always have the option with .representation. That's a
great name because it gives you the type used to represent the string -
i.e. an array of integers of a specific width.

Second, it's as it should. The entire scaffolding rests on the notion
that char[] is distinguished from ubyte[] by having UTF8 code units, not
arbitrary bytes. It seems that many arguments against autodecoding are
in fact arguments in favor of eliminating virtually all distinctions
between char[] and ubyte[]. Then the natural question is, what _is_ the
difference between char[] and ubyte[] and why do we need char as a
separate type from ubyte?

This is a fundamental question for which we need a rigorous answer. What
is the purpose of char, wchar, and dchar? My current understanding is
that they're justified as pretty much indistinguishable in primitives
and behavior from ubyte, ushort, and uint respectively, but they reflect
a loose subjective intent from the programmer that they hold actual UTF
code units. The core language does not enforce such, except it does
special things in random places like for loops (any other)?

If char is to be distinct from ubyte, and char[] is to be distinct from
ubyte[], then autodecoding does the right thing: it makes sure they are
distinguished in behavior and embodies the assumption that char is, in
fact, a UTF8 code point.

11. Indexing an array produces different results than autodecoding,
another glaring special case.

This is a direct consequence of the fact that string is
immutable(char)[] and not a specific type. That error predates autodecoding.

Overall, I think the one way to make real steps forward in improving
string processing in the D language is to give a clear answer of what
char, wchar, and dchar mean.

Andrei

May 26 2016

Jack Stouffer <jack jackstouffer.com> writes:
On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu
wrote:
instead, it should use standard library algorithms for
searching,
matching etc. When needed, iterating every code unit is
trivially
done through indexing.

For an example where the std.algorithm/range functions don't cut
it, my random format date string parser first breaks up the given
character range into tokens. Once it has the tokens, it checks
several known formats. One piece of that is checking if some of
the tokens are in AAs of month and day names for fast tests of
presence. Because the AAs are int[string], and it's unknowable
the encoding of string (it's complicated), during tokenization,
the character range must be forced to UTF-8 with byChar with all
isSomeString!R == true inputs to avoid the auto-decoding and
subsequent AA key mismatch.

Agreed. This is probably the most glaring mistake. I think we
should open a discussion no fixing this everywhere in the
stdlib, even at the cost of breaking code.

See the discussion here:
https://issues.dlang.org/show_bug.cgi?id=14519

I think some of the proposals there are interesting.

Overall, I think the one way to make real steps forward in
improving string processing in the D language is to give a
clear answer of what char, wchar, and dchar mean.

If you agree that iterating over code units and code points isn't
what people want/need most of the time, then I will quote
something from my article on the subject:

"I really don't see the benefit of the automatic behavior
fulfilling this one specific corner case when you're going to
make everyone else call a range generating function when they
want to iterate over code units or graphemes. Just make everyone
call a range generating function to specify the type of iteration
and save a lot of people the trouble!"

I think the only clear way forward is to not make strings ranges
and force people to make a decision when passing them to range
functions. The HUGE problem is the code this will break, which is
just about all of it.

May 26 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, May 26, 2016 at 12:00:54PM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
[...]
On 05/12/2016 04:15 PM, Walter Bright wrote:

[...]
4. Autodecoding is slow and has no place in high speed string processing.

General Unicode strings have a lot of non-ASCII characters. Why are we
only optimizing for the ASCII case?

5. Very few algorithms require decoding.

The key here is leaving it to the standard library to do the right
thing instead of having the user wonder separately for each case.
These uses don't need decoding, and the standard library correctly
doesn't involve it (or if it currently does it has a bug):

s.find("abc")
s.findSplit("abc")
s.findSplit('a')
s.count!(c => "!()-;:,.?".canFind(c)) // punctuation

However the following do require autodecoding:

s.walkLength
s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
s.count!(c => c >= 32) // non-control characters

Question: what should count return, given a string containing (1)
combining diacritics, or (2) Korean text? Or (3) zero-width spaces?

Currently the standard library operates at code point level even
though inside it may choose to use code units when admissible. Leaving
such a decision to the library seems like a wise thing to do.

The problem is that often such decisions can only be made by the user,
because it depends on what the user wants to accomplish. What should
count return, given some Unicode string? If the user wants to determine
the size of a buffer (e.g., to store a string minus some characters to
be stripped), then count should return the byte count. If the user wants
to count the number of matching visual characters, then count should
return the number of graphemes. If the user wants to determine the
visual width of the (filtered) string, then count should not be used at
all, but instead a font metric algorithm. (I can't think of a practical
use case where you'd actually need to count code points(!).)

Having the library arbitrarily choose one use case over the others
(especially one that seems the least applicable to practical situations)
just doesn't seem right to me at all. Rather, the user ought to specify
what exactly is to be counted, i.e., s.byCodeUnit.count(),
s.byCodePoint.count(), or s.byGrapheme.count().

[...]
9. Autodecode cannot be turned off, i.e. it isn't practical to avoid
importing std.array one way or another, and then autodecode is there.

Turning off autodecoding is as easy as inserting .representation after any
string. (Not to mention using indexing directly.)

Therefore, instead of:

myString.splitter!"abc".joiner!"def".count;

we have to write:

myString.representation
.splitter!("abc".representation)
.joiner!("def".representation)
.count;

Great.

[...]
Second, it's as it should. The entire scaffolding rests on the notion
that char[] is distinguished from ubyte[] by having UTF8 code units,
not arbitrary bytes. It seems that many arguments against autodecoding
are in fact arguments in favor of eliminating virtually all
distinctions between char[] and ubyte[].

That is a strawman. We are not arguing for eliminating the distinction
between char[] and ubyte[]. Rather, the complaint is that autodecoding
represents a constant overhead in string processing that's often
*unnecessary*. Many string operations don't *need* to autodecode, and
even those that may seem like they do, are often better implemented
differently.

For example, filtering a string by a non-ASCII character can actually be
done via substring search -- expand the non-ASCII character into 1 to 6
code units, and then do the equivalent of C's strstr(). This will not
have false positives thanks to the way UTF-8 is designed. It eliminates
the overhead of decoding every single character -- in implementational
terms, it could, for example, first scan for the first 1st byte by
linear scan through the string without decoding, which is a lot faster
than decoding every single character and then comparing with the target.
Only when the first byte matches does it need to do the slightly more
expensive operation of substring comparison.

Similarly, splitter does not need to operate on code points at all. It's
unnecessarily slow that way. Most use cases of splitter has lots of data
in between delimiters, which means most of the work done by autodecoding
is wasted. Instead, splitter should just scan for the substring to
split on -- again the design of UTF-8 guarantees there will be no false
positives -- and only put in the effort where it's actually needed: at
the delimiters, not the data in between.

The same could be said of joiner, and many other common string
algorithms.

There aren't many algorithms that actually need to decode; decoding
should be restricted to them, rather than an overhead applied across the
board.

[...]
Overall, I think the one way to make real steps forward in improving
string processing in the D language is to give a clear answer of what
char, wchar, and dchar mean.

[...]

We already have a clear definition: char, wchar, and dchar are Unicode
code units, and the latter is also Unicode code points. That's all there
is to it.

If we want Phobos to truly be able to take advantage of the fact that
char[], wchar[], dchar[] contain Unicode strings, we need to stop the
navel gazing at what byte representations and bits mean, and look at the
bigger picture. Consider char[] as a unit in itself, a complete Unicode
string -- the actual code units don't really matter, as they are just an
implementation detail. What you want to be able to do is for a Phobos
algorithm to decide, OK, in order to produce output X, it's faster to do
substring scanning, and in order to produce output Y, it's better to
decode first. In other words, decoding or not decoding ought to be a
decision made at the algorithm level (or higher), depending on the need
at hand. It should not be hard-boiled into the lower-level internals of
how strings are handled, such that higher-level algorithms are
straitjacketed and forced to work with the decoded stream, even when
they actually don't *need* decoding to do what they want.

In the cases where Phobos is unable to make a decision (e.g., what
should count return -- which depends on what the user is trying to
accomplish), it should be left to the user. The user shouldn't have to
work against a default setting that only works for a subset of use
cases.

--
Without geometry, life would be pointless. -- VS

May 26 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/26/2016 07:23 PM, H. S. Teoh via Digitalmars-d wrote:
Therefore, instead of:

myString.splitter!"abc".joiner!"def".count;

we have to write:

myString.representation
.splitter!("abc".representation)
.joiner!("def".representation)
.count;

No, that's not necessary (or correct). -- Andrei

May 26 2016

Marco Leise <Marco.Leise gmx.de> writes:
Am Thu, 26 May 2016 16:23:16 -0700
schrieb "H. S. Teoh via Digitalmars-d"
<digitalmars-d puremagic.com>:

On Thu, May 26, 2016 at 12:00:54PM -0400, Andrei Alexandrescu via
Digitalmars-d wrote:
[...]
s.walkLength
s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
s.count!(c => c >= 32) // non-control characters

Question: what should count return, given a string containing (1)
combining diacritics, or (2) Korean text? Or (3) zero-width spaces?

Currently the standard library operates at code point level even
though inside it may choose to use code units when admissible. Leaving
such a decision to the library seems like a wise thing to do.

Hey, I was about to answer exactly the same. It reminds me that
a few years ago I proposed making string iteration explicit
by code-unit, code-point and grapheme in "Rust" and there was
virtually no debate about doing it in the sense that to enable
people to write correct code they'd need to understand a
bit of Unicode and pick the right primitive. If you don't know
what to pick you look it up.

--
Marco

May 30 2016

Andrew Godfrey <X y.com> writes:
I like "make string iteration explicit" but I wonder about other
constructs. E.g. What about "sort an array of strings"? How would
you tell a generic sort function whether you want it to interpret
strings by code unit vs code point vs grapheme?

May 30 2016

Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 30 May 2016 at 17:14:47 UTC, Andrew Godfrey wrote:
I like "make string iteration explicit" but I wonder about
other constructs. E.g. What about "sort an array of strings"?
How would you tell a generic sort function whether you want it
to interpret strings by code unit vs code point vs grapheme?

The comparison predicate does that...

sort!( (string a, string b) {
/* you interpret a and b here and return the comparison */
})(["hi", "there"]);

May 30 2016

Andrew Godfrey <X y.com> writes:
On Monday, 30 May 2016 at 18:26:32 UTC, Adam D. Ruppe wrote:
On Monday, 30 May 2016 at 17:14:47 UTC, Andrew Godfrey wrote:
I like "make string iteration explicit" but I wonder about
other constructs. E.g. What about "sort an array of strings"?
How would you tell a generic sort function whether you want it
to interpret strings by code unit vs code point vs grapheme?

The comparison predicate does that...

sort!( (string a, string b) {
/* you interpret a and b here and return the comparison */
})(["hi", "there"]);

Thanks! You left out some details but I think I see - an example
predicate might be "cmp(a.byGrapheme, b.byGrapheme)" and by the
looks of it, that code works in D today.

(However, "cmp(a, b)" would default to code points today, which
is surprising to almost everyone, and that's more what this
thread is about).

May 30 2016

Marco Leise <Marco.Leise gmx.de> writes:
Am Mon, 30 May 2016 17:14:47 +0000
schrieb Andrew Godfrey <X y.com>:

I like "make string iteration explicit" but I wonder about other=20
constructs. E.g. What about "sort an array of strings"? How would=20
you tell a generic sort function whether you want it to interpret=20
strings by code unit vs code point vs grapheme?

You are just scratching the surface! Unicode strings are
sorted following the Unicode Collation Algorithm which is
described in the 86 pages document here:
(http://www.unicode.org/reports/tr10/)
which is implemented in the ICU library mentioned before.

Some obvious considerations from the description of the
algorithm:

In Sweden z comes before =C3=B6, while in Germany its the reverse.
In Germany, words in a dictionary are sorted differently from
lists of names in a phone book.
dictionary: of < =C3=B6f,
phone book: =C3=B6f < of
Spanish sorts 'll' as one character right after 'l'.

The default collation is selected in Windows through the
control panel's localization app and on Linux (Posix) using
the LC_COLLATE environment variable.
The actual string sorting in the user's locale can then be
performed with the C library using
http://www.cplusplus.com/reference/cstring/strcoll/
or OS specific functions like CompareStringEx on Windows
https://msdn.microsoft.com/en-us/library/windows/desktop/dd317761%28v=3Dvs.=
85%29.aspx

TL;DR neither code-points nor grapheme clusters are adequate
for string sorting. Also two strings may compare unequal byte
for byte, while they are actually the same text in different
normalization forms. (E.g. Umlauts on OS X (NFD) vs. rest of
the world (NFC)).

Admittedly I find myself using str1 =3D=3D str2 without first
normalizing both, because it is frigging convenient and fast.

--=20
Marco

May 30 2016

Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu
wrote:
4. Autodecoding is slow and has no place in high speed string
processing.

I would agree only with the amendment "...if used naively",
which is important. Knowledge of how autodecoding works is a
prerequisite for writing fast string code in D.

It is completely wasted mental effort.

5. Very few algorithms require decoding.

The key here is leaving it to the standard library to do the
right thing instead of having the user wonder separately for
each case. These uses don't need decoding, and the standard
library correctly doesn't involve it (or if it currently does
it has a bug):

s.count!(c => "!()-;:,.?".canFind(c)) // punctuation

As far as I can see, the language currently does not provide the
facilities to implement the above without autodecoding.

However the following do require autodecoding:

s.walkLength

Usage of the result of this expression will be incorrect in many
foreseeable cases.

s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation

Ditto.

s.count!(c => c >= 32) // non-control characters

Ditto, with a big red flag. If you are dealing with control
characters, the code is likely low-level enough that you need to
be explicit in what you are counting. It is likely not what
actually needs to be counted. Such confusion can lead to security
risks.

Currently the standard library operates at code point level
even though inside it may choose to use code units when
admissible. Leaving such a decision to the library seems like a
wise thing to do.

It should be explicit.

7. Autodecode cannot be used with unicode path/filenames,
because it is
legal (at least on Linux) to have invalid UTF-8 as filenames.
It turns
out in the wild that pure Unicode is not universal - there's
lots of
dirty Unicode that should remain unmolested, and autocode does
not play
with that.

If paths are not UTF-8, then they shouldn't have string type
(instead use ubyte[] etc). More on that below.

This is not practical. Do you really see changing std.file and
std.path to accept ubyte[] for all path arguments?

8. In my work with UTF-8 streams, dealing with autodecode has
caused me
considerably extra work every time. A convenient timesaver it
ain't.

Objection. Vague.

I can confirm this vague subjective observation. For example,
DustMite reimplements some std.string functions in order to be
able to handle D files with invalid UTF-8 characters.

9. Autodecode cannot be turned off, i.e. it isn't practical to
avoid
importing std.array one way or another, and then autodecode is
there.

Turning off autodecoding is as easy as inserting
.representation after any string. (Not to mention using
indexing directly.)

This is neither easy nor practical. It makes writing reliable
string handling code a chore in D. Because it is difficult to
find all places where this must be done, it is not possible to do
on a program-wide scale, thus bugs can only be discovered when
this or that component fails because it was not tested with
Unicode strings.

10. Autodecoded arrays cannot be RandomAccessRanges, losing a
key
benefit of being arrays in the first place.

First off, you always have the option with .representation.
That's a great name because it gives you the type used to
represent the string - i.e. an array of integers of a specific
width.

Second, it's as it should. The entire scaffolding rests on the
notion that char[] is distinguished from ubyte[] by having UTF8
code units, not arbitrary bytes. It seems that many arguments
against autodecoding are in fact arguments in favor of
eliminating virtually all distinctions between char[] and
ubyte[]. Then the natural question is, what _is_ the difference
between char[] and ubyte[] and why do we need char as a
separate type from ubyte?

This is a fundamental question for which we need a rigorous
answer.

Why?

What is the purpose of char, wchar, and dchar? My current
understanding is that they're justified as pretty much
indistinguishable in primitives and behavior from ubyte,
ushort, and uint respectively, but they reflect a loose
subjective intent from the programmer that they hold actual UTF
code units. The core language does not enforce such, except it
does special things in random places like for loops (any other)?

If char is to be distinct from ubyte, and char[] is to be
distinct from ubyte[], then autodecoding does the right thing:
it makes sure they are distinguished in behavior and embodies
the assumption that char is, in fact, a UTF8 code point.

I don't follow this line of reasoning at all.

11. Indexing an array produces different results than
autodecoding,
another glaring special case.

This is a direct consequence of the fact that string is
immutable(char)[] and not a specific type. That error predates
autodecoding.

There is no convincing argument why indexing and slicing should
not simply operate on code units.

Overall, I think the one way to make real steps forward in
improving string processing in the D language is to give a
clear answer of what char, wchar, and dchar mean.

I don't follow. Though, making char implicitly convertible to
wchar and dchar has clearly been a mistake.

May 26 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Friday, May 27, 2016 04:31:49 Vladimir Panteleev via Digitalmars-d wrote:
On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu
9. Autodecode cannot be turned off, i.e. it isn't practical to
avoid
importing std.array one way or another, and then autodecode is
there.

Turning off autodecoding is as easy as inserting
.representation after any string. (Not to mention using
indexing directly.)

In addition, as soon as you have ubyte[], none of the string-related
functions work. That's fixable, but as it stands, operating on ubyte[]
instead of char[] is a royal pain.

- Jonathan M Davis

May 31 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 02:57 PM, Jonathan M Davis via Digitalmars-d wrote:
In addition, as soon as you have ubyte[], none of the string-related
functions work. That's fixable, but as it stands, operating on ubyte[]
instead of char[] is a royal pain.

That'd be nice to fix indeed. Please break the ground? -- Andrei

May 31 2016

Kagamin <spam here.lot> writes:
On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu
wrote:
11. Indexing an array produces different results than
autodecoding,
another glaring special case.

This is a direct consequence of the fact that string is
immutable(char)[] and not a specific type. That error predates
autodecoding.

Sounds like you want to say that string should be smarter than an
array of code units in dealing with unicode. As I understand,
design rationale behind strings being plain arrays of code units
is that it's impractical for the string to smarter than array of
code units - it just won't cut it, while plain array provides
simple and easy to understand implementation of string.

May 27 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/27/16 6:26 AM, Kagamin wrote:
As I understand, design rationale
behind strings being plain arrays of code units is that it's impractical
for the string to smarter than array of code units - it just won't cut
it, while plain array provides simple and easy to understand
implementation of string.

That's my understanding too. And I think the design rationale is wrong.
-- Andrei

May 27 2016

Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu
wrote:
This might be a good time to discuss this a tad further. I'd
appreciate if the debate stayed on point going forward. Thanks!

My thesis: the D1 design decision to represent strings as
char[] was disastrous and probably one of the largest
weaknesses of D1. The decision in D2 to use immutable(char)[]
for strings is a vast improvement but still has a number of
issues. The approach to autodecoding in Phobos is an
improvement on that decision.

It is not, which has been shown by various posts in this thread.
Iterating by code points is at least as wrong as iterating by
code units; it can be argued it is worse because it sometimes
makes the fact that it's wrong harder to detect.

The insistent shunning of a user-defined type to represent
strings is not good and we need to rid ourselves of it.

While this may be true, it has nothing to do with auto decoding.
I assume you would want such a user-define string type to
auto-decode as well, right?

On 05/12/2016 04:15 PM, Walter Bright wrote:
5. Very few algorithms require decoding.

The key here is leaving it to the standard library to do the
right thing instead of having the user wonder separately for
each case. These uses don't need decoding, and the standard
library correctly doesn't involve it (or if it currently does
it has a bug):

s.find("abc")
s.findSplit("abc")
s.findSplit('a')

Yes.

s.count!(c => "!()-;:,.?".canFind(c)) // punctuation

Ideally yes, but this is a special case that cannot be detected
by `count`.

However the following do require autodecoding:

s.walkLength
s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
s.count!(c => c >= 32) // non-control characters

No, they do not need _auto_ decoding, they need a decision _by
the user_ what they should be decoded to. Code units? Code
points? Graphemes? Words? Lines?

Currently the standard library operates at code point level

Because it auto decodes.

even though inside it may choose to use code units when
admissible. Leaving such a decision to the library seems like a
wise thing to do.

No one wants to take that second part away. For example, the
`find` can provide an overload that accepts `const(char)[]`
directly, while `walkLength` doesn't, requiring a decision by the
caller.

7. Autodecode cannot be used with unicode path/filenames,
because it is
legal (at least on Linux) to have invalid UTF-8 as filenames.
It turns
out in the wild that pure Unicode is not universal - there's
lots of
dirty Unicode that should remain unmolested, and autocode does
not play
with that.

If paths are not UTF-8, then they shouldn't have string type
(instead use ubyte[] etc). More on that below.

I believe a library type would be more appropriate than bare
`ubyte[]`. It should provide conversion between the OS encoding
(which can be detected automatically) and UTF strings, for
example. And it should be used for any "strings" that comes from
outside the program, like main's arguments, env variables...

9. Autodecode cannot be turned off, i.e. it isn't practical to
avoid
importing std.array one way or another, and then autodecode is
there.

Turning off autodecoding is as easy as inserting
.representation after any string. (Not to mention using
indexing directly.)

This would no longer work if char[] and char ranges were to be
treated identically.

10. Autodecoded arrays cannot be RandomAccessRanges, losing a
key
benefit of being arrays in the first place.

First off, you always have the option with .representation.
That's a great name because it gives you the type used to
represent the string - i.e. an array of integers of a specific
width.

Second, it's as it should. The entire scaffolding rests on the
notion that char[] is distinguished from ubyte[] by having UTF8
code units, not arbitrary bytes. It seems that many arguments
against autodecoding are in fact arguments in favor of
eliminating virtually all distinctions between char[] and
ubyte[]. Then the natural question is, what _is_ the difference
between char[] and ubyte[] and why do we need char as a
separate type from ubyte?

This is a fundamental question for which we need a rigorous
answer. What is the purpose of char, wchar, and dchar? My
current understanding is that they're justified as pretty much
indistinguishable in primitives and behavior from ubyte,
ushort, and uint respectively, but they reflect a loose
subjective intent from the programmer that they hold actual UTF
code units. The core language does not enforce such, except it
does special things in random places like for loops (any other)?

Agreed.

If char is to be distinct from ubyte, and char[] is to be
distinct from ubyte[], then autodecoding does the right thing:
it makes sure they are distinguished in behavior and embodies
the assumption that char is, in fact, a UTF8 code point.

Distinguishing them is the right thing to do, but auto decoding
is not the way to achieve that, see above.

May 27 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/27/16 6:56 AM, Marc Schütz wrote:
It is not, which has been shown by various posts in this thread.

Couldn't quite find strong arguments. Could you please be more explicit
on which you found most convincing? -- Andrei

May 27 2016

Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Friday, 27 May 2016 at 13:34:33 UTC, Andrei Alexandrescu wrote:
On 5/27/16 6:56 AM, Marc Schütz wrote:
It is not, which has been shown by various posts in this
thread.

Couldn't quite find strong arguments. Could you please be more
explicit on which you found most convincing? -- Andrei

There are several possibilities of what iteration over a char
range can mean. (For the sake of simplicity, let's ignore special
cases like `find` and `split`; instead, let's look at
`walkLength`, `retro` and similar.)

BEFORE the introduction of auto decoding, it used to iterate over
UTF8 code _units_, which is wrong for any non-ASCII data (except
for the unlikely case where you really want code units).

AFTER the introduction of auto decoding, it iterates over UTF8
code _points_, which is wrong for combined characters, e.g.
äöüéòàñ on MacOS X, more "exotic" ones everywhere (except for the
even more unlikely case where you really want code points).

That is, both the BEFORE and AFTER behaviour are wrong, both
break for various kinds of input in different ways.

So, is AFTER an improvement over BEFORE? The set of inputs where
auto decoding produces wrong output is likely smaller, making it
slightly less likely to encounter problems in practice; on the
other hand, it's still wrong, and it's harder to find these
problems during testing. That's like "improving" a bicycle so
that it only breaks down after riding it for 30 minutes instead
of just after 10 minutes, so you won't notice it during a test
ride.

But there are even more possibilities. It could iterate over
graphemes, which is expensive, but more likely to produce the
results that the user wants. Or it could iterate by lines, or
words (and there are different ways to define what a word is),
and so on.

The fundamental problem is choosing one of those possibilities
over the others without knowing what the user actually wants,
which is what both BEFORE and AFTER do.

So, what was the original goal when introducing auto decoding? To
improve correctness, right? I would argue that this goal has not
been achieved. Have a look at the article [1], which IMO gives
good criteria for how a _correct_ string type should behave. Both
BEFORE and AFTER fail most of them.

[1] https://mortoray.com/2013/11/27/the-string-type-is-broken/

May 28 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/28/16 6:59 AM, Marc Schütz wrote:
The fundamental problem is choosing one of those possibilities over the
others without knowing what the user actually wants, which is what both
BEFORE and AFTER do.

OK, that's a fair argument, thanks. So it seems there should be no
"default" way to iterate a string, and furthermore iterating for each
constituent of a string should be fairly rare. Strings and substrings
yes, but not individual points/units/graphemes unless expressly asked.
(Indeed some languages treat strings as first-class entities and
individual characters are mere short substrings.)

So it harkens back to the original mistake: strings should NOT be arrays
with the respective primitives.

Andrei

May 28 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:
So it harkens back to the original mistake: strings should NOT be arrays with
the respective primitives.

An array of code units provides consistency, predictability, flexibility, and
performance. It's a solid base upon which the programmer can build what he
needs
as required.

A string class does not do that (from the article: "I admit the correct answer
is not always clear").

May 28 2016

Andrew Godfrey <X y.com> writes:
On Saturday, 28 May 2016 at 19:04:14 UTC, Walter Bright wrote:
On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:
So it harkens back to the original mistake: strings should NOT
be arrays with
the respective primitives.

An array of code units provides consistency, predictability,
flexibility, and performance. It's a solid base upon which the
programmer can build what he needs as required.

A string class does not do that (from the article: "I admit the
correct answer is not always clear").

You're right. An "array of code units" is a very useful low-level
primitive. I've dealt with a lot of code that uses these (more or
less correctly) in various languages.

But when providing such a thing, I think it's very important to
make it *look* like a low-level primitive, and use the type
system to distinguish it from higher-level ones.

E.g. A string literal should not implicitly convert into an array
of code units. What should it implicitly convert to? I'm not
sure. Something close to how it looks in the source code,
probably. A sequential range of graphemes? From all the detail in
this thread, I wonder now if "a grapheme" is even an unambiguous
concept across different environments. But one thing I'm sure of
(and this is from other languages/API's, not from D
specifically): A function which converts from one representation
to another, but doesn't keep track of the change (e.g. Different
compile-time type; e.g. State in a "string" class about whether
it is in normalized form), is a "bug farm".

May 28 2016

Chris <wendlec tcd.ie> writes:
On Saturday, 28 May 2016 at 22:29:12 UTC, Andrew Godfrey wrote:
[snip]
From all the detail in this thread, I wonder now if "a
grapheme" is even an unambiguous concept across different
environments.

Unicode graphemes are not always the same as graphemes in natural
(written) languages. If <é> is composed in Unicode, it is still
one grapheme in a written language, not two distinct characters.
However, in natural languages two characters can be one grapheme,
as in English <sh>, it represents the sound in `shower, shop,
fish`. In German the same sound is represented by three
characters <sch> as in `Schaf` ("sheep"). A bit nit-picky but we
should make clear that we talk about "Unicode graphemes" that map
to single characters on the written page. But is that at all
possible across all languages?

To avoid confusion and misunderstandings we should agree on the
terminology first.

May 29 2016

Tobias =?UTF-8?B?TcO8bGxlcg==?= <troplin bluewin.ch> writes:
On Sunday, 29 May 2016 at 11:25:11 UTC, Chris wrote:
Unicode graphemes are not always the same as graphemes in
natural (written) languages. If <é> is composed in Unicode, it
is still one grapheme in a written language, not two distinct
characters. However, in natural languages two characters can be
one grapheme, as in English <sh>, it represents the sound in
`shower, shop, fish`. In German the same sound is represented
by three characters <sch> as in `Schaf` ("sheep"). A bit
nit-picky but we should make clear that we talk about "Unicode
graphemes" that map to single characters on the written page.
But is that at all possible across all languages?

To avoid confusion and misunderstandings we should agree on the
terminology first.

No, this is well established terminology, you are confusing
several things here:

- A grapheme is a "character" as written on the page
- A phoneme is a spoken "character"
- A codepoint is the fundamental "unit" of unicode

Graphemes are built from one or more codepoints.
Phonemes are a different topic and not really covered by the
unicode standard AFAIK. Except for the IPA notation, but these
are again graphemes that represent phonemes.

May 29 2016

default0 <Kevin.Labschek gmx.de> writes:
On Sunday, 29 May 2016 at 11:47:30 UTC, Tobias Müller wrote:
On Sunday, 29 May 2016 at 11:25:11 UTC, Chris wrote:
Unicode graphemes are not always the same as graphemes in
natural (written) languages. If <é> is composed in Unicode, it
is still one grapheme in a written language, not two distinct
characters. However, in natural languages two characters can
be one grapheme, as in English <sh>, it represents the sound
in `shower, shop, fish`. In German the same sound is
represented by three characters <sch> as in `Schaf` ("sheep").
A bit nit-picky but we should make clear that we talk about
"Unicode graphemes" that map to single characters on the
written page. But is that at all possible across all languages?

To avoid confusion and misunderstandings we should agree on
the terminology first.

No, this is well established terminology, you are confusing
several things here:

- A grapheme is a "character" as written on the page
- A phoneme is a spoken "character"
- A codepoint is the fundamental "unit" of unicode

I am pretty sure that a single grapheme in unicode does not
correspond to your notion of "character". I am pretty sure that
what you think of as a "character" is officially called "Grapheme
Cluster" not "Grapheme".

See here: http://www.unicode.org/glossary/#grapheme_cluster

May 29 2016

Tobias M <troplin bluewin.ch> writes:
On Sunday, 29 May 2016 at 12:08:52 UTC, default0 wrote:
I am pretty sure that a single grapheme in unicode does not
correspond to your notion of "character". I am pretty sure that
what you think of as a "character" is officially called
"Grapheme Cluster" not "Grapheme".

Grapheme is a linguistic term. AFAIUI, a grapheme cluster is a
cluster of codepoints representing a grapheme. It's called
"cluster" in the unicode spec, because there there is no
dedicated grapheme unit.

I put "character" into quotes, because the term is not really
well defined. I just used it for a short and pregnant answer. I'm
sure there's a better/more correct definition of graphem/phoneme,
but it's probably also much longer and complicated.

May 29 2016

Chris <wendlec tcd.ie> writes:
On Sunday, 29 May 2016 at 13:04:18 UTC, Tobias M wrote:
On Sunday, 29 May 2016 at 12:08:52 UTC, default0 wrote:
I am pretty sure that a single grapheme in unicode does not
correspond to your notion of "character". I am pretty sure
that what you think of as a "character" is officially called
"Grapheme Cluster" not "Grapheme".

I put "character" into quotes, because the term is not really
well defined. I just used it for a short and pregnant answer.
I'm sure there's a better/more correct definition of
graphem/phoneme, but it's probably also much longer and
complicated.

Which is why we need to agree on a terminology, i.e. be clear
when we use linguistic terms and when we use Unicode specific
terminology.

May 29 2016

Chris <wendlec tcd.ie> writes:
On Sunday, 29 May 2016 at 11:47:30 UTC, Tobias Müller wrote:
On Sunday, 29 May 2016 at 11:25:11 UTC, Chris wrote:
Unicode graphemes are not always the same as graphemes in
natural (written) languages. If <é> is composed in Unicode, it
is still one grapheme in a written language, not two distinct
characters. However, in natural languages two characters can
be one grapheme, as in English <sh>, it represents the sound
in `shower, shop, fish`. In German the same sound is
represented by three characters <sch> as in `Schaf` ("sheep").
A bit nit-picky but we should make clear that we talk about
"Unicode graphemes" that map to single characters on the
written page. But is that at all possible across all languages?

To avoid confusion and misunderstandings we should agree on
the terminology first.

No, this is well established terminology, you are confusing
several things here:

- A grapheme is a "character" as written on the page
- A phoneme is a spoken "character"
- A codepoint is the fundamental "unit" of unicode

Ok, you have a point there, to be precise <sh> is a multigraph (a
digraph)(cf. [1]). In French you can have multigraphs consisting
of three or more characters <eau> /o/, as in Irish <aoi> => /i:/.
However, a phoneme is not necessarily a spoken "character" as
<sh> represents one phoneme but consists of two "characters" or
graphemes. <th> can represent two different phonemes (voiced and
unvoiced "th" as in `this` vs. `thorough`).

My point was that we have to be _very_ careful not to mix our
cultural experience with written text with machine
representations. There's bound to be confusion. That's why we
should always make clear what we refer to when we use the words
grapheme, character, code point etc.

[1] https://en.wikipedia.org/wiki/Grapheme

May 29 2016

Tobias M <troplin bluewin.ch> writes:
On Sunday, 29 May 2016 at 12:41:50 UTC, Chris wrote:
Ok, you have a point there, to be precise <sh> is a multigraph
(a digraph)(cf. [1]). In French you can have multigraphs
consisting of three or more characters <eau> /o/, as in Irish
<aoi> => /i:/. However, a phoneme is not necessarily a spoken
"character" as <sh> represents one phoneme but consists of two
"characters" or graphemes. <th> can represent two different
phonemes (voiced and unvoiced "th" as in `this` vs. `thorough`).

What I meant was, a phoneme is the "character" (smallest unit) in
a spoken language, not that it corresponds to a character
(whatever that means).

I used 'character' in quotes, because it's not a well defined
therm. Code point, grapheme and phoneme are well defined.

May 29 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Sun, May 29, 2016 at 01:13:36PM +0000, Tobias M via Digitalmars-d wrote:
On Sunday, 29 May 2016 at 12:41:50 UTC, Chris wrote:
Ok, you have a point there, to be precise <sh> is a multigraph (a
digraph)(cf. [1]). In French you can have multigraphs consisting of
three or more characters <eau> /o/, as in Irish <aoi> => /i:/.
However, a phoneme is not necessarily a spoken "character" as <sh>
represents one phoneme but consists of two "characters" or
graphemes. <th> can represent two different phonemes (voiced and
unvoiced "th" as in `this` vs. `thorough`).

What I meant was, a phoneme is the "character" (smallest unit) in a
spoken language, not that it corresponds to a character (whatever that
means).

[...]

Calling a phoneme a "character" is misleading. A phoneme is a logical
sound unit in a spoken language, whereas a "character" is a unit of
written language. The two do not necessarily have a direct
correspondence (or even any correspondence whatsoever).

In a language like English, whose writing system was codified many
hundreds of years ago, the spoken language has sufficiently diverged
from the written language (specifically, in the way words are spelt)
that the correspondence between the two is complex at best, downright
arbitrary at worst. For example, the 'o' in "women" and the 'i' in
"fish" map to the same phoneme, the short /i/, in (common dialects of)
spoken English, in spite of being two completely different characters.
Therefore conflating "character" and "phoneme" is misleading and is only
confusing the issue.

As far as Unicode is concerned, it is a standard for representing
*written* text, not spoken language, so concepts like phonemes aren't
even relevant in the first place. Let's not get derailed from the
present discussion by confusing the two.

--
What are you when you run out of Monet? Baroque.

May 29 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 5/29/2016 5:56 PM, H. S. Teoh via Digitalmars-d wrote:
As far as Unicode is concerned, it is a standard for representing
*written* text, not spoken language, so concepts like phonemes aren't
even relevant in the first place. Let's not get derailed from the
present discussion by confusing the two.

As far as D is concerned, we are not going to invent our own concepts around
text that is different from Unicode or redefine Unicode terms. Unicode is what
it is, and D is going to work with it.

May 29 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 5/29/2016 4:47 AM, Tobias Müller wrote:
No, this is well established terminology, you are confusing several things
here:

For D, we should stick with the terminology as defined by Unicode.

May 29 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/28/2016 03:04 PM, Walter Bright wrote:
On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:
So it harkens back to the original mistake: strings should NOT be
arrays with
the respective primitives.

An array of code units provides consistency, predictability,
flexibility, and performance. It's a solid base upon which the
programmer can build what he needs as required.

Nope. Not buying it.

A string class does not do that

Buying it. -- Andrei

May 30 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 30.05.2016 18:01, Andrei Alexandrescu wrote:
On 05/28/2016 03:04 PM, Walter Bright wrote:
On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:
So it harkens back to the original mistake: strings should NOT be
arrays with
the respective primitives.

An array of code units provides consistency, predictability,
flexibility, and performance. It's a solid base upon which the
programmer can build what he needs as required.

Nope. Not buying it.

I'm buying it. IMO alias string=immutable(char)[] is the most useful
choice, and auto-decoding ideally wouldn't exist.

May 30 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/30/2016 03:04 PM, Timon Gehr wrote:
On 30.05.2016 18:01, Andrei Alexandrescu wrote:
On 05/28/2016 03:04 PM, Walter Bright wrote:
On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:
So it harkens back to the original mistake: strings should NOT be
arrays with
the respective primitives.

An array of code units provides consistency, predictability,
flexibility, and performance. It's a solid base upon which the
programmer can build what he needs as required.

Nope. Not buying it.

I'm buying it. IMO alias string=immutable(char)[] is the most useful
choice, and auto-decoding ideally wouldn't exist.

Wouldn't D then be seen (and rightfully so) as largely not supporting
Unicode, seeing as its many many core generic algorithms seem to
randomly work or not on arrays of characters? Wouldn't ranges - the most
important artifact of D's stdlib - default for strings on the least
meaningful approach to strings (dumb code units)? Would a smattering of
Unicode primitives in std.utf and friends entitle us to claim D had dyed
Unicode in its wool? (All are not rhetorical.)

I.e. wouldn't be in a worse place than now? (This is rhetorical.) The
best argument for autodecoding is to contemplate where we'd be without
it: the ghetto of Unicode string handling.

I'm not going to debate this further (though I'll look for meaningful
answers to the questions above). But this thread has been informative in
that it did little to change my conviction that autodecoding is a good
thing for D, all things considered (i.e. the wrong decision to not
encapsulate string as a separate type distinct from bare array of code
units). I'd lie if I said it did nothing. It did, but only a little.

Funny thing is that's not even what's important. What's important is
that autodecoding is here to stay - there's no realistic way to
eliminate it from D. So the focus should be making autodecoding the best
it could ever be.

Andrei

May 30 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Mon, May 30, 2016 at 03:28:38PM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
On 05/30/2016 03:04 PM, Timon Gehr wrote:
On 30.05.2016 18:01, Andrei Alexandrescu wrote:
On 05/28/2016 03:04 PM, Walter Bright wrote:
On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:
So it harkens back to the original mistake: strings should NOT
be arrays with the respective primitives.

An array of code units provides consistency, predictability,
flexibility, and performance. It's a solid base upon which the
programmer can build what he needs as required.

Nope. Not buying it.

I'm buying it. IMO alias string=immutable(char)[] is the most useful
choice, and auto-decoding ideally wouldn't exist.

Wouldn't D then be seen (and rightfully so) as largely not supporting
Unicode, seeing as its many many core generic algorithms seem to
randomly work or not on arrays of characters?

They already randomly work or not work on ranges of dchar. I hope we
don't have to rehash all the examples of why things that seem to work,
like count, filter, map, etc., actually *don't* work outside of a very
narrow set of languages. The best of all this is that they *both* don't
work properly *and* make your program pay for the performance overhead,
even when you're not even using them -- thanks to ubiquitous
autodecoding.

Wouldn't ranges - the most important artifact of D's stdlib - default
for strings on the least meaningful approach to strings (dumb code
units)?

No, ideally there should *not* be a default range type -- the user needs
to specify what he wants to iterate by, whether code unit, code point,
or grapheme, etc..

Would a smattering of Unicode primitives in std.utf and friends
entitle us to claim D had dyed Unicode in its wool? (All are not
rhetorical.)

I have no idea what this means.

I.e. wouldn't be in a worse place than now? (This is rhetorical.) The
best argument for autodecoding is to contemplate where we'd be without
it: the ghetto of Unicode string handling.

I've no idea what you're talking about. Without autodecoding we'd
actually have faster string handling, and forcing the user to specify
the unit of iteration would actually bring more Unicode-awareness which
would improve the quality of string handling code, instead of
proliferating today's wrong code that just happens to work in some
languages but make a hash of things everywhere else.

I'm not going to debate this further (though I'll look for meaningful
answers to the questions above). But this thread has been informative
in that it did little to change my conviction that autodecoding is a
good thing for D, all things considered (i.e. the wrong decision to
not encapsulate string as a separate type distinct from bare array of
code units). I'd lie if I said it did nothing. It did, but only a
little.

[...]

If I ever had to write string-heavy code, I'd probably fork Phobos just
so I can get decent performance. Just sayin'.

--
People walk. Computers run.

May 30 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 5/30/2016 12:52 PM, H. S. Teoh via Digitalmars-d wrote:
If I ever had to write string-heavy code, I'd probably fork Phobos just
so I can get decent performance. Just sayin'.

When I wrote Warp, the only point of which was speed, I couldn't use phobos
because of autodecoding. I have since recoded a number of phobos functions so
they didn't autodecode, so the situation is better.

May 30 2016

Chris <wendlec tcd.ie> writes:
On Monday, 30 May 2016 at 21:39:00 UTC, Walter Bright wrote:
On 5/30/2016 12:52 PM, H. S. Teoh via Digitalmars-d wrote:
If I ever had to write string-heavy code, I'd probably fork
Phobos just
so I can get decent performance. Just sayin'.

When I wrote Warp, the only point of which was speed, I
couldn't use phobos because of autodecoding. I have since
recoded a number of phobos functions so they didn't autodecode,
so the situation is better.

Two questions:

1. Given you experience with Warp, how hard would it be to clean
Phobos up?
2. After recoding a number of Phobos functions, how much code did
actually break (yours or someone else's)?.

May 31 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 5/31/2016 1:57 AM, Chris wrote:
1. Given you experience with Warp, how hard would it be to clean Phobos up?

It's not hard, it's just a bit tedious.

2. After recoding a number of Phobos functions, how much code did actually
break
(yours or someone else's)?.

It's been a while so I don't remember exactly, but as I recall if the API had
to
change, I created a new overload or a new name, and left the old one as it is.
For the std.path functions, I just changed them. While that technically changed
the API, I'm not aware of any actual problems it caused.

(Decoding file strings is a latent bug anyway, as pointed out elsewhere in this
thread. It's a change that had to be made sooner or later.)

May 31 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 30.05.2016 21:28, Andrei Alexandrescu wrote:
On 05/30/2016 03:04 PM, Timon Gehr wrote:
On 30.05.2016 18:01, Andrei Alexandrescu wrote:
On 05/28/2016 03:04 PM, Walter Bright wrote:
On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:
So it harkens back to the original mistake: strings should NOT be
arrays with
the respective primitives.

An array of code units provides consistency, predictability,
flexibility, and performance. It's a solid base upon which the
programmer can build what he needs as required.

Nope. Not buying it.

I'm buying it. IMO alias string=immutable(char)[] is the most useful
choice, and auto-decoding ideally wouldn't exist.

Wouldn't D then be seen (and rightfully so) as largely not supporting
Unicode, seeing as its many many core generic algorithms seem to
randomly work or not on arrays of characters?

In D, enum does not mean enumeration, const does not mean constant, pure
is not pure, lazy is not lazy, and char does not mean character.

Wouldn't ranges - the most
important artifact of D's stdlib - default for strings on the least
meaningful approach to strings (dumb code units)?

I don't see how that's the least meaningful approach. It's the data that
you actually have sitting in memory. It's the data that you can slice
and index and get a length for in constant time.

Would a smattering of
Unicode primitives in std.utf and friends entitle us to claim D had dyed
Unicode in its wool? (All are not rhetorical.)
...

We should support Unicode by having all the required functionality and
properly documenting the data formats used. What is the goal here? I.e.
what does a language that has "Unicode dyed in its wool" have that other
languages do not? Why isn't it enough to provide data types for
UTF8/16/32 and Unicode algorithms operating on them?

I.e. wouldn't be in a worse place than now? (This is rhetorical.) The
best argument for autodecoding is to contemplate where we'd be without
it: the ghetto of Unicode string handling.
...

Those questions seem to be mostly marketing concerns. I'm more concerned
with whether I find it convenient to use. Autodecoding does not improve
Unicode support.

Andrei

Sure, I didn't mean to engage in a debate (it seems there is no decision
to be made here that might affect me in the future).

May 30 2016

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 05/30/2016 04:30 PM, Timon Gehr wrote:
In D, enum does not mean enumeration, const does not mean constant, pure
is not pure, lazy is not lazy, and char does not mean character.

My new favorite quote :)

May 30 2016

Jack Stouffer <jack jackstouffer.com> writes:
On Saturday, 28 May 2016 at 12:04:20 UTC, Andrei Alexandrescu
wrote:
OK, that's a fair argument, thanks. So it seems there should be
no "default" way to iterate a string

Yes!

So it harkens back to the original mistake: strings should NOT
be arrays with the respective primitives.

If you're proposing a library type, a la RCStr, as an alternative
then yeah.

May 28 2016

Dicebot <public dicebot.lv> writes:
On 05/28/2016 03:04 PM, Andrei Alexandrescu wrote:
On 5/28/16 6:59 AM, Marc Schütz wrote:
The fundamental problem is choosing one of those possibilities over the
others without knowing what the user actually wants, which is what both
BEFORE and AFTER do.

OK, that's a fair argument, thanks. So it seems there should be no
"default" way to iterate a string.

Ideally there should not be a way to iterate a (unicode) string at all
without explictily stating mode of operations, i.e.

struct String
{
private void[] data;

CodeUnitRange byCodeUnit ( );
CodePointRange byCodePoint ( );
GraphemeRange byGrapheme ( );
bool normalize ( );
}

(byGrapheme and normalize have rather expensive dependencies so probably
better to provide those via UFCS on demand)

May 29 2016

Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Saturday, 28 May 2016 at 12:04:20 UTC, Andrei Alexandrescu
wrote:
On 5/28/16 6:59 AM, Marc Schütz wrote:
The fundamental problem is choosing one of those possibilities
over the
others without knowing what the user actually wants, which is
what both
BEFORE and AFTER do.

OK, that's a fair argument, thanks. So it seems there should be
no "default" way to iterate a string, and furthermore iterating
for each constituent of a string should be fairly rare. Strings
and substrings yes, but not individual points/units/graphemes
unless expressly asked. (Indeed some languages treat strings as
first-class entities and individual characters are mere short
substrings.)

So it harkens back to the original mistake: strings should NOT
be arrays with the respective primitives.

I think this is going too far. It's sufficient if they (= char
slices, not ranges) can't be iterated over directly, i.e. aren't
input ranges (and maybe don't work with foreach). That would
force the user to append .byCodeUnit etc. as needed.

This provides a very nice deprecation path, by the way, it's just
not clear whether it can be implemented with the way `deprecated`
currently works. I.e. deprecate/warn every time auto decoding
kicks in, print a nice message to the user, and later remove auto
decoding and make isInputRange!string return false.

May 30 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/30/2016 07:58 AM, Marc Schütz wrote:
On Saturday, 28 May 2016 at 12:04:20 UTC, Andrei Alexandrescu wrote:
On 5/28/16 6:59 AM, Marc Schütz wrote:
The fundamental problem is choosing one of those possibilities over the
others without knowing what the user actually wants, which is what both
BEFORE and AFTER do.

So it harkens back to the original mistake: strings should NOT be
arrays with the respective primitives.

I think this is going too far. It's sufficient if they (= char slices,
not ranges) can't be iterated over directly, i.e. aren't input ranges
(and maybe don't work with foreach).

That's... what I said. -- Andrei

May 30 2016

Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 30 May 2016 at 12:45:27 UTC, Andrei Alexandrescu wrote:
That's... what I said. -- Andrei

You said "not arrays", he said "not ranges".

So that just means making the std.range.primitives.popFront and
front add a constraint if(!isSomeString()).

Language built-ins still work, but the library rejects them.

Indeed, we could add a deprecated overload then that points
people to the other range getter methods (byCodeUnit,
byCodePoint, byGrapheme, etc.)... this might be our migration
path.

May 30 2016

Seb <seb wilzba.ch> writes:
On Monday, 30 May 2016 at 12:59:08 UTC, Adam D. Ruppe wrote:
On Monday, 30 May 2016 at 12:45:27 UTC, Andrei Alexandrescu
wrote:
That's... what I said. -- Andrei

You said "not arrays", he said "not ranges".

So that just means making the std.range.primitives.popFront and
front add a constraint if(!isSomeString()).

Language built-ins still work, but the library rejects them.

Indeed, we could add a deprecated overload then that points
people to the other range getter methods (byCodeUnit,
byCodePoint, byGrapheme, etc.)... this might be our migration
path.

That's a great idea - the compiler should also issue deprecation
warnings when I try to do things like:

string a = "你好";

a[1]; // deprecation: direct access to a Unicode string is highly
error-prone. Please specify the type of access. More details
(shortlink)

a[1] = "b"; // deprecation: direct index assignment to a Unicode
string is ...

a.length; // deprecation: a Unicode string has multiple
definitions of length. Please specify your iteration (...). More
details (shortlink)

...

Btw should a[] be an alias for `byCodeUnit` or also trigger a
warning?

May 30 2016

ag0aep6g <anonymous example.com> writes:
On 05/30/2016 04:35 PM, Seb wrote:
That's a great idea - the compiler should also issue deprecation
warnings when I try to do things like:

string a = "你好";

a[1]; // deprecation: direct access to a Unicode string is highly
error-prone. Please specify the type of access. More details (shortlink)

a[1] = "b"; // deprecation: direct index assignment to a Unicode string
is ...

a.length; // deprecation: a Unicode string has multiple definitions of
length. Please specify your iteration (...). More details (shortlink)

...

Btw should a[] be an alias for `byCodeUnit` or also trigger a warning?

All this is only sensible when we move to a dedicated string type that's
not just an alias of `immutable(char)[]`.

`immutable(char)[]` explicitly is an array of code units. It would not
be acceptable, in my opinion, if the normal array syntax got broken for it.

May 30 2016

Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Monday, 30 May 2016 at 14:56:36 UTC, ag0aep6g wrote:
All this is only sensible when we move to a dedicated string
type that's not just an alias of `immutable(char)[]`.

`immutable(char)[]` explicitly is an array of code units. It
would not be acceptable, in my opinion, if the normal array
syntax got broken for it.

I agree; most of the troubles have been with auto-decoding. In an
ideal world, we'd also want to change the way `length` and
`opIndex` work, but if we only fix the range primitives, we've
achieved almost as much with fewer compatibility problems.

May 30 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 5/30/2016 8:34 AM, Marc Schütz wrote:
In an ideal world, we'd also want to change the way `length` and `opIndex`
work,

Why? strings are arrays of code units. All the trouble comes from erratically
pretending otherwise.

May 30 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/30/16 5:51 PM, Walter Bright wrote:
On 5/30/2016 8:34 AM, Marc Schütz wrote:
In an ideal world, we'd also want to change the way `length` and
`opIndex` work,

Why? strings are arrays of code units. All the trouble comes from
erratically pretending otherwise.

That's not an argument. Objects are arrays of bytes, or tuples of their
fields, etc. The whole point of encapsulation is superimposing a more
structured view on top of the representation. Operating on open-heart
representation is risky, and strings are no exception. -- Andrei

May 30 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 5/30/2016 9:16 PM, Andrei Alexandrescu wrote:
On 5/30/16 5:51 PM, Walter Bright wrote:
On 5/30/2016 8:34 AM, Marc Schütz wrote:
In an ideal world, we'd also want to change the way `length` and
`opIndex` work,

Why? strings are arrays of code units. All the trouble comes from
erratically pretending otherwise.

That's not an argument.

Consistency is a factual argument, and autodecode is not consistent.

Objects are arrays of bytes, or tuples of their fields,
etc. The whole point of encapsulation is superimposing a more structured view
on
top of the representation. Operating on open-heart representation is risky, and
strings are no exception.

If there is an abstraction for strings that is efficient, consistent, useful,
and hides the fact that it is UTF, I am not aware of it. Autodecoding is not it.

May 31 2016

deadalnix <deadalnix gmail.com> writes:
On Tuesday, 31 May 2016 at 07:56:54 UTC, Walter Bright wrote:
On 5/30/2016 9:16 PM, Andrei Alexandrescu wrote:
On 5/30/16 5:51 PM, Walter Bright wrote:
On 5/30/2016 8:34 AM, Marc Schütz wrote:
In an ideal world, we'd also want to change the way `length`
and
`opIndex` work,

Why? strings are arrays of code units. All the trouble comes
from
erratically pretending otherwise.

That's not an argument.

Consistency is a factual argument, and autodecode is not
consistent.

Objects are arrays of bytes, or tuples of their fields,
etc. The whole point of encapsulation is superimposing a more
structured view on
top of the representation. Operating on open-heart
representation is risky, and
strings are no exception.

If there is an abstraction for strings that is efficient,
consistent, useful, and hides the fact that it is UTF, I am not
aware of it. Autodecoding is not it.

Thing is, more info is needed to support unicode properly.
Collation for instance.

May 31 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/31/16 3:56 AM, Walter Bright wrote:
On 5/30/2016 9:16 PM, Andrei Alexandrescu wrote:
On 5/30/16 5:51 PM, Walter Bright wrote:
On 5/30/2016 8:34 AM, Marc Schütz wrote:
In an ideal world, we'd also want to change the way `length` and
`opIndex` work,

Why? strings are arrays of code units. All the trouble comes from
erratically pretending otherwise.

That's not an argument.

Consistency is a factual argument, and autodecode is not consistent.

Consistency with what? Consistent with what?

Objects are arrays of bytes, or tuples of their fields,
etc. The whole point of encapsulation is superimposing a more
structured view on
top of the representation. Operating on open-heart representation is
risky, and
strings are no exception.

If there is an abstraction for strings that is efficient, consistent,
useful, and hides the fact that it is UTF, I am not aware of it.

It's been mentioned several times: a string type that does not offer
range primitives; instead it offers explicit primitives (such as
byCodeUnit, byCodePoint, byGrapheme etc) that yield appropriate ranges.
-- Andrei

May 31 2016

deadalnix <deadalnix gmail.com> writes:
On Tuesday, 31 May 2016 at 15:07:09 UTC, Andrei Alexandrescu
wrote:
Consistency with what? Consistent with what?

It is a slice type. It should work as a slice type. Every other
design stink.

May 31 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, May 31, 2016 11:07:09 Andrei Alexandrescu via Digitalmars-d wrote:
On 5/31/16 3:56 AM, Walter Bright wrote:
If there is an abstraction for strings that is efficient, consistent,
useful, and hides the fact that it is UTF, I am not aware of it.

Not exactly. Such a string type does not hide the fact that it's UTF.
Rather, it forces you to deal with the fact that its UTF. I have to agree
with Walter in that there really isn't a way to automatically handle Unicode
correctly and efficiently while hiding the fact that it's doing all of the
stuff that has to be done for UTF.

That being said, while an array of code units is really what a string should
be underneath the hood, having a string type that provides byCodeUnit,
byCodePoint, and byGrapheme is an improvement over treating
immutable(char)[] as string, even if byCodeUnit returns immutable(char)[],
because it forces the programmer to decide what they want to do rather than
blindingly operate on immutable(char)[] as if a char were a full character.
And as long as it provides access to each level of Unicode, then it's
possible for programmers who know what they're doing to efficiently operate
on Unicode while simultaneously making it much more obvious to those who
don't know what they're doing that they don't know they're doing rather than
having them blindly act like char is a full character.

There's really no reason why we couldn't define a string type that operated
that way while continuing to treat arrays of char the way that we do now in
the language, though transitioning to such a scheme is not at all
straightforward in terms of avoiding code breakage. Defining a String type
would be simple enough, and any function in Phobos which accepted a string
could be changed to accept a String, but we'd have problems with many
functions which currently returned string, since changing what they returned
would break code.

But even if Phobos were somehow completly changed over to use a new String
type, and even if the string alias were deprecated/removed, we'd still have
to deal with arrays of char, wchar, and dchar and run the risk of someone
using those and having problems, because they didn't treat them as arrays of
code units. We can't really prevent that, just make it so that string/String
is something else that makes the Unicode issue obvious so that folks are
less likely to blindly treat chars as full characters. But even then, it's
not like it would be hard for folks to just use the wrong Unicode level. All
we'd really be doing is shoving the issue in their face so that they'd have
to acknowledge it on some level and maybe then actually learn enough to
operate on Unicode strings correctly.

But then again, since all you're really doing at that point is shoving the
Unicode issues in folks' faces by not treating strings as ranges or
indexable and forcing them to call byCodeUnit, byCodePoint, byGrapheme,
etc., I don't know that it actually solves much over treating
immutable(char)[] as string. Programmers still have to learn Unicode enough
to handle it correctly, just like they do now (whether we have autodecoding
or not). And such a string type really doesn't make the Unicode handling any
easier. It just make it harder to ignore the Unicode issues.

The Unicode problem is a lot like the floating point problems that have been
discussed recently. Programmers want it to "just work" without them having
to worry about the details, but that really doesn't work, and while the
average programmer may not understand either floating point operations or
Unicode properly, the average programmer does actually have to work with
both on a regular basis.

I'm not at all convinced that having string be an alias of immutable(char)[]
was a mistake, but having a struct that's not a range may very well be an
improvement. It _would_ at least make some of the Unicode issues more
obvious, but it doesn't really solve much from what I can see.

- Jonathan M Davis

May 31 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 12:45 PM, Jonathan M Davis via Digitalmars-d wrote:
On Tuesday, May 31, 2016 11:07:09 Andrei Alexandrescu via Digitalmars-d wrote:
On 5/31/16 3:56 AM, Walter Bright wrote:
If there is an abstraction for strings that is efficient, consistent,
useful, and hides the fact that it is UTF, I am not aware of it.

Not exactly. Such a string type does not hide the fact that it's UTF.
Rather, it forces you to deal with the fact that its UTF.

How is that different from what I said? -- Andrei

May 31 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, May 31, 2016 13:01:11 Andrei Alexandrescu via Digitalmars-d wrote:
On 05/31/2016 12:45 PM, Jonathan M Davis via Digitalmars-d wrote:
On Tuesday, May 31, 2016 11:07:09 Andrei Alexandrescu via Digitalmars-d

wrote:
On 5/31/16 3:56 AM, Walter Bright wrote:
If there is an abstraction for strings that is efficient, consistent,
useful, and hides the fact that it is UTF, I am not aware of it.

Not exactly. Such a string type does not hide the fact that it's UTF.
Rather, it forces you to deal with the fact that its UTF.

How is that different from what I said? -- Andrei

My point was that Walter was stating that you can't have a type that hides
the fact that it's dealing with Unicode while still being efficient, whereas
you mentioned a proposal for a type that does not hide the fact that
it's dealing with Unicode. So, you weren't really responding with a type
that rebutted Walter's statement. Rather, you responded with a type that
attempts to make its Unicode nature more explicit than immutable(char)[].

- Jonathan M Davis

May 31 2016

Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Monday, 30 May 2016 at 21:51:36 UTC, Walter Bright wrote:
On 5/30/2016 8:34 AM, Marc Schütz wrote:
In an ideal world, we'd also want to change the way `length`
and `opIndex` work,

Why? strings are arrays of code units.

So, strings are _implemented_ as arrays of code units. But
indiscriminately treating them as such in all situations leads to
wrong results (just like arrays of code points would).

In an ideal world, the programs someone intuitively writes will
do the right thing, and if they can't, they at least refuse to
compile. If we agree that it's up to the user whether to iterate
over a string by code unit or code points or graphemes, and that
we shouldn't arbitrarily choose one of those (except when we know
that it's what the user wants), then the same applies to
indexing, slicing and counting.

On the other hand, changing such low-level things will likely be
impractical, that's why I said "In an ideal world".

All the trouble comes from erratically pretending otherwise.

For me, the trouble comes from pretending otherwise _without
being told to_.

To make sure there are no misunderstandings, here is what is
suggested as an alternative to the current situation:

* `char[]`, `wchar[]` (and `dchar[]`?) no longer pass
`isInputRange`.
* Ranges with element type `char`, `wchar`, and `dchar` do pass
`isInputRange`.
* A bunch of rangeifying helpers are added to `std.string` (I
believe they are already there): `byCodePoint`, `byCodeUnit`,
`byChar`, `byWchar`, `byDchar`, ...
* Algorithms like `find`, `join(er)` get overloads that accept
char slices directly.
* Built-in operators and `length` of char slices are unchanged.

Advantages:

* Algorithms that can work _correctly_ without any kind of
decoding will do so.
* Algorithms that would yield incorrect results won't compile,
requiring the user to make a decision regarding the desired
element type.
* No auto-decoding.
=> Best performance depending on the actual requirements.
=> No results that look correct when tested with only
precomposed characters but are wrong in the general case.
* Behaviour of [] and .length is no worse than today.

May 31 2016

Seb <seb wilzba.ch> writes:
On Tuesday, 31 May 2016 at 13:33:14 UTC, Marc Schütz wrote:
On Monday, 30 May 2016 at 21:51:36 UTC, Walter Bright wrote:
[...]

So, strings are _implemented_ as arrays of code units. But
indiscriminately treating them as such in all situations leads
to wrong results (just like arrays of code points would).

[...]

If we follow Adam's proposal to deprecate front, back, popFront
and popBack, we don't even need to touch the compiler and it's
trivial to do so.
The proof of concept change needs eight lines.

https://github.com/dlang/phobos/pull/4384

Explicitly stating the type of iteration in the 132 places with
auto-decoding in Phobos doesn't sound that terrible.

May 31 2016

ag0aep6g <anonymous example.com> writes:
On 05/31/2016 04:33 PM, Seb wrote:
https://github.com/dlang/phobos/pull/4384

Explicitly stating the type of iteration in the 132 places with
auto-decoding in Phobos doesn't sound that terrible.

After checking some of those 132 places, they are in generic functions
that take ranges. std.algorithm.equal, std.range.take - stuff like that.

That's expected, of course, as the range primitives are used there. But
those places are not the ones we'd have to fix. We'd have to fix the
code that uses those generic functions on strings.

May 31 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/31/16 10:33 AM, Seb wrote:
Explicitly stating the type of iteration in the 132 places with
auto-decoding in Phobos doesn't sound that terrible.

It is terrible, no two ways about it. We've been very very careful with
changes that caused a handful or breakages in Phobos. It really means
every D project on the planet will be broken. We can't contemplate that,
it's suicide. -- Andrei

May 31 2016

Kagamin <spam here.lot> writes:
On Tuesday, 31 May 2016 at 13:33:14 UTC, Marc Schütz wrote:
In an ideal world, the programs someone intuitively writes will
do the right thing, and if they can't, they at least refuse to
compile. If we agree that it's up to the user whether to
iterate over a string by code unit or code points or graphemes,
and that we shouldn't arbitrarily choose one of those (except
when we know that it's what the user wants), then the same
applies to indexing, slicing and counting.

If the user doesn't know how he wants to iterate and you leave
the decision to the user... erm... it's not going to give correct
result :)

May 31 2016

Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 30 May 2016 at 14:35:03 UTC, Seb wrote:
That's a great idea - the compiler should also issue
deprecation warnings when I try to do things like:

I don't agree on changing those. Indexing and slicing a char[] is
really useful and actually not hard to do correctly (at least
with regard to handling code units). Besides, it'd be a much
bigger change than the library transition.

May 30 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 5/30/2016 11:25 AM, Adam D. Ruppe wrote:
I don't agree on changing those. Indexing and slicing a char[] is really useful
and actually not hard to do correctly (at least with regard to handling code
units).

Yup. It isn't hard at all to use arrays of codeunits correctly.

May 30 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/30/16 6:00 PM, Walter Bright wrote:
On 5/30/2016 11:25 AM, Adam D. Ruppe wrote:
I don't agree on changing those. Indexing and slicing a char[] is
really useful
and actually not hard to do correctly (at least with regard to
handling code
units).

Yup. It isn't hard at all to use arrays of codeunits correctly.

Trouble is, it isn't hard at all to use arrays of codeunits incorrectly,
too. -- Andrei

May 30 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Tue, May 31, 2016 at 12:13:57AM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
On 5/30/16 6:00 PM, Walter Bright wrote:
On 5/30/2016 11:25 AM, Adam D. Ruppe wrote:
I don't agree on changing those. Indexing and slicing a char[] is
really useful and actually not hard to do correctly (at least with
regard to handling code units).

Yup. It isn't hard at all to use arrays of codeunits correctly.

Trouble is, it isn't hard at all to use arrays of codeunits
incorrectly, too. -- Andrei

Neither does autodecoding make code anymore correct. It just better
hides the fact that the code is wrong.

--
I've been around long enough to have seen an endless parade of magic new
techniques du jour, most of which purport to remove the necessity of thought
about your programming problem. In the end they wind up contributing one or
two pieces to the collective wisdom, and fade away in the rearview mirror. --
Walter Bright

May 30 2016

default0 <Kevin.Labschek gmx.de> writes:
On Tuesday, 31 May 2016 at 06:45:56 UTC, H. S. Teoh wrote:
On Tue, May 31, 2016 at 12:13:57AM -0400, Andrei Alexandrescu
via Digitalmars-d wrote:
On 5/30/16 6:00 PM, Walter Bright wrote:
On 5/30/2016 11:25 AM, Adam D. Ruppe wrote:
I don't agree on changing those. Indexing and slicing a
char[] is really useful and actually not hard to do
correctly (at least with regard to handling code units).

Yup. It isn't hard at all to use arrays of codeunits
correctly.

Trouble is, it isn't hard at all to use arrays of codeunits
incorrectly, too. -- Andrei

Neither does autodecoding make code anymore correct. It just
better hides the fact that the code is wrong.

Thinking about this a bit more - what algorithms are actually
correct when implemented on the level of code units?
Off the top of my head I can only really think of copying and
hashing, since you want to do that on the byte level anyways.
I would also think that if you know your strings are normalized
in the same normalization form (for example because they come
from the same normalized source), you can check two strings for
equality on the code unit level, but my understanding of unicode
is still quite lacking, so I'm not sure on that.

May 31 2016

Marco Leise <Marco.Leise gmx.de> writes:
Am Tue, 31 May 2016 07:17:03 +0000
schrieb default0 <Kevin.Labschek gmx.de>:

Thinking about this a bit more - what algorithms are actually
correct when implemented on the level of code units?

Calculating the buffer size of a string, validation and
fast versions of general algorithms that can be defined in
terms of ASCII, like skipAsciiWhitespace(), splitByComma(),
splitByLineAscii().

I would also think that if you know your strings are normalized
in the same normalization form (for example because they come
from the same normalized source), you can check two strings for
equality on the code unit level, but my understanding of unicode
is still quite lacking, so I'm not sure on that.

That's correct.

--
Marco

May 31 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, May 31, 2016 07:17:03 default0 via Digitalmars-d wrote:
Thinking about this a bit more - what algorithms are actually
correct when implemented on the level of code units?
Off the top of my head I can only really think of copying and
hashing, since you want to do that on the byte level anyways.
I would also think that if you know your strings are normalized
in the same normalization form (for example because they come
from the same normalized source), you can check two strings for
equality on the code unit level, but my understanding of unicode
is still quite lacking, so I'm not sure on that.

Equality does not require decoding. Similarly, functions like find don't
either. Something like filter generally would, but it's also not
particularly normal to filter a string on a by-character basis. You'd
probably want to get to at least the word level in that case.

To make matters worse, functions like find or splitter are frequently used
to look for ASCII delimiters, even when the strings themselves contain
Unicode characters. So, even if decoding were necessary when looking for a
Unicode character, it's utterly wasteful when the character you're looking
for is ASCII. But searching generally does not require decoding so long as
the same character is always encoded the same way. So, Unicode normalization
_can_ be a problem, but that's a problem with code points as well as code
units (since the normalization has to do with the order of code points when
multiple code points make up a single grapheme). You'd have to go to the
grapheme level to avoid that problem. And that's why at least some of the
time, string-processing code is going to need to normalize its strings
before doing searches. But the searches themselves can then operate at the
code unit level.

- Jonathan M Davis

May 31 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 12:54 PM, Jonathan M Davis via Digitalmars-d wrote:
Equality does not require decoding. Similarly, functions like find don't
either. Something like filter generally would, but it's also not
particularly normal to filter a string on a by-character basis. You'd
probably want to get to at least the word level in that case.

It's nice that the stdlib takes care of that.

Good idea. We could overload functions such as find on char, wchar, and
dchar. Jonathan, could you look into a PR to do that?

But searching generally does not require decoding so long as
the same character is always encoded the same way.

Yah, a good rule of thumb is to get the same (consistent, heh) results
for a given string (including a given normalization) regardless of the
encoding used. So e.g. it's nice that walkLength the same number for the
string whether it's UTF8/16/32.

Andrei

May 31 2016

Marco Leise <Marco.Leise gmx.de> writes:
Am Tue, 31 May 2016 13:06:16 -0400
schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:

On 05/31/2016 12:54 PM, Jonathan M Davis via Digitalmars-d wrote:
Equality does not require decoding. Similarly, functions like find don't
either. Something like filter generally would, but it's also not
particularly normal to filter a string on a by-character basis. You'd
probably want to get to at least the word level in that case. =20

=20
It's nice that the stdlib takes care of that.

Both "equality" and "find" require byGrapheme.

=E2=87=B0 The equivalence algorithm first brings both strings to a
common normalization form (NFD or NFC), which works on one
grapheme cluster at a time and afterwards does the binary
comparison.
http://www.unicode.org/reports/tr15/#Canon_Compat_Equivalence

=E2=87=B0 Find would yield false positives for the start of grapheme clust=
ers.
I.e. will match 'o' in an NFD "=C3=B6" (simplified example).
http://www.unicode.org/reports/tr10/#Searching

--=20
Marco

May 31 2016

Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 31-May-2016 01:00, Walter Bright wrote:
On 5/30/2016 11:25 AM, Adam D. Ruppe wrote:
I don't agree on changing those. Indexing and slicing a char[] is
really useful
and actually not hard to do correctly (at least with regard to
handling code
units).

Yup. It isn't hard at all to use arrays of codeunits correctly.

Ehm as long as all you care for is operating on substrings I'd say.
Working with individual character requires either decoding or clever
tricks like operating on encoded UTF directly.

--
Dmitry Olshansky

May 31 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, May 31, 2016 22:47:56 Dmitry Olshansky via Digitalmars-d wrote:
On 31-May-2016 01:00, Walter Bright wrote:
On 5/30/2016 11:25 AM, Adam D. Ruppe wrote:
I don't agree on changing those. Indexing and slicing a char[] is
really useful
and actually not hard to do correctly (at least with regard to
handling code
units).

Yup. It isn't hard at all to use arrays of codeunits correctly.

Ehm as long as all you care for is operating on substrings I'd say.
Working with individual character requires either decoding or clever
tricks like operating on encoded UTF directly.

Yeah, but Phobos provides the tools to do that reasonably easily even when
autodecoding isn't involved. Sure, it's slightly more tedious to call
std.utf.decode or std.utf.encode yourself rather than letting autodecoding
take care of it, but it's easy enough to do and allows you to control when
it's done. And we have stuff like byChar!dchar or byGrapheme for the cases
where you don't want to actually operate on arrays of code units.

- Jonathan M Davis

May 31 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Tue, May 31, 2016 at 10:47:56PM +0300, Dmitry Olshansky via Digitalmars-d
wrote:
On 31-May-2016 01:00, Walter Bright wrote:
On 5/30/2016 11:25 AM, Adam D. Ruppe wrote:
I don't agree on changing those. Indexing and slicing a char[] is
really useful and actually not hard to do correctly (at least with
regard to handling code units).

Yup. It isn't hard at all to use arrays of codeunits correctly.

Ehm as long as all you care for is operating on substrings I'd say.
Working with individual character requires either decoding or clever
tricks like operating on encoded UTF directly.

[...]

Working on individual characters needs byGrapheme, unless you know
beforehand that the character(s) you're working with are ASCII, or fits
in a single code unit.

About "clever tricks", it's not really that hard. I was thinking that
things like s.canFind('Ш') should translate the 'Ш' into a UTF-8 byte
sequence, and then do a substring search directly on the encoded string.
This way, a large number of single-character algorithms don't even need
to decode. The way UTF-8 is designed guarantees that there will not be
any false positives. This will eliminate a lot of the current overhead
of autodecoding.

--
Klein bottle for rent ... inquire within. -- Stephen Mulraney

May 31 2016

Chris <wendlec tcd.ie> writes:
On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu
wrote:
[snip]
I would agree only with the amendment "...if used naively",
which is important. Knowledge of how autodecoding works is a
prerequisite for writing fast string code in D. Also, little
code should deal with one code unit or code point at a time;
instead, it should use standard library algorithms for
searching, matching etc. When needed, iterating every code unit
is trivially done through indexing.

I disagree. "if used naively" shouldn't be the default. A user
(naively) expects string algorithms to work as efficiently as
possible without overheads. To tell the user later that s/he
shouldn't _naively_ have used a certain algorithm provided by the
library is a bit cynical. Having to redesign a code base because
of hidden behavior is a big turn off, having to go through Phobos
to determine where the hidden pitfalls are is not the user's job.

Also allow me to point that much of the slowdown can be
addressed tactically. The test c < 0x80 is highly predictable
(in ASCII-heavy text) and therefore easily speculated. We can
and we should arrange code to minimize impact.

And what if you deal with non-ASCII heavy text? Does the user
have to guess an micro-optimize for simple use cases?

5. Very few algorithms require decoding.

The key here is leaving it to the standard library to do the
right thing instead of having the user wonder separately for
each case. These uses don't need decoding, and the standard
library correctly doesn't involve it (or if it currently does
it has a bug):

s.find("abc")
s.findSplit("abc")
s.findSplit('a')
s.count!(c => "!()-;:,.?".canFind(c)) // punctuation

However the following do require autodecoding:

s.walkLength
s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
s.count!(c => c >= 32) // non-control characters

Currently the standard library operates at code point level
even though inside it may choose to use code units when
admissible. Leaving such a decision to the library seems like a
wise thing to do.

But how is the user supposed to know without being a core
contributor to Phobos? If using a library method that works well
in one case can slow down your code in a slightly different case,
something is wrong with the language/library design. For simple
cases the burden shouldn't be on the user, or, if it is, s/he
should be informed about it in order to be able to make
well-informed decisions. Personally I wouldn't mind having to
decide in each case what I want (provided I have a best practices
cheat sheet :)), so I can get the best out of it. But to keep
guessing, testing and benchmarking each string handling library
function is not good at all.

[snip]

May 27 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/27/16 7:19 AM, Chris wrote:
On Thursday, 26 May 2016 at 16:00:54 UTC, Andrei Alexandrescu wrote:
[snip]
I would agree only with the amendment "...if used naively", which is
important. Knowledge of how autodecoding works is a prerequisite for
writing fast string code in D. Also, little code should deal with one
code unit or code point at a time; instead, it should use standard
library algorithms for searching, matching etc. When needed, iterating
every code unit is trivially done through indexing.

I disagree.

Misunderstanding.

"if used naively" shouldn't be the default. A user (naively)
expects string algorithms to work as efficiently as possible without
overheads.

That's what happens with autodecoding.

And what if you deal with non-ASCII heavy text? Does the user have to
guess an micro-optimize for simple use cases?

Misunderstanding.

5. Very few algorithms require decoding.

The key here is leaving it to the standard library to do the right
thing instead of having the user wonder separately for each case.
These uses don't need decoding, and the standard library correctly
doesn't involve it (or if it currently does it has a bug):

s.find("abc")
s.findSplit("abc")
s.findSplit('a')
s.count!(c => "!()-;:,.?".canFind(c)) // punctuation

However the following do require autodecoding:

s.walkLength
s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
s.count!(c => c >= 32) // non-control characters

Currently the standard library operates at code point level even
though inside it may choose to use code units when admissible. Leaving
such a decision to the library seems like a wise thing to do.

But how is the user supposed to know without being a core contributor to
Phobos?

Misunderstanding. All examples work properly today because of
autodecoding. -- Andrei

May 27 2016

ag0aep6g <anonymous example.com> writes:
On 05/27/2016 03:32 PM, Andrei Alexandrescu wrote:
However the following do require autodecoding:

s.walkLength
s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
s.count!(c => c >= 32) // non-control characters

Currently the standard library operates at code point level even
though inside it may choose to use code units when admissible. Leaving
such a decision to the library seems like a wise thing to do.

But how is the user supposed to know without being a core contributor to
Phobos?

Misunderstanding. All examples work properly today because of
autodecoding. -- Andrei

They only work "properly" if you define "properly" as "in terms of code
points". But working in terms of code points is usually wrong. If you
want to count "characters", you need to work with graphemes.

https://dpaste.dzfl.pl/817dec505fd2

May 27 2016

Chris <wendlec tcd.ie> writes:
On Friday, 27 May 2016 at 13:47:32 UTC, ag0aep6g wrote:
Misunderstanding. All examples work properly today because of
autodecoding. -- Andrei

They only work "properly" if you define "properly" as "in terms
of code points". But working in terms of code points is usually
wrong. If you want to count "characters", you need to work with
graphemes.

https://dpaste.dzfl.pl/817dec505fd2

I agree. It has happened to me that characters like "é" return
length == 2, which has been the cause of some bugs in my code.
I'm wiser now, of course, but you wouldn't expect this, if you
write

if (input.length == 1)
speakCharacter(input); // e.g. when spelling a word
else
processInput(input);

The worst thing is that you never know, what's going on under the
hood and where autodecode slows you down, unbeknownst to yourself.

May 27 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/27/16 10:15 AM, Chris wrote:
It has happened to me that characters like "é" return length == 2

Would normalization make length 1? -- Andrei

May 27 2016

Adam D. Ruppe <destructionator gmail.com> writes:
On Friday, 27 May 2016 at 18:11:22 UTC, Andrei Alexandrescu wrote:
Would normalization make length 1? -- Andrei

In some, but not all cases.

May 27 2016

Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 27-May-2016 21:11, Andrei Alexandrescu wrote:
On 5/27/16 10:15 AM, Chris wrote:
It has happened to me that characters like "é" return length == 2

Would normalization make length 1? -- Andrei

No, this is not the point of normalization.

--
Dmitry Olshansky

May 27 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/27/2016 03:39 PM, Dmitry Olshansky wrote:
On 27-May-2016 21:11, Andrei Alexandrescu wrote:
On 5/27/16 10:15 AM, Chris wrote:
It has happened to me that characters like "é" return length == 2

Would normalization make length 1? -- Andrei

No, this is not the point of normalization.

What is? -- Andrei

May 27 2016

Minas Mina <minas_0 hotmail.co.uk> writes:
On Friday, 27 May 2016 at 20:42:13 UTC, Andrei Alexandrescu wrote:
On 05/27/2016 03:39 PM, Dmitry Olshansky wrote:
On 27-May-2016 21:11, Andrei Alexandrescu wrote:
On 5/27/16 10:15 AM, Chris wrote:
It has happened to me that characters like "é" return length
== 2

Would normalization make length 1? -- Andrei

No, this is not the point of normalization.

What is? -- Andrei

This video will be helpfull :)

https://www.youtube.com/watch?v=n0GK-9f4dl8

It talks about Unicode in C++, but also explains how unicode
works.

May 27 2016

tsbockman <thomas.bockman gmail.com> writes:
On Friday, 27 May 2016 at 20:42:13 UTC, Andrei Alexandrescu wrote:
On 05/27/2016 03:39 PM, Dmitry Olshansky wrote:
No, this is not the point of normalization.

What is? -- Andrei

1) A grapheme may include several combining characters (such as
diacritics) whose order is not supposed to be semantically
significant. Normalization sorts them in a standardized way so
that string comparisons return the expected result for graphemes
which differ only by the internal order of their constituent
combining code points.

2) Some graphemes (like accented latin letters) can be
represented by a single code point OR a letter followed by a
combining diacritic. Normalization either splits them all apart
(NFD), or combines them whenever possible (NFC). Again, this is
primarily intended to make things like string comparisons work as
expected, and perhaps to simplify low-level tasks like graphical
rendering of text.

(Disclaimer: This is an oversimplification, because nothing about
Unicode is ever simple.)

May 27 2016

Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 28-May-2016 01:04, tsbockman wrote:
On Friday, 27 May 2016 at 20:42:13 UTC, Andrei Alexandrescu wrote:
On 05/27/2016 03:39 PM, Dmitry Olshansky wrote:
No, this is not the point of normalization.

What is? -- Andrei

1) A grapheme may include several combining characters (such as
diacritics) whose order is not supposed to be semantically significant.
Normalization sorts them in a standardized way so that string
comparisons return the expected result for graphemes which differ only
by the internal order of their constituent combining code points.

2) Some graphemes (like accented latin letters) can be represented by a
single code point OR a letter followed by a combining diacritic.
Normalization either splits them all apart (NFD), or combines them
whenever possible (NFC). Again, this is primarily intended to make
things like string comparisons work as expected, and perhaps to simplify
low-level tasks like graphical rendering of text.

Quite accurate statement of the goals. Normalization is all about having
canonical order of combining code points.

(Disclaimer: This is an oversimplification, because nothing about
Unicode is ever simple.)

--
Dmitry Olshansky

May 28 2016

Would normalization make length 1? -- Andrei

No, this is not the point of normalization.

What is? -- Andrei

Here is an example about normalization.

In Unicode, the grapheme Ä is composed of two code points: A (the
ascii A) and the ¨ character.

However, one of the goals of unicode was to be backwards to
compatible with earlier encodings that extended ASCII (codepages).
In some codepages, Ä was an actual codepoint.

So in some cases you would have the unicode one which is two
codepoints and the one from some codepages which would be one.

Those should be the same though, i.e compare the same. In order
to do that, there is normalization. What is does is to _expand_
the single codepoint Ä into A + ¨

May 27 2016

David Nadlinger <code klickverbot.at> writes:
On Friday, 27 May 2016 at 22:12:57 UTC, Minas Mina wrote:
Those should be the same though, i.e compare the same. In order
to do that, there is normalization. What is does is to _expand_
the single codepoint Ä into A + ¨

Unless I'm mistaken, this depends on the form used. For example,
in NFKC you'd get the single codepoint Ä.

— David

May 27 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Friday, May 27, 2016 23:16:58 David Nadlinger via Digitalmars-d wrote:
On Friday, 27 May 2016 at 22:12:57 UTC, Minas Mina wrote:
Those should be the same though, i.e compare the same. In order
to do that, there is normalization. What is does is to _expand_
the single codepoint � into A + �

Unless I'm mistaken, this depends on the form used. For example,
in NFKC you'd get the single codepoint �.

Yeah. For better or worse, there are different normalization schemes for
Unicode. A normalization scheme makes the encodings consisent, but that
doesn't mean that each of the different normalization schemes does the same
thing, just that if you apply the same normalization scheme to two strings,
then all graphemes within those strings will be encoded identically.

- Jonathan M Davis

May 31 2016

Chris <wendlec tcd.ie> writes:
On Friday, 27 May 2016 at 18:11:22 UTC, Andrei Alexandrescu wrote:
On 5/27/16 10:15 AM, Chris wrote:
It has happened to me that characters like "é" return length
== 2

Would normalization make length 1? -- Andrei

No, I've tried it. I think dchar[] returns one or you check by
grapheme.

May 28 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, May 27, 2016 at 03:47:32PM +0200, ag0aep6g via Digitalmars-d wrote:
On 05/27/2016 03:32 PM, Andrei Alexandrescu wrote:
However the following do require autodecoding:

s.walkLength
s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
s.count!(c => c >= 32) // non-control characters

Currently the standard library operates at code point level even
though inside it may choose to use code units when admissible.
Leaving such a decision to the library seems like a wise thing
to do.

But how is the user supposed to know without being a core
contributor to Phobos?

Misunderstanding. All examples work properly today because of
autodecoding. -- Andrei

They only work "properly" if you define "properly" as "in terms of
code points". But working in terms of code points is usually wrong. If
you want to count "characters", you need to work with graphemes.

https://dpaste.dzfl.pl/817dec505fd2

Exactly. And we just keep getting stuck on this point. It seems that the
message just isn't getting through. The unfounded assumption continues
to be made that iterating by code point is somehow "correct" by
definition and nobody can challenge it.

String handling, especially in the standard library, ought to be (1)
efficient where possible, and (2) be as correct as possible (meaning,
most corresponding to user expectations -- principle of least surprise).
If we can't have both, we should at least have one, right? However, the
way autodecoding is currently implemented, we have neither.

Firstly, it is beyond clear that autodecoding adds a significant amount
of overhead, and because it's automatic, it applies to ALL string
processing in D. The only way around it is to fight against the
standard library and use workarounds to bypass all that
meticulously-crafted autodecoding code, begging the question of why
we're even spending the effort on said code in the first place.

Secondly, it violates the principle of least surprise when the user,
given a string of, say, Korean text, discovers that s.count() *doesn't*
return the correct answer. Oh, it's "correct", all right, if your
definition of correct is "number of Unicode code points", but to a
Korean user, such an answer is completely meaningless because it has
little correspondence with what he would perceive as the number of
"characters" in the string. It might as well be a random number and it
would be just as meaningful. It is just as wrong as s.count() returning
the number of code units, except that in the current Euro-centric D
community the wrong instances are less often encountered and so are
often overlooked. But that doesn't change the fact that code that
assumes s.count() returns anything remotely meaningful to the user is
buggy. Autodecoding into code points only serves to hide the bugs.

As has been said before already countless times, autodecoding, as
currently implemented, is neither "correct" nor efficient. Iterating by
code point is much faster, but more prone to user mistakes; whereas
iterating by grapheme more often corresponds with user expectations but
performs quite poorly. The current implementation of autodecoding
represents the worst of both worlds: it is both inefficient *and* prone
to user mistakes, and worse yet, it serves to conceal such user mistakes
by giving the false sense of security that because we're iterating by
code points we're somehow magically "correct" by definition.

The fact of the matter is that if you're going to write Unicode string
processing code, you're gonna hafta to know the dirty nitty gritty of
Unicode strings, including the fine distinctions between code units,
code points, grapheme clusters, etc.. Since this is required knowledge
anyway, why not just let the user worry about how to iterate over the
string? Let the user choose what best suits his application, whether
it's working directly with code units for speed, or iterating over
grapheme clusters for correctness (in terms of visual "characters"),
instead of choosing the pessimal middle ground that's neither efficient
nor correct?

--
Do not reason with the unreasonable; you lose by definition.

May 27 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/27/16 12:40 PM, H. S. Teoh via Digitalmars-d wrote:
Exactly. And we just keep getting stuck on this point. It seems that the
message just isn't getting through. The unfounded assumption continues
to be made that iterating by code point is somehow "correct" by
definition and nobody can challenge it.

Which languages are covered by code points, and which languages require
graphemes consisting of multiple code points? How does normalization
play into this? -- Andrei

May 27 2016

ag0aep6g <anonymous example.com> writes:
On 05/27/2016 08:42 PM, Andrei Alexandrescu wrote:
Which languages are covered by code points, and which languages require
graphemes consisting of multiple code points? How does normalization
play into this? -- Andrei

I don't think there is value in distinguishing by language. The point of
Unicode is that you shouldn't need to do that.

I think there are scripts that use combining characters extensively, but
Unicode also has stuff like combining arrows. Those can make sense in an
otherwise plain English text.

For example: 'a' + U+20D7 = a⃗.

There is no combined character for that, so normalization can't do
anything here.

May 27 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/27/16 3:10 PM, ag0aep6g wrote:
I don't think there is value in distinguishing by language. The point of
Unicode is that you shouldn't need to do that.

It seems code points are kind of useless because they don't really mean
anything, would that be accurate? -- Andrei

May 27 2016

ag0aep6g <anonymous example.com> writes:
On 05/27/2016 09:30 PM, Andrei Alexandrescu wrote:
It seems code points are kind of useless because they don't really mean
anything, would that be accurate? -- Andrei

I think so, yeah.

Due to combining characters, code points are similar to code units: a
Unicode thing that you need to know about of when working below the
human-perceived character (grapheme) level.

May 27 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, May 27, 2016 at 03:30:53PM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
On 5/27/16 3:10 PM, ag0aep6g wrote:
I don't think there is value in distinguishing by language. The
point of Unicode is that you shouldn't need to do that.

It seems code points are kind of useless because they don't really
mean anything, would that be accurate? -- Andrei

That's what we've been trying to say all along! :-P They're a kind of
low-level Unicode construct used for building "real" characters, i.e.,
what a layperson would consider to be a "character".

--
English is useful because it is a mess. Since English is a mess, it maps well
onto the problem space, which is also a mess, which we call reality. Similarly,
Perl was designed to be a mess, though in the nicest of all possible ways. --
Larry Wall

May 27 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/27/2016 03:43 PM, H. S. Teoh via Digitalmars-d wrote:
That's what we've been trying to say all along!

If that's the case things are pretty dire, autodecoding or not. -- Andrei

May 27 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, May 27, 2016 at 04:41:09PM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
On 05/27/2016 03:43 PM, H. S. Teoh via Digitalmars-d wrote:
That's what we've been trying to say all along!

If that's the case things are pretty dire, autodecoding or not. --
Andrei

Like it or not, Unicode ain't merely some glorified form of C's ASCII
char arrays. It's about time we faced the reality and dealt with it
accordingly. Trying to sweep the complexities of Unicode under the rug
is not doing us any good.

--
The fact that anyone still uses AOL shows that even the presence of options
doesn't stop some people from picking the pessimal one. - Mike Ellis

May 27 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Friday, May 27, 2016 16:41:09 Andrei Alexandrescu via Digitalmars-d wrote:
On 05/27/2016 03:43 PM, H. S. Teoh via Digitalmars-d wrote:
That's what we've been trying to say all along!

If that's the case things are pretty dire, autodecoding or not. -- Andrei

True enough. Correctly handling Unicode in the general case is ridiculously
hard - especially if you want to be efficient. We could do everything at
the grapheme level to get the correctness, but we'd be so slow that it would
be ridiculous.

Fortunately, many string algorithms really don't need to care much about
Unicode so long as the strings involved are normalized. For instance, a
function like find can usually compare code units without decoding anything
(though even then, depending on the normalization, you run the risk of
finding a part of a character if it involves combining code points - e.g.
searching for e could give you the first part of � if its encoded with the e
followed by the accent).

But ultimately, fully correct string handling requires having a far better
understanding of Unicode than most programmers have. Even the percentage of
programmers here that have that level of understanding isn't all that great
- though the fact that D supports UTF-8, UTF-16, and UTF-32 the way that it
does has led a number of us to dig further into Unicode and learn it better
in ways that we probably wouldn't have if all it had was char. It highlights
that there is something that needs to be learned to get this right in a way
that most languages don't.

- Jonathan M Davis

May 31 2016

Tobias M <troplin bluewin.ch> writes:
On Friday, 27 May 2016 at 19:43:16 UTC, H. S. Teoh wrote:
On Fri, May 27, 2016 at 03:30:53PM -0400, Andrei Alexandrescu
via Digitalmars-d wrote:
On 5/27/16 3:10 PM, ag0aep6g wrote:
I don't think there is value in distinguishing by language.
The point of Unicode is that you shouldn't need to do that.

It seems code points are kind of useless because they don't
really mean anything, would that be accurate? -- Andrei

That's what we've been trying to say all along! :-P They're a
kind of low-level Unicode construct used for building "real"
characters, i.e., what a layperson would consider to be a
"character".

Code points are *the fundamental unit* of unicode. AFAIK most
(all?) algorithms in the unicode spec are defined in terms of
code points.
Sure, some algorithms also work on the code unit level. That can
be used as an optimization, but they are still defined on code
points.

Code points are also abstracting over the different
representations (UTF-...), providing a uniform "interface".

May 29 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/29/2016 09:42 AM, Tobias M wrote:
On Friday, 27 May 2016 at 19:43:16 UTC, H. S. Teoh wrote:
On Fri, May 27, 2016 at 03:30:53PM -0400, Andrei Alexandrescu via
Digitalmars-d wrote:
On 5/27/16 3:10 PM, ag0aep6g wrote:
I don't think there is value in distinguishing by language. > The

point of Unicode is that you shouldn't need to do that.

It seems code points are kind of useless because they don't really
mean anything, would that be accurate? -- Andrei

That's what we've been trying to say all along! :-P They're a kind of
low-level Unicode construct used for building "real" characters, i.e.,
what a layperson would consider to be a "character".

Code points are *the fundamental unit* of unicode. AFAIK most (all?)
algorithms in the unicode spec are defined in terms of code points.
Sure, some algorithms also work on the code unit level. That can be used
as an optimization, but they are still defined on code points.

Code points are also abstracting over the different representations
(UTF-...), providing a uniform "interface".

So now code points are good? -- Andrei

May 29 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Sun, May 29, 2016 at 03:55:22PM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
On 05/29/2016 09:42 AM, Tobias M wrote:
On Friday, 27 May 2016 at 19:43:16 UTC, H. S. Teoh wrote:
On Fri, May 27, 2016 at 03:30:53PM -0400, Andrei Alexandrescu via
Digitalmars-d wrote:
On 5/27/16 3:10 PM, ag0aep6g wrote:
I don't think there is value in distinguishing by language.
The point of Unicode is that you shouldn't need to do that.

It seems code points are kind of useless because they don't
really mean anything, would that be accurate? -- Andrei

That's what we've been trying to say all along! :-P They're a
kind of low-level Unicode construct used for building "real"
characters, i.e., what a layperson would consider to be a
"character".

Code points are *the fundamental unit* of unicode. AFAIK most (all?)
algorithms in the unicode spec are defined in terms of code points.
Sure, some algorithms also work on the code unit level. That can be
used as an optimization, but they are still defined on code points.

Code points are also abstracting over the different representations
(UTF-...), providing a uniform "interface".

So now code points are good? -- Andrei

It depends on what you're trying to accomplish. That's the point we're
trying to get at. For some operations, working with code points makes
the most sense. But for other operations, it does not. There is no one
representation that is best for all situations; it needs to be decided
on a case-by-case basis. Which is why forcing everything to decode to
code points eventually leads to problems.

--
Customer support: the art of getting your clients to pay for your own
incompetence.

May 29 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/29/2016 04:47 PM, H. S. Teoh via Digitalmars-d wrote:
It depends on what you're trying to accomplish. That's the point we're
trying to get at. For some operations, working with code points makes
the most sense. But for other operations, it does not. There is no one
representation that is best for all situations; it needs to be decided
on a case-by-case basis. Which is why forcing everything to decode to
code points eventually leads to problems.

I see. Again this all to me sounds like "naked arrays of characters are
the wrong choice and should have been encapsulated in a dedicated string
type". -- Andrei

May 30 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Sunday, May 29, 2016 13:47:32 H. S. Teoh via Digitalmars-d wrote:
On Sun, May 29, 2016 at 03:55:22PM -0400, Andrei Alexandrescu via

Digitalmars-d wrote:
So now code points are good? -- Andrei

Exactly. And even a given function can't necessarily always be defined to
use a specific level of Unicode, because whether that's correct or not
depends on what the programmer is actually trying to do with the function.
And then there are cases where the programmer knows enough about the data
that they're dealing with that they're able to operate at a different level
of Unicode than would normally be correct. The most obvious example of that
is when you know that your strings are pure ASCII, but it's not the only
case.

We should strive to make Phobos operate correctly on strings by default
where we can, but there are cases where the programmer needs to know enough
to specify the behavior that they want, and deciding for them is just going
to lead to behavior that happens to be right some of the time while making
it hard for code using Phobos to have the correct behavior the rest of the
time. And the default behavior that we currently have is inefficient to
boot.

- Jonathan M Davis

May 31 2016

Adam D. Ruppe <destructionator gmail.com> writes:
On Friday, 27 May 2016 at 19:30:53 UTC, Andrei Alexandrescu wrote:
It seems code points are kind of useless because they don't
really mean anything, would that be accurate? -- Andrei

It might help to think of code points as being a kind of byte
code for a text-representing VM.

It's not meaningless, but it also isn't trivial and relevant
metrics can only be seen in application.

BTW you don't even have to get into unicode to hit complications.
Tab, backspace, carriage return, these are part of ASCII but
already complicate questions.

http://stackoverflow.com/questions/6792812/the-backspace-escape-character-b-in-c-unexpected-behavior

came up on a quick search. Does the backspace character reduce
the length of a string? In some contexts, maybe.

May 27 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, May 27, 2016 at 07:53:30PM +0000, Adam D. Ruppe via Digitalmars-d wrote:
On Friday, 27 May 2016 at 19:30:53 UTC, Andrei Alexandrescu wrote:
It seems code points are kind of useless because they don't really
mean anything, would that be accurate? -- Andrei

It might help to think of code points as being a kind of byte code for
a text-representing VM.

It's not meaningless, but it also isn't trivial and relevant metrics
can only be seen in application.

BTW you don't even have to get into unicode to hit complications. Tab,
backspace, carriage return, these are part of ASCII but already
complicate questions.

http://stackoverflow.com/questions/6792812/the-backspace-escape-character-b-in-c-unexpected-behavior

came up on a quick search. Does the backspace character reduce the
length of a string? In some contexts, maybe.

Fun fact: on some old Unix boxen, Backspace + underscore was interpreted
to mean "underline the previous character". Probably inherited from the
old typewriter days. Scarily enough, some Posix terminals may still
interpret this sequence this way! An early precursor of Unicode
combining diacritics, perhaps? :-D

--
Everybody talks about it, but nobody does anything about it! -- Mark Twain

May 27 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:
On 5/27/16 3:30 PM, Andrei Alexandrescu wrote:
On 5/27/16 3:10 PM, ag0aep6g wrote:
I don't think there is value in distinguishing by language. The point of
Unicode is that you shouldn't need to do that.

It seems code points are kind of useless because they don't really mean
anything, would that be accurate? -- Andrei

The only unmistakably correct use I can think of is transcoding from one
UTF representation to another.

That is, in order to transcode from UTF8 to UTF16, I don't need to know
anything about character composition.

-Steve

May 27 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, May 27, 2016 at 02:42:27PM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
On 5/27/16 12:40 PM, H. S. Teoh via Digitalmars-d wrote:
Exactly. And we just keep getting stuck on this point. It seems that
the message just isn't getting through. The unfounded assumption
continues to be made that iterating by code point is somehow
"correct" by definition and nobody can challenge it.

Which languages are covered by code points, and which languages
require graphemes consisting of multiple code points? How does
normalization play into this? -- Andrei

This is a complicated issue; for a full explanation you'll probably want
to peruse the Unicode codices. For example:

http://www.unicode.org/faq/char_combmark.html

But in brief, it's mostly a number of common European languages have
1-to-1 code point to character mapping, as well as Chinese writing.
Outside of this narrow set, you're on shaky ground. Examples (that I
can think of, there are many others):

- Almost all Korean characters are composed of multiple code points.

- The Indic languages (which cover quite a good number of Unicode code
pages) have ligatures that require multiple code points.

- The Thai block contains a series of combining diacritics for vowels
and tones.

- Hebrew vowel points require multiple code points;

- A good number of native American scripts require combining marks,
e.g., Navajo.

- International Phonetic Alphabet (primarily only for linguistic uses,
but could be widespread because it's relevant everywhere language is
spoken).

- Classical Greek accents (though this is less common, mostly being used
only in academic circles).

Even within the realm of European languages and languages that use some
version of the Latin script, there is an entire block of code points in
Unicode (the U+0300 block) dedicated to combining diacritics. A good
number of combinations do not have precomposed characters.

Now as far as normalization is concerned, it only helps if a particular
combination of diacritics on a base glyph have a precomposed form. A
large number of the above languages do not have precomposed characters
simply because of the sheer number of combinations. The only reason the
CJK block actually includes a huge number of precomposed characters was
because the rules for combining the base forms are too complex to encode
compositionally. Otherwise, most languages with combining diacritics
would not have precomposed characters assigned to their respective
blocks. In fact, a good number (all?) of precomposed Latin characters
were included in Unicode only because they existed in pre-Unicode days
and some form of compatibility was desired back when Unicode was still
not yet widely adopted.

So basically, besides a small number of languages, the idea of 1 code
point == 1 character is pretty unworkable. Especially in this day and
age of worldwide connectivity.

--
The diminished 7th chord is the most flexible and fear-instilling chord. Use it
often, use it unsparingly, to subdue your listeners into submission!

May 27 2016

Marco Leise <Marco.Leise gmx.de> writes:
Am Fri, 27 May 2016 15:47:32 +0200
schrieb ag0aep6g <anonymous example.com>:

On 05/27/2016 03:32 PM, Andrei Alexandrescu wrote:
However the following do require autodecoding:

s.walkLength
s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
s.count!(c => c >= 32) // non-control characters

Currently the standard library operates at code point level even
though inside it may choose to use code units when admissible. Leaving
such a decision to the library seems like a wise thing to do.

But how is the user supposed to know without being a core contributor to
Phobos?

Misunderstanding. All examples work properly today because of
autodecoding. -- Andrei

https://dpaste.dzfl.pl/817dec505fd2

1: Auto-decoding shall ALWAYS do the proper thing
2: Therefor humans shall read text in units of code points
3: OS X is an anomaly and must be purged from this planet
4: Indonesians shall be converted to a sane alphabet
5: He who useth combining diacritics shall burn in hell
6: We shall live in peace and harmony forevermore
Let's give this a rest.

--
Marco

May 30 2016

Marco Leise <Marco.Leise gmx.de> writes:
4: Indonesians* shall be converted to a sane alphabet

*Correction: Koreans
(2-4 Hangul syllables (code points) form each letter)

--
Marco

May 30 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Friday, May 27, 2016 09:40:21 H. S. Teoh via Digitalmars-d wrote:
On Fri, May 27, 2016 at 03:47:32PM +0200, ag0aep6g via Digitalmars-d wrote:
On 05/27/2016 03:32 PM, Andrei Alexandrescu wrote:
However the following do require autodecoding:

s.walkLength
s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
s.count!(c => c >= 32) // non-control characters

Currently the standard library operates at code point level even
though inside it may choose to use code units when admissible.
Leaving such a decision to the library seems like a wise thing
to do.

But how is the user supposed to know without being a core
contributor to Phobos?

Misunderstanding. All examples work properly today because of
autodecoding. -- Andrei

They only work "properly" if you define "properly" as "in terms of
code points". But working in terms of code points is usually wrong. If
you want to count "characters", you need to work with graphemes.

https://dpaste.dzfl.pl/817dec505fd2

Exactly. Saying that operating at the code point level - UTF-32 - is correct
is like saying that operating at UTF-16 instead of UTF-8 is correct. More,
full characters fit in a single code unit, but they still don't all fit. You
have to go to the grapheme level for that.

IIRC, Andrei talked in TDPL about how UTF-8 was better than UTF-16, because
you figured out when you screwed up Unicode handling more quickly, because
very few Unicode characters fit in single UTF-8 code unit, whereas many more
fit in a single UTF-16 code unit, making it harder to catch errors with
UTF-16. Well, we're making the same mistake but with UTF-32 instead of
UTF-16. The code is still wrong, but it's that much harder to catch that
it's wrong.

The standard library has to fight against itself because of autodecoding!
The vast majority of the algorithms in Phobos are special-cased on strings
in an attempt to get around autodecoding. That alone should highlight the
fact that autodecoding is problematic.

There is no solution here that's going to be both correct and efficient.
Ideally, we either need to provide a fully correct solution that's dog slow,
or we need to provide a solution that's efficient but requires that the
programmer understand Unicode to write correct code. Right now, we have a
slow solution that's incorrect.

- Jonathan M Davis

May 31 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
Saying that operating at the code point level - UTF-32 - is correct
is like saying that operating at UTF-16 instead of UTF-8 is correct.

Could you please substantiate that? My understanding is that code unit
is a higher-level Unicode notion independent of encoding, whereas code
point is an encoding-dependent representation detail. -- Andrei

May 31 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d wrote:
On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
Saying that operating at the code point level - UTF-32 - is correct
is like saying that operating at UTF-16 instead of UTF-8 is correct.

Okay. If you have the letter A, it will fit in one UTF-8 code unit, one
UTF-16 code unit, and one UTF-32 code unit (so, one code point).

assert("A"c.length == 1);
assert("A"w.length == 1);
assert("A"d.length == 1);

If you have 月, then you get

assert("月"c.length == 3);
assert("月"w.length == 1);
assert("月"d.length == 1);

whereas if you have 𐀆, then you get

assert("𐀆"c.length == 4);
assert("𐀆"w.length == 2);
assert("𐀆"d.length == 1);

So, with these characters, it's clear that UTF-8 and UTF-16 don't cut it for
holding an entire character, but it still looks like UTF-32 does. However,
what about characters like é or שׂ? Notice that שׂ takes up more than one
code
point.

assert("שׂ"c.length == 4);
assert("שׂ"w.length == 2);
assert("שׂ"d.length == 2);

It's ש with some sort of dot marker on it that they have in Hebrew, but it's
a single character in spite of the fact that it's multiple code points. é is
in a similar, though more complicated boat. With D, you'll get

assert("é"c.length == 2);
assert("é"w.length == 1);
assert("é"d.length == 1);

because the compiler decides to use the version of é that's a single code
point. However, Unicode is set up so that that accent can be its own code
point and be applied to any other code point - be it an e, an a, or even
something like the number 0. If we normalize é, we can see other
versions of it that take up more than one code point. e.g.

assert("é"d.normalize!NFC.length == 1);
assert("é"d.normalize!NFD.length == 2);
assert("é"d.normalize!NFKC.length == 1);
assert("é"d.normalize!NFKD.length == 2);

And you can even put that accent on 0 by doing something like

assert("0"d ~ "é"d.normalize!NFKD[1] == "0́"d);

One or more code units combine to make a single code point, but one or more
code points also combine to make a grapheme. So, while there is a definite
layer of separation between code units and code points, it's still the case
that a single code point is not guaranteed to be a single character. You do
indeed have encodings with code units and not code points (though those
still have different normalizations, which is kind of like having different
encodings), but in terms of correctness, you have the same problem with
treating code points as characters that you have as treating code units as
characters. You're still not guaranteed that you're operating on full
characters and risk chopping them up. It's just that at the code point
level, you're generally chopping something up that is visually separable
(like an accent from a letter or a superscript on a symbol), whereas with
code units, you end up with utter garbage when you chop them incorrectly.

By operating at the code point level, we're correct for _way_ more
characters than we would be than if we treated char like a full character,
but we're still not fully correct, and it's a lot harder to notice when you
screw it up, because the number of characters which are handled incorrectly
is far smaller.

- Jonathan M Davis

May 31 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote:
On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d wrote:
On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
Saying that operating at the code point level - UTF-32 - is correct
is like saying that operating at UTF-16 instead of UTF-8 is correct.

Okay. If you have the letter A, it will fit in one UTF-8 code unit, one
UTF-16 code unit, and one UTF-32 code unit (so, one code point).

assert("A"c.length == 1);
assert("A"w.length == 1);
assert("A"d.length == 1);

If you have 月, then you get

assert("月"c.length == 3);
assert("月"w.length == 1);
assert("月"d.length == 1);

whereas if you have 𐀆, then you get

assert("𐀆"c.length == 4);
assert("𐀆"w.length == 2);
assert("𐀆"d.length == 1);

So, with these characters, it's clear that UTF-8 and UTF-16 don't cut it for
holding an entire character, but it still looks like UTF-32 does.

Does walkLength yield the same number for all representations?

However,
what about characters like é or שׂ? Notice that שׂ takes up more than one
code
point.

assert("שׂ"c.length == 4);
assert("שׂ"w.length == 2);
assert("שׂ"d.length == 2);

assert("é"c.length == 2);
assert("é"w.length == 1);
assert("é"d.length == 1);

because the compiler decides to use the version of é that's a single code
point.

Does walkLength yield the same number for all representations?

However, Unicode is set up so that that accent can be its own code
point and be applied to any other code point - be it an e, an a, or even
something like the number 0. If we normalize é, we can see other
versions of it that take up more than one code point. e.g.

assert("é"d.normalize!NFC.length == 1);
assert("é"d.normalize!NFD.length == 2);
assert("é"d.normalize!NFKC.length == 1);
assert("é"d.normalize!NFKD.length == 2);

Does walkLength yield the same number for all representations?

And you can even put that accent on 0 by doing something like

assert("0"d ~ "é"d.normalize!NFKD[1] == "0́"d);

One or more code units combine to make a single code point, but one or more
code points also combine to make a grapheme.

That's right. D's handling of UTF is at the code unit level (like all of
Unicode is portably defined). If you want graphemes use byGrapheme.

It seems you destroyed your own argument, which was:

Saying that operating at the code point level - UTF-32 - is correct
is like saying that operating at UTF-16 instead of UTF-8 is correct.

You can't claim code units are just a special case of code points.

Andrei

May 31 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 31.05.2016 20:30, Andrei Alexandrescu wrote:
D's

Phobos'

handling of UTF is at the code unit

code point

level (like all of Unicode is portably defined).

May 31 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 02:46 PM, Timon Gehr wrote:
On 31.05.2016 20:30, Andrei Alexandrescu wrote:
D's

Phobos'

foreach, too. -- Andrei

May 31 2016

ZombineDev <petar.p.kirov gmail.com> writes:
On Tuesday, 31 May 2016 at 19:33:03 UTC, Andrei Alexandrescu
wrote:
On 05/31/2016 02:46 PM, Timon Gehr wrote:
On 31.05.2016 20:30, Andrei Alexandrescu wrote:
D's

Phobos'

foreach, too. -- Andrei

Incorrect. https://dpaste.dzfl.pl/ba7a65d59534

Jun 01 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/01/2016 01:35 PM, ZombineDev wrote:
On Tuesday, 31 May 2016 at 19:33:03 UTC, Andrei Alexandrescu wrote:
On 05/31/2016 02:46 PM, Timon Gehr wrote:
On 31.05.2016 20:30, Andrei Alexandrescu wrote:
D's

Phobos'

foreach, too. -- Andrei

Incorrect. https://dpaste.dzfl.pl/ba7a65d59534

Try typing the iteration variable with "dchar". -- Andrei

Jun 01 2016

Adam D. Ruppe <destructionator gmail.com> writes:
On Wednesday, 1 June 2016 at 17:57:15 UTC, Andrei Alexandrescu
wrote:
Try typing the iteration variable with "dchar". -- Andrei

Or you can type it as wchar...

But important to note: that's opt in, not automatic.

Jun 01 2016

ZombineDev <petar.p.kirov gmail.com> writes:
On Wednesday, 1 June 2016 at 17:57:15 UTC, Andrei Alexandrescu
wrote:
On 06/01/2016 01:35 PM, ZombineDev wrote:
On Tuesday, 31 May 2016 at 19:33:03 UTC, Andrei Alexandrescu
wrote:
On 05/31/2016 02:46 PM, Timon Gehr wrote:
On 31.05.2016 20:30, Andrei Alexandrescu wrote:
D's

Phobos'

foreach, too. -- Andrei

Incorrect. https://dpaste.dzfl.pl/ba7a65d59534

Try typing the iteration variable with "dchar". -- Andrei

I think you are not getting my point. This is not autodecoding.
There is nothing auto-magic w.r.t. strings in plain foreach.
Typing char, wchar or dchar is the same using byChar, byWchar or
byDchar - it is opt-in. The only problems are the front, empty
and popFront overloads for narrow strings.

Jun 01 2016

ZombineDev <petar.p.kirov gmail.com> writes:
On Wednesday, 1 June 2016 at 19:07:26 UTC, ZombineDev wrote:
On Wednesday, 1 June 2016 at 17:57:15 UTC, Andrei Alexandrescu
wrote:
On 06/01/2016 01:35 PM, ZombineDev wrote:
On Tuesday, 31 May 2016 at 19:33:03 UTC, Andrei Alexandrescu
wrote:
On 05/31/2016 02:46 PM, Timon Gehr wrote:
On 31.05.2016 20:30, Andrei Alexandrescu wrote:
D's

Phobos'

foreach, too. -- Andrei

Incorrect. https://dpaste.dzfl.pl/ba7a65d59534

Try typing the iteration variable with "dchar". -- Andrei

I think you are not getting my point. This is not autodecoding.
There is nothing auto-magic w.r.t. strings in plain foreach.
Typing char, wchar or dchar is the same using byChar, byWchar
or byDchar - it is opt-in. The only problems are the front,
empty and popFront overloads for narrow strings...

in std.range.primitives.

Jun 01 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/01/2016 03:07 PM, ZombineDev wrote:
This is not autodecoding. There is nothing auto-magic w.r.t. strings in
plain foreach.

I understand where you're coming from, but it actually is autodecoding.
Consider:

byte[] a;
foreach (byte x; a) {}
foreach (short x; a) {}
foreach (int x; a) {}

That works by means of a conversion short->int. However:

char[] a;
foreach (char x; a) {}
foreach (wchar x; a) {}
foreach (dchar x; a) {}

The latter two do autodecoding, not coversion as the rest of the language.

Andrei

Jun 01 2016

Jack Stouffer <jack jackstouffer.com> writes:
On Wednesday, 1 June 2016 at 19:52:01 UTC, Andrei Alexandrescu
wrote:
foreach (dchar x; a) {}
The latter two do autodecoding, not coversion as the rest of
the language.

This seems to be a miscommunication with semantics. This is not
auto-decoding at all; you're decoding, but there is nothing
"auto" about it. This code is an explicit choice by the
programmer to do something.

On the other hand, using std.range.primitives.front for narrow
strings is auto-decoding because the programmer has not made a
choice, the choice is made for the programmer.

Jun 01 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/01/2016 05:30 PM, Jack Stouffer wrote:
On Wednesday, 1 June 2016 at 19:52:01 UTC, Andrei Alexandrescu wrote:
foreach (dchar x; a) {}
The latter two do autodecoding, not coversion as the rest of the
language.

This seems to be a miscommunication with semantics. This is not
auto-decoding at all; you're decoding, but there is nothing "auto" about
it. This code is an explicit choice by the programmer to do something.

No, this is autodecoding pure and simple. We can't move the goals
whenever we don't like where the ball gets. The usual language rules are
not applied for strings - they are autodecoded (i.e. there's code
generated that magically decodes UTF surprisingly for beginners, in
apparent violation of the language rules, and without any user-visible
request) by the foreach statement. -- Andrei

Jun 01 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 01.06.2016 23:48, Andrei Alexandrescu wrote:
On 06/01/2016 05:30 PM, Jack Stouffer wrote:
On Wednesday, 1 June 2016 at 19:52:01 UTC, Andrei Alexandrescu wrote:
foreach (dchar x; a) {}
The latter two do autodecoding, not coversion as the rest of the
language.

This seems to be a miscommunication with semantics. This is not
auto-decoding at all; you're decoding, but there is nothing "auto" about
it. This code is an explicit choice by the programmer to do something.

No, this is autodecoding pure and simple. We can't move the goals
whenever we don't like where the ball gets.

It does not share most of the characteristics that make Phobos'
autodecoding painful in practice.

The usual language rules are
not applied for strings - they are autodecoded (i.e. there's code
generated that magically decodes UTF surprisingly for beginners, in
apparent violation of the language rules, and without any user-visible
request) by the foreach statement. -- Andrei

Agreed.

(But implicit conversion from char to dchar is a bad language rule.)

Jun 02 2016

ZombineDev <petar.p.kirov gmail.com> writes:
On Wednesday, 1 June 2016 at 19:52:01 UTC, Andrei Alexandrescu
wrote:
On 06/01/2016 03:07 PM, ZombineDev wrote:
This is not autodecoding. There is nothing auto-magic w.r.t.
strings in
plain foreach.

I understand where you're coming from, but it actually is
autodecoding. Consider:

byte[] a;
foreach (byte x; a) {}
foreach (short x; a) {}
foreach (int x; a) {}

That works by means of a conversion short->int. However:

char[] a;
foreach (char x; a) {}
foreach (wchar x; a) {}
foreach (dchar x; a) {}

The latter two do autodecoding, not coversion as the rest of
the language.

Andrei

Regardless of how different people may call it, it's not what
this thread is about. Deprecating front, popFront and empty for
narrow strings is what we are talking about here. This has little
to do with explicit string transcoding in foreach. I don't think
anyone has a problem with it, because it is **opt-in** and easy
to change to get the desired behavior.
On the other hand, trying to prevent Phobos from autodecoding
without typesystem defeating hacks like .representation is an
uphill battle right now.

Removing range autodecoding will also be beneficial for library
writers. For example, instead of writing find specializations for
char, wchar and dchar needles, it would be much more productive
to focus on optimising searching for T in T[] and specializing on
element size and other type properties that generic code should
care about. Having to specialize for all the char and string
types instead of just any types of that size that can be compared
bitwise is like programming in a language with no support for
generic programing.

And like many others have pointed out, it also about correctness.
Only the users can decide if searching at code unit, code point
or grapheme level (or something else) is right for their needs. A
library that pretends that a single interpretation (i.e. code
point) is right for every case is a false friend.

Jun 01 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/01/2016 06:09 PM, ZombineDev wrote:
Regardless of how different people may call it, it's not what this
thread is about.

Yes, definitely - but then again we can't after each invalidated claim
to go "yeah well but that other point stands".

Deprecating front, popFront and empty for narrow
strings is what we are talking about here.

That will not happen. Walter and I consider the cost excessive and the
benefit too small.

This has little to do with
explicit string transcoding in foreach.

It is implicit, not explicit.

I don't think anyone has a
problem with it, because it is **opt-in** and easy to change to get the
desired behavior.

It's not opt-in. There is no way to tell foreach "iterate this array by
converting char to dchar by the usual language rules, no autodecoding".
You can if you e.g. use uint for the iteration variable. Same deal as
with .representation.

On the other hand, trying to prevent Phobos from autodecoding without
typesystem defeating hacks like .representation is an uphill battle
right now.

Characterizing .representation as a typesystem defeating hack is a
stretch. What memory safety issues is it introducing?

Andrei

Jun 01 2016

Kagamin <spam here.lot> writes:
On Wednesday, 1 June 2016 at 22:24:49 UTC, Andrei Alexandrescu
wrote:
Deprecating front, popFront and empty for narrow
strings is what we are talking about here.

That will not happen. Walter and I consider the cost excessive
and the benefit too small.

This has little to do with
explicit string transcoding in foreach.

It is implicit, not explicit.

Do you mean you agree that range primitives for strings can be
changed to stay (auto)decoding to dchar, but require some form of
explicit opt-in?

Jun 02 2016

ZombineDev <petar.p.kirov gmail.com> writes:
On Wednesday, 1 June 2016 at 22:24:49 UTC, Andrei Alexandrescu
wrote:
On 06/01/2016 06:09 PM, ZombineDev wrote:
Regardless of how different people may call it, it's not what
this
thread is about.

Yes, definitely - but then again we can't after each
invalidated claim to go "yeah well but that other point stands".

My claim was not invalidated. I just didn't want to waste time
arguing about it, because it is off topic. My point was that
foreach is a purely language construct that doesn't know about
the std.range.primitives module, therefore doesn't use it and
therefore foreach doesn't perform **auto**decoding. It does
perform explicit decoding because you need to specify a different
type of iteration variable to trigger the behavior. If the
variable type is not specified, you won't get any decoding (it
will instead iterate over the code units).

Deprecating front, popFront and empty for narrow
strings is what we are talking about here.

That will not happen. Walter and I consider the cost excessive
and the benefit too small.

On the other hand many people think that the cost of using a
language (like C++) that has accumulated excessive number of bad
design decisions and pitfalls is too high.
Keeping bad design decisions alienates existing users and
repulses new ones.

I know you are in a difficult decision making position, but
imagine telling people ten years from now:

A) For the last ten years we worked on fixing every bad design
and improving all the good ones. That's why we managed to expand
our market share/mind share 10x-100x to what we had before.

B) This strange feature you need to know about is here because we
chose comparability with old code, over building the best
language possible. The language managed to continue growing (but
not as fast as we hoped) only because of the other good features.
You should use this feature and here's a long list of things you
need to consider when avoiding it.

The majority of D users ten years from now are not yet D users.
That's the target group you need to consider. And given the
overwhelming support for fixing this problem by the existing
users, you need to reevaluate your cost vs benefit metrics.

This theme (breaking code) has come up many times before and I
think that instead of complaining about the cost, we should focus
on lower it with tooling. The problem I currently see is that
there is not enough support for building and improving tools like
dfix and leveraging them for language/std lib design process.

This has little to do with
explicit string transcoding in foreach.

It is implicit, not explicit.

I don't think anyone has a
problem with it, because it is **opt-in** and easy to change
to get the
desired behavior.

It's not opt-in.

You need to opt-in by specifying a the type of the iteration
variable and that type needs to be different than the
typeof(array[0]). That's opt-in in my book.

There is no way to tell foreach "iterate this array by
converting char to dchar by the usual language rules, no
autodecoding". You can if you e.g. use uint for the iteration
variable. Same deal as with .representation.

Again, off topic. No sane person wants automatic conversion
(bitcast) from char to dchar, because dchar gives the impression
of a fully decoded code point, which the result of such cast
would certainly not provide.

On the other hand, trying to prevent Phobos from autodecoding
without
typesystem defeating hacks like .representation is an uphill
battle
right now.

Characterizing .representation as a typesystem defeating hack
is a stretch. What memory safety issues is it introducing?

Memory safety is not the only benefit of a type system. This goal
is only a small subset of the larger goal of preventing logical
errors and allowing greater expressiveness.

You may as well invent a memory safe subset of D that works only
ubyte, ushort, uint, ulong and arrays of those types, but I don't
think anyone would want to use such language. Using
.representation in parts of your code, makes those parts like the
aforementioned language that no one wants to use.

Jun 02 2016

ZombineDev <petar.p.kirov gmail.com> writes:
...

B) This strange feature you need to know about is here because
we chose comparability with old code, over building the best
language possible. The language managed to continue growing
(but not as fast as we hoped) only because of the other good
features. You should use this feature and here's a long list of
things you need to consider when avoiding it.

B) This strange feature is here because we chose compatibility
with old code, over building the best language possible. The
language managed to continue growing (but not as fast as we
hoped) only because of the other good features. You shouldn't use
this feature because of this and that potential pitfalls and
here's a long list of things you need to consider when avoiding
it.

...

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 06:42 AM, ZombineDev wrote:
On Wednesday, 1 June 2016 at 22:24:49 UTC, Andrei Alexandrescu wrote:
On 06/01/2016 06:09 PM, ZombineDev wrote:
Regardless of how different people may call it, it's not what this
thread is about.

Yes, definitely - but then again we can't after each invalidated claim
to go "yeah well but that other point stands".

My claim was not invalidated. I just didn't want to waste time arguing
about it, because it is off topic. My point was that foreach is a purely
language construct that doesn't know about the std.range.primitives
module, therefore doesn't use it and therefore foreach doesn't perform
**auto**decoding. It does perform explicit decoding because you need to
specify a different type of iteration variable to trigger the behavior.
If the variable type is not specified, you won't get any decoding (it
will instead iterate over the code units).

Your claim was obliterated, and now you continue arguing it by adjusting
term definitions on the fly, while at the same time awesomely claiming
to choose the high road by not wasting time to argue it. I should
remember the trick :o). Stand with the points that stand, own those that
don't.

Deprecating front, popFront and empty for narrow
strings is what we are talking about here.

That will not happen. Walter and I consider the cost excessive and the
benefit too small.

On the other hand many people think that the cost of using a language
(like C++) that has accumulated excessive number of bad design decisions
and pitfalls is too high.
Keeping bad design decisions alienates existing users and repulses new
ones.

Definitely. It's a fine line to walk; this particular decision is not
that much on the edge at all. We must stay with autodecoding.

I know you are in a difficult decision making position, but imagine
telling people ten years from now:

A) For the last ten years we worked on fixing every bad design and
improving all the good ones. That's why we managed to expand our market
share/mind share 10x-100x to what we had before.

I think we have underperformed and we need to do radically better. I'm
on lookout for radical new approaches to things all the time. This is
for another discussion though.

B) This strange feature you need to know about is here because we chose
comparability with old code, over building the best language possible.
The language managed to continue growing (but not as fast as we hoped)
only because of the other good features. You should use this feature and
here's a long list of things you need to consider when avoiding it.

There are many components to the decision, not only compatibility with
old code.

The majority of D users ten years from now are not yet D users. That's
the target group you need to consider. And given the overwhelming
support for fixing this problem by the existing users, you need to
reevaluate your cost vs benefit metrics.

It's funny that evidence for the "overwhelming" support is the vote of
35 voters, which was cast in terms of percentages. Math is great.

ZombineDev, I've been at the top level in the C++ community for many
many years, even after I wanted to exit :o). I'm familiar with how the
committee that steers C++ works, perspective that is unique in our
community - even Walter lacks it. I see trends and patterns. It is
interesting how easily a small but very influential priesthood can
alienate itself from the needs of the larger community and get into a
frenzy over matters that are simply missing the point.

This is what's happening here. We worked ourselves to a foam because the
creator of the language started a thread entitled "The Case Against
Autodecode", whilst fully understanding there is no way to actually
eliminate autodecode. The very definition of a useless debate, the kind
he and I had agreed to not initiate anymore. It was a mistake. I'm still
metaphorically angry at him for it. I admit I started it by asking the
question, but Walter shouldn't have answered. Following that, there was
blood in the water; any of us loves to improve something by 2% by
completely rewiring the thing. A proneness to doing that is why we
self-select to be in this community and forum.

Meanwhile, I go to conferences. Train and consult at large companies.
Dozens every year, cumulatively thousands of people. I talk about D and
ask people what it would take for them to use the language. Invariably I
hear a surprisingly small number of reasons:

* The garbage collector eliminates probably 60% of potential users right
off.

* Tooling is immature and of poorer quality compared to the competition.

* Safety has holes and bugs.

* Hiring people who know D is a problem.

* Documentation and tutorials are weak.

* There's no web services framework (by this time many folks know of D,
but of those a shockingly small fraction has even heard of vibe.d). I
have strongly argued with Sönke to bundle vibe.d with dmd over one year
ago, and also in this forum. There wasn't enough interest.

* (On Windows) if it doesn't have a compelling Visual Studio plugin, it
doesn't exist.

* Let's wait for the "herd effect" (corporate support) to start.

* Not enough advantages over the competition to make up for the
weaknesses above.

There is a second echelon of arguments related to language proper
issues, but those collectively count as much less than the above. And
"inefficient/poor/error-prone string handling" has NEVER come up.
Literally NEVER, even among people who had some familiarity with D and
would otherwise make very informed comments about it.

Look at reddit and hackernews, too - admittedly other self-selected
communities. Language debates often spring about. How often is the point
being made that D is wanting because of its string support? Nada.

This theme (breaking code) has come up many times before and I think
that instead of complaining about the cost, we should focus on lower it
with tooling. The problem I currently see is that there is not enough
support for building and improving tools like dfix and leveraging them
for language/std lib design process.

Currently dfix is weak because it doesn't do lookup. So we need to make
the front end into a library. Daniel said he wants to be on it, but he
has two jobs to worry about so he's short on time. There's only so many
hours in the day, and I think the right focus is on attacking the
matters above.

This has little to do with
explicit string transcoding in foreach.

It is implicit, not explicit.

I don't think anyone has a
problem with it, because it is **opt-in** and easy to change to get the
desired behavior.

It's not opt-in.

You need to opt-in by specifying a the type of the iteration variable
and that type needs to be different than the typeof(array[0]). That's
opt-in in my book.

Taking exception to language rules for iteration with dchar is not opt-in.

There is no way to tell foreach "iterate this array by converting char
to dchar by the usual language rules, no autodecoding". You can if you
e.g. use uint for the iteration variable. Same deal as with
.representation.

Again, off topic.

It's very on-topic. It's surprising semantics compared to the rest of
the language, for which the user needs to be informed.

No sane person wants automatic conversion (bitcast)
from char to dchar, because dchar gives the impression of a fully
decoded code point, which the result of such cast would certainly not
provide.

void fun(char c) {
if (c < 0x80) {
// Look ma I'm not a sane person
dchar d = c; // conversion is implicit, too
...
}
}

On the other hand, trying to prevent Phobos from autodecoding without
typesystem defeating hacks like .representation is an uphill battle
right now.

Characterizing .representation as a typesystem defeating hack is a
stretch. What memory safety issues is it introducing?

Memory safety is not the only benefit of a type system. This goal is
only a small subset of the larger goal of preventing logical errors and
allowing greater expressiveness.

This sounds like "no comeback here so let's insert a filler". Care to
substantiate?

You may as well invent a memory safe subset of D that works only ubyte,
ushort, uint, ulong and arrays of those types, but I don't think anyone
would want to use such language. Using .representation in parts of your
code, makes those parts like the aforementioned language that no one
wants to use.

I disagree.

Andrei

Jun 02 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 15:06, Andrei Alexandrescu wrote:
On 06/02/2016 06:42 AM, ZombineDev wrote:
On Wednesday, 1 June 2016 at 22:24:49 UTC, Andrei Alexandrescu wrote:
On 06/01/2016 06:09 PM, ZombineDev wrote:
Regardless of how different people may call it, it's not what this
thread is about.

Yes, definitely - but then again we can't after each invalidated claim
to go "yeah well but that other point stands".

My claim was not invalidated. I just didn't want to waste time arguing
about it, because it is off topic. My point was that foreach is a purely
language construct that doesn't know about the std.range.primitives
module, therefore doesn't use it and therefore foreach doesn't perform
**auto**decoding. It does perform explicit decoding because you need to
specify a different type of iteration variable to trigger the behavior.
If the variable type is not specified, you won't get any decoding (it
will instead iterate over the code units).

It's not "on the fly". You two were presumably using different
definitions of terms all along.

Jun 02 2016

cym13 <cpicard openmailbox.org> writes:
On Thursday, 2 June 2016 at 13:06:44 UTC, Andrei Alexandrescu
wrote:
Your claim was obliterated, and now you continue arguing it by
adjusting term definitions on the fly, while at the same time
awesomely claiming to choose the high road by not wasting time
to argue it. I should remember the trick :o). Stand with the
points that stand, own those that don't.

Definitely. It's a fine line to walk; this particular decision
is not that much on the edge at all. We must stay with
autodecoding.

If you are to stay with autodecoding (and I hope you won't) then
please, *please*, at least make it decode to graphemes so that it
decodes to something that actually have some kind of meaning of
its
own.

I think we have underperformed and we need to do radically
better. I'm on lookout for radical new approaches to things all
the time. This is for another discussion though.

There are many components to the decision, not only
compatibility with old code.

It's funny that evidence for the "overwhelming" support is the
vote of 35 voters, which was cast in terms of percentages. Math
is great.

ZombineDev, I've been at the top level in the C++ community for
many many years, even after I wanted to exit :o). I'm familiar
with how the committee that steers C++ works, perspective that
is unique in our community - even Walter lacks it. I see trends
and patterns. It is interesting how easily a small but very
influential priesthood can alienate itself from the needs of
the larger community and get into a frenzy over matters that
are simply missing the point.

This is what's happening here. We worked ourselves to a foam
because the creator of the language started a thread entitled
"The Case Against Autodecode", whilst fully understanding there
is no way to actually eliminate autodecode. The very definition
of a useless debate, the kind he and I had agreed to not
initiate anymore. It was a mistake. I'm still metaphorically
angry at him for it. I admit I started it by asking the
question, but Walter shouldn't have answered. Following that,
there was blood in the water; any of us loves to improve
something by 2% by completely rewiring the thing. A proneness
to doing that is why we self-select to be in this community and
forum.

Meanwhile, I go to conferences. Train and consult at large
companies. Dozens every year, cumulatively thousands of people.
I talk about D and ask people what it would take for them to
use the language. Invariably I hear a surprisingly small number
of reasons:

* The garbage collector eliminates probably 60% of potential
users right off.

* Tooling is immature and of poorer quality compared to the
competition.

* Safety has holes and bugs.

* Hiring people who know D is a problem.

* Documentation and tutorials are weak.

* There's no web services framework (by this time many folks
know of D, but of those a shockingly small fraction has even
heard of vibe.d). I have strongly argued with Sönke to bundle
vibe.d with dmd over one year ago, and also in this forum.
There wasn't enough interest.

* (On Windows) if it doesn't have a compelling Visual Studio
plugin, it doesn't exist.

* Let's wait for the "herd effect" (corporate support) to start.

* Not enough advantages over the competition to make up for the
weaknesses above.

There is a second echelon of arguments related to language
proper issues, but those collectively count as much less than
the above. And "inefficient/poor/error-prone string handling"
has NEVER come up. Literally NEVER, even among people who had
some familiarity with D and would otherwise make very informed
comments about it.

Look at reddit and hackernews, too - admittedly other
self-selected communities. Language debates often spring about.
How often is the point being made that D is wanting because of
its string support? Nada.

I think the real reason about why this isn't mentioned in the
critics you mention is that people don't know about it. Most
people
don't even imagine it can be as broken as it is. Heck, it even
took Walter by surprise after years! This thread is the first real
discussion we've had about it with proper deconstruction and
very reasonnable arguments against it. The only unreasonnable
thing
here has been your own arguments. I'd like not to point a finger
at
you but the fact is that you are the only single one defending
autodecoding and not with good arguments.

Currently autodecoding relies on chance only. (Yes, I call “hoping
the text we're manipulating can be represented by dchars” chance.)
This cannot be anymore.

Currently dfix is weak because it doesn't do lookup. So we need
to make the front end into a library. Daniel said he wants to
be on it, but he has two jobs to worry about so he's short on
time. There's only so many hours in the day, and I think the
right focus is on attacking the matters above.

...

Andrei

Jun 02 2016

tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 2 June 2016 at 13:55:28 UTC, cym13 wrote:
If you are to stay with autodecoding (and I hope you won't) then
please, *please*, at least make it decode to graphemes so that
it decodes to something that actually have some kind of meaning
of its own.

That would cause just as much - if not more - code breakage as
ditching auto-decoding entirely. It would also be considerably
slower and more memory-hungry.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 09:55 AM, cym13 wrote:
On Thursday, 2 June 2016 at 13:06:44 UTC, Andrei Alexandrescu wrote:
Your claim was obliterated, and now you continue arguing it by
adjusting term definitions on the fly, while at the same time
awesomely claiming to choose the high road by not wasting time to
argue it. I should remember the trick :o). Stand with the points that
stand, own those that don't.

Definitely. It's a fine line to walk; this particular decision is not
that much on the edge at all. We must stay with autodecoding.

That's not going to work. A false impression created in this thread has
been that code points are useless and graphemes are da bomb. That's not
the case even if we ignore the overwhelming issue of changing semantics
of existing code.

I think the real reason about why this isn't mentioned in the
critics you mention is that people don't know about it. Most people
don't even imagine it can be as broken as it is.

This should be taken at face value - rampant speculation. From my
experience that's not how these things work.

Heck, it even
took Walter by surprise after years! This thread is the first real
discussion we've had about it with proper deconstruction and
very reasonnable arguments against it. The only unreasonnable thing
here has been your own arguments. I'd like not to point a finger at
you but the fact is that you are the only single one defending
autodecoding and not with good arguments.

Fair enough. I accept continuous scrutiny of my competency - it comes
with the territory.

Currently autodecoding relies on chance only. (Yes, I call “hoping
the text we're manipulating can be represented by dchars” chance.)
This cannot be anymore.

The real ticket out of this is RCStr. It solves a major problem in the
language (compulsive GC) and also a minor occasional annoyance
(autodecoding).

Andrei

Jun 02 2016

Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Thursday, 2 June 2016 at 14:28:44 UTC, Andrei Alexandrescu
wrote:
That's not going to work. A false impression created in this
thread has been that code points are useless

They _are_ useless for almost anything you can do with strings.
The only places where they should be used are std.uni and
std.regex.

Again: What is the justification for using code points, in your
opinion? Which practical tasks are made possible (and work
_correctly_) if you decode to code points, that don't already
work with code units?

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 01:54 PM, Marc Schütz wrote:
On Thursday, 2 June 2016 at 14:28:44 UTC, Andrei Alexandrescu wrote:
That's not going to work. A false impression created in this thread
has been that code points are useless

They _are_ useless for almost anything you can do with strings. The only
places where they should be used are std.uni and std.regex.

Again: What is the justification for using code points, in your opinion?
Which practical tasks are made possible (and work _correctly_) if you
decode to code points, that don't already work with code units?

Pretty much everything. Consider s and s1 string variables with possibly
different encodings (UTF8/UTF16).

* s.all!(c => c == 'ö') works only with autodecoding. It returns always
false without.

* s.any!(c => c == 'ö') works only with autodecoding. It returns always
false without.

* s.balancedParens('〈', '〉') works only with autodecoding.

* s.canFind('ö') works only with autodecoding. It returns always false
without.

* s.commonPrefix(s1) works only if they both use the same encoding;
otherwise it still compiles but silently produces an incorrect result.

* s.count('ö') works only with autodecoding. It returns always zero without.

* s.countUntil(s1) is really odd - without autodecoding, whether it
works at all, and the result it returns, depends on both encodings. With
autodecoding it always works and returns a number independent of the
encodings.

* s.endsWith('ö') works only with autodecoding. It returns always false
without.

* s.endsWith(s1) works only with autodecoding. Otherwise it compiles and
runs but produces incorrect results if s and s1 have different encodings.

* s.find('ö') works only with autodecoding. It never finds it without.

* s.findAdjacent is a very interesting one. It works with autodecoding,
but without it it just does odd things.

* s.findAmong(s1) is also interesting. It works only with autodecoding.

* s.findSkip(s1) works only if s and s1 have the same encoding.
Otherwise it compiles and runs but produces incorrect results.

* s.findSplit(s1), s.findSplitAfter(s1), s.findSplitBefore(s1) work only
if s and s1 have the same encoding. Otherwise they compile and run but
produce incorrect results.

* s.minCount, s.maxCount are unlikely to be terribly useful but with
autodecoding it consistently returns the extremum numeric code unit
regardless of representation. Without, they just return
encoding-dependent and meaningless numbers.

* s.minPos, s.maxPos follow a similar semantics.

* s.skipOver(s1) only works with autodecoding. Otherwise it compiles and
runs but produces incorrect results if s and s1 have different encodings.

* s.startsWith('ö') works only with autodecoding. Otherwise it compiles
and runs but produces incorrect results if s and s1 have different
encodings.

* s.startsWith(s1) works only with autodecoding. Otherwise it compiles
and runs but produces incorrect results if s and s1 have different
encodings.

* s.until!(c => c == 'ö') works only with autodecoding. Otherwise, it
will span the entire range.

===

The intent of autodecoding was to make std.algorithm work meaningfully
with strings. As it's easy to see I just went through
std.algorithm.searching alphabetically and found issues literally with
every primitive in there. It's an easy exercise to go forth with the others.

Andrei

Jun 02 2016

ag0aep6g <anonymous example.com> writes:
On 06/02/2016 09:05 PM, Andrei Alexandrescu wrote:
Pretty much everything. Consider s and s1 string variables with possibly
different encodings (UTF8/UTF16).

* s.all!(c => c == 'ö') works only with autodecoding. It returns always
false without.

Doesn't work with autodecoding (to code points) when a combining
diaeresis (U+0308) is used in s.

Would actually work with UTF-16 and only combined 'ö's in s, because the
combined character fits in a single UTF-16 code unit.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
ag0aep6g <anonymous example.com> wrote:
On 06/02/2016 09:05 PM, Andrei Alexandrescu wrote:
Pretty much everything. Consider s and s1 string variables with possibly
different encodings (UTF8/UTF16).

* s.all!(c => c == 'ö') works only with autodecoding. It returns always
false without.

Doesn't work with autodecoding (to code points) when a combining
diaeresis (U+0308) is used in s.

Works if s is normalized appropriately. No?

Jun 02 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 21:26, Andrei Alexandrescu wrote:
ag0aep6g <anonymous example.com> wrote:
On 06/02/2016 09:05 PM, Andrei Alexandrescu wrote:
Pretty much everything. Consider s and s1 string variables with possibly
different encodings (UTF8/UTF16).

* s.all!(c => c == 'ö') works only with autodecoding. It returns always
false without.

Doesn't work with autodecoding (to code points) when a combining
diaeresis (U+0308) is used in s.

Works if s is normalized appropriately. No?

No. assert(!"ö̶".normalize!NFC.any!(c => c== 'ö'));

Jun 02 2016

ag0aep6g <anonymous example.com> writes:
On 06/02/2016 09:26 PM, Andrei Alexandrescu wrote:
ag0aep6g <anonymous example.com> wrote:
On 06/02/2016 09:05 PM, Andrei Alexandrescu wrote:
Pretty much everything. Consider s and s1 string variables with possibly
different encodings (UTF8/UTF16).

* s.all!(c => c == 'ö') works only with autodecoding. It returns always
false without.

Doesn't work with autodecoding (to code points) when a combining
diaeresis (U+0308) is used in s.

Works if s is normalized appropriately. No?

Works when normalized to precomposed characters, yes.

That's not a given, of course. When the user is aware enough to
normalize their strings that way, then they should be able to call
byDchar explicitly.

And of course you can't do s.all!(c => c == 'a⃗'), despite a⃗ looking like
one character. Need byGrapheme for that.

Jun 02 2016

tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu
wrote:
Pretty much everything. Consider s and s1 string variables with
possibly different encodings (UTF8/UTF16).
...

Your 'ö' examples will NOT work reliably with auto-decoded code
points, and for nearly the same reason that they won't work with
code units; you would have to use byGrapheme.

The fact that you still don't get that, even after a dozen plus
attempts by the community to explain the difference, makes you
unfit to direct Phobos' Unicode support. Please, either go study
Unicode until you really understand it, or delegate this issue to
someone else.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 03:34 PM, tsbockman wrote:
On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:
Pretty much everything. Consider s and s1 string variables with
possibly different encodings (UTF8/UTF16).
...

Your 'ö' examples will NOT work reliably with auto-decoded code points,
and for nearly the same reason that they won't work with code units; you
would have to use byGrapheme.

They do work per spec: find this code point. It would be surprising if
'ö' were found but the string were positioned at a different code point.

The fact that you still don't get that, even after a dozen plus attempts
by the community to explain the difference, makes you unfit to direct
Phobos' Unicode support.

Well there's gotta be a reason why my basic comprehension is under
constant scrutiny whereas yours is safe.

Please, either go study Unicode until you
really understand it, or delegate this issue to someone else.

Would be happy to. To whom would I delegate?

Andrei

Jun 02 2016

Brad Anderson <eco gnuk.net> writes:
On Thursday, 2 June 2016 at 20:13:14 UTC, Andrei Alexandrescu
wrote:
On 06/02/2016 03:34 PM, tsbockman wrote:
[...]

They do work per spec: find this code point. It would be
surprising if 'ö' were found but the string were positioned at
a different code point.

[...]

Well there's gotta be a reason why my basic comprehension is
under constant scrutiny whereas yours is safe.

[...]

Would be happy to. To whom would I delegate?

Andrei

If there were to be a unicode lieutenant, Dmitry seems to be the
obvious choice (if he's interested).

Jun 02 2016

ag0aep6g <anonymous example.com> writes:
On 06/02/2016 10:13 PM, Andrei Alexandrescu wrote:
They do work per spec: find this code point. It would be surprising if
'ö' were found but the string were positioned at a different code point.

The "spec" here is how the range primitives for narrow strings are
defined, right? I.e., the spec says auto-decode code units to code points.

The discussion is about whether the spec is good or bad. No one is
arguing that there are bugs in the decoding to code points. People are
arguing that auto-decoding to code points is not useful.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:23 PM, ag0aep6g wrote:
People are arguing that auto-decoding to code points is not useful.

And want to return to the point where char[] is but an indiscriminated
array, which would take std.algorithm back to the stone age. -- Andrei

Jun 02 2016

default0 <Kevin.Labschek gmx.de> writes:
On Thursday, 2 June 2016 at 20:30:34 UTC, Andrei Alexandrescu
wrote:
On 06/02/2016 04:23 PM, ag0aep6g wrote:
People are arguing that auto-decoding to code points is not
useful.

And want to return to the point where char[] is but an
indiscriminated array, which would take std.algorithm back to
the stone age. -- Andrei

Just make RCStr the most amazing string type of any standard
library ever and everyone will be happy :o)

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:37 PM, default0 wrote:
On Thursday, 2 June 2016 at 20:30:34 UTC, Andrei Alexandrescu wrote:
On 06/02/2016 04:23 PM, ag0aep6g wrote:
People are arguing that auto-decoding to code points is not useful.

And want to return to the point where char[] is but an indiscriminated
array, which would take std.algorithm back to the stone age. -- Andrei

Just make RCStr the most amazing string type of any standard library
ever and everyone will be happy :o)

Soon as this thread ends. -- Andrei

Jun 02 2016

ag0aep6g <anonymous example.com> writes:
On 06/02/2016 10:30 PM, Andrei Alexandrescu wrote:
And want to return to the point where char[] is but an indiscriminated
array, which would take std.algorithm back to the stone age. -- Andrei

I think you'd have to substantiate how that would be worse than
auto-decoding.

Your examples only show that treating code points as characters falls
apart at a higher level than treating code units as characters. But it
still falls apart. Failing early is a quality.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:47 PM, ag0aep6g wrote:
On 06/02/2016 10:30 PM, Andrei Alexandrescu wrote:
And want to return to the point where char[] is but an indiscriminated
array, which would take std.algorithm back to the stone age. -- Andrei

I think you'd have to substantiate how that would be worse than
auto-decoding.

I gave a long list of std.algorithm uses that perform virtually randomly
on char[].

Your examples only show that treating code points as characters falls
apart at a higher level than treating code units as characters. But it
still falls apart. Failing early is a quality.

It does not fall apart for code points.

Andrei

Jun 02 2016

ag0aep6g <anonymous example.com> writes:
On 06/02/2016 10:50 PM, Andrei Alexandrescu wrote:
It does not fall apart for code points.

Yes it does. You've been given plenty examples where it falls apart.
Your answer to that was that it operates on code points, not graphemes.
Well, duh. Comparing UTF-8 code units against each other works, too.
That's not an argument for doing that by default.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:01 PM, ag0aep6g wrote:
On 06/02/2016 10:50 PM, Andrei Alexandrescu wrote:
It does not fall apart for code points.

Yes it does. You've been given plenty examples where it falls apart.

There weren't any.

Your answer to that was that it operates on code points, not graphemes.

That is correct.

Well, duh. Comparing UTF-8 code units against each other works, too.
That's not an argument for doing that by default.

Nope, that's a radically different matter. As the examples show, the
examples would be entirely meaningless at code unit level.

Andrei

Jun 02 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 23:06, Andrei Alexandrescu wrote:
As the examples show, the examples would be entirely meaningless at code
unit level.

So far, I needed to count the number of characters 'ö' inside some
string exactly zero times, but I wanted to chain or join strings
relatively often.

Jun 02 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 23:16, Timon Gehr wrote:
On 02.06.2016 23:06, Andrei Alexandrescu wrote:
As the examples show, the examples would be entirely meaningless at code
unit level.

So far, I needed to count the number of characters 'ö' inside some
string exactly zero times,

(Obviously this isn't even what the example would do. I predict I will
never need to count the number of code points 'ö' by calling some
function from std.algorithm directly.)

but I wanted to chain or join strings
relatively often.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:19 PM, Timon Gehr wrote:
On 02.06.2016 23:16, Timon Gehr wrote:
On 02.06.2016 23:06, Andrei Alexandrescu wrote:
As the examples show, the examples would be entirely meaningless at code
unit level.

So far, I needed to count the number of characters 'ö' inside some
string exactly zero times,

(Obviously this isn't even what the example would do. I predict I will
never need to count the number of code points 'ö' by calling some
function from std.algorithm directly.)

You may look for a specific dchar, and it'll work. How about
findAmong("...") with a bunch of ASCII and Unicode punctuation symbols?
-- Andrei

Jun 02 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 23:23, Andrei Alexandrescu wrote:
On 6/2/16 5:19 PM, Timon Gehr wrote:
On 02.06.2016 23:16, Timon Gehr wrote:
On 02.06.2016 23:06, Andrei Alexandrescu wrote:
As the examples show, the examples would be entirely meaningless at
code
unit level.

So far, I needed to count the number of characters 'ö' inside some
string exactly zero times,

(Obviously this isn't even what the example would do. I predict I will
never need to count the number of code points 'ö' by calling some
function from std.algorithm directly.)

You may look for a specific dchar, and it'll work. How about
findAmong("...") with a bunch of ASCII and Unicode punctuation symbols?
-- Andrei

.̂ ̪.̂

(Copy-paste it somewhere else, I think it might not be rendered
correctly on the forum.)

The point is that if I do:

".̂ ̪.̂".normalize!NFC.byGrapheme.findAmong([Grapheme("."),Grapheme(",")])

no match is returned.

If I use your method with dchars, I will get spurious matches. I.e. the
suggested method to look for punctuation symbols is incorrect:

writeln(".̂ ̪.̂".findAmong(",.")); // ".̂ ̪.̂"

(Also, do you have an use case for this?)

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:43 PM, Timon Gehr wrote:
.̂ ̪.̂

(Copy-paste it somewhere else, I think it might not be rendered
correctly on the forum.)

The point is that if I do:

".̂ ̪.̂".normalize!NFC.byGrapheme.findAmong([Grapheme("."),Grapheme(",")])

no match is returned.

If I use your method with dchars, I will get spurious matches. I.e. the
suggested method to look for punctuation symbols is incorrect:

writeln(".̂ ̪.̂".findAmong(",.")); // ".̂ ̪.̂"

Nice example.

(Also, do you have an use case for this?)

Count delimited words. Did you also look at balancedParens?

Andrei

Jun 02 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 23:46, Andrei Alexandrescu wrote:
On 6/2/16 5:43 PM, Timon Gehr wrote:
.̂ ̪.̂

(Copy-paste it somewhere else, I think it might not be rendered
correctly on the forum.)

The point is that if I do:

".̂ ̪.̂".normalize!NFC.byGrapheme.findAmong([Grapheme("."),Grapheme(",")])

no match is returned.

If I use your method with dchars, I will get spurious matches. I.e. the
suggested method to look for punctuation symbols is incorrect:

writeln(".̂ ̪.̂".findAmong(",.")); // ".̂ ̪.̂"

Nice example.
...

Thanks! :o)

(Also, do you have an use case for this?)

Count delimited words. Did you also look at balancedParens?

Andrei

On 02.06.2016 22:01, Timon Gehr wrote:
* s.balancedParens('〈', '〉') works only with autodecoding.
...

Doesn't work, e.g. s="⟨⃖". Shouldn't compile.

assert("⟨⃖".normalize!NFC.byGrapheme.balancedParens(Grapheme("⟨"),Grapheme("⟩")));

writeln("⟨⃖".balancedParens('⟨','⟩')); // false

Jun 02 2016

ag0aep6g <anonymous example.com> writes:
On 06/02/2016 11:06 PM, Andrei Alexandrescu wrote:
Nope, that's a radically different matter. As the examples show, the
examples would be entirely meaningless at code unit level.

They're simply not possible. Won't compile. There is no single UTF-8
code unit for 'ö', so you can't (easily) search for it in a range for
code units. Just like there is no single code point for 'a⃗' so you can't
search for it in a range of code points.

You can still search for 'a', and 'o', and the rest of ASCII in a range
of code units.

Jun 02 2016

ag0aep6g <anonymous example.com> writes:
On 06/02/2016 11:24 PM, ag0aep6g wrote:
They're simply not possible. Won't compile. There is no single UTF-8
code unit for 'ö', so you can't (easily) search for it in a range for
code units. Just like there is no single code point for 'a⃗' so you can't
search for it in a range of code points.

You can still search for 'a', and 'o', and the rest of ASCII in a range
of code units.

I'm ignoring combining characters there. You can search for 'a' in code
units in the same way that you can search for 'ä' in code points. I.e.,
more or less, depending on how serious you are about combining characters.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:24 PM, ag0aep6g wrote:
On 06/02/2016 11:06 PM, Andrei Alexandrescu wrote:
Nope, that's a radically different matter. As the examples show, the
examples would be entirely meaningless at code unit level.

They're simply not possible. Won't compile.

They do compile.

There is no single UTF-8
code unit for 'ö', so you can't (easily) search for it in a range for
code units.

Of course you can. Can you search for an int in a short[]? Oh yes you
can. Can you search for a dchar in a char[]? Of course you can.
Autodecoding also gives it meaning.

Just like there is no single code point for 'a⃗' so you can't
search for it in a range of code points.

Of course you can.

You can still search for 'a', and 'o', and the rest of ASCII in a range
of code units.

You can search for a dchar in a char[] because you can compare an
individual dchar with either another dchar (correct, autodecoding) or
with a char (incorrect, no autodecoding).

As I said: this thread produces an unpleasant amount of arguments in
favor of autodecoding. Even I don't like that :o).

Andrei

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:27 PM, Andrei Alexandrescu wrote:
On 6/2/16 5:24 PM, ag0aep6g wrote:
Just like there is no single code point for 'a⃗' so you can't
search for it in a range of code points.

Of course you can.

Correx, indeed you can't. -- Andrei

Jun 02 2016

ag0aep6g <anonymous example.com> writes:
On 06/02/2016 11:27 PM, Andrei Alexandrescu wrote:
On 6/2/16 5:24 PM, ag0aep6g wrote:
On 06/02/2016 11:06 PM, Andrei Alexandrescu wrote:
Nope, that's a radically different matter. As the examples show, the
examples would be entirely meaningless at code unit level.

They're simply not possible. Won't compile.

They do compile.

Yes, you're right, of course they do. char implicitly converts to dchar.
I didn't think of that anti-feature.

As I said: this thread produces an unpleasant amount of arguments in
favor of autodecoding. Even I don't like that :o).

It's more of an argument against char : dchar, I'd say.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:35 PM, ag0aep6g wrote:
On 06/02/2016 11:27 PM, Andrei Alexandrescu wrote:
On 6/2/16 5:24 PM, ag0aep6g wrote:
On 06/02/2016 11:06 PM, Andrei Alexandrescu wrote:
Nope, that's a radically different matter. As the examples show, the
examples would be entirely meaningless at code unit level.

They're simply not possible. Won't compile.

They do compile.

Yes, you're right, of course they do. char implicitly converts to dchar.
I didn't think of that anti-feature.

As I said: this thread produces an unpleasant amount of arguments in
favor of autodecoding. Even I don't like that :o).

It's more of an argument against char : dchar, I'd say.

I do think that's an interesting option in PL design space, but that
would be super disruptive. -- Andrei

Jun 02 2016

tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 2 June 2016 at 20:13:14 UTC, Andrei Alexandrescu
wrote:
On 06/02/2016 03:34 PM, tsbockman wrote:
Your 'ö' examples will NOT work reliably with auto-decoded
code points,
and for nearly the same reason that they won't work with code
units; you
would have to use byGrapheme.

They do work per spec: find this code point. It would be
surprising if 'ö' were found but the string were positioned at
a different code point.

Your examples will pass or fail depending on how (and whether)
the 'ö' grapheme is normalized. They only ever succeeds because
'ö' happens to be one of the privileged graphemes that *can* be
(but often isn't!) represented as a single code point. Many other
graphemes have no such representation.

Working directly with code points is sometimes useful anyway -
but then, working with code units can be, also. Neither will lead
to inherently "correct" Unicode processing, and in the absence of
a compelling context, your examples fall completely flat as an
argument for the inherent superiority of processing at the code
unit level.

The fact that you still don't get that, even after a dozen
plus attempts
by the community to explain the difference, makes you unfit to
direct
Phobos' Unicode support.

Well there's gotta be a reason why my basic comprehension is
under constant scrutiny whereas yours is safe.

Who said mine is safe? I *know* that I'm not qualified to be in
charge of this.

Your comprehension is under greater scrutiny because you are
proposing to overrule nearly all other active contributors
combined.

Please, either go study Unicode until you
really understand it, or delegate this issue to someone else.

Would be happy to. To whom would I delegate?

If you're serious, I would suggest Dmitry Olshansky. He seems to
be our top Unicode expert, based on his contributions to
`std.uni` and `std.regex`. But, if he is unwilling/unsuitable for
some reason there are other candidates participating in this
thread (not me).

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:36 PM, tsbockman wrote:
Your examples will pass or fail depending on how (and whether) the 'ö'
grapheme is normalized.

And that's fine. Want graphemes, .byGrapheme wags its tail in that
corner. Otherwise, you work on code points which is a completely
meaningful way to go about things. What's not meaningful is the random
results you get from operating on code units.

They only ever succeeds because 'ö' happens to
be one of the privileged graphemes that *can* be (but often isn't!)
represented as a single code point. Many other graphemes have no such
representation.

Then there's no dchar for them so no problem to start with.

s.find(c) ----> "Find code unit c in string s"

Andrei

Jun 02 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, Jun 02, 2016 at 04:38:28PM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
On 06/02/2016 04:36 PM, tsbockman wrote:
Your examples will pass or fail depending on how (and whether) the
'ö' grapheme is normalized.

They only ever succeeds because 'ö' happens to be one of the
privileged graphemes that *can* be (but often isn't!) represented as
a single code point. Many other graphemes have no such
representation.

Then there's no dchar for them so no problem to start with.

s.find(c) ----> "Find code unit c in string s"

[...]

This is a ridiculous argument. We might as well say, "there's no single
byte UTF-8 that can represent Ш, so that's no problem to start with" --
since we can just define it away by saying s.find(c) == "find byte c in
string s", and thereby justify using ASCII as our standard string
representation.

The point is that dchar is NOT ENOUGH TO REPRESENT A SINGLE CHARACTER in
the general case. It is adequate for a subset of characters -- just
like ASCII is also adequate for a subset of characters. If you only
need to work with ASCII, it suffices to work with ubyte[]. Similarly, if
your work is restricted to only languages without combining diacritics,
then a range of dchar suffices. But a range of dchar is NOT good enough
in the general case, and arguing that it does only makes you look like a
fool.

Appealing to normalization doesn't change anything either, since only a
subset of base character + diacritic combinations will normalize to a
single code point. If the string has a base character + diacritic
combination doesn't have a precomposed code point, it will NOT fit in a
dchar. (And keep in mind that the notion of diacritic is still very
Euro-centric. In Korean, for example, a single character is composed of
multiple parts, each of which occupies 1 code point. While some
precomposed combinations do exist, they don't cover all of the
possibilities, so normalization won't help you there.)

--
Frank disagreement binds closer than feigned agreement.

Jun 02 2016

deadalnix <deadalnix gmail.com> writes:
On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu
wrote:
Pretty much everything. Consider s and s1 string variables with
possibly different encodings (UTF8/UTF16).

* s.all!(c => c == 'ö') works only with autodecoding. It
returns always false without.

False. Many characters can be represented by different sequences
of codepoints. For instance, ê can be ê as one codepoint or ^ as
a modifier followed by e. ö is one such character.

* s.any!(c => c == 'ö') works only with autodecoding. It
returns always false without.

False. (while this is pretty much the same as 1, one can come up
with with as many example as wished by tweaking the same one to
produce endless variations).

* s.balancedParens('〈', '〉') works only with autodecoding.

Not sure, so I'll say OK.

* s.canFind('ö') works only with autodecoding. It returns
always false without.

False.

* s.commonPrefix(s1) works only if they both use the same
encoding; otherwise it still compiles but silently produces an
incorrect result.

False.

* s.count('ö') works only with autodecoding. It returns always
zero without.

False.

* s.countUntil(s1) is really odd - without autodecoding,
whether it works at all, and the result it returns, depends on
both encodings. With autodecoding it always works and returns a
number independent of the encodings.

False.

* s.endsWith('ö') works only with autodecoding. It returns
always false without.

False.

* s.endsWith(s1) works only with autodecoding. Otherwise it
compiles and runs but produces incorrect results if s and s1
have different encodings.

False.

* s.find('ö') works only with autodecoding. It never finds it
without.

False.

* s.findAdjacent is a very interesting one. It works with
autodecoding, but without it it just does odd things.

Not sure so I'll say OK, while I strongly suspect that, like for
other, this will only work if string are normalized.

* s.findAmong(s1) is also interesting. It works only with
autodecoding.

False.

* s.findSkip(s1) works only if s and s1 have the same encoding.
Otherwise it compiles and runs but produces incorrect results.

False.

* s.findSplit(s1), s.findSplitAfter(s1), s.findSplitBefore(s1)
work only if s and s1 have the same encoding. Otherwise they
compile and run but produce incorrect results.

False.

* s.minCount, s.maxCount are unlikely to be terribly useful but
with autodecoding it consistently returns the extremum numeric
code unit regardless of representation. Without, they just
return encoding-dependent and meaningless numbers.

Note sure, so I'll say ok.

* s.minPos, s.maxPos follow a similar semantics.

Note sure, so I'll say ok.

* s.skipOver(s1) only works with autodecoding. Otherwise it
compiles and runs but produces incorrect results if s and s1
have different encodings.

False.

* s.startsWith('ö') works only with autodecoding. Otherwise it
compiles and runs but produces incorrect results if s and s1
have different encodings.

False.

* s.startsWith(s1) works only with autodecoding. Otherwise it
compiles and runs but produces incorrect results if s and s1
have different encodings.

False.

* s.until!(c => c == 'ö') works only with autodecoding.
Otherwise, it will span the entire range.

False.

===

The intent of autodecoding was to make std.algorithm work
meaningfully with strings. As it's easy to see I just went
through std.algorithm.searching alphabetically and found issues
literally with every primitive in there. It's an easy exercise
to go forth with the others.

Andrei

I mean what a trainwreck. Your examples are saying it all doesn't
it ? Almost none of them would work without normalizing the
string first. And that is the point you've been refusing to hear
so far. autodecoding doesn't pay for itself as it is unable to do
what it is supposed to do in the general case.

Really, there is not much you can do with anything unicode
related without first going through normalization. If you want
anything more than searching substring or alike, you'll also need
a collation, that is locale dependent (for sorting for instance).

Supporting unicode, IMO, would be to provide facilities to
normalize (preferably lazilly as a range), to manage collations,
and so on. Decoding to codepoints just don't cut it.

As a result, any algorithm that need to support string need to
either fight against the language because it doesn't need
decoding, use decoding and assume to be incorrect without
normalization or do the correct thing by itself (which is also
going to require to work against the language).

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 03:34 PM, deadalnix wrote:
On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:
Pretty much everything. Consider s and s1 string variables with
possibly different encodings (UTF8/UTF16).

* s.all!(c => c == 'ö') works only with autodecoding. It returns
always false without.

False.

True. "Are all code points equal to this one?" -- Andrei

Jun 02 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 22:13, Andrei Alexandrescu wrote:
On 06/02/2016 03:34 PM, deadalnix wrote:
On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:
Pretty much everything. Consider s and s1 string variables with
possibly different encodings (UTF8/UTF16).

* s.all!(c => c == 'ö') works only with autodecoding. It returns
always false without.

False.

True. "Are all code points equal to this one?" -- Andrei

I.e. you are saying that 'works' means 'operates on code points'.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:17 PM, Timon Gehr wrote:
I.e. you are saying that 'works' means 'operates on code points'.

Affirmative. -- Andrei

Jun 02 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, Jun 02, 2016 at 04:28:45PM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
On 06/02/2016 04:17 PM, Timon Gehr wrote:
I.e. you are saying that 'works' means 'operates on code points'.

Affirmative. -- Andrei

Again, a ridiculous position. I can use exactly the same line of
argument for why we should just standardize on ASCII. All I have to do
is to define "work" to mean "operates on an ASCII character", and then
every ASCII algorithm "works" by definition, so nobody can argue with
me.

Unfortunately, everybody else's definition of "work" is different from
mine, so the argument doesn't hold water.

Similarly, you are the only one whose definition of "work" means
"operates on code points". Basically nobody else here uses that
definition, so while you may be right according to your own made-up
tautological arguments, none of your conclusions actually have any
bearing in the real world of Unicode handling.

Give it up. It is beyond reasonable doubt that autodecoding is a
liability. D should be moving away from autodecoding instead of clinging
to historical mistakes in the face of overwhelming evidence. (And note,
I said *auto*-decoding; decoding by itself obviously is very relevant.
But it needs to be opt-in because of its performance and correctness
implications. The user needs to be able to choose whether to decode, and
how to decode.)

--
Freedom: (n.) Man's self-given right to be enslaved by his own depravity.

Jun 02 2016

cym13 <cpicard openmailbox.org> writes:
On Thursday, 2 June 2016 at 20:13:52 UTC, Andrei Alexandrescu
wrote:
On 06/02/2016 03:34 PM, deadalnix wrote:
On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu
wrote:
Pretty much everything. Consider s and s1 string variables
with
possibly different encodings (UTF8/UTF16).

* s.all!(c => c == 'ö') works only with autodecoding. It
returns
always false without.

False.

True. "Are all code points equal to this one?" -- Andrei

A:“We should decode to code points”
B:“No, decoding to code points is a stupid idea.”
A:“No it's not!”
B:“Can you show a concrete example where it does something
useful?”
A:“Sure, look at that!”
B:“This isn't working at all, look at all those counter-examples!”
A:“It may not work for your examples but look how easy it is to
find code points!”

*Sigh*

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:22 PM, cym13 wrote:
A:“We should decode to code points”
B:“No, decoding to code points is a stupid idea.”
A:“No it's not!”
B:“Can you show a concrete example where it does something useful?”
A:“Sure, look at that!”
B:“This isn't working at all, look at all those counter-examples!”
A:“It may not work for your examples but look how easy it is to
find code points!”

With autodecoding all of std.algorithm operates correctly on code
points. Without it all it does for strings is gibberish. -- Andrei

Jun 02 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 22:29, Andrei Alexandrescu wrote:
On 06/02/2016 04:22 PM, cym13 wrote:
A:“We should decode to code points”
B:“No, decoding to code points is a stupid idea.”
A:“No it's not!”
B:“Can you show a concrete example where it does something useful?”
A:“Sure, look at that!”
B:“This isn't working at all, look at all those counter-examples!”
A:“It may not work for your examples but look how easy it is to
find code points!”

With autodecoding all of std.algorithm operates correctly on code
points. Without it all it does for strings is gibberish. -- Andrei

No, without it, it operates correctly on code units.

Jun 02 2016

cym13 <cpicard openmailbox.org> writes:
On Thursday, 2 June 2016 at 20:29:48 UTC, Andrei Alexandrescu
wrote:
On 06/02/2016 04:22 PM, cym13 wrote:
A:“We should decode to code points”
B:“No, decoding to code points is a stupid idea.”
A:“No it's not!”
B:“Can you show a concrete example where it does something
useful?”
A:“Sure, look at that!”
B:“This isn't working at all, look at all those
counter-examples!”
A:“It may not work for your examples but look how easy it is to
find code points!”

With autodecoding all of std.algorithm operates correctly on
code points. Without it all it does for strings is gibberish.
-- Andrei

Allow me to try another angle:

- There are different levels of unicode support and you don't
want to
support them all transparently. That's understandable.

- The level you choose to support is the code point level. There
are
many good arguments about why this isn't a good default but you
won't
change your mind. I don't like that at all and I'm not alone but
let's
forget the entirety of the vocal D community for a moment.

- A huge part of unicode chars can be normalized to fit your
definition. That way not everything work (far from it) but a
sufficiently big subset works.

- On the other hand without normalization it just doesn't make any
sense from a user perspective.The ö example has clearly shown that
much, you even admitted it yourself by stating that many counter
arguments would have worked had the string been normalized).

- The most proeminent problem is with graphems that can have
different
representations as those that can't be normalized can't be
searched as
dchars as well.

- If autodecoding to code points is to stay and in an effort to
find a
compromise then normalizing should be done by default. Sure it
would
take some more time but it wouldn't break any code (I think) and
would
actually make things more correct. They still wouldn't be correct
but
I feel that something as crazy as unicode cannot be tackled
generically anyway.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:38 PM, cym13 wrote:
Allow me to try another angle:

- There are different levels of unicode support and you don't want to
support them all transparently. That's understandable.

Cool.

- The level you choose to support is the code point level. There are
many good arguments about why this isn't a good default but you won't
change your mind. I don't like that at all and I'm not alone but let's
forget the entirety of the vocal D community for a moment.

You mean all 35 of them?

It's not about changing my mind! A massive thing that the code point
level handling is the incumbent, and that changing it would need to mark
an absolutely Earth-shattering improvement to be worth it!

- A huge part of unicode chars can be normalized to fit your
definition. That way not everything work (far from it) but a
sufficiently big subset works.

Cool.

Yah, operating at code point level does not come free of caveats. It is
vastly superior to operating on code units, and did I mention it's the
incumbent.

- The most proeminent problem is with graphems that can have different
representations as those that can't be normalized can't be searched as
dchars as well.

Yah, I'd say if the program needs graphemes the option is there. Phobos
by default deals with code points which are not perfect but are
independent of representation, produce meaningful and consistent results
with std.algorithm etc.

- If autodecoding to code points is to stay and in an effort to find a
compromise then normalizing should be done by default. Sure it would
take some more time but it wouldn't break any code (I think) and would
actually make things more correct. They still wouldn't be correct but
I feel that something as crazy as unicode cannot be tackled
generically anyway.

Some more work on normalization at strategic points in Phobos would be
interesting!

Andrei

Jun 02 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, Jun 02, 2016 at 04:29:48PM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
On 06/02/2016 04:22 PM, cym13 wrote:

A:“We should decode to code points”
B:“No, decoding to code points is a stupid idea.”
A:“No it's not!”
B:“Can you show a concrete example where it does something useful?”
A:“Sure, look at that!”
B:“This isn't working at all, look at all those counter-examples!”
A:“It may not work for your examples but look how easy it is to
find code points!”

With autodecoding all of std.algorithm operates correctly on code points.
Without it all it does for strings is gibberish. -- Andrei

With ASCII strings, all of std.algorithm operates correctly on ASCII
bytes. So let's standardize on ASCII strings.

What a vacuous argument! Basically you're saying "I define code points
to be correct. Therefore, I conclude that decoding to code points is
correct." Well, duh. Unfortunately such vacuous conclusions have no
bearing in the real world of Unicode handling.

--
I am Ohm of Borg. Resistance is voltage over current.

Jun 02 2016

deadalnix <deadalnix gmail.com> writes:
On Thursday, 2 June 2016 at 20:13:52 UTC, Andrei Alexandrescu
wrote:
On 06/02/2016 03:34 PM, deadalnix wrote:
On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu
wrote:
Pretty much everything. Consider s and s1 string variables
with
possibly different encodings (UTF8/UTF16).

* s.all!(c => c == 'ö') works only with autodecoding. It
returns
always false without.

False.

True. "Are all code points equal to this one?" -- Andrei

The good thing when you define works by whatever it does right
now, it is that everything always works and there are literally
never any bug. The bad thing is that this is a completely useless
definition of work.

The sample code won't count the instance of the grapheme 'ö' as
some of its encoding won't be counted, which definitively count
as doesn't work.

When your point need to redefine words in ways that nobody agree
with, it is time to admit the point is bogus.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:20 PM, deadalnix wrote:
The good thing when you define works by whatever it does right now

No, it works as it was designed. -- Andrei

Jun 02 2016

deadalnix <deadalnix gmail.com> writes:
On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu
wrote:
On 6/2/16 5:20 PM, deadalnix wrote:
The good thing when you define works by whatever it does right
now

No, it works as it was designed. -- Andrei

Nobody says it doesn't. Everybody says the design is crap.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:35 PM, deadalnix wrote:
On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote:
On 6/2/16 5:20 PM, deadalnix wrote:
The good thing when you define works by whatever it does right now

No, it works as it was designed. -- Andrei

Nobody says it doesn't. Everybody says the design is crap.

I think I like it more after this thread. -- Andrei

Jun 02 2016

deadalnix <deadalnix gmail.com> writes:
On Thursday, 2 June 2016 at 21:37:11 UTC, Andrei Alexandrescu
wrote:
On 6/2/16 5:35 PM, deadalnix wrote:
On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu
wrote:
On 6/2/16 5:20 PM, deadalnix wrote:
The good thing when you define works by whatever it does
right now

No, it works as it was designed. -- Andrei

Nobody says it doesn't. Everybody says the design is crap.

I think I like it more after this thread. -- Andrei

You start reminding me of the joke with that guy complaining that
everybody is going backward on the highway.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:38 PM, deadalnix wrote:
On Thursday, 2 June 2016 at 21:37:11 UTC, Andrei Alexandrescu wrote:
On 6/2/16 5:35 PM, deadalnix wrote:
On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote:
On 6/2/16 5:20 PM, deadalnix wrote:
The good thing when you define works by whatever it does right now

No, it works as it was designed. -- Andrei

Nobody says it doesn't. Everybody says the design is crap.

I think I like it more after this thread. -- Andrei

You start reminding me of the joke with that guy complaining that
everybody is going backward on the highway.

Touché. (Get it?) -- Andrei

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:37 PM, Andrei Alexandrescu wrote:
On 6/2/16 5:35 PM, deadalnix wrote:
On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote:
On 6/2/16 5:20 PM, deadalnix wrote:
The good thing when you define works by whatever it does right now

No, it works as it was designed. -- Andrei

Nobody says it doesn't. Everybody says the design is crap.

I think I like it more after this thread. -- Andrei

Meh, thinking of it again: I don't like it more, I'd still do it
differently given a clean slate (viz. RCStr). But let's say I didn't get
many compelling reasons to remove autodecoding from this thread. -- Andrei

Jun 02 2016

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 06/02/2016 05:37 PM, Andrei Alexandrescu wrote:
On 6/2/16 5:35 PM, deadalnix wrote:
On Thursday, 2 June 2016 at 21:24:15 UTC, Andrei Alexandrescu wrote:
On 6/2/16 5:20 PM, deadalnix wrote:
The good thing when you define works by whatever it does right now

No, it works as it was designed. -- Andrei

Nobody says it doesn't. Everybody says the design is crap.

I think I like it more after this thread. -- Andrei

Well there's a fantastic argument.

Jun 03 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 23:20, deadalnix wrote:
The sample code won't count the instance of the grapheme 'ö' as some of
its encoding won't be counted, which definitively count as doesn't work.

It also has false positives (you can combine 'ö' with some combining
character in order to get some strange character that is not an 'ö', and
not even NFC helps with that).

Jun 02 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 12:34 PM, deadalnix wrote:
On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:
Pretty much everything. Consider s and s1 string variables with possibly
different encodings (UTF8/UTF16).

* s.all!(c => c == 'ö') works only with autodecoding. It returns always false
without.

False. Many characters can be represented by different sequences of codepoints.
For instance, ê can be ê as one codepoint or ^ as a modifier followed by e.
ö is
one such character.

There are 3 levels of Unicode support. What Andrei is talking about is Level 1.

http://unicode.org/reports/tr18/tr18-5.1.html

I wonder what rationale there is for Unicode to have two different sequences of
codepoints be treated as the same. It's madness.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:27 PM, Walter Bright wrote:
On 6/2/2016 12:34 PM, deadalnix wrote:
On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:
Pretty much everything. Consider s and s1 string variables with possibly
different encodings (UTF8/UTF16).

* s.all!(c => c == 'ö') works only with autodecoding. It returns
always false
without.

False. Many characters can be represented by different sequences of
codepoints.
For instance, ê can be ê as one codepoint or ^ as a modifier followed
by e. ö is
one such character.

There are 3 levels of Unicode support. What Andrei is talking about is
Level 1.

http://unicode.org/reports/tr18/tr18-5.1.html

Apparently I'm not the only idiot. -- Andrei

Jun 02 2016

deadalnix <deadalnix gmail.com> writes:
On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote:
On 6/2/2016 12:34 PM, deadalnix wrote:
On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu
wrote:
Pretty much everything. Consider s and s1 string variables
with possibly
different encodings (UTF8/UTF16).

* s.all!(c => c == 'ö') works only with autodecoding. It
returns always false
without.

False. Many characters can be represented by different
sequences of codepoints.
For instance, ê can be ê as one codepoint or ^ as a modifier
followed by e. ö is
one such character.

There are 3 levels of Unicode support. What Andrei is talking
about is Level 1.

http://unicode.org/reports/tr18/tr18-5.1.html

I wonder what rationale there is for Unicode to have two
different sequences of codepoints be treated as the same. It's
madness.

To be able to convert back and forth from/to unicode in a
lossless manner.

Jun 02 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 2:25 PM, deadalnix wrote:
On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote:
I wonder what rationale there is for Unicode to have two different sequences
of codepoints be treated as the same. It's madness.

To be able to convert back and forth from/to unicode in a lossless manner.

Sorry, that makes no sense, as it is saying "they're the same, only different."

Jun 02 2016

John Colvin <john.loughran.colvin gmail.com> writes:
On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote:
On 6/2/2016 12:34 PM, deadalnix wrote:
On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu
wrote:
Pretty much everything. Consider s and s1 string variables
with possibly
different encodings (UTF8/UTF16).

* s.all!(c => c == 'ö') works only with autodecoding. It
returns always false
without.

False. Many characters can be represented by different
sequences of codepoints.
For instance, ê can be ê as one codepoint or ^ as a modifier
followed by e. ö is
one such character.

There are 3 levels of Unicode support. What Andrei is talking
about is Level 1.

http://unicode.org/reports/tr18/tr18-5.1.html

I wonder what rationale there is for Unicode to have two
different sequences of codepoints be treated as the same. It's
madness.

There are languages that make heavy use of diacritics, often
several on a single "character". Hebrew is a good example. Should
there be only one valid ordering of any given set of diacritics
on any given character? It's an interesting idea, but it's not
how things are.

Jun 02 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Thursday, June 02, 2016 22:27:16 John Colvin via Digitalmars-d wrote:
On Thursday, 2 June 2016 at 20:27:27 UTC, Walter Bright wrote:
I wonder what rationale there is for Unicode to have two
different sequences of codepoints be treated as the same. It's
madness.

Yeah. I'm inclined to think that the fact that there are multiple
normalizations was a huge mistake on the part of the Unicode folks, but
we're stuck dealing with it. And as horrible as it is for most cases, maybe
it _does_ ultimately make sense because of certain use cases; I don't know.
But bad idea or not, we're stuck. :(

- Jonathan M Davis

Jun 02 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 3:27 PM, John Colvin wrote:
I wonder what rationale there is for Unicode to have two different sequences
of codepoints be treated as the same. It's madness.

There are languages that make heavy use of diacritics, often several on a
single
"character". Hebrew is a good example. Should there be only one valid ordering
of any given set of diacritics on any given character?

I didn't say ordering, I said there should be no such thing as "normalization"
in Unicode, where two codepoints are considered to be identical to some other
codepoint.

Jun 02 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, Jun 02, 2016 at 05:19:48PM -0700, Walter Bright via Digitalmars-d wrote:
On 6/2/2016 3:27 PM, John Colvin wrote:
I wonder what rationale there is for Unicode to have two different
sequences of codepoints be treated as the same. It's madness.

There are languages that make heavy use of diacritics, often several
on a single "character". Hebrew is a good example. Should there be
only one valid ordering of any given set of diacritics on any given
character?

I didn't say ordering, I said there should be no such thing as
"normalization" in Unicode, where two codepoints are considered to be
identical to some other codepoint.

I think it was a combination of historical baggage and trying to
accomodate unusual but still valid use cases.

The historical baggage was that Unicode was trying to unify all of the
various already-existing codepages out there, and many of those
codepages already come with various precomposed characters. To maximize
compatibility with existing codepages, Unicode tried to preserve as much
of the original mappings as possible within each 256-point block, so
these precomposed characters became part of the standard.

However, there weren't enough of them -- some people demanded less
common character + diacritic combinations, and some languages had
writing so complex their characters had to be composed from more basic
parts. The original Unicode range was 16-bit, so there wasn't enough
room to fit all of the precomposed characters people demanded, plus
there were other things people wanted, like multiple diacritics (e.g.,
in IPA). So the concept of combining diacritics was invented, in part to
prevent combinatorial explosion from soaking up the available code point
space, in part to allow for novel combinations of diacritics that
somebody out there somewhere might want to represent. However, this
meant that some precomposed characters were "redundant": they
represented character + diacritic combinations that could equally well
be expressed separately. Normalization was the inevitable consequence.
(Normalization, of course, also subsumes a few other things, such as
collation, but this is one of the factors behind it.)

(This is a greatly over-simplified description, of course. At the time
Unicode also had to grapple with tricky issues like what to do with
lookalike characters that served different purposes or had different
meanings, e.g., the mu sign in the math block vs. the real letter mu in
the Greek block, or the Cyrillic A which looks and behaves exactly like
the Latin A, yet the Cyrillic Р, which looks like the Latin P, does
*not* mean the same thing (it's the equivalent of R), or the Cyrillic В
whose lowercase is в not b, and also had a different sound, but
lowercase Latin b looks very similar to Cyrillic ь, which serves a
completely different purpose (the uppercase is Ь, not B, you see). Then
you have the wonderful Indic and Arabic cursive writings, where
letterforms mutate depending on the surrounding context, which, if you
were to include all variants as distinct code points, would occupy many
more pages than they currently do. And also sticky issues like the
oft-mentioned Turkish i, which is encoded as a Latin i but behaves
differently w.r.t. upper/lowercasing when in Turkish locale -- some
cases of this, IIRC, are unfixable bugs in Phobos because we currently
do not handle locales. So you see, imagining that code points == the
solution to Unicode string handling is a joke. Writing correct Unicode
handling is *hard*.)

As with all sufficiently complex software projects, Unicode represents a
compromise between many contradictory factors -- writing systems in the
world being the complex, not-very-consistent beasts they are -- so such
"dirty" details are somewhat inevitable.

--
Debugging is twice as hard as writing the code in the first place. Therefore,
if you write the code as cleverly as possible, you are, by definition, not
smart enough to debug it. -- Brian W. Kernighan

Jun 03 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
However, this
meant that some precomposed characters were "redundant": they
represented character + diacritic combinations that could equally well
be expressed separately. Normalization was the inevitable consequence.

It is not inevitable. Simply disallow the 2 codepoint sequences - the single
one
has to be used instead.

There is precedent. Some characters can be encoded with more than one UTF-8
sequence, and the longer sequences were declared invalid. Simple.

I.e. have the normalization up front when the text is created rather than
everywhere else.

Jun 03 2016

Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Friday, 3 June 2016 at 10:05:11 UTC, Walter Bright wrote:
On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
However, this
meant that some precomposed characters were "redundant": they
represented character + diacritic combinations that could
equally well
be expressed separately. Normalization was the inevitable
consequence.

It is not inevitable. Simply disallow the 2 codepoint sequences
- the single one has to be used instead.

There is precedent. Some characters can be encoded with more
than one UTF-8 sequence, and the longer sequences were declared
invalid. Simple.

I.e. have the normalization up front when the text is created
rather than everywhere else.

I don't think it would work (or at least, the analogy doesn't
hold). It would mean that you can't add new precomposited
characters, because that means that previously valid sequences
are now invalid.

Jun 03 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Friday, June 03, 2016 10:10:18 Vladimir Panteleev via Digitalmars-d wrote:
On Friday, 3 June 2016 at 10:05:11 UTC, Walter Bright wrote:
On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
However, this
meant that some precomposed characters were "redundant": they
represented character + diacritic combinations that could
equally well
be expressed separately. Normalization was the inevitable
consequence.

It is not inevitable. Simply disallow the 2 codepoint sequences
- the single one has to be used instead.

There is precedent. Some characters can be encoded with more
than one UTF-8 sequence, and the longer sequences were declared
invalid. Simple.

I.e. have the normalization up front when the text is created
rather than everywhere else.

I don't think it would work (or at least, the analogy doesn't
hold). It would mean that you can't add new precomposited
characters, because that means that previously valid sequences
are now invalid.

I would have argued that no composited characters should have ever existed
regardless of what was done in previous encodings, since they're redundant,
and you need the non-composited characters to avoid a combinatorial
explosion of characters, so you can't have characters that just have a
composited version and be consistent. However, the Unicode folks obviously
didn't go that route. But given where we sit now, even though we're stuck
with some composited characters, I'd argue that we should at least never add
any new ones. But who knows what the Unicode folks are actually going to do.

As it is, you probably should normalize strings in many cases where they
enter the program, just like ideally, you'd validate them when they enter
the program. But regardless, you have to deal with the fact that multiple
normalization schemes exist and that there's no guarantee that you're even
going to get valid Unicode, let alone Unicode that's normalized the way you
want.

- Jonathan M Davis

Jun 03 2016

Chris <wendlec tcd.ie> writes:
On Friday, 3 June 2016 at 11:46:50 UTC, Jonathan M Davis wrote:
On Friday, June 03, 2016 10:10:18 Vladimir Panteleev via
Digitalmars-d wrote:
On Friday, 3 June 2016 at 10:05:11 UTC, Walter Bright wrote:
On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
However, this
meant that some precomposed characters were "redundant":
they
represented character + diacritic combinations that could
equally well
be expressed separately. Normalization was the inevitable
consequence.

It is not inevitable. Simply disallow the 2 codepoint
sequences - the single one has to be used instead.

There is precedent. Some characters can be encoded with more
than one UTF-8 sequence, and the longer sequences were
declared invalid. Simple.

I.e. have the normalization up front when the text is
created rather than everywhere else.

I don't think it would work (or at least, the analogy doesn't
hold). It would mean that you can't add new precomposited
characters, because that means that previously valid sequences
are now invalid.

I would have argued that no composited characters should have
ever existed regardless of what was done in previous encodings,
since they're redundant, and you need the non-composited
characters to avoid a combinatorial explosion of characters, so
you can't have characters that just have a composited version
and be consistent. However, the Unicode folks obviously didn't
go that route. But given where we sit now, even though we're
stuck with some composited characters, I'd argue that we should
at least never add any new ones. But who knows what the Unicode
folks are actually going to do.

As it is, you probably should normalize strings in many cases
where they enter the program, just like ideally, you'd validate
them when they enter the program. But regardless, you have to
deal with the fact that multiple normalization schemes exist
and that there's no guarantee that you're even going to get
valid Unicode, let alone Unicode that's normalized the way you
want.

- Jonathan M Davis

I do exactly this. Validate and normalize.

Jun 03 2016

deadalnix <deadalnix gmail.com> writes:
On Friday, 3 June 2016 at 12:04:39 UTC, Chris wrote:
I do exactly this. Validate and normalize.

And once you've done this, auto decoding is useless because the
same character has the same representation anyway.

Jun 05 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2016 3:10 AM, Vladimir Panteleev wrote:
I don't think it would work (or at least, the analogy doesn't hold). It would
mean that you can't add new precomposited characters, because that means that
previously valid sequences are now invalid.

So don't add new precomposited characters when a recognized existing sequence
exists.

Jun 03 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
At the time
Unicode also had to grapple with tricky issues like what to do with
lookalike characters that served different purposes or had different
meanings, e.g., the mu sign in the math block vs. the real letter mu in
the Greek block, or the Cyrillic A which looks and behaves exactly like
the Latin A, yet the Cyrillic Р, which looks like the Latin P, does
*not* mean the same thing (it's the equivalent of R), or the Cyrillic В
whose lowercase is в not b, and also had a different sound, but
lowercase Latin b looks very similar to Cyrillic ь, which serves a
completely different purpose (the uppercase is Ь, not B, you see).

I don't see that this is tricky at all. Adding additional semantic meaning that
does not exist in printed form was outside of the charter of Unicode. Hence
there is no justification for having two distinct characters with identical
glyphs.

They should have put me in charge of Unicode. I'd have put a stop to much of
the
madness :-)

Jun 03 2016

Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Friday, 3 June 2016 at 10:08:43 UTC, Walter Bright wrote:
On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
At the time
Unicode also had to grapple with tricky issues like what to do
with
lookalike characters that served different purposes or had
different
meanings, e.g., the mu sign in the math block vs. the real
letter mu in
the Greek block, or the Cyrillic A which looks and behaves
exactly like
the Latin A, yet the Cyrillic Р, which looks like the Latin P,
does
*not* mean the same thing (it's the equivalent of R), or the
Cyrillic В
whose lowercase is в not b, and also had a different sound, but
lowercase Latin b looks very similar to Cyrillic ь, which
serves a
completely different purpose (the uppercase is Ь, not B, you
see).

I don't see that this is tricky at all. Adding additional
semantic meaning that does not exist in printed form was
outside of the charter of Unicode. Hence there is no
justification for having two distinct characters with identical
glyphs.

That's not right either. Cyrillic letters can look slightly
different from their latin lookalikes in some circumstances.

I'm sure there are extremely good reasons for not using the latin
lookalikes in the Cyrillic alphabets, because most (all?) 8-bit
Cyrillic encodings use separate codes for the lookalikes. It's
not restricted to Unicode.

Jun 03 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Jun 03, 2016 at 10:14:15AM +0000, Vladimir Panteleev via Digitalmars-d
wrote:
On Friday, 3 June 2016 at 10:08:43 UTC, Walter Bright wrote:
On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
At the time Unicode also had to grapple with tricky issues like
what to do with lookalike characters that served different
purposes or had different meanings, e.g., the mu sign in the math
block vs. the real letter mu in the Greek block, or the Cyrillic A
which looks and behaves exactly like the Latin A, yet the Cyrillic
Р, which looks like the Latin P, does *not* mean the same thing
(it's the equivalent of R), or the Cyrillic В whose lowercase is в
not b, and also had a different sound, but lowercase Latin b looks
very similar to Cyrillic ь, which serves a completely different
purpose (the uppercase is Ь, not B, you see).

I don't see that this is tricky at all. Adding additional semantic
meaning that does not exist in printed form was outside of the
charter of Unicode. Hence there is no justification for having two
distinct characters with identical glyphs.

That's not right either. Cyrillic letters can look slightly different
from their latin lookalikes in some circumstances.

Yeah, lowercase Cyrillic П is п, which looks like lowercase Greek π in
some fonts, but in cursive form it looks more like Latin lowercase n.
It wouldn't make sense to encode Cyrillic п the same as Greek π or Latin
lowercase n just by appearance, since logically it stands as its own
character despite its various appearances. But it wouldn't make sense
to encode it differently just because you're using a different font!
Similarly, lowercase Cyrillic т in some cursive fonts looks like
lowercase Latin m. I don't think it would make sense to encode
lowercase Т as Latin m just because of that.

Eventually you have no choice but to encode by logical meaning rather
than by appearance, since there are many lookalikes between different
languages that actually mean something completely different, and often
behaves completely differently.

--
People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG

Jun 03 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2016 9:28 AM, H. S. Teoh via Digitalmars-d wrote:
Eventually you have no choice but to encode by logical meaning rather
than by appearance, since there are many lookalikes between different
languages that actually mean something completely different, and often
behaves completely differently.

It's almost as if printed documents and books have never existed!

Jun 03 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Jun 03, 2016 at 11:43:07AM -0700, Walter Bright via Digitalmars-d wrote:
On 6/3/2016 9:28 AM, H. S. Teoh via Digitalmars-d wrote:
Eventually you have no choice but to encode by logical meaning
rather than by appearance, since there are many lookalikes between
different languages that actually mean something completely
different, and often behaves completely differently.

It's almost as if printed documents and books have never existed!

But if we were to encode appearance instead of logical meaning, that
would mean the *same* lowercase Cyrillic ь would have multiple,
different encodings depending on which font was in use. That doesn't
seem like the right solution either. Do we really want Unicode strings
to encode font information too?? 'Cos by that argument, serif and sans
serif letters should have different encodings, because in languages like
Hebrew, a tiny little serif could mean the difference between two
completely different letters.

And what of the Arabic and Indic scripts? They would need to encode the
same letter multiple times, each being a variation of the physical form
that changes depending on the surrounding context. Even the Greek sigma
has two forms depending on whether it's at the end of a word or not --
so should it be two code points or one? If you say two, then you'd have
a problem with how to search for sigma in Greek text, and you'd have to
search for either medial sigma or final sigma. But if you say one, then
you'd have a problem with having two different letterforms for a single
codepoint.

Besides, that still doesn't solve the problem of what "i".uppercase()
should return. In most languages, it should return "I", but in Turkish
it should not. And if we really went the route of encoding Cyrillic
letters the same as their Latin lookalikes, we'd have a problem with
what "m".uppercase() should return, because now it depends on which font
is in effect (if it's a Cyrillic cursive font, the correct answer is
"Т", if it's a Latin font, the correct answer is "M" -- the other
combinations: who knows). That sounds far worse than what we have
today.

--
Let's eat some disquits while we format the biskettes.

Jun 03 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2016 1:53 PM, H. S. Teoh via Digitalmars-d wrote:
But if we were to encode appearance instead of logical meaning, that
would mean the *same* lowercase Cyrillic ь would have multiple,
different encodings depending on which font was in use.

I don't see that consequence at all.

That doesn't
seem like the right solution either. Do we really want Unicode strings
to encode font information too??

No.

'Cos by that argument, serif and sans
serif letters should have different encodings, because in languages like
Hebrew, a tiny little serif could mean the difference between two
completely different letters.

If they are different letters, then they should have a different code point. I
don't see why this is such a hard concept.

Two. Again, why is this hard to grasp? If there is meaning in having two
different visual representations, then they are two codepoints. If the visual
representation is the same, then it is one codepoint. If the difference is only
due to font selection, that it is the same codepoint.

Besides, that still doesn't solve the problem of what "i".uppercase()
should return. In most languages, it should return "I", but in Turkish
it should not.
And if we really went the route of encoding Cyrillic
letters the same as their Latin lookalikes, we'd have a problem with
what "m".uppercase() should return, because now it depends on which font
is in effect (if it's a Cyrillic cursive font, the correct answer is
"Т", if it's a Latin font, the correct answer is "M" -- the other
combinations: who knows). That sounds far worse than what we have
today.

The notion of 'case' should not be part of Unicode, as that is semantic
information that is beyond the scope of Unicode.

Jun 03 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Jun 03, 2016 at 03:35:18PM -0700, Walter Bright via Digitalmars-d wrote:
On 6/3/2016 1:53 PM, H. S. Teoh via Digitalmars-d wrote:

[...]
'Cos by that argument, serif and sans serif letters should have
different encodings, because in languages like Hebrew, a tiny little
serif could mean the difference between two completely different
letters.

If they are different letters, then they should have a different code
point. I don't see why this is such a hard concept.

[...]

It's not a hard concept, except that these different letters have
lookalike forms with completely unrelated letters. Again:

- Lowercase Latin m looks visually the same as lowercase Cyrillic Т in
cursive form. In some font renderings the two are IDENTICAL glyphs, in
spite of being completely different, unrelated letters. However, in
non-cursive form, Cyrillic lowercase т is visually distinct.

- Similarly, lowercase Cyrillic П in cursive font looks like lowercase
Latin n, and in some fonts they are identical glyphs. Again,
completely unrelated letters, yet they have the SAME VISUAL
REPRESENTATION. However, in non-cursive font, lowercase Cyrillic П is
п, which is visually distinct from Latin n.

- These aren't the only ones, either. Other Cyrillic false friends
include cursive Д, which in some fonts looks like lowercase Latin g.
But in non-cursive font, it's д.

Just given the above, it should be clear that going by visual
representation is NOT enough to disambiguate between these different
letters. By your argument, since lowercase Cyrillic Т is, visually,
just m, it should be encoded the same way as lowercase Latin m. But this
is untenable, because the letterform changes with a different font. So
you end up with the unworkable idea of a font-dependent encoding.

Similarly, since lowercase Cyrillic П is n (in cursive font), we should
encode it the same way as Latin lowercase n. But again, the letterform
changes based on font. Your criteria of "same visual representation"
does not work outside of English. What you imagine to be a simple,
straightforward concept is far from being simple once you're dealing
with the diverse languages and writing systems of the world.

Or, to use an example closer to home, uppercase Latin O and the digit 0
are visually identical. Should they be encoded as a single code point or
two? Worse, in some fonts, the digit 0 is rendered like Ø (to
differentiate it from uppercase O). Does that mean that it should be
encoded the same way as the Danish letter Ø? Obviously not, but
according to your "visual representation" idea, the answer should be
yes.

The bottomline is that uppercase O and the digit 0 represent different
LOGICAL entities, in spite of their sharing the same visual
representation. Eventually you have to resort to representing *logical*
entities ("characters") rather than visual appearance, which is a
property of the font, and has no place in a digital text encoding.

Besides, that still doesn't solve the problem of what
"i".uppercase() should return. In most languages, it should return
"I", but in Turkish it should not.
And if we really went the route of encoding Cyrillic letters the
same as their Latin lookalikes, we'd have a problem with what
"m".uppercase() should return, because now it depends on which font
is in effect (if it's a Cyrillic cursive font, the correct answer is
"Т", if it's a Latin font, the correct answer is "M" -- the other
combinations: who knows). That sounds far worse than what we have
today.

The notion of 'case' should not be part of Unicode, as that is
semantic information that is beyond the scope of Unicode.

But what should "i".toUpper return? Or are you saying the standard
library should not include such a basic function as a case-changing
function?

--
Customer support: the art of getting your clients to pay for your own
incompetence.

Jun 03 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2016 6:08 PM, H. S. Teoh via Digitalmars-d wrote:
It's not a hard concept, except that these different letters have
lookalike forms with completely unrelated letters. Again:

- These aren't the only ones, either. Other Cyrillic false friends
include cursive Д, which in some fonts looks like lowercase Latin g.
But in non-cursive font, it's д.

Just given the above, it should be clear that going by visual
representation is NOT enough to disambiguate between these different
letters.

It works for books. Unicode invented a problem, and came up with a thoroughly
wretched "solution" that we'll be stuck with for generations. One of those bad
solutions is have the reader not know what a glyph actually is without pulling
back the cover to read the codepoint. It's madness.

By your argument, since lowercase Cyrillic Т is, visually,
just m, it should be encoded the same way as lowercase Latin m. But this
is untenable, because the letterform changes with a different font. So
you end up with the unworkable idea of a font-dependent encoding.

Oh rubbish. Let go of the idea that choosing bad fonts should drive Unicode
codepoint decisions.

Don't confuse fonts with code points. It'd be adequate if Unicode defined a
canonical glyph for each code point, and let the font makers do what they wish.

The notion of 'case' should not be part of Unicode, as that is
semantic information that is beyond the scope of Unicode.

But what should "i".toUpper return?

Not relevant to my point that Unicode shouldn't decide what "upper case" for
all
languages means, any more than Unicode should specify a font. Now when you
argue
that Unicode should make such decisions, note what a spectacularly hopeless job
of it they've done.

Jun 03 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Jun 03, 2016 at 08:03:16PM -0700, Walter Bright via Digitalmars-d wrote:
On 6/3/2016 6:08 PM, H. S. Teoh via Digitalmars-d wrote:
It's not a hard concept, except that these different letters have
lookalike forms with completely unrelated letters. Again:

- Lowercase Latin m looks visually the same as lowercase Cyrillic Т
in cursive form. In some font renderings the two are IDENTICAL
glyphs, in spite of being completely different, unrelated letters.
However, in non-cursive form, Cyrillic lowercase т is visually
distinct.

- Similarly, lowercase Cyrillic П in cursive font looks like
lowercase Latin n, and in some fonts they are identical glyphs.
Again, completely unrelated letters, yet they have the SAME VISUAL
REPRESENTATION. However, in non-cursive font, lowercase Cyrillic П
is п, which is visually distinct from Latin n.

- These aren't the only ones, either. Other Cyrillic false friends
include cursive Д, which in some fonts looks like lowercase Latin g.
But in non-cursive font, it's д.

Just given the above, it should be clear that going by visual
representation is NOT enough to disambiguate between these different
letters.

It works for books.

Because books don't allow their readers to change the font.

Unicode invented a problem, and came up with a thoroughly wretched
"solution" that we'll be stuck with for generations. One of those bad
solutions is have the reader not know what a glyph actually is without
pulling back the cover to read the codepoint. It's madness.

This madness already exists *without* Unicode. If you have a page with a
single glyph 'm' printed on it and show it to an English speaker, he
will say it's lowercase M. Show it to a Russian speaker, and he will say
it's lowercase Т. So which letter is it, M or Т?

The fundamental problem is that writing systems for different languages
interpret the same letter forms differently. In English, lowercase g
has at least two different forms that we recognize as the same letter.
However, to a Cyrillic reader the two forms are distinct, because one of
them looks like a Cyrillic letter but the other one looks foreign. So
should g be encoded as a single point or two different points?

In a similar vein, to a Cyrillic reader the glyphs т and m represent the
same letter, but to an English letter they are clearly two different
things.

If you're going to represent both languages, you cannot get away from
needing to represent letters abstractly, rather than visually.

By your argument, since lowercase Cyrillic Т is, visually, just m,
it should be encoded the same way as lowercase Latin m. But this is
untenable, because the letterform changes with a different font. So
you end up with the unworkable idea of a font-dependent encoding.

Oh rubbish. Let go of the idea that choosing bad fonts should drive
Unicode codepoint decisions.

It's not a bad font. It's standard practice to print Cyrillic cursive
letters with different glyphs. Russian readers can read both without any
problem. The same letter is represented by different glyphs, and
therefore the abstract letter is a more fundamental unit of meaning than
the glyph itself.

Or, to use an example closer to home, uppercase Latin O and the
digit 0 are visually identical. Should they be encoded as a single
code point or two? Worse, in some fonts, the digit 0 is rendered
like Ø (to differentiate it from uppercase O). Does that mean that
it should be encoded the same way as the Danish letter Ø? Obviously
not, but according to your "visual representation" idea, the answer
should be yes.

Don't confuse fonts with code points. It'd be adequate if Unicode
defined a canonical glyph for each code point, and let the font makers
do what they wish.

So should O and 0 share the same glyph or not? They're visually the same
thing, even though some fonts render them differently. What should be
the canonical shape of O vs. 0? If they are the same shape, then by your
argument they must be the same code point, regardless of what font
makers do to disambiguate them. Good luck writing a parser that can't
tell between an identifier that begins with O vs. a number literal that
begins with 0.

The very fact that we distinguish between O and 0, independently of what
Unicode did/does, is already proof enough that going by visual
representation is inadequate.

The notion of 'case' should not be part of Unicode, as that is
semantic information that is beyond the scope of Unicode.

But what should "i".toUpper return?

Not relevant to my point that Unicode shouldn't decide what "upper
case" for all languages means, any more than Unicode should specify a
font. Now when you argue that Unicode should make such decisions, note
what a spectacularly hopeless job of it they've done.

In other words toUpper and toLower does not belong in the standard
library. Great.

--
Heuristics are bug-ridden by definition. If they didn't have bugs, they'd be
algorithms.

Jun 03 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2016 11:17 PM, H. S. Teoh via Digitalmars-d wrote:
On Fri, Jun 03, 2016 at 08:03:16PM -0700, Walter Bright via Digitalmars-d
wrote:
It works for books.

Because books don't allow their readers to change the font.

Unicode is not the font.

It's not a problem that Unicode can solve. As you said, the meaning is in the
context. Unicode has no context, and tries to solve something it cannot.

('m' doesn't always mean m in english, either. It depends on the context.)

Ya know, if Unicode actually solved these problems, you'd have a case. But it
doesn't, and so you don't :-)

If you're going to represent both languages, you cannot get away from
needing to represent letters abstractly, rather than visually.

Books do visually just fine!

So should O and 0 share the same glyph or not? They're visually the same
thing,

No, they're not. Not even on old typewriters where every key was expensive.
Even
without the slash, the O tends to be fatter than the 0.

The very fact that we distinguish between O and 0, independently of what
Unicode did/does, is already proof enough that going by visual
representation is inadequate.

Except that you right now are using a font where they are different enough that
you have no trouble at all distinguishing them without bothering to look it up.
And so am I.

In other words toUpper and toLower does not belong in the standard
library. Great.

Unicode and the standard library are two different things.

Jun 04 2016

docandrew <x x.com> writes:
On Saturday, 4 June 2016 at 08:12:47 UTC, Walter Bright wrote:
On 6/3/2016 11:17 PM, H. S. Teoh via Digitalmars-d wrote:
On Fri, Jun 03, 2016 at 08:03:16PM -0700, Walter Bright via
Digitalmars-d wrote:
It works for books.

Because books don't allow their readers to change the font.

Unicode is not the font.

This madness already exists *without* Unicode. If you have a
page with a
single glyph 'm' printed on it and show it to an English
speaker, he
will say it's lowercase M. Show it to a Russian speaker, and
he will say
it's lowercase Т. So which letter is it, M or Т?

It's not a problem that Unicode can solve. As you said, the
meaning is in the context. Unicode has no context, and tries to
solve something it cannot.

('m' doesn't always mean m in english, either. It depends on
the context.)

Ya know, if Unicode actually solved these problems, you'd have
a case. But it doesn't, and so you don't :-)

If you're going to represent both languages, you cannot get
away from
needing to represent letters abstractly, rather than visually.

Books do visually just fine!

So should O and 0 share the same glyph or not? They're
visually the same
thing,

No, they're not. Not even on old typewriters where every key
was expensive. Even without the slash, the O tends to be fatter
than the 0.

The very fact that we distinguish between O and 0,
independently of what
Unicode did/does, is already proof enough that going by visual
representation is inadequate.

Except that you right now are using a font where they are
different enough that you have no trouble at all distinguishing
them without bothering to look it up. And so am I.

In other words toUpper and toLower does not belong in the
standard
library. Great.

Unicode and the standard library are two different things.

Even if a character in different languages share a glyph or look
identical though, it makes sense to duplicate them with different
code points/units/whatever.

Simple functions like isCyrillicLetter() can then do a simple
less-than / greater-than comparison instead of having a lookup
table to check different numeric representations scattered
throughout the Unicode table. Functions like toUpper and toLower
become easier to write as well (for SOME languages anyhow), it's
simply myletter +/- numlettersinalphabet. Redundancy here is very
helpful.

Maybe instead of Unicode they should have called it Babel... :)

"The Lord said, “If as one people speaking the same language they
have begun to do this, then nothing they plan to do will be
impossible for them. Come, let us go down and confuse their
language so they will not understand each other.”"

-Jon

Jun 05 2016

deadalnix <deadalnix gmail.com> writes:
On Saturday, 4 June 2016 at 03:03:16 UTC, Walter Bright wrote:
Oh rubbish. Let go of the idea that choosing bad fonts should
drive Unicode codepoint decisions.

Interestingly enough, I've mentioned earlier here that only
people from the US would believe that documents with mixed
languages aren't commonplace. I wasn't expecting to be proven
right that fast.

Jun 05 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/5/2016 1:07 AM, deadalnix wrote:
On Saturday, 4 June 2016 at 03:03:16 UTC, Walter Bright wrote:
Oh rubbish. Let go of the idea that choosing bad fonts should drive Unicode
codepoint decisions.

Interestingly enough, I've mentioned earlier here that only people from the US
would believe that documents with mixed languages aren't commonplace. I wasn't
expecting to be proven right that fast.

You'd be in error. I've been casually working on my grandfather's thesis trying
to make a web version of it, and it is mixed German, French, and English. I've
also made a digital version of an old history book that is mixed English, old
English, German, French, Greek, old Greek, and Egyptian hieroglyphs (available
on Amazons in your neighborhood!).

I've also lived in Germany for 3 years, though that was before computers took
over the world.

Jun 05 2016

Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Friday, 3 June 2016 at 20:53:32 UTC, H. S. Teoh wrote:

Even the Greek sigma has two forms depending on whether it's at
the end of a word or not -- so should it be two code points or
one? If you say two, then you'd have a problem with how to
search for sigma in Greek text, and you'd have to search for
either medial sigma or final sigma. But if you say one, then
you'd have a problem with having two different letterforms for
a single codepoint.

In Unicode there are 2 different codepoints for lower case sigma
ς U+03C2 and σ U+3C3 but only one uppercase Σ U+3A3 sigma.
Codepoint U+3A2 is undefined. So your objection is not
hypothetic, it is actually an issue for uppercase() and
lowercase() functions.
Another difficulty besides dotless and dotted i of Turkic, the
double letters used in latin transcription of cyrillic text in
east and south europe ǆ, ǉ, ǌ and ǳ, which have an uppercase
forme (Ǆ, Ǉ, Ǌ, Ǳ) and a titlecase form (ǅ, ǈ, ǋ, ǲ).

Besides, that still doesn't solve the problem of what
"i".uppercase() should return. In most languages, it should
return "I", but in Turkish it should not. And if we really
went the route of encoding Cyrillic letters the same as their
Latin lookalikes, we'd have a problem with what "m".uppercase()
should return, because now it depends on which font is in
effect (if it's a Cyrillic cursive font, the correct answer is
"Т", if it's a Latin font, the correct answer is "M" -- the
other combinations: who knows). That sounds far worse than
what we have today.

As an anecdote I can tell the story of the accession to the
European Union of Romania and Bulgaria in 2007. The issue was
that 3 letters used by Romanian and Bulgarian had been forgotten
by the Unicode consortium (Ș U+0218, ș U+219, Ț U+21A, ț U+21B
and 2 Cyrillic letters that I do not remember). The Romanian used
as a replacement Ş, ş, Ţ and ţ (U+15D, U+15E and U+161 and
U+162), which look a little bit alike. When the Commission
finally managed to force Mirosoft to correct the fonts to include
them, we could start to correct the data. The transition was
finished in 2012 and was only possible because no other language
we deal with uses the "wrong" codepoints (Turkish but fortunately
we only have a handful of them in our db's). So 5 years of ad hoc
processing for the substicion of 4 codepoints.
BTW: using combining diacritics was out of the question at that
time simply because Microsoft Word didn't support it at that time
and many documents we encountered still only used codepages (one
has also to remember that in big institution like the EC, the IT
is always several years behind the open market, which means that
when product is in release X, the Institution still might use
release X-5 years).

Jun 04 2016

Patrick Schluter <Patrick.Schluter bbox.fr> writes:
One has also to take into consideration that Unicode is the way
it is because it was not invented in an empty space. It had to
take consideration of the existing and find compromisses allowing
its adoption. Even if they had invented the perfect encoding, NO
ONE WOULD HAVE USED IT, as it would have fubar the existing.
As it was invented it allowed a (relatively smooth) transition.
Here some points that made it even possible that Unicode could be
adopted at all:
- 16 bits: while that choice was a bit shortsighted, 16 bits is a
good compromice between compactness and richness (BMP suffice to
express nearly all living languages).
- Using more or less the same arrangement of codepoints as in the
different codepages. This allowed to transform legacy documents
with simple scripts (matter of fact I wrote a script to repair
misencoded Greek documents, it consisted mainly of unich =
ch>0x80 ? ch+0x2D0 : ch;
- Utf-8: this was the genious stroke encoding that allowed to mix
it all without requiring awful acrobatics (Joakim is completely
out to lunch on that one, shifting encoding without
self-synchronisation are hellish, that's why Chinese and Japanese
adopted Unicode without hesitation, they had enough experience
with their legacy encodings.
- Letting time for the transition.

So all the points that people here criticize, were in fact the
reason why Unicode could even be become the standard it is now.

Jun 04 2016

ketmar <ketmar ketmar.no-ip.org> writes:
On Friday, 3 June 2016 at 18:43:07 UTC, Walter Bright wrote:
It's almost as if printed documents and books have never
existed!

some old xUSSR books which has some English text sometimes used
Cyrillic font to represent English. it was awful, and barely
readable. this was done to ease the work of compositors, and the
result was unacceptable. do you feel a recognizable pattern here?
;-)

Jun 03 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2016 5:42 PM, ketmar wrote:
sometimes used Cyrillic font to represent English.

Nobody here suggested using the wrong font, it's completely irrelevant.

Jun 03 2016

ketmar <ketmar ketmar.no-ip.org> writes:
On Saturday, 4 June 2016 at 02:46:31 UTC, Walter Bright wrote:
On 6/3/2016 5:42 PM, ketmar wrote:
sometimes used Cyrillic font to represent English.

Nobody here suggested using the wrong font, it's completely
irrelevant.

you suggested that unicode designers should make similar-looking
glyphs share the same code, and it reminds me this little story.
maybe i misunderstood you, though.

Jun 03 2016

deadalnix <deadalnix gmail.com> writes:
On Friday, 3 June 2016 at 18:43:07 UTC, Walter Bright wrote:
On 6/3/2016 9:28 AM, H. S. Teoh via Digitalmars-d wrote:
Eventually you have no choice but to encode by logical meaning
rather
than by appearance, since there are many lookalikes between
different
languages that actually mean something completely different,
and often
behaves completely differently.

It's almost as if printed documents and books have never
existed!

TIL: books are read by computers.

Jun 05 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/5/2016 1:05 AM, deadalnix wrote:
TIL: books are read by computers.

I should introduce you to a fabulous technology called OCR. :-)

Jun 05 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2016 3:14 AM, Vladimir Panteleev wrote:
That's not right either. Cyrillic letters can look slightly different from
their
latin lookalikes in some circumstances.

I'm sure there are extremely good reasons for not using the latin lookalikes in
the Cyrillic alphabets, because most (all?) 8-bit Cyrillic encodings use
separate codes for the lookalikes. It's not restricted to Unicode.

How did people ever get by with printed books and documents?

Jun 03 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 03.06.2016 20:41, Walter Bright wrote:
On 6/3/2016 3:14 AM, Vladimir Panteleev wrote:
That's not right either. Cyrillic letters can look slightly different
from their
latin lookalikes in some circumstances.

I'm sure there are extremely good reasons for not using the latin
lookalikes in
the Cyrillic alphabets, because most (all?) 8-bit Cyrillic encodings use
separate codes for the lookalikes. It's not restricted to Unicode.

How did people ever get by with printed books and documents?

They can disambiguate the letters based on context well enough.

Jun 03 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2016 11:54 AM, Timon Gehr wrote:
On 03.06.2016 20:41, Walter Bright wrote:
How did people ever get by with printed books and documents?

They can disambiguate the letters based on context well enough.

Characters do not have semantic meaning. Their meaning is always inferred from
the context. Unicode's troubles started the moment they stepped beyond their
charter.

Jun 03 2016

Adam D. Ruppe <destructionator gmail.com> writes:
On Friday, 3 June 2016 at 18:41:36 UTC, Walter Bright wrote:
How did people ever get by with printed books and documents?

Printed books pick one font and one layout, then is read by
people. It doesn't have to be represented in some format where
end users can change the font and size etc.

Jun 03 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Friday, June 03, 2016 03:08:43 Walter Bright via Digitalmars-d wrote:
On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
At the time
Unicode also had to grapple with tricky issues like what to do with
lookalike characters that served different purposes or had different
meanings, e.g., the mu sign in the math block vs. the real letter mu in
the Greek block, or the Cyrillic A which looks and behaves exactly like
the Latin A, yet the Cyrillic Р, which looks like the Latin P, does
*not* mean the same thing (it's the equivalent of R), or the Cyrillic В
whose lowercase is в not b, and also had a different sound, but
lowercase Latin b looks very similar to Cyrillic ь, which serves a
completely different purpose (the uppercase is Ь, not B, you see).

I don't see that this is tricky at all. Adding additional semantic meaning
that does not exist in printed form was outside of the charter of Unicode.
Hence there is no justification for having two distinct characters with
identical glyphs.

They should have put me in charge of Unicode. I'd have put a stop to much of
the madness :-)

Actually, I would argue that the moment that Unicode is concerned with what
the character actually looks like rather than what character it logically is
that it's gone outside of its charter. The way that characters actually look
is far too dependent on fonts, and aside from display code, code does not
care one whit what the character looks like.

For instance, take the capital letter I, the lowercase letter l, and the
number one. In some fonts that are feeling cruel towards folks who actually
want to read them, two of those characters - or even all three of them -
look identical. But I think that you'll agree that those characters should
be represented as distinct characters in Unicode regardless of what they
happen to look like in a particular font.

Now, take a cyrllic letter that looks similar to a latin letter. If they're
logically equivalent such that no code would ever want to distinguish
between the two and such that no font would ever even consider representing
them differently, then they're truly the same letter, and they should only
have one Unicode representation. But if anyone would ever consider them to
be logically distinct, then it makes no sense for them to be considered to
be the same character by Unicode, because they don't have the same identity.
And that distinction is quite clear if any font would ever consider
representing the two characters differently, no matter how slight that
difference might be.

Really, what a character looks like has nothing to do with Unicode. The
exact same Unicode is used regardless of how the text is displayed. Rather,
what Unicode is doing is providing logical identifiers for characters so
that code can operate on them, and display code can then do whatever it does
to display those characters, whether they happen to look similar or not. I
would think that the fact that non-display code does not care one whit about
what a character looks like and that display code can have drastically
different visual representations for the same character would make it clear
that Unicode is concerned with having identifiers for logical characters and
that that is distinct from any visual representation.

- Jonathan M Davis

Jun 03 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/3/2016 2:10 PM, Jonathan M Davis via Digitalmars-d wrote:
Actually, I would argue that the moment that Unicode is concerned with what
the character actually looks like rather than what character it logically is
that it's gone outside of its charter. The way that characters actually look
is far too dependent on fonts, and aside from display code, code does not
care one whit what the character looks like.

What I meant was pretty clear. Font is an artistic style that does not change
context nor semantic meaning. If a font choice changes the meaning then it is
not a font.

Jun 03 2016

Adam D. Ruppe <destructionator gmail.com> writes:
On Friday, 3 June 2016 at 22:38:38 UTC, Walter Bright wrote:
If a font choice changes the meaning then it is not a font.

Nah, then it is an Awesome Font that is totally Web Scale!

i wish i was making that up http://fontawesome.io/ i hate that
thing

But, it is kinda legal: gotta love the Unicode private use area!

Jun 03 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Friday, June 03, 2016 15:38:38 Walter Bright via Digitalmars-d wrote:
On 6/3/2016 2:10 PM, Jonathan M Davis via Digitalmars-d wrote:
Actually, I would argue that the moment that Unicode is concerned with
what
the character actually looks like rather than what character it logically
is that it's gone outside of its charter. The way that characters
actually look is far too dependent on fonts, and aside from display code,
code does not care one whit what the character looks like.

What I meant was pretty clear. Font is an artistic style that does not
change context nor semantic meaning. If a font choice changes the meaning
then it is not a font.

Well, maybe I misunderstood what was being argued, but it seemed like you've
been arguing that two characters should be considered the same just because
they look similar, whereas H. S. Teoh is arguing that two characters can be
logically distinct while still looking similar and that they should be
treated as distinct in Unicode because they're logically distinct. And if
that's what's being argued, then I agree with H. S. Teoh.

I expect - at least ideally - for Unicode to contain identifiers for
characters that are distinct from whatever their visual representation might
be. Stuff like fonts then worries about how to display them, and hopefully
don't do stupid stuff like make a capital I look like a lowercase l (though
they often do, unfortunately). But if two characters in different scripts -
be they latin and cyrillic or whatever - happen to often look the same but
would be considered two different characters by humans, then I would expect
Unicode to consider them to be different, whereas if no one would reasonably
consider them to be anything but exactly the same character, then there
should only be one character in Unicode.

However, if we really have crazy stuff where subtly different visual
representations of the letter g are considered to be one character in
English and two in Russian, then maybe those should be three different
characters in Unicode so that the English text can clearly be operating on
g, whereas the Russian text is doing whatever it does with its two
characters that happen to look like g. I don't know. That sort of thing just
gets ugly. But I definitely think that Unicode characters should be made up
of what the logical characters are and leave the visual representation up to
the fonts and the like.

Now, how to deal with uppercase vs lowercase and all of that sort of stuff
is a completely separate issue IMHO, and that comes down to how the
characters are somehow logically associated with one another, and it's going
to be very locale-specific such that it's not really part of the core of
Unicode's charter IMHO (though I'm not sure that it's bad if there's a set
of locale rules that go along with Unicode for those looking to correctly
apply such rules - they just have nothing to do with code points and
graphemes and how they're represented in code).

- Jonathan M Davis

Jun 05 2016

Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 02-Jun-2016 23:27, Walter Bright wrote:
On 6/2/2016 12:34 PM, deadalnix wrote:
On Thursday, 2 June 2016 at 19:05:44 UTC, Andrei Alexandrescu wrote:
Pretty much everything. Consider s and s1 string variables with possibly
different encodings (UTF8/UTF16).

* s.all!(c => c == 'ö') works only with autodecoding. It returns
always false
without.

False. Many characters can be represented by different sequences of
codepoints.
For instance, ê can be ê as one codepoint or ^ as a modifier followed
by e. ö is
one such character.

There are 3 levels of Unicode support. What Andrei is talking about is
Level 1.

http://unicode.org/reports/tr18/tr18-5.1.html

I wonder what rationale there is for Unicode to have two different
sequences of codepoints be treated as the same. It's madness.

Yeah, Unicode was not meant to be easy it seems. Or this is whatever
happens with evolutionary design that started with "everything is a
16-bit character".

--
Dmitry Olshansky

Jun 03 2016

Alix Pexton <alix.pexton gmail.com> writes:
On 03/06/2016 20:12, Dmitry Olshansky wrote:
On 02-Jun-2016 23:27, Walter Bright wrote:

I wonder what rationale there is for Unicode to have two different
sequences of codepoints be treated as the same. It's madness.

Yeah, Unicode was not meant to be easy it seems. Or this is whatever
happens with evolutionary design that started with "everything is a
16-bit character".

Typing as someone who as spent some time creating typefaces, having two
representations makes sense, and it didn't start with Unicode, it
started with movable type.

It is much easier for a font designer to create the two codepoint
versions of characters for most instances, i.e. make the base letters
and the diacritics once. Then what I often do is make single codepoint
versions of the ones I'm likely to use, but only if they need more
tweaking than the kerning options of the font format allow. I'll omit
the history lesson on how this was similar in the case of movable type.

Keyboards for different languages mean that a character that is a single
keystroke in one case is two together or in sequence in another. This
means that Unicode not only represents completed strings, but also those
that are mid composition. The ordering that it uses to ensure that
graphemes have a single canonical representation is based on the order
that those multi-key characters are entered. I wouldn't call it elegant,
but its not inelegant either.

Trying to represent all sufficiently similar glyphs with the same
codepoint would lead to a layout problem. How would you order them so
that strings of any language can be sorted by their local sorting rules,
without having to special case algorithms?

Also consider ligatures, such as those for "ff", "fi", "ffi", "fl",
"ffl" and many, many more. Typographers create these glyphs whenever
available kerning tools do a poor job of combining them from the
individual glyphs. From the point of view of meaning they should still
be represented as individual codepoints, but for display (electronic or
print) that sequence needs to be replaced with the single codepoint for
the ligature.

I think that in order to understand the decisions of the Unicode
committee, one has to consider that they are trying to unify the
concerns of representing written information from two sides. One side
prioritises storage and manipulation, while the other considers
aesthetics and design workflow more important. My experience of using
Unicode from both sides gives me a different appreciation for the
difficulties of reconciling the two.

A...

P.S.

Then they started adding emojis, and I lost all faith in humanity ;)

Jun 04 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 21:05, Andrei Alexandrescu wrote:
On 06/02/2016 01:54 PM, Marc Schütz wrote:
On Thursday, 2 June 2016 at 14:28:44 UTC, Andrei Alexandrescu wrote:
That's not going to work. A false impression created in this thread
has been that code points are useless

They _are_ useless for almost anything you can do with strings. The only
places where they should be used are std.uni and std.regex.

Again: What is the justification for using code points, in your opinion?
Which practical tasks are made possible (and work _correctly_) if you
decode to code points, that don't already work with code units?

Pretty much everything. Consider s and s1 string variables with possibly
different encodings (UTF8/UTF16).

* s.all!(c => c == 'ö') works only with autodecoding. It returns always
false without.
...

Doesn't work. Shouldn't compile. (char and wchar shouldn't be comparable.)

assert("ö".all!(c => c == 'ö')); // fails

* s.any!(c => c == 'ö') works only with autodecoding. It returns always
false without.
...

Doesn't work. Shouldn't compile.

assert("ö".any!(c => c == 'ö")); // fails
assert(!"̃ö⃖".any!(c => c== 'ö')); // fails

* s.balancedParens('〈', '〉') works only with autodecoding.
...

Doesn't work, e.g. s="⟨⃖". Shouldn't compile.

* s.canFind('ö') works only with autodecoding. It returns always false
without.
...

Doesn't work. Shouldn't compile.

assert("ö".canFind!(c => c == 'ö")); // fails

* s.commonPrefix(s1) works only if they both use the same encoding;
otherwise it still compiles but silently produces an incorrect result.
...

Doesn't work. Shouldn't compile.

* s.count('ö') works only with autodecoding. It returns always zero
without.
....

Doesn't work. Shouldn't compile.

* s.endsWith('ö') works only with autodecoding. It returns always false
without.
...

Doesn't work. Shouldn't compile.

* s.endsWith(s1) works only with autodecoding.

Doesn't work.

Otherwise it compiles and
runs but produces incorrect results if s and s1 have different encodings.
...

Shouldn't compile.

* s.find('ö') works only with autodecoding. It never finds it without.
...

Doesn't work. Shouldn't compile.

* s.findAdjacent is a very interesting one. It works with autodecoding,
but without it it just does odd things.
....

Doesn't work. Shouldn't compile.

* s.findAmong(s1) is also interesting. It works only with autodecoding.
...

Doesn't work. Shouldn't compile.

* s.findSkip(s1) works only if s and s1 have the same encoding.
Otherwise it compiles and runs but produces incorrect results.
...

Doesn't work. Shouldn't compile.

* s.findSplit(s1), s.findSplitAfter(s1), s.findSplitBefore(s1) work only
if s and s1 have the same encoding.

Doesn't work.

Otherwise they compile and run but produce incorrect results.
...

Shouldn't compile.

* s.minPos, s.maxPos follow a similar semantics.
...

Hardly a point in favour of autodecoding.

* s.skipOver(s1) only works with autodecoding.

Doesn't work. Shouldn't compile.

Otherwise it compiles and
runs but produces incorrect results if s and s1 have different encodings.
...

Shouldn't compile.

* s.startsWith('ö') works only with autodecoding. Otherwise it compiles
and runs but produces incorrect results if s and s1 have different
encodings.
...

Doesn't work. Shouldn't compile.

* s.startsWith(s1) works only with autodecoding. Otherwise it compiles
and runs but produces incorrect results if s and s1 have different
encodings.
...

Doesn't work. Shouldn't compile.

* s.until!(c => c == 'ö') works only with autodecoding. Otherwise, it
will span the entire range.
...

Doesn't work. Shouldn't compile.

===

Basically all of those still don't work with UTF-32 (assuming your goal
is to operate on characters). You need to normalize and possibly iterate
on graphemes. Also, many of those functions actually have valid uses
intentionally operating on code units.

The "shouldn't compile" remarks ideally would be handled at the language
level: char/wchar/dchar should be incompatible types and char[], wchar[]
and dchar[] should be handled like all arrays.

Jun 02 2016

jmh530 <john.michael.hall gmail.com> writes:
On Thursday, 2 June 2016 at 20:01:54 UTC, Timon Gehr wrote:
Doesn't work. Shouldn't compile. (char and wchar shouldn't be
comparable.)

In Andrei's original post, he says that s is a string variable.
He doesn't say it's a char. I find the weirder thing to be that t
below is false, per deadalnix's point.

import std.algorithm : all;
import std.stdio : writeln;

void main()
{
string s = "ö";
auto t = s.all!(c => c == 'ö');
writeln(t); //prints false
}

I could imagine getting frustrated that something like the code
below throws errors.

import std.algorithm : all;
import std.stdio : writeln;

void main()
{
import std.uni : byGrapheme;

string s = "ö";
auto s2 = s.byGrapheme;
auto t2 = s2.all!(c => c == 'ö');
writeln(t2);
}

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:01 PM, Timon Gehr wrote:
Doesn't work. Shouldn't compile. (char and wchar shouldn't be comparable.)

That would be another language design option, which we don't have the
luxury to explore. -- Andrei

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:01 PM, Timon Gehr wrote:
assert("ö".all!(c => c == 'ö')); // fails

As expected. Different code units for different folks. That's a
different matter than walking blindly through code units. -- Andrei

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:01 PM, Timon Gehr wrote:
Basically all of those still don't work with UTF-32 (assuming your goal
is to operate on characters).

The goal is to operate on code units. -- Andrei

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:26 PM, Andrei Alexandrescu wrote:
On 06/02/2016 04:01 PM, Timon Gehr wrote:
Basically all of those still don't work with UTF-32 (assuming your goal
is to operate on characters).

The goal is to operate on code units. -- Andrei

s/units/points/

Jun 02 2016

ag0aep6g <anonymous example.com> writes:
On 06/02/2016 10:26 PM, Andrei Alexandrescu wrote:
The goal is to operate on code units. -- Andrei

You sure you got the right word there? The code unit is the smallest
building block. A code point is encoded with one or more code units.

Also, if you mean code points, that's where people disagree. Operating
on code points by default is seen as not particularly useful.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:33 PM, ag0aep6g wrote:
Operating on code points by default is seen as not particularly useful.

By whom? The "support level 1" folks yonder at the Unicode standard? :o)
-- Andrei

Jun 02 2016

tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 2 June 2016 at 20:36:12 UTC, Andrei Alexandrescu
wrote:
On 06/02/2016 04:33 PM, ag0aep6g wrote:
Operating on code points by default is seen as not
particularly useful.

By whom? The "support level 1" folks yonder at the Unicode
standard? :o) -- Andrei

From the standard:

Level 1 support works well in many circumstances. However, it
does not handle more complex languages or extensions to the
Unicode Standard very well. Particularly important cases are
surrogates, canonical equivalence, word boundaries, grapheme
boundaries, and loose matches. (For more information about
boundary conditions, see The Unicode Standard, Section 5-15.)

Level 2 support matches much more what user expectations are
for sequences of Unicode characters. It is still locale
independent and easily implementable. However, the
implementation may be slower when supporting Level 2, and some
expressions may require Level 1 matches. Thus it is usually
required to have some sort of syntax that will turn Level 2
support on and off.

That doesn't sound like much of an endorsement for defaulting to
only level 1 support to me - "it does not handle more complex
languages or extensions to the Unicode Standard very well".

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:47 PM, tsbockman wrote:
That doesn't sound like much of an endorsement for defaulting to only
level 1 support to me - "it does not handle more complex languages or
extensions to the Unicode Standard very well".

Code point/Level 1 support sounds like a sweet spot between
efficiency/complexity and conviviality. Level 2 is opt-in with
byGrapheme. -- Andrei

Jun 02 2016

tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 2 June 2016 at 20:49:52 UTC, Andrei Alexandrescu
wrote:
On 06/02/2016 04:47 PM, tsbockman wrote:
That doesn't sound like much of an endorsement for defaulting
to only
level 1 support to me - "it does not handle more complex
languages or
extensions to the Unicode Standard very well".

Code point/Level 1 support sounds like a sweet spot between
efficiency/complexity and conviviality. Level 2 is opt-in with
byGrapheme. -- Andrei

Actually, according to the document Walter Bright linked level 1
does NOT operate at the code point level:

Level 1: Basic Unicode Support. At this level, the regular
expression engine provides support for Unicode characters as
basic 16-bit logical units. (This is independent of the actual
serialization of Unicode as UTF-8, UTF-16BE, UTF-16LE, or
UTF-32.)
...
Level 1 support works well in many circumstances. However, it
does not handle more complex languages or extensions to the
Unicode Standard very well. Particularly important cases are
**surrogates** ...

So, level 1 appears to be UTF-16 code units, not code points. To
do code points it would have to recognize surrogates, which are
specifically mentioned as not supported.

Level 2 skips straight to graphemes, and there is no code point
level.

However, this document is very old - from Unicode 3.0 and the
year 2000:

While there are no surrogate characters in Unicode 3.0 (outside
of private use characters), future versions of Unicode will
contain them...

Perhaps level 1 has since been redefined?

Jun 02 2016

tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 2 June 2016 at 21:00:17 UTC, tsbockman wrote:
However, this document is very old - from Unicode 3.0 and the
year 2000:

While there are no surrogate characters in Unicode 3.0
(outside of private use characters), future versions of
Unicode will contain them...

Perhaps level 1 has since been redefined?

I found the latest (unofficial) draft version:
http://www.unicode.org/reports/tr18/tr18-18.html

Relevant changes:

* Level 1 is to be redefined as working on code points, not code
units:

A fundamental requirement is that Unicode text be interpreted
semantically by code point, not code units.

* Level 2 (graphemes) is explicitly described as a "default
level":

This is still a default level—independent of country or
language—but provides much better support for end-user
expectations than the raw level 1...

* All mention of level 2 being slow has been removed. The only
reason given for making it toggle-able is for compatibility with
level 1 algorithms:

Level 2 support matches much more what user expectations are
for sequences of Unicode characters. It is still
locale-independent and easily implementable. However, for
compatibility with Level 1, it is useful to have some sort of
syntax that will turn Level 2 support on and off.

Jun 02 2016

ag0aep6g <anonymous example.com> writes:
On 06/02/2016 10:36 PM, Andrei Alexandrescu wrote:
By whom? The "support level 1" folks yonder at the Unicode standard? :o)
-- Andrei

Do they say that level 1 should be the default, and do they give a
rationale for that? Would you kindly link or quote that?

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:52 PM, ag0aep6g wrote:
On 06/02/2016 10:36 PM, Andrei Alexandrescu wrote:
By whom? The "support level 1" folks yonder at the Unicode standard? :o)
-- Andrei

Do they say that level 1 should be the default, and do they give a
rationale for that? Would you kindly link or quote that?

No, but that sounds agreeable to me, especially since it breaks no code
of ours.

We really should document this better. Kudos to Walter for finding all
that Level 1 support.

Andrei

Jun 02 2016

default0 <Kevin.Labschek gmx.de> writes:
On Thursday, 2 June 2016 at 20:52:29 UTC, ag0aep6g wrote:
On 06/02/2016 10:36 PM, Andrei Alexandrescu wrote:
By whom? The "support level 1" folks yonder at the Unicode
standard? :o)
-- Andrei

Do they say that level 1 should be the default, and do they
give a rationale for that? Would you kindly link or quote that?

The level 2 support description noted that it should be opt-in
because its slow.
Arguably it should be easier to operate on code units if you know
its safe to do so, but either always working on code units or
always working on graphemes as the default seems to be either too
broken too often or too slow too often.

Now one can argue either consistency for code units (because then
we can treat char[] and friends as a slice) or correctness for
graphemes but really the more I think about it the more I think
there is no good default and you need to learn unicode anyways.
The only sad parts here are that 1) we hijacked an array type for
strings, which sucks and 2) that we dont have an api that is
actually good at teaching the user what it does and doesnt do.

The consequence of 1 is that generic code that also wants to deal
with strings will want to special-case to get rid of
auto-decoding, the consequence of 2 is that we will have tons of
not-actually-correct string handling.
I would assume that almost all string handling code that is out
in the wild is broken anyways (in code I have encountered I have
never seen attempts to normalize or do other things before or
after comparisons, searching, etc), unless of course, YOU or one
of your colleagues wrote it (consider that checking the length of

characters is often done and wrong, because .Length is the number
of UTF-16 code units in those languages) :o)

So really as bad and alarming as "incorrect string handling" by
default seems, it in practice of other languages that get used
way more than D has not prevented people from writing working
(internationalized!) applications in those languages.
One could say we should do it better than them, but I would be
inclined to believe that RCStr provides our opportunity to do so.
Having char[] be what it is is an annoying wart, and maybe at
some point we can deprecate/remove that behaviour, but for now Id
rather see if RCStr is viable than attempt to change semantics of
all string handling code in D.

Jun 02 2016

tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 2 June 2016 at 21:07:19 UTC, default0 wrote:
The level 2 support description noted that it should be opt-in
because its slow.

1) It does not say that level 2 should be opt-in; it says that
level 2 should be toggle-able. Nowhere does it say which of level
1 and 2 should be the default.

2) It says that working with graphemes is slower than UTF-16 code
UNITS (level 1), but says nothing about streaming decoding of
code POINTS (what we have).

3) That document is from 2000, and its claims about performance
are surely extremely out-dated, anyway. Computers and the Unicode
standard have both changed much since then.

Jun 02 2016

default0 <Kevin.Labschek gmx.de> writes:
On Thursday, 2 June 2016 at 21:30:51 UTC, tsbockman wrote:
On Thursday, 2 June 2016 at 21:07:19 UTC, default0 wrote:
The level 2 support description noted that it should be opt-in
because its slow.

1) It does not say that level 2 should be opt-in; it says that
level 2 should be toggle-able. Nowhere does it say which of
level 1 and 2 should be the default.

2) It says that working with graphemes is slower than UTF-16
code UNITS (level 1), but says nothing about streaming decoding
of code POINTS (what we have).

3) That document is from 2000, and its claims about performance
are surely extremely out-dated, anyway. Computers and the
Unicode standard have both changed much since then.

1) Right because a special toggleable syntax is definitely not
"opt-in".
2) Several people in this thread noted that working on graphemes
is way slower (which makes sense, because its yet another
processing you need to do after you decoded - therefore more work
- therefore slower) than working on code points.
3) Not an argument - doing more work makes code slower. The only
thing that changes is what specific operations have what cost
(for instance, memory access has a much higher cost now than it
had then). Considering the way the process works and judging from
what others in this thread have said about it, I will stick with
"always decoding to graphemes for all operations is very slow"
and indulge in being too lazy to write benchmarks for it to show
just how bad it is.

Jun 02 2016

tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 2 June 2016 at 21:38:02 UTC, default0 wrote:
On Thursday, 2 June 2016 at 21:30:51 UTC, tsbockman wrote:
1) It does not say that level 2 should be opt-in; it says that
level 2 should be toggle-able. Nowhere does it say which of
level 1 and 2 should be the default.

2) It says that working with graphemes is slower than UTF-16
code UNITS (level 1), but says nothing about streaming
decoding of code POINTS (what we have).

3) That document is from 2000, and its claims about
performance are surely extremely out-dated, anyway. Computers
and the Unicode standard have both changed much since then.

1) Right because a special toggleable syntax is definitely not
"opt-in".

It is not "opt-in" unless it is toggled off by default. The only
reason it doesn't talk about toggling in the level 1 section, is
because that section is written with the assumption that many
programs will *only* support level 1.

2) Several people in this thread noted that working on
graphemes is way slower (which makes sense, because its yet
another processing you need to do after you decoded - therefore
more work - therefore slower) than working on code points.

And working on code points is way slower than working on code
units (the actual level 1).

3) Not an argument - doing more work makes code slower.

What do you think I'm arguing for? It's not graphemes-by-default.

What I actually want to see: permanently deprecate the
auto-decoding range primitives. Force the user to explicitly
specify whichever of `by!dchar`, `byCodePoint`, or `byGrapheme`
their specific algorithm actually needs. Removing the implicit
conversions between `char`, `wchar`, and `dchar` would also be
nice, but isn't really necessary I think.

That would be a standards-compliant solution (one of several
possible). What we have now is non-standard, at least going by
the old version Walter linked.

Jun 02 2016

default0 <Kevin.Labschek gmx.de> writes:
On Thursday, 2 June 2016 at 21:51:51 UTC, tsbockman wrote:
On Thursday, 2 June 2016 at 21:38:02 UTC, default0 wrote:
On Thursday, 2 June 2016 at 21:30:51 UTC, tsbockman wrote:
1) It does not say that level 2 should be opt-in; it says
that level 2 should be toggle-able. Nowhere does it say which
of level 1 and 2 should be the default.

2) It says that working with graphemes is slower than UTF-16
code UNITS (level 1), but says nothing about streaming
decoding of code POINTS (what we have).

3) That document is from 2000, and its claims about
performance are surely extremely out-dated, anyway. Computers
and the Unicode standard have both changed much since then.

1) Right because a special toggleable syntax is definitely not
"opt-in".

It is not "opt-in" unless it is toggled off by default. The
only reason it doesn't talk about toggling in the level 1
section, is because that section is written with the assumption
that many programs will *only* support level 1.

*sigh* reading comprehension. Needing to write .byGrapheme or
similar to enable the behaviour qualifies for what that
description was arguing for. I hope you understand that now that
I am repeating this for you.

2) Several people in this thread noted that working on
graphemes is way slower (which makes sense, because its yet
another processing you need to do after you decoded -
therefore more work - therefore slower) than working on code
points.

And working on code points is way slower than working on code
units (the actual level 1).

Never claimed the opposite. Do note however that its specifically
talking about UTF-16 code units.

3) Not an argument - doing more work makes code slower.

What do you think I'm arguing for? It's not
graphemes-by-default.

Unrelated. I was refuting the point you made about the relevance
of the performance claims of the unicode level 2 support
description, not evaluating your hypothetical design. Please do
not take what I say out of context, thank you.

Jun 02 2016

tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 2 June 2016 at 22:03:01 UTC, default0 wrote:
*sigh* reading comprehension.
...
Please do not take what I say out of context, thank you.

Earlier you said:

The level 2 support description noted that it should be opt-in
because its slow.

My main point is simply that you mischaracterized what the
standard says. Making level 1 opt-in, rather than level 2, would
be just as compliant as the reverse. The standard makes no
suggestion as to which should be default.

Jun 02 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 12:05 PM, Andrei Alexandrescu wrote:
* s.all!(c => c == 'ö') works only with autodecoding. It returns always false
without.

The o is inferred as a wchar. The lamda then is inferred to return a wchar. The
algorithm can check that the input is char[], and is being tested against a
wchar. Therefore, the algorithm can specialize to do the decoding itself.

No autodecoding necessary, and it does the right thing.

Jun 02 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 22:07, Walter Bright wrote:
On 6/2/2016 12:05 PM, Andrei Alexandrescu wrote:
* s.all!(c => c == 'ö') works only with autodecoding. It returns
always false
without.

The o is inferred as a wchar. The lamda then is inferred to return a
wchar.

No, the lambda returns a bool.

The algorithm can check that the input is char[], and is being
tested against a wchar. Therefore, the algorithm can specialize to do
the decoding itself.

No autodecoding necessary, and it does the right thing.

It still would not be the right thing. The lambda shouldn't compile. It
is not meaningful to compare utf-8 and utf-16 code units directly.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:12 PM, Timon Gehr wrote:
It is not meaningful to compare utf-8 and utf-16 code units directly.

But it is meaningful to compare Unicode code points. -- Andrei

Jun 02 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 22:28, Andrei Alexandrescu wrote:
On 06/02/2016 04:12 PM, Timon Gehr wrote:
It is not meaningful to compare utf-8 and utf-16 code units directly.

But it is meaningful to compare Unicode code points. -- Andrei

It is also meaningful to compare two utf-8 code units or two utf-16 code
units.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:50 PM, Timon Gehr wrote:
On 02.06.2016 22:28, Andrei Alexandrescu wrote:
On 06/02/2016 04:12 PM, Timon Gehr wrote:
It is not meaningful to compare utf-8 and utf-16 code units directly.

But it is meaningful to compare Unicode code points. -- Andrei

It is also meaningful to compare two utf-8 code units or two utf-16 code
units.

By decoding them of course. -- Andrei

Jun 02 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 22:51, Andrei Alexandrescu wrote:
On 06/02/2016 04:50 PM, Timon Gehr wrote:
On 02.06.2016 22:28, Andrei Alexandrescu wrote:
On 06/02/2016 04:12 PM, Timon Gehr wrote:
It is not meaningful to compare utf-8 and utf-16 code units directly.

But it is meaningful to compare Unicode code points. -- Andrei

It is also meaningful to compare two utf-8 code units or two utf-16 code
units.

By decoding them of course. -- Andrei

That makes no sense, I cannot decode single code units.

BTW, I guess the reason why char converts to wchar converts to dchar is
that the lower half of code units in char and the lower half of code
units in wchar are code points. Maybe code units and code points with
low numerical values should have distinct types.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:23 PM, Timon Gehr wrote:
On 02.06.2016 22:51, Andrei Alexandrescu wrote:
On 06/02/2016 04:50 PM, Timon Gehr wrote:
On 02.06.2016 22:28, Andrei Alexandrescu wrote:
On 06/02/2016 04:12 PM, Timon Gehr wrote:
It is not meaningful to compare utf-8 and utf-16 code units directly.

But it is meaningful to compare Unicode code points. -- Andrei

It is also meaningful to compare two utf-8 code units or two utf-16 code
units.

By decoding them of course. -- Andrei

That makes no sense, I cannot decode single code units.

Then you lost me. (I'm sure you're making a good point.) -- Andrei

Jun 02 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 23:29, Andrei Alexandrescu wrote:
On 6/2/16 5:23 PM, Timon Gehr wrote:
On 02.06.2016 22:51, Andrei Alexandrescu wrote:
On 06/02/2016 04:50 PM, Timon Gehr wrote:
On 02.06.2016 22:28, Andrei Alexandrescu wrote:
On 06/02/2016 04:12 PM, Timon Gehr wrote:
It is not meaningful to compare utf-8 and utf-16 code units directly.

But it is meaningful to compare Unicode code points. -- Andrei

It is also meaningful to compare two utf-8 code units or two utf-16
code
units.

By decoding them of course. -- Andrei

That makes no sense, I cannot decode single code units.

Then you lost me. (I'm sure you're making a good point.) -- Andrei

Basically:

bool bad(char c,dchar d){ return c==d; } // ideally shouldn't compile
bool good(char c,char d){ return c==d; } // should compile

Jun 02 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 1:12 PM, Timon Gehr wrote:
On 02.06.2016 22:07, Walter Bright wrote:
On 6/2/2016 12:05 PM, Andrei Alexandrescu wrote:
* s.all!(c => c == 'ö') works only with autodecoding. It returns
always false
without.

The o is inferred as a wchar. The lamda then is inferred to return a
wchar.

No, the lambda returns a bool.

Thanks for the correction.

The algorithm can check that the input is char[], and is being
tested against a wchar. Therefore, the algorithm can specialize to do
the decoding itself.

No autodecoding necessary, and it does the right thing.

It still would not be the right thing. The lambda shouldn't compile. It is not
meaningful to compare utf-8 and utf-16 code units directly.

Yes, you have a good point. But we do allow things like:

byte b;
if (b == 10000) ...

Jun 02 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 23:56, Walter Bright wrote:
On 6/2/2016 1:12 PM, Timon Gehr wrote:
...
It is not
meaningful to compare utf-8 and utf-16 code units directly.

Yes, you have a good point. But we do allow things like:

byte b;
if (b == 10000) ...

Well, this is a somewhat different case, because 10000 is just not
representable as a byte. Every value that fits in a byte fits in an int
though.

It's different for code units. They are incompatible both ways. E.g.
dchar obviously does not fit in a char, and while the lower half of char
is compatible with dchar, the upper half is specific to the encoding.
dchar cannot represent upper half char code units. You get the code
points with the corresponding values instead.

E.g.:

void main(){
import std.stdio,std.utf;
foreach(dchar d;"ö".byCodeUnit)
writeln(d); // "Ã", "¶"
}

Jun 02 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 3:11 PM, Timon Gehr wrote:
Well, this is a somewhat different case, because 10000 is just not
representable
as a byte. Every value that fits in a byte fits in an int though.

It's different for code units. They are incompatible both ways.

Not exactly. (c == 'ö') is always false for the same reason that (b == 1000)
is
always false.

I'm not sure what the right answer is here.

Jun 02 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 03.06.2016 00:26, Walter Bright wrote:
On 6/2/2016 3:11 PM, Timon Gehr wrote:
Well, this is a somewhat different case, because 10000 is just not
representable
as a byte. Every value that fits in a byte fits in an int though.

It's different for code units. They are incompatible both ways.

Not exactly. (c == 'ö') is always false for the same reason that (b ==
1000) is always false.
...

Yes. And _additionally_, some other concerns apply that are not there
for byte vs. int. I.e. if b == 10000 is disallowed, then c == d should
be disallowed too, but b == 10000 can be allowed even if c == d is
disallowed.

I'm not sure what the right answer is here.

char to dchar is a lossy conversion, so it shouldn't happen.
byte to int is a lossless conversion, so there is no problem a priori.

Jun 02 2016

Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Thursday, 2 June 2016 at 21:56:10 UTC, Walter Bright wrote:
Yes, you have a good point. But we do allow things like:

byte b;
if (b == 10000) ...

Why allowing char/wchar/dchar comparisons is wrong:

void main()
{
string s = "Привет";
foreach (c; s)
assert(c != 'Ñ');
}

From my post from 2014:

http://forum.dlang.org/post/knrwiqxhlvqwxqshyqpy forum.dlang.org

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 04:07 PM, Walter Bright wrote:
On 6/2/2016 12:05 PM, Andrei Alexandrescu wrote:
* s.all!(c => c == 'ö') works only with autodecoding. It returns
always false
without.

The o is inferred as a wchar. The lamda then is inferred to return a
wchar.

The lambda returns bool. -- Andrei

Jun 02 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 1:27 PM, Andrei Alexandrescu wrote:
The lambda returns bool. -- Andrei

Yes, I was wrong about that. But the point still stands with:

* s.balancedParens('〈', '〉') works only with autodecoding.
* s.canFind('ö') works only with autodecoding. It returns always false
without.

Can be made to work without autodecoding.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 05:58 PM, Walter Bright wrote:
On 6/2/2016 1:27 PM, Andrei Alexandrescu wrote:
The lambda returns bool. -- Andrei

Yes, I was wrong about that. But the point still stands with:

> * s.balancedParens('〈', '〉') works only with autodecoding.
> * s.canFind('ö') works only with autodecoding. It returns always
false without.

Can be made to work without autodecoding.

By special casing? Perhaps. I seem to recall though that one major issue
with autodecoding was that it special-cases certain algorithms. So you'd
need to go through all of std.algorithm and make sure you can
special-case your way out of situations that work today.

Andrei

Jun 02 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 03.06.2016 00:23, Andrei Alexandrescu wrote:
On 06/02/2016 05:58 PM, Walter Bright wrote:
On 6/2/2016 1:27 PM, Andrei Alexandrescu wrote:
The lambda returns bool. -- Andrei

Yes, I was wrong about that. But the point still stands with:

> * s.balancedParens('〈', '〉') works only with autodecoding.
> * s.canFind('ö') works only with autodecoding. It returns always
false without.

Can be made to work without autodecoding.

By special casing? Perhaps. I seem to recall though that one major issue
with autodecoding was that it special-cases certain algorithms.

The major issue is that it special cases when there's different, more
natural semantics available.

Jun 02 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 3:23 PM, Andrei Alexandrescu wrote:
On 06/02/2016 05:58 PM, Walter Bright wrote:
> * s.balancedParens('〈', '〉') works only with autodecoding.
> * s.canFind('ö') works only with autodecoding. It returns always
false without.

Can be made to work without autodecoding.

By special casing? Perhaps.

The argument to canFind() can be detected as not being a char, then decoded
into
a sequence of char's, then forwarded to a substring search.

I seem to recall though that one major issue with
autodecoding was that it special-cases certain algorithms. So you'd need to go
through all of std.algorithm and make sure you can special-case your way out of
situations that work today.

That's right. A side effect of that is that the algorithms will go even faster!
So it's good.

(A substring of codeunits is faster to search than decoding the input stream.)

Jun 02 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Thursday, June 02, 2016 15:48:03 Walter Bright via Digitalmars-d wrote:
On 6/2/2016 3:23 PM, Andrei Alexandrescu wrote:
On 06/02/2016 05:58 PM, Walter Bright wrote:
> * s.balancedParens('〈', '〉') works only with autodecoding.
> * s.canFind('ö') works only with autodecoding. It returns always

false without.

Can be made to work without autodecoding.

By special casing? Perhaps.

The argument to canFind() can be detected as not being a char, then decoded
into a sequence of char's, then forwarded to a substring search.

How do you suggest that we handle the normalization issue? Should we just
assume NFC like std.uni.normalize does and provide an optional template
argument to indicate a different normalization (like normalize does)? Since
without providing a way to deal with the normalization, we're not actually
making the code fully correct, just faster.

- Jonathan M Davis

Jun 02 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 4:29 PM, Jonathan M Davis via Digitalmars-d wrote:
How do you suggest that we handle the normalization issue?

Started a new thread for that one.

Jun 02 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Thursday, June 02, 2016 18:23:19 Andrei Alexandrescu via Digitalmars-d
wrote:
On 06/02/2016 05:58 PM, Walter Bright wrote:
On 6/2/2016 1:27 PM, Andrei Alexandrescu wrote:
The lambda returns bool. -- Andrei

Yes, I was wrong about that. But the point still stands with:
> * s.balancedParens('〈', '〉') works only with autodecoding.
> * s.canFind('ö') works only with autodecoding. It returns always

false without.

Can be made to work without autodecoding.

Yeah, I believe that you do have to do some special casing, though it would
be special casing on ranges of code units in general and not strings
specifically, and a lot of those functions are already special cased on
string in an attempt be efficient. In particular, with a function like find
or canFind, you'd take the needle and encode it to match the haystack it was
passed so that you can do the comparisons via code units. So, you incur the
encoding cost once when encoding the needle rather than incurring the
decoding cost of each code point or grapheme as you iterate over the
haystack. So, you end up with something that's correct and efficient. It's
also much friendlier to code that only operates on ASCII.

The one issue that I'm not quite sure how we'd handle in that case is
normalization (which auto-decoding doesn't handle either), since you'd need
to normalize the needle to match the haystack (which also assumes that the
haystack was already normalized). Certainly, it's the sort of thing that
makes it so that you kind of wish you were dealing with a string type that
had the normalization built into it rather than either an array of code
units or an arbitrary range of code units. But maybe we could assume the NFC
normalization like std.uni.normalize does and provide an optional template
argument for the normalization scheme.

In any case, while it's not entirely straightforward, it is quite possible
to write some algorithms in a way which works on arbitrary ranges of code
units and deals with Unicode correctly without auto-decoding or requiring
that the user convert it to a range of code points or graphemes in order to
properly handle the full range of Unicode. And even if we keep
auto-decoding, we pretty much need to fix it so that std.algorithm and
friends are Unicode-aware in this manner so that ranges of code units work
in general without requiring that you use byGrapheme. So, this sort of thing
could have a large impact on RCStr, even if we keep auto-decoding for narrow
strings.

Other algorithms, however, can't be made to work automatically with Unicode
- at least not with the current range paradigm. filter, for instance, really
needs to operate on graphemes to filter on characters, but with a range of
code units, that would mean operating on groups of code units as a single
element, which you can't do with something like a range of char, since that
essentially becomes a range of ranges. It has to be wrapped in a range
that's going to provide graphemes - and of course, if you know that you're
operating only on ASCII, then you wouldn't want to deal with graphemes
anyway, so automatically converting to graphemes would be undesirable. So,
for a function like filter, it really does have to be up to the programmer
to indicate what level of Unicode they want to operate at.

But if we don't make functions Unicode-aware where possible, then we're
going to take a performance hit by essentially forcing everyone to use
explicit ranges of code points or graphemes even when they should be
unnecessary. So, I think that we're stuck with some level of special casing,
but it would then be for ranges of code units and code points and not
strings. So, it would work efficiently for stuff like RCStr, which the
current scheme does not.

I think that the reality of the matter is that regardless of whether we keep
auto-decoding for narrow strings in place, we need to make Phobos operate on
arbitrary ranges of code units and code points, since even stuff like RCStr
won't work efficiently otherwise, and stuff like byCodeUnit won't be usuable
in as many cases otherwise, because if a generic function isn't
Unicode-aware, then in many cases, byCodeUnit will be very wrong, just like
byCodePoint would be wrong. So, as far as Phobos goes, I'm not sure that the
question of auto-decoding matters much for what we need to do at this point.
If we do what we need to do, then Phobos will work whether we have
auto-decoding or not (working in a Unicode-aware manner where possible and
forcing the user to decide the correct level of Unicode to work at where
not), and then it just becomes a question of whether we can or should
deprecate auto-decoding once all that's done.

- Jonathan M Davis

Jun 02 2016

Marco Leise <Marco.Leise gmx.de> writes:
Am Thu, 2 Jun 2016 15:05:44 -0400
schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:

On 06/02/2016 01:54 PM, Marc Sch=C3=BCtz wrote:
Which practical tasks are made possible (and work _correctly_) if you
decode to code points, that don't already work with code units? =20

=20
Pretty much everything.

s.all!(c =3D> c =3D=3D '=C3=B6')

Andrei, your ignorance is really starting to grind on
everyones nerves. If after 350 posts you still don't see
why this is incorrect: s.any!(c =3D> c =3D=3D 'o'), you must be
actively skipping the informational content of this thread.

You are in error, no one agrees with you, and you refuse to see
it and in the end we have to assume you will make a decisive
vote against any PR with the intent to remove auto-decoding
from Phobos.

Your so called vocal minority is actually D's panel of Unicode
experts who understand that auto-decoding is a false ally and
should be on the deprecation track.

Remember final-by-default? You promised, that your objection
about breaking code means that D2 will only continue to be
fixed in a backwards compatible way, be it the implementation
of shared or whatever else. Yet months later you opened a
thread with the title "inout must go". So that must have been
an appeasement back then. People don't forget these things
easily and RCStr seems to be a similar distraction,
considering we haven't looked into borrowing/scoped enough and
you promise wonders from it.

--=20
Marco

Jun 02 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 3:10 PM, Marco Leise wrote:
we haven't looked into borrowing/scoped enough

That's my fault.

As for scoped, the idea is to make scope work analogously to DIP25's 'return
ref'. I don't believe we need borrowing, we've worked out another solution that
will work for ref counting.

Please do not reply to this in this thread - start a new one if you wish to
continue with this topic.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 06:10 PM, Marco Leise wrote:
Am Thu, 2 Jun 2016 15:05:44 -0400
schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:

On 06/02/2016 01:54 PM, Marc Schütz wrote:
Which practical tasks are made possible (and work _correctly_) if you
decode to code points, that don't already work with code units?

Pretty much everything.

s.all!(c => c == 'ö')

Andrei, your ignorance is really starting to grind on
everyones nerves.

Indeed there seem to be serious questions about my competence, basic
comprehension, and now knowledge.

I understand it is tempting to assume that a disagreement is caused by
the other simply not understanding the matter. Even if that were true
it's not worth sacrificing civility over it.

If after 350 posts you still don't see
why this is incorrect: s.any!(c => c == 'o'), you must be
actively skipping the informational content of this thread.

Is it 'o' with an umlaut or without?

At any rate, consider s of type string and x of type dchar. The dchar
type is defined as "a Unicode code point", or at least my understanding
that has been a reasonable definition to operate with in the D language
ever since its first release. Also in the D language, the various string
types char[], wchar[] etc. with their respective qualified versions are
meant to hold Unicode strings with one of the UTF8, UTF16, and UTF32
encodings.

Following these definitions, it stands to reason to infer that the call
s.find(c => c == x) means "find the code point x in string s and return
the balance of s positioned there". It's prima facie application of the
definitions of the entities involved.

Is this the only possible or recommended meaning? Most likely not, viz.
the subtle cases in which a given grapheme is represented via either one
or multiple code points by means of combining characters. Is it the best
possible meaning? It's even difficult to define what "best" means
(fastest, covering most languages, etc).

I'm not claiming that meaning is the only possible, the only
recommended, or the best possible. All I'm arguing is that it's not
retarded, and within a certain universe confined to operating at code
point level (which is reasonable per the definitions of the types
involved) it can be considered correct.

If at any point in the reasoning above some rampant ignorance comes
about, please point it out.

You are in error, no one agrees with you, and you refuse to see
it and in the end we have to assume you will make a decisive
vote against any PR with the intent to remove auto-decoding
from Phobos.

This seems to assume I have some vesting in the position that makes it
independent of facts. That is not the case. I do what I think is right
to do, and you do what you think is right to do.

Your so called vocal minority is actually D's panel of Unicode
experts who understand that auto-decoding is a false ally and
should be on the deprecation track.

They have failed to convince me. But I am more convinced than before
that RCStr should not offer a default mode of iteration. I think its
impact is lost in this discussion, because once it's understood RCStr
will become D's recommended string type, the entire matter becomes moot.

What the hell is this, digging dirt on me? Paying back debts? Please
stop that crap.

Andrei

Jun 02 2016

Marco Leise <Marco.Leise gmx.de> writes:
Am Thu, 2 Jun 2016 18:54:21 -0400
schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:

On 06/02/2016 06:10 PM, Marco Leise wrote:
Am Thu, 2 Jun 2016 15:05:44 -0400
schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:
=20
On 06/02/2016 01:54 PM, Marc Sch=C3=BCtz wrote: =20
Which practical tasks are made possible (and work _correctly_) if you
decode to code points, that don't already work with code units? =20

Pretty much everything.

s.all!(c =3D> c =3D=3D '=C3=B6') =20

Andrei, your ignorance is really starting to grind on
everyones nerves. =20

=20
Indeed there seem to be serious questions about my competence, basic=20
comprehension, and now knowledge.

That's not my general impression, but something is different
with this thread.

I understand it is tempting to assume that a disagreement is caused by=20
the other simply not understanding the matter. Even if that were true=20
it's not worth sacrificing civility over it.

Civility has had us caught in an 36 pages long, tiresome
debate with us mostly talking past each other. I was being
impolite and can't say I regret it, because I prefer this
answer over the rest of the thread. It's more informed,
elaborate and conclusive.

If after 350 posts you still don't see
why this is incorrect: s.any!(c =3D> c =3D=3D 'o'), you must be
actively skipping the informational content of this thread. =20

=20
Is it 'o' with an umlaut or without?

At any rate, consider s of type string and x of type dchar.
The dchar type is defined as "a Unicode code point", or at
least my understanding that has been a reasonable definition
to operate with in the D language ever since its first
release. Also in the D language, the various string types
char[], wchar[] etc. with their respective qualified
versions are meant to hold Unicode strings with one of the
UTF8, UTF16, and UTF32 encodings.

Following these definitions, it stands to reason to infer that the call=20
s.find(c =3D> c =3D=3D x) means "find the code point x in string s and re=

turn=20
the balance of s positioned there". It's prima facie application of the=20
definitions of the entities involved.
=20
Is this the only possible or recommended meaning? Most likely not, viz.=20
the subtle cases in which a given grapheme is represented via either one=

=20
or multiple code points by means of combining characters. Is it the best=

=20
possible meaning? It's even difficult to define what "best" means=20
(fastest, covering most languages, etc).
=20
I'm not claiming that meaning is the only possible, the only=20
recommended, or the best possible. All I'm arguing is that it's not=20
retarded, and within a certain universe confined to operating at code=20
point level (which is reasonable per the definitions of the types=20
involved) it can be considered correct.
=20
If at any point in the reasoning above some rampant ignorance comes=20
about, please point it out.

No, it's pretty close now. We can all agree that there is no
"best" way, only different use cases. Just defining Phobos to
work on code points gives the false illusion that it does the
correct thing in all use cases - after all D claims to support
Unicode. But in case you wanted to iterate on visual letters
it is incorrect and otherwise slow when you work on ASCII
structured formats (JSON, XML, paths, Warp, ...). Then there is
explaining the different default iteration schemes when using
foreach vs. range API (no big deal, just not easily justified)
and the cost of implementation when dealing with
char[]/wchar[].

=46rom this observation we concluded that decoding should be
opt-in and that when we need it, it should be a conscious
decision. Unicode is quite complex and learning about the
difference between code points and grapheme clusters when
segmenting strings will benefit code quality.

As for the question, do multi-code-point graphemes ever appear
in the wild ? OS X is known to use NFD on its native file
system and there is a hint on Wikipedia that some symbols from
Thai or Hindi's Devanagari need them:
https://en.wikipedia.org/wiki/UTF-8#Disadvantages
Some form of Lithuanian seems to have a use for them, too:
http://www.unicode.org/L2/L2012/12026r-n4191-lithuanian.pdf
Aside from those there is nothing generally wrong about
decomposed letters appearing in strings, even though the
use of NFC is encouraged.

[=E2=80=A6harsh tone removed=E2=80=A6] in the end we have to assume you
will make a decisive vote against any PR with the intent
to remove auto-decoding from Phobos. =20

=20
This seems to assume I have some vesting in the position
that makes it independent of facts. That is not the case. I
do what I think is right to do, and you do what you think is
right to do.

Your vote outweighs that of many others for better or worse.
When a decision needs to be made and the community is divided,
we need you or Walter or anyone who is invested in the matter
to cast a ruling vote. However when several dozen people
support an idea after discussion, hearing everyones arguments
with practically no objections and you overrule everyone
tensions build up. I welcome the idea to delegate some of the
tasks to smaller groups. No single person is knowledgeable in
every area of CS and both a bus factor of 1 and too big a
group can hinder decision making.
It would help to know for the future, if you understand your
role as one with veto powers or if you could arrange with
giving up responsibilities to decisions within the community
and if so under what conditions.

Your so called vocal minority is actually D's panel of Unicode
experts who understand that auto-decoding is a false ally and
should be on the deprecation track. =20

=20
They have failed to convince me. But I am more convinced than before=20
that RCStr should not offer a default mode of iteration. I think its=20
impact is lost in this discussion, because once it's understood RCStr=20
will become D's recommended string type, the entire matter becomes moot.

=20
What the hell is this, digging dirt on me? Paying back debts? Please=20
stop that crap.

No, that was my actual impression. I must apologize for
generalizing it to other people though. I welcome that RCStr
project and hope it will be good. At this time though it is
not yet fleshed out and we can't tell how fast its adoption
will be. Remember that DIPs on scope and RC have had the past
tendency to go into long debates with unclear outcome. Unlike
this thread, which may be the first in D's forum history with
such a high agreement across the board.

Andrei

--=20
Marco

Jun 03 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Thursday, June 02, 2016 15:05:44 Andrei Alexandrescu via Digitalmars-d
wrote:
The intent of autodecoding was to make std.algorithm work meaningfully
with strings. As it's easy to see I just went through
std.algorithm.searching alphabetically and found issues literally with
every primitive in there. It's an easy exercise to go forth with the others.

It comes down to the question of whether it's better to fail quickly when
Unicode is handled incorrectly so that it's obvious that you're doing it
wrong, or whether it's better for it to work in a large number of cases so
that for a lot of code it "just works" but is still wrong in the general
case, and it's a lot less obvious that that's the case, so many folks won't
realize that they need to do more in order to have their string handling be
Unicode-correct.

With code units - especially UTF-8 - it becomes obvious very quickly that
treating each element of the string/range as a character is wrong. With code
points, you have to work far harder to find examples that are incorrect. So,
it's not at all obvious (especially to the lay programmer) that the Unicode
handling is incorrect and that their code is wrong - but their code will end
up working a large percentage of the time in spite of it being wrong in the
general case.

So, yes, it's trivial to show how operating on ranges of code units as if
they were characters gives incorrect results far more easily than operating
on ranges of code points does. But operating on code points as if they were
characters is still going to give incorrect results in the general case.

Regardless of auto-decoding, the anwser is that the programmer needs to
understand the Unicode issues and use ranges of code units or code points
where appropriate and use ranges of graphemes where appropriate. It's just
that if we default to handling code points, then a lot of code will be
written which treats those as characters, and it will provide the correct
result more often than it would if it treated code units as characters.

In any case, I've probably posted too much in this thread already. It's
clear that the first step to solving this problem is to improve Phobos so
that it handles ranges of code units, code points, and graphemes correctly
whether auto-decoding is involved or not, and only then can we consider the
possibility of removing auto-decoding (and even then, the answer may still
be that we're stuck, because we consider the resulting code breakage to be
too great). But whether Phobos retains auto-decoding or not, the Unicode
handling stuff in general is the same, and what we need to do to improve the
siutation is the same. So, clearly, I need to do a much better job of
finding time to work on D so that I can create some PRs to help the
situation. Unfortunately, it's far easier to find a few minutes here and
there while waiting on other stuff to shoot off a post or two in the
newsgroup than it is to find time to substantively work on code. :|

- Jonathan M Davis

Jun 03 2016

jmh530 <john.michael.hall gmail.com> writes:
On Thursday, 2 June 2016 at 13:06:44 UTC, Andrei Alexandrescu
wrote:
Look at reddit and hackernews, too - admittedly other
self-selected communities. Language debates often spring about.
How often is the point being made that D is wanting because of
its string support? Nada.

I've been lurking on this thread for a while and was convinced by
the arguments that autodecoding should go.

Nevertheless, I think this is really the strongest argument
you've made against using the community's resources to fix it
now. If your position from the beginning were this clear, then I
think the thread might not have gone on so long. As someone
trained in economics, I get convinced by arguments about scarce
resources. It makes more sense to focus on higher value issues.

However, the case against autodecoding is clearly popular. At a
minimum, it has resulted in a significant amount of time
dedicated to forum discussion and has made you metaphorically
angry at Walter. Resources spent grumbling about it could be
better spent elsewhere.

One way to deal with the problem of scarce resources is by
reducing the cost of whatever action you want to take. For
instance, Adam Ruppe just put up a good post in the Dealing with
Autodecode thread
https://forum.dlang.org/post/ksasfwpuvpwxjfniupiv forum.dlang.org
noting that a compiler switch could easily be added to phobos.
Combined with a long deprecation timeline, the cost that it would
impose on D users who are not active forum members and might want
to complain about the issue would be relatively small.

Another problem related to scarce resources is that there is a
division of labor in the community. People like yourself and
Walter have fewer substitutes for your labor. It makes sense that
the top contributors should be focusing on higher value issues
where fewer people have the ability to contribute. I don't
dispute that. However, there seem to be a number of people who
can contribute on this issue and want to contribute. Scarcity of
resources seems to be less of an issue here.

Finally, when you discussed things people complain about D, you
mentioned tooling. In the time I've been following this forum, I
haven't seen a single thread focusing on this issue. I don't mean
a few comments like "oh D should improve its tooling." I mean a
thread dedicated to D's tooling strengths and weaknesses with a
goal of creating a plan on what to do to improve things.

Currently dfix is weak because it doesn't do lookup. So we need
to make the front end into a library. Daniel said he wants to
be on it, but he has two jobs to worry about so he's short on
time. There's only so many hours in the day, and I think the
right focus is on attacking the matters above.

On a somewhat tangential basis, I was reading about Microsoft's
Roslyn a week or so ago. They do something similar where they
have a compiler API. I don't have a very good sense of how it
works from their overview, but it seems to be an interesting
approach.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 10:14 AM, jmh530 wrote:
However, the case against autodecoding is clearly popular. At a minimum,
it has resulted in a significant amount of time dedicated to forum
discussion and has made you metaphorically angry at Walter. Resources
spent grumbling about it could be better spent elsewhere.

Yah, this is a bummer and one of the larger issues of our community:
there's too much talking about doing things and too little doing things.
On one hand I want to empower people (as I said at DConf: please get me
fired!), and on the other I need to prevent silly things from happening.
The quality of some of the code that gets into Phobos when I look the
other way is sadly sub-par. Cumulatively that has reduced its quality
over time. That (improving the time * talent torque) is the real
solution to Phobos' technical debt, of which autodecoding is negligible.

One way to deal with the problem of scarce resources is by reducing the
cost of whatever action you want to take. For instance, Adam Ruppe just
put up a good post in the Dealing with Autodecode thread
https://forum.dlang.org/post/ksasfwpuvpwxjfniupiv forum.dlang.org
noting that a compiler switch could easily be added to phobos. Combined
with a long deprecation timeline, the cost that it would impose on D
users who are not active forum members and might want to complain about
the issue would be relatively small.

This is a very costly solution to a very small problem. I'm here to
prevent silly things like this from happening and from bringing back
perspective. We've had huge issues with language changes that were much
more important and brought much less breakage. The fact that people talk
about 132 breakages in Phobos with a straight face is a good sign that
the heat of the debate has taken perspective away. I'm sure it will come
back in a few weeks. Just need to keep the dam until then.

The real ticket out of this is RCStr. It solves a major problem in the
language (compulsive GC) and also a minor occasional annoyance
(autodecoding). This is what I need to work on, instead of writing long
messages to put back sense into people.

Many don't realize that the only reason current strings ever work in
safe code is because of the GC. char[] is too little encapsulation, so
it needs GC as a crutch to be safe. That's the problem with D's strings,
not autodecoding. That's why we need to change things. That's what keeps
me awake at night.

Andrei

Jun 02 2016

Adam D. Ruppe <destructionator gmail.com> writes:
On Thursday, 2 June 2016 at 15:02:13 UTC, Andrei Alexandrescu
wrote:
Yah, this is a bummer and one of the larger issues of our
community: there's too much talking about doing things and too
little doing things.

We wrote a PR to implement the first step in the autodecode
deprecation cycle. Granted, it wasn't ready to merge, but you
just closed it with a flippant "not gonna happen" despite the
*unanimous* agreement that the status quo sucks, and now complain
that there's too much talking and too little doing!

When we do something, you just shut it down then blame us. What's
even the point of trying anymore?

Jun 02 2016

deadalnix <deadalnix gmail.com> writes:
On Thursday, 2 June 2016 at 15:38:46 UTC, Adam D. Ruppe wrote:
On Thursday, 2 June 2016 at 15:02:13 UTC, Andrei Alexandrescu
wrote:
Yah, this is a bummer and one of the larger issues of our
community: there's too much talking about doing things and too
little doing things.

We wrote a PR to implement the first step in the autodecode
deprecation cycle. Granted, it wasn't ready to merge, but you
just closed it with a flippant "not gonna happen" despite the
*unanimous* agreement that the status quo sucks, and now
complain that there's too much talking and too little doing!

When we do something, you just shut it down then blame us.
What's even the point of trying anymore?

https://www.youtube.com/watch?v=MJiBjfvltQw

Jun 02 2016

Kagamin <spam here.lot> writes:
On Thursday, 2 June 2016 at 15:38:46 UTC, Adam D. Ruppe wrote:
We wrote a PR to implement the first step in the autodecode
deprecation cycle.

It outright deprecated popFront - that's not the first step in
the migration.

Jun 02 2016

Adam D. Ruppe <destructionator gmail.com> writes:
On Thursday, 2 June 2016 at 15:50:54 UTC, Kagamin wrote:
It outright deprecated popFront - that's not the first step in
the migration.

Which gave us the list of places inside Phobos to fix, only about
two hours of work, and proved that the version() method was
viable (and REALLY easy to implement).

Jun 02 2016

Kagamin <spam here.lot> writes:
On Thursday, 2 June 2016 at 16:02:18 UTC, Adam D. Ruppe wrote:
Which gave us the list of places inside Phobos to fix, only
about two hours of work, and proved that the version() method
was viable (and REALLY easy to implement).

Yes, it was a research PR that was never meant to be an
implementation of the first step. You used wrong wording that
just unnecessarily freaked Andrei out.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 12:45 PM, Kagamin wrote:
On Thursday, 2 June 2016 at 16:02:18 UTC, Adam D. Ruppe wrote:
Which gave us the list of places inside Phobos to fix, only about two
hours of work, and proved that the version() method was viable (and
REALLY easy to implement).

Yes, it was a research PR that was never meant to be an implementation
of the first step. You used wrong wording that just unnecessarily
freaked Andrei out.

I closed it because it wasn't an actual implementation, in full
understanding that the discussion in it could continue. -- Andrei

Jun 02 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 9:02 AM, Adam D. Ruppe wrote:
Which gave us the list of places inside Phobos to fix, only about two hours of
work, and proved that the version() method was viable (and REALLY easy to
implement).

Nothing prevents anyone from doing that on their own (it's trivial) in order to
find Phobos problems, and pick one or three to fix.

Jun 02 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 8:50 AM, Kagamin wrote:
It outright deprecated popFront - that's not the first step in the migration.

That's right. It's going about things backwards.

The first step is to adjust Phobos implementations and documentation so they do
not rely on autodecoding.

This will take some time and care, particularly with algorithms that support
mixed codeunit argument types. (Or perhaps mixed codeunit argument types can be
deprecated.)

This is not so simple, as they have to be dealt with one by one.

Jun 02 2016

Adam D. Ruppe <destructionator gmail.com> writes:
On Thursday, 2 June 2016 at 20:32:39 UTC, Walter Bright wrote:
The first step is to adjust Phobos implementations and
documentation so they do not rely on autodecoding.

The compiler can help you with that. That's the point of the do
not merge PR: it got an actionable list out of the compiler and
proved the way forward was viable.

Jun 02 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 1:46 PM, Adam D. Ruppe wrote:
The compiler can help you with that. That's the point of the do not merge PR:
it
got an actionable list out of the compiler and proved the way forward was
viable.

What is supposed to be done with "do not merge" PRs other than close them?

Jun 02 2016

Jack Stouffer <jack jackstouffer.com> writes:
On Thursday, 2 June 2016 at 20:56:26 UTC, Walter Bright wrote:
What is supposed to be done with "do not merge" PRs other than
close them?

Experimentally iterate until something workable comes about. This
way it's done publicly and people can collaborate.

Jun 02 2016

tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 2 June 2016 at 20:56:26 UTC, Walter Bright wrote:
What is supposed to be done with "do not merge" PRs other than
close them?

Occasionally people need to try something on the auto tester (not
sure if that's relevant to that particular PR, though).
Presumably if someone marks their own PR as "do not merge", it
means they're planning to either close it themselves after it has
served its purpose, or they plan to fix/finish it and then remove
the "do not merge" label.

Either way, they shouldn't be closed just because they say "do
not merge" (unless they're abandoned or something, obviously).

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/2/16 5:05 PM, tsbockman wrote:
On Thursday, 2 June 2016 at 20:56:26 UTC, Walter Bright wrote:
What is supposed to be done with "do not merge" PRs other than close
them?

Occasionally people need to try something on the auto tester (not sure
if that's relevant to that particular PR, though). Presumably if someone
marks their own PR as "do not merge", it means they're planning to
either close it themselves after it has served its purpose, or they plan
to fix/finish it and then remove the "do not merge" label.

Feel free to reopen if it helps, it wasn't closed in anger. -- Andrei

Jun 02 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 6/2/2016 2:05 PM, tsbockman wrote:
On Thursday, 2 June 2016 at 20:56:26 UTC, Walter Bright wrote:
What is supposed to be done with "do not merge" PRs other than close them?

Occasionally people need to try something on the auto tester (not sure if
that's
relevant to that particular PR, though).

I've done that, but that doesn't apply here.

Presumably if someone marks their own
PR as "do not merge", it means they're planning to either close it themselves
after it has served its purpose, or they plan to fix/finish it and then remove
the "do not merge" label.

That doesn't seem to apply here, either.

Either way, they shouldn't be closed just because they say "do not merge"
(unless they're abandoned or something, obviously).

Something like that could not be merged until 132 other PRs are done to fix
Phobos. It doesn't belong as a PR.

Jun 02 2016

tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 2 June 2016 at 22:20:49 UTC, Walter Bright wrote:
On 6/2/2016 2:05 PM, tsbockman wrote:
Presumably if someone marks their own
PR as "do not merge", it means they're planning to either
close it themselves
after it has served its purpose, or they plan to fix/finish it
and then remove
the "do not merge" label.

That doesn't seem to apply here, either.

Either way, they shouldn't be closed just because they say "do
not merge"
(unless they're abandoned or something, obviously).

Something like that could not be merged until 132 other PRs are
done to fix Phobos. It doesn't belong as a PR.

I was just responding to the general question you posed about "do
not merge" PRs, not really arguing for that one, in particular,
to be re-opened. I'm sure wilzbach is willing to explain if
anyone cares to ask him why he did it as a PR, though.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 11:38 AM, Adam D. Ruppe wrote:
On Thursday, 2 June 2016 at 15:02:13 UTC, Andrei Alexandrescu wrote:
Yah, this is a bummer and one of the larger issues of our
community: there's too much talking about doing things and too
little doing things.

We wrote a PR to implement the first step in the autodecode
deprecation cycle. Granted, it wasn't ready to merge, but you just
closed it with a flippant "not gonna happen" despite the *unanimous*
agreement that the status quo sucks, and now complain that there's
too much talking and too little doing!

You mean https://github.com/dlang/phobos/pull/4384, the one with "[do
not merge]" in the title?

Would you realistically have advised me to merge it?

I spent time writing what I thought was a reasonable and reasonably long
answer. Allow me to quote it below:

wilzbach thanks for running this experiment.

Andrei is wrong.

Definitely wouldn't be the first time and not the last.

We can all see it, and maybe if we demonstrate that a migration
path is possible, even actually pretty easy following a simple
deprecation path, maybe he can see it too.

I'm not sure who "all" is but that's beside the point. Taking a step
back, we'd take in a change that breaks Phobos in 132 places only if
it was a major language overhaul bringing dramatic improvements to
the quality of life for D programmers. An artifact as earth
shattering as ranges, or an ownership system that was massively
simple and beneficial. For comparison, the recent changes in name
lookup broke Phobos in fewer places (I don't have an exact number,
but I think they were at most a couple dozen.) Those changes were
closing an enormous hole in the language and mark a huge step
forward. I'd be really hard pressed to characterize the elimination
of autodecoding as enough of an improvement to warrant this kind of
breakage. (I do realize there's a difference between breakage and
deprecation, but for many folks the distinction is academic.)

The better end game here is to improve efficiency of code that uses
autodecoding (e.g. per the recent `find()` work), and to make sure
`RCStr` is the right design. A string that manages its own memory
_and_ does the right things with regard to Unicode is the ticket.
Let's focus future efforts on that.

Could you please point me at the parts you found flippant in it, or
merely unreasonable?

When we do something, you just shut it down then blame us. What's
even the point of trying anymore?

At some point I need to stick with what I think is the better course for
D, even if that means disagreeing with you. But I hope you understand
this is not "flippant" or teasing people then shutting down their good
work.

Andrei

Jun 02 2016

Adam D. Ruppe <destructionator gmail.com> writes:
On Thursday, 2 June 2016 at 16:12:01 UTC, Andrei Alexandrescu
wrote:
Would you realistically have advised me to merge it?

Not at this time, no, but I also wouldn't advise you to close it
and tell us to stop trying if you were actually open to a chance.

You closed that and posted this at about the same time:

http://forum.dlang.org/post/nii497$2p79$1 digitalmars.com

"I'm not going to debate this further"

"What's important is that autodecoding is here to stay - there's
no realistic way to eliminate it from D."

So, what do you seriously expect us to think? We had a migration
plan and enough excitement to start working on the code, then
within about 15 minutes of each other, you close the study PR and
post that the discussion is over and your mistake is here to stay.

I'm not sure who "all" is but that's beside the point.

This sentence makes me pretty mad too. This topic has come up
many times and nobody, NOBODY, with the exception of yourself
agrees with the current behavior anymore. It is a very frequently
asked question among new users, and we have no real justification
because there is no technical merit to it.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 02:36 PM, Adam D. Ruppe wrote:
We had a migration plan and enough excitement to start working on the code

I don't think the plan is realistic. How can I tell you this without you
getting mad at me? Apparently the only way to go is do as you say. -- Andrei

Jun 02 2016

Adam D. Ruppe <destructionator gmail.com> writes:
On Thursday, 2 June 2016 at 18:43:54 UTC, Andrei Alexandrescu
wrote:
I don't think the plan is realistic. How can I tell you this
without you getting mad at me?

You get out of the way and let the community get to work.
Actually delegate, let people take ownership of problems, success
and failure alike.

If we fail then, at least it will be from our own experience
instead of from executive meddling.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 03:13 PM, Adam D. Ruppe wrote:
On Thursday, 2 June 2016 at 18:43:54 UTC, Andrei Alexandrescu wrote:
I don't think the plan is realistic. How can I tell you this without
you getting mad at me?

You get out of the way and let the community get to work. Actually
delegate, let people take ownership of problems, success and failure alike.

That's a good point. We plan to do more of that in the future.

If we fail then, at least it will be from our own experience instead of
from executive meddling.

This applies to high-risk work that is also of commensurately
extraordinary value. My assessment is this is not it. If you were in my
position you'd also do what you think is the best thing to do, and
nobody should feel offended by that.

Andrei

Jun 02 2016

Kagamin <spam here.lot> writes:
On Thursday, 2 June 2016 at 13:06:44 UTC, Andrei Alexandrescu
wrote:
This is what's happening here. We worked ourselves to a foam
because the creator of the language started a thread entitled
"The Case Against Autodecode", whilst fully understanding there
is no way to actually eliminate autodecode.

Autodecode doesn't need to be removed from phobos completely, it
only needs to be more bearable, like it is in the foreach
statement. E.g. byDchar will stay, initial idea is to actually
put it to more intensive usage in phobos and user code, no need
to remove it.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 10:53 AM, Kagamin wrote:
On Thursday, 2 June 2016 at 13:06:44 UTC, Andrei Alexandrescu wrote:
This is what's happening here. We worked ourselves to a foam because
the creator of the language started a thread entitled "The Case
Against Autodecode", whilst fully understanding there is no way to
actually eliminate autodecode.

Autodecode doesn't need to be removed from phobos completely, it only
needs to be more bearable, like it is in the foreach statement. E.g.
byDchar will stay, initial idea is to actually put it to more intensive
usage in phobos and user code, no need to remove it.

Yah, and then such code will work with RCStr. -- Andrei

Jun 02 2016

Kagamin <spam here.lot> writes:
On Thursday, 2 June 2016 at 15:06:20 UTC, Andrei Alexandrescu
wrote:
Autodecode doesn't need to be removed from phobos completely,
it only
needs to be more bearable, like it is in the foreach
statement. E.g.
byDchar will stay, initial idea is to actually put it to more
intensive
usage in phobos and user code, no need to remove it.

Yah, and then such code will work with RCStr. -- Andrei

Yes, do consider Walter's proposal, it will be an enabling
technology for RCStr too: the more phobos works with string-like
ranges the more it is usable for RCStr.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 12:14 PM, Kagamin wrote:
On Thursday, 2 June 2016 at 15:06:20 UTC, Andrei Alexandrescu wrote:
Autodecode doesn't need to be removed from phobos completely, it only
needs to be more bearable, like it is in the foreach statement. E.g.
byDchar will stay, initial idea is to actually put it to more intensive
usage in phobos and user code, no need to remove it.

Yah, and then such code will work with RCStr. -- Andrei

Yes, do consider Walter's proposal, it will be an enabling technology
for RCStr too: the more phobos works with string-like ranges the more it
is usable for RCStr.

Walter and I have a unified view on this. Although I'd need to raise the
issue that the primitive should be by!dchar, not byDchar. -- Andrei

Jun 02 2016

ZombineDev <petar.p.kirov gmail.com> writes:
On Thursday, 2 June 2016 at 16:21:33 UTC, Andrei Alexandrescu
wrote:
On 06/02/2016 12:14 PM, Kagamin wrote:
On Thursday, 2 June 2016 at 15:06:20 UTC, Andrei Alexandrescu
wrote:
Autodecode doesn't need to be removed from phobos
completely, it only
needs to be more bearable, like it is in the foreach
statement. E.g.
byDchar will stay, initial idea is to actually put it to
more intensive
usage in phobos and user code, no need to remove it.

Yah, and then such code will work with RCStr. -- Andrei

Yes, do consider Walter's proposal, it will be an enabling
technology
for RCStr too: the more phobos works with string-like ranges
the more it
is usable for RCStr.

Walter and I have a unified view on this. Although I'd need to
raise the issue that the primitive should be by!dchar, not
byDchar. -- Andrei

The primitive is byUTF!dchar:

Jun 02 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, Jun 02, 2016 at 09:06:44AM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
[...]
ZombineDev, I've been at the top level in the C++ community for many
many years, even after I wanted to exit :o). I'm familiar with how the
committee that steers C++ works, perspective that is unique in our
community - even Walter lacks it. I see trends and patterns. It is
interesting how easily a small but very influential priesthood can
alienate itself from the needs of the larger community and get into a
frenzy over matters that are simply missing the point.

Appeal to authority.

I think that's a misrepresentation of the situation. I was getting
increasingly unhappy with autodecoding myself, completely independently
of Walter, and in fact have filed bugs and posted complaints about it
long before Walter started his thread. I used to be a supporter of
autodecoding, but over time it has become increasingly clear to me that
it was a mistake. The fact that you continue to deny this and write it
off in the face of similar complaints raised by many active D users is
very off-putting, to say the least, and does not inspire confidence. Not
to mention the fact that you started this thread yourself with a
question about what it is we dislike about autodecoding, yet after
having received a multitude of complaints, corrobated by many forum
members, you simply write off the whole thing like it was nothing. If
you want D to succeed, you need to raise the morale of the community,
and this is not the way to raise morale.

The very definition of a useless debate, the kind he and I had agreed
to not initiate anymore. It was a mistake. I'm still metaphorically
angry at him for it.

On the contrary, I found that Walter's willingness to admit past
mistakes very refreshing, even if practically speaking we can't actually
get rid of autodecoding today. What he proposed in the other thread is
actually a workable step towards reversing the wrong decision behind
autodecoding, that doesn't leave existing users out in the cold, and
that we might actually be able to pull off if done carefully. I know
you probably won't see it the same way, since you still seem convinced
that autodecoding was a good idea, but you need to understand that your
opinion is not representative in this case.

[...]
Meanwhile, I go to conferences. Train and consult at large companies.
Dozens every year, cumulatively thousands of people. I talk about D
and ask people what it would take for them to use the language.
Invariably I hear a surprisingly small number of reasons:

* The garbage collector eliminates probably 60% of potential users
right off.

At least we have begun to do something about this. That's good news.

* Tooling is immature and of poorer quality compared to the
competition.

And what have we done about it? How long has it been since dfix existed,
yet we still haven't really integrated it into the dmd toolchain?

* Safety has holes and bugs.

And what have we done about it?

* Hiring people who know D is a problem.

There are many willing candidates right here. :-P

* Documentation and tutorials are weak.

And what have we done about this?

* There's no web services framework (by this time many folks know of
D, but of those a shockingly small fraction has even heard of vibe.d).
I have strongly argued with S�nke to bundle vibe.d with dmd over one
year ago, and also in this forum. There wasn't enough interest.

What about linking to it in a prominent place on dlang.org? This isn't
a big problem, AFAICT. I don't think it takes months and years to put
up a big prominent banner promoting vibe.d on, say, the download page of
dlang.org.

* (On Windows) if it doesn't have a compelling Visual Studio plugin,
it doesn't exist.

And what have we done about this?

One of the things that I have found a little disappointing with D is
that while it has many very promising features, it lacks polish in many
small details. Such as the way features interact with each other in
corner cases. E.g., the whole can't-use-gc from dtor debacle, the
semantics of closures over aggregate members, holes in safe, holes in
const/immutable in unions, the whole import mess that took
oh-how-many-years to clean up that thankfully was finally improved
recently, can't use nogc with Phobos, can't use const/pure/etc. in
Object.toString, Object.opEqual, et al (which we've been trying to get
of since how many years ago now?), and a whole long list of small
irritations that in themselves are nothing, but together add up like a
dustball to an overall perception of lack of polish.

I'm more sympathetic to Walter's stance of improving the language for
*current* users, instead of bending over backwards to please would-be
adopters who may never actually adopt the language -- they'd just come
back with new excuses of why they can't adopt D yet. If you make
existing users happier, they will do all the work of evangelism for you,
instead of you having to fight the uphill battle by yourself while
bleeding away current users due to poor morale.

--
Why ask rhetorical questions? -- JC

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 10:48 AM, H. S. Teoh via Digitalmars-d wrote:
On Thu, Jun 02, 2016 at 09:06:44AM -0400, Andrei Alexandrescu via
Digitalmars-d wrote:
[...]
ZombineDev, I've been at the top level in the C++ community for many
many years, even after I wanted to exit :o). I'm familiar with how the
committee that steers C++ works, perspective that is unique in our
community - even Walter lacks it. I see trends and patterns. It is
interesting how easily a small but very influential priesthood can
alienate itself from the needs of the larger community and get into a
frenzy over matters that are simply missing the point.

Appeal to authority.

You cut the context, which was rampant speculation.

There is no denying. If I did things all over again, autodecoding would
not be in. But also string would not be immutable(char)[] which is the
real mistake.

Some of the arguments in here have been good, but many (probably the
majority) of them were not so much. A good one didn't even come up,
Walter told it to me over the phone: the reality of invalid UTF strings
forces you to mind the representation more often than you'd want in an
ideal world.

There is no "writing off". Again, the real solution here is RCStr. We
can't continue with immutable(char)[] as our flagship string.
Autodecoding is the least of its problems.

The very definition of a useless debate, the kind he and I had agreed
to not initiate anymore. It was a mistake. I'm still metaphorically
angry at him for it.

I don't see it the same way. Yes, I agree my opinion is not
representative. I'd also say I'm glad I can do something about this.

* The garbage collector eliminates probably 60% of potential users
right off.

At least we have begun to do something about this. That's good news.

I've been working on RCStr for the past few days. I'd get a lot more
work done if I didn't need to talk sense into people in this thread.

* Tooling is immature and of poorer quality compared to the
competition.

And what have we done about it? How long has it been since dfix existed,
yet we still haven't really integrated it into the dmd toolchain?

I've spoken to Brian about it. Dfix does not do lookup, which makes it
sadly not up for meaningful uses.

* Safety has holes and bugs.

And what have we done about it?

Walter and I are working on safe RC.

* Hiring people who know D is a problem.

There are many willing candidates right here. :-P

Nice.

* Documentation and tutorials are weak.

And what have we done about this?

http://tour.dlang.org is a good start.

PR please. I can't babysit everything. I'm preparing for a conference
where I'll evangelize for D next week
(http://ndcoslo.com/speaker/andrei-alexandrescu/). As I mentioned at
DConf, for better or worse this is the kind of stuff I cannot delegate.

That kind of work is where the community would really make an impact,
not a large debate that I need to worry will lead to some silly rash
decision.

* (On Windows) if it doesn't have a compelling Visual Studio plugin,
it doesn't exist.

And what have we done about this?

I'm actively looking for a collaboration.

It's a fair perspective. Those annoy me as well. I'll also note every
language has such matter, including the mainstream ones. At some point
we need to acknowledge they're there but they're small enough to live
with. (Some of those you enumerated aren't small, e.g. the holes in safe.)

We want to improve the language for current AND future users. RCStr is
part of that.

Andrei

Jun 02 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Thursday, June 02, 2016 09:06:44 Andrei Alexandrescu via Digitalmars-d
wrote:
Meanwhile, I go to conferences. Train and consult at large companies.
Dozens every year, cumulatively thousands of people. I talk about D and
ask people what it would take for them to use the language. Invariably I
hear a surprisingly small number of reasons:

Are folks going to not start using D because of auto-decoding? No, because
they won't know anything about it. Many of them don't even know anything
about ranges. But it _will_ result in a WTF moment for pretty much everyone.
It happens all the time and results in plenty of questions on D.Learn and
stackoverflow, because no one expects it, and it causes them problems.

Can we sanely remove auto-decoding from Phobos? I don't know. It's
entrenched enough that doing so without breaking code is going to be very
difficult. But at minimum, we need to mitigate it's effects, and I'm sure
that we're going to be sorry in the long run if we don't figure out how to
actually excise it. It's already a major wart that causes frequent problems,
and it's the sort of thing that's going to make a number of folks unhappy
with D in the long run, even if you can convince them to switch to it now
while auto-decoding is still in place. Will it make them unhappy enough to
switch away from D? Probably not. But it is going to be a constant pain
point of the sort that folks frequently complain about with C++ - only this
is one that we'll have, and C++ won't.

- Jonathan M Davis

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 11:58 AM, Jonathan M Davis via Digitalmars-d wrote:
On Thursday, June 02, 2016 09:06:44 Andrei Alexandrescu via Digitalmars-d
wrote:
Meanwhile, I go to conferences. Train and consult at large companies.
Dozens every year, cumulatively thousands of people. I talk about D and
ask people what it would take for them to use the language. Invariably I
hear a surprisingly small number of reasons:

Are folks going to not start using D because of auto-decoding? No, because
they won't know anything about it. Many of them don't even know anything
about ranges.

Actually ranges are a major reason for which people look into D. -- Andrei

Jun 02 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/1/16 6:24 PM, Andrei Alexandrescu wrote:
On 06/01/2016 06:09 PM, ZombineDev wrote:

Deprecating front, popFront and empty for narrow
strings is what we are talking about here.

That will not happen. Walter and I consider the cost excessive and the
benefit too small.

If this doesn't happen, then all this push to change anything in Phobos
is completely wasted effort. As long as arrays aren't treated like
arrays, we will have to deal with auto-decoding.

You can change string literals to be something other than arrays, and
then we have a path forward. But as long as char[] is not an array, we
have lost the battle of sanity.

-Steve

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 09:05 AM, Steven Schveighoffer wrote:
On 6/1/16 6:24 PM, Andrei Alexandrescu wrote:
On 06/01/2016 06:09 PM, ZombineDev wrote:

Deprecating front, popFront and empty for narrow
strings is what we are talking about here.

That will not happen. Walter and I consider the cost excessive and the
benefit too small.

If this doesn't happen, then all this push to change anything in Phobos
is completely wasted effort.

Really? "Anything"?

As long as arrays aren't treated like
arrays, we will have to deal with auto-decoding.

You can change string literals to be something other than arrays, and
then we have a path forward. But as long as char[] is not an array, we
have lost the battle of sanity.

Yeah, it's a miracle the language stays glued eh.

Your post is a prime example that this thread has lost the battle of
sanity. I'll destroy you in person tonight.

Andrei

Jun 02 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/2/16 9:09 AM, Andrei Alexandrescu wrote:
On 06/02/2016 09:05 AM, Steven Schveighoffer wrote:
On 6/1/16 6:24 PM, Andrei Alexandrescu wrote:
On 06/01/2016 06:09 PM, ZombineDev wrote:

Deprecating front, popFront and empty for narrow
strings is what we are talking about here.

That will not happen. Walter and I consider the cost excessive and the
benefit too small.

If this doesn't happen, then all this push to change anything in Phobos
is completely wasted effort.

Really? "Anything"?

The push to make Phobos only use byDchar (or any other band-aid fixes
for this issue) is what I meant by anything. not "anything" anything :)

As long as arrays aren't treated like
arrays, we will have to deal with auto-decoding.

You can change string literals to be something other than arrays, and
then we have a path forward. But as long as char[] is not an array, we
have lost the battle of sanity.

Yeah, it's a miracle the language stays glued eh.

I mean as far as narrow strings are concerned. To have the language tell
me, yes, char[] is an array with a .length member, but hasLength is
false? What, str[4] works, but isRandomAccessRange is false?

Maybe it's more Orwellian than insane: Phobos is saying 2 + 2 = 5 ;)

Your post is a prime example that this thread has lost the battle of
sanity. I'll destroy you in person tonight.

It's the cynicism of talking/debating about this for years and years and
not seeing any progress. We can discuss of course, and see who gets
destroyed :)

And yes, I'm about to kill this thread from my newsreader, since it's
wasting too much of my time...

-Steve

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/02/2016 09:25 AM, Steven Schveighoffer wrote:
And yes, I'm about to kill this thread from my newsreader, since it's
wasting too much of my time...

A good idea for all of us. Could you also please look on my post on our
meetup page? Thx! -- Andrei

Jun 02 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 15:09, Andrei Alexandrescu wrote:
You can change string literals to be something other than arrays, and
then we have a path forward. But as long as char[] is not an array, we
have lost the battle of sanity.

Yeah, it's a miracle the language stays glued eh.
...

It's not a language problem. Just avoid Phobos.

Your post is a prime example that this thread has lost the battle of
sanity.

He is just saying that the fundamental reason why autodecoding is bad is
that it denies that T[] is an array for any T.

Jun 02 2016

deadalnix <deadalnix gmail.com> writes:
On Wednesday, 1 June 2016 at 19:52:01 UTC, Andrei Alexandrescu
wrote:
On 06/01/2016 03:07 PM, ZombineDev wrote:
This is not autodecoding. There is nothing auto-magic w.r.t.
strings in
plain foreach.

I understand where you're coming from, but it actually is
autodecoding. Consider:

byte[] a;
foreach (byte x; a) {}
foreach (short x; a) {}
foreach (int x; a) {}

That works by means of a conversion short->int. However:

char[] a;
foreach (char x; a) {}
foreach (wchar x; a) {}
foreach (dchar x; a) {}

The latter two do autodecoding, not coversion as the rest of
the language.

Andrei

This, deep down, point at the fact that conversion from/to char
types are ill defined.

One should be able to convert from char to byte/ubyte but not the
other way around.
One should be able to convert from byte to short but not from
char to wchar.

Once you disable the naive conversions, then the autodecoding in
foreach isn't inconsistent anymore.

Jun 02 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 12:38, deadalnix wrote:

This, deep down, point at the fact that conversion from/to char types
are ill defined.

One should be able to convert from char to byte/ubyte but not the other
way around.
One should be able to convert from byte to short but not from char to
wchar.

Once you disable the naive conversions, then the autodecoding in foreach
isn't inconsistent anymore.

The current situation is bad:

void main(){
import std.utf,std.stdio;
foreach(dchar d;"∑")
writeln(d); // "∑"
foreach(dchar d;"∑".byCodeUnit)
writeln(d); // "â", "\210", "\221"
}

Implicit conversion should not happen, and I'd prefer both of them to
behave the same. (I.e. make both a compile-time error or decode for both).

Jun 02 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, May 31, 2016 14:30:08 Andrei Alexandrescu via Digitalmars-d wrote:
On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote:
On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d

wrote:
On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
Saying that operating at the code point level - UTF-32 - is correct
is like saying that operating at UTF-16 instead of UTF-8 is correct.

Does walkLength yield the same number for all representations?

walkLength treats a code point like it's a character. My point is that
that's incorrect behavior. It will not result in correct string processing
in the general case, because a code point is not guaranteed to be a
full character.

walkLength does not report the length of a character as one in all cases
just like length does not report the length of a character as one in all
cases. walkLength is counting bigger units than length, but it's still
counting pieces of a character rather than counting full characters.

And you can even put that accent on 0 by doing something like

assert("0"d ~ "é"d.normalize!NFKD[1] == "0́"d);

One or more code units combine to make a single code point, but one or
more
code points also combine to make a grapheme.

That's right. D's handling of UTF is at the code unit level (like all of
Unicode is portably defined). If you want graphemes use byGrapheme.

It seems you destroyed your own argument, which was:
Saying that operating at the code point level - UTF-32 - is correct
is like saying that operating at UTF-16 instead of UTF-8 is correct.

You can't claim code units are just a special case of code points.

The point is that treating a code point like it's a full character is just
as wrong as treating a code unit as if it were a full character. It's _not_
guaranteed to be a full character. Treating code points as full characters
does give you the correct result in more cases than treating a code unit as
a full character gives you the correct result, but it still gives you the
wrong result in many cases. If we want to have fully correct behavior
without making the programmer deal with all of the Unicode issues
themselves, then we need to operate at the grapheme level so that we are
operating on full characters (though that obviously comes at a high cost to
efficiency).

Treating code points as characters like we do right now does not give the
correct result in the general case just like treating code units as
characters doesn't give the correct result in the general case. Both work
some of the time, but neither works all of the time.

Autodecoding attempts to hide the fact that it's operating on Unicode but
does not actually go far enough to result in correct behavior. So, we pay
the cost of decoding without getting the benefit of correctness.

- Jonathan M Davis

May 31 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 31.05.2016 20:53, Jonathan M Davis via Digitalmars-d wrote:
On Tuesday, May 31, 2016 14:30:08 Andrei Alexandrescu via Digitalmars-d wrote:
On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote:
On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d

Does walkLength yield the same number for all representations?

What's "correct"? Maybe the user intended to count the number of code
points in order to pre-allocate a dchar[] of the correct size.

Generally, I don't see how algorithms become magically "incorrect" when
applied to utf code units.

The 'length' of a character is not one in all contexts.
The following text takes six columns in my terminal:

日本語
123456

May 31 2016

Wyatt <wyatt.epp gmail.com> writes:
On Tuesday, 31 May 2016 at 19:20:19 UTC, Timon Gehr wrote:
The 'length' of a character is not one in all contexts.
The following text takes six columns in my terminal:

日本語
123456

That's a property of your font and font rendering engine, not
Unicode. (Also, it's probably not quite six columns; most fonts
I've tested, 漢字 are rendered as something like 1.5 characters
wide, assuming your terminal doesn't overlap them.)

-Wyatt

May 31 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 31.05.2016 21:40, Wyatt wrote:
On Tuesday, 31 May 2016 at 19:20:19 UTC, Timon Gehr wrote:
The 'length' of a character is not one in all contexts.
The following text takes six columns in my terminal:

日本語
123456

That's a property of your font and font rendering engine, not Unicode.

Sure. Hence "context". If you are e.g. trying to manually underline some
text in console output, for example in a compiler error message,
counting the number of characters will not actually be what you want,
even though it works reliably for ASCII text.

(Also, it's probably not quite six columns; most fonts I've tested, 漢字
are rendered as something like 1.5 characters wide, assuming your
terminal doesn't overlap them.)

-Wyatt

It's precisely six columns in my terminal (also in emacs and in gedit).

My point was, how can std.algorithm ever guess correctly what you
/actually/ intended to do?

May 31 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, May 31, 2016 21:48:36 Timon Gehr via Digitalmars-d wrote:
On 31.05.2016 21:40, Wyatt wrote:
On Tuesday, 31 May 2016 at 19:20:19 UTC, Timon Gehr wrote:
The 'length' of a character is not one in all contexts.
The following text takes six columns in my terminal:

日本語
123456

That's a property of your font and font rendering engine, not Unicode.

(Also, it's probably not quite six columns; most fonts I've tested, 漢字
are rendered as something like 1.5 characters wide, assuming your
terminal doesn't overlap them.)

-Wyatt

It's precisely six columns in my terminal (also in emacs and in gedit).

My point was, how can std.algorithm ever guess correctly what you
/actually/ intended to do?

It can't, which is precisely why having it select for you was a bad design
decision. The programmer needs to be making that decision. And the fact that
Phobos currently makes that decision for you means that it's often doing the
wrong thing, and the fact that it chose to decode code points by default
means that it's often eating up unnecessary cycles to boot.

- Jonathan M Davis

May 31 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Tue, May 31, 2016 at 07:40:13PM +0000, Wyatt via Digitalmars-d wrote:
On Tuesday, 31 May 2016 at 19:20:19 UTC, Timon Gehr wrote:

The 'length' of a character is not one in all contexts.
The following text takes six columns in my terminal:

日本語
123456

That's a property of your font and font rendering engine, not Unicode.
(Also, it's probably not quite six columns; most fonts I've tested,
漢字 are rendered as something like 1.5 characters wide, assuming your
terminal doesn't overlap them.)

[...]

I believe he was talking about a console terminal that uses 2 columns to
render the so-called "double width" characters. The CJK block does
contain "double-width" versions of selected blocks (e.g., the ASCII
block), to be used with said characters.

Of course, using string length to measure string width is a risky
venture fraught with pitfalls, because your terminal may not actually
render them the way you think it should. Nevertheless, it does serve to
highlight why a construct like s.walkLength is essentially buggy,
because there is not enough information to determine which length it
should return -- length of the buffer in bytes, or the number of code
points, or the number of graphemes, or the width of the string. No
matter which choice you make, it only works for a subset of cases and is
wrong for the other cases. This is a prime illustration of why forcing
autodecoding on every string in D is a wrong design.

--
Не дорог подарок, дорога любовь.

May 31 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, May 31, 2016 21:20:19 Timon Gehr via Digitalmars-d wrote:
On 31.05.2016 20:53, Jonathan M Davis via Digitalmars-d wrote:
On Tuesday, May 31, 2016 14:30:08 Andrei Alexandrescu via Digitalmars-d

wrote:
On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote:
On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via
Digitalmars-d

wrote:
On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
Saying that operating at the code point level - UTF-32 - is
correct
is like saying that operating at UTF-16 instead of UTF-8 is
correct.

Could you please substantiate that? My understanding is that code
unit
is a higher-level Unicode notion independent of encoding, whereas
code
point is an encoding-dependent representation detail. -- Andrei

Does walkLength yield the same number for all representations?

What's "correct"? Maybe the user intended to count the number of code
points in order to pre-allocate a dchar[] of the correct size.

Generally, I don't see how algorithms become magically "incorrect" when
applied to utf code units.

In the vast majority of cases what folks care about is full characters,
which is not what code points are. But the fact that they want different
things in different situation just highlights the fact that just converting
everything to code points by default is a bad idea. And even worse, code
points are usually the worst choice. Many operations don't require decoding
and can be done at the code unit level, meaning that operating at the code
point level is just plain inefficient. And the vast majority of the
operations that can't operate at the code point level, then need to operate
on full characters, which means that they need to be operating at the
grapheme level. Code points are in this weird middle ground that's useful in
some cases but usually isn't what you want or need.

We need to be able to operate at the code unit level, the code point level,
and the grapheme level. But defaulting to the code point level really makes
no sense.

The 'length' of a character is not one in all contexts.
The following text takes six columns in my terminal:

日本語
123456

Well, that's getting into displaying characters which is a whole other can
of worms, but it also highlights that assuming that the programmer wants a
particular level of unicode is not a particularly good idea and that we
should avoid converting for them without being asked, since it risks being
inefficient to no benefit.

- Jonathan M Davis

May 31 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 03:44 PM, Jonathan M Davis via Digitalmars-d wrote:
In the vast majority of cases what folks care about is full character

How are you so sure? -- Andrei

May 31 2016

Marco Leise <Marco.Leise gmx.de> writes:
Am Tue, 31 May 2016 16:56:43 -0400
schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:

On 05/31/2016 03:44 PM, Jonathan M Davis via Digitalmars-d wrote:
In the vast majority of cases what folks care about is full character

How are you so sure? -- Andrei

Because a full character is the typical unit of a written
language. It's what we visualize in our heads when we think
about finding a substring or counting characters. A special
case of this is the reduction to ASCII where we can use code
units in place of grapheme clusters.

--
Marco

May 31 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, May 31, 2016 23:36:20 Marco Leise via Digitalmars-d wrote:
Am Tue, 31 May 2016 16:56:43 -0400

schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:
On 05/31/2016 03:44 PM, Jonathan M Davis via Digitalmars-d wrote:
In the vast majority of cases what folks care about is full character

How are you so sure? -- Andrei

Exactly. How many folks here have written code where the correct thing to do
is to search on code points? Under what circumstances is that even useful?
Code points are a mid-level abstraction between UTF-8/16 and graphemes that
are not particularly useful on their own. Yes, by using code points, we
eliminate the differences between the encodings, but how much code even
operates on multiple string types? Having all of your strings have the same
encoding fixes the consistency problem just as well as autodecoding to dchar
evereywhere does - and without the efficiency hit. Typically, folks operate
on string or char[] unless they're talking to the Windows API, in which
case, they need wchar[]. Our general recommendation is that D code operate
on UTF-8 except when it needs to operate on a different encoding because of
other stuff it has to interact with (like the Win32 API), in which case,
ideally it converts those strings to UTF-8 once they get into the D code and
operates on them as UTF-8, and anything that has to be output in a different
encoding is operated on as UTF-8 until it needs to be outputed, in which
case, it's converted to UTF-16 or whatever the target encoding is. Not
much of anyone is recommending that you use dchar[] everywhere, but that's
essentially what the range API is trying to force.

I think that it's very safe to say that the vast majority of string
processing either is looking to operate on strings as a whole or on
individual, full characters within a string. Code points are neither. While
code may play tricks with Unicode to be efficient (e.g. operating at the
code unit level where it can rather than decoding to either code points or
graphemes), or it might make assumptions about its data being ASCII-only,
aside from explicit Unicode processing code, I have _never_ seen code that
was actually looking to logically operate on only pieces of characters.
While it may operate on code units for efficiency, it's always looking to be
logically operating on string as a unit or on whole characters.

Anyone looking to operate on code points is going to need to take into
account the fact that they're not full characters, just like anyone who
operates on code units needs to take into account the fact that they're not
whole characters. Operating on code points as if they were characters -
which is exactly what D currently does with ranges - is just plain wrong.
We need to support operating at the code point level for those rare cases
where it's actually useful, but autedecoding makes no sense. It incurs a
performance penality without actually giving correct results except in those
rare cases where you want code points instead of full characters. And only
Unicode experts are ever going to want that. The average programmer who is
not super Unicode savvy doesn't even know what code points are. They're
clearly going to be looking to operate on strings as sequences of
characters, not sequences of code points. I don't see how anyone could
expect otherwise. Code points are a mid-level, Unicode abstraction that only
those who are Unicode savvy are going to know or care about, let alone want
to operate on.

- Jonathan M Davis

May 31 2016

Jack Stouffer <jack jackstouffer.com> writes:
On Wednesday, 1 June 2016 at 02:17:21 UTC, Jonathan M Davis wrote:
...

This thread is going in circles; the against crowd has stated
each of their arguments very clearly at least five times in
different ways.

The cost/benefit problems with auto decoding are as clear as day.
If the evidence already presented in this thread (and in the many
others) isn't enough to convince people of that, then I don't
think anything else said will have an impact.

I don't want to sound like someone telling people not to discuss
this anymore, but honestly, what is continuing this thread going
to accomplish?

May 31 2016

Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Tuesday, 31 May 2016 at 20:56:43 UTC, Andrei Alexandrescu
wrote:
On 05/31/2016 03:44 PM, Jonathan M Davis via Digitalmars-d
wrote:
In the vast majority of cases what folks care about is full
character

How are you so sure? -- Andrei

He doesn't need to be sure. You are the one advocating for code
points, so the burden is on you to present evidence that it's the
correct choice.

Jun 01 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 02:53 PM, Jonathan M Davis via Digitalmars-d wrote:
walkLength treats a code point like it's a character.

No, it treats a code point like it's a code point. -- Andrei

May 31 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, May 31, 2016 15:33:38 Andrei Alexandrescu via Digitalmars-d wrote:
On 05/31/2016 02:53 PM, Jonathan M Davis via Digitalmars-d wrote:
walkLength treats a code point like it's a character.

No, it treats a code point like it's a code point. -- Andrei

Wasn't the whole point of operating at the code point level by default to
make it so that code would be operating on full characters by default
instead of chopping them up as is so easy to do when operating at the code
unit level? Thanks to how Phobos treats strings as ranges of dchar, most D
code treats code points as if they were characters. So, whether it's correct
or not, a _lot_ of D code is treating walkLength like it returns the number
of characters in a string. And if walkLength doesn't provide the number of
characters in a string, why would I want to use it under normal
circumstances? Why would I want to be operating at the code point level in
my code? It's not necessarily a full character, since it's not necessarily a
grapheme. So, by using walkLength and front and popFront and whatnot with
strings, I'm not getting full characters. I'm still only getting pieces of
characters - just like would happen if strings were treated as ranges of
code units. I'm just getting bigger pieces of the characters out of the
deal. But if they're not full characters, what's the point?

I am sure that there is code that is going to want to operate at the code
point level, but your average program is either operating on strings as a
whole or individual characters. As long as strings are being operated on as
a whole, code units are generally plenty, and careful encoding of characters
into code units for comparisons means that much of the time that you want to
operate on individual characters, you can still operate at the code unit
level. But if you can't, then you need the grapheme level, because a code
point is not necessarily a full character.

So, what is the point of operating on code points in your average D program?
walkLength will not always tell me the number of characters in a string.
front risks giving me a partial character rather than a whole one. Slicing
dchar[] risks chopping up characters just like slicing char[] does.
Operating on code points by default does not result in correct string
processing.

I honestly don't see how autodecoding is defensible. We may not be able to
get rid of it due to the breakage that doing that would cause, but I fail to
see how it is at all desirable that we have autodecoded strings. I can
understand how we got it if it's based on a misunderstanding on your part
about how Unicode works. We all make mistakes. But I fail to see how
autodecoding wasn't a mistake. It's the worst of both worlds - inefficient
while still incorrect. At least operating at the code unit level would be
fast while being incorrect, and it would be obviously incorrect once you did
anything with non-ASCII values, whereas it's easy to miss that ranges of
dchar are doing the wrong thing too

- Jonathan M Davis

May 31 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 04:01 PM, Jonathan M Davis via Digitalmars-d wrote:
Wasn't the whole point of operating at the code point level by default to
make it so that code would be operating on full characters by default
instead of chopping them up as is so easy to do when operating at the code
unit level?

The point is to operate on representation-independent entities (Unicode
code points) instead of low-level representation-specific artifacts
(code units). That's the contract, and it seems meaningful seeing how
Unicode is defined in terms of code points as its abstract building
block. If user code needs to go lower at the code unit level, they can
do so. If user code needs to go upper at the grapheme level, they can do
so. If anything this thread strengthens my opinion that autodecoding is
a sweet spot. -- Andrei

May 31 2016

Max Samukha <maxsamukha gmail.com> writes:
On Tuesday, 31 May 2016 at 21:01:17 UTC, Andrei Alexandrescu
wrote:

If user code needs to go upper at the grapheme level, they can
If anything this thread strengthens my opinion that
autodecoding is a sweet spot. -- Andrei

Unicode FAQ disagrees (http://unicode.org/faq/utf_bom.html):

"Q: How about using UTF-32 interfaces in my APIs?

A: Except in some environments that store text as UTF-32 in
memory, most Unicode APIs are using UTF-16. With UTF-16 APIs the
low level indexing is at the storage or code unit level, with
higher-level mechanisms for graphemes or words specifying their
boundaries in terms of the code units. This provides efficiency
at the low levels, and the required functionality at the high
levels."

May 31 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Tue, May 31, 2016 at 05:01:17PM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
On 05/31/2016 04:01 PM, Jonathan M Davis via Digitalmars-d wrote:
Wasn't the whole point of operating at the code point level by
default to make it so that code would be operating on full
characters by default instead of chopping them up as is so easy to
do when operating at the code unit level?

The point is to operate on representation-independent entities
(Unicode code points) instead of low-level representation-specific
artifacts (code units).

This is basically saying that we operate on dchar[] by default, except
that we disguise its detrimental memory usage consequences by
compressing to UTF-8/UTF-16 and incurring the cost of decompression
every time we access its elements. Perhaps you love the idea of running
an OS that stores all files in compressed form and always decompresses
upon every syscall to read(), but I prefer a higher-performance system.

That's the contract, and it seems meaningful
seeing how Unicode is defined in terms of code points as its abstract
building block.

Where's this contract stated, and when did we sign up for this?

If user code needs to go lower at the code unit level, they can do so.
If user code needs to go upper at the grapheme level, they can do so.

Only with much pain by using workarounds to bypass meticulously-crafted
autodecoding algorithms in Phobos.

If anything this thread strengthens my opinion that autodecoding is a
sweet spot. -- Andrei

No, autodecoding is a stalemate that's neither fast nor correct.

--
"Real programmers can write assembly code in any language. :-)" -- Larry Wall

May 31 2016

Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Tuesday, 31 May 2016 at 21:01:17 UTC, Andrei Alexandrescu
wrote:
On 05/31/2016 04:01 PM, Jonathan M Davis via Digitalmars-d
wrote:
Wasn't the whole point of operating at the code point level by
default to
make it so that code would be operating on full characters by
default
instead of chopping them up as is so easy to do when operating
at the code
unit level?

The point is to operate on representation-independent entities
(Unicode code points) instead of low-level
representation-specific artifacts (code units).

_Both_ are low-level representation-specific artifacts.

Jun 01 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/01/2016 06:25 AM, Marc Schütz wrote:
On Tuesday, 31 May 2016 at 21:01:17 UTC, Andrei Alexandrescu wrote:
On 05/31/2016 04:01 PM, Jonathan M Davis via Digitalmars-d wrote:
Wasn't the whole point of operating at the code point level by
default to
make it so that code would be operating on full characters by default
instead of chopping them up as is so easy to do when operating at the
code
unit level?

The point is to operate on representation-independent entities
(Unicode code points) instead of low-level representation-specific
artifacts (code units).

_Both_ are low-level representation-specific artifacts.

Maybe this is a misunderstanding. Representation = how things are laid
out in memory. What does associating numbers with various Unicode
symbols have to do with representation? -- Andrei

Jun 01 2016

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 06/01/2016 10:29 AM, Andrei Alexandrescu wrote:
On 06/01/2016 06:25 AM, Marc Schütz wrote:
On Tuesday, 31 May 2016 at 21:01:17 UTC, Andrei Alexandrescu wrote:
The point is to operate on representation-independent entities
(Unicode code points) instead of low-level representation-specific
artifacts (code units).

_Both_ are low-level representation-specific artifacts.

Maybe this is a misunderstanding. Representation = how things are laid
out in memory. What does associating numbers with various Unicode
symbols have to do with representation? -- Andrei

As has been explained countless times already, code points are a non-1:1
internal representation of graphemes. Code points don't exist for their
own sake, their entire existence is purely as a way to encode graphemes.
Whether that technically qualifies as "memory representation" or not is
irrelevant: it's still a low-level implementation detail of text.

Jun 01 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/01/2016 12:41 PM, Nick Sabalausky wrote:
As has been explained countless times already, code points are a non-1:1
internal representation of graphemes. Code points don't exist for their
own sake, their entire existence is purely as a way to encode graphemes.

Of course, thank you.

Whether that technically qualifies as "memory representation" or not is
irrelevant: it's still a low-level implementation detail of text.

The relevance is meandering across the discussion, and it's good to have
the same definitions for terms. Unicode code points are abstract notions
with meanings attached to them, whereas UTF8/16/32 are concerned with
their representation.

Andrei

Jun 01 2016

Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Wednesday, 1 June 2016 at 14:29:58 UTC, Andrei Alexandrescu
wrote:
On 06/01/2016 06:25 AM, Marc Schütz wrote:
On Tuesday, 31 May 2016 at 21:01:17 UTC, Andrei Alexandrescu
wrote:
The point is to operate on representation-independent entities
(Unicode code points) instead of low-level
representation-specific
artifacts (code units).

_Both_ are low-level representation-specific artifacts.

Maybe this is a misunderstanding. Representation = how things
are laid out in memory. What does associating numbers with
various Unicode symbols have to do with representation? --

Ok, if you define it that way, sure. I was thinking in terms of
the actual text: Unicode is a way to represent that text using a
variety of low-level representations: UTF8/NFC, UTF8/NFD,
unnormalized UTF8, UTF16 big/little endian x normalization, UTF32
x normalization, some other more obscure ones. From that
viewpoint, auto decoded char[] (= UTF8) is equivalent to dchar[]
(= UTF32). Neither of them is the actual text.

Both writing and the memory representation consist of fundamental
units. But there is no 1:1 relationship between the units of
char[] (UTF8 code units) or auto decoded strings (Unicode code
points) on the one hand, and the units of writing (graphemes) on
the other.

Jun 02 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via Digitalmars-d
wrote:
[...]
Does walkLength yield the same number for all representations?

Let's put the question this way. Given the following string, what do
*you* think walkLength should return?

şŭt̥ḛ́k̠

I think any reasonable person would have to say it should return 5,
because there are 5 visual "characters" here. Otherwise, what is even
the meaning of walkLength?! For it to return anything other than 5 means
that it's a leaky abstraction, because it's leaking low-level
"implementation details" of the Unicode representation of this string.

However, with the current implementation of autodecoding, walkLength
returns 11. Can anyone reasonably argue that it's reasonable for
"şŭt̥ḛ́k̠".walkLength to equal 11? What difference does this make if
we
get rid of autodecoding, and walkLength returns 17 instead? *Both* are
wrong.

17 is actually the right answer if you're looking to allocate a buffer
large enough to hold this string, because that's the number of bytes it
occupies.

5 is the right answer to an end user who knows nothing about Unicode.

11 is an answer that a question that only makes sense to a Unicode
specialist, and that no layperson understands.

11 is the answer we currently give. And that, at the cost of
across-the-board performance degradation. Yet you're seriously arguing
that 11 should be the right answer, by insisting that the current
implementation of autodecoding is "correct". It boggles the mind.

--
Today's society is one of specialization: as you grow, you learn more and more
about less and less. Eventually, you know everything about nothing.

May 31 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:
On 5/31/16 3:32 PM, H. S. Teoh via Digitalmars-d wrote:
On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via
Digitalmars-d wrote:
[...]
Does walkLength yield the same number for all representations?

Let's put the question this way. Given the following string, what do
*you* think walkLength should return?

Compiler error.

-Steve

May 31 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 31.05.2016 21:51, Steven Schveighoffer wrote:
On 5/31/16 3:32 PM, H. S. Teoh via Digitalmars-d wrote:
On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via
Digitalmars-d wrote:
[...]
Does walkLength yield the same number for all representations?

Let's put the question this way. Given the following string, what do
*you* think walkLength should return?

Compiler error.

-Steve

What about e.g. joiner?

May 31 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Tue, May 31, 2016 at 10:38:03PM +0200, Timon Gehr via Digitalmars-d wrote:
On 31.05.2016 21:51, Steven Schveighoffer wrote:
On 5/31/16 3:32 PM, H. S. Teoh via Digitalmars-d wrote:
On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via
Digitalmars-d wrote:
[...]
Does walkLength yield the same number for all representations?

Let's put the question this way. Given the following string, what
do *you* think walkLength should return?

Compiler error.

-Steve

What about e.g. joiner?

joiner is one of those algorithms that can work perfectly fine *without*
autodecoding anything at all. The only time it'd actually need to decode
would be if you're joining a set of UTF-8 strings with a UTF-16
delimiter, or some other such combination, which should be pretty rare.
After all, within the same application you'd usually only be dealing
with a single encoding rather than mixing UTF-8, UTF-16, and UTF-32
willy-nilly.

(Unless the code is specifically written for transcoding, in which case
decoding is part of the job description, so it should be expected that
the programmer ought to know how to do it properly without needing
Phobos to do it for him.)

Even in the case of s.joiner('Ш'), joiner could easily convert that
dchar into a short UTF-8 string and then operate directly on UTF-8.

--
Just because you survived after you did it, doesn't mean it wasn't stupid!

May 31 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:
On 5/31/16 4:38 PM, Timon Gehr wrote:
On 31.05.2016 21:51, Steven Schveighoffer wrote:
On 5/31/16 3:32 PM, H. S. Teoh via Digitalmars-d wrote:
On Tue, May 31, 2016 at 02:30:08PM -0400, Andrei Alexandrescu via
Digitalmars-d wrote:
[...]
Does walkLength yield the same number for all representations?

Let's put the question this way. Given the following string, what do
*you* think walkLength should return?

Compiler error.

What about e.g. joiner?

Compiler error. Better than what it does now.

-Steve

May 31 2016

Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Wednesday, 1 June 2016 at 01:13:17 UTC, Steven Schveighoffer
wrote:
On 5/31/16 4:38 PM, Timon Gehr wrote:
What about e.g. joiner?

Compiler error. Better than what it does now.

I believe everything that does only concatenation will work
correctly. That's why joiner() is one of those algorithms that
should accept strings directly without going through any decoding
(but it may need to recode the joining element itself, of course).

Jun 01 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/1/16 6:31 AM, Marc Schütz wrote:
On Wednesday, 1 June 2016 at 01:13:17 UTC, Steven Schveighoffer wrote:
On 5/31/16 4:38 PM, Timon Gehr wrote:
What about e.g. joiner?

Compiler error. Better than what it does now.

I believe everything that does only concatenation will work correctly.
That's why joiner() is one of those algorithms that should accept
strings directly without going through any decoding (but it may need to
recode the joining element itself, of course).

This means that a string is a range. What is it a range of? If you want
to make it a range of code units, I think you will lose that battle.

If you want to special-case joiner for strings, that's always possible.
Or string could be changed to be a range of dchar struct explicitly.
Then at least joiner makes sense, and I can reasonably explain why it
behaves the way it does.

-Steve

Jun 02 2016

Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:
On Thursday, 2 June 2016 at 13:11:10 UTC, Steven Schveighoffer
wrote:
On 6/1/16 6:31 AM, Marc Schütz wrote:
I believe everything that does only concatenation will work
correctly.
That's why joiner() is one of those algorithms that should
accept
strings directly without going through any decoding (but it
may need to
recode the joining element itself, of course).

This means that a string is a range. What is it a range of? If
you want to make it a range of code units, I think you will
lose that battle.

No, I don't want to make string a range of anything, I want to
provide an additional overload for joiner() that accepts a
const(char)[], and returns a range of chars.

The remark about the joining element is that

["abc", "xyz"].joiner(","d)

should convert ","d to "," first, to match the element type of
the elements. But this is purely a convenience; it can also be
pushed to the user.

If you want to special-case joiner for strings, that's always
possible.

Yes, that's what I want. Sorry if it wasn't clear.

Or string could be changed to be a range of dchar struct
explicitly. Then at least joiner makes sense, and I can
reasonably explain why it behaves the way it does.

-Steve

Jun 02 2016

Timon Gehr <timon.gehr gmx.ch> writes:
On 02.06.2016 15:48, Marc Schütz wrote:

No, I don't want to make string a range of anything, I want to provide
an additional overload for joiner() that accepts a const(char)[], and
returns a range of chars.

If strings are not ranges, returning a range of chars is inconsistent.

Jun 02 2016

Kagamin <spam here.lot> writes:
On Thursday, 2 June 2016 at 13:11:10 UTC, Steven Schveighoffer
wrote:
This means that a string is a range. What is it a range of? If
you want to make it a range of code units, I think you will
lose that battle.

After the first migration step joiner will return a decoded dchar
range just like it does now, only code will change internally,
there will be no observable semantic difference to the user.
Anyway, read Walter's proposal in the thread about dealing with
autodecode.

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 03:32 PM, H. S. Teoh via Digitalmars-d wrote:
Let's put the question this way. Given the following string, what do
*you* think walkLength should return?

şŭt̥ḛ́k̠

The number of code units in the string. That's the contract promised and
honored by Phobos. -- Andrei

May 31 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote:
On 05/31/2016 03:32 PM, H. S. Teoh via Digitalmars-d wrote:
Let's put the question this way. Given the following string, what do
*you* think walkLength should return?

şŭt̥ḛ́k̠

The number of code units in the string. That's the contract promised and
honored by Phobos. -- Andrei

Code points I mean. -- Andrei

May 31 2016

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote:
On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote:
On 05/31/2016 03:32 PM, H. S. Teoh via Digitalmars-d wrote:
Let's put the question this way. Given the following string, what do
*you* think walkLength should return?

şŭt̥ḛ́k̠

The number of code units in the string. That's the contract promised and
honored by Phobos. -- Andrei

Code points I mean. -- Andrei

Yes, we know it's the contract. ***That's the problem.*** As everybody
is saying, it *SHOULDN'T* be the contract.

Why shouldn't it be the contract? Because it's proven itself, both
logically (as presented by pretty much everybody other than you in both
this and other threads) and empirically (in phobos, warp, and other user
code) to be both the least useful and most PITA option.

May 31 2016

Jonathan M Davis via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Tuesday, May 31, 2016 20:38:14 Nick Sabalausky via Digitalmars-d wrote:
On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote:
On 05/31/2016 04:55 PM, Andrei Alexandrescu wrote:
On 05/31/2016 03:32 PM, H. S. Teoh via Digitalmars-d wrote:
Let's put the question this way. Given the following string, what do
*you* think walkLength should return?

şŭt̥ḛ́k̠

The number of code units in the string. That's the contract promised and
honored by Phobos. -- Andrei

Code points I mean. -- Andrei

Yes, we know it's the contract. ***That's the problem.*** As everybody
is saying, it *SHOULDN'T* be the contract.

Exactly. Operating at the code point level rarely makes sense. What sorts of
algorithms purposefully do that in a typical program? Unless you're doing
very specific Unicode stuff or somehow know that your strings don't contain
any graphemes that are made up of multiple code points, operating at the
code point level is just bug-prone, and unless you're using dchar[]
everywhere, it's slow to boot, because you're strings have to be decoded
whether the algorithm needs to or not.

I think that it's very safe to say that the vast majority of string
algorithms are either able to operate at the code unit level without
decoding (though possibly encoding another string to match - e.g. with a
string comparison or search), or they have to operate at the grapheme level
in order to deal with full characters. A code point is borderline useless on
its own. It's just a step above the different UTF encodings without actually
getting to proper characters.

- Jonathan M Davis

May 31 2016

ag0aep6g <anonymous example.com> writes:
On 05/31/2016 07:21 PM, Andrei Alexandrescu wrote:
Could you please substantiate that? My understanding is that code unit
is a higher-level Unicode notion independent of encoding, whereas code
point is an encoding-dependent representation detail. -- Andrei

You got the terms mixed up. Code unit is lower level. Code point is
higher level.

One code point is encoded with one or more code units. char is a UTF-8
code unit. wchar is a UTF-16 code unit. dchar is both a UTF-32 code unit
and a code point, because in UTF-32 it's a 1-to-1 relation.

May 31 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 03:34 PM, ag0aep6g wrote:
On 05/31/2016 07:21 PM, Andrei Alexandrescu wrote:
Could you please substantiate that? My understanding is that code unit
is a higher-level Unicode notion independent of encoding, whereas code
point is an encoding-dependent representation detail. -- Andrei

You got the terms mixed up. Code unit is lower level. Code point is
higher level.

Apologies and thank you. -- Andrei

May 31 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
The standard library has to fight against itself because of autodecoding!
The vast majority of the algorithms in Phobos are special-cased on strings
in an attempt to get around autodecoding. That alone should highlight the
fact that autodecoding is problematic.

The way I see it is it's specialization to speed things up without
giving up the higher level abstraction. -- Andrei

May 31 2016

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 05/31/2016 01:23 PM, Andrei Alexandrescu wrote:
On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
The standard library has to fight against itself because of autodecoding!
The vast majority of the algorithms in Phobos are special-cased on
strings
in an attempt to get around autodecoding. That alone should highlight the
fact that autodecoding is problematic.

The way I see it is it's specialization to speed things up without
giving up the higher level abstraction. -- Andrei

Problem is, that "higher"[1] level abstraction you don't want to give up
(ie working on code points) is rarely useful, and yet the default is to
pay the price for something which is rarely useful.

[1] It's really the mid-level abstraction - grapheme is the high-level
one (and more likely useful).

May 31 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 5/26/2016 9:00 AM, Andrei Alexandrescu wrote:
My thesis: the D1 design decision to represent strings as char[] was disastrous
and probably one of the largest weaknesses of D1. The decision in D2 to use
immutable(char)[] for strings is a vast improvement but still has a number of
issues.

The mutable vs immutable has nothing to do with autodecoding.

On 05/12/2016 04:15 PM, Walter Bright wrote:
On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
2. Every time one wants an algorithm to work with both strings and
ranges, you wind up special casing the strings to defeat the
autodecoding, or to decode the ranges. Having to constantly special case
it makes for more special cases when plugging together components. These
issues often escape detection when unittesting because it is convenient
to unittest only with arrays.

This is a consequence of 1. It is at least partially fixable.

It's a consequence of autodecoding, not arrays.

4. Autodecoding is slow and has no place in high speed string processing.

I would agree only with the amendment "...if used naively", which is important.
Knowledge of how autodecoding works is a prerequisite for writing fast string
code in D.

Having written high speed string processing code in D, that also deals with
unicode (i.e. Warp), the only knowledge of autodecoding needed was how to have
it not happen. Autodecoding made it slower than necessary in every case it was
used. I found no place in Warp where autodecoding was desirable.

Also, little code should deal with one code unit or code point at a
time; instead, it should use standard library algorithms for searching,
matching
etc.

That doesn't work so well. There always seems to be a need for custom string
processing. Worse, when pipelining strings, the autodecoding changes the type
to
dchar, which then needs to be re-encoded into the result.

The std.string algorithms I wrote all work much better (i.e. faster) without
autodecoding, while maintaining proper Unicode support. I.e. the autodecoding
did not benefit the algorithms at all, and if the user is to use standard
algorithms instead of custom ones, then autodecoding is not necessary.

When needed, iterating every code unit is trivially done through indexing.

This implies replacing pipelining with loops, and also falls apart if indexing
is redone to index by code points.

Also allow me to point that much of the slowdown can be addressed tactically.
The test c < 0x80 is highly predictable (in ASCII-heavy text) and therefore
easily speculated. We can and we should arrange code to minimize impact.

I.e. special case the code to avoid autodecoding.

The trouble is that the low level code cannot avoid autodecoding, as it happens
before the low level code gets it. This is conceptually backwards, and winds up
requiring every algorithm to special case strings, even when completely
unnecessary. (The 'copy' algorithm is an example of utterly unnecessary
decoding.)

When teaching people how to write algorithms, having to write every one twice,
once for ranges and arrays, and a specialization for strings even when decoding
is never necessary (such as for 'copy'), is embarrassing.

5. Very few algorithms require decoding.

The key here is leaving it to the standard library to do the right thing
instead
of having the user wonder separately for each case. These uses don't need
decoding, and the standard library correctly doesn't involve it (or if it
currently does it has a bug):

s.find("abc")
s.findSplit("abc")
s.findSplit('a')
s.count!(c => "!()-;:,.?".canFind(c)) // punctuation

However the following do require autodecoding:

s.walkLength
s.count!(c => !"!()-;:,.?".canFind(c)) // non-punctuation
s.count!(c => c >= 32) // non-control characters

Currently the standard library operates at code point level even though inside
it may choose to use code units when admissible. Leaving such a decision to the
library seems like a wise thing to do.

Running my char[] through a pipeline and having it come out sometimes as char[]
and sometimes dchar[] and sometimes ubyte[] is hidden and surprising behavior.

6. Autodecoding has two choices when encountering invalid code units -
throw or produce an error dchar. Currently, it throws, meaning no
algorithms using autodecode can be made nothrow.

Agreed. This is probably the most glaring mistake. I think we should open a
discussion no fixing this everywhere in the stdlib, even at the cost of
breaking
code.

A third option is to pass the invalid code units through unmolested, which
won't
work if autodecoding is used.

If paths are not UTF-8, then they shouldn't have string type (instead use
ubyte[] etc). More on that below.

Requiring code units to be all 100% valid is not workable, nor is redoing them
to be ubytes. More on that below.

8. In my work with UTF-8 streams, dealing with autodecode has caused me
considerably extra work every time. A convenient timesaver it ain't.

Objection. Vague.

Sorry I didn't log the time I spent on it.

9. Autodecode cannot be turned off, i.e. it isn't practical to avoid
importing std.array one way or another, and then autodecode is there.

Turning off autodecoding is as easy as inserting .representation after any
string.

.representation changes the type to ubyte[]. All knowledge that this is a
Unicode string then gets lost for the rest of the pipeline.

(Not to mention using indexing directly.)

Doesn't work if you're pipelining.

10. Autodecoded arrays cannot be RandomAccessRanges, losing a key
benefit of being arrays in the first place.

First off, you always have the option with .representation. That's a great name
because it gives you the type used to represent the string - i.e. an array of
integers of a specific width.

I found .representation to be unworkable because it changed the type.

11. Indexing an array produces different results than autodecoding,
another glaring special case.

This is a direct consequence of the fact that string is immutable(char)[] and
not a specific type. That error predates autodecoding.

Even if it is made a special type, the problem of what an index means will
remain. Of course, indexing by code point is an O(n) operation, which I submit
is surprising and shouldn't be supported as [i] even by a special type (for the
same reason that indexing of linked lists is frowned upon). Giving up indexing
means giving up efficient slicing, which would be a major downgrade for D.

Overall, I think the one way to make real steps forward in improving string
processing in the D language is to give a clear answer of what char, wchar, and
dchar mean.

They mean code units. This is not ambiguous. How a code unit is different from
a
ubyte:

A. I know you hate bringing up my personal experience, but here goes. I've
programmed in C forever. In C, char is used for both small integers and
characters. It's always been a source of confusion, and sometimes bugs, to
conflate the two:

struct S { char field; };

Which is it, a character or a small integer? I have to rely on reading the
code.
It's a definite improvement in D that they are distinguished, and I feel that
improvement every time I have to deal with C/C++ code and see 'char' used as a
small integer instead of a character.

B. Overloading is different, and that's good. For example, writeln(T[])
produces
different results for char[] and ubyte[], and this is unsurprising and
expected.
It "just works".

C. More overloading:

writeln('a');

Does anyone want that to print 96? Does anyone really want 'a' to be of type
dchar? (The trouble with that is type inference when building up more complex
types, as you'll wind up with hidden dchar[] if not careful. My experience with
dchar[] is it is almost never desirable, as it is too memory hungry.)

May 27 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/27/16 1:11 PM, Walter Bright wrote:
They mean code units.

Always valid or potentially invalid as well? -- Andrei

May 27 2016

Walter Bright <newshound2 digitalmars.com> writes:
On 5/27/2016 11:27 AM, Andrei Alexandrescu wrote:
On 5/27/16 1:11 PM, Walter Bright wrote:
They mean code units.

Always valid or potentially invalid as well? -- Andrei

Some years ago I would have said always valid. Experience, however, says that
Unicode is often dirty and code should be tolerant of that.

Consider Unicode in a text editor. You can't have it throwing exceptions,
silently changing things to replacement characters, etc., when there's a few
invalid sequences in it. You also can't just say "the file isn't Unicode" and
refuse to display the Unicode in it.

It isn't hard to deal with invalid Unicode in a user friendly manner.

May 27 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/27/16 1:11 PM, Walter Bright wrote:
The std.string algorithms I wrote all work much better (i.e. faster)
without autodecoding, while maintaining proper Unicode support.

Violent agreement is occurring here. We have plenty of those and need
more. -- Andrei

May 27 2016

Martin Nowak <code+news.digitalmars dawg.eu> writes:
On 05/12/2016 10:15 PM, Walter Bright wrote:
On 5/12/2016 9:29 AM, Andrei Alexandrescu wrote:
I am as unclear about the problems of autodecoding as I am about the

necessity
to remove curl. Whenever I ask I hear some arguments that work well

emotionally
but are scant on reason and engineering. Maybe it's time to rehash

them? I just
did so about curl, no solid argument seemed to come together. I'd be

curious of
a crisp list of grievances about autodecoding. -- Andrei

Here are some that are not matters of opinion.

6. Autodecoding has two choices when encountering invalid code units -
throw or produce an error dchar. Currently, it throws, meaning no
algorithms using autodecode can be made nothrow.

There are more than 2 choices here, see the related discussion on
avoiding redundant unicode validation
https://issues.dlang.org/show_bug.cgi?id=14519#c32.

May 29 2016

Marco Leise <Marco.Leise gmx.de> writes:
A relevant thread in the Rust bug tracker I remember from
three years ago: https://github.com/rust-lang/rust/issues/7043
May it be of inspiration.

--
Marco

May 30 2016

D Programming

C/C++ Programming

Other

digitalmars.D - The Case Against Autodecode