digitalmars.D - Major performance problem with std.array.front()

Walter Bright (37/37) Mar 06 2014 In "Lots of low hanging fruit in Phobos" the issue came up about the aut...

bearophile (7/11) Mar 06 2014 But it's good to have in Phobos a compiler-intrinsics-based

Walter Bright (3/10) Mar 06 2014 Yes, so that the user selects it, rather than having it wired in everywh...

bearophile (6/13) Mar 06 2014 I don't think people have ever suggested that.
Adam D. Ruppe (110/112) Mar 06 2014 BTW you know what would help this? A pragma we can attach to a

Walter Bright (2/4) Mar 06 2014 I'd rather fix the compiler's codegen than add a pragma.

bearophile (5/6) Mar 06 2014 But a standard common intrinsic to detect the overflow
H. S. Teoh (9/15) Mar 06 2014 would say that if a struct is under a certain size (determined by the

Walter Bright (2/7) Mar 06 2014 Yes, that's right.

Adam D. Ruppe (6/7) Mar 07 2014 The codegen isn't broken, the current this pointer behavior is

Dicebot (3/10) Mar 07 2014 We don't need C ABI compatibility for stuff that is not

Adam D. Ruppe (13/15) Mar 07 2014 That's a good point, though personally I'd still like some way to

Walter Bright (4/10) Mar 07 2014 Oh, I see what you mean. But I think it does generate the same code, if ...

Kagamin (12/54) Mar 07 2014 struct A {

Adam D. Ruppe (5/9) Mar 07 2014 That won't work for operator overloading though (which is the

Kagamin (4/8) Mar 07 2014 Alternatively for small methods you can rely on inlining, which

Adam D. Ruppe (8/10) Mar 07 2014 Yeah, that's usually the way to go, inlining can also avoid

Walter Bright (7/9) Mar 07 2014 For that, I was thinking of having the compiler recognize one of the com...

Walter Bright (2/5) Mar 06 2014 You use ranges a lot. Would it break any of your code?

bearophile (8/9) Mar 06 2014 I need to try the changes to be sure. But the magnitude of this

Walter Bright (4/7) Mar 06 2014 Yes, I hadn't thought of that.

Andrei Alexandrescu (3/11) Mar 07 2014 There's no asymmetry, and decoding helps composability as I demonstrated...

Walter Bright (20/32) Mar 07 2014 Here's one asymmetry:

Dmitry Olshansky (4/11) Mar 07 2014 Which it shouldn't unless there is an ascii type or some such.

Andrei Alexandrescu (6/19) Mar 07 2014 Correct. This is a win, not a failure, of the current approach. To sort

H. S. Teoh (21/28) Mar 06 2014 Whoa. You're not serious about changing this now, are you? Because even

Walter Bright (5/21) Mar 06 2014 I understand this all too well. (Note that we currently have a different...

bearophile (6/8) Mar 06 2014 On the other hand your change could introduce Unicode-related

Walter Bright (6/12) Mar 06 2014 This comes up repeatedly as justification for D trying to hide the UTF-8...

Shammah Chancellor (5/22) Mar 07 2014 Is it possible to add a warning notice when .front() is used on char?

Michel Fortin (43/51) Mar 07 2014 The way Phobos works isn't any more correct than dealing with code

Vladimir Panteleev (2/4) Mar 07 2014 Why is this?
Kagamin (4/7) Mar 07 2014 AFAIK, xml control characters are all ascii, and what's between

Michel Fortin (9/17) Mar 07 2014 If you don't fully check for well-formness (as XML parsers ought to do

Nick Sabalausky (5/44) Mar 10 2014 Well, it is *more* correct, as many western languages are more likely in...

Marco Leise (6/60) Mar 18 2014 +1 Reminds me of my proposal for Rust

Peter Alexander (3/13) Mar 07 2014 +1
Vladimir Panteleev (4/6) Mar 07 2014 I would argue that, unless it's been made clear that the program

Walter Bright (3/4) Mar 06 2014 Is there any way we can provide an upgrade path for this? Silent breakag...

Walter Bright (11/15) Mar 06 2014 Ok, I have a plan. Each step will be separated by at least one version:

Dmitry Olshansky (15/32) Mar 07 2014 This would also be a great fit in cases where 'decode' is decoding some

Walter Bright (2/4) Mar 07 2014 I agree. ElementEncodingType is a giant red flag saying we screwed thing...

Vladimir Panteleev (6/19) Mar 07 2014 I think .decode should be something more explicit (byCodePoint

H. S. Teoh (10/27) Mar 07 2014 +1. I think "byCodePoint" is far more self-documenting and less

Dmitry Olshansky (4/27) Mar 07 2014 And there is precedent, see std.uni.byCodepoint ;)

Andrei Alexandrescu (5/8) Mar 07 2014 There's no "until then".

Dmitry Olshansky (12/20) Mar 07 2014 There is however a big glaring failure: std.algorithm specialized for

Andrei Alexandrescu (7/21) Mar 07 2014 I agree that's an issue. Back in the day when this was a choice I

Walter Bright (3/7) Mar 07 2014 The way they do it now, i.e. they can't. That's the whole problem.

Andrei Alexandrescu (3/20) Mar 07 2014 This would kill D. I am not exaggerating.

Nicholas Londey (3/4) Mar 08 2014 I don't know about kill but it certainly feels awfully similar to

Nicholas Londey (2/3) Mar 08 2014 Decode is an incredibly generic name. What about byGlyph similar
Chris (26/39) Mar 11 2014 What about this:

Nick Sabalausky (18/26) Mar 06 2014 We rip out that front() entirely. The result is *not* technically a

Nick Sabalausky (7/35) Mar 06 2014 Of course, I just realized that these front()s can't be added unless
"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (8/40) Mar 09 2014 Strings can be iterated over by code unit, code point, grapheme,

Jakob Ovrum (5/8) Mar 09 2014 There already is a std.uni.byCodePoint. It is a higher order

Andrei Alexandrescu (5/12) Mar 09 2014 Actually not because for reasons that are unclear to me people really

Nick Sabalausky (2/6) Mar 09 2014 Probably because char *is* D's type for UTF-8 code units.

Walter Bright (4/5) Mar 09 2014 Not at all. std.string.representation takes a string and casts it to the...

Walter Bright (3/5) Mar 09 2014 I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string, wst...

Nick Sabalausky (15/21) Mar 09 2014 'byCodePoint' and 'byDchar' are the same. However, 'byCodeUnit' is

Nick Sabalausky (2/5) Mar 09 2014 Erm, naturally I meant "(str|wstr|dstr)"
Walter Bright (2/24) Mar 09 2014 I don't see much point to the latter 3.

Nick Sabalausky (29/59) Mar 10 2014 Do you mean:

Walter Bright (3/45) Mar 10 2014 Just not sure I see a use for that.

Dmitry Olshansky (5/12) Mar 07 2014 Where have you been when it was introduced? :)

Walter Bright (7/8) Mar 07 2014 It slipped by me. What can I say? I'm not the only committer :-)

Steven Schveighoffer (5/8) Mar 07 2014 No, this is intrinsic in the problem of treating strings as ranges of
Dmitry Olshansky (10/18) Mar 07 2014 That seems to be the biggest problem, it's an overriding default that is...

Walter Bright (2/3) Mar 07 2014 Ah right, I misremembered. Thanks for the correction.

Vladimir Panteleev (63/69) Mar 07 2014 I'm glad I'm not the only one who feels this way. Implicit

Andrej Mitrovic (3/9) Mar 07 2014 We could later make a page on dlang (or the wiki) describing how to do
Robert Schadek (4/10) Mar 07 2014 +1 see my pull requests for std.string:
ponce (3/12) Mar 09 2014 With all due respect, D string type is exclusively for UTF-8

Vladimir Panteleev (3/15) Mar 09 2014 This is an arbitrary self-imposed limitation caused by the choice

Nick Sabalausky (6/21) Mar 09 2014 Yea, I've had problems before - completely unnecessary problems that

ponce (14/29) Mar 10 2014 For greater good.

Nick Sabalausky (5/20) Mar 10 2014 I may have missed it, but I don't see where it says anything about

ponce (8/13) Mar 10 2014 I should have highlighted it, their recommendations for proper

Steven Schveighoffer (15/16) Mar 07 2014 Yes, make d strings not char arrays, but a library-defined struct with a...
Dicebot (19/19) Mar 07 2014 I don't like it at all.

Vladimir Panteleev (8/25) Mar 07 2014 This is a fallacy.

Dicebot (13/43) Mar 07 2014 Any code that relies on countUntil to count dchar's? Or, to

Vladimir Panteleev (17/29) Mar 07 2014 This is a pretty fragile design in the first place, since we use

Dicebot (5/9) Mar 07 2014 Well if you consider really breaking changes, simply prohibiting

H. S. Teoh (5/15) Mar 07 2014 I don't understand what advantage this would bring.

Dicebot (4/11) Mar 07 2014 Making sure that whatever interpretation is chosen by the

Walter Bright (2/4) Mar 07 2014 Yes.

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (3/8) Mar 09 2014 This would no longer compile, as dchar[] stops being a range.

Vladimir Panteleev (3/12) Mar 09 2014 Why? There's no reason why dchar[] would stop being a range. It

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (8/21) Mar 09 2014 This was under the assumption that Nick's proposal (and my

Walter Bright (16/19) Mar 07 2014 1. Performance Performance Performance

Dicebot (34/59) Mar 10 2014 Not important enough. D has always been "safe by default, fast

Walter Bright (3/4) Mar 10 2014 It was done that way simply to get it up and running quickly. Having the...

Steven Schveighoffer (11/15) Mar 10 2014 I think you forget about this:

Andrei Alexandrescu (7/22) Mar 10 2014 Fixing that:

Steven Schveighoffer (11/36) Mar 10 2014 Actually, it can't do anything, seeing as it's invalid code ;)

Andrei Alexandrescu (7/14) Mar 10 2014 I think so too. But that's irrelevant because arrays do allocate (at

Sean Kelly (4/18) Mar 11 2014 The array is small and does not escape. It could be allocated on

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (5/10) Mar 09 2014 Not with Nick Sabalausky's suggestion to remove the

Vladimir Panteleev (3/13) Mar 09 2014 Andrei has made it clear that the code breakage this would
Andrei Alexandrescu (3/11) Mar 09 2014 Such as giving up on that crappy language that keeps on breaking their c...

Dicebot (5/23) Mar 10 2014 That was more about "if you are that crazy to even consider such

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (15/40) Mar 10 2014 BTW, I don't believe it would be that bad, because there's a

Sean Kelly (2/2) Mar 07 2014 I'm with Walter on this, and it's why I don't use char ranges.
Chris (5/5) Mar 07 2014 I only hope it won't break my code. It mainly deals with string /
Andrei Alexandrescu (53/56) Mar 07 2014 There's nothing to fix.

H. S. Teoh (64/133) Mar 07 2014 The problem is that the current implementation of this correct behaviour

Andrei Alexandrescu (30/79) Mar 07 2014 That's an optimization that fits the current design and goes in the
=?UTF-8?B?Ikx1w61z?= Marques" (2/5) Mar 08 2014 (BTW, byGrapheme is currently missing in the std.uni docs)

Walter Bright (2/3) Mar 08 2014 https://github.com/D-Programming-Language/phobos/pull/1985

Vladimir Panteleev (21/32) Mar 07 2014 No, it doesn't.

Eyrk (2/9) Mar 07 2014 Hm, I'm not following? Works perfectly fine on my system?

Vladimir Panteleev (4/17) Mar 07 2014 Something's messing with your Unicode. Try downloading and

TC (3/7) Mar 07 2014 Used hex view on referenced file and it does not seem to be the

Vladimir Panteleev (2/9) Mar 07 2014 Define "symbol". :)

TC (4/14) Mar 07 2014 "cassé" - 22 63 61 73 73 65 cc 81 22

Eyrk (2/20) Mar 07 2014 ah right, missing normalization, I get your point, thanks.

TC (1/2) Mar 07 2014 Oops :)

H. S. Teoh (25/42) Mar 07 2014 Probably because your browser is normalizing the unicode string when you

Eyrk (7/33) Mar 07 2014 Yes, I realised too late.
TC (18/21) Mar 07 2014 Just for curiosity I tried it with C# to see how it is handled
Brad Anderson (27/86) Mar 07 2014 To me, the status quo feels like an ok compromise between
Andrei Alexandrescu (3/11) Mar 07 2014 Which is a reasonable thing to ask for.

Andrei Alexandrescu (12/29) Mar 07 2014 Yup, the grapheme issue. This should work.

Vladimir Panteleev (15/18) Mar 07 2014 No. It does not work because grapheme segmentation is not the

bearophile (4/5) Mar 07 2014 Given sufficiently refined types, it can be about types :-)

Eyrk (11/16) Mar 08 2014 I think Bear is onto something, we already solved an analogous

Dmitry Olshansky (5/35) Mar 08 2014 Becasue Graphemes do not auto-magically convert to dchar and back? After...

Dmitry Olshansky (5/43) Mar 08 2014 Plus it won't help the matters, you need both "é" and "cassé" to have...

Andrei Alexandrescu (4/47) Mar 08 2014 Why? Couldn't the grapheme 'compare true with the character? I.e. the

Vladimir Panteleev (8/10) Mar 08 2014 Grapheme segmentation and normalization are distinct Unicode

Peter Alexander (27/37) Mar 08 2014 How about this?

Dmitry Olshansky (10/58) Mar 08 2014 Oh crap, please no. It's not only _Slow_ but it's also horribly

Andrei Alexandrescu (4/13) Mar 08 2014 Yah, just pushing my luck :o). I don't know much about graphemes and

Andrei Alexandrescu (4/38) Mar 08 2014 Yah but I think they should support comparison with individual

Dmitry Olshansky (7/46) Mar 08 2014 We could add one. I don't think Grapheme interface is optimal or set in

Vladimir Panteleev (7/9) Mar 08 2014 Doesn't work here. Not sure why.

Dmitry Olshansky (4/14) Mar 08 2014 Sounds like a bug, file it before we derailed.

Vladimir Panteleev (2/3) Mar 08 2014 https://d.puremagic.com/issues/show_bug.cgi?id=12324

Walter Bright (6/8) Mar 08 2014 std.uni.Grapheme is a struct, and that struct contains a string of arbit...

Dmitry Olshansky (5/15) Mar 08 2014 They use small-string optimization with great success, as indeed plenty
Andrei Alexandrescu (5/15) Mar 08 2014 I think basic encapsulation suggests Grapheme should be a distinct type....

Michel Fortin (31/35) Mar 08 2014 Dchar, aka code points, are much clearly defined than graphemes. A

Nick Sabalausky (6/7) Mar 09 2014 It's also a good example for when non-programmers are surprised to hear

Sarath Kodali (8/40) Mar 07 2014 +1

H. S. Teoh (16/36) Mar 07 2014 [...]

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (4/52) Mar 09 2014 That won't work, because your needle might be in a different

Michel Fortin (17/20) Mar 09 2014 The core of the problem is that sometime this byte-by-byte comparison

Sarath Kodali (10/17) Mar 07 2014 Oops, incomplete reply ...

H. S. Teoh (13/33) Mar 07 2014 Yes. The more I think about it, the more auto-decoding sounds like a

Nick Sabalausky (15/37) Mar 09 2014 I'm leaning the same way too. But I also think Andrei is right that, at

w0rp (42/58) Mar 09 2014 I've been watching this discussion for the last few days, and I'm

Nick Sabalausky (6/11) Mar 09 2014 Python 2 or 3 (out of curiosity)? If you're including Python3, then that...

w0rp (20/36) Mar 11 2014 Late reply here. Python 3 is a lot better in terms of Unicode

Andrei Alexandrescu (2/14) Mar 07 2014 worksforme

Vladimir Panteleev (3/4) Mar 07 2014 http://forum.dlang.org/post/fhqradggtvwnpqpuehgg@forum.dlang.org

Dmitry Olshansky (9/21) Mar 07 2014 Special case was wrong though - special casing arrays of char[] and

Andrei Alexandrescu (5/26) Mar 07 2014 I think this is a confusion. The code in e.g. std.algorithm is

Dmitry Olshansky (15/38) Mar 08 2014 Well, I've said it elsewhere - specialization was too fine grained.

Vladimir Panteleev (17/31) Mar 07 2014 These are a variation of the following:

Andrei Alexandrescu (4/11) Mar 07 2014 The compared element need not have the same type (otherwise we'd break

Vladimir Panteleev (6/19) Mar 07 2014 Do you think such code will appear often in practice? Even if the

Vladimir Panteleev (5/19) Mar 07 2014 Sorry, I see now that you were referring to algorithms in

Timon Gehr (2/5) Mar 07 2014 I think this is among the most annoying aspects of Phobos.
Walter Bright (4/4) Mar 07 2014 Andrei suggests that this change would destroy D by breaking too much ex...

Peter Alexander (6/12) Mar 07 2014 Before we discuss risk in the change, we need to agree that it is

H. S. Teoh (11/27) Mar 07 2014 Regardless of which way we decide in the end, I hope the one thing good

Vladimir Panteleev (16/17) Mar 07 2014 I think a good place to start would be to have a draft
Sean Kelly (24/27) Mar 08 2014 Perhaps not. But I think the current approach is totally broken,

Andrei Alexandrescu (47/75) Mar 08 2014 I think that's an exaggeration poorly supported by evidence. My

Sean Kelly (22/22) Mar 08 2014 I'll admit that I'm probably not the best person to make

Andrei Alexandrescu (4/7) Mar 08 2014 Ain't nobody know nothing about

Vladimir Panteleev (26/63) Mar 08 2014 The notion of "character" exists only in certain writing systems.

Andrei Alexandrescu (9/54) Mar 08 2014 It's not late to do a lot of that.

Vladimir Panteleev (26/44) Mar 08 2014 I admit I never used it personally. I just thought you meant that

Andrei Alexandrescu (17/57) Mar 08 2014 My only claim is that recognizing and iterating strings by code point is...

Vladimir Panteleev (49/80) Mar 08 2014 Considering or disregarding the disadvantages of this choice?

Andrei Alexandrescu (57/132) Mar 08 2014 I agree. At a point or another, the dual nature of strings (and dual

Vladimir Panteleev (34/107) Mar 08 2014 I think it would be good to get a comparison of the two

Andrei Alexandrescu (30/78) Mar 08 2014 What can I say? The answer is obvious. It's not hard to figure for me.

Vladimir Panteleev (40/69) Mar 08 2014 The size of this thread is one factor. But I see your point - I

Andrei Alexandrescu (19/37) Mar 08 2014 What exactly is the consensus? From your wiki page I see "One of the

Vladimir Panteleev (3/8) Mar 08 2014 Why?

Andrei Alexandrescu (5/13) Mar 08 2014 From the cycle "going in circles": because I think the breakage is way

Vladimir Panteleev (27/45) Mar 09 2014 All right. I was wondering if there was something more

Andrei Alexandrescu (48/86) Mar 09 2014 It's just factual information with no drama attached (i.e. I'm not

Dmitry Olshansky (17/28) Mar 09 2014 This. Anyhow searching dchar makes sense for _some_ languages, the

Andrei Alexandrescu (10/39) Mar 09 2014 That's just an optimization. Conceptually what happens is we're looking

Dmitry Olshansky (28/41) Mar 09 2014 Yup. It's till not a good idea to introduce this in std.algorithm in a

Andrei Alexandrescu (12/31) Mar 09 2014 Wait, why is dchar[] a narrow string?

w0rp (12/21) Mar 09 2014 When I've wanted to write code especially for ASCII, I think it
Dmitry Olshansky (19/39) Mar 09 2014 Too bad, but we have renamed imports... if only they worked correctly.

Joseph Rushton Wakeling (7/16) Mar 09 2014 So IIUC iterating over s.byChar would not encounter the decoding-related...

Andrei Alexandrescu (3/16) Mar 09 2014 That is correct.

Vladimir Panteleev (6/11) Mar 09 2014 Unless I'm missing something, all algorithms that can work faster

Andrei Alexandrescu (6/16) Mar 09 2014 Good point. Off the top of my head I can't remember any algorithm that

Dmitry Olshansky (4/21) Mar 09 2014 copy to begin with. And it's about 80x faster with plain arrays.

Andrei Alexandrescu (3/23) Mar 09 2014 Question is if there are a bunch of them.

Joseph Cassman (33/45) Mar 08 2014 I like these two points you make here. In particular, I like the

Sean Kelly (15/16) Mar 08 2014 I think the biggest problem with ICU is documentation. It can

Marco Leise (11/29) Mar 19 2014 You find the answer here:

H. S. Teoh (24/55) Mar 08 2014 +1. Most "character"-based Unicode string operations are actually

Vladimir Panteleev (24/33) Mar 08 2014 That's pretty much it. Unless you are working in the confines of

monarch_dodra (26/36) Mar 09 2014 I'm pretty sure that all string operations are actually "front to

Peter Alexander (23/31) Mar 09 2014 Why do you think it is better?

monarch_dodra (47/79) Mar 09 2014 IMO, the "normalization" argument is overrated. I've yet to

Michel Fortin (12/13) Mar 09 2014 Not necessarily. While the unicode collation algorithms (which should
Peter Alexander (34/70) Mar 09 2014 We don't "handle" code points (when have you ever wanted to

monarch_dodra (17/20) Mar 09 2014 I don't understand what your argument. Is it "by code point is

Vladimir Panteleev (9/18) Mar 09 2014 But you don't deal with Unicode. You deal with *text*. Unless you

bearophile (7/10) Mar 09 2014 It seems I am sorting arrays of mutable ASCII chars often enough

Vladimir Panteleev (7/15) Mar 09 2014 What do you use this for?

bearophile (7/13) Mar 09 2014 For lots of different reasons (counting, testing, histograms, to

Andrei Alexandrescu (3/5) Mar 09 2014 I commented on that and preapproved it.

Andrei Alexandrescu (16/30) Mar 09 2014 I suspect that code point iteration is the worst as it works only with

Peter Alexander (26/64) Mar 09 2014 It depends what you mean by "cover" :-)

Andrei Alexandrescu (8/20) Mar 09 2014 But others such as edit distance or equal(some_string, some_wstring)

Vladimir Panteleev (7/12) Mar 09 2014 What should wc produce on a Sanskrit text?

Dmitry Olshansky (6/10) Mar 09 2014 Technically it could use word-braking algorithm for words.

Peter Alexander (20/34) Mar 09 2014 equal(string, wstring) should either not compile, or would be

Andrei Alexandrescu (14/46) Mar 09 2014 These would be possible designs each with its pros and cons. The current...

Dmitry Olshansky (25/51) Mar 09 2014 Code points help only in so far that many (~all) high-level algorithms

Vladimir Panteleev (6/14) Mar 09 2014 As has been discussed, this does not make sense. Graphemes are

Nick Sabalausky (6/11) Mar 10 2014 It's simple: Breaking things on all non-English languages is worse than

Sean Kelly (9/41) Mar 09 2014 Yeah, I think RTL strings are preceded by a code point that

Andrea Fontana (7/53) Mar 10 2014 I'm not sure I understood the point of this (long) thread.

Abdulhaq (28/30) Mar 10 2014 I'd like to offer up one D 'user' perspective, it's just a single

Andrea Fontana (17/48) Mar 10 2014 In italian we need unicode too. We have several accented letters

Johannes Pfau (16/25) Mar 10 2014 The only real problem apart from potential performance issues I've seen

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (4/20) Mar 10 2014 Are you sure that code points is what you want? AFAIK there are

Abdulhaq (10/31) Mar 10 2014 I checked the terminology before posting so I'm pretty sure.
Abdulhaq (7/28) Mar 10 2014 Adding to my other comment I don't expect a string type to

dennis luehring (6/8) Mar 10 2014 after reading many of the attached posts the question is - what

Dicebot (7/17) Mar 10 2014 Historically 2 approaches has been practiced:

Vladimir Panteleev (4/10) Mar 10 2014 These are one and the same, just from the two opposing points of

Dicebot (11/22) Mar 10 2014 :)

Yota (9/13) Mar 10 2014 So at what point are we going to discuss these things

Nick Sabalausky (3/9) Mar 10 2014 Not until (at least) the D2/Phobos implementations mature, the current

Abdulhaq (9/11) Mar 10 2014 This happens (I think) because Andrei and Walter really value

Abdulhaq (19/29) Mar 10 2014 I'm a newbie here but I've been waiting for D to mature for a

Jonathan M Davis (52/53) Mar 12 2014 I agree with Andrei. I don't think that there's really anything to fix. ...

Walter Bright <newshound2 digitalmars.com> writes:

In "Lots of low hanging fruit in Phobos" the issue came up about the automatic 
encoding and decoding of char ranges.

Throughout D's history, there are regular and repeated proposals to redesign
D's 
view of char[] to pretend it is not UTF-8, but UTF-32. I.e. so D will 
automatically generate code to decode and encode on every attempt to index
char[].

I have strongly objected to these proposals on the grounds that:

1. It is a MAJOR performance problem to do this.

2. Very, very few manipulations of strings ever actually need decoded values.

3. D is a systems/native programming language, and systems/native programming 
languages must not hide the underlying representation (I make similar arguments 
about proposals to make ints issue errors on overflow, etc.).

4. Users should choose when decode/encode happens, not the language.

and I have been successful at heading these off. But one slipped by me. See
this 
in std.array:

    property dchar front(T)(T[] a)  safe pure if (isNarrowString!(T[]))
   {
     assert(a.length, "Attempting to fetch the front of an empty array of " ~
            T.stringof);
     size_t i = 0;
     return decode(a, i);
   }

What that means is that if I implement an algorithm that accepts, as input, an 
InputRange of char's, it will ALWAYS try to decode it. This means that even:

    from.copy(to)

will decode 'from', and then re-encode it for 'to'. And it will do it SILENTLY. 
The user won't notice, and he'll just assume that D performance sux. Even if he 
does notice, his options to make his code run faster are poor.

If the user wants decoding, it should be explicit, as in:

     from.decode.copy(encode!to)

The USER should decide where and when the decoding goes. 'decode' should be
just 
another algorithm.

(Yes, I know that std.algorithm.copy() has some specializations to take care of 
this. But these specializations would have to be written for EVERY algorithm, 
which is thoroughly unreasonable. Furthermore, copy()'s specializations only 
apply if BOTH source and destination are arrays. If just one is, the 
decode/encode penalty applies.)

Is there any hope of fixing this?

Mar 06 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Walter Bright:

 systems/native programming languages must not hide the 
 underlying representation (I make similar arguments about 
 proposals to make ints issue errors on overflow, etc.).

But it's good to have in Phobos a compiler-intrinsics-based 
efficient overflow detection on a user-defined struct type that 
behaves like built-in ints in all other aspects.


 Is there any hope of fixing this?

I don't think we can change that in D2. You can change it in D3.

Bye,
bearophile

Mar 06 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/6/2014 6:54 PM, bearophile wrote:
 Walter Bright:

 systems/native programming languages must not hide the underlying
 representation (I make similar arguments about proposals to make ints issue
 errors on overflow, etc.).

 But it's good to have in Phobos a compiler-intrinsics-based efficient overflow
 detection on a user-defined struct type that behaves like built-in ints in all
 other aspects.

Yes, so that the user selects it, rather than having it wired in everywhere and 
the user has to figure out how to defeat it.

Mar 06 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Walter Bright:

 But it's good to have in Phobos a compiler-intrinsics-based 
 efficient overflow
 detection on a user-defined struct type that behaves like 
 built-in ints in all
 other aspects.

 Yes, so that the user selects it, rather than having it wired 
 in everywhere and the user has to figure out how to defeat it.

I don't think people have ever suggested that.

In a recent discussion you seemed against the idea of a special 
compiler support for that user defined type.

Bye,
bearophile

Mar 06 2014

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Friday, 7 March 2014 at 02:57:38 UTC, Walter Bright wrote:
 Yes, so that the user selects it, rather than having it wired 
 in everywhere and the user has to figure out how to defeat it.

BTW you know what would help this? A pragma we can attach to a 
struct which makes it a very thin value type.

pragma(thin_struct)
struct A {
    int a;
    int foo() { return a; }
    static A get() { A(10); }
}

void test() {
     A a = A.get();
     printf("%d", a.foo());
}

With the pragma, A would be completely indistinguishable from int 
in all ways.

What do I mean?
$ dmd -release -O -inline test56 -c

Let's look at A.foo:

A.foo:
    0:   55                      push   ebp
    1:   8b ec                   mov    ebp,esp
    3:   50                      push   eax
    4:   8b 00                   mov    eax,DWORD PTR [eax] ; 
waste!
    6:   8b e5                   mov    esp,ebp
    8:   5d                      pop    ebp
    9:   c3                      ret


It is line four that bugs me: the struct is passed as a 
*pointer*, but its only contents are an int, which could just as 
well be passed as a value. Let's compare it to an identical 
function in operation:

int identity(int a) { return a; }

00000000 <_D6test568identityFiZi>:
    0:   55                      push   ebp
    1:   8b ec                   mov    ebp,esp
    3:   83 ec 04                sub    esp,0x4
    6:   c9                      leave
    7:   c3                      ret

lol it *still* wastes time, setting up a stack frame for nothing. 
But we could just as well write asm { naked; ret; } and it would 
work as expected: the argument is passed in EAX and the return 
value is expected in EAX. The function doesn't actually have to 
do anything.


Anywho, the struct could work the same way. Now, I understand 
that we can't just change this unilaterally since it would break 
interaction with the C ABI, but we could opt in to some thinner 
stuff with a pragma.


Ideally, the thin struct would generate this code:

void A.get() {
    naked { // no need for stack frame here
        mov EAX, 10;
        ret;
    }
}

return A(10); when A is thin should be equal to return 10;. No 
need for NRVO, the object is super thin.

void A.foo() {
    naked { // no locals, no stack frame
        ret; // the last argument (this) is passed in EAX
             // and the return value goes in EAX
             // so we don't have to do anything
    }
}

Without the thin_struct thing, this would minimally look like

mov EAX, [EAX];
ret;

Having to load the value from the this pointer. But since it is 
thin, it is generated identically to an int, like the identity 
function above, so the value is already in the register!

Then, test:

void test() {
     naked { // don't need a stack frame here either!
         call A.get;
         // a is now in EAX, the value loaded right up
         call A.foo; // the this is an int and already
                     // where it needs to be, so just go
         // and finally, go ahead and call printf
         push EAX;
         push "%d".ptr;
         call printf;
         ret;
     }
}


Then, naturally, inlining A.get and A.foo might be possible 
(though I'd love to write them in assembly myself* and the 
compiler prolly can't inline them) but call/ret is fairly cheap, 
especially when compared to push/pop, so just keeping all the 
relevant stuff right in registers with no need to reference can 
really help us.

pragma(thin_struct)
struct RangedInt {
   int a;
   RangedInt opBinary(string op : "+")(int rhs) {
    asm {
      naked;
      add EAX, [rhs]; // or RDI on 64 bit! Don't even need to 
touch the stack! **
      jo throw_exception;
      ret;
    }
   }
}


Might still not be as perfect as intrinsics like bearophile is 
thinking of... but we'd be getting pretty close. And this kind of 
thing would be good for other thin wrappers too, we could 
magically make smart pointers too! (This can't be done now since 
returning a struct is done via hidden pointer argument instead of 
by register like a naked pointer).

** i'd kinda love it if we had an all-register calling convention 
on 32 bit too.... but eh oh well

Mar 06 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/6/2014 8:01 PM, Adam D. Ruppe wrote:
 BTW you know what would help this? A pragma we can attach to a struct which
 makes it a very thin value type.

I'd rather fix the compiler's codegen than add a pragma.

Mar 06 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Walter Bright:

 I'd rather fix the compiler's codegen than add a pragma.

But a standard common intrinsic to detect the overflow 
efficiently could be useful.

Bye,
bearophile

Mar 06 2014

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Mar 06, 2014 at 08:19:18PM -0800, Walter Bright wrote:
 On 3/6/2014 8:01 PM, Adam D. Ruppe wrote:
BTW you know what would help this? A pragma we can attach to a struct
which makes it a very thin value type.

 
 I'd rather fix the compiler's codegen than add a pragma.

[...]

From what I understand, structs are *supposed* to be thin value types. I

would say that if a struct is under a certain size (determined by the
compiler), and doesn't have complicated semantics like dtors and stuff
like that, then it should be treated like a POD (passed in registers,
etc).


T

-- 
Ruby is essentially Perl minus Wall.

Mar 06 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/6/2014 10:12 PM, H. S. Teoh wrote:
 From what I understand, structs are *supposed* to be thin value types. I
 would say that if a struct is under a certain size (determined by the
 compiler), and doesn't have complicated semantics like dtors and stuff
 like that, then it should be treated like a POD (passed in registers,
 etc).

Yes, that's right.

Mar 06 2014

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Friday, 7 March 2014 at 04:19:16 UTC, Walter Bright wrote:
 I'd rather fix the compiler's codegen than add a pragma.

The codegen isn't broken, the current this pointer behavior is 
needed for full compatibility with the C ABI. It would be opt in 
to an ABI tweak that the caller needs to be aware of rather than 
an traditional optimization where the outside world would never 
know.

Mar 07 2014

"Dicebot" <public dicebot.lv> writes:

On Friday, 7 March 2014 at 13:56:48 UTC, Adam D. Ruppe wrote:
 On Friday, 7 March 2014 at 04:19:16 UTC, Walter Bright wrote:
 I'd rather fix the compiler's codegen than add a pragma.

 The codegen isn't broken, the current this pointer behavior is 
 needed for full compatibility with the C ABI. It would be opt 
 in to an ABI tweak that the caller needs to be aware of rather 
 than an traditional optimization where the outside world would 
 never know.

We don't need C ABI compatibility for stuff that is not 
extern(C), do we?

Mar 07 2014

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Friday, 7 March 2014 at 14:04:53 UTC, Dicebot wrote:
 We don't need C ABI compatibility for stuff that is not 
 extern(C), do we?

That's a good point, though personally I'd still like some way to 
magic it up, even in extern(C).

Consider the example of library typedef. If C did:

typedef void* HANDLE;

and D did

struct HANDLE { void* foo; alias foo this; }

it is almost the same, but then when you declare

HANDLE OpenFile(...);

it won't work since the compiler will pass a hidden struct 
pointer (which is exactly what C woudl expect if it was a typedef 
struct { void* } on its side too) instead of expecting the value 
in the accumulator as it would with the void*.

Mar 07 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/7/2014 5:56 AM, Adam D. Ruppe wrote:
 On Friday, 7 March 2014 at 04:19:16 UTC, Walter Bright wrote:
 I'd rather fix the compiler's codegen than add a pragma.

 The codegen isn't broken, the current this pointer behavior is needed for full
 compatibility with the C ABI. It would be opt in to an ABI tweak that the
caller
 needs to be aware of rather than an traditional optimization where the outside
 world would never know.

Oh, I see what you mean. But I think it does generate the same code, if you use 
it the same way. There is no 'get' function for ints; you aren't using it the 
same way.

Mar 07 2014

"Kagamin" <spam here.lot> writes:

On Friday, 7 March 2014 at 04:01:15 UTC, Adam D. Ruppe wrote:
 BTW you know what would help this? A pragma we can attach to a 
 struct which makes it a very thin value type.

 pragma(thin_struct)
 struct A {
    int a;
    int foo() { return a; }
    static A get() { A(10); }
 }

 void test() {
     A a = A.get();
     printf("%d", a.foo());
 }

 With the pragma, A would be completely indistinguishable from 
 int in all ways.

 What do I mean?
 $ dmd -release -O -inline test56 -c

 Let's look at A.foo:

 A.foo:
    0:   55                      push   ebp
    1:   8b ec                   mov    ebp,esp
    3:   50                      push   eax
    4:   8b 00                   mov    eax,DWORD PTR [eax] ; 
 waste!
    6:   8b e5                   mov    esp,ebp
    8:   5d                      pop    ebp
    9:   c3                      ret


 It is line four that bugs me: the struct is passed as a 
 *pointer*, but its only contents are an int, which could just 
 as well be passed as a value. Let's compare it to an identical 
 function in operation:

 int identity(int a) { return a; }

 00000000 <_D6test568identityFiZi>:
    0:   55                      push   ebp
    1:   8b ec                   mov    ebp,esp
    3:   83 ec 04                sub    esp,0x4
    6:   c9                      leave
    7:   c3                      ret

 lol it *still* wastes time, setting up a stack frame for 
 nothing. But we could just as well write asm { naked; ret; } 
 and it would work as expected: the argument is passed in EAX 
 and the return value is expected in EAX. The function doesn't 
 actually have to do anything.

struct A {
    int a;
    //int foo() { return a; }
    static A get() { A(10); }
}

int foo(A a) { return a.a; }

printf("%d", a.foo());

Now it's passed by value.

Though, I needed checked arithmetic only twice: for cast from 
long to int and for cast from double to long. If you expect your 
number type to overflow, you probably chose wrong type.

Mar 07 2014

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Friday, 7 March 2014 at 10:44:46 UTC, Kagamin wrote:
 Now it's passed by value.

That won't work for operator overloading though (which is the 
really interesting case here).

 Though, I needed checked arithmetic only twice: for cast from 
 long to int and for cast from double to long. If you expect 
 your number type to overflow, you probably chose wrong type.

I very rarely need it too, but it is nice to have in a convenient 
package that is fairly efficient at the same time.

Mar 07 2014

"Kagamin" <spam here.lot> writes:

On Friday, 7 March 2014 at 14:13:54 UTC, Adam D. Ruppe wrote:
 On Friday, 7 March 2014 at 10:44:46 UTC, Kagamin wrote:
 Now it's passed by value.

 That won't work for operator overloading though (which is the 
 really interesting case here).

Alternatively for small methods you can rely on inlining, which 
dereferences the argument. If the method is big, the reference is 
probably unimportant.

Mar 07 2014

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Friday, 7 March 2014 at 14:44:43 UTC, Kagamin wrote:
 Alternatively for small methods you can rely on inlining, which 
 dereferences the argument.

Yeah, that's usually the way to go, inlining can also avoid 
pushing other arguments to the stack on 32 bit which is a big win 
too. But you can't inline asm function, and checking the overflow 
flag needs asm. (or a compiler intrinsic.)

For the library typedef case too, this means wrapping any 
function that returns a struct too which is annoying if nothing 
else.

Mar 07 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/7/2014 7:24 AM, Adam D. Ruppe wrote:
 But you can't inline asm function,

I intend to fix that for dmd, but haven't had the time.

 and checking the overflow flag needs asm. (or a compiler intrinsic.)

For that, I was thinking of having the compiler recognize one of the common 
coding patterns for detecting overflow, and then generating efficient overflow 
checks. Then documenting the pattern as being specially detected.

This means the code will still be successful for compilers that don't detect
the 
pattern, and no language changes would be required.

Mar 07 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/6/2014 6:54 PM, bearophile wrote:
 Walter Bright:
 Is there any hope of fixing this?

 I don't think we can change that in D2. You can change it in D3.

You use ranges a lot. Would it break any of your code?

Mar 06 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Walter Bright:

 You use ranges a lot. Would it break any of your code?

I need to try the changes to be sure. But the magnitude of this 
change is so large that I guess some code will surely break.

One advantage of your change is that this code will work:

auto s = "hello".dup;
s.sort();

Bye,
bearophile

Mar 06 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/6/2014 7:22 PM, bearophile wrote:
 One advantage of your change is that this code will work:

 auto s = "hello".dup;
 s.sort();

Yes, I hadn't thought of that.

The auto-decoding front() introduces all kinds of asymmetry in how ranges work, 
and asymmetry is bad as it negatively impacts composability.

Mar 06 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/6/14, 7:55 PM, Walter Bright wrote:
 On 3/6/2014 7:22 PM, bearophile wrote:
 One advantage of your change is that this code will work:

 auto s = "hello".dup;
 s.sort();

 Yes, I hadn't thought of that.

 The auto-decoding front() introduces all kinds of asymmetry in how
 ranges work, and asymmetry is bad as it negatively impacts composability.

There's no asymmetry, and decoding helps composability as I demonstrated.

Andrei

Mar 07 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/7/2014 11:59 AM, Andrei Alexandrescu wrote:
 On 3/6/14, 7:55 PM, Walter Bright wrote:
 On 3/6/2014 7:22 PM, bearophile wrote:
 One advantage of your change is that this code will work:

 auto s = "hello".dup;
 s.sort();

 Yes, I hadn't thought of that.

 The auto-decoding front() introduces all kinds of asymmetry in how
 ranges work, and asymmetry is bad as it negatively impacts composability.

 There's no asymmetry, and decoding helps composability as I demonstrated.

Here's one asymmetry:
-----------------------------
alias int T;     // compiles
//alias char T;  // fails to compile

struct Input(T) { T front(); bool empty(); void popFront(); }
struct Output(T) { void put(T); }

import std.array;

void copy(F,T)(F f, T t) {
     while (!f.empty) {
         t.put(f.front);
         f.popFront();
     }
}

void main() {
     immutable(T)[] from;
     Output!T to;
     from.copy(to);
}
-------------------------------

Mar 07 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

07-Mar-2014 07:22, bearophile пишет:
 Walter Bright:

 You use ranges a lot. Would it break any of your code?

 I need to try the changes to be sure. But the magnitude of this change
 is so large that I guess some code will surely break.

 One advantage of your change is that this code will work:

 auto s = "hello".dup;
 s.sort();

Which it shouldn't unless there is an ascii type or some such.

-- 
Dmitry Olshansky

Mar 07 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/7/14, 1:56 AM, Dmitry Olshansky wrote:
 07-Mar-2014 07:22, bearophile пишет:
 Walter Bright:

 You use ranges a lot. Would it break any of your code?

 I need to try the changes to be sure. But the magnitude of this change
 is so large that I guess some code will surely break.

 One advantage of your change is that this code will work:

 auto s = "hello".dup;
 s.sort();

 Which it shouldn't unless there is an ascii type or some such.

Correct. This is a win, not a failure, of the current approach. To sort 
the bytes in "hello" write:

s.representation.sort();

which is indicative to the human and technically correct.


Andrei

Mar 07 2014

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Mar 06, 2014 at 06:59:36PM -0800, Walter Bright wrote:
 On 3/6/2014 6:54 PM, bearophile wrote:
Walter Bright:
Is there any hope of fixing this?

I don't think we can change that in D2. You can change it in D3.

 
 You use ranges a lot. Would it break any of your code?

Whoa. You're not serious about changing this now, are you? Because even
though I would support such a change, you have to realize the magnitude
of code breakage that will happen. A lot of code that iterates over
narrow strings will break, and worse yet, they will break *silently*.
Calling count() on a narrow string will not return the expected value,
for example. And existing code that iterates over narrow strings
expecting dchars to come out of it will suddenly silently convert to
char, and may pass by unnoticed until somebody runs the program with a
multibyte character in the input.

This is very high risk change IMO.

You're welcome to create a (temporary) Phobos fork that reverts narrow
string auto-decoding, of course, and people can try it out to see how
much actual breakage is happening. If you really want to push for this,
that might be the safest way to test the waters before committing to
such a major change. Silent breakage is not easy to test for,
unfortunately. :(


T

-- 
Truth, Sir, is a cow which will give [skeptics] no more milk, and so
they are gone to milk the bull. -- Sam. Johnson

Mar 06 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/6/2014 7:31 PM, H. S. Teoh wrote:
 Whoa. You're not serious about changing this now, are you? Because even
 though I would support such a change, you have to realize the magnitude
 of code breakage that will happen. A lot of code that iterates over
 narrow strings will break, and worse yet, they will break *silently*.
 Calling count() on a narrow string will not return the expected value,
 for example. And existing code that iterates over narrow strings
 expecting dchars to come out of it will suddenly silently convert to
 char, and may pass by unnoticed until somebody runs the program with a
 multibyte character in the input.

I understand this all too well. (Note that we currently have a different silent 
problem: unnoticed large performance problems.)


 This is very high risk change IMO.

 You're welcome to create a (temporary) Phobos fork that reverts narrow
 string auto-decoding, of course, and people can try it out to see how
 much actual breakage is happening. If you really want to push for this,
 that might be the safest way to test the waters before committing to
 such a major change. Silent breakage is not easy to test for,
 unfortunately. :(

I posted a plan in another message in this thread. It'll be a long process, but 
I think it's doable.

Mar 06 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Walter Bright:

 I understand this all too well. (Note that we currently have a 
 different silent problem: unnoticed large performance problems.)

On the other hand your change could introduce Unicode-related 
bugs in future code (that the current Phobos avoids) (and here I 
am not talking about code breakage).

Bye,
bearophile

Mar 06 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/6/2014 7:59 PM, bearophile wrote:
 Walter Bright:

 I understand this all too well. (Note that we currently have a different
 silent problem: unnoticed large performance problems.)

 On the other hand your change could introduce Unicode-related bugs in future
 code (that the current Phobos avoids) (and here I am not talking about code
 breakage).

This comes up repeatedly as justification for D trying to hide the UTF-8 nature 
of strings that I discussed upthread.

To my mind it's like trying to pretend that floating point doesn't have
roundoff 
issues, integers have infinite range, memory is infinite, etc. That has a place 
in other languages, but not in a systems/native language.

Mar 06 2014

Shammah Chancellor <anonymous coward.com> writes:

On 2014-03-07 04:17:34 +0000, Walter Bright said:

 On 3/6/2014 7:59 PM, bearophile wrote:
 Walter Bright:
 
 I understand this all too well. (Note that we currently have a different
 silent problem: unnoticed large performance problems.)

 
 On the other hand your change could introduce Unicode-related bugs in future
 code (that the current Phobos avoids) (and here I am not talking about code
 breakage).

 
 This comes up repeatedly as justification for D trying to hide the 
 UTF-8 nature of strings that I discussed upthread.
 
 To my mind it's like trying to pretend that floating point doesn't have 
 roundoff issues, integers have infinite range, memory is infinite, etc. 
 That has a place in other languages, but not in a systems/native 
 language.

Is it possible to add a warning notice when .front() is used on char?  
I would say fix it now, add a warning, and then remove the warning 
later.

-S.

Mar 07 2014

Michel Fortin <michel.fortin michelf.ca> writes:

On 2014-03-07 03:59:55 +0000, "bearophile" <bearophileHUGS lycos.com> said:

 Walter Bright:
 
 I understand this all too well. (Note that we currently have a 
 different silent problem: unnoticed large performance problems.)

 
 On the other hand your change could introduce Unicode-related bugs in 
 future code (that the current Phobos avoids) (and here I am not talking 
 about code breakage).

The way Phobos works isn't any more correct than dealing with code 
units. Many graphemes span on multiple code points -- because of 
combined diacritics or character variant modifiers -- and decoding at 
the code-point level is thus often insufficient for correctness.

The problem with Unicode strings is that the representation you must 
work with depends on the things you want to do. If you want to count 
the characters then you need graphemes; if you want to parse XML then 
you'll need to work with code points (in theory, in practice you might 
still want direct access to code units for performance reasons); and if 
you want to slice or copy a string then you need to deal with code 
units. Because of this multiple-representation-for-different-purpose 
thing, generic algorithms for arrays don't map very well to string.

From my experience, I'd suggest these basic operations for a "string 
range" instead of the regular range interface:

.empty
.frontCodeUnit
.frontCodePoint
.frontGrapheme
.popFrontCodeUnit
.popFrontCodePoint
.popFrontGrapheme
.codeUnitLength (aka length)
.codePointLength (for dchar[] only)
.codePointLengthLinear
.graphemeLengthLinear

Someone should be able to mix all the three 'front' and 'pop' function 
variants above in any code dealing with a string type. In my XML parser 
for instance I regularly use frontCodeUnit to avoid the decoding 
penalty when matching the next character with an ASCII one such as '<' 
or '&'. An API like the one above forces you to be aware of the level 
you're working on, making bugs and inefficiencies stand out (as long as 
you're familiar with each representation).

If someone wants to use a generic array/range algorithm with a string, 
my opinion is that he should have to wrap it in a range type that maps 
front and popFront to one of the above variant. Having to do that 
should make it obvious that there's an inefficiency there, as you're 
using an algorithm that wasn't tailored to work with strings and that 
more decoding than strictly necessary is being done.

-- 
Michel Fortin
michel.fortin michelf.ca
http://michelf.ca

Mar 07 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Friday, 7 March 2014 at 13:40:31 UTC, Michel Fortin wrote:
 if you want to parse XML then you'll need to work with code 
 points

Why is this?

Mar 07 2014

"Kagamin" <spam here.lot> writes:

On Friday, 7 March 2014 at 13:40:31 UTC, Michel Fortin wrote:
 if you want to parse XML then you'll need to work with code 
 points (in theory, in practice you might still want direct 
 access to code units for performance reasons)

AFAIK, xml control characters are all ascii, and what's between 
them you can slice or dup without consideration, so code units 
should be more than enough.

Mar 07 2014

Michel Fortin <michel.fortin michelf.ca> writes:

On 2014-03-07 14:47:26 +0000, "Kagamin" <spam here.lot> said:

 On Friday, 7 March 2014 at 13:40:31 UTC, Michel Fortin wrote:
 if you want to parse XML then you'll need to work with code points (in 
 theory, in practice you might still want direct access to code units 
 for performance reasons)

 
 AFAIK, xml control characters are all ascii, and what's between them 
 you can slice or dup without consideration, so code units should be 
 more than enough.

If you don't fully check for well-formness (as XML parsers ought to do 
according to the XML spec) then sure you can limit yourself to ASCII. 
You'll let through illegal characters in element and attribute names 
though.

-- 
Michel Fortin
michel.fortin michelf.ca
http://michelf.ca

Mar 07 2014

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On 3/7/2014 8:40 AM, Michel Fortin wrote:
 On 2014-03-07 03:59:55 +0000, "bearophile" <bearophileHUGS lycos.com> said:

 Walter Bright:

 I understand this all too well. (Note that we currently have a
 different silent problem: unnoticed large performance problems.)

 On the other hand your change could introduce Unicode-related bugs in
 future code (that the current Phobos avoids) (and here I am not
 talking about code breakage).

 The way Phobos works isn't any more correct than dealing with code
 units. Many graphemes span on multiple code points -- because of
 combined diacritics or character variant modifiers -- and decoding at
 the code-point level is thus often insufficient for correctness.

Well, it is *more* correct, as many western languages are more likely in 
current Phobos to "just work" in most cases. It's just that things still 
aren't completely correct overall.

  From my experience, I'd suggest these basic operations for a "string
 range" instead of the regular range interface:

 .empty
 .frontCodeUnit
 .frontCodePoint
 .frontGrapheme
 .popFrontCodeUnit
 .popFrontCodePoint
 .popFrontGrapheme
 .codeUnitLength (aka length)
 .codePointLength (for dchar[] only)
 .codePointLengthLinear
 .graphemeLengthLinear

 Someone should be able to mix all the three 'front' and 'pop' function
 variants above in any code dealing with a string type. In my XML parser
 for instance I regularly use frontCodeUnit to avoid the decoding penalty
 when matching the next character with an ASCII one such as '<' or '&'.
 An API like the one above forces you to be aware of the level you're
 working on, making bugs and inefficiencies stand out (as long as you're
 familiar with each representation).

 If someone wants to use a generic array/range algorithm with a string,
 my opinion is that he should have to wrap it in a range type that maps
 front and popFront to one of the above variant. Having to do that should
 make it obvious that there's an inefficiency there, as you're using an
 algorithm that wasn't tailored to work with strings and that more
 decoding than strictly necessary is being done.

I actually like this suggestion quite a bit.

Mar 10 2014

Marco Leise <Marco.Leise gmx.de> writes:

Am Mon, 10 Mar 2014 17:44:22 -0400
schrieb Nick Sabalausky <SeeWebsiteToContactMe semitwist.com>:

 On 3/7/2014 8:40 AM, Michel Fortin wrote:
 On 2014-03-07 03:59:55 +0000, "bearophile" <bearophileHUGS lycos.com> said:

 Walter Bright:

 I understand this all too well. (Note that we currently have a
 different silent problem: unnoticed large performance problems.)

 On the other hand your change could introduce Unicode-related bugs in
 future code (that the current Phobos avoids) (and here I am not
 talking about code breakage).

 The way Phobos works isn't any more correct than dealing with code
 units. Many graphemes span on multiple code points -- because of
 combined diacritics or character variant modifiers -- and decoding at
 the code-point level is thus often insufficient for correctness.

 
 Well, it is *more* correct, as many western languages are more likely in 
 current Phobos to "just work" in most cases. It's just that things still 
 aren't completely correct overall.
 
  From my experience, I'd suggest these basic operations for a "string
 range" instead of the regular range interface:

 .empty
 .frontCodeUnit
 .frontCodePoint
 .frontGrapheme
 .popFrontCodeUnit
 .popFrontCodePoint
 .popFrontGrapheme
 .codeUnitLength (aka length)
 .codePointLength (for dchar[] only)
 .codePointLengthLinear
 .graphemeLengthLinear

 Someone should be able to mix all the three 'front' and 'pop' function
 variants above in any code dealing with a string type. In my XML parser
 for instance I regularly use frontCodeUnit to avoid the decoding penalty
 when matching the next character with an ASCII one such as '<' or '&'.
 An API like the one above forces you to be aware of the level you're
 working on, making bugs and inefficiencies stand out (as long as you're
 familiar with each representation).

 If someone wants to use a generic array/range algorithm with a string,
 my opinion is that he should have to wrap it in a range type that maps
 front and popFront to one of the above variant. Having to do that should
 make it obvious that there's an inefficiency there, as you're using an
 algorithm that wasn't tailored to work with strings and that more
 decoding than strictly necessary is being done.

 
 I actually like this suggestion quite a bit.

+1 Reminds me of my proposal for Rust
(https://github.com/mozilla/rust/issues/7043#issuecomment-19187984)

-- 
Marco

Mar 18 2014

"Peter Alexander" <peter.alexander.au gmail.com> writes:

On Friday, 7 March 2014 at 03:32:50 UTC, H. S. Teoh wrote:
 On Thu, Mar 06, 2014 at 06:59:36PM -0800, Walter Bright wrote:
 On 3/6/2014 6:54 PM, bearophile wrote:
Walter Bright:
Is there any hope of fixing this?

I don't think we can change that in D2. You can change it in 
D3.

 
 You use ranges a lot. Would it break any of your code?

 This is very high risk change IMO.

+1

This will be the most disruptive change in D's history...

Mar 07 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Friday, 7 March 2014 at 03:32:50 UTC, H. S. Teoh wrote:
 Calling count() on a narrow string will not return the expected 
 value, for example.

I would argue that, unless it's been made clear that the program 
is expected to work only for certain languages, code that relied 
on this was wrong in the first place.

Mar 07 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/6/2014 6:37 PM, Walter Bright wrote:
 Is there any hope of fixing this?

Is there any way we can provide an upgrade path for this? Silent breakage is 
terrible. Any ideas?

Mar 06 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/6/2014 7:06 PM, Walter Bright wrote:
 On 3/6/2014 6:37 PM, Walter Bright wrote:
 Is there any hope of fixing this?

 Is there any way we can provide an upgrade path for this? Silent breakage is
 terrible. Any ideas?

Ok, I have a plan. Each step will be separated by at least one version:

1. implement decode() as an algorithm for string types, so one can write:

     string s;
     s.decode.algorithm...

suggest that people start doing that instead of:

     s.algorithm...

2. Emit warning when people use std.array.front(s) with strings.

3. Deprecate std.array.front for strings.

4. Error for std.array.front for strings.

5. Implement new std.array.front for strings that doesn't decode.

Mar 06 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

07-Mar-2014 07:52, Walter Bright пишет:
 On 3/6/2014 7:06 PM, Walter Bright wrote:
 On 3/6/2014 6:37 PM, Walter Bright wrote:
 Is there any hope of fixing this?

 Is there any way we can provide an upgrade path for this? Silent
 breakage is
 terrible. Any ideas?

 Ok, I have a plan. Each step will be separated by at least one version:

 1. implement decode() as an algorithm for string types, so one can write:

      string s;
      s.decode.algorithm...

 suggest that people start doing that instead of:

      s.algorithm...

This would also be a great fit in cases where 'decode' is decoding some 
other encoding.

 2. Emit warning when people use std.array.front(s) with strings.

 3. Deprecate std.array.front for strings.

 4. Error for std.array.front for strings.

This sounds fine to me. I would even prefer to only offer explicit wrappers:
.raw - ubyte/ushort for UTF-8/UTF-16 etc.
.decode - dchars
as Nick suggests.

Then there is also the horrible ElementEncodingType vs ElementType.
I would love to see ElementEncodingType die.

 5. Implement new std.array.front for strings that doesn't decode.

It would make it simple to think that strings are arrays of characters. 
This illusion was broken (and good thing it was), no point in 
reestablishing it to save a couple of keystrokes for those "who really 
know what they are doing".

-- 
Dmitry Olshansky

Mar 07 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/7/2014 2:11 AM, Dmitry Olshansky wrote:
 Then there is also the horrible ElementEncodingType vs ElementType.
 I would love to see ElementEncodingType die.

I agree. ElementEncodingType is a giant red flag saying we screwed things up.

Mar 07 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Friday, 7 March 2014 at 03:52:42 UTC, Walter Bright wrote:
 Ok, I have a plan. Each step will be separated by at least one 
 version:

 1. implement decode() as an algorithm for string types, so one 
 can write:

     string s;
     s.decode.algorithm...

 suggest that people start doing that instead of:

     s.algorithm...

I think .decode should be something more explicit (byCodePoint 
OSLT), just so it's clear that it's not magical and does not 
solve all problems.

 2. Emit warning when people use std.array.front(s) with strings.

 3. Deprecate std.array.front for strings.

 4. Error for std.array.front for strings.

 5. Implement new std.array.front for strings that doesn't 
 decode.

Until then, how will people use strings with algorithms when they 
mean to use them per-byte? A .raw property which casts to ubyte[]?

Mar 07 2014

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Mar 07, 2014 at 05:24:59PM +0000, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 03:52:42 UTC, Walter Bright wrote:
Ok, I have a plan. Each step will be separated by at least one
version:

1. implement decode() as an algorithm for string types, so one can
write:

    string s;
    s.decode.algorithm...

suggest that people start doing that instead of:

    s.algorithm...

 
 I think .decode should be something more explicit (byCodePoint
 OSLT), just so it's clear that it's not magical and does not solve
 all problems.

+1. I think "byCodePoint" is far more self-documenting and less
misleading than "decode".

	string s;
	s.byCodePoint.algorithm...

I'm already starting to like it.


T

-- 
It always amuses me that Windows has a Safe Mode during bootup. Does
that mean that Windows is normally unsafe?

Mar 07 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

07-Mar-2014 21:34, H. S. Teoh пишет:
 On Fri, Mar 07, 2014 at 05:24:59PM +0000, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 03:52:42 UTC, Walter Bright wrote:
 Ok, I have a plan. Each step will be separated by at least one
 version:

 1. implement decode() as an algorithm for string types, so one can
 write:

     string s;
     s.decode.algorithm...

 suggest that people start doing that instead of:

     s.algorithm...

 I think .decode should be something more explicit (byCodePoint
 OSLT), just so it's clear that it's not magical and does not solve
 all problems.

 +1. I think "byCodePoint" is far more self-documenting and less
 misleading than "decode".

 	string s;
 	s.byCodePoint.algorithm...

 I'm already starting to like it.

And there is precedent, see std.uni.byCodepoint ;)


-- 
Dmitry Olshansky

Mar 07 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/7/14, 9:24 AM, Vladimir Panteleev wrote:
 5. Implement new std.array.front for strings that doesn't decode.

 Until then, how will people use strings with algorithms when they mean
 to use them per-byte? A .raw property which casts to ubyte[]?

There's no "until then".

A current ".representation" property already exists that casts all 
string types appropriately.

Andrei

Mar 07 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

07-Mar-2014 23:11, Andrei Alexandrescu пишет:
 On 3/7/14, 9:24 AM, Vladimir Panteleev wrote:
 5. Implement new std.array.front for strings that doesn't decode.

 Until then, how will people use strings with algorithms when they mean
 to use them per-byte? A .raw property which casts to ubyte[]?

 There's no "until then".

 A current ".representation" property already exists that casts all
 string types appropriately.

There is however a big glaring failure: std.algorithm specialized for 
char[], wchar[] but not for any RandomAccessRange!char or 
RandomAccessRange!wchar.

So if I for instance get a custom slice type (e.g. a ring buffer), then 
I'm out of luck w/o both "auto-magic dchar range" and special code in 
std.algo that works with chars as code units.

If there is a way to exploit the duality of RA range of code units being
"is a" BD range of code points we certainly have failed with making it 
work (first of all doing horrible job at generic-ness as mentioned).


-- 
Dmitry Olshansky

Mar 07 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/7/14, 11:28 AM, Dmitry Olshansky wrote:
 07-Mar-2014 23:11, Andrei Alexandrescu пишет:
 On 3/7/14, 9:24 AM, Vladimir Panteleev wrote:
 5. Implement new std.array.front for strings that doesn't decode.

 Until then, how will people use strings with algorithms when they mean
 to use them per-byte? A .raw property which casts to ubyte[]?

 There's no "until then".

 A current ".representation" property already exists that casts all
 string types appropriately.

 There is however a big glaring failure: std.algorithm specialized for
 char[], wchar[] but not for any RandomAccessRange!char or
 RandomAccessRange!wchar.

I agree that's an issue. Back in the day when this was a choice I 
decided to consider only char[] and friends "UTF strings". There was 
room for more generality but I didn't know of any use cases that would 
ask for them. It's possible I was wrong, but the option to generalize is 
still open today.

Andrei

Mar 07 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/7/2014 9:24 AM, Vladimir Panteleev wrote:
 I think .decode should be something more explicit (byCodePoint OSLT), just so
 it's clear that it's not magical and does not solve all problems.

Good point. Perhaps "decodeUTF". "decode" is too generic.


 Until then, how will people use strings with algorithms when they mean to use
 them per-byte?

The way they do it now, i.e. they can't. That's the whole problem.

Mar 07 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/6/14, 7:52 PM, Walter Bright wrote:
 On 3/6/2014 7:06 PM, Walter Bright wrote:
 On 3/6/2014 6:37 PM, Walter Bright wrote:
 Is there any hope of fixing this?

 Is there any way we can provide an upgrade path for this? Silent
 breakage is
 terrible. Any ideas?

 Ok, I have a plan. Each step will be separated by at least one version:

 1. implement decode() as an algorithm for string types, so one can write:

      string s;
      s.decode.algorithm...

 suggest that people start doing that instead of:

      s.algorithm...

 2. Emit warning when people use std.array.front(s) with strings.

 3. Deprecate std.array.front for strings.

 4. Error for std.array.front for strings.

 5. Implement new std.array.front for strings that doesn't decode.

This would kill D. I am not exaggerating.

Andrei

Mar 07 2014

"Nicholas Londey" <londey gmail.com> writes:

 This would kill D. I am not exaggerating.

I don't know about kill but it certainly feels awfully similar to 
the Python 2/3 spit over strings and Unicode which still doesn't 
seem to be resolved.

Mar 08 2014

"Nicholas Londey" <londey gmail.com> writes:

 1. implement decode() as an algorithm for string types,

Decode is an incredibly generic name. What about byGlyph similar 
to byLine?

Mar 08 2014

"Chris" <wendlec tcd.ie> writes:

On Friday, 7 March 2014 at 03:52:42 UTC, Walter Bright wrote:
 Ok, I have a plan. Each step will be separated by at least one 
 version:

 1. implement decode() as an algorithm for string types, so one 
 can write:

     string s;
     s.decode.algorithm...

 suggest that people start doing that instead of:

     s.algorithm...

 2. Emit warning when people use std.array.front(s) with strings.

 3. Deprecate std.array.front for strings.

 4. Error for std.array.front for strings.

 5. Implement new std.array.front for strings that doesn't 
 decode.

What about this:

[as above]
1. implement decode() as an algorithm for string types, so one 
can write:

      string s;
      s.decode.algorithm...

  suggest that people start doing that instead of:

      s.algorithm...

[as above]
2. Emit warning when people use std.array.front(s) with strings.

3. Implement new std.array.front for strings that doesn't decode, 
but keep the old one either forever(ish) or until way into D3 
(3.03).

4. Deprecate std.array.front for strings (see 3.)
5. Error for std.array.front for strings. (see 3)

I know that one of the rules of D is "warnings should eventually 
become errors", but there is nothing wrong with waiting longer 
than a few months before something is an error or removed from 
the library, especially if it would cause loads of code to break 
(my own too, I suppose). As long as users are aware of it, they 
can start to make the transition in their own code little by 
little. In this case they will make the transition rather sooner 
than later, because nobody wants to suffer constant performance 
penalties. So for this particular change I'd suggest to wait 
patiently until it can finally be deprecated. Is this feasible?

Mar 11 2014

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

What about this?:

Anywhere we currently have a front() that decodes, such as your example:

     property dchar front(T)(T[] a)  safe pure if (isNarrowString!(T[]))
    {
      assert(a.length, "Attempting to fetch the front of an empty array
 of " ~
             T.stringof);
      size_t i = 0;
      return decode(a, i);
    }

We rip out that front() entirely. The result is *not* technically a 
range...yet! We could call it a protorange.

Then we provide two functions:

auto decode(someStringProtoRange) {...}
auto raw(someStringProtoRange) {...}

These convert the protoranges into actual ranges by adding the missing 
front() function. The 'decode' adds a front() which decodes into dchar, 
while the 'raw' adds a front() which simply returns the raw underlying type.

I imagine the decode/raw would probably also handle any "length" 
property (if it exists in the protorange) accordingly.

This way, the user is forced to specify "myStringRange.decode" or 
"myStringRange.raw" as appropriate, otherwise myStringRange can't be 
used since it isn't technically a range, only a protorange.

(Naturally, ranges of dchar would always have front, since no decoding 
is ever needed for them anyway. For these ranges, the decode/raw funcs 
above would simply be no-ops.)

Mar 06 2014

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On 3/6/2014 11:11 PM, Nick Sabalausky wrote:
 What about this?:

 Anywhere we currently have a front() that decodes, such as your example:

     property dchar front(T)(T[] a)  safe pure if (isNarrowString!(T[]))
    {
      assert(a.length, "Attempting to fetch the front of an empty array
 of " ~
             T.stringof);
      size_t i = 0;
      return decode(a, i);
    }

 We rip out that front() entirely. The result is *not* technically a
 range...yet! We could call it a protorange.

 Then we provide two functions:

 auto decode(someStringProtoRange) {...}
 auto raw(someStringProtoRange) {...}

 These convert the protoranges into actual ranges by adding the missing
 front() function. The 'decode' adds a front() which decodes into dchar,
 while the 'raw' adds a front() which simply returns the raw underlying
 type.

 I imagine the decode/raw would probably also handle any "length"
 property (if it exists in the protorange) accordingly.

 This way, the user is forced to specify "myStringRange.decode" or
 "myStringRange.raw" as appropriate, otherwise myStringRange can't be
 used since it isn't technically a range, only a protorange.

 (Naturally, ranges of dchar would always have front, since no decoding
 is ever needed for them anyway. For these ranges, the decode/raw funcs
 above would simply be no-ops.)

Of course, I just realized that these front()s can't be added unless 
there's already a front to be called in the first place...

So instead of ripping out the current front() functions entirely, we 
replace "front" with some sort of "rawFront" which the raw/decode 
versions of front() can query in order to provide actual 
decoding/non-decoding ranges.

Mar 06 2014

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:

On Friday, 7 March 2014 at 04:11:15 UTC, Nick Sabalausky wrote:
 What about this?:

 Anywhere we currently have a front() that decodes, such as your 
 example:

    property dchar front(T)(T[] a)  safe pure if 
 (isNarrowString!(T[]))
   {
     assert(a.length, "Attempting to fetch the front of an 
 empty array
 of " ~
            T.stringof);
     size_t i = 0;
     return decode(a, i);
   }

 We rip out that front() entirely. The result is *not* 
 technically a range...yet! We could call it a protorange.

 Then we provide two functions:

 auto decode(someStringProtoRange) {...}
 auto raw(someStringProtoRange) {...}

 These convert the protoranges into actual ranges by adding the 
 missing front() function. The 'decode' adds a front() which 
 decodes into dchar, while the 'raw' adds a front() which simply 
 returns the raw underlying type.

 I imagine the decode/raw would probably also handle any 
 "length" property (if it exists in the protorange) accordingly.

 This way, the user is forced to specify "myStringRange.decode" 
 or "myStringRange.raw" as appropriate, otherwise myStringRange 
 can't be used since it isn't technically a range, only a 
 protorange.

 (Naturally, ranges of dchar would always have front, since no 
 decoding is ever needed for them anyway. For these ranges, the 
 decode/raw funcs above would simply be no-ops.)

Strings can be iterated over by code unit, code point, grapheme, 
grapheme cluster (?), words, sentences, lines, paragraphs, and 
potentially other things. Therefore, it makes sense two require 
the same for ranges of dchar, too.

Also, `byCodeUnit` and `byCodePoint` would probably be better 
names than `raw` and `decode`, to much the already existing 
`byGrapheme` in std.uni.

Mar 09 2014

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Sunday, 9 March 2014 at 13:08:05 UTC, Marc Schütz wrote:
 Also, `byCodeUnit` and `byCodePoint` would probably be better 
 names than `raw` and `decode`, to much the already existing 
 `byGrapheme` in std.uni.

There already is a std.uni.byCodePoint. It is a higher order 
range that accepts ranges of graphemes and ranges of code points 
(such as strings).

`byCodeUnit` is essentially std.string.representation.

Mar 09 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/9/14, 6:34 AM, Jakob Ovrum wrote:
 On Sunday, 9 March 2014 at 13:08:05 UTC, Marc Schütz wrote:
 Also, `byCodeUnit` and `byCodePoint` would probably be better names
 than `raw` and `decode`, to much the already existing `byGrapheme` in
 std.uni.

 There already is a std.uni.byCodePoint. It is a higher order range that
 accepts ranges of graphemes and ranges of code points (such as strings).

noice

 `byCodeUnit` is essentially std.string.representation.

Actually not because for reasons that are unclear to me people really 
want the individual type to be char, not ubyte.


Andrei

Mar 09 2014

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On 3/9/2014 1:26 PM, Andrei Alexandrescu wrote:
 On 3/9/14, 6:34 AM, Jakob Ovrum wrote:

 `byCodeUnit` is essentially std.string.representation.

 Actually not because for reasons that are unclear to me people really
 want the individual type to be char, not ubyte.

Probably because char *is* D's type for UTF-8 code units.

Mar 09 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/9/2014 6:34 AM, Jakob Ovrum wrote:
 `byCodeUnit` is essentially std.string.representation.

Not at all. std.string.representation takes a string and casts it to the 
corresponding ubyte, ushort, uint string.

It doesn't work at all with InputRange!char

Mar 09 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/9/2014 6:08 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 Also, `byCodeUnit` and `byCodePoint` would probably be better names than `raw`
 and `decode`, to much the already existing `byGrapheme` in std.uni.

I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string, wstring, 
dstring, and InputRange!char, etc.

Mar 09 2014

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On 3/9/2014 6:31 PM, Walter Bright wrote:
 On 3/9/2014 6:08 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 Also, `byCodeUnit` and `byCodePoint` would probably be better names
 than `raw`
 and `decode`, to much the already existing `byGrapheme` in std.uni.

 I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string,
 wstring, dstring, and InputRange!char, etc.

'byCodePoint' and 'byDchar' are the same. However, 'byCodeUnit' is 
completely different from anything else:

string  str;
wstring wstr;
dstring dstr;

(str|wchar|dchar).byChar  // Always range of char
(str|wchar|dchar).byWchar // Always range of wchar
(str|wchar|dchar).byDchar // Always range of dchar

str.representation  // Range of ubyte
wstr.representation // Range of ushort
dstr.representation // Range of uint

str.byCodeUnit  // Range of char
wstr.byCodeUnit // Range of wchar
dstr.byCodeUnit // Range of dchar

Mar 09 2014

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On 3/10/2014 12:19 AM, Nick Sabalausky wrote:
 (str|wchar|dchar).byChar  // Always range of char
 (str|wchar|dchar).byWchar // Always range of wchar
 (str|wchar|dchar).byDchar // Always range of dchar

Erm, naturally I meant "(str|wstr|dstr)"

Mar 09 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/9/2014 9:19 PM, Nick Sabalausky wrote:
 On 3/9/2014 6:31 PM, Walter Bright wrote:
 On 3/9/2014 6:08 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 Also, `byCodeUnit` and `byCodePoint` would probably be better names
 than `raw`
 and `decode`, to much the already existing `byGrapheme` in std.uni.

 I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string,
 wstring, dstring, and InputRange!char, etc.

 'byCodePoint' and 'byDchar' are the same. However, 'byCodeUnit' is completely
 different from anything else:

 string  str;
 wstring wstr;
 dstring dstr;

 (str|wchar|dchar).byChar  // Always range of char
 (str|wchar|dchar).byWchar // Always range of wchar
 (str|wchar|dchar).byDchar // Always range of dchar

 str.representation  // Range of ubyte
 wstr.representation // Range of ushort
 dstr.representation // Range of uint

 str.byCodeUnit  // Range of char
 wstr.byCodeUnit // Range of wchar
 dstr.byCodeUnit // Range of dchar

I don't see much point to the latter 3.

Mar 09 2014

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On 3/10/2014 12:23 AM, Walter Bright wrote:
 On 3/9/2014 9:19 PM, Nick Sabalausky wrote:
 On 3/9/2014 6:31 PM, Walter Bright wrote:
 On 3/9/2014 6:08 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 Also, `byCodeUnit` and `byCodePoint` would probably be better names
 than `raw`
 and `decode`, to much the already existing `byGrapheme` in std.uni.

 I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string,
 wstring, dstring, and InputRange!char, etc.

 'byCodePoint' and 'byDchar' are the same. However, 'byCodeUnit' is
 completely
 different from anything else:

 string  str;
 wstring wstr;
 dstring dstr;

 (str|wchar|dchar).byChar  // Always range of char
 (str|wchar|dchar).byWchar // Always range of wchar
 (str|wchar|dchar).byDchar // Always range of dchar

 str.representation  // Range of ubyte
 wstr.representation // Range of ushort
 dstr.representation // Range of uint

 str.byCodeUnit  // Range of char
 wstr.byCodeUnit // Range of wchar
 dstr.byCodeUnit // Range of dchar

 I don't see much point to the latter 3.

Do you mean:

1. You don't see the point to iterating by code unit?
2. You don't see the point to 'byCodeUnit' if we have 'representation'?
3. You don't see the point to 'byCodeUnit' if we have 
'byChar/byWchar/byDchar'?
4. You don't see the point to having 'byCodeUnit' work on UTF-32 dstrings?

Responses:

1. Iterating by code unit: Useful for tweaking performance anytime 
decoding is unnecessary. For example, parsing a grammar where the bulk 
of the keywords and operators are ASCII. (Occasional uses of Unicode, 
like unicode whitespace, can of course be handled easily enough by the 
lexer FSM).

2. 'byCodeUnit' if we have 'representation': This one I have trouble 
answering since I'm still unclear on the purpose of 'representation' (I 
wasn't even aware of it until a few days ago.) I've been assuming 
there's some specific use-case I've overlooked where it's useful to 
iterate by code unit *while* treating the code units as if they weren't 
UTF-8/16/32 at all. But since 'representation' is called *on* a 
string/wstring/dstring, they should already be UTF-8/16/32 anyway, not 
some other encoding that would necessitate using integer types. Or maybe 
it's just for working around problems with the auto-verification being 
too eager (I've ran into those)? I admit I don't quite get 'representation'.

3. 'byCodeUnit' if we have 'byChar/byWchar/byDchar': To avoid a "static 
if" chain every time you want to use code units inside generic code. 
Also, so in non-generic code you can change your data type without 
updating instances of 'by*char'.

4. Having 'byCodeUnit' work on UTF-32 dstrings: So generic code working 
on code units doesn't have to special-case UTF-32.

Mar 10 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/10/2014 12:09 AM, Nick Sabalausky wrote:
 On 3/10/2014 12:23 AM, Walter Bright wrote:
 On 3/9/2014 9:19 PM, Nick Sabalausky wrote:
 On 3/9/2014 6:31 PM, Walter Bright wrote:
 On 3/9/2014 6:08 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 Also, `byCodeUnit` and `byCodePoint` would probably be better names
 than `raw`
 and `decode`, to much the already existing `byGrapheme` in std.uni.

 I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string,
 wstring, dstring, and InputRange!char, etc.

 'byCodePoint' and 'byDchar' are the same. However, 'byCodeUnit' is
 completely
 different from anything else:

 string  str;
 wstring wstr;
 dstring dstr;

 (str|wchar|dchar).byChar  // Always range of char
 (str|wchar|dchar).byWchar // Always range of wchar
 (str|wchar|dchar).byDchar // Always range of dchar

 str.representation  // Range of ubyte
 wstr.representation // Range of ushort
 dstr.representation // Range of uint

 str.byCodeUnit  // Range of char
 wstr.byCodeUnit // Range of wchar
 dstr.byCodeUnit // Range of dchar

 I don't see much point to the latter 3.

 Do you mean:

 1. You don't see the point to iterating by code unit?
 2. You don't see the point to 'byCodeUnit' if we have 'representation'?
 3. You don't see the point to 'byCodeUnit' if we have 'byChar/byWchar/byDchar'?
 4. You don't see the point to having 'byCodeUnit' work on UTF-32 dstrings?

(3)

 3. 'byCodeUnit' if we have 'byChar/byWchar/byDchar': To avoid a "static if"
 chain every time you want to use code units inside generic code. Also, so in
 non-generic code you can change your data type without updating instances of
 'by*char'.

Just not sure I see a use for that.

Mar 10 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

07-Mar-2014 06:37, Walter Bright пишет:
 In "Lots of low hanging fruit in Phobos" the issue came up about the
 automatic encoding and decoding of char ranges.

 Throughout D's history, there are regular and repeated proposals to
 redesign D's view of char[] to pretend it is not UTF-8, but UTF-32. I.e.
 so D will automatically generate code to decode and encode on every
 attempt to index char[].

...

 Is there any hope of fixing this?

Where have you been when it was introduced? :)


-- 
Dmitry Olshansky

Mar 07 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/7/2014 2:27 AM, Dmitry Olshansky wrote:
 Where have you been when it was introduced? :)

It slipped by me. What can I say? I'm not the only committer :-)

But after spending non-trivial time suffering as auto-decode blasted my
kingdom, 
I've concluded that it needs to die. Working around it is not easy.

I know that auto-decode has negatively impacted your regex, too. Basically, 
auto-decode is like booking a flight from Seattle to San Francisco with a plane 
change in Atlanta.

Mar 07 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 07 Mar 2014 05:41:18 -0500, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 3/7/2014 2:27 AM, Dmitry Olshansky wrote:
 Where have you been when it was introduced? :)

 It slipped by me. What can I say? I'm not the only committer :-)

No, this is intrinsic in the problem of treating strings as ranges of  
dchar. This one function is a symptom, not the problem.

-Steve

Mar 07 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

07-Mar-2014 14:41, Walter Bright пишет:
 On 3/7/2014 2:27 AM, Dmitry Olshansky wrote:
 Where have you been when it was introduced? :)

 It slipped by me. What can I say? I'm not the only committer :-)

 But after spending non-trivial time suffering as auto-decode blasted my
 kingdom, I've concluded that it needs to die.
 Working around it is not
 easy.

That seems to be the biggest problem, it's an overriding default that is 
very hard to "turn off" and retain nice and clear generic view of stuff.

 I know that auto-decode has negatively impacted your regex, too.

No, technically, I knew what I was doing and that decode call was 
explicit. It's just turned out to set a bar on a minimum time budget to 
do X with a string, and it's too high.

What really got nasty is multiple re-decoding of the same piece as 
engine backtracks to try earlier alternatives.

-- 
Dmitry Olshansky

Mar 07 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/7/2014 11:43 AM, Dmitry Olshansky wrote:
 No, technically, I knew what I was doing and that decode call was explicit.

Ah right, I misremembered. Thanks for the correction.

Mar 07 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Friday, 7 March 2014 at 02:37:11 UTC, Walter Bright wrote:
 In "Lots of low hanging fruit in Phobos" the issue came up 
 about the automatic encoding and decoding of char ranges.

 Throughout D's history, there are regular and repeated 
 proposals to redesign D's view of char[] to pretend it is not 
 UTF-8, but UTF-32. I.e. so D will automatically generate code 
 to decode and encode on every attempt to index char[].

I'm glad I'm not the only one who feels this way. Implicit 
decoding must die.

I strongly believe that implicit decoding of character points in 
std.range has been a mistake.

- Algorithms such as "countUntil" will count code points. These 
numbers are useless for slicing, and can introduce hard-to-find 
bugs.

- In lots of places, I've discovered that Phobos did UTF decoding 
(thus murdering performance) when it didn't need to. Such cases 
included format (now fixed), appender (now fixed), startsWith 
(now fixed - recently), skipOver (still unfixed). These have 
caused latent bugs in my programs that happened to be fed non-UTF 
data. There's no reason for why D should fail on non-UTF data if 
it has no reason to decode it in the first place! These failures 
have only served to identify places in Phobos where redundant 
decoding was occurring.

Furthermore, it doesn't actually solve anything completely! The 
only thing it solves is a subset of cases for a subset of 
languages!

People want to look at a string "character by character". If a 
Unicode code point is a character in your language and alphabet, 
I'm really happy for you, but that's not how it is for everyone. 
Combining marks, complex scripts etc. make this point just a 
fallacy that in the end will cause programmers to make mistakes 
that will affect certain users somewhere.

Why do people want to look at individual characters? There are a 
lot of misconceptions about Unicode, and I think some of that 
applies here.

- Do you want to split a string by whitespace? Some languages 
have no notion of whitespace. What do you need it for? Line 
wrapping? Employ the Unicode line-breaking algorithm instead.

- Do you want to uppercase the first letter of a string? Some 
language have no notion of letter case, and some use it for 
different reasons. Furthermore, even languages with a Latin-based 
alphabet may not have 1:1 mapping for case, e.g. the German ß 
letter.

- Do you want to count how wide a string will be in a fixed-point 
font? Wrong... Combining and control characters, zero-width 
whitespace, etc. will render this approach futile.

- Do you want to split or flush a stream to a character device at 
a point so that there's no garbage? I believe, this is the case 
in TDPL's mention of the subject. Again, combining characters or 
complex scripts will still be broken by this approach.

You need to either go all-out and provide complete 
implementations of the relevant Unicode algorithms to perform 
tasks such as the above that will work in all locales, or you 
need to draw a line somewhere for which languages, alphabets, 
locales do you want to support in your program. D's line is drawn 
at the point where it considers that code points == characters, 
however the outcome of this is clear nowhere in its documentation 
and for such an arbitrary decision (from a cultural point of 
view), it is embedded too deep into the language itself. With 
std.ascii, at least, it's clear to the user that the functions 
there will only work with English or languages using the same 
alphabet.

This doesn't apply universally. There are still cases like, e.g., 
regular expression ranges. [a-z] makes sense in English, and 
[а-я] makes sense in Russian, but I don't think that makes sense 
for all languages. However, for the most part, I think implicit 
decoding must be axed, and instead we need implementations of 
Unicode algorithms and the documentation to instruct users why 
and how to use them.

Mar 07 2014

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 3/7/14, Vladimir Panteleev <vladimir thecybershadow.net> wrote:
 - Do you want to split a string by whitespace?
 - Do you want to uppercase the first letter of a string?
 - Do you want to count how wide a string will be in a fixed-point
 font?
 - Do you want to split or flush a stream to a character device at
 a point so that there's no garbage?

We could later make a page on dlang (or the wiki) describing how to do
these common things.

Mar 07 2014

Robert Schadek <realburner gmx.de> writes:

On 03/07/2014 12:56 PM, Vladimir Panteleev wrote:
 I'm glad I'm not the only one who feels this way. Implicit decoding
 must die.

 I strongly believe that implicit decoding of character points in
 std.range has been a mistake.

 - Algorithms such as "countUntil" will count code points. These
 numbers are useless for slicing, and can introduce hard-to-find bugs.

+1 see my pull requests for std.string:
https://github.com/D-Programming-Language/phobos/pull/1952
https://github.com/D-Programming-Language/phobos/pull/1977

Mar 07 2014

"ponce" <contact gam3sfrommars.fr> writes:

 - In lots of places, I've discovered that Phobos did UTF 
 decoding (thus murdering performance) when it didn't need to. 
 Such cases included format (now fixed), appender (now fixed), 
 startsWith (now fixed - recently), skipOver (still unfixed). 
 These have caused latent bugs in my programs that happened to 
 be fed non-UTF data. There's no reason for why D should fail on 
 non-UTF data if it has no reason to decode it in the first 
 place! These failures have only served to identify places in 
 Phobos where redundant decoding was occurring.

With all due respect, D string type is exclusively for UTF-8 
strings. If it is not valid UTF-8, it should never had been a D 
string in the first place. In the other cases, ubyte[] is there.

Mar 09 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Sunday, 9 March 2014 at 12:24:11 UTC, ponce wrote:
 - In lots of places, I've discovered that Phobos did UTF 
 decoding (thus murdering performance) when it didn't need to. 
 Such cases included format (now fixed), appender (now fixed), 
 startsWith (now fixed - recently), skipOver (still unfixed). 
 These have caused latent bugs in my programs that happened to 
 be fed non-UTF data. There's no reason for why D should fail 
 on non-UTF data if it has no reason to decode it in the first 
 place! These failures have only served to identify places in 
 Phobos where redundant decoding was occurring.

 With all due respect, D string type is exclusively for UTF-8 
 strings. If it is not valid UTF-8, it should never had been a D 
 string in the first place. In the other cases, ubyte[] is there.

This is an arbitrary self-imposed limitation caused by the choice 
in how strings are handled in Phobos.

Mar 09 2014

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On 3/9/2014 11:21 AM, Vladimir Panteleev wrote:
 On Sunday, 9 March 2014 at 12:24:11 UTC, ponce wrote:
 - In lots of places, I've discovered that Phobos did UTF decoding
 (thus murdering performance) when it didn't need to. Such cases
 included format (now fixed), appender (now fixed), startsWith (now
 fixed - recently), skipOver (still unfixed). These have caused latent
 bugs in my programs that happened to be fed non-UTF data. There's no
 reason for why D should fail on non-UTF data if it has no reason to
 decode it in the first place! These failures have only served to
 identify places in Phobos where redundant decoding was occurring.

 With all due respect, D string type is exclusively for UTF-8 strings.
 If it is not valid UTF-8, it should never had been a D string in the
 first place. In the other cases, ubyte[] is there.

 This is an arbitrary self-imposed limitation caused by the choice in how
 strings are handled in Phobos.

Yea, I've had problems before - completely unnecessary problems that 
were *not* helpful or indicative of latent bugs - which were a direct 
result of Phobos being overly pedantic and eager about UTF validation. 
And yet the implicit UTF validation has never actually *helped* me in 
any way.

Mar 09 2014

"ponce" <contact gam3sfrommars.fr> writes:

On Sunday, 9 March 2014 at 21:14:30 UTC, Nick Sabalausky wrote:
 With all due respect, D string type is exclusively for UTF-8 
 strings.
 If it is not valid UTF-8, it should never had been a D string 
 in the
 first place. In the other cases, ubyte[] is there.

 This is an arbitrary self-imposed limitation caused by the 
 choice in how
 strings are handled in Phobos.

 Yea, I've had problems before - completely unnecessary problems 
 that were *not* helpful or indicative of latent bugs - which 
 were a direct result of Phobos being overly pedantic and eager 
 about UTF validation. And yet the implicit UTF validation has 
 never actually *helped* me in any way.


 self-imposed limitation


For greater good.

I finds this article very telling about why string should be 
converted to UTF-8 as often as possible.
http://www.utf8everywhere.org/

I agree 100% with its content, it's impossibly hard to have a 
sane handling of encodings on WIndows (even more in a team), if 
not following the drastic rules the article exposes.

This happens to be what Phobos gently mandates, UTF validation is 
certainly the lesser evil as compared the mess that everything 
become without. How is mandating valid UTF-8 being overly 
pedantic? This is the sanest behaviour. Just use sanitizeUTF8 
(http://vibed.org/api/vibe.utils.string/sanitizeUTF8) or 
equivalent.

Mar 10 2014

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On 3/10/2014 6:21 AM, ponce wrote:
 On Sunday, 9 March 2014 at 21:14:30 UTC, Nick Sabalausky wrote:
 Yea, I've had problems before - completely unnecessary problems that
 were *not* helpful or indicative of latent bugs - which were a direct
 result of Phobos being overly pedantic and eager about UTF validation.
 And yet the implicit UTF validation has never actually *helped* me in
 any way.


 self-imposed limitation


 For greater good.

 I finds this article very telling about why string should be converted
 to UTF-8 as often as possible.
 http://www.utf8everywhere.org/

 I agree 100% with its content, it's impossibly hard to have a sane
 handling of encodings on WIndows (even more in a team), if not following
 the drastic rules the article exposes.

I may have missed it, but I don't see where it says anything about 
validation or immediate sanitation of invalid sequences. It's mostly 
"UTF-16 sucks and so does Windows" (not that I'm necessarily disagreeing 
with it). (ot: Kinda wish they hadn't used such a hard to read font...)

Mar 10 2014

"ponce" <contact gam3sfrommars.fr> writes:

On Monday, 10 March 2014 at 11:04:43 UTC, Nick Sabalausky wrote:
 I may have missed it, but I don't see where it says anything 
 about validation or immediate sanitation of invalid sequences. 
 It's mostly "UTF-16 sucks and so does Windows" (not that I'm 
 necessarily disagreeing with it). (ot: Kinda wish they hadn't 
 used such a hard to read font...)

I should have highlighted it, their recommendations for proper 
encoding handling on Windows are in section 5 ("How to do text on 
Windows").

One of them is "std::strings and char*, anywhere in the program, 
are considered UTF-8 (if not said otherwise)."

I finds it interesting that D tends to enforce this lesson 
learned with mixed-encodings codebases.

Mar 10 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 06 Mar 2014 21:37:13 -0500, Walter Bright  
<newshound2 digitalmars.com> wrote:

 Is there any hope of fixing this?

Yes, make d strings not char arrays, but a library-defined struct with an  
array as backing.

auto x = "..."; compiles to => auto x =  
string(cast(immutable(char)[])"...");

Then define string to be whatever kind of range you want in the library,  
with whatever functionality you want.

Then if you want by-char traversal, explicitly use immutable(char)[] as  
x's type. And in the string range's members, we can provide whatever  
access we want.

Note, this also fixes foreach, and many other problems we have. Most  
likely code that works today will continue to work, since it's much more  
of a bear to type immutable(char)[] instead of string :)

-Steve

Mar 07 2014

"Dicebot" <public dicebot.lv> writes:

I don't like it at all.

1) It is a huge breakage and you have been refusing to do one 
even for more important problems. What is about this sudden 
change of mind?

2) It is regression back to C++ days of 
no-one-cares-about-Unicode pain. Thinking about strings as 
character arrays is so natural and convenient that if 
language/Phobos won't punish you for that, it will be extremely 
widespread.

Rendering correctness is very application-specific but providing 
basic guarantees that string is not completely broken is useful.

Now real problems I see:

1) stuff like readText() returns char[] instead of requiring 
explicit default encoding

2) lack of convenient .raw property which will effectively do 
cast(ubyte[])

3) the fact that std.string always assumes unicode and never 
forwards to std.ascii for

Mar 07 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
 I don't like it at all.

 1) It is a huge breakage

Can we look at some example situations that this will break?

 and you have been refusing to do one even for more important 
 problems.

This is a fallacy.

 2) It is regression back to C++ days of 
 no-one-cares-about-Unicode pain. Thinking about strings as 
 character arrays is so natural and convenient that if 
 language/Phobos won't punish you for that, it will be extremely 
 widespread.

Thinking about dstrings as character arrays is less flawed only 
to a certain extent.

 Now real problems I see:

 1) stuff like readText() returns char[] instead of requiring 
 explicit default encoding

 2) lack of convenient .raw property which will effectively do 
 cast(ubyte[])

 3) the fact that std.string always assumes unicode and never 
 forwards to std.ascii for 


I think these are fixable without breaking anything? So why not 
go for it? The first two sound trivial (.raw can be an UFCS 
property).

Mar 07 2014

"Dicebot" <public dicebot.lv> writes:

On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
 I don't like it at all.

 1) It is a huge breakage

 Can we look at some example situations that this will break?

Any code that relies on countUntil to count dchar's? Or, to 
generalize, almost any code that uses std.algorithm functions 
with string?

 and you have been refusing to do one even for more important 
 problems.

 This is a fallacy.

Ok :)

 2) It is regression back to C++ days of 
 no-one-cares-about-Unicode pain. Thinking about strings as 
 character arrays is so natural and convenient that if 
 language/Phobos won't punish you for that, it will be 
 extremely widespread.

 Thinking about dstrings as character arrays is less flawed only 
 to a certain extent.

Sure. But I find this extent practical enough to make the 
difference. It is good compromise between perfectly correct (and 
very slow) string processing and having your program unusable 
with anything but basic latin symbol set.

 Now real problems I see:

 1) stuff like readText() returns char[] instead of requiring 
 explicit default encoding

 2) lack of convenient .raw property which will effectively do 
 cast(ubyte[])

 3) the fact that std.string always assumes unicode and never 
 forwards to std.ascii for 

 ubyte[]

 I think these are fixable without breaking anything? So why not 
 go for it? The first two sound trivial (.raw can be an UFCS 
 property).

(1) will likely require deprecation (== breakage) of old 
interface, but yes, those are relatively trivial. It is just has 
not been important enough to me to spend time on pushing it. 
Still struggling to finish my template argument list proposal :(

Mar 07 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Friday, 7 March 2014 at 16:43:30 UTC, Dicebot wrote:
 On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev 
 wrote:
 Can we look at some example situations that this will break?

 Any code that relies on countUntil to count dchar's? Or, to 
 generalize, almost any code that uses std.algorithm functions 
 with string?

This is a pretty fragile design in the first place, since we use 
the same basic type (integers) to count two different things 
(code units / code points). Code that relies on this behavior 
would need to be explicitly tested with Unicode data to be sure 
that it works correctly - otherwise, it will only appear at a 
glance that it works right if it's only tested with ASCII.

Correct code where these indices never left the equation will not 
be affected, e.g.:

auto s = "日本語";
auto x = s.countUntil("本語"); // was 1, will be 3
s = s.drop(x);
assert(s == "本語"); // still OK

 Thinking about dstrings as character arrays is less flawed 
 only to a certain extent.

 Sure. But I find this extent practical enough to make the 
 difference. It is good compromise between perfectly correct 
 (and very slow) string processing and having your program 
 unusable with anything but basic latin symbol set.

I think that if we are to draw a line somewhere on what to 
support and not, the decision should not be embedded as deep into 
the language. Ideally, it would be clearly visible in the code 
that you are counting code points.

Mar 07 2014

"Dicebot" <public dicebot.lv> writes:

On Friday, 7 March 2014 at 17:04:30 UTC, Vladimir Panteleev wrote:
 I think that if we are to draw a line somewhere on what to 
 support and not, the decision should not be embedded as deep 
 into the language. Ideally, it would be clearly visible in the 
 code that you are counting code points.

Well if you consider really breaking changes, simply prohibiting 
plain random access to char[] and forcing to use either .raw or 
.decode is one thing I'd love to see (with .byGrapheme as library 
cherry on top)

Mar 07 2014

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Mar 07, 2014 at 05:08:02PM +0000, Dicebot wrote:
 On Friday, 7 March 2014 at 17:04:30 UTC, Vladimir Panteleev wrote:
I think that if we are to draw a line somewhere on what to support
and not, the decision should not be embedded as deep into the
language. Ideally, it would be clearly visible in the code that
you are counting code points.

 
 Well if you consider really breaking changes, simply prohibiting
 plain random access to char[] and forcing to use either .raw or
 .decode is one thing I'd love to see (with .byGrapheme as library
 cherry on top)

I don't understand what advantage this would bring.


T

-- 
Frank disagreement binds closer than feigned agreement.

Mar 07 2014

"Dicebot" <public dicebot.lv> writes:

On Friday, 7 March 2014 at 17:39:41 UTC, H. S. Teoh wrote:
 Well if you consider really breaking changes, simply 
 prohibiting
 plain random access to char[] and forcing to use either .raw or
 .decode is one thing I'd love to see (with .byGrapheme as 
 library
 cherry on top)

 I don't understand what advantage this would bring.

Making sure that whatever interpretation is chosen by the 
programmer it is actually a conscious choice and he does not hold 
any false illusions.

Mar 07 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/7/2014 9:04 AM, Vladimir Panteleev wrote:
 Ideally, it would be
 clearly visible in the code that you are counting code points.

Yes.

Mar 07 2014

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:

On Friday, 7 March 2014 at 16:43:30 UTC, Dicebot wrote:
 On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev
 Can we look at some example situations that this will break?

 Any code that relies on countUntil to count dchar's? Or, to 
 generalize, almost any code that uses std.algorithm functions 
 with string?

This would no longer compile, as dchar[] stops being a range. 
countUntil(range.byCodePoint) would have to be used instead.

Mar 09 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Sunday, 9 March 2014 at 13:51:12 UTC, Marc Schütz wrote:
 On Friday, 7 March 2014 at 16:43:30 UTC, Dicebot wrote:
 On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev
 Can we look at some example situations that this will break?

 Any code that relies on countUntil to count dchar's? Or, to 
 generalize, almost any code that uses std.algorithm functions 
 with string?

 This would no longer compile, as dchar[] stops being a range. 
 countUntil(range.byCodePoint) would have to be used instead.

Why? There's no reason why dchar[] would stop being a range. It 
will be treated as now, like any other array.

Mar 09 2014

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:

On Sunday, 9 March 2014 at 15:23:57 UTC, Vladimir Panteleev wrote:
 On Sunday, 9 March 2014 at 13:51:12 UTC, Marc Schütz wrote:
 On Friday, 7 March 2014 at 16:43:30 UTC, Dicebot wrote:
 On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev
 Can we look at some example situations that this will break?

 Any code that relies on countUntil to count dchar's? Or, to 
 generalize, almost any code that uses std.algorithm functions 
 with string?

 This would no longer compile, as dchar[] stops being a range. 
 countUntil(range.byCodePoint) would have to be used instead.

 Why? There's no reason why dchar[] would stop being a range. It 
 will be treated as now, like any other array.

This was under the assumption that Nick's proposal (and my 
"amendment" to extend it to dchar because of graphemes e.a.) 
would be implemented.

But I made the mistake of replying to posts as I read them, just 
to notice a few posts later that someone else already posted 
something to the same effect, or that made my point irrelevant. 
Sorry for the confusion.

Mar 09 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/7/2014 7:03 AM, Dicebot wrote:
 1) It is a huge breakage and you have been refusing to do one even for more
 important problems. What is about this sudden change of mind?

1. Performance Performance Performance

2. The current behavior is surprising (it sure surprised me, I didn't notice it 
until I looked at the assembler to figure out why the performance sucked)

3. Weirdnesses like ElementEncodingType

4. Strange behavior differences between char[], char*, and InputRange!char types

5. Funky anomalous issues with writing OutputRange!char (the put(T) must take a 
dchar)


 2) lack of convenient .raw property which will effectively do cast(ubyte[])

I've done the cast as a workaround, but when working with generic code it turns 
out the ubyte type becomes viral - you have to use it everywhere. So all over 
the place you're having casts between ubyte <=> char in unexpected places. You 
also wind up with ugly ubyte <=> dchar casts, with the commensurate risk that 
you goofed and have a truncation bug.

Essentially, the auto-decode makes trivial code look better, but if you're 
writing a more comprehensive string processing program, and care about 
performance, it makes a regular ugly mess of things.

Mar 07 2014

"Dicebot" <public dicebot.lv> writes:

On Friday, 7 March 2014 at 19:43:57 UTC, Walter Bright wrote:
 On 3/7/2014 7:03 AM, Dicebot wrote:
 1) It is a huge breakage and you have been refusing to do one 
 even for more
 important problems. What is about this sudden change of mind?

 1. Performance Performance Performance

Not important enough. D has always been "safe by default, fast 
when asked to" language, not other way around. There is no 
fundamental performance problem here, only lack of knowledge 
about Phobos.

 2. The current behavior is surprising (it sure surprised me, I 
 didn't notice it until I looked at the assembler to figure out 
 why the performance sucked)

That may imply that better documentation is needed. You were only 
surprised because of wrong initial assumption about what `char[]` 
type means.

 3. Weirdnesses like ElementEncodingType

ElementEncodingType is extremely annoying but I think it is just 
a side effect of more bigger problem how string algorithms are 
handled currently. It does not need to be that way.

 4. Strange behavior differences between char[], char*, and 
 InputRange!char types

Again, there is nothing strange about it. `char[]` is a special 
type with special semantics that is defined in documentation and 
consistently following  that definition in all but raw array 
indexing/slicing (which is what I find unfortunate but also 
beyond fixing feasibility).

 5. Funky anomalous issues with writing OutputRange!char (the 
 put(T) must take a dchar)

Bad but not worth even a small breaking change.

 2) lack of convenient .raw property which will effectively do 
 cast(ubyte[])

 I've done the cast as a workaround, but when working with 
 generic code it turns out the ubyte type becomes viral - you 
 have to use it everywhere. So all over the place you're having 
 casts between ubyte <=> char in unexpected places. You also 
 wind up with ugly ubyte <=> dchar casts, with the commensurate 
 risk that you goofed and have a truncation bug.

Of course it is viral. Because you never ever wan't to have 
char[] at all if you don't work with Unicode (or work with it on 
raw byte level). And in that case it is your responsibility to do 
manual decoding when appropriate. Trying to dish out that 
performance often means going at low level with all associated 
risks, there is nothing special about char[] here. It is not a 
common use case.

 Essentially, the auto-decode makes trivial code look better, 
 but if you're writing a more comprehensive string processing 
 program, and care about performance, it makes a regular ugly 
 mess of things.

And this is how it should be. Again, I am all for creating 
language that favors performance-critical power programming needs 
over common/casual needs but it is not what D is and you have 
been making such choices consistently over quite a long time now 
(array literals that allocate, I will never forgive that). 
Suddenly changing your mind only because you have encountered 
this specific issue personally as opposed to just reports does 
not fit a language author role. It does not really matter if any 
new approach itself is good or bad - being unpredictable is a 
reputation damage D simply can't afford.

Mar 10 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/10/2014 6:47 AM, Dicebot wrote:
 (array literals that allocate, I will never forgive that).

It was done that way simply to get it up and running quickly. Having them not 
allocate is an optimization, it doesn't change the nature.

Mar 10 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 10 Mar 2014 19:59:07 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 3/10/2014 6:47 AM, Dicebot wrote:
 (array literals that allocate, I will never forgive that).

 It was done that way simply to get it up and running quickly. Having  
 them not allocate is an optimization, it doesn't change the nature.

I think you forget about this:

foo(int v, int w)
{
    auto x = [v, w];
}

Which cannot pre-allocate.

That said, I would not mind if this code broke and you had to use array(v,  
w) instead, for the sake of avoiding unnecessary allocations.

-Steve

Mar 10 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/10/14, 7:07 PM, Steven Schveighoffer wrote:
 On Mon, 10 Mar 2014 19:59:07 -0400, Walter Bright
 <newshound2 digitalmars.com> wrote:

 On 3/10/2014 6:47 AM, Dicebot wrote:
 (array literals that allocate, I will never forgive that).

 It was done that way simply to get it up and running quickly. Having
 them not allocate is an optimization, it doesn't change the nature.

 I think you forget about this:

 foo(int v, int w)
 {
     auto x = [v, w];
 }

 Which cannot pre-allocate.

It actually can, seeing as x is a dead assignment :o).

 That said, I would not mind if this code broke and you had to use
 array(v, w) instead, for the sake of avoiding unnecessary allocations.

Fixing that:

int[] foo(int v, int w) { return [v, w]; }

This one would allocate. But analyses of varying complexity may 
eliminate a variety of allocation patterns.


Andrei

Mar 10 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 10 Mar 2014 22:56:22 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 3/10/14, 7:07 PM, Steven Schveighoffer wrote:
 On Mon, 10 Mar 2014 19:59:07 -0400, Walter Bright
 <newshound2 digitalmars.com> wrote:

 On 3/10/2014 6:47 AM, Dicebot wrote:
 (array literals that allocate, I will never forgive that).

 It was done that way simply to get it up and running quickly. Having
 them not allocate is an optimization, it doesn't change the nature.

 I think you forget about this:

 foo(int v, int w)
 {
     auto x = [v, w];
 }

 Which cannot pre-allocate.

 It actually can, seeing as x is a dead assignment :o).

Actually, it can't do anything, seeing as it's invalid code ;)

 That said, I would not mind if this code broke and you had to use
 array(v, w) instead, for the sake of avoiding unnecessary allocations.

 Fixing that:

 int[] foo(int v, int w) { return [v, w]; }

 This one would allocate. But analyses of varying complexity may  
 eliminate a variety of allocation patterns.

I think you are missing what I'm saying, I don't want the allocation  
eliminated, but if we eliminate some allocations with [] and not others,  
it will be confusing. The path I'd always hoped we would go in was to make  
all array literals immutable, and make allocation of mutable arrays on the  
heap explicit.

Adding eliding of some allocations for optimization is good, but I (and I  
think possibly Dicebot) think all array literals should not allocate.

-Steve

Mar 10 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/10/14, 8:05 PM, Steven Schveighoffer wrote:
 I think you are missing what I'm saying, I don't want the allocation
 eliminated, but if we eliminate some allocations with [] and not others,
 it will be confusing. The path I'd always hoped we would go in was to
 make all array literals immutable, and make allocation of mutable arrays
 on the heap explicit.

 Adding eliding of some allocations for optimization is good, but I (and
 I think possibly Dicebot) think all array literals should not allocate.

I think so too. But that's irrelevant because arrays do allocate (at 
least behave as if they did) and that's how the cookie crumbles.

D is a wonderful language, and is getting better literally by day. There 
is a lot more in using it in new and interesting ways, than in brooding 
about its inevitable imperfections.


Andrei

Mar 10 2014

"Sean Kelly" <sean invisibleduck.org> writes:

On Tuesday, 11 March 2014 at 02:07:19 UTC, Steven Schveighoffer 
wrote:
 On Mon, 10 Mar 2014 19:59:07 -0400, Walter Bright 
 <newshound2 digitalmars.com> wrote:

 On 3/10/2014 6:47 AM, Dicebot wrote:
 (array literals that allocate, I will never forgive that).

 It was done that way simply to get it up and running quickly. 
 Having them not allocate is an optimization, it doesn't change 
 the nature.

 I think you forget about this:

 foo(int v, int w)
 {
    auto x = [v, w];
 }

 Which cannot pre-allocate.

The array is small and does not escape.  It could be allocated on 
the stack as an optimization.

Mar 11 2014

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:

On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
 2) It is regression back to C++ days of 
 no-one-cares-about-Unicode pain. Thinking about strings as 
 character arrays is so natural and convenient that if 
 language/Phobos won't punish you for that, it will be extremely 
 widespread.

Not with Nick Sabalausky's suggestion to remove the 
implementation of front from char arrays. This way, everyone will 
be forced to decide whether they want code units or code points 
or something else.

Mar 09 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Sunday, 9 March 2014 at 13:47:26 UTC, Marc Schütz wrote:
 On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
 2) It is regression back to C++ days of 
 no-one-cares-about-Unicode pain. Thinking about strings as 
 character arrays is so natural and convenient that if 
 language/Phobos won't punish you for that, it will be 
 extremely widespread.

 Not with Nick Sabalausky's suggestion to remove the 
 implementation of front from char arrays. This way, everyone 
 will be forced to decide whether they want code units or code 
 points or something else.

Andrei has made it clear that the code breakage this would
involve would be unacceptable.

Mar 09 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/9/14, 6:47 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
 2) It is regression back to C++ days of no-one-cares-about-Unicode
 pain. Thinking about strings as character arrays is so natural and
 convenient that if language/Phobos won't punish you for that, it will
 be extremely widespread.

 Not with Nick Sabalausky's suggestion to remove the implementation of
 front from char arrays. This way, everyone will be forced to decide
 whether they want code units or code points or something else.

Such as giving up on that crappy language that keeps on breaking their code.

Andrei

Mar 09 2014

"Dicebot" <public dicebot.lv> writes:

On Sunday, 9 March 2014 at 17:27:20 UTC, Andrei Alexandrescu 
wrote:
 On 3/9/14, 6:47 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
 2) It is regression back to C++ days of 
 no-one-cares-about-Unicode
 pain. Thinking about strings as character arrays is so 
 natural and
 convenient that if language/Phobos won't punish you for that, 
 it will
 be extremely widespread.

 Not with Nick Sabalausky's suggestion to remove the 
 implementation of
 front from char arrays. This way, everyone will be forced to 
 decide
 whether they want code units or code points or something else.

 Such as giving up on that crappy language that keeps on 
 breaking their code.

 Andrei


That was more about "if you are that crazy to even consider such 
breakage, this is closer my personal perfection" than actual 
proposal ;)

Mar 10 2014

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:

On Monday, 10 March 2014 at 13:18:50 UTC, Dicebot wrote:
 On Sunday, 9 March 2014 at 17:27:20 UTC, Andrei Alexandrescu 
 wrote:
 On 3/9/14, 6:47 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
 2) It is regression back to C++ days of 
 no-one-cares-about-Unicode
 pain. Thinking about strings as character arrays is so 
 natural and
 convenient that if language/Phobos won't punish you for 
 that, it will
 be extremely widespread.

 Not with Nick Sabalausky's suggestion to remove the 
 implementation of
 front from char arrays. This way, everyone will be forced to 
 decide
 whether they want code units or code points or something else.

 Such as giving up on that crappy language that keeps on 
 breaking their code.

 Andrei


 That was more about "if you are that crazy to even consider 
 such breakage, this is closer my personal perfection" than 
 actual proposal ;)

BTW, I don't believe it would be that bad, because there's a 
straight-forward path of deprecation:

First, std.range.front for narrow strings (and dchar, for 
consistency) can be marked as deprecated. The deprecation message 
can say: "Please specify .byCodePoint()/.byCodeUnit()", guiding 
the users towards a better style (assuming one agrees that 
explicit is indeed better than implicit in this case).

After some time, the functionality can be moved into a 
compatibility module, with the deprecated functions still there, 
but now additionally telling the user about the quick fix of 
importing that module.

The deprecation period can be very long, and even if the 
functions should never be removed, at least everyone writing new 
code would do so in the new style.

Mar 10 2014

"Sean Kelly" <sean invisibleduck.org> writes:

I'm with Walter on this, and it's why I don't use char ranges. 
Though converting to ubyte feels weird.

Mar 07 2014

"Chris" <wendlec tcd.ie> writes:

I only hope it won't break my code. It mainly deals with string / 
character processing and our project in D is now almost ready for 
take off (at least for a beta flight). It deals with characters 
like "é", it is not dealing with English input. Hope the landing 
will be soft!

Mar 07 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/6/14, 6:37 PM, Walter Bright wrote:
 In "Lots of low hanging fruit in Phobos" the issue came up about the
 automatic encoding and decoding of char ranges.

[snip]
 Is there any hope of fixing this?

There's nothing to fix.

Allow me to enumerate the functions of std.algorithm and how they work 
today and how they'd work with the proposed change. Let s be a variable 
of some string type.

1.

s.all!(x => x == 'é') currently works as expected. Proposed: fails silently.

2.

s.any!(x => x == 'é') currently works as expected. Proposed: fails silently.

3.

s.canFind!(x => x == 'é') currently works as expected. Proposed: fails 
silently.

4.

s.canFind('é') currently works as expected. Proposed: fails silently.

5.

s.count() currently works as expected. Proposed: fails silently.

6.

s.count!((a, b) => std.uni.toLower(a) == std.uni.toLower(b))("é") 
currently works as expected (with the known issues of lowercase 
conversion). Proposed: fails silently.

7.

s.count('é') currently works as expected. Proposed: fails silently.

8.

s.countUntil("a") currently work as expected. Proposed: fails silently. 
This applies to all variations of countUntil.

9.

s.endsWith('é') currently works as expected. Proposed: fails silently.

10.

s.find('é') currently works as expected. Proposed: fails silently. This 
applies to other variations of find that include custom predicates.

11.

...

I went down std.algorithm in the order listed in its documentation and 
found pernicious issues with almost every single algorithm.

I designed the range behavior of strings after much thinking and 
consideration back in the day when I designed std.algorithm. It was 
painfully obvious (but it seems to have been forgotten now that it's 
working so well) that approaching strings as arrays of char[] would 
break almost every single algorithm leaving us essentially in the 
pre-UTF C++aveman era.

Making strings bidirectional ranges has been a very good choice within 
the constraints. There was already a string type, and that was 
immutable(char)[], and a bunch of code depended on that definition.

Clearly one might argue that their app has no business dealing with 
diacriticals or Asian characters. But that's the typical provincial view 
that marred many languages' approach to UTF and internationalization. If 
you know your string is ASCII, the remedy is simple - don't use char[] 
and friends. From day 1, the type "char" was meant to mean "code unit of 
UTF characters".

So please ponder the above before going to do surgery on the patient 
that's going to kill him.


Andrei

Mar 07 2014

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Mar 07, 2014 at 11:57:23AM -0800, Andrei Alexandrescu wrote:
 On 3/6/14, 6:37 PM, Walter Bright wrote:
In "Lots of low hanging fruit in Phobos" the issue came up about the
automatic encoding and decoding of char ranges.

 [snip]
Is there any hope of fixing this?

 
 There's nothing to fix.

:D  I knew this was going to happen.


 Allow me to enumerate the functions of std.algorithm and how they
 work today and how they'd work with the proposed change. Let s be a
 variable of some string type.
 
 1.
 
 s.all!(x => x == '�') currently works as expected. Proposed: fails silently.
 
 2.
 
 s.any!(x => x == '�') currently works as expected. Proposed: fails silently.
 
 3.
 
 s.canFind!(x => x == '�') currently works as expected. Proposed:
 fails silently.
 
 4.
 
 s.canFind('�') currently works as expected. Proposed: fails silently.

The problem is that the current implementation of this correct behaviour
leaves a lot to be desired in terms of performance. Ideally, you should
not need to decode every single character in s just to see if it happens
to contain �. Rather, canFind, et al should convert the dchar literal
'�' into a UTF-8 (resp. UTF-16) sequence and do a substring search
instead. Decoding every character in s, while correct, is also
needlessly inefficient.


 5.
 
 s.count() currently works as expected. Proposed: fails silently.

Wrong. The current behaviour of s.count() does not work as expected, it
only gives an illusion that it does. Its return value is misleading when
combining diacritics and other such Unicode "niceness" are involved.
Arguably, such things should be prohibited altogether, and more
semantically transparent algorithms used, namely s.countCodePoints,
s.countGraphemes, etc..


 6.
 
 s.count!((a, b) => std.uni.toLower(a) == std.uni.toLower(b))("�")
 currently works as expected (with the known issues of lowercase
 conversion). Proposed: fails silently.

Again, I don't like this. It sweeps the issues of comparing unicode
strings under the carpet and gives the programmer a false sense of code
correctness. Users instead should be encouraged to use proper Unicode
collation functions that are actually correct, instead of giving an
illusion of correctness.


 7.
 
 s.count('�') currently works as expected. Proposed: fails silently.




 8.
 
 s.countUntil("a") currently work as expected. Proposed: fails
 silently. This applies to all variations of countUntil.

Whether this is correct or not depends on what the intention is. If
you're looking to slice a string, this most definitely does NOT work as
expected. If you're looking to count graphemes, this doesn't work as
expected either. This only works if you just so happen to be counting
code points. The correct approach, IMO, is to help the user make a
conscious choice between these different semantics:

	s.indexOf("a");			// for slicing
	s.byCodepoint.countUntil("a");	// count code points
	s.byGrapheme.countUntil("a");	// count graphemes

Things like s.countUntil("a") are misleading and lead to subtle Unicode
bugs.


 9.
 
 s.endsWith('�') currently works as expected. Proposed: fails silently.

Arguable, because it imposes a performance hit by needless decoding.
Ideally, you should have 3 overloads:

	bool endsWith(string s, char asciiChar);
	bool endsWith(string s, wchar wideChar);
	bool endsWith(string s, dchar codepoint);

In the wchar and dchar overloads you'd do substring search. There is no
need to decode.


 10.
 
 s.find('�') currently works as expected. Proposed: fails silently.
 This applies to other variations of find that include custom
 predicates.

Not necessarily. Arguably we should be overloading on needle type to
eliminate needless decoding:

	string find(string s, char c); // ubyte search
	string find(string s, wchar c); // substring search with char[2]
	string find(string s, dchar c); // substring search with char[4]

This makes sense to me because string is immutable(char)[], so from the
point of view of being an array, searching for wchar is not something
that is obvious (how do you search for a value of type T in an array of
elements of type U?), so explicit overloads for handling those cases
make sense.

Decoding every single character in s is a lot of needless work.


[...]
 I designed the range behavior of strings after much thinking and
 consideration back in the day when I designed std.algorithm. It was
 painfully obvious (but it seems to have been forgotten now that it's
 working so well) that approaching strings as arrays of char[] would
 break almost every single algorithm leaving us essentially in the
 pre-UTF C++aveman era.

I agree, but it is also painfully obvious that the current
implementation is lackluster in terms of performance.


 Making strings bidirectional ranges has been a very good choice
 within the constraints. There was already a string type, and that
 was immutable(char)[], and a bunch of code depended on that
 definition.
 
 Clearly one might argue that their app has no business dealing with
 diacriticals or Asian characters. But that's the typical provincial
 view that marred many languages' approach to UTF and
 internationalization. If you know your string is ASCII, the remedy
 is simple - don't use char[] and friends. From day 1, the type
 "char" was meant to mean "code unit of UTF characters".

Yes, but currently Phobos support for non-UTF strings is rather poor,
and requires many explicit casts to/from ubyte[].


 So please ponder the above before going to do surgery on the patient
 that's going to kill him.

[...]

Yeah I was surprised Walter was actually seriously going to pursue this.
It's a change of a far vaster magnitude than many of the other DIPs and
other proposals that have been rejected because they were deemed to
cause too much breakage of existing code.


T

-- 
Having a smoking section in a restaurant is like having a peeing section
in a swimming pool. -- Edward Burr

Mar 07 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/7/14, 12:26 PM, H. S. Teoh wrote:
 On Fri, Mar 07, 2014 at 11:57:23AM -0800, Andrei Alexandrescu wrote:
 s.canFind('�') currently works as expected. Proposed: fails silently.

 The problem is that the current implementation of this correct behaviour
 leaves a lot to be desired in terms of performance. Ideally, you should
 not need to decode every single character in s just to see if it happens
 to contain �. Rather, canFind, et al should convert the dchar literal
 '�' into a UTF-8 (resp. UTF-16) sequence and do a substring search
 instead. Decoding every character in s, while correct, is also
 needlessly inefficient.

That's an optimization that fits the current design and goes in the 
library transparently, i.e. the good stuff.

 5.

 s.count() currently works as expected. Proposed: fails silently.

 Wrong. The current behaviour of s.count() does not work as expected, it
 only gives an illusion that it does.

Depends on what one expects :o).

 Its return value is misleading when
 combining diacritics and other such Unicode "niceness" are involved.
 Arguably, such things should be prohibited altogether, and more
 semantically transparent algorithms used, namely s.countCodePoints,
 s.countGraphemes, etc..

I think s.byGrapheme.count is the right way instead of specializing a 
bunch of algorithms to work with graphemes.

 s.endsWith('�') currently works as expected. Proposed: fails silently.

 Arguable, because it imposes a performance hit by needless decoding.
 Ideally, you should have 3 overloads:

 	bool endsWith(string s, char asciiChar);
 	bool endsWith(string s, wchar wideChar);
 	bool endsWith(string s, dchar codepoint);

Nice idea. Fits current design. Then interesting complications arise 
with things like bool endsWith(string, wstring) etc.

 [...]
 I designed the range behavior of strings after much thinking and
 consideration back in the day when I designed std.algorithm. It was
 painfully obvious (but it seems to have been forgotten now that it's
 working so well) that approaching strings as arrays of char[] would
 break almost every single algorithm leaving us essentially in the
 pre-UTF C++aveman era.

 I agree, but it is also painfully obvious that the current
 implementation is lackluster in terms of performance.

It's not painfully obvious to me at all. What is obvious to me is people 
are happy campers with the way D's strings work, including UTF support 
and performance. I don't remember people bringing this up in forums and 
here at Facebook "yeah, just look at the crappy way they handle 
strings..." Silent approval is easy to forget about.

Walter has been working on an application in which anything slower than 
2x baseline would have been a failure. In that app (which I know very 
well) the right option from day 1 would have been ubyte[], which he 
discovered the hard way. His incomplete understanding of how D strings 
work is the single largest problem there, and indicates an issue with 
the documentation.

He discovered that, was surprised, and overreacted. No need to amplify 
that into mass hysteria. There are improvements that can be made, in the 
form of additions, not breaking changes that would inflict massive 
breakage on the community. This is the way in which this discussion can 
have a positive outcome. (I've shared in fact a few ideas with Walter.)

 Clearly one might argue that their app has no business dealing with
 diacriticals or Asian characters. But that's the typical provincial
 view that marred many languages' approach to UTF and
 internationalization. If you know your string is ASCII, the remedy
 is simple - don't use char[] and friends. From day 1, the type
 "char" was meant to mean "code unit of UTF characters".

 Yes, but currently Phobos support for non-UTF strings is rather poor,
 and requires many explicit casts to/from ubyte[].

Non-UTF strings are currently modeled as ubyte[], so I don't see what 
you'd be casting to and fro. You have absolutely no business 
representing anything non-UTF with char and char[] etc.

 So please ponder the above before going to do surgery on the patient
 that's going to kill him.

 [...]

 Yeah I was surprised Walter was actually seriously going to pursue this.
 It's a change of a far vaster magnitude than many of the other DIPs and
 other proposals that have been rejected because they were deemed to
 cause too much breakage of existing code.

Compared with what's going on now with D at Facebook, this agitation is 
but a little side show. We have way bigger fish to fry.


Andrei

Mar 07 2014

=?UTF-8?B?Ikx1w61z?= Marques" <luis luismarques.eu> writes:

On Friday, 7 March 2014 at 20:27:38 UTC, H. S. Teoh wrote:
 	s.indexOf("a");			// for slicing
 	s.byCodepoint.countUntil("a");	// count code points
 	s.byGrapheme.countUntil("a");	// count graphemes

(BTW, byGrapheme is currently missing in the std.uni docs)

Mar 08 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/8/2014 9:44 AM, "Luís Marques" <luis luismarques.eu>" wrote:
 (BTW, byGrapheme is currently missing in the std.uni docs)

https://github.com/D-Programming-Language/phobos/pull/1985

Mar 08 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu 
wrote:
 Allow me to enumerate the functions of std.algorithm and how 
 they work today and how they'd work with the proposed change. 
 Let s be a variable of some string type.

 s.canFind('é') currently works as expected.

No, it doesn't.

import std.algorithm;

void main()
{
     auto s = "cassé";
     assert(s.canFind('é'));
}

That's the whole problem - all this hot steam and it still does 
not work properly. Because it can't - not without pulling in all 
of the Unicode algorithms implicitly, and that would be much 
worse.

 I went down std.algorithm in the order listed in its 
 documentation and found pernicious issues with almost every 
 single algorithm.

All of your examples are variations of one and the same case: 
searching for a non-ASCII dchar or dchar literal.

How often does this pattern occur in real programs? I think the 
only real metric is to try the change and find out.

 Clearly one might argue that their app has no business dealing 
 with diacriticals or Asian characters. But that's the typical 
 provincial view that marred many languages' approach to UTF and 
 internationalization.

So is yours, if you think that making everything magically a 
dchar is going to solve all problems.

The TDPL example only showcases the problem. Yes, it works with 
Swedish. Now try it again with Sanskrit.

Mar 07 2014

"Eyrk" <eyrk hotmail.com> writes:

On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
 No, it doesn't.

 import std.algorithm;

 void main()
 {
     auto s = "cassé";
     assert(s.canFind('é'));
 }

Hm, I'm not following? Works perfectly fine on my system?

Mar 07 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
 On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev 
 wrote:
 No, it doesn't.

 import std.algorithm;

 void main()
 {
    auto s = "cassé";
    assert(s.canFind('é'));
 }

 Hm, I'm not following? Works perfectly fine on my system?

Something's messing with your Unicode. Try downloading and 
compiling this file:
http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

Mar 07 2014

"TC" <chalucha gmail.com> writes:

 Hm, I'm not following? Works perfectly fine on my system?

 Something's messing with your Unicode. Try downloading and 
 compiling this file:
 http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

Used hex view on referenced file and it does not seem to be the 
same symbol.

Works for me with same ones.

Mar 07 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Friday, 7 March 2014 at 22:16:58 UTC, TC wrote:
 Hm, I'm not following? Works perfectly fine on my system?

 Something's messing with your Unicode. Try downloading and 
 compiling this file:
 http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

 Used hex view on referenced file and it does not seem to be the 
 same symbol.

Define "symbol". :)

Mar 07 2014

"TC" <chalucha gmail.com> writes:

On Friday, 7 March 2014 at 22:18:17 UTC, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 22:16:58 UTC, TC wrote:
 Hm, I'm not following? Works perfectly fine on my system?

 Something's messing with your Unicode. Try downloading and 
 compiling this file:
 http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

 Used hex view on referenced file and it does not seem to be 
 the same symbol.

 Define "symbol". :)

"cassé" - 22 63 61 73 73 65 cc 81 22

vs

'é' - 27 c3 a9 27

Mar 07 2014

"Eyrk" <eyrk hotmail.com> writes:

On Friday, 7 March 2014 at 21:58:40 UTC, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
 On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev 
 wrote:
 No, it doesn't.

 import std.algorithm;

 void main()
 {
   auto s = "cassé";
   assert(s.canFind('é'));
 }

 Hm, I'm not following? Works perfectly fine on my system?

 Something's messing with your Unicode. Try downloading and 
 compiling this file:
 http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

ah right, missing normalization, I get your point, thanks.

Mar 07 2014

"TC" <chalucha gmail.com> writes:

 ah right, missing normalization, I get your point, thanks.

Oops :)

Mar 07 2014

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Mar 07, 2014 at 09:58:39PM +0000, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
No, it doesn't.

import std.algorithm;

void main()
{
   auto s = "cassé";
   assert(s.canFind('é'));
}

Hm, I'm not following? Works perfectly fine on my system?


Probably because your browser is normalizing the unicode string when you
copy-n-paste Vladimir's message? See below:


 Something's messing with your Unicode. Try downloading and compiling
 this file:
 http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

I downloaded the file and looked at it through `od -ctx1`: the first é
is encoded as the byte sequence 65 cc 81, that is, [U+65, U+301] (small
letter e + combining diacritic acute accent), whereas the second é is
encoded as c3 a9, that is, U+E9 (precomposed small letter e with acute
accent).

This illustrates one of my objections to Andrei's post: by auto-decoding
behind the user's back and hiding the intricacies of unicode from him,
it has masked the fact that codepoint-for-codepoint comparison of a
unicode string is not guaranteed to always return the correct results,
due to the possibility of non-normalized strings.

Basically, to have correct behaviour in all cases, the user must be
aware of, and use, the Unicode collation / normalization algorithms
prescribed by the Unicode standard. What we have in std.algorithm right
now is an incomplete implementation with non-working edge cases (like
Vladimir's example) that has poor performance to start with. Its only
redeeming factor is that the auto-decoding hack has given it the
illusion of being correct, when actually it's not correct according to
the Unicode standard. I don't see how this is necessarily superior to
Walter's proposal.


T

-- 
Just because you survived after you did it, doesn't mean it wasn't stupid!

Mar 07 2014

"Eyrk" <eyrk hotmail.com> writes:

On Friday, 7 March 2014 at 22:27:35 UTC, H. S. Teoh wrote:
 This illustrates one of my objections to Andrei's post: by 
 auto-decoding
 behind the user's back and hiding the intricacies of unicode 
 from him,
 it has masked the fact that codepoint-for-codepoint comparison 
 of a
 unicode string is not guaranteed to always return the correct 
 results,
 due to the possibility of non-normalized strings.

 Basically, to have correct behaviour in all cases, the user 
 must be
 aware of, and use, the Unicode collation / normalization 
 algorithms
 prescribed by the Unicode standard. What we have in 
 std.algorithm right
 now is an incomplete implementation with non-working edge cases 
 (like
 Vladimir's example) that has poor performance to start with. 
 Its only
 redeeming factor is that the auto-decoding hack has given it the
 illusion of being correct, when actually it's not correct 
 according to
 the Unicode standard. I don't see how this is necessarily 
 superior to
 Walter's proposal.


 T

Yes, I realised too late.

Would it not be beneficial to have different types of literals, 
one type which is implicitly normalized and one which is 
"raw"(like today)? Since typically you'd want to normalize most 
string literals at compile-time, then you only have to normalize 
external input at run-time.

Mar 07 2014

"TC" <chalucha gmail.com> writes:

 Probably because your browser is normalizing the unicode string 
 when you
 copy-n-paste Vladimir's message? See below:


there and it works like this:

using System;
using System.Diagnostics;

namespace Test
{
     class Program
     {
         static void Main()
         {
             var s = "cassé";
             Debug.Assert(s.IndexOf('é') < 0);
             s = s.Normalize();
             Debug.Assert(s.IndexOf('é') == 4);
         }
     }
}

So it's neither work by default there and Normalize has to be used

Mar 07 2014

"Brad Anderson" <eco gnuk.net> writes:

On Friday, 7 March 2014 at 22:27:35 UTC, H. S. Teoh wrote:
 On Fri, Mar 07, 2014 at 09:58:39PM +0000, Vladimir Panteleev 
 wrote:
 On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev 
wrote:
No, it doesn't.

import std.algorithm;

void main()
{
   auto s = "cassé";
   assert(s.canFind('é'));
}

Hm, I'm not following? Works perfectly fine on my system?


 Probably because your browser is normalizing the unicode string 
 when you
 copy-n-paste Vladimir's message? See below:


 Something's messing with your Unicode. Try downloading and 
 compiling
 this file:
 http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

 I downloaded the file and looked at it through `od -ctx1`: the 
 first é
 is encoded as the byte sequence 65 cc 81, that is, [U+65, 
 U+301] (small
 letter e + combining diacritic acute accent), whereas the 
 second é is
 encoded as c3 a9, that is, U+E9 (precomposed small letter e 
 with acute
 accent).

 This illustrates one of my objections to Andrei's post: by 
 auto-decoding
 behind the user's back and hiding the intricacies of unicode 
 from him,
 it has masked the fact that codepoint-for-codepoint comparison 
 of a
 unicode string is not guaranteed to always return the correct 
 results,
 due to the possibility of non-normalized strings.

 Basically, to have correct behaviour in all cases, the user 
 must be
 aware of, and use, the Unicode collation / normalization 
 algorithms
 prescribed by the Unicode standard. What we have in 
 std.algorithm right
 now is an incomplete implementation with non-working edge cases 
 (like
 Vladimir's example) that has poor performance to start with. 
 Its only
 redeeming factor is that the auto-decoding hack has given it the
 illusion of being correct, when actually it's not correct 
 according to
 the Unicode standard. I don't see how this is necessarily 
 superior to
 Walter's proposal.


 T

To me, the status quo feels like an ok compromise between 
performance and correctness. Everyone is pointing out that 
working at the code point level is bad because it's not correct 
but working at the code unit level as Walter proposes is correct 
even less often so that's not really an argument for moving to 
that. It is, however, an argument for forcing the user to decide 
what level of correctness and performance they need.

Walter's idea (code unit level) would be fastest but least 
correct.
The current is somewhat fast and is somewhat correct.
The next level, graphemes, would be slowest of all but most 
correct.

It seems like there is just no way to avoid the tradeoff between 
speed and correctness so we shouldn't try, only try to force the 
user to make a decision.

Maybe some more string types are in order (hrm). In order of 
performance to correctness:

  string, wstring (code units)
  dstring         (code points)
+gstring         (graphemes)

(do grapheme's completely normalize? If not probably need another 
level, say, nstring)

Then if a user needs correctness over performance they just work 
with gstrings. If they need performance over correctness they 
work with strings (assuming some of Walter's idea happens, 
otherwise they'd work with string.representation).

Mar 07 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/7/14, 2:26 PM, H. S. Teoh wrote:
 This illustrates one of my objections to Andrei's post: by auto-decoding
 behind the user's back and hiding the intricacies of unicode from him,
 it has masked the fact that codepoint-for-codepoint comparison of a
 unicode string is not guaranteed to always return the correct results,
 due to the possibility of non-normalized strings.

 Basically, to have correct behaviour in all cases, the user must be
 aware of, and use, the Unicode collation / normalization algorithms
 prescribed by the Unicode standard.

Which is a reasonable thing to ask for.

Andrei

Mar 07 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
 On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
 No, it doesn't.

 import std.algorithm;

 void main()
 {
    auto s = "cassé";
    assert(s.canFind('é'));
 }

 Hm, I'm not following? Works perfectly fine on my system?

 Something's messing with your Unicode. Try downloading and compiling
 this file:
 http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

Yup, the grapheme issue. This should work.

import std.algorithm, std.uni;

void main()
{
     auto s = "cassé";
     assert(s.byGrapheme.canFind('é'));
}

It doesn't compile, seems like a library bug.

Graphemes are the next level of Nirvana above code points, but that 
doesn't mean it's graphemes or nothing.


Andrei

Mar 07 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Saturday, 8 March 2014 at 01:23:27 UTC, Andrei Alexandrescu 
wrote:
 Yup, the grapheme issue. This should work.

No. It does not work because grapheme segmentation is not the 
same as normalization. Even if you fix the code (should be: 
assert(s.byGrapheme.canFind!"a[] == b"("é"))), it will not work 
because byGrapheme does not normalize (and not all graphemes can 
be normalized to a single code point anyway). And there is more 
than one type of normalization - you need to use the one 
depending on what you're trying to achieve.

 Graphemes are the next level of Nirvana above code points, but 
 that doesn't mean it's graphemes or nothing.

It's not about types, it's about algorithms. It's never "X or 
nothing" - unless X is "actually understanding Unicode". 
Everything else is a compromise.

Compromises are acceptable, but not when they are built into the 
language as the standard way of working with text, thus hiding 
the problems that come with them.

Mar 07 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Vladimir Panteleev:

 It's not about types, it's about algorithms.

Given sufficiently refined types, it can be about types :-)

Bye,
bearophile

Mar 07 2014

"Eyrk" <eyrk hotmail.com> writes:

On Saturday, 8 March 2014 at 02:04:12 UTC, bearophile wrote:
 Vladimir Panteleev:

 It's not about types, it's about algorithms.

 Given sufficiently refined types, it can be about types :-)

 Bye,
 bearophile

I think Bear is onto something, we already solved an analogous 
problem in an elegant way.

see SortedRange with assumeSorted etc.

But for this to be convenient to use, I still think we should 
expand the current 'String Literal Postfix' types to include both 
normaliztion and graphemes.

Postfix	Type	Aka
c	immutable(char)[]	string
w	immutable(wchar)[]	wstring
d	immutable(dchar)[]	dstring

Mar 08 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

08-Mar-2014 05:23, Andrei Alexandrescu пишет:
 On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
 On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
 No, it doesn't.

 import std.algorithm;

 void main()
 {
    auto s = "cassé";
    assert(s.canFind('é'));
 }

 Hm, I'm not following? Works perfectly fine on my system?

 Something's messing with your Unicode. Try downloading and compiling
 this file:
 http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

 Yup, the grapheme issue. This should work.

 import std.algorithm, std.uni;

 void main()
 {
      auto s = "cassé";
      assert(s.byGrapheme.canFind('é'));
 }

 It doesn't compile, seems like a library bug.

Becasue Graphemes do not auto-magically convert to dchar and back? After 
all they are just small strings.

 Graphemes are the next level of Nirvana above code points, but that
 doesn't mean it's graphemes or nothing.


 Andrei


-- 
Dmitry Olshansky

Mar 08 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

08-Mar-2014 12:09, Dmitry Olshansky пишет:
 08-Mar-2014 05:23, Andrei Alexandrescu пишет:
 On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
 On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
 No, it doesn't.

 import std.algorithm;

 void main()
 {
    auto s = "cassé";
    assert(s.canFind('é'));
 }

 Hm, I'm not following? Works perfectly fine on my system?

 Something's messing with your Unicode. Try downloading and compiling
 this file:
 http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

 Yup, the grapheme issue. This should work.

 import std.algorithm, std.uni;

 void main()
 {
      auto s = "cassé";
      assert(s.byGrapheme.canFind('é'));
 }

 It doesn't compile, seems like a library bug.

 Becasue Graphemes do not auto-magically convert to dchar and back? After
 all they are just small strings.

 Graphemes are the next level of Nirvana above code points, but that
 doesn't mean it's graphemes or nothing.


Plus it won't help the matters, you need both "é" and "cassé" to have 
the same normalization.


-- 
Dmitry Olshansky

Mar 08 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/8/14, 12:14 AM, Dmitry Olshansky wrote:
 08-Mar-2014 12:09, Dmitry Olshansky пишет:
 08-Mar-2014 05:23, Andrei Alexandrescu пишет:
 On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
 On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
 No, it doesn't.

 import std.algorithm;

 void main()
 {
    auto s = "cassé";
    assert(s.canFind('é'));
 }

 Hm, I'm not following? Works perfectly fine on my system?

 Something's messing with your Unicode. Try downloading and compiling
 this file:
 http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

 Yup, the grapheme issue. This should work.

 import std.algorithm, std.uni;

 void main()
 {
      auto s = "cassé";
      assert(s.byGrapheme.canFind('é'));
 }

 It doesn't compile, seems like a library bug.

 Becasue Graphemes do not auto-magically convert to dchar and back? After
 all they are just small strings.

 Graphemes are the next level of Nirvana above code points, but that
 doesn't mean it's graphemes or nothing.


 Plus it won't help the matters, you need both "é" and "cassé" to have
 the same normalization.

Why? Couldn't the grapheme 'compare true with the character? I.e. the 
byGrapheme iteration normalizes on the fly.

Andrei

Mar 08 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Saturday, 8 March 2014 at 15:33:34 UTC, Andrei Alexandrescu 
wrote:
 Why? Couldn't the grapheme 'compare true with the character? 
 I.e. the byGrapheme iteration normalizes on the fly.

Grapheme segmentation and normalization are distinct Unicode 
algorithms:

http://www.unicode.org/reports/tr15/
http://www.unicode.org/reports/tr29/

There are also several normalization algorithms.

http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization

Mar 08 2014

"Peter Alexander" <peter.alexander.au gmail.com> writes:

On Saturday, 8 March 2014 at 16:00:38 UTC, Vladimir Panteleev 
wrote:
 On Saturday, 8 March 2014 at 15:33:34 UTC, Andrei Alexandrescu 
 wrote:
 Why? Couldn't the grapheme 'compare true with the character? 
 I.e. the byGrapheme iteration normalizes on the fly.

 Grapheme segmentation and normalization are distinct Unicode 
 algorithms:

 http://www.unicode.org/reports/tr15/
 http://www.unicode.org/reports/tr29/

 There are also several normalization algorithms.

 http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization

How about this?

s.normalize!NFKD

To return a range of normalized code points?

Clearly, no definition of string can handle this natively. As you 
say, there are multiple algorithms, so there is no one 'right' 
answer. byGrapheme is useful, but doesn't and cannot solve the 
normalization issue.

I feel this discussion is tangential to main debate: whether 
strings should be ranges of code points or code units. By code 
unit is faster by default, and simpler to implement in Phobos (no 
more special code). By code point works better when searching for 
individual code points, but as you rightly point out this might 
not be useful in practice as you rarely search for individual 
non-ASCII code points, and it isn't a complete solution anyway 
because of normalization.

There's a few problems with by code unit:

1. Searching string/wstring for dchar fails silently. You have 
suggested making this a compilation error, but Andrei argues this 
would break lots of code. You counter that it's possible that 
people rarely search for dchar anyway, so may not matter.

2. It's a fundamental change. Regardless of which is better, we 
need to consider the impact of such a change.

3. Ranges of code units are random access and sliceable, which 
means they will be accepted by algorithms such as sort, which 
will just produce garbage strings. Maybe this isn't an issue.

Mar 08 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

08-Mar-2014 19:33, Andrei Alexandrescu пишет:
 On 3/8/14, 12:14 AM, Dmitry Olshansky wrote:
 08-Mar-2014 12:09, Dmitry Olshansky пишет:
 08-Mar-2014 05:23, Andrei Alexandrescu пишет:
 On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
 On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
 No, it doesn't.

 import std.algorithm;

 void main()
 {
    auto s = "cassé";
    assert(s.canFind('é'));
 }

 Hm, I'm not following? Works perfectly fine on my system?

 Something's messing with your Unicode. Try downloading and compiling
 this file:
 http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

 Yup, the grapheme issue. This should work.

 import std.algorithm, std.uni;

 void main()
 {
      auto s = "cassé";
      assert(s.byGrapheme.canFind('é'));
 }

 It doesn't compile, seems like a library bug.

 Becasue Graphemes do not auto-magically convert to dchar and back? After
 all they are just small strings.

 Graphemes are the next level of Nirvana above code points, but that
 doesn't mean it's graphemes or nothing.


 Plus it won't help the matters, you need both "é" and "cassé" to have
 the same normalization.

 Why? Couldn't the grapheme 'compare true with the character?

Iff it consists of one codepoint, it technically may.

 I.e. the
 byGrapheme iteration normalizes on the fly.

Oh crap, please no. It's not only _Slow_ but it's also horribly 
complicated (even in off-line, eager version). + there are 4 
normalizations, of which 2 are lossy.

You simply can't be serious on this one, though seeing that you 
introduced auto-decoding then by extension you must have proposed to 
normalize on the fly :)


-- 
Dmitry Olshansky

Mar 08 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/8/14, 8:08 AM, Dmitry Olshansky wrote:
 08-Mar-2014 19:33, Andrei Alexandrescu пишет:
 I.e. the
 byGrapheme iteration normalizes on the fly.

 Oh crap, please no. It's not only _Slow_ but it's also horribly
 complicated (even in off-line, eager version). + there are 4
 normalizations, of which 2 are lossy.

 You simply can't be serious on this one, though seeing that you
 introduced auto-decoding then by extension you must have proposed to
 normalize on the fly :)

Yah, just pushing my luck :o). I don't know much about graphemes and 
normalization, so leaving that stuff to you guys.

Andrei

Mar 08 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/8/14, 12:09 AM, Dmitry Olshansky wrote:
 08-Mar-2014 05:23, Andrei Alexandrescu пишет:
 On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
 On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
 No, it doesn't.

 import std.algorithm;

 void main()
 {
    auto s = "cassé";
    assert(s.canFind('é'));
 }

 Hm, I'm not following? Works perfectly fine on my system?

 Something's messing with your Unicode. Try downloading and compiling
 this file:
 http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

 Yup, the grapheme issue. This should work.

 import std.algorithm, std.uni;

 void main()
 {
      auto s = "cassé";
      assert(s.byGrapheme.canFind('é'));
 }

 It doesn't compile, seems like a library bug.

 Becasue Graphemes do not auto-magically convert to dchar and back? After
 all they are just small strings.

Yah but I think they should support comparison with individual 
characters. No?

Andrei

Mar 08 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

08-Mar-2014 19:32, Andrei Alexandrescu пишет:
 On 3/8/14, 12:09 AM, Dmitry Olshansky wrote:
 08-Mar-2014 05:23, Andrei Alexandrescu пишет:
 On 3/7/14, 1:58 PM, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 21:56:45 UTC, Eyrk wrote:
 On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
 No, it doesn't.

 import std.algorithm;

 void main()
 {
    auto s = "cassé";
    assert(s.canFind('é'));
 }

 Hm, I'm not following? Works perfectly fine on my system?

 Something's messing with your Unicode. Try downloading and compiling
 this file:
 http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

 Yup, the grapheme issue. This should work.

 import std.algorithm, std.uni;

 void main()
 {
      auto s = "cassé";
      assert(s.byGrapheme.canFind('é'));
 }

 It doesn't compile, seems like a library bug.

 Becasue Graphemes do not auto-magically convert to dchar and back? After
 all they are just small strings.

 Yah but I think they should support comparison with individual
 characters. No?

We could add one. I don't think Grapheme interface is optimal or set in 
stone.

The following should work as is though:

s.byGrapheme.canFind(Grapheme("é"))


 Andrei


-- 
Dmitry Olshansky

Mar 08 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Saturday, 8 March 2014 at 15:56:08 UTC, Dmitry Olshansky wrote:
 The following should work as is though:

 s.byGrapheme.canFind(Grapheme("é"))

Doesn't work here. Not sure why.

Grapheme(1000065, 3, 0, 33554432, [101, 0, 0, 1, 3, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0], 2) // last byGrapheme

vs.

Grapheme(E9, 0, 0, 16777216, [233, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0], 1) // Grapheme("é")

Mar 08 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

08-Mar-2014 20:43, Vladimir Panteleev пишет:
 On Saturday, 8 March 2014 at 15:56:08 UTC, Dmitry Olshansky wrote:
 The following should work as is though:

 s.byGrapheme.canFind(Grapheme("é"))

 Doesn't work here. Not sure why.

 Grapheme(1000065, 3, 0, 33554432, [101, 0, 0, 1, 3, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0], 2) // last byGrapheme

 vs.

 Grapheme(E9, 0, 0, 16777216, [233, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0], 1) // Grapheme("é")

Sounds like a bug, file it before we derailed.

-- 
Dmitry Olshansky

Mar 08 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Saturday, 8 March 2014 at 22:42:20 UTC, Dmitry Olshansky wrote:
 Sounds like a bug, file it before we derailed.

https://d.puremagic.com/issues/show_bug.cgi?id=12324

Mar 08 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/8/2014 12:09 AM, Dmitry Olshansky wrote:
 Becasue Graphemes do not auto-magically convert to dchar and back? After all
 they are just small strings.

std.uni.Grapheme is a struct, and that struct contains a string of arbitrary
length.

I don't know if that is the right design or not, or if a Grapheme should
instead 
be an alias for a slice (rather than be a distinct type).

Graphemes do not appear to have a 1:1 mapping with dchars, and any attempt to
do 
so would likely be a giant mistake.

Mar 08 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

09-Mar-2014 01:15, Walter Bright пишет:
 On 3/8/2014 12:09 AM, Dmitry Olshansky wrote:
 Becasue Graphemes do not auto-magically convert to dchar and back?
 After all
 they are just small strings.

 std.uni.Grapheme is a struct, and that struct contains a string of
 arbitrary length.

 I don't know if that is the right design or not, or if a Grapheme should
 instead be an alias for a slice (rather than be a distinct type).

They use small-string optimization with great success, as indeed plenty 
of graphemes are just 1 codepoint. Many others are just a couple.

 Graphemes do not appear to have a 1:1 mapping with dchars, and any
 attempt to do so would likely be a giant mistake.


-- 
Dmitry Olshansky

Mar 08 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/8/14, 1:15 PM, Walter Bright wrote:
 On 3/8/2014 12:09 AM, Dmitry Olshansky wrote:
 Becasue Graphemes do not auto-magically convert to dchar and back?
 After all
 they are just small strings.

 std.uni.Grapheme is a struct, and that struct contains a string of
 arbitrary length.

 I don't know if that is the right design or not, or if a Grapheme should
 instead be an alias for a slice (rather than be a distinct type).

I think basic encapsulation suggests Grapheme should be a distinct type. 
It's a restricted slice, not just any slice.

 Graphemes do not appear to have a 1:1 mapping with dchars, and any
 attempt to do so would likely be a giant mistake.

I think they may be comparable to dchar.


Andrei

Mar 08 2014

Michel Fortin <michel.fortin michelf.ca> writes:

On 2014-03-08 23:50:43 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 Graphemes do not appear to have a 1:1 mapping with dchars, and any
 attempt to do so would likely be a giant mistake.

 
 I think they may be comparable to dchar.

Dchar, aka code points, are much clearly defined than graphemes. A 
quick search shows me there's more than one way to segment a string 
into graphemes. There's the legacy and extended boundary algorithms for 
general processing, and then there are some tailored algorithms that 
can segment code points differently depending on the locale, or other 
considerations.

Reference:
http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

There are three examples of local-specific graphemes in the table in 
the section linked above. "Ch" is one of them. Quoting Wikipedia: "Ch 
is a digraph in the Latin script. It is treated as a letter of its own 
in Chamorro, Czech, Slovak, Igbo, Quechua, Guarani, Welsh, Cornish, 
Breton and Belarusian Łacinka alphabets."
https://en.wikipedia.org/wiki/Ch_(digraph)

Also, there's some code points that represent ligatures (such as “ﬂ”), 
which are in theory two graphemes. I'm not sure that the general 
algorithm does with that, but the depending on what you're doing 
(counting characters? spell checking?) you might want to split it in 
two.

So basically you just can't make make an algorithm capable of counting 
letters/graphemes/characters in a universal fashion. There's no such 
thing as a universal grapheme segmentation algorithm, even though there 
is a general one. It'd be wise for any API to expose this subtlety 
whenever segmenting graphemes.

Text is an interesting topic for never-ending discussions.

-- 
Michel Fortin
michel.fortin michelf.ca
http://michelf.ca

Mar 08 2014

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On 3/8/2014 9:15 PM, Michel Fortin wrote:
 Text is an interesting topic for never-ending discussions.

It's also a good example for when non-programmers are surprised to hear 
that I *don't* see the world as binary "black and white" *because* of my 
programming experience ;)

Problems like text-handling make it [painfully] obvious to programmers 
that reality is shades-of-grey - laymen don't often expect that!

Mar 09 2014

"Sarath Kodali" <sarath dummy.com> writes:

On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
 On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu 
 wrote:
 Allow me to enumerate the functions of std.algorithm and how 
 they work today and how they'd work with the proposed change. 
 Let s be a variable of some string type.

 s.canFind('é') currently works as expected.

 No, it doesn't.

 import std.algorithm;

 void main()
 {
     auto s = "cassé";
     assert(s.canFind('é'));
 }

 That's the whole problem - all this hot steam and it still does 
 not work properly. Because it can't - not without pulling in 
 all of the Unicode algorithms implicitly, and that would be 
 much worse.

 I went down std.algorithm in the order listed in its 
 documentation and found pernicious issues with almost every 
 single algorithm.

 All of your examples are variations of one and the same case: 
 searching for a non-ASCII dchar or dchar literal.

 How often does this pattern occur in real programs? I think the 
 only real metric is to try the change and find out.

 Clearly one might argue that their app has no business dealing 
 with diacriticals or Asian characters. But that's the typical 
 provincial view that marred many languages' approach to UTF 
 and internationalization.

 So is yours, if you think that making everything magically a 
 dchar is going to solve all problems.

 The TDPL example only showcases the problem. Yes, it works with 
 Swedish. Now try it again with Sanskrit.

+1
In Indian languages, a character consists of one or more UNICODE 
code points. For example, in Sanskrit "ddhrya" 
http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg 
consists of 7 UNICODE code points. So to search for this char I 
have to use string search.

- Sarath

Mar 07 2014

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Mar 07, 2014 at 10:35:46PM +0000, Sarath Kodali wrote:
 On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu
wrote:


[...]
Clearly one might argue that their app has no business dealing
with diacriticals or Asian characters. But that's the typical
provincial view that marred many languages' approach to UTF and
internationalization.

So is yours, if you think that making everything magically a dchar
is going to solve all problems.

The TDPL example only showcases the problem. Yes, it works with
Swedish. Now try it again with Sanskrit.

 
 +1
 In Indian languages, a character consists of one or more UNICODE
 code points. For example, in Sanskrit "ddhrya"
 http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg
 consists of 7 UNICODE code points. So to search for this char I have
 to use string search.

[...]

That's what I've been arguing for. The most general form of character
searching in Unicode requires substring searching, and similarly many
character-based operations on Unicode strings are effectively
substring-based operations, because said "character" may be a multibyte
code point, or, in your case, multiple code points. Since that's the
case, we might as well just forget about the distinction between
"character" and "string", and treat all such operations as substring
operations (even if the operand is supposedly "just 1 character long").

This would allow us to get rid of the hackish auto-decoding of narrow
strings, and thus eliminate the needless overhead of always decoding.


T

-- 
All men are mortal. Socrates is mortal. Therefore all men are Socrates.

Mar 07 2014

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:

On Friday, 7 March 2014 at 23:13:50 UTC, H. S. Teoh wrote:
 On Fri, Mar 07, 2014 at 10:35:46PM +0000, Sarath Kodali wrote:
 On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev 
 wrote:
On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu
wrote:


 [...]
Clearly one might argue that their app has no business 
dealing
with diacriticals or Asian characters. But that's the typical
provincial view that marred many languages' approach to UTF 
and
internationalization.

So is yours, if you think that making everything magically a 
dchar
is going to solve all problems.

The TDPL example only showcases the problem. Yes, it works 
with
Swedish. Now try it again with Sanskrit.

 
 +1
 In Indian languages, a character consists of one or more 
 UNICODE
 code points. For example, in Sanskrit "ddhrya"
 http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg
 consists of 7 UNICODE code points. So to search for this char 
 I have
 to use string search.

 [...]

 That's what I've been arguing for. The most general form of 
 character
 searching in Unicode requires substring searching, and 
 similarly many
 character-based operations on Unicode strings are effectively
 substring-based operations, because said "character" may be a 
 multibyte
 code point, or, in your case, multiple code points. Since 
 that's the
 case, we might as well just forget about the distinction between
 "character" and "string", and treat all such operations as 
 substring
 operations (even if the operand is supposedly "just 1 character 
 long").

 This would allow us to get rid of the hackish auto-decoding of 
 narrow
 strings, and thus eliminate the needless overhead of always 
 decoding.

That won't work, because your needle might be in a different 
normalization form than your haystack, thus a byte-by-byte 
comparison will not be able to find it.

Mar 09 2014

Michel Fortin <michel.fortin michelf.ca> writes:

"Sarath Kodali" <sarath dummy.com> writes:
On Friday, 7 March 2014 at 22:35:47 UTC, Sarath Kodali wrote:
+1
In Indian languages, a character consists of one or more
UNICODE code points. For example, in Sanskrit "ddhrya"
http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg
consists of 7 UNICODE code points. So to search for this char I
have to use string search.

- Sarath

Oops, incomplete reply ...

Since a single "alphabet" in Indian languages can contain
multiple code-points, iterating over single code-points is like
iterating over char[] for non English European languages. So
decode is of no use other than decreasing the performance. A raw
char[] comparison is much faster.

And then there is this "unicode normalization" that makes it very
difficult for string searches or comparisons.

- Sarath

Mar 07 2014

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Mar 07, 2014 at 11:13:50PM +0000, Sarath Kodali wrote:
On Friday, 7 March 2014 at 22:35:47 UTC, Sarath Kodali wrote:
+1
In Indian languages, a character consists of one or more UNICODE
code points. For example, in Sanskrit "ddhrya"
http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg
consists of 7 UNICODE code points. So to search for this char I
have to use string search.

- Sarath

Oops, incomplete reply ...

Since a single "alphabet" in Indian languages can contain multiple
code-points, iterating over single code-points is like iterating
over char[] for non English European languages. So decode is of no
use other than decreasing the performance. A raw char[] comparison
is much faster.

Yes. The more I think about it, the more auto-decoding sounds like a
wrong decision. The question, though, is whether it's worth the massive
code breakage needed to undo it. :-(

And then there is this "unicode normalization" that makes it very
difficult for string searches or comparisons.

[...]

I believe the convention is to always normalize strings before
performing operations on them, in order to prevent these sorts of
problems. I think many of the unicode prescribed algorithms have
normalization as a prerequisite, since otherwise there's no guarantee
that the algorithm will produce the correct results.

--
"I'm not childish; I'm just in touch with the child within!" - RL

Mar 07 2014

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 3/7/2014 6:33 PM, H. S. Teoh wrote:
On Fri, Mar 07, 2014 at 11:13:50PM +0000, Sarath Kodali wrote:
On Friday, 7 March 2014 at 22:35:47 UTC, Sarath Kodali wrote:
+1
In Indian languages, a character consists of one or more UNICODE
code points. For example, in Sanskrit "ddhrya"
http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg
consists of 7 UNICODE code points. So to search for this char I
have to use string search.

- Sarath

Oops, incomplete reply ...

Since a single "alphabet" in Indian languages can contain multiple
code-points, iterating over single code-points is like iterating
over char[] for non English European languages. So decode is of no
use other than decreasing the performance. A raw char[] comparison
is much faster.

Yes. The more I think about it, the more auto-decoding sounds like a
wrong decision. The question, though, is whether it's worth the massive
code breakage needed to undo it. :-(

I'm leaning the same way too. But I also think Andrei is right that, at
this point in time, it'd be a terrible move to change things so that "by
code unit" is default. For better or worse, that ship has sailed.

Perhaps we *can* deal with the auto-decoding problem not by killing
auto-decoding, but by marginalizing it in an additive way:

Convincing arguments have been made that any string-processing code
which *isn't* done entirely with the official Unicode algorithms is
likely wrong *regardless* of whether std.algorithm defaults to
per-code-unit or per-code-point.

So...How's this?: We add any of these Unicode algorithms we may be
missing, encourage their use for strings, discourage use of
std.algorithm for string processing, and in the meantime, just do our
best to reduce unnecessary decoding wherever possible. Then we call it a
day and all be happy :)

Mar 09 2014

"w0rp" <devw0rp gmail.com> writes:
On Sunday, 9 March 2014 at 09:24:02 UTC, Nick Sabalausky wrote:
I'm leaning the same way too. But I also think Andrei is right
that, at this point in time, it'd be a terrible move to change
things so that "by code unit" is default. For better or worse,
that ship has sailed.

Perhaps we *can* deal with the auto-decoding problem not by
killing auto-decoding, but by marginalizing it in an additive
way:

Convincing arguments have been made that any string-processing
code which *isn't* done entirely with the official Unicode
algorithms is likely wrong *regardless* of whether
std.algorithm defaults to per-code-unit or per-code-point.

So...How's this?: We add any of these Unicode algorithms we may
be missing, encourage their use for strings, discourage use of
std.algorithm for string processing, and in the meantime, just
do our best to reduce unnecessary decoding wherever possible.
Then we call it a day and all be happy :)

I've been watching this discussion for the last few days, and I'm
kind of a nobody jumping in pretty late, but I think after
thinking about the problem for a while I would aggree on a
solution along the lines of what you have suggested.

I think Vladimir is definitely right when he's saying that when
you have algorithms that deal with natural languages, simply
working on the basis of a code unit isn't enough. I think it is
also true that you need to select a particular algorithm for
dealing with strings of characters, as there are many different
algorithms you can use for different languages which behave
differently, perhaps several in a single langauge. I also think
Andrei is right when he is saying we need to minimise code
breakage, and that the string decoding and encoding by default
isn't the biggest of performance problems.

I think our best option is to offer a function which creates a
range in std.array for getting a range over raw character data,
without decoding to code points.

myArray.someAlgorithm; // std.array .front used today with decode
calls
myArray.rawData.someAlgorithm; // New range which doesn't decode.

Then we could look at creating algorithms for string processing
which don't use the existing dchar abstraction.

myArray.rawData.byNaturalSymbol!SomeIndianEncodingHere; // Range
of strings, maybe range of range of characters, not dchars

Or even specialise the new algorithm so it looks for arrays and
turns them into the ranges for you via the transformation myArray
-> myArray.rawData.

myArray.byNaturalSymbol!SomeIndianEncodingHere;

Honestly, I'd leave the details of such an algorithm to Vladimir
and not myself, because he's spent far more time looking into
Unicode processing than myself. My knowledge of Unicode pretty
much just comes from having to deal with foreign language
customers and discovering the problems with the code unit
abstraction most languages seem to use. (Java and Python suffer
from similar issues, but they don't really have algorithms in the
way that we do.)

This new set of algorithms taking settings for different
encodings could be first implemented in a third party library,
tested there, and eventually submitted to Phobos, probably in
std.string.

There's my input, I'll duck before I'm beheaded.

Mar 09 2014

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 3/9/2014 7:47 AM, w0rp wrote:
My knowledge of Unicode pretty much just comes from having
to deal with foreign language customers and discovering the problems
with the code unit abstraction most languages seem to use. (Java and
Python suffer from similar issues, but they don't really have algorithms
in the way that we do.)

Python 2 or 3 (out of curiosity)? If you're including Python3, then that
somewhat surprises me as I thought greatly improved Unicode was one of
the biggest reasons for the jump from 2 to 3. (Although it isn't
*completely* surprising since, as we all know far too well here, fully
correct Unicode is *not* easy.)

Mar 09 2014

"w0rp" <devw0rp gmail.com> writes:
On Sunday, 9 March 2014 at 21:38:06 UTC, Nick Sabalausky wrote:
On 3/9/2014 7:47 AM, w0rp wrote:
My knowledge of Unicode pretty much just comes from having
to deal with foreign language customers and discovering the
problems
with the code unit abstraction most languages seem to use.
(Java and
Python suffer from similar issues, but they don't really have
algorithms
in the way that we do.)

Python 2 or 3 (out of curiosity)? If you're including Python3,
then that somewhat surprises me as I thought greatly improved
Unicode was one of the biggest reasons for the jump from 2 to
3. (Although it isn't *completely* surprising since, as we all
know far too well here, fully correct Unicode is *not* easy.)

Late reply here. Python 3 is a lot better in terms of Unicode
support than 2. The situation in Python 2 was this.

1. The default string type is 'str', an immutable array of bytes.
2. 'str' could be one of many encodings, including UTF-16, etc.
3. There is an extra 'unicode' type for when you want a Unicode
string.
4. Python implicltly converts between the two, often in wrong
ways, often causing exceptions to appear where you didn't expect
them to.

In 3, this changed to...

1. The default string type is still named 'str', only now it's
like the 'unicode' of olde.
2. 'bytes' is a new immutable array of bytes type like the Python
2 'str'.
3. Conversion between 'str' and 'bytes' is always explicit.

However, Python 3 works on a code point level, probably some code
unit level in fact, and you don't see very many algorithms which
take, say, combining characters into account. So Python suffers
from similar issues.

Mar 11 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/7/14, 12:43 PM, Vladimir Panteleev wrote:
On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu wrote:
Allow me to enumerate the functions of std.algorithm and how they work
today and how they'd work with the proposed change. Let s be a
variable of some string type.

s.canFind('é') currently works as expected.

No, it doesn't.

import std.algorithm;

void main()
{
auto s = "cassé";
assert(s.canFind('é'));
}

worksforme

Mar 07 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 8 March 2014 at 00:44:53 UTC, Andrei Alexandrescu
wrote:
worksforme

http://forum.dlang.org/post/fhqradggtvwnpqpuehgg forum.dlang.org

Mar 07 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:
07-Mar-2014 23:57, Andrei Alexandrescu пишет:
On 3/6/14, 6:37 PM, Walter Bright wrote:
In "Lots of low hanging fruit in Phobos" the issue came up about the
automatic encoding and decoding of char ranges.

[snip]
Is there any hope of fixing this?

There's nothing to fix.

There is, all right. ElementEncodingType for starters.

Allow me to enumerate the functions of std.algorithm and how they work
today and how they'd work with the proposed change. Let s be a variable
of some string type.

Special case was wrong though - special casing arrays of char[] and
throwing all other ranges of char out the window. The amount of code to
support this schizophrenia is enormous.

Making strings bidirectional ranges has been a very good choice within
the constraints. There was already a string type, and that was
immutable(char)[], and a bunch of code depended on that definition.

Trying to make it work by blowing a hole in the generic range concept
now seems like it wasn't worth it.

--
Dmitry Olshansky

Mar 07 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/7/14, 12:48 PM, Dmitry Olshansky wrote:
07-Mar-2014 23:57, Andrei Alexandrescu пишет:
On 3/6/14, 6:37 PM, Walter Bright wrote:
In "Lots of low hanging fruit in Phobos" the issue came up about the
automatic encoding and decoding of char ranges.

[snip]
Is there any hope of fixing this?

There's nothing to fix.

There is, all right. ElementEncodingType for starters.

Allow me to enumerate the functions of std.algorithm and how they work
today and how they'd work with the proposed change. Let s be a variable
of some string type.

Special case was wrong though - special casing arrays of char[] and
throwing all other ranges of char out the window. The amount of code to
support this schizophrenia is enormous.

I think this is a confusion. The code in e.g. std.algorithm is
specialized for efficiency of stuff that already works.

Trying to make it work by blowing a hole in the generic range concept
now seems like it wasn't worth it.

I disagree. Also what hole?

Andrei

Mar 07 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:
08-Mar-2014 05:18, Andrei Alexandrescu пишет:
On 3/7/14, 12:48 PM, Dmitry Olshansky wrote:
07-Mar-2014 23:57, Andrei Alexandrescu пишет:
On 3/6/14, 6:37 PM, Walter Bright wrote:
In "Lots of low hanging fruit in Phobos" the issue came up about the
automatic encoding and decoding of char ranges.

[snip]

Allow me to enumerate the functions of std.algorithm and how they work
today and how they'd work with the proposed change. Let s be a variable
of some string type.

Special case was wrong though - special casing arrays of char[] and
throwing all other ranges of char out the window. The amount of code to
support this schizophrenia is enormous.

I think this is a confusion. The code in e.g. std.algorithm is
specialized for efficiency of stuff that already works.

Well, I've said it elsewhere - specialization was too fine grained.
Either a generic or it doesn't work.

Trying to make it work by blowing a hole in the generic range concept
now seems like it wasn't worth it.

I disagree. Also what hole?

Let's say we keep it.
Yesterday I had to write constraints like this:

if((isNarrowString!Range && is(Unqual!(ElementEncodingType!Range) ==
wchar)) ||
(isRandomAccessRange!Range && is(Unqual!(ElementType!Range) == wchar)))

Just to accept anything that works alike to array of wchar, buffers and
whatnot included.

I expect that this should have been enough:
isRandomAccessRange!Range && is(Unqual!(ElementType!Range) == wchar)

Or maybe introduce something to indicate any "DualRange" of narrow chars.

--
Dmitry Olshansky

Mar 08 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu
wrote:
s.all!(x => x == 'é')
s.any!(x => x == 'é')
s.canFind!(x => x == 'é')

These are a variation of the following:

ubyte b = ...;
if (b == 1000) { ... }

The compiler could emit a warning here, and indeed some
languages/compilers do. It might not be in the vein of D
metaprogramming, though, as the compiler will not emit a warning
for "if (false) { ... }".

s.canFind('é')
s.endsWith('é')
s.find('é')
s.count('é')
s.countUntil('é')

These should not compile post-change, because the sought element
(dchar) is not of the same type as the string. So they will not
fail silently.

s.count()
s.count!((a, b) => std.uni.toLower(a) ==
std.uni.toLower(b))("é")
s.countUntil('é')

As has already been mentioned, counting code points is borderline
useless.

s.count!((a, b) => std.uni.toLower(a) ==
std.uni.toLower(b))("é")

And this is just wrong on many levels. I hope you know better
than to actually use this for case-insensitive comparisons in
production software.

Mar 07 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/7/14, 4:39 PM, Vladimir Panteleev wrote:
s.canFind('é')
s.endsWith('é')
s.find('é')
s.count('é')
s.countUntil('é')

These should not compile post-change, because the sought element (dchar)
is not of the same type as the string. So they will not fail silently.

The compared element need not have the same type (otherwise we'd break
some other code).

Andrei

Mar 07 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 8 March 2014 at 01:38:39 UTC, Andrei Alexandrescu
wrote:
On 3/7/14, 4:39 PM, Vladimir Panteleev wrote:
s.canFind('é')
s.endsWith('é')
s.find('é')
s.count('é')
s.countUntil('é')

These should not compile post-change, because the sought
element (dchar)
is not of the same type as the string. So they will not fail
silently.

The compared element need not have the same type (otherwise
we'd break some other code).

Do you think such code will appear often in practice? Even if the
type is a dchar, in some cases the programmer may not have
intended to do decoding (e.g. the "dchar" type was a result of
type deduction form .front OSLT).

Mar 07 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 8 March 2014 at 01:41:01 UTC, Vladimir Panteleev
wrote:
On Saturday, 8 March 2014 at 01:38:39 UTC, Andrei Alexandrescu
wrote:
On 3/7/14, 4:39 PM, Vladimir Panteleev wrote:
These should not compile post-change, because the sought
element (dchar)
is not of the same type as the string. So they will not fail
silently.

The compared element need not have the same type (otherwise
we'd break some other code).

Do you think such code will appear often in practice? Even if
the type is a dchar, in some cases the programmer may not have
intended to do decoding (e.g. the "dchar" type was a result of
type deduction form .front OSLT).

Sorry, I see now that you were referring to algorithms in
general. I think adding a temporary warning for character types
only, as with .front, would be appropriate...

Mar 07 2014

Timon Gehr <timon.gehr gmx.ch> writes:
On 03/07/2014 03:37 AM, Walter Bright wrote:
In "Lots of low hanging fruit in Phobos" the issue came up about the
automatic encoding and decoding of char ranges.
...

I think this is among the most annoying aspects of Phobos.

Mar 07 2014

Walter Bright <newshound2 digitalmars.com> writes:
Andrei suggests that this change would destroy D by breaking too much existing
code. He might be right. Can we afford the risk that he is right?

We should think about a way to have our cake and eat it, too.

Keep in mind that this issue is a Phobos one, not a core language issue.

Mar 07 2014

"Peter Alexander" <peter.alexander.au gmail.com> writes:
On Saturday, 8 March 2014 at 00:22:05 UTC, Walter Bright wrote:
Andrei suggests that this change would destroy D by breaking
too much existing code. He might be right. Can we afford the
risk that he is right?

We should think about a way to have our cake and eat it, too.

Keep in mind that this issue is a Phobos one, not a core
language issue.

Before we discuss risk in the change, we need to agree that it is
even a desirable change. I don't think we have reached that point.

It's worth pointing out that all the performance issues can be
resolved in Phobos through specialisation with no disruption to
the users.

Mar 07 2014

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Mar 08, 2014 at 12:46:21AM +0000, Peter Alexander wrote:
On Saturday, 8 March 2014 at 00:22:05 UTC, Walter Bright wrote:
Andrei suggests that this change would destroy D by breaking too
much existing code. He might be right. Can we afford the risk that
he is right?

We should think about a way to have our cake and eat it, too.

Keep in mind that this issue is a Phobos one, not a core language
issue.

Before we discuss risk in the change, we need to agree that it is
even a desirable change. I don't think we have reached that point.

It's worth pointing out that all the performance issues can be
resolved in Phobos through specialisation with no disruption to the
users.

Regardless of which way we decide in the end, I hope the one thing good
that will come out of this thread is to improve the performance of
string algorithms in Phobos. Things like substring searching to
implement multibyte character (or multi-codepoint "characters")
operations efficiently are quite needed, IMO.

--
If a person can't communicate, the very least he could do is to shut up.
-- Tom Lehrer, on people who bemoan their communication woes with their
loved ones.

Mar 07 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 8 March 2014 at 00:22:05 UTC, Walter Bright wrote:
We should think about a way to have our cake and eat it, too.

I think a good place to start would be to have a draft
implementation of the proposal. This will allow people to try it
with their projects and see how much code it will really affect.
As I mentioned here[1], I suspect that certain valid code that
used the range primitives will continue to work unaffected even
after a sudden switch, so perhaps the "deprecation" and "error"
stage can be replaced with a longer "warning" stage instead.

This is similar to how git changed the meaning of the "push"
command: it just nagged users for a long time, and included the
instructions to switch to the new behavior early (thus squelching
the warning) or permanently accepting the old behavior. (For our
case it is adding .representation or .byCodepoint depending on
the intent.)

[1]:
http://forum.dlang.org/post/dlpmchtaqzrxxylpmiwh forum.dlang.org

Mar 07 2014

"Sean Kelly" <sean invisibleduck.org> writes:
On Saturday, 8 March 2014 at 00:22:05 UTC, Walter Bright wrote:
Andrei suggests that this change would destroy D by breaking
too much existing code. He might be right. Can we afford the
risk that he is right?

Perhaps not. But I think the current approach is totally broken,
it's just also happens to be what people have coded to. Andrei
used algorithms operating on a code point level as an example of
what would break if this change were made, and in that he's
absolutely correct. But what bothers me is whether it's
appropriate to perform this sort of character-based operation on
Unicode strings in the first place.

The current approach is a cut above treating strings as arrays of
bytes for some languages, and still utterly broken for others.
If I'm operating on a right to left language like Hebrew, what
would I expect the result to be from something like countUntil?
And how useful would such a result be? I'm inclined to say that
the correct approach is to state that algorithms operate
explicitly on a T.sizeof basis and that if the data contained in
a particular range has some multi-element encoding then separate,
specialized routines should be used with the T.sizeof behavior
will not produce the desired result.

So the problem to me is that we're stuck not fixing something
that's horribly broken just because it's broken in a way that
people presumably now expect. I'd personally like to see this
fixed and I think the new behavior is preferable overall, but I
do share Andrei's concern that such a big change might hurt the
language anyway.

Mar 08 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/8/14, 9:33 AM, Sean Kelly wrote:
On Saturday, 8 March 2014 at 00:22:05 UTC, Walter Bright wrote:
Andrei suggests that this change would destroy D by breaking too much
existing code. He might be right. Can we afford the risk that he is
right?

Perhaps not. But I think the current approach is totally broken, it's
just also happens to be what people have coded to.

I think that's an exaggeration poorly supported by evidence. My
definition of "totally broken" would be "essentially unusable" and I
think we're well past the point we need to prove that. Virtually all
applications need to deal with strings to some extent, and I myself
wrote a couple of relatively string-heavy ones. D strings work well.
Even the most ardent detractors of D on e.g. reddit.com admit by
omission that string processing is one if its strengths. Though they'll
probably pick up on this thread soon :o).

Andrei used
algorithms operating on a code point level as an example of what would
break if this change were made, and in that he's absolutely correct.
But what bothers me is whether it's appropriate to perform this sort of
character-based operation on Unicode strings in the first place.

Searching for characters in strings would be difficult to deem
inappropriate.

When I designed std.algorithm I recall I put the following options on
the table:

1. All algorithms would by default operate on strings at char/wchar
level (i.e. code unit). That would cause the usual issues and confusions
I was aware of from C++. Certain algorithms would require specialization
and/or the user using byDchar for correctness. At some point I swear
I've had a byDchar definition somewhere; I've searched the recent git
history for it, no avail.

2. All algorithms would by default operate at code point level. That way
correctness would be achieved by default, and certain algorithms would
require specialization for efficiency. (Back then I didn't know about
graphemes and normalization. I'm not sure how that would have affected
the final decision.)

3. Change the alias string, wstring etc. to be some type that requires
explicit access for code units/code points etc. instead of implicitly
mixing the two.

My fave was (3). And not mine only - several people suggested
alternative definitions of the "default" string type. Back then however
we were in the middle of the D1/D2 transition and one more aftershock
didn't seem like a good idea at all. Walter opposed such a change, and
didn't really have to convince me.

From experience with C++ I knew (1) had a bad track record, and (2)
"generically conservative, specialize for speed" was a successful pattern.

What would you have chosen given that context?

The entire string processing paraphernalia is left to right. I figure
RTL languages are under-supported, but s.retro.countUntil comes to mind.

And how useful would
such a result be?

I don't know.

I'm inclined to say that the correct approach is to
state that algorithms operate explicitly on a T.sizeof basis and that if
the data contained in a particular range has some multi-element encoding
then separate, specialized routines should be used with the T.sizeof
behavior will not produce the desired result.

That sounds quite like C++ plus ICU. It doesn't strike me as the golden
standard for Unicode integration.

So the problem to me is that we're stuck not fixing something that's
horribly broken just because it's broken in a way that people presumably
now expect.

Clearly I'm being subjective here but again I'd find it difficult to get
convinced we have something horribly broken from the evidence I gathered
inside and outside Facebook.

I'd personally like to see this fixed and I think the new behavior is
preferable overall, but I do share Andrei's concern that such a big
change might hurt the language anyway.

I've said this once and I'm saying it again: the best way to convert
this discussion into something useful is to devise ideas for useful
non-breaking additions.

Andrei

Mar 08 2014

"Sean Kelly" <sean invisibleduck.org> writes:
I'll admit that I'm probably not the best person to make
suggestions here. As a back-end programmer, a large portion of my
work is dealing with text streams of various types. And the data
I work with is in any number of encodings and none can be assumed
to be in English. But literally all of my work is either parsing
protocols where the symbols are single byte and so the C way is
appropriate, or they are with blocks of text where I basically
never work at the per character level. In fact I can think of
only one case--trimming a block of text for disay in a small
frame. And there I use an explicit routine for trimming to a
specific number of Unicodw characters.

So regarding std.algorithm, I couldn't use it because I need to
be able to slice based on the result. Knowing the number of
multibyte code points between the beginning of the string and the
thing I was searching for is utterly useless. Also, the
performance is way too bad to make it a consideration.

But you're right. I was being dramatic when I called it utterly
broken. It's simply not useful to me as-is. The solution for me
is fairly simple though if inelegant--cast the string to an array
of ubyte. Having both options is nice I suppose. I just can't
comment on the utility of the default behavior because I can't
imagine a use for it.

Mar 08 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/8/14, 12:26 PM, Sean Kelly wrote:
But you're right. I was being dramatic when I called it utterly broken.
It's simply not useful to me as-is. The solution for me is fairly simple
though if inelegant--cast the string to an array of ubyte.

Ain't nobody know nothing about

Andrei

Mar 08 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu
wrote:
Searching for characters in strings would be difficult to deem
inappropriate.

The notion of "character" exists only in certain writing systems.
It is thus a flawed practice, and I think it should not be
encouraged, as it will only make writing truly-international
software more difficult. A more correct approach is searching for
a certain substring. If non-exact matching is needed
(normalization, case insensitivity etc.), then the appropriate
solution is to use the Unicode algorithms.

If you look at the situation from this point of view, single code
points become merely an implementation detail.

As previously discussed, "correctness" here is conditional. I
would not use that word, it is another extreme.

From experience with C++ I knew (1) had a bad track record, and
(2) "generically conservative, specialize for speed" was a
successful pattern.

What would you have chosen given that context?

Ideally, we would have the Unicode algorithms in the standard
library from day 1, and advocated their use throughout the
documentation.

I'm inclined to say that the correct approach is to
state that algorithms operate explicitly on a T.sizeof basis
and that if
the data contained in a particular range has some
multi-element encoding
then separate, specialized routines should be used with the
T.sizeof
behavior will not produce the desired result.

That sounds quite like C++ plus ICU. It doesn't strike me as
the golden standard for Unicode integration.

Why not? Because it sounds like D needs exactly that. Plus its
amazing slicing and range capabilities, of course.

So the problem to me is that we're stuck not fixing something
that's
horribly broken just because it's broken in a way that people
presumably
now expect.

Clearly I'm being subjective here but again I'd find it
difficult to get convinced we have something horribly broken
from the evidence I gathered inside and outside Facebook.

Have you or anyone you personally know tried to process text in D
containing a writing system such as Sanskrit's?

I'd personally like to see this fixed and I think the new
behavior is
preferable overall, but I do share Andrei's concern that such
a big
change might hurt the language anyway.

I've said this once and I'm saying it again: the best way to
convert this discussion into something useful is to devise
ideas for useful non-breaking additions.

I disagree. As I've argued, I believe that currently most uses of
dchars in an application are incorrect, and ultimately a time
bomb for proper internationalization support. We need to apply
the same procedure that we do with any language construct that
was deemed to have been a poor decision: put it through a
deprecation cycle and fix it.

Mar 08 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/8/14, 12:38 PM, Vladimir Panteleev wrote:
On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu wrote:
1. All algorithms would by default operate on strings at char/wchar
level (i.e. code unit). That would cause the usual issues and
confusions I was aware of from C++. Certain algorithms would require
specialization and/or the user using byDchar for correctness.

As previously discussed, "correctness" here is conditional. I would not
use that word, it is another extreme.

Agreed.

From experience with C++ I knew (1) had a bad track record, and (2)
"generically conservative, specialize for speed" was a successful
pattern.

What would you have chosen given that context?

Ideally, we would have the Unicode algorithms in the standard library
from day 1, and advocated their use throughout the documentation.

It's not late to do a lot of that.

That sounds quite like C++ plus ICU. It doesn't strike me as the
golden standard for Unicode integration.

Why not? Because it sounds like D needs exactly that. Plus its amazing
slicing and range capabilities, of course.

Pretty much everyone using ICU hates it.

So the problem to me is that we're stuck not fixing something that's
horribly broken just because it's broken in a way that people presumably
now expect.

Clearly I'm being subjective here but again I'd find it difficult to
get convinced we have something horribly broken from the evidence I
gathered inside and outside Facebook.

Have you or anyone you personally know tried to process text in D
containing a writing system such as Sanskrit's?

No. Point being?

I'd personally like to see this fixed and I think the new behavior is
preferable overall, but I do share Andrei's concern that such a big
change might hurt the language anyway.

I've said this once and I'm saying it again: the best way to convert
this discussion into something useful is to devise ideas for useful
non-breaking additions.

I disagree. As I've argued, I believe that currently most uses of dchars
in an application are incorrect, and ultimately a time bomb for proper
internationalization support. We need to apply the same procedure that
we do with any language construct that was deemed to have been a poor
decision: put it through a deprecation cycle and fix it.

I think there are too large risks for that, and it's quite unclear this
is solving a problem. "Slightly better Unicode support" is hardly a good
justification.

Andrei

Mar 08 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 8 March 2014 at 20:50:49 UTC, Andrei Alexandrescu
wrote:
On 3/8/14, 12:38 PM, Vladimir Panteleev wrote:
On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu
wrote:
That sounds quite like C++ plus ICU. It doesn't strike me as
the
golden standard for Unicode integration.

Why not? Because it sounds like D needs exactly that. Plus its
amazing
slicing and range capabilities, of course.

Pretty much everyone using ICU hates it.

I admit I never used it personally. I just thought you meant that
implied "D implementations of relevant Unicode algorithms,
adapted to D style (range interface)". Is there more to this than
the limitations of C++ or the implementers' design choices?

Have you or anyone you personally know tried to process text
in D
containing a writing system such as Sanskrit's?

No. Point being?

Point being, we don't have solid data to conclude whether D's
current approach is actually good enough for such cases as you
claim.

We do have one post in this thread:
http://forum.dlang.org/post/jlgfkxlrhlzdpwkpsrot forum.dlang.org

I think there are too large risks for that,

For what? We have not discussed a possible plan yet. Are you
referring to Walter Bright's proposal?

and it's quite unclear this is solving a problem. "Slightly
better Unicode support" is hardly a good justification.

What this will solve:

1. Eliminating dangerous constructs, such as s.countUntil and
s.indexOf both returning integers, yet possibly having different
values in circumstances that the developer may not foresee.

2. Very high complexity of implementations (the
ElementEncodingType problem previously mentioned).

3. Hidden, difficult-to-detect performance problems. The reason
why this thread was started. I've had to deal with them in
several places myself.

4. Encourage D programmers to write Unicode-capable code that is
correct in the full sense of the word.

I think the above list has enough weight to merit at least
considering *some* breaking changes.

Mar 08 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/8/14, 1:13 PM, Vladimir Panteleev wrote:
On Saturday, 8 March 2014 at 20:50:49 UTC, Andrei Alexandrescu wrote:
On 3/8/14, 12:38 PM, Vladimir Panteleev wrote:
On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu wrote:
That sounds quite like C++ plus ICU. It doesn't strike me as the
golden standard for Unicode integration.

Why not? Because it sounds like D needs exactly that. Plus its amazing
slicing and range capabilities, of course.

Pretty much everyone using ICU hates it.

I admit I never used it personally.

Time to do due diligence :o).

I just thought you meant that
implied "D implementations of relevant Unicode algorithms, adapted to D
style (range interface)". Is there more to this than the limitations of
C++ or the implementers' design choices?

Have you or anyone you personally know tried to process text in D
containing a writing system such as Sanskrit's?

No. Point being?

Point being, we don't have solid data to conclude whether D's current
approach is actually good enough for such cases as you claim.

My only claim is that recognizing and iterating strings by code point is
better than doing things by the octet.

We do have one post in this thread:
http://forum.dlang.org/post/jlgfkxlrhlzdpwkpsrot forum.dlang.org

I think there are too large risks for that,

For what? We have not discussed a possible plan yet. Are you referring
to Walter Bright's proposal?

Any plan to inflict a large breaking change for strings incurs a risk.
To add insult to injury, the improvement brought about by the change is
debatable.

and it's quite unclear this is solving a problem. "Slightly better
Unicode support" is hardly a good justification.

What this will solve:

1. Eliminating dangerous constructs, such as s.countUntil and s.indexOf
both returning integers, yet possibly having different values in
circumstances that the developer may not foresee.

I disagree there's any danger. They deal in code points, end of story.

2. Very high complexity of implementations (the ElementEncodingType
problem previously mentioned).

I disagree with "very high". Besides if you want to do Unicode you gotta
crack some eggs.

3. Hidden, difficult-to-detect performance problems. The reason why this
thread was started. I've had to deal with them in several places myself.

I disagree with "hidden, difficult to detect". Also I'd add that I'd
rather not have hidden, difficult to detect correctness problems.

4. Encourage D programmers to write Unicode-capable code that is correct
in the full sense of the word.

I disagree we are presently discouraging them. I do agree a change would
make certain things clearer. But not enough to nearly make up for the
breakage.

I think the above list has enough weight to merit at least considering
*some* breaking changes.

I think a better approach is to figure what to add.

Andrei

Mar 08 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 8 March 2014 at 23:59:15 UTC, Andrei Alexandrescu
wrote:
My only claim is that recognizing and iterating strings by code
point is better than doing things by the octet.

Considering or disregarding the disadvantages of this choice?

1. Eliminating dangerous constructs, such as s.countUntil and
s.indexOf
both returning integers, yet possibly having different values
in
circumstances that the developer may not foresee.

I disagree there's any danger. They deal in code points, end of
story.

Perhaps I did not explain clearly enough.

auto pos = s.countUntil(sub);
writeln(s[pos..$]);

This will compile, and work for English text. For someone without
complete knowledge of Phobos functions and how D handles Unicode,
it is not obvious that this code is actually wrong. In certain
situations, this can have devastating effects: consider, for
example, if this code is extracting a slice from a string that
elsewhere contains sensitive data (e.g. a configuration file
containing, among other data, a password). An attacker could
supply an Unicode string where the developer did not expect it,
thus causing "pos" to have a smaller value than the corresponding
indexOf result, thus revealing a slice of "s" which was not
intended to be visible. Thus, a developer currently needs to
tread very carefully wherever he is slicing strings, so as to not
accidentally use indices obtained from functions that count code
points.

2. Very high complexity of implementations (the
ElementEncodingType
problem previously mentioned).

I disagree with "very high".

I'm quite sure that std.range and std.algorithm will lose a LOT
of weight if they were fixed to not treat strings specially.

Besides if you want to do Unicode you gotta crack some eggs.

No, I can't see how this justifies the choice. An explicit
decoding range would have simplified things greatly while
offering much of the same advantages. Whether the fact that it is
there "by default" an advantage of the current approach at all is
debatable.

3. Hidden, difficult-to-detect performance problems. The
reason why this
thread was started. I've had to deal with them in several
places myself.

I disagree with "hidden, difficult to detect".

Why? You can only find out that an algorithm is slower than it
needs to be via either profiling (at which point you're wondering

you had made a different choice for Unicode in D, this problem
would not exist altogether.

Also I'd add that I'd rather not have hidden, difficult to
detect correctness problems.

Except we already do. Arguments have already been presented in
this thread that demonstrate correctness problems with the
current approach. I don't think that these can stand up to the
problems that the simpler by-char iteration approach would have.

4. Encourage D programmers to write Unicode-capable code that
is correct
in the full sense of the word.

I disagree we are presently discouraging them.

I did not say we are. The problem is that we aren't encouraging
them either - we are instead setting an example of how to do it
in a wrong (incomplete) way.

I do agree a change would make certain things clearer.

I have an issue with all the counter-arguments presented in this
thread being shoved behind the one word "clearer".

But not enough to nearly make up for the breakage.

I would still like to go ahead with my suggestion to attempt some
possible changes without releasing them. I'm going to try them
with my own programs first to see how much it will break. I
believe that you are too eagerly dismissing all proposals without
even evaluating them.

I think the above list has enough weight to merit at least
considering
*some* breaking changes.

I think a better approach is to figure what to add.

This is obvious:
- more Unicode algorithms (normalization, segmentation, etc.)
- better documentation

Mar 08 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/8/14, 4:42 PM, Vladimir Panteleev wrote:
On Saturday, 8 March 2014 at 23:59:15 UTC, Andrei Alexandrescu wrote:
My only claim is that recognizing and iterating strings by code point
is better than doing things by the octet.

Considering or disregarding the disadvantages of this choice?

Doing my best to weigh everything with the right measures.

1. Eliminating dangerous constructs, such as s.countUntil and s.indexOf
both returning integers, yet possibly having different values in
circumstances that the developer may not foresee.

I disagree there's any danger. They deal in code points, end of story.

Perhaps I did not explain clearly enough.

auto pos = s.countUntil(sub);
writeln(s[pos..$]);

This will compile, and work for English text. For someone without
complete knowledge of Phobos functions and how D handles Unicode, it is
not obvious that this code is actually wrong.

I agree. At a point or another, the dual nature of strings (and dual
means to iterate them) will cause trouble for the unwary.

In certain situations,
this can have devastating effects: consider, for example, if this code
is extracting a slice from a string that elsewhere contains sensitive
data (e.g. a configuration file containing, among other data, a
password).

Whaaa, passwords in clear?

An attacker could supply an Unicode string where the
developer did not expect it, thus causing "pos" to have a smaller value
than the corresponding indexOf result, thus revealing a slice of "s"
which was not intended to be visible. Thus, a developer currently needs
to tread very carefully wherever he is slicing strings, so as to not
accidentally use indices obtained from functions that count code points.

Okay, though when you opened with "devastating" I was hoping for nothing
short of death and dismemberment. Anyhow the fix is obvious per this
brief tutorial: http://www.youtube.com/watch?v=hkDD03yeLnU

2. Very high complexity of implementations (the ElementEncodingType
problem previously mentioned).

I disagree with "very high".

I'm quite sure that std.range and std.algorithm will lose a LOT of
weight if they were fixed to not treat strings specially.

I'm not so sure. Most of the string-specific optimizations simply detect
certain string cases and forward them to array algorithms that need be
written anyway. You would, indeed, save a fair amount of isSomeString
conditionals and stuff (thus simplifying on scaffolding), but probably
not a lot of code. That's not useless work - it'd go somewhere in any
design.

Besides if you want to do Unicode you gotta crack some eggs.

No, I can't see how this justifies the choice. An explicit decoding
range would have simplified things greatly while offering much of the
same advantages.

My point there is that there's no useless or duplicated code that would
be thrown away. A better design would indeed make for better modular
separation - would be great if the string-related optimizations in
std.algorithm went elsewhere. They wouldn't disappear.

Whether the fact that it is there "by default" an advantage of the
current approach at all is debatable.

Clearly. If I'd do things over again, I'd definitely change a thing or
two. (I wouldn't go with Walter's proposal, which I think is worse than
what we have now.)

But the current approach has something very difficult to talk away: it's
there. And that makes a whole lotta difference. Do I believe it's
perfect? Hell no. Does it blunt much of the point of this debate? I'm
afraid so.

3. Hidden, difficult-to-detect performance problems. The reason why this
thread was started. I've had to deal with them in several places myself.

I disagree with "hidden, difficult to detect".

Why? You can only find out that an algorithm is slower than it needs to

the thing is so slow), or feeding it invalid UTF. If you had made a
different choice for Unicode in D, this problem would not exist altogether.

Disagree.

Also I'd add that I'd rather not have hidden, difficult to detect
correctness problems.

Except we already do. Arguments have already been presented in this
thread that demonstrate correctness problems with the current approach.
I don't think that these can stand up to the problems that the simpler
by-char iteration approach would have.

Sure there are, and you yourself illustrated a misuse of the APIs. My
point is: code point is better than code unit and not all that much
slower. Grapheme is better than code point but a lot slower. It seems
we're quite in a sweet spot here wrt performance/correctness.

4. Encourage D programmers to write Unicode-capable code that is correct
in the full sense of the word.

I disagree we are presently discouraging them.

I did not say we are. The problem is that we aren't encouraging them
either - we are instead setting an example of how to do it in a wrong
(incomplete) way.

Code unit is what it is. Those programming for natural languages for
which code units are not sufficient would need to exercise due
diligence. We ought to help them without crippling efficiency.

I do agree a change would make certain things clearer.

I have an issue with all the counter-arguments presented in this thread
being shoved behind the one word "clearer".

What is the issue you are having? I don't see a much better API being
proposed. I see a marginally improved API at the very best, and possibly
quite a bit more prone to error.

But not enough to nearly make up for the breakage.

I would still like to go ahead with my suggestion to attempt some
possible changes without releasing them. I'm going to try them with my
own programs first to see how much it will break.

I think that's great.

I believe that you are
too eagerly dismissing all proposals without even evaluating them.

Perspective is everything, isn't it :o). I thought I'm being reasonable
and accepting in discussing of a number of proposed points, although in
my heart of hearts many arguments seem rather frivolous.

With what has been put forward so far, that's not even close to
justifying a breaking change. If that great better design is just get
back to code unit iteration, the change will not happen while I work on
D. It is possible, however, that a much better idea comes forward, and
I'd be looking forward to such.

I think the above list has enough weight to merit at least considering
*some* breaking changes.

I think a better approach is to figure what to add.

This is obvious:
- more Unicode algorithms (normalization, segmentation, etc.)
- better documentation

I was thinking of these too:

1. Revisit std.encoding and perhaps confer legitimacy to the character
types defined there. The implementation in std.encoding is wanting, but
I think the idea is sound. Essentially give more love to various
encodings, including Ascii and "bypass encoding, I'll deal with stuff
myself".

2. Add byChar that returns a random-access range iterating a string by
character. Add byWchar that does on-the-fly transcoding to UTF16. Add
byDchar that accepts any range of char and does decoding. And such
stuff. Then whenever one wants to go through a string by code point can
just use str.byChar.

Andrei

Mar 08 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 9 March 2014 at 01:23:27 UTC, Andrei Alexandrescu
wrote:
On 3/8/14, 4:42 PM, Vladimir Panteleev wrote:
On Saturday, 8 March 2014 at 23:59:15 UTC, Andrei Alexandrescu
wrote:
My only claim is that recognizing and iterating strings by
code point
is better than doing things by the octet.

Considering or disregarding the disadvantages of this choice?

Doing my best to weigh everything with the right measures.

I think it would be good to get a comparison of the two
approaches, and list the arguments presented so far. I'll look
into starting a Wiki page.

Okay, though when you opened with "devastating" I was hoping
for nothing short of death and dismemberment.

In proportion. To the best of my knowledge, no one here writes
software for military or industrial robots in D. Security issues
rank as the worst kind of bugs in software on my scale.

Anyhow the fix is obvious per this brief tutorial:
http://www.youtube.com/watch?v=hkDD03yeLnU

I don't get it.

I'm quite sure that std.range and std.algorithm will lose a
LOT of
weight if they were fixed to not treat strings specially.

I'm not so sure. Most of the string-specific optimizations
simply detect certain string cases and forward them to array
algorithms that need be written anyway. You would, indeed, save
a fair amount of isSomeString conditionals and stuff (thus
simplifying on scaffolding), but probably not a lot of code.
That's not useless work - it'd go somewhere in any design.

One way to find out.

Besides if you want to do Unicode you gotta crack some eggs.

No, I can't see how this justifies the choice. An explicit
decoding
range would have simplified things greatly while offering much
of the
same advantages.

My point there is that there's no useless or duplicated code
that would be thrown away. A better design would indeed make
for better modular separation - would be great if the
string-related optimizations in std.algorithm went elsewhere.
They wouldn't disappear.

Why? Isn't the whole issue that std.range presents strings as
dchar ranges, and std.algorithm needs to detect dchar ranges and
then treat them as char arrays? As opposed to std.algorithm just
detecting arrays and treating them all as arrays (which it should
be doing now anyway)?

3. Hidden, difficult-to-detect performance problems. The
reason why this
thread was started. I've had to deal with them in several
places myself.

I disagree with "hidden, difficult to detect".

Why? You can only find out that an algorithm is slower than it
needs to
be via either profiling (at which point you're wondering why

the thing is so slow), or feeding it invalid UTF. If you had
made a
different choice for Unicode in D, this problem would not
exist altogether.

Disagree.

Could you please elaborate? This is the second uninformative
reply to this argument.

Except we already do. Arguments have already been presented in
this
thread that demonstrate correctness problems with the current
approach.
I don't think that these can stand up to the problems that the
simpler
by-char iteration approach would have.

Sure there are, and you yourself illustrated a misuse of the
APIs.

If UTF decoding was explicit, the problem would stand out. I
don't think this is a valid argument.

My point is: code point is better than code unit

This was debated... people should not be looking at individual
code points, unless they really know what they're doing.

Grapheme is better than code point but a lot slower.

We are going in circles. People should have very good reasons for
looking at individual graphemes as well.

It seems we're quite in a sweet spot here wrt
performance/correctness.

This does not seem like an objective summary of this thread's
arguments so far.

I guess I'll get working on that wiki page to organize the
arguments. This discussion is starting to feel like a quicksand
roundabout.

With what has been put forward so far, that's not even close to
justifying a breaking change. If that great better design is
just get back to code unit iteration, the change will not
happen while I work on D. It is possible, however, that a much
better idea comes forward, and I'd be looking forward to such.

Actually, could you post some examples of real-world code that
would be broken by a hypothetical sudden switch? I think I would
be hard-pressed to find some in my own code, but I'd need to
check for sure to find out.

2. Add byChar that returns a random-access range iterating a
string by character. Add byWchar that does on-the-fly
transcoding to UTF16. Add byDchar that accepts any range of
char and does decoding. And such stuff. Then whenever one wants
to go through a string by code point can just use str.byChar.

This is confusing. Did you mean to say that byChar iterates a
string by code unit (not character / code point)?

Mar 08 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/8/14, 6:14 PM, Vladimir Panteleev wrote:
On Sunday, 9 March 2014 at 01:23:27 UTC, Andrei Alexandrescu wrote:
On 3/8/14, 4:42 PM, Vladimir Panteleev wrote:
My point there is that there's no useless or duplicated code that
would be thrown away. A better design would indeed make for better
modular separation - would be great if the string-related
optimizations in std.algorithm went elsewhere. They wouldn't disappear.

Why? Isn't the whole issue that std.range presents strings as dchar
ranges, and std.algorithm needs to detect dchar ranges and then treat
them as char arrays? As opposed to std.algorithm just detecting arrays
and treating them all as arrays (which it should be doing now anyway)?

That's scaffolding, not actual executable code.

Why? You can only find out that an algorithm is slower than it needs to

the thing is so slow), or feeding it invalid UTF. If you had made a
different choice for Unicode in D, this problem would not exist
altogether.

Disagree.

Could you please elaborate? This is the second uninformative reply to
this argument.

What can I say? The answer is obvious. It's not hard to figure for me.
Performance of D's UTF strings has never been a mystery to me. From
where I stand all this "hidden, difficult-to-detect performance
problems" drama is just posturing. We'd do good to wean such out of the
discussion.

No bug myriad of bug reports "D strings are awfully slow" on bugzilla.

No long threads "Why are D strings so slow" on stack overflow.

No trolling on reddit or hackernews "D? Just look at their strings. How
could anyone think that's a good idea lol."

And it's not like people aren't talking. In contrast, D has been (and
often rightly) criticized in the past for things like floating point
performance and garbage collection. No evidence we are having an acute
performance problem with UTF strings.

Sure there are, and you yourself illustrated a misuse of the APIs.

If UTF decoding was explicit, the problem would stand out. I don't think
this is a valid argument.

Yours? Indeed isn't, if what you want is iterate by code unit (=
meaningless for all but ASCII strings) by default.

My point is: code point is better than code unit

This was debated... people should not be looking at individual code
points, unless they really know what they're doing.

Should they be looking at code units instead?

Grapheme is better than code point but a lot slower.

We are going in circles. People should have very good reasons for
looking at individual graphemes as well.

And it's good we have increasing support for graphemes. I don't think
they should be the default.

It seems we're quite in a sweet spot here wrt performance/correctness.

This does not seem like an objective summary of this thread's arguments
so far.

What is an objective summary? Those who want to inflict massive breakage
are not even done arguing we have a better design.

I guess I'll get working on that wiki page to organize the arguments.
This discussion is starting to feel like a quicksand roundabout.

That's great. Yes, we're exchanging jabs right now which is not our best
use of time. Also in the interest of time, please understand you'd need
to show the second coming if you want to break backward compatibility.
Additions are a much better path.

With what has been put forward so far, that's not even close to
justifying a breaking change. If that great better design is just get
back to code unit iteration, the change will not happen while I work
on D. It is possible, however, that a much better idea comes forward,
and I'd be looking forward to such.

Actually, could you post some examples of real-world code that would be
broken by a hypothetical sudden switch? I think I would be hard-pressed
to find some in my own code, but I'd need to check for sure to find out.

I'm afraid burden of proof is on you. Far as I'm concerned every
breakage of string processing is unacceptable or at least very undesirable.

This is confusing. Did you mean to say that byChar iterates a string by
code unit (not character / code point)?

Unit. s.byChar.front is a (possibly ref, possibly qualified) char.

Andrei

Mar 08 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 9 March 2014 at 03:26:40 UTC, Andrei Alexandrescu
wrote:
And it's not like people aren't talking. In contrast, D has
been (and often rightly) criticized in the past for things like
floating point performance and garbage collection. No evidence
we are having an acute performance problem with UTF strings.

The size of this thread is one factor. But I see your point - I
agree that is evidently not one of D's more glaring current
problems. I hope I never alluded to that not being the case. That
doesn't mean the problem doesn't exist at all, though.

If UTF decoding was explicit, the problem would stand out. I
don't think
this is a valid argument.

Yours? Indeed isn't, if what you want is iterate by code unit
(= meaningless for all but ASCII strings) by default.

I don't understand this argument. Iterating by code unit is not
meaningless if you don't want to extract meaning from each unit
iteration. For example, if you're parsing JSON or XML, you only
care about the syntax characters, which are all ASCII. And there
is no confusion of "what exactly are we counting here".

This was debated... people should not be looking at individual
code
points, unless they really know what they're doing.

Should they be looking at code units instead?

No. They should only be looking at substrings.

Unless they're e.g. parsing a computer language (regardless if it
has international text data), as above.

We are going in circles. People should have very good reasons
for
looking at individual graphemes as well.

And it's good we have increasing support for graphemes. I don't
think they should be the default.

I don't think so either. Did I somehow imply that?

What is an objective summary? Those who want to inflict massive
breakage are not even done arguing we have a better design.

From my POV, I could say I see consensus, with just you defending
a decision you made a while ago :) But I'd prefer a constructive
discussion.

Anyway, I don't want to "inflict massive breakage" either. I want
the amount of breakage to be a justified cost of fixing a mistake
and permanently improving the language's design going forward.

Here's what I have so far, BTW:
http://wiki.dlang.org/Element_type_of_string_ranges
I'll have to review it in the morning. Or rather, afternoon,
given that it's 6 AM here.

I'm afraid burden of proof is on you.

Why? I'm not saying that if you can't produce an example of
breakage then your arguments are invalid. Rather, concrete
examples give us a concrete problem to work with. I'm not trying
to put any "burden of proof" on anyone.

That's great. Yes, we're exchanging jabs right now which is not
our best use of time. Also in the interest of time, please
understand you'd need to show the second coming if you want to
break backward compatibility. Additions are a much better path.

Even a teensy-weensy breakage? :)

Far as I'm concerned every breakage of string processing is
unacceptable or at least very undesirable.

In all seriousness, at this point I'm worried that you will
defend the status quo even if the breakage turns out minimal.
Instead of dealing with absolutes, advantages and disadvantages
should be weighed against another (even with the
breaking-backwards-compatibility penalty being very high).

Unit. s.byChar.front is a (possibly ref, possibly qualified)
char.

So... does byChar for wstrings do the same thing as byWchar? And
what if you want to iterate a wstring by char? Wouldn't it be
better to have byChar/byWchar/byDchar be a range of
char/wchar/dchar regardless of the string type, and have
byCodeUnit which iterates by the code unit type?

Mar 08 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/8/14, 7:53 PM, Vladimir Panteleev wrote:
From my POV, I could say I see consensus, with just you defending a
decision you made a while ago :) But I'd prefer a constructive discussion.

What exactly is the consensus? From your wiki page I see "One of the
proposals in the thread is to switch the iteration type of string ranges
from dchar to the string's character type."

I can tell you straight out: That will not happen for as long as I'm
working on D. I'm ready to fight on this not only Walter Bright, but him
and Walter White together. (Fortunately the former agrees the breakage
is too large; haven't asked the latter yet.)

Anyway, I don't want to "inflict massive breakage" either. I want the
amount of breakage to be a justified cost of fixing a mistake and
permanently improving the language's design going forward.

It seems you and I have a different view of the tradeoffs involved.

In all seriousness, at this point I'm worried that you will defend the
status quo even if the breakage turns out minimal. Instead of dealing
with absolutes, advantages and disadvantages should be weighed against
another (even with the breaking-backwards-compatibility penalty being
very high).

Of course. If you come with something better, I'd be glad to take a look.

Unit. s.byChar.front is a (possibly ref, possibly qualified) char.

So... does byChar for wstrings do the same thing as byWchar?

No, it transcodes from UTF16 to UTF8.

And what if
you want to iterate a wstring by char?

byChar.

Wouldn't it be better to have
byChar/byWchar/byDchar be a range of char/wchar/dchar regardless of the
string type

that's right

, and have byCodeUnit which iterates by the code unit type?

We must add that too. I agree the resulting design is roundabout (you
have char[] which is by default iterated by code point, and you need to
wrap it to get to its units that were there in the first place).

I also wanted to add some ASCII string love (by ascribing it a separate
type) but Walter has good arguments opposing that.

Andrei

Mar 08 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 9 March 2014 at 04:18:15 UTC, Andrei Alexandrescu
wrote:
What exactly is the consensus? From your wiki page I see "One
of the proposals in the thread is to switch the iteration type
of string ranges from dchar to the string's character type."

I can tell you straight out: That will not happen for as long
as I'm working on D.

Why?

Mar 08 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/8/14, 8:24 PM, Vladimir Panteleev wrote:
On Sunday, 9 March 2014 at 04:18:15 UTC, Andrei Alexandrescu wrote:
What exactly is the consensus? From your wiki page I see "One of the
proposals in the thread is to switch the iteration type of string
ranges from dchar to the string's character type."

I can tell you straight out: That will not happen for as long as I'm
working on D.

Why?

From the cycle "going in circles": because I think the breakage is way
too large compared to the alleged improvement. In fact I believe that
that design is inferior to the current one regardless.

Andrei

Mar 08 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 9 March 2014 at 05:10:26 UTC, Andrei Alexandrescu
wrote:
On 3/8/14, 8:24 PM, Vladimir Panteleev wrote:
On Sunday, 9 March 2014 at 04:18:15 UTC, Andrei Alexandrescu
wrote:
What exactly is the consensus? From your wiki page I see "One
of the
proposals in the thread is to switch the iteration type of
string
ranges from dchar to the string's character type."

I can tell you straight out: That will not happen for as long
as I'm
working on D.

Why?

From the cycle "going in circles": because I think the breakage
is way too large compared to the alleged improvement.

All right. I was wondering if there was something more
fundamental behind such an ultimatum.

In fact I believe that that design is inferior to the current
one regardless.

I was hoping we could come to an agreement at least on this point.

---

BTW, a thought struck me while thinking about the problem
yesterday.

char and dchar should not be implicitly convertible between one
another, or comparable to the other.

void main()
{
string s = "Привет";
foreach (c; s)
assert(c != 'Ñ');
}

Instead, std.conv.to should allow converting between character
types, iff they represent one whole code point and fit into the
destination type, and throw an exception otherwise (similar to
how it deals with integer overflow). Char literals should be
special-cased by the compiler to implicitly convert to any
sufficiently large type.

This would break more[1] code, but it would avoid the silent
failures of the earlier proposal.

[1] I went through my own larger programs. I actually couldn't
find any uses of dchar which would be impacted by such a
hypothetical change.

Mar 09 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/9/14, 8:18 AM, Vladimir Panteleev wrote:
On Sunday, 9 March 2014 at 05:10:26 UTC, Andrei Alexandrescu wrote:
On 3/8/14, 8:24 PM, Vladimir Panteleev wrote:
On Sunday, 9 March 2014 at 04:18:15 UTC, Andrei Alexandrescu wrote:
What exactly is the consensus? From your wiki page I see "One of the
proposals in the thread is to switch the iteration type of string
ranges from dchar to the string's character type."

I can tell you straight out: That will not happen for as long as I'm
working on D.

Why?

From the cycle "going in circles": because I think the breakage is way
too large compared to the alleged improvement.

All right. I was wondering if there was something more fundamental
behind such an ultimatum.

It's just factual information with no drama attached (i.e. I'm not
threatening to leave the language, just plainly explain I'll never
approve that particular change).

That said a larger explanation is in order. There have been cases in the
past when our community has worked itself in a froth over a non-issue
and ultimately caused a language change imposed by "the faction that
shouted the loudest". The "lazy" keyword and recently the "virtual"
keyword come to mind as cases in which the language leadership has been
essentially annoyed into making a change it didn't believe in.

I am all about listening to the community's needs and desires. But at
some point there is a need to stick to one's guns in matters of judgment
call. See e.g. https://d.puremagic.com/issues/show_bug.cgi?id=11837 for
a very recent example in which reasonable people may disagree but at
some point you can't choose both options.

What we now have works as intended. As I mentioned, there is quite a bit
more evidence the design is useful to people, than detrimental. Unicode
is all about code points. Code units are incidental to each encoding.
The fact that we recognize code points at language and library level is,
in my opinion, a Good Thing(tm).

I understand that doesn't reach the ninth level of Nirvana and there are
still issues to work on, and issues where good-looking code is actually
incorrect. But I think we're overall in good shape. A regression from
that to code unit level would be very destructive. Even a clear slight
improvement that breaks backward compatibility would be destructive.

So I wanted to limit the potential damage of this discussion. It is made
only a lot more dangerous that Walter himself started it, something that
others didn't fail to tune into. The sheer fact that we got to
contemplate an unbelievably massive breakage on no other evidence than
one misuse case and for the sake of possibly an illusory improvement -
that's a sign we need to grow up. We can't go like this about changing
the language and aim to play in the big leagues.

In fact I believe that that design is inferior to the current one
regardless.

I was hoping we could come to an agreement at least on this point.

Sorry to disappoint.

---

BTW, a thought struck me while thinking about the problem yesterday.

char and dchar should not be implicitly convertible between one another,
or comparable to the other.

I think only the char -> dchar conversion works, and I can see arguments
against it. Also comparison of char with dchar is dicey. But there are
also cases in which it's legitimate to do that (e.g. assign ASCII chars
etc) and this would be a breaking change.

One good way to think about breaking changes is "if this change were
executed to perfection, how much would that improve the overall quality
of D?" Because breakages _are_ "overall" - users don't care whether they
come from this or the other part of the type system. Really puts things
into perspective.

void main()
{
string s = "Привет";
foreach (c; s)
assert(c != 'Ñ');
}

Instead, std.conv.to should allow converting between character types,
iff they represent one whole code point and fit into the destination
type, and throw an exception otherwise (similar to how it deals with
integer overflow). Char literals should be special-cased by the compiler
to implicitly convert to any sufficiently large type.

This would break more[1] code, but it would avoid the silent failures of
the earlier proposal.

[1] I went through my own larger programs. I actually couldn't find any
uses of dchar which would be impacted by such a hypothetical change.

Generally I think we should steer away from slight improvements of the
language at the cost of breaking existing code. Instead, we must think
of ways to improve the language without the breakage. You may want to
pursue (bugzilla + pull request) adding the std.conv routines with the
semantics you mentioned.

Andrei

Mar 09 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:
09-Mar-2014 07:53, Vladimir Panteleev пишет:
On Sunday, 9 March 2014 at 03:26:40 UTC, Andrei Alexandrescu wrote:
I don't understand this argument. Iterating by code unit is not
meaningless if you don't want to extract meaning from each unit
iteration. For example, if you're parsing JSON or XML, you only care
about the syntax characters, which are all ASCII. And there is no
confusion of "what exactly are we counting here".

This was debated... people should not be looking at individual code
points, unless they really know what they're doing.

Should they be looking at code units instead?

No. They should only be looking at substrings.

This. Anyhow searching dchar makes sense for _some_ languages, the
problem is that it shouldn't decode the whole string but rather encode
the needle properly and search that.

Basically the whole thread is about:
how do I work efficiently (no-decoding) with UTF-8/UTF-16 in cases where
it obviously can be done?

The current situation is bad in that it undermines writing decode-less
generic code. One easily falls into auto-decode trap on first .front,
especially when called from some standard algorithm. The algo sees
char[]/wchar[] and gets into decode mode via some special case. If it
would do that with _all_ char/wchar random access ranges it'd be at
least consistent.

That and wrapping your head around 2 sets of constraints. The amount of
code around 2 types - wchar[]/char[] is way too much, that much is clear.

--
Dmitry Olshansky

Mar 09 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/9/14, 11:34 AM, Dmitry Olshansky wrote:
09-Mar-2014 07:53, Vladimir Panteleev пишет:
On Sunday, 9 March 2014 at 03:26:40 UTC, Andrei Alexandrescu wrote:
I don't understand this argument. Iterating by code unit is not
meaningless if you don't want to extract meaning from each unit
iteration. For example, if you're parsing JSON or XML, you only care
about the syntax characters, which are all ASCII. And there is no
confusion of "what exactly are we counting here".

This was debated... people should not be looking at individual code
points, unless they really know what they're doing.

Should they be looking at code units instead?

No. They should only be looking at substrings.

This. Anyhow searching dchar makes sense for _some_ languages, the
problem is that it shouldn't decode the whole string but rather encode
the needle properly and search that.

That's just an optimization. Conceptually what happens is we're looking
for a code point in a sequence of code points.

Basically the whole thread is about:
how do I work efficiently (no-decoding) with UTF-8/UTF-16 in cases where
it obviously can be done?

The current situation is bad in that it undermines writing decode-less
generic code.

s/undermines writing/makes writing explicit/

One easily falls into auto-decode trap on first .front,
especially when called from some standard algorithm. The algo sees
char[]/wchar[] and gets into decode mode via some special case. If it
would do that with _all_ char/wchar random access ranges it'd be at
least consistent.

That and wrapping your head around 2 sets of constraints. The amount of
code around 2 types - wchar[]/char[] is way too much, that much is clear.

We're engineers so we should quantify. Ideally that would be as simple
as "git grep isNarrowString|wc -l" which currently prints 42 of all
numbers :o).

Overall I suspect there are a few good simplifications we can make by
using isNarrowString and .representation.

Andrei

Mar 09 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:
09-Mar-2014 22:41, Andrei Alexandrescu пишет:
On 3/9/14, 11:34 AM, Dmitry Olshansky wrote:
This. Anyhow searching dchar makes sense for _some_ languages, the
problem is that it shouldn't decode the whole string but rather encode
the needle properly and search that.

That's just an optimization. Conceptually what happens is we're looking
for a code point in a sequence of code points.

Yup. It's till not a good idea to introduce this in std.algorithm in a
non-generic way.

That and wrapping your head around 2 sets of constraints. The amount of
code around 2 types - wchar[]/char[] is way too much, that much is clear.

We're engineers so we should quantify. Ideally that would be as simple
as "git grep isNarrowString|wc -l" which currently prints 42 of all
numbers :o).

Add to that some uses of isSomeString and ElementEncodingType.
138 and 80 respectively.

And in most cases it means that nice generic code was hacked to care
about 2 types in particular. That is what bothers me.

Overall I suspect there are a few good simplifications we can make by
using isNarrowString and .representation.

Okay putting potential breakage aside.
Let me sketch up an additive way of improving current situation.

1. Say we recognize any indexable entity of char/wchar/dchar, that
however has .front returning a dchar as a "narrow string". Nothing fancy
- it's just a generalization of isNarrowString. At least a range over
Array!char will work as string now.

2. Likewise representation must be made something more explicit say
byCodeUnit and work on any isNarrowString per above. The opposite of
that is byCodePoint.

3. ElementEncodingType is too verbose and misleading. Something more
explicit would be useful. ItemType/UnitType maybe?

4. We lack lots of good stuff from Unicode standard. Some recently
landed in std.uni. We need many more, and deprecate crappy ones in
std.string. (e.g. wrapping text is one)

5. Most algorithms conceptually decode, but may be enhanced to work
directly on UTF-8/UTF-16. That together with 1, should IMHO solve most
of our problems.

6. Take into account ASCII and maybe other alphabets? Should be as
trivial as .assumeASCII and then on you march with all of std.algo/etc.

--
Dmitry Olshansky

Mar 09 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/9/14, 12:25 PM, Dmitry Olshansky wrote:
Okay putting potential breakage aside.
Let me sketch up an additive way of improving current situation.

Now you're talking.

Wait, why is dchar[] a narrow string?

2. Likewise representation must be made something more explicit say
byCodeUnit and work on any isNarrowString per above. The opposite of
that is byCodePoint.

Fine.

3. ElementEncodingType is too verbose and misleading. Something more
explicit would be useful. ItemType/UnitType maybe?

We're stuck with that name.

4. We lack lots of good stuff from Unicode standard. Some recently
landed in std.uni. We need many more, and deprecate crappy ones in
std.string. (e.g. wrapping text is one)

Add away.

5. Most algorithms conceptually decode, but may be enhanced to work
directly on UTF-8/UTF-16. That together with 1, should IMHO solve most
of our problems.

Great!

6. Take into account ASCII and maybe other alphabets? Should be as
trivial as .assumeASCII and then on you march with all of std.algo/etc.

Walter is against that. His main argument is that UTF already covers
ASCII with only a marginal cost (that can be avoided) and that we should
go farther into the future instead of catering to an obsolete
representation.

Andrei

Mar 09 2014

"w0rp" <devw0rp gmail.com> writes:
On Sunday, 9 March 2014 at 19:40:32 UTC, Andrei Alexandrescu
wrote:
6. Take into account ASCII and maybe other alphabets? Should
be as
trivial as .assumeASCII and then on you march with all of
std.algo/etc.

Walter is against that. His main argument is that UTF already
covers ASCII with only a marginal cost (that can be avoided)
and that we should go farther into the future instead of
catering to an obsolete representation.

Andrei

When I've wanted to write code especially for ASCII, I think it
hasn't been for use in generic algorithms anyway. Mostly it's
stuff for manipulating segments of memory in a particular way,
like as seen here in my library which does some work to generate
D code.

https://github.com/w0rp/dsmoke/blob/master/source/smoke/string_util.d#L45

Anything else would be something like running through an
algorithm and then copying data into a new array or similar, and
that would miss the point. When it comes to generic algorithms
and ASCII I think UTF-x is sufficient.

Mar 09 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:
09-Mar-2014 23:40, Andrei Alexandrescu пишет:
On 3/9/14, 12:25 PM, Dmitry Olshansky wrote:
Okay putting potential breakage aside.
Let me sketch up an additive way of improving current situation.

Now you're talking.

Wait, why is dchar[] a narrow string?

Indeed `...entity of char/wchar/dchar` --> `...entity of char/wchar`.

3. ElementEncodingType is too verbose and misleading. Something more
explicit would be useful. ItemType/UnitType maybe?

We're stuck with that name.

Too bad, but we have renamed imports... if only they worked correctly.
But let's not derail.

[snip]

Great, so this may be turned into smallish DIP or bugzilla enhancements.

6. Take into account ASCII and maybe other alphabets? Should be as
trivial as .assumeASCII and then on you march with all of std.algo/etc.

Walter is against that. His main argument is that UTF already covers
ASCII with only a marginal cost

He certainly doesn't have things like case-insensitive matching or
collation on his list. Some cute tables are what "directly to the UTF"
algorithms require for almost anything beyond simple-minded "find me a
substring".

Walter certainly would have different stance the moment he observe the
extra bulk of object code for these.

(that can be avoided)

How? I'm not talking about `x < 0x80` branches, these wouldn't cost a dime.

I really don't feel strong about 6th point. I see it as a good idea to
allow custom alphabets and reap performance benefits where it makes
sense, the need for that is less urgent though.

and that we should
go farther into the future instead of catering to an obsolete
representation.

That is something I agree with.

--
Dmitry Olshansky

Mar 09 2014

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 09/03/14 04:26, Andrei Alexandrescu wrote:
2. Add byChar that returns a random-access range iterating a string by
character. Add byWchar that does on-the-fly transcoding to UTF16. Add
byDchar that accepts any range of char and does decoding. And such
stuff. Then whenever one wants to go through a string by code point
can just use str.byChar.

This is confusing. Did you mean to say that byChar iterates a string by
code unit (not character / code point)?

Unit. s.byChar.front is a (possibly ref, possibly qualified) char.

So IIUC iterating over s.byChar would not encounter the decoding-related speed
hits that Walter is concerned about?

In which case it seems to me a better solution -- "safe" strings by default,
unsafe speed-focused solution available if you want it. ("Safe" here in the
more general sense of "Doesn't generate unexpected errors" rather than memory
safety.)

Mar 09 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:
On 09/03/14 04:26, Andrei Alexandrescu wrote:
2. Add byChar that returns a random-access range iterating a string by
character. Add byWchar that does on-the-fly transcoding to UTF16. Add
byDchar that accepts any range of char and does decoding. And such
stuff. Then whenever one wants to go through a string by code point
can just use str.byChar.

This is confusing. Did you mean to say that byChar iterates a string by
code unit (not character / code point)?

Unit. s.byChar.front is a (possibly ref, possibly qualified) char.

So IIUC iterating over s.byChar would not encounter the decoding-related
speed hits that Walter is concerned about?

That is correct.

Andrei

Mar 09 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 9 March 2014 at 17:18:47 UTC, Andrei Alexandrescu
wrote:
On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:
So IIUC iterating over s.byChar would not encounter the
decoding-related
speed hits that Walter is concerned about?

That is correct.

Unless I'm missing something, all algorithms that can work faster
on arrays will need to be adapted to also recognize
byChar-wrapped arrays, unwrap them, perform the fast array
operation, and wrap them back in a byChar.

Mar 09 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/9/14, 10:21 AM, Vladimir Panteleev wrote:
On Sunday, 9 March 2014 at 17:18:47 UTC, Andrei Alexandrescu wrote:
On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:
So IIUC iterating over s.byChar would not encounter the decoding-related
speed hits that Walter is concerned about?

That is correct.

Unless I'm missing something, all algorithms that can work faster on
arrays will need to be adapted to also recognize byChar-wrapped arrays,
unwrap them, perform the fast array operation, and wrap them back in a
byChar.

Good point. Off the top of my head I can't remember any algorithm that
relies on array representation to do better on arrays than on
random-access ranges offering all of arrays' primitives. But I'm sure
there are a few.

Andrei

Mar 09 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:
09-Mar-2014 21:45, Andrei Alexandrescu пишет:
On 3/9/14, 10:21 AM, Vladimir Panteleev wrote:
On Sunday, 9 March 2014 at 17:18:47 UTC, Andrei Alexandrescu wrote:
On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:
So IIUC iterating over s.byChar would not encounter the
decoding-related
speed hits that Walter is concerned about?

That is correct.

Unless I'm missing something, all algorithms that can work faster on
arrays will need to be adapted to also recognize byChar-wrapped arrays,
unwrap them, perform the fast array operation, and wrap them back in a
byChar.

copy to begin with. And it's about 80x faster with plain arrays.

--
Dmitry Olshansky

Mar 09 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/9/14, 11:14 AM, Dmitry Olshansky wrote:
09-Mar-2014 21:45, Andrei Alexandrescu пишет:
On 3/9/14, 10:21 AM, Vladimir Panteleev wrote:
On Sunday, 9 March 2014 at 17:18:47 UTC, Andrei Alexandrescu wrote:
On 3/9/14, 5:28 AM, Joseph Rushton Wakeling wrote:
So IIUC iterating over s.byChar would not encounter the
decoding-related
speed hits that Walter is concerned about?

That is correct.

Unless I'm missing something, all algorithms that can work faster on
arrays will need to be adapted to also recognize byChar-wrapped arrays,
unwrap them, perform the fast array operation, and wrap them back in a
byChar.

copy to begin with. And it's about 80x faster with plain arrays.

Question is if there are a bunch of them.

Andrei

Mar 09 2014

"Joseph Cassman" <jc7919 outlook.com> writes:
On Sunday, 9 March 2014 at 01:23:27 UTC, Andrei Alexandrescu
wrote:
I was thinking of these too:

1. Revisit std.encoding and perhaps confer legitimacy to the
character types defined there. The implementation in
std.encoding is wanting, but I think the idea is sound.
Essentially give more love to various encodings, including
Ascii and "bypass encoding, I'll deal with stuff myself".

2. Add byChar that returns a random-access range iterating a
string by character. Add byWchar that does on-the-fly
transcoding to UTF16. Add byDchar that accepts any range of
char and does decoding. And such stuff. Then whenever one wants
to go through a string by code point can just use str.byChar.

Andrei

I like these two points you make here. In particular, I like the
recent addition of byGrapheme, and other ideas along this line
which provide a custom range interface to a string. Such
additions do not break code but add opt-in functionality for
those who need it, while leaving the default case intact.

Overall, I think the current string design in D2 stikes a nice
balance between performance and functionality. It does not reach
Unicode perfection but gets rather close to good useability while
still maintaining good C compatibility and performance in the
default case.

As for Walter's original post regarding the use of decode by
default in std.array.front, if I had it my way, I would prefer
all performance hits to be explicit so that way I know what I am
paying for by simply reading the code. Nonetheless, this change
will break code in the wild relying on its current behavior. As a
result, I feel that making such a fundamental change would be
better to postpone until the next major version of D is
considered. D currently seems to carry much hope due to its
potential, but is struggling to gain reputation as a reliable,
quality, production-ready language. If such fundamental changes
are made at this point it will do a lot of harm to D's reputation
which it may never recover from. Rather than making such a change
now, I feel that fixing all open issues in bugzilla and
'completing' D2 would do much good. Then, near the close of
implementing D2, a new library implementation of text
capabilities could be prototyped for D3 and flagged as
beta-please-test-but-avoid-use-in-production-code. Such an
approach would benefit from the insights gained from implementing
this version in D2 and also get much needed input from actual
usage.

Joseph

Mar 08 2014

"Sean Kelly" <sean invisibleduck.org> writes:
On Saturday, 8 March 2014 at 20:50:49 UTC, Andrei Alexandrescu
wrote:
Pretty much everyone using ICU hates it.

I think the biggest problem with ICU is documentation. It can
take a long time to figure out how to do something if you've
never done it before. Also, the C interface in ICU seems better
than the C++ interface. And I'll grant that a few things are
just far harder than they need to be. I wanted a transcoding
iterator and ICU almost has this but not quite, so I've got to
write my own. In fact, iterating across an arbitrary encoding in
general is at least not intuitive and perhaps not possible. I
kinda gave up on that. Um, and using UTF-16 as the standard
encoding, requiring many transcoding operations to require two
conversions. Okay, I guess there are a lot of problems with ICU,
but it handles nearly every requirement I have, which is in
itself quite a lot.

Mar 08 2014

Marco Leise <Marco.Leise gmx.de> writes:
Am Sat, 08 Mar 2014 22:07:09 +0000
schrieb "Sean Kelly" <sean invisibleduck.org>:

On Saturday, 8 March 2014 at 20:50:49 UTC, Andrei Alexandrescu
wrote:
Pretty much everyone using ICU hates it.

You find the answer here:
http://userguide.icu-project.org/icufaq#TOC-What-is-the-performance-difference-between-UTF-8-and-UTF-16-

In addition it is infeasible to maintain code for direct
conversions with all the encodings they support. The project
doesn't aim at providing a specific transcoding but all of
them equally. What can you do. For Java it is easier to accept
since they use UTF-16 internally.

--
Marco

Mar 19 2014

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Mar 08, 2014 at 08:38:40PM +0000, Vladimir Panteleev wrote:
On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu
wrote:
Searching for characters in strings would be difficult to deem
inappropriate.

The notion of "character" exists only in certain writing systems. It
is thus a flawed practice, and I think it should not be encouraged,
as it will only make writing truly-international software more
difficult. A more correct approach is searching for a certain
substring. If non-exact matching is needed (normalization, case
insensitivity etc.), then the appropriate solution is to use the
Unicode algorithms.

+1. Most "character"-based Unicode string operations are actually
*substring* operations, because the notion of "character" is not
universal to every writing system, and doesn't map 1-to-1 to Unicode
code points anyway. I would argue that most instances of code that
perform character-based operations on strings are incorrect, in the
sense that they will fail to correctly process strings in certain
languages.

[...]
From experience with C++ I knew (1) had a bad track record, and
(2) "generically conservative, specialize for speed" was a
successful pattern.

What would you have chosen given that context?

Ideally, we would have the Unicode algorithms in the standard
library from day 1, and advocated their use throughout the
documentation.

+1. I came to D expecting this to be the case... and was a little let
down when I discovered the actual state of affairs in std.uni at the
time. Thankfully, things have improved since, and all those who worked
on that have my gratitude. But it's still not quite there yet.

[...]
So the problem to me is that we're stuck not fixing something that's
horribly broken just because it's broken in a way that people
presumably now expect.

Clearly I'm being subjective here but again I'd find it difficult to
get convinced we have something horribly broken from the evidence I
gathered inside and outside Facebook.

Have you or anyone you personally know tried to process text in D
containing a writing system such as Sanskrit's?

[...]

Or more to the point, do you know of any experience that you can share
about code that attempts to process these sorts of strings on a per
character basis? My suspicion is that any code that operates on such
strings, if they have any claim to correctness at all, must be
substring-based, rather than character-based.

--
I think Debian's doing something wrong, `apt-get install pesticide',
doesn't seem to remove the bugs on my system! -- Mike Dresser

Mar 08 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Saturday, 8 March 2014 at 20:52:40 UTC, H. S. Teoh wrote:
Or more to the point, do you know of any experience that you
can share
about code that attempts to process these sorts of strings on a
per
character basis? My suspicion is that any code that operates on
such
strings, if they have any claim to correctness at all, must be
substring-based, rather than character-based.

That's pretty much it. Unless you are working in the confines of
certain languages (alphabets, scripts, etc.), many notions that
are valid for English or European languages lose meaning in
general. This includes the notion of "characters" - at full
abstraction, you can only treat a string as a stream of code
units (or code points, if you wish, but as has been discussed to
death this is rarely useful).

An application which has to handle user text (said text being
possibly in any language), has to pretty much treat string
variables as "holy":
- no indexing
- no slicing
- no counting anything
- no toUpper/toLower (std.ascii or std.uni)
etc.

All processing and transformations (line breaking, normalization,
etc.) needs to be done using the relevant Unicode algorithms.

I've posted something earlier which I'd like to take back:

[a-z] makes sense in English, and [а-я] makes sense in Russian

[а-я] makes sense for Russian, but it doesn't for Ukrainian, in
the same way how [a-z] is useless for Portuguese. There are
probably only a few such ranges in Unicode which encompass
exactly one alphabet, due to how much letters overlap across
alphabets of similar languages.

Mar 08 2014

"monarch_dodra" <monarchdodra gmail.com> writes:
On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu
wrote:
The current approach is a cut above treating strings as arrays
of bytes
for some languages, and still utterly broken for others. If I'm
operating on a right to left language like Hebrew, what would
I expect
the result to be from something like countUntil?

The entire string processing paraphernalia is left to right. I
figure RTL languages are under-supported, but
s.retro.countUntil comes to mind.

Andrei

I'm pretty sure that all string operations are actually "front to
back". If I recall correctly, evenlanguages that "read" right to
left, are stored in a front to back manner: EG: string[0] would
be the right-most character. Is is only a question of "display",
and changes nothing to the code. As for "countUntil", it would
still work perfectly fine, as a RTL reader would expect the
counting to start at the "begining" eg: the "Right" side.

I'm pretty confident RTL is 100% supported. The only issue is the
"front"/"left" abiguity, and the only one I know of is the oddly
named "stripLeft" function, which actually does a "stripFront"
anyways.

So I wouldn't worry about RTL.

But as mentioned, it is languages like indian, that have complex
graphemes, or languages with accentuated characters, eg, most
europeans ones, that can have problems, such as canFind("cassé",
'e').

On topic, I think D's implicit default decode to dchar is
*infinity* times better than C++'s char-based strings. While
imperfect in terms of grapheme, it was still a design decision
made of win.

I'd be tempted to not ask "how do we back out", but rather, "how
can we take this further"? I'd love to ditch the whole
"char"/"dchar" thing altogether, and work with graphemes. But
that would be massive involvement.

Mar 09 2014

"Peter Alexander" <peter.alexander.au gmail.com> writes:
On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
On topic, I think D's implicit default decode to dchar is
*infinity* times better than C++'s char-based strings. While
imperfect in terms of grapheme, it was still a design decision
made of win.

I'd be tempted to not ask "how do we back out", but rather,
"how can we take this further"? I'd love to ditch the whole
"char"/"dchar" thing altogether, and work with graphemes. But
that would be massive involvement.

Why do you think it is better?

Let's be clear here: if you are searching/iterating/comparing by
code point then your program is either not correct, or no better
than doing so by code unit. Graphemes don't really fix this
either.

I think this is the main confusion: the belief that iterating by
code point has utility.

If you care about normalization then neither by code unit, by
code point, nor by grapheme are correct (except in certain
language subsets).

If you don't care about normalization then by code unit is just
as good as by code point, but you don't need to specialise
everywhere in Phobos.

AFAIK, there is only one exception, stuff like s.all!(c => c ==
'é'), but as Vladimir correctly points out: (a) by code point,
this is still broken in the face of normalization, and (b) are
there any real applications that search a string for a specific
non-ASCII character?

To those that think the status quo is better, can you give an
example of a real-life use case that demonstrates this?

I do think it's probably too late to change this, but I think
there is value in at least getting everyone on the same page.

Mar 09 2014

"monarch_dodra" <monarchdodra gmail.com> writes:
On Sunday, 9 March 2014 at 11:34:31 UTC, Peter Alexander wrote:
On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
On topic, I think D's implicit default decode to dchar is
*infinity* times better than C++'s char-based strings. While
imperfect in terms of grapheme, it was still a design decision
made of win.

I'd be tempted to not ask "how do we back out", but rather,
"how can we take this further"? I'd love to ditch the whole
"char"/"dchar" thing altogether, and work with graphemes. But
that would be massive involvement.

Why do you think it is better?

Let's be clear here: if you are searching/iterating/comparing
by code point then your program is either not correct, or no
better than doing so by code unit. Graphemes don't really fix
this either.

I think this is the main confusion: the belief that iterating
by code point has utility.

If you care about normalization then neither by code unit, by
code point, nor by grapheme are correct (except in certain
language subsets).

If you don't care about normalization then by code unit is just
as good as by code point, but you don't need to specialise
everywhere in Phobos.

IMO, the "normalization" argument is overrated. I've yet to
encounter a real-world case of normalization: only hand written
counter-examples. Not saying it doesn't exist, just that:
1. It occurs only in special cases that the program should be
aware of before hand.
2. Arguably, be taken care of eagerly, or in a special pass.

As for "the belief that iterating by code point has utility." I
have to strongly disagree. Unicode is composed of codepoints, and
that is what we handle. The fact that it can be be encoded and
stored as UTF is implementation detail.

As for the grapheme thing, I'm not actually so sure about it
myself, so don't take it too seriously.

But *what* other kinds of algorithms are there? AFAIK, the *only*
type of algorithm that doesn't need decoding is searching, and
you know what? std.algorithm.find does it perfectly well. This
trickles into most other algorithms too: split, splitter or
findAmong don't decode if they don't have too.

AFAIK, the most common algorithm "case insensitive search" *must*
decode.

There may still be cases where it is still not working as
intended in the face of normalization, but it is still leaps and
bounds better than what we get iterating with codeunits.

To turn it the other way around, *what* are you guys doing, that
doesn't require decoding, and where performance is such a killer?

To those that think the status quo is better, can you give an
example of a real-life use case that demonstrates this?

I do not know of a single bug report in regards to buggy phobos
code that used front/popFront. Not_a_single_one (AFAIK).

On the other hand, there are plenty of cases of bugs for
attempting to not decode strings, or incorrectly decoding
strings. They are being corrected on a continuous basis.

Seriously, Bearophile suggested "ABCD".sort(), and it took about
6 pages (!) for someone to point out this would be wrong. Even
Walter pointed out that such code should work. *Maybe* it is
still wrong in regards to graphemes and normalization, but at
*least*, the result is not a corrupted UTF-8 stream.

Walter keeps grinding on about "myCharArray.put('é')" not
working, but I'm not sure he realizes how dangerous it would
actually be to allow such a thing to work.

In particular, in all these cases, a simple call to
"representation" will deactivate the feature, giving you the
tools you want.

I do think it's probably too late to change this, but I think
there is value in at least getting everyone on the same page.

Me too. I do see the value in being able to do decode-less
iteration. I just think the *default* behavior has the advantage
of being correct *most* of the time, and definitely much more
correct than without decoding.

I think opt-out of decoding is just a much much much saner
approach to string handling.

Mar 09 2014

Michel Fortin <michel.fortin michelf.ca> writes:
On 2014-03-09 13:00:45 +0000, "monarch_dodra" <monarchdodra gmail.com> said:

AFAIK, the most common algorithm "case insensitive search" *must* decode.

Not necessarily. While the unicode collation algorithms (which should
be used to compare text) are defined in term of code points, you could
build a collation element table using code units as keys and bypass the
decoding step for searching the table. I'm not sure if there would be a
significant performance gain though.

That remains an optimization though. The natural way to implement a
Unicode algorithm is to base it on code points.

--
Michel Fortin
michel.fortin michelf.ca
http://michelf.ca

Mar 09 2014

"Peter Alexander" <peter.alexander.au gmail.com> writes:
On Sunday, 9 March 2014 at 13:00:46 UTC, monarch_dodra wrote:
IMO, the "normalization" argument is overrated. I've yet to
encounter a real-world case of normalization: only hand written
counter-examples. Not saying it doesn't exist, just that:
1. It occurs only in special cases that the program should be
aware of before hand.
2. Arguably, be taken care of eagerly, or in a special pass.

As for "the belief that iterating by code point has utility." I
have to strongly disagree. Unicode is composed of codepoints,
and that is what we handle. The fact that it can be be encoded
and stored as UTF is implementation detail.

We don't "handle" code points (when have you ever wanted to
handle a combining character separate to the character it
combines with?)

You are just thinking of a subset of languages and locales.

Normalization is an issue any time you have a user enter text
into your program and you then want to search for that text. I
hope we can agree this isn't a rare occurrence.

AFAIK, there is only one exception, stuff like s.all!(c => c
== 'é'), but as Vladimir correctly points out: (a) by code
point, this is still broken in the face of normalization, and
(b) are there any real applications that search a string for a
specific non-ASCII character?

But *what* other kinds of algorithms are there? AFAIK, the
*only* type of algorithm that doesn't need decoding is
searching, and you know what? std.algorithm.find does it
perfectly well. This trickles into most other algorithms too:
split, splitter or findAmong don't decode if they don't have
too.

Searching, equality testing, copying, sorting, hashing,
splitting, joining...

I can't think of a single use-case for searching for a non-ASCII
code point. You can search for strings, but searching by code
unit is just as good (and fast by default).

AFAIK, the most common algorithm "case insensitive search"
*must* decode.

But it must also normalize and take locales into account, so by
code point is insufficient (unless you are willing to ignore
languages like Turkish). See Turkish I.

http://en.wikipedia.org/wiki/Turkish_I

Sure, if you just want to ignore normalization and several
languages then by code point is just fine... but that's the
point: by code point is incorrect in general.

There may still be cases where it is still not working as
intended in the face of normalization, but it is still leaps
and bounds better than what we get iterating with codeunits.

To turn it the other way around, *what* are you guys doing,
that doesn't require decoding, and where performance is such a
killer?

Searching, equality testing, copying, sorting, hashing,
splitting, joining...

The performance thing can be fixed in the library, but my concern
is (a) it takes a significant amount of code to do so (b)
complicates implementations. There are many, many algorithms in
Phobos that are special cased for strings, and I don't think it
needs to be that way.

To those that think the status quo is better, can you give an
example of a real-life use case that demonstrates this?

I do not know of a single bug report in regards to buggy phobos
code that used front/popFront. Not_a_single_one (AFAIK).

On the other hand, there are plenty of cases of bugs for
attempting to not decode strings, or incorrectly decoding
strings. They are being corrected on a continuous basis.

Can you provide a link to a bug?

Also, you haven't answered the question :-) Can you give a
real-life example of a case where code point decoding was
necessary where code units wouldn't have sufficed?

You have mentioned case-insensitive searching, but I think I've
adequately demonstrated that this doesn't work in general by code
point: you need to normalize and take locales into account.

Mar 09 2014

"monarch_dodra" <monarchdodra gmail.com> writes:
On Sunday, 9 March 2014 at 14:57:32 UTC, Peter Alexander wrote:
You have mentioned case-insensitive searching, but I think I've
adequately demonstrated that this doesn't work in general by
code point: you need to normalize and take locales into account.

I don't understand what your argument. Is it "by code point is
not 100% correct, so let's just drop it and go for raw code units
instead?"

We *are* arguing about whether or not "front/popFront" should
decode by dchar, right...?

You mention the algorithms "Searching, equality testing, copying,
sorting, hashing, splitting, joining..." I said "by codepoint is
not correct", but I still think it's a hell of a lot more
accurate than by codeunit. Unless you want to ignore any and all
algorithms that takes a predicate?

You say "unless you are willing to ignore languages like
Turkish", but... If you don't decode front, than aren't you just
ignoring *all* languages that basically aren't English....?

As I said, maybe by codepoint is not correct, but if it isn't, I
think we should be moving further *into* the correct behavior by
default, not away from it.

Mar 09 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 9 March 2014 at 13:00:46 UTC, monarch_dodra wrote:
As for "the belief that iterating by code point has utility." I
have to strongly disagree. Unicode is composed of codepoints,
and that is what we handle. The fact that it can be be encoded
and stored as UTF is implementation detail.

But you don't deal with Unicode. You deal with *text*. Unless you
are implementing Unicode algorithms, code points solve nothing in
the general case.

Seriously, Bearophile suggested "ABCD".sort(), and it took
about 6 pages (!) for someone to point out this would be wrong.

Sorting a string has quite limited use in the general case, so I
think this is another artificial example.

Even Walter pointed out that such code should work. *Maybe* it
is still wrong in regards to graphemes and normalization, but
at *least*, the result is not a corrupted UTF-8 stream.

I think this is no worse than putting all combining marks all
clustered at the end of the string, thus attached to the last
non-combining letter.

Mar 09 2014

"bearophile" <bearophileHUGS lycos.com> writes:
Vladimir Panteleev:

Seriously, Bearophile suggested "ABCD".sort(), and it took
about 6 pages (!) for someone to point out this would be wrong.

Sorting a string has quite limited use in the general case,

It seems I am sorting arrays of mutable ASCII chars often enough
:-)

Time ago I have even asked for a helper function:
https://d.puremagic.com/issues/show_bug.cgi?id=10162

Bye,
bearophile

Mar 09 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 9 March 2014 at 16:02:55 UTC, bearophile wrote:
Vladimir Panteleev:

Seriously, Bearophile suggested "ABCD".sort(), and it took
about 6 pages (!) for someone to point out this would be
wrong.

Sorting a string has quite limited use in the general case,

It seems I am sorting arrays of mutable ASCII chars often
enough :-)

What do you use this for?

I can think of sort being useful e.g. to see which characters
appear in a string (and with which frequency), but as the concept
does not apply to all languages, one would need to draw a line
somewhere for which languages they want to support. I think this
should be done explicitly in user code.

Mar 09 2014

"bearophile" <bearophileHUGS lycos.com> writes:
Vladimir Panteleev:

What do you use this for?

For lots of different reasons (counting, testing, histograms, to
unique-ify, to allow binary searches, etc), you can find
alternative solutions for every one of those use cases.

I can think of sort being useful e.g. to see which characters
appear in a string (and with which frequency), but as the
concept does not apply to all languages, one would need to draw
a line somewhere for which languages they want to support. I
think this should be done explicitly in user code.

So far I have needed to sort 7-bit ASCII chars.

Bye,
bearophile

Mar 09 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/9/14, 9:02 AM, bearophile wrote:
Time ago I have even asked for a helper function:
https://d.puremagic.com/issues/show_bug.cgi?id=10162

I commented on that and preapproved it.

Andrei

Mar 09 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/9/14, 4:34 AM, Peter Alexander wrote:
I think this is the main confusion: the belief that iterating by code
point has utility.

If you care about normalization then neither by code unit, by code
point, nor by grapheme are correct (except in certain language subsets).

I suspect that code point iteration is the worst as it works only with
ASCII and perchance with ASCII single-byte extensions. Then we have code
unit iteration that works with a larger spectrum of languages. One
question would be how large that spectrum it is. If it's larger than
English, then that would be nice because we would've made progress.

I don't know about normalization beyond discussions in this group, but
as far as I understand from
http://www.unicode.org/faq/normalization.html, normalization would be a
one-step process, after which code point iteration would cover still
more human languages. No? I'm pretty sure it's more complicated than
that, so please illuminate me :o).

If you don't care about normalization then by code unit is just as good
as by code point, but you don't need to specialise everywhere in Phobos.

AFAIK, there is only one exception, stuff like s.all!(c => c == 'é'),
but as Vladimir correctly points out: (a) by code point, this is still
broken in the face of normalization, and (b) are there any real
applications that search a string for a specific non-ASCII character?

What happened to counting characters and such?

To those that think the status quo is better, can you give an example of
a real-life use case that demonstrates this?

split(ter) comes to mind.

I do think it's probably too late to change this, but I think there is
value in at least getting everyone on the same page.

Awesome.

Andrei

Mar 09 2014

"Peter Alexander" <peter.alexander.au gmail.com> writes:
On Sunday, 9 March 2014 at 17:15:59 UTC, Andrei Alexandrescu
wrote:
On 3/9/14, 4:34 AM, Peter Alexander wrote:
I think this is the main confusion: the belief that iterating
by code
point has utility.

If you care about normalization then neither by code unit, by
code
point, nor by grapheme are correct (except in certain language
subsets).

I suspect that code point iteration is the worst as it works
only with ASCII and perchance with ASCII single-byte
extensions. Then we have code unit iteration that works with a
larger spectrum of languages. One question would be how large
that spectrum it is. If it's larger than English, then that
would be nice because we would've made progress.

I don't know about normalization beyond discussions in this
group, but as far as I understand from
http://www.unicode.org/faq/normalization.html, normalization
would be a one-step process, after which code point iteration
would cover still more human languages. No? I'm pretty sure
it's more complicated than that, so please illuminate me :o).

It depends what you mean by "cover" :-)

If we assume strings are normalized then substring search,
equality testing, sorting all work the same with either code
units or code points.

If you don't care about normalization then by code unit is
just as good
as by code point, but you don't need to specialise everywhere
in Phobos.

AFAIK, there is only one exception, stuff like s.all!(c => c
== 'é'),
but as Vladimir correctly points out: (a) by code point, this
is still
broken in the face of normalization, and (b) are there any real
applications that search a string for a specific non-ASCII
character?

What happened to counting characters and such?

I can't think of any case where you would want to count
characters.

* If you want an index to slice from, then you need code units.
* If you want a buffer size, then you need code units.
* If you are doing something like word wrapping then you need to
count glyphs, which is not the same as counting code points (and
that only works with mono-spaced fonts anyway -- with variable
width fonts you need to add up the widths of those glyphs)

To those that think the status quo is better, can you give an
example of
a real-life use case that demonstrates this?

split(ter) comes to mind.

splitter is just an application of substring search, no?
substring search works the same with both code units and code
points (e.g. strstr in C works with UTF encoded strings without
any need to decode).

All you need to do is ensure that mismatched encodings in the
delimeter are re-encoded (you want to do this for performance
anyway)

auto splitter(string str, dchar delim)
{
char[4] enc;
return splitter(str, enc[0..encode(enc, delim)]);
}

Mar 09 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/9/14, 10:34 AM, Peter Alexander wrote:
If we assume strings are normalized then substring search, equality
testing, sorting all work the same with either code units or code points.

But others such as edit distance or equal(some_string, some_wstring)
will not.

If you don't care about normalization then by code unit is just as good
as by code point, but you don't need to specialise everywhere in Phobos.

AFAIK, there is only one exception, stuff like s.all!(c => c == 'é'),
but as Vladimir correctly points out: (a) by code point, this is still
broken in the face of normalization, and (b) are there any real
applications that search a string for a specific non-ASCII character?

What happened to counting characters and such?

I can't think of any case where you would want to count characters.

(Generally: I've always been very very very doubtful about arguments
that start with "I can't think of..." because I've historically tried
them so many times, and with terrible results.)

Andrei

Mar 09 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 9 March 2014 at 17:48:47 UTC, Andrei Alexandrescu
wrote:
wc

What should wc produce on a Sanskrit text?

The problem is that such questions quickly become philosophical.

(Generally: I've always been very very very doubtful about
arguments that start with "I can't think of..." because I've
historically tried them so many times, and with terrible
results.)

I agree, which is why I think that although such arguments are
not unwelcome, it's much better to find out by experiment. Break
something in Phobos and see how much of your code is affected :)

Mar 09 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:
09-Mar-2014 21:54, Vladimir Panteleev пишет:
On Sunday, 9 March 2014 at 17:48:47 UTC, Andrei Alexandrescu wrote:
wc

What should wc produce on a Sanskrit text?

The problem is that such questions quickly become philosophical.

Technically it could use word-braking algorithm for words.
Or count grapheme clusters, or count code points it all may have value,
depending on the user and writing system.

--
Dmitry Olshansky

Mar 09 2014

"Peter Alexander" <peter.alexander.au gmail.com> writes:
On Sunday, 9 March 2014 at 17:48:47 UTC, Andrei Alexandrescu
wrote:
On 3/9/14, 10:34 AM, Peter Alexander wrote:
If we assume strings are normalized then substring search,
equality
testing, sorting all work the same with either code units or
code points.

But others such as edit distance or equal(some_string,
some_wstring) will not.

equal(string, wstring) should either not compile, or would be
overloaded to do the right thing. In an ideal world, char, wchar,
and dchar should not be comparable.

Edit distance on code points is of questionable utility. Like
Vladimir says, its meaning is pretty philosophical, even in ASCII
(is "\r\n" really two "edits"? What is an "edit"?)

I can't think of any case where you would want to count
characters.

% echo € | wc -c
4

:-)

(Generally: I've always been very very very doubtful about
arguments that start with "I can't think of..." because I've
historically tried them so many times, and with terrible
results.)

Fair point... but it's not as if we would be removing the ability
(you could always do s.byCodePoint.count); we are talking about
defaults. The argument that we shouldn't iterate by code unit by
default because people might want to count code points is without
substance. Also, with the proposal, string.count(dchar) would
encode the dchar to a string first for performance, so it would
still work.

Anyway, I think this discussion isn't really going anywhere so I
think I'll agree to disagree and retire.

Mar 09 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/9/14, 11:19 AM, Peter Alexander wrote:
On Sunday, 9 March 2014 at 17:48:47 UTC, Andrei Alexandrescu wrote:
On 3/9/14, 10:34 AM, Peter Alexander wrote:
If we assume strings are normalized then substring search, equality
testing, sorting all work the same with either code units or code
points.

But others such as edit distance or equal(some_string, some_wstring)
will not.

equal(string, wstring) should either not compile, or would be overloaded
to do the right thing.

These would be possible designs each with its pros and cons. The current
design works out of the box across all encodings. It has its own pros
and cons. Puts in perspective what should and shouldn't be.

In an ideal world, char, wchar, and dchar should
not be comparable.

Probably. But that has nothing to do with equal() working.

Edit distance on code points is of questionable utility. Like Vladimir
says, its meaning is pretty philosophical, even in ASCII (is "\r\n"
really two "edits"? What is an "edit"?)

Nothing philosophical - it's as cut and dried as it gets. An edit is as
defined by the Levenshtein algorithm using code points as the unit of
comparison.

I can't think of any case where you would want to count characters.

% echo € | wc -c
4

:-)

Noice.

(Generally: I've always been very very very doubtful about arguments
that start with "I can't think of..." because I've historically tried
them so many times, and with terrible results.)

Fair point... but it's not as if we would be removing the ability (you
could always do s.byCodePoint.count); we are talking about defaults. The
argument that we shouldn't iterate by code unit by default because
people might want to count code points is without substance. Also, with
the proposal, string.count(dchar) would encode the dchar to a string
first for performance, so it would still work.

That's a good enhancement for the current design as well - care to
submit a request for it?

Anyway, I think this discussion isn't really going anywhere so I think
I'll agree to disagree and retire.

The part that advocates a breaking change will not indeed lead anywhere.
The parts where we improve Unicode support for D is very fertile.

Andrei

Mar 09 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:
09-Mar-2014 21:16, Andrei Alexandrescu пишет:
On 3/9/14, 4:34 AM, Peter Alexander wrote:
I think this is the main confusion: the belief that iterating by code
point has utility.

If you care about normalization then neither by code unit, by code
point, nor by grapheme are correct (except in certain language subsets).

Was clearly meant to be: code point <--> code unit

One
question would be how large that spectrum it is. If it's larger than
English, then that would be nice because we would've made progress.

Code points help only in so far that many (~all) high-level algorithms
in Unicode are described in terms of code points. Code points have
properties, code unit do not have anything. Code points with assigned
semantic value are "abstract characters".

It's up to programmer to implement a particular algorithm to make it "as
if" decoding really happened, working directly on code units or do
decoding and work with code points which is simpler.

Current std.uni offering mostly work on code points and decodes, crucial
building block to work directly on code units is in review:

https://github.com/D-Programming-Language/phobos/pull/1685

Technically most apps just assume say "input comes in UTF-8 that is in
normalization form C". Other such as browsers strive to get uniform
representation on any input, do normalization of any input (often times
normalization turns out to be just a no-op).

If you don't care about normalization then by code unit is just as good
as by code point, but you don't need to specialise everywhere in Phobos.

AFAIK, there is only one exception, stuff like s.all!(c => c == 'é'),
but as Vladimir correctly points out: (a) by code point, this is still
broken in the face of normalization, and (b) are there any real
applications that search a string for a specific non-ASCII character?

What happened to counting characters and such?

Counting chars is dubious. But, for instance, collation is defined in
terms of code points. Regex pattern matching is _defined_ in terms of
codepoints (even the mystical level 3 Unicode support of it). So there
is certain merit to work at that level. But hacking it to be this way
isn't the way to go.

The least intrusive change would be to generalize the current choice
w.r.t. to RA ranges of char/wchar.

--
Dmitry Olshansky

Mar 09 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
On topic, I think D's implicit default decode to dchar is
*infinity* times better than C++'s char-based strings. While
imperfect in terms of grapheme, it was still a design decision
made of win.

Care to argument?

I'd be tempted to not ask "how do we back out", but rather,
"how can we take this further"? I'd love to ditch the whole
"char"/"dchar" thing altogether, and work with graphemes. But
that would be massive involvement.

As has been discussed, this does not make sense. Graphemes are
also a concept which apply only to certain writing systems, all
it would do is exchange one set of tradeoffs with another,
without solving anything. Text isn't that simple.

Mar 09 2014

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 3/9/2014 11:27 AM, Vladimir Panteleev wrote:
On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
On topic, I think D's implicit default decode to dchar is *infinity*
times better than C++'s char-based strings. While imperfect in terms
of grapheme, it was still a design decision made of win.

Care to argument?

It's simple: Breaking things on all non-English languages is worse than
breaking things on non-western[1] languages. Is still breakage, and that
*is* bad, but there's no question which breakage is significantly larger.

[1] (And yes, I realize "western" is a gross over-simplification here.
Point is "one working language" vs "several working languages".)

Mar 10 2014

"Sean Kelly" <sean invisibleduck.org> writes:
On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
On Saturday, 8 March 2014 at 20:05:36 UTC, Andrei Alexandrescu
wrote:
The current approach is a cut above treating strings as
arrays of bytes
for some languages, and still utterly broken for others. If
I'm
operating on a right to left language like Hebrew, what would
I expect
the result to be from something like countUntil?

The entire string processing paraphernalia is left to right. I
figure RTL languages are under-supported, but
s.retro.countUntil comes to mind.

Andrei

I'm pretty sure that all string operations are actually "front
to back". If I recall correctly, evenlanguages that "read"
right to left, are stored in a front to back manner: EG:
string[0] would be the right-most character. Is is only a
question of "display", and changes nothing to the code. As for
"countUntil", it would still work perfectly fine, as a RTL
reader would expect the counting to start at the "begining" eg:
the "Right" side.

I'm pretty confident RTL is 100% supported. The only issue is
the "front"/"left" abiguity, and the only one I know of is the
oddly named "stripLeft" function, which actually does a
"stripFront" anyways.

So I wouldn't worry about RTL.

Yeah, I think RTL strings are preceded by a code point that
indicates RTL display. It was just something I mentioned because
some operations might be confusing to the programmer.

But as mentioned, it is languages like indian, that have
complex graphemes, or languages with accentuated characters,
eg, most europeans ones, that can have problems, such as
canFind("cassé", 'e').

True. I still question why anyone would want to do
character-based operations on Unicode strings. I guess substring
searches could even end up with the same problem in some cases if
not implemented specifically for Unicode for the same reason, but
those should be far less common.

Mar 09 2014

"Andrea Fontana" <nospam example.com> writes:
I'm not sure I understood the point of this (long) thread.
The main problem is that decode() is called also if not needed?

Well, in this case that's not a problem only for string. I found
this problem also when I was writing other ranges. For example
when I read binary data from db stream. Front represent a single
row, and I decode it every time also if not needed.

On Friday, 7 March 2014 at 02:37:11 UTC, Walter Bright wrote:
In "Lots of low hanging fruit in Phobos" the issue came up
about the automatic encoding and decoding of char ranges.

Throughout D's history, there are regular and repeated
proposals to redesign D's view of char[] to pretend it is not
UTF-8, but UTF-32. I.e. so D will automatically generate code
to decode and encode on every attempt to index char[].

I have strongly objected to these proposals on the grounds that:

1. It is a MAJOR performance problem to do this.

2. Very, very few manipulations of strings ever actually need
decoded values.

3. D is a systems/native programming language, and
systems/native programming languages must not hide the
underlying representation (I make similar arguments about
proposals to make ints issue errors on overflow, etc.).

4. Users should choose when decode/encode happens, not the
language.

and I have been successful at heading these off. But one
slipped by me. See this in std.array:

property dchar front(T)(T[] a) safe pure if
(isNarrowString!(T[]))
{
assert(a.length, "Attempting to fetch the front of an empty
array of " ~
T.stringof);
size_t i = 0;
return decode(a, i);
}

What that means is that if I implement an algorithm that
accepts, as input, an InputRange of char's, it will ALWAYS try
to decode it. This means that even:

from.copy(to)

will decode 'from', and then re-encode it for 'to'. And it will
do it SILENTLY. The user won't notice, and he'll just assume
that D performance sux. Even if he does notice, his options to
make his code run faster are poor.

If the user wants decoding, it should be explicit, as in:

from.decode.copy(encode!to)

The USER should decide where and when the decoding goes.
'decode' should be just another algorithm.

(Yes, I know that std.algorithm.copy() has some specializations
to take care of this. But these specializations would have to
be written for EVERY algorithm, which is thoroughly
unreasonable. Furthermore, copy()'s specializations only apply
if BOTH source and destination are arrays. If just one is, the
decode/encode penalty applies.)

Is there any hope of fixing this?

Mar 10 2014

"Abdulhaq" <alynch4047 gmail.com> writes:
On Monday, 10 March 2014 at 10:52:02 UTC, Andrea Fontana wrote:
I'm not sure I understood the point of this (long) thread.
The main problem is that decode() is called also if not needed?

I'd like to offer up one D 'user' perspective, it's just a single
data point but perhaps useful. I write applications that process
Arabic, and I'm thinking about converting one of those apps from
python to D, for performance reasons.

My app deals with unicode arabic text that is 'out there', and
the UnicodeTM support for Arabic is not that well thought out, so
the data is often (always) inconsistent in terms of sequencing
diacritics etc. Even the code page can vary. Therefore my code
has to cater to various ways that other developers have sequenced
the code points.

So, my needs as a 'user' are:
* I want to encode all incoming data immediately into unicode,
usually UTF8, if isn't already.
* I want to iterate over code points. I don't care about the raw
data.
* When I get the length of my string it should be the number of
code points.
* When I index my string it should return the nth code point.
* When I manipulate my strings I want to work with code points
... you get the drift.

If I want to access the raw data, which I don't, then I'm very
happy to cast to ubyte etc.

If encode/decode is a performance issue then perhaps there could
be a cache for recently used strings where the code point
representation is held.

BTW to answer a question in the thread, yes the data is
left-to-right and visualised right-to-left.

Mar 10 2014

"Andrea Fontana" <nospam example.com> writes:
In italian we need unicode too. We have several accented letters
and often programming languages don't handle utf-8 and other
encoding so well...

In D I never had any problem with this, and I work a lot on text
processing.

So my question: is there any problem I'm missing in D with
unicode support or is just a performance problem on algorithms?

If the problem is performance on algorithms that use .front() but
don't care to understand its data, why don't we add a .rawFront()
property to implement only when make sense and then a "fallback"
like:

auto rawFront(R)(R range) if ( ... isrange ... &&
!__traits(compiles, range.rawFront)) { return range.front; }

In this way on copy() or other algorithms we can use rawFront()
and it's backward compatible with other ranges too.

But I guess I'm missing the point :)

On Monday, 10 March 2014 at 13:48:44 UTC, Abdulhaq wrote:
On Monday, 10 March 2014 at 10:52:02 UTC, Andrea Fontana wrote:
I'm not sure I understood the point of this (long) thread.
The main problem is that decode() is called also if not needed?

I'd like to offer up one D 'user' perspective, it's just a
single data point but perhaps useful. I write applications that
process Arabic, and I'm thinking about converting one of those
apps from python to D, for performance reasons.

My app deals with unicode arabic text that is 'out there', and
the UnicodeTM support for Arabic is not that well thought out,
so the data is often (always) inconsistent in terms of
sequencing diacritics etc. Even the code page can vary.
Therefore my code has to cater to various ways that other
developers have sequenced the code points.

So, my needs as a 'user' are:
* I want to encode all incoming data immediately into unicode,
usually UTF8, if isn't already.
* I want to iterate over code points. I don't care about the
raw data.
* When I get the length of my string it should be the number of
code points.
* When I index my string it should return the nth code point.
* When I manipulate my strings I want to work with code points
... you get the drift.

If I want to access the raw data, which I don't, then I'm very
happy to cast to ubyte etc.

If encode/decode is a performance issue then perhaps there
could be a cache for recently used strings where the code point
representation is held.

BTW to answer a question in the thread, yes the data is
left-to-right and visualised right-to-left.

Mar 10 2014

Johannes Pfau <nospam example.com> writes:
Am Mon, 10 Mar 2014 14:05:03 +0000
schrieb "Andrea Fontana" <nospam example.com>:

In italian we need unicode too. We have several accented letters
and often programming languages don't handle utf-8 and other
encoding so well...

In D I never had any problem with this, and I work a lot on text
processing.

So my question: is there any problem I'm missing in D with
unicode support or is just a performance problem on algorithms?

The only real problem apart from potential performance issues I've seen
mentioned in this thread is that indexing/slicing is done with code
units. I think this:

auto index = countUntil(...);
auto slice = str[0 .. index];

is really the only problem with the current implementation.

If we could start from scratch I'd say we keep operating on code points
by default but don't make strings arrays of char/wchar/dchar. Instead
they should be special types which do all operations (especially
indexing, slicing) on code points. This would be as safe as the current
implementation, always consistent but probably even slower in some
cases. Then offer some nice way to get the raw data for algorithms
which can deal with it.
However, I think it's too late to make these changes.

Mar 10 2014

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Monday, 10 March 2014 at 13:48:44 UTC, Abdulhaq wrote:
My app deals with unicode arabic text that is 'out there', and
the UnicodeTM support for Arabic is not that well thought out,
so the data is often (always) inconsistent in terms of
sequencing diacritics etc. Even the code page can vary.
Therefore my code has to cater to various ways that other
developers have sequenced the code points.

So, my needs as a 'user' are:
* I want to encode all incoming data immediately into unicode,
usually UTF8, if isn't already.
* I want to iterate over code points. I don't care about the
raw data.
* When I get the length of my string it should be the number of
code points.
* When I index my string it should return the nth code point.
* When I manipulate my strings I want to work with code points
... you get the drift.

Are you sure that code points is what you want? AFAIK there are
lots of diacritics in Arabic, and I believe they are not
precomposed with their carrying letters...

Mar 10 2014

"Abdulhaq" <alynch4047 gmail.com> writes:
On Monday, 10 March 2014 at 18:54:26 UTC, Marc Schütz wrote:
On Monday, 10 March 2014 at 13:48:44 UTC, Abdulhaq wrote:
My app deals with unicode arabic text that is 'out there', and
the UnicodeTM support for Arabic is not that well thought out,
so the data is often (always) inconsistent in terms of
sequencing diacritics etc. Even the code page can vary.
Therefore my code has to cater to various ways that other
developers have sequenced the code points.

So, my needs as a 'user' are:
* I want to encode all incoming data immediately into unicode,
usually UTF8, if isn't already.
* I want to iterate over code points. I don't care about the
raw data.
* When I get the length of my string it should be the number
of code points.
* When I index my string it should return the nth code point.
* When I manipulate my strings I want to work with code points
... you get the drift.

Are you sure that code points is what you want? AFAIK there are
lots of diacritics in Arabic, and I believe they are not
precomposed with their carrying letters...

I checked the terminology before posting so I'm pretty sure.
Arabic has a code page for the logical characters, one code point
for each letter of the alphabet and others for various diacritics.

Because of the 'shaping' each logical character has various
glyphs, found on other code pages.

Text editing programs tend to store typed Arabic as the user
entered it, and because there can be more than one diacritic per
alphabetic letter the sequence varies as to how the user
sequenced them.

Mar 10 2014

So, my needs as a 'user' are:
* I want to encode all incoming data immediately into unicode,
usually UTF8, if isn't already.
* I want to iterate over code points. I don't care about the
raw data.
* When I get the length of my string it should be the number
of code points.
* When I index my string it should return the nth code point.
* When I manipulate my strings I want to work with code points
... you get the drift.

Are you sure that code points is what you want? AFAIK there are
lots of diacritics in Arabic, and I believe they are not
precomposed with their carrying letters...

Adding to my other comment I don't expect a string type to
understand arabic and merge the diacritics for me. In fact there
are other symbols (code points) that can also be present, for
instance instructions on how Quranic text is to be read. These
issues have not been standardised and I would say are not well
understood generally.

Mar 10 2014

dennis luehring <dl.soluz gmx.net> writes:
Am 07.03.2014 03:37, schrieb Walter Bright:
In "Lots of low hanging fruit in Phobos" the issue came up about the automatic
encoding and decoding of char ranges.

after reading many of the attached posts the question is - what
could be Ds future design of introducing breaking changes, its
not a solution to say its not possible because of too many breaking
changes - that will become more and more a problem of Ds evolution
- much like C++

Mar 10 2014

"Dicebot" <public dicebot.lv> writes:
On Monday, 10 March 2014 at 14:05:39 UTC, dennis luehring wrote:
Am 07.03.2014 03:37, schrieb Walter Bright:
In "Lots of low hanging fruit in Phobos" the issue came up
about the automatic
encoding and decoding of char ranges.

after reading many of the attached posts the question is - what
could be Ds future design of introducing breaking changes, its
not a solution to say its not possible because of too many
breaking changes - that will become more and more a problem of
Ds evolution
- much like C++

Historically 2 approaches has been practiced:

1) argue a lot and then do nothing
2) suddenly change something and tell users is was necessary

I also think that this is much more important issue than this
whole thread but it does not seem to attract any real attention
when mentioned.

Mar 10 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Monday, 10 March 2014 at 14:11:13 UTC, Dicebot wrote:
Historically 2 approaches has been practiced:

1) argue a lot and then do nothing
2) suddenly change something and tell users is was necessary

These are one and the same, just from the two opposing points of
view.

I also think that this is much more important issue than this
whole thread but it does not seem to attract any real attention
when mentioned.

You mean the whole policy on breaking changes?

Mar 10 2014

"Dicebot" <public dicebot.lv> writes:
On Monday, 10 March 2014 at 14:27:02 UTC, Vladimir Panteleev
wrote:
On Monday, 10 March 2014 at 14:11:13 UTC, Dicebot wrote:
Historically 2 approaches has been practiced:

1) argue a lot and then do nothing
2) suddenly change something and tell users is was necessary

These are one and the same, just from the two opposing points
of view.

</sarcasm> :)

I also think that this is much more important issue than this
whole thread but it does not seem to attract any real
attention when mentioned.

You mean the whole policy on breaking changes?

Yes. I have given up about this idea at some point as there
seemed to be consensus that no breaking changes will be even
considered for D2 and those that come from fixing bugs are not
worth the fuss. This is exactly why I was so shocked that Walter
has even started this thread. If breaking changes are actually
considered (rare or not), then it is absolutely critical to
define the process for it and put link to its description to
dlang.org front page.

Mar 10 2014

"Yota" <yotaxp thatGoogleMailThing.com> writes:
On Monday, 10 March 2014 at 14:42:18 UTC, Dicebot wrote:
Yes. I have given up about this idea at some point as there
seemed to be consensus that no breaking changes will be even
considered for D2 and those that come from fixing bugs are not
worth the fuss.

So at what point are we going to discuss these things
in the context of D-next? These topics have us group
up and focus on compromises instead of ideals. As was
said, D2 is at the 90% point. It only has room left
for bug fixes. I think we would make much more
productive use of our time and minds coming up with
ideas that actually have a chance of coming to
fruition, even if D3 ends up being half a decade away.

Mar 10 2014

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On 3/10/2014 7:35 PM, Yota wrote:
On Monday, 10 March 2014 at 14:42:18 UTC, Dicebot wrote:
Yes. I have given up about this idea at some point as there seemed to
be consensus that no breaking changes will be even considered for D2
and those that come from fixing bugs are not worth the fuss.

So at what point are we going to discuss these things
in the context of D-next?

Not until (at least) the D2/Phobos implementations mature, the current
issues get worked out, and the library/tool ecosystem grows and matures.

Mar 10 2014

"Abdulhaq" <alynch4047 gmail.com> writes:
Historically 2 approaches has been practiced:

1) argue a lot and then do nothing

This happens (I think) because Andrei and Walter really value
your's and other expert's opinions, but nevertheless have to
preserve the general way things work to preserve the long term
future of D. They have to be open to persuasion but it would have
to be very compelling to get them to change basics now - it seems
to me.

D is at that difficult 90% stage that we all know about where the
boring difficult stuff is left to do. People like to discuss
interesting new stuff which at the time seems oh-so-important.

Mar 10 2014

"Abdulhaq" <alynch4047 gmail.com> writes:
On Monday, 10 March 2014 at 14:05:39 UTC, dennis luehring wrote:
Am 07.03.2014 03:37, schrieb Walter Bright:
In "Lots of low hanging fruit in Phobos" the issue came up
about the automatic
encoding and decoding of char ranges.

after reading many of the attached posts the question is - what
could be Ds future design of introducing breaking changes, its
not a solution to say its not possible because of too many
breaking changes - that will become more and more a problem of
Ds evolution
- much like C++

I'm a newbie here but I've been waiting for D to mature for a
long time. D IMHO has to stabilise now because:
* D needs a bigger community so that the the big fish who have
learnt the ins and outs don't get bored and move on due to lack
of kudos etc.
* To get the bigger community D needs more _working_ libraries
for major toolkits (GUI etc. etc.)
* Libraries will cease to work if there is significant change in
D, and then can stay broken because there isn't the inertial mass
of other developers to maintain it after the intial developer has
moved on. You can see that this has happened a LOT
* Anyway the D that I read about in TDPL is already very exciting
for programmers like myself, we just want that thanks.

Breaking changes can go into D3, if and whenever that is. Keep
breaking D2 now and it risks just being forevermore a playpen for
computer scientist types.

Anyway who cares what I think but I think it reflects a lot of
people's opinions too.

Mar 10 2014

Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, March 06, 2014 18:37:13 Walter Bright wrote:
Is there any hope of fixing this?

I agree with Andrei. I don't think that there's really anything to fix. The
problem is that there's roughly 3 levels at which string operations can be
done

1. By code unit
2. By code point
3. By grapheme

and which is correct depends on what you're trying to do. Phobos attempts to
go for correctness by default without seriously impacting performance, so it

pretty much any algorithm which operated on individual characters would be
broken, as unless your strings are ASCII-only, code units are very much the
wrong level to be operating on if you're trying to deal with characters. If we

being reasonably performant. And those who want full performance can use

support in std.uni.

We've gone to great lengths in Phobos to specialize on narrow strings in order
to make it more efficient while still maintaining correctness, and anyone who
really wants performance can do the same. But by operating on the code point
level, we at least get a reasonable level of unicode-correctness by default.
With your suggestion, I'd fully expect most D programs to be wrong with
regards to Unicode, because most programmers don't know or care about how
Unicode works. And changing what we're doing now would be code breakage of
astronomical proportions. It will essentially break all uses of range-based
string code. Certainly, it would be largest code breakage that D has seen is
years if not ever. So, it's almost certainly a bad idea, but if it isn't, we
need to be darn sure that what we change to is significantly better and worth
the huge amount of code breakage that it will cause.

I really don't think that there's any way to get this right. Regardless of
which level you operate at by default - be it code unit, code point, or
grapheme - it will be wrong a good chunk of the time. So, it becomes a
question which of the three has the best tradeoffs, and I think that our
current solution of operating on code points by default does that. If there
are things that we can do to better support operating on code units or
graphemes for those who want it, then great. And it's great if we can find
ways to make operating at the code point level more efficient or less prone to
bugs due to not operating at the grapheme level. But I think that operating on
the code point level like we currently do is by far the best approach.

If anything, it's the fact that the language doesn't do that that's a bigger
concern IMHO - the main place where that's an issue being the fact that
foreach iterates by code unit by default. But I don't know of a good way to
solve that other than treating all arrays of char, wchar, and dchar specially,
and disable their array operations like ranges do so that you have to convert
them to code units via the representation function in order to operate on them
as code units - which Andrei has suggested a number of times before, but
you've shot him down each time. If that were fixed, then at least we'd be
consistent, which is usually the biggest complaint with regards to how D
treats strings. But I really don't think that there's a magical fix for range-
based string operations, and I think that our current approach is a good one.

- Jonathan M Davis

Mar 12 2014

D Programming

C/C++ Programming

Other

digitalmars.D - Major performance problem with std.array.front()