digitalmars.D.learn - Narrow string is not a random access range

mist (5/5) Oct 23 2012 Was thinking on this topic after seeing this:

Andrei Alexandrescu (3/7) Oct 23 2012 Historical mistake.

mist (4/14) Oct 23 2012 Is string random access gonna be deprecated some day then or this

Andrei Alexandrescu (6/20) Oct 23 2012 Walter is unconvinced it's a mistake, which doesn't make it any easier.

Andrei Alexandrescu (2/25) Oct 23 2012 s/byte/code unit/

mist (4/4) Oct 23 2012 Hm, and all phobos functions should operate on narrow strings as

Simen Kjaeraas (6/10) Oct 23 2012 Preferably, yes. If there are performance (or other) benefits from

mist (19/29) Oct 24 2012 Probably I don't undertsand it fully, but D approach has always

Jonathan M Davis (72/106) Oct 24 2012 =3D=3D=3D

mist (4/4) Oct 24 2012 Ok, just one question to make an official position clear: is

Jonathan M Davis (16/20) Oct 24 2012 Strings are always ranges of dchar, but if a function can operate on the...

mist (2/2) Oct 24 2012 Thanks, that is exactly what I wanted to clarify.

Jonathan M Davis (13/18) Oct 24 2012 Wait. No. I think that it's (mostly) okay. I was thinking that you could...

mist (9/9) Oct 24 2012 Wait. So you consider commonPrefix returning malformed string to

Jonathan M Davis (24/35) Oct 24 2012 Hmmm. Let me think this through for a moment. Every code point starts w=
H. S. Teoh (26/53) Oct 24 2012 [...]
Jonathan M Davis (3/14) Oct 24 2012 http://d.puremagic.com/issues/show_bug.cgi?id=3D8890

Timon Gehr (11/68) Oct 24 2012 There are plenty cases where it makes no difference, or iterating by

Jonathan M Davis (15/17) Oct 24 2012 Yes and no. They'd be arrays of code units, but any operations on them w...

mist (14/43) Oct 24 2012 What about a compromise - turning this proposal upside down and

Jonathan M Davis (16/29) Oct 24 2012 Well, to use ranges in general, you need to understand hasLength, hasSli...

Timon Gehr (5/19) Oct 23 2012 The other valid opinion is that the 'mistake' is in Phobos because it

Jonathan M Davis (8/12) Oct 23 2012 If it didn't, then range-based functions would be useless for strings in...

Timon Gehr (2/14) Oct 23 2012 That idea does not even deserve discussion.

Jonathan M Davis (22/41) Oct 23 2012 Actually, it solves the problem quite well, because you then have to wor...
mist (2/2) Oct 24 2012 Actually it is awesome.

Timon Gehr (3/5) Oct 24 2012 Obviously T[] should support indexing for any T.

Adam D. Ruppe (9/11) Oct 23 2012 As I said last time this came up, we could actually do this today

Simen Kjaeraas (5/15) Oct 23 2012 As long as typeof("") != String, this is not going t work:

Adam D. Ruppe (3/5) Oct 24 2012 Gah, I hate literals.

Jonathan M Davis (18/28) Oct 24 2012 It does take advantage of it in a number of cases but not necessarily

"mist" <none none.none> writes:

Was thinking on this topic after seeing this: 
http://stackoverflow.com/questions/13014999/cannot-slice-taker-from-std-range-in-d
Still can't understand rationale here. Why native slicing / 
random access is allowed for narrow strings, but trait explicitly 
negates this?

Oct 23 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/23/12 11:36 AM, mist wrote:
 Was thinking on this topic after seeing this:
 http://stackoverflow.com/questions/13014999/cannot-slice-taker-from-std-range-in-d

 Still can't understand rationale here. Why native slicing / random
 access is allowed for narrow strings, but trait explicitly negates this?

Historical mistake.

Andrei

Oct 23 2012

"mist" <none none.none> writes:

On Tuesday, 23 October 2012 at 15:55:23 UTC, Andrei Alexandrescu 
wrote:
 On 10/23/12 11:36 AM, mist wrote:
 Was thinking on this topic after seeing this:
 http://stackoverflow.com/questions/13014999/cannot-slice-taker-from-std-range-in-d

 Still can't understand rationale here. Why native slicing / 
 random
 access is allowed for narrow strings, but trait explicitly 
 negates this?

 Historical mistake.

 Andrei

Is string random access gonna be deprecated some day then or this 
is considered a mistake we need to get used to? :)

Oct 23 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/23/12 11:58 AM, mist wrote:
 On Tuesday, 23 October 2012 at 15:55:23 UTC, Andrei Alexandrescu wrote:
 On 10/23/12 11:36 AM, mist wrote:
 Was thinking on this topic after seeing this:
 http://stackoverflow.com/questions/13014999/cannot-slice-taker-from-std-range-in-d


 Still can't understand rationale here. Why native slicing / random
 access is allowed for narrow strings, but trait explicitly negates this?

 Historical mistake.

 Andrei

 Is string random access gonna be deprecated some day then or this is
 considered a mistake we need to get used to? :)

Walter is unconvinced it's a mistake, which doesn't make it any easier. 
If I had my way, I'd require people to write str.rep[6] to access the 
sixth byte in the representation of a UTF-8 or UTF-16 string. It would 
make D's strings from great to indistinguishable from perfect.

Andrei

Oct 23 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/23/12 12:35 PM, Andrei Alexandrescu wrote:
 On 10/23/12 11:58 AM, mist wrote:
 On Tuesday, 23 October 2012 at 15:55:23 UTC, Andrei Alexandrescu wrote:
 On 10/23/12 11:36 AM, mist wrote:
 Was thinking on this topic after seeing this:
 http://stackoverflow.com/questions/13014999/cannot-slice-taker-from-std-range-in-d



 Still can't understand rationale here. Why native slicing / random
 access is allowed for narrow strings, but trait explicitly negates
 this?

 Historical mistake.

 Andrei

 Is string random access gonna be deprecated some day then or this is
 considered a mistake we need to get used to? :)

 Walter is unconvinced it's a mistake, which doesn't make it any easier.
 If I had my way, I'd require people to write str.rep[6] to access the
 sixth byte in the representation of a UTF-8 or UTF-16 string. It would
 make D's strings from great to indistinguishable from perfect.

 Andrei

s/byte/code unit/

Oct 23 2012

"mist" <none none.none> writes:

Hm, and all phobos functions should operate on narrow strings as 
if they where not random-acessible? I am thinking about something 
like commonPrefix from std.algorithm, which operates on code 
points for strings.

Oct 23 2012

"Simen Kjaeraas" <simen.kjaras gmail.com> writes:

On 2012-10-23, 19:21, mist wrote:

 Hm, and all phobos functions should operate on narrow strings as if they  
 where not random-acessible? I am thinking about something like  
 commonPrefix from std.algorithm, which operates on code points for  
 strings.

Preferably, yes. If there are performance (or other) benefits from
operating on code units, and it's just as safe, then operating on code
units is ok.

-- 
Simen

Oct 23 2012

"mist" <none none.none> writes:

On Tuesday, 23 October 2012 at 17:36:53 UTC, Simen Kjaeraas wrote:
 On 2012-10-23, 19:21, mist wrote:

 Hm, and all phobos functions should operate on narrow strings 
 as if they where not random-acessible? I am thinking about 
 something like commonPrefix from std.algorithm, which operates 
 on code points for strings.

 Preferably, yes. If there are performance (or other) benefits 
 from
 operating on code units, and it's just as safe, then operating 
 on code
 units is ok.

Probably I don't undertsand it fully, but D approach has always 
been "safe first, fast with some additional syntax". Back to 
commonPrefix and take:

==========================
import std.stdio, std.traits, std.algorithm, std.range;

void main()
{
	auto beer = "Пиво";
	auto r1 = beer.take(2);
	auto pony = "Пони";
	auto r2 = commonPrefix(beer, pony);
	writeln(r1);
	writeln(r2);
}
==========================

First one returns 2 symbols. Second one - 3 code points and 
broken string. There is no way such incosistency by-default in 
standard library is understandable by a newbie.

Oct 24 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, October 24, 2012 12:42:59 mist wrote:
 On Tuesday, 23 October 2012 at 17:36:53 UTC, Simen Kjaeraas wrote:
 On 2012-10-23, 19:21, mist wrote:
 Hm, and all phobos functions should operate on narrow strings
 as if they where not random-acessible? I am thinking about
 something like commonPrefix from std.algorithm, which operates
 on code points for strings.

=20
 Preferably, yes. If there are performance (or other) benefits
 from
 operating on code units, and it's just as safe, then operating
 on code
 units is ok.

=20
 Probably I don't undertsand it fully, but D approach has always
 been "safe first, fast with some additional syntax". Back to
 commonPrefix and take:
=20
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=

=3D=3D=3D
 import std.stdio, std.traits, std.algorithm, std.range;
=20
 void main()
 {
 =09auto beer =3D "=D0=9F=D0=B8=D0=B2=D0=BE";
 =09auto r1 =3D beer.take(2);
 =09auto pony =3D "=D0=9F=D0=BE=D0=BD=D0=B8";
 =09auto r2 =3D commonPrefix(beer, pony);
 =09writeln(r1);
 =09writeln(r2);
 }
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=

=3D=3D=3D
=20
 First one returns 2 symbols. Second one - 3 code points and
 broken string. There is no way such incosistency by-default in
 standard library is understandable by a newbie.

We don't really have much choice here. As long as strings are arrays of=
 code=20
units, it wouldn't work to treat them as ranges of their elements, beca=
use=20
that would be a complete disaster for unicode. You'd be operating on co=
de=20
units rather than code points, which is almost always wrong. Pretty muc=
h the=20
only way to really solve the problem as long as strings are arrays with=
 all of=20
the normal array operations is for the std.range traits (hasLength,=20
hasSlicing, etc.) and the range functions for arrays in std.array (e.g.=
 front,=20
popFront, etc.) to treat strings as ranges of code points (dchar), whic=
h is=20
what they do. The result _is_ confusing, but as long as strings are arr=
ays of=20
code units like they are now, to do anything else would result in incor=
rect=20
behavior. There just isn't a good solution given what strings currently=
 are in=20
the language itself.

Andrei's suggestion would work if Walter could be talked into it, but t=
hat=20
doesn't look like it's going to happen. And making it so that strings a=
re=20
structs which hold arrays of code units could work, but without languag=
e=20
support, it's likely to have major issues. String literals would have t=
o=20
become the struct type, which could cause issue with calling C function=
s, and=20
the code breakage would be _way_ larger than with Andrei's suggestion, =
since=20
arrays of code units would no longer be strings at all. It would be fea=
sible,=20
but it gets really messy. What we have is probably about the best that =
we can=20
do without actually changing the language (and Andrei's suggestion is l=
ikely=20
the best way to do that IMHO), but that's unlikely to happen at this po=
int,=20
especilaly since Walter seems to view unicode quite differently from yo=
ur=20
average programmer and expects your average programmer to actually unde=
rstand=20
it and handle correctly (which just isn't going to happen).

The confusion could be reduced if we not only had an article on dlang.o=
rg=20
explaining exactly what ranges were and how to use them with Phobos but=
 also=20
an article (maybe the same one, maybe another), which explained what th=
is=20
means for strings and why. That way, it would become easier to become=20=

educated. But no one has written (or at least finished writing) such an=
 article=20
for dlang.org (I keep meaning to, but I never get around to it). Some s=
tuff has=20
been written outside of dlang.org (e.g. http://www.drdobbs.com/architec=
ture-
and-design/component-programming-in-d/240008321 and=20
http://ddili.org/ders/d.en/ranges.html ), but there's nothing on dlang.=
org,=20
and I don't believe that there's really anything online aside from stra=
y=20
newsgroup posts or stackoverflow answers which discusses why strings ar=
e the=20
way they are with regards to ranges. And there should be.

- Jonathan M Davis

Oct 24 2012

"mist" <none none.none> writes:

Ok, just one question to make an official position clear: is 
commonPrefix implementation buggy or it is a conscious decision 
to go for some speed breaking correct operations on narrow 
strings at the same time?

Oct 24 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, October 24, 2012 13:19:54 mist wrote:
 Ok, just one question to make an official position clear: is
 commonPrefix implementation buggy or it is a conscious decision
 to go for some speed breaking correct operations on narrow
 strings at the same time?

Strings are always ranges of dchar, but if a function can operate on them more 
efficiently by special casing them and then using array operations taking the 
correct unicode handling into account, then it generally will.

commonPrefix can't make much more efficient by special casing strings, but it 
_can_ change its return type to be a string via slicing, since it can keep 
track of where it is in the string as it iterates over it. However, the 
documentation incorrectly states that the result of commonPrefix is always 
takeExactly. That's generally true but is _not_ true for strings. The 
documentation needs to be fixed.

That being said, there _is_ a bug in commonPrefix that I just noticed when 
looking it over. It currently operates on code units rather than code points. 
It can operate on strings just fine like it's doing now (even returning a 
slice), but it needs to decode the code points as it iterates over them, and 
it's not doing that.

- Jonathna M Davis

Oct 24 2012

"mist" <none none.none> writes:

Thanks, that is exactly what I wanted to clarify.
Can I do pull request for this or do you plan to?

Oct 24 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, October 24, 2012 04:54:36 Jonathan M Davis wrote:
 That being said, there _is_ a bug in commonPrefix that I just noticed when
 looking it over. It currently operates on code units rather than code
 points. It can operate on strings just fine like it's doing now (even
 returning a slice), but it needs to decode the code points as it iterates
 over them, and it's not doing that.

Wait. No. I think that it's (mostly) okay. I was thinking that you could have 
different sequences of code units which resolved to the same code point, and 
upon reflection, I don't think that you can. It's graphemes which can be 
represented by multiple sequences of code points, not code points which can be 
represented by multiple sequences of code units (unicode is overly confusing 
to say the least).

There's still an issue with the predicate though (hence the "mostly" above). 
If anything _other_ than == or != is used, then the code units would have to 
be decoded in order to pass dchars to the predicate. So, commonPrefix should be 
fine as-is in all cases except for when a custom predicate is given, and it's 
operating on narrow strings.

- Jonathan M Davis

Oct 24 2012

"mist" <none none.none> writes:

Wait. So you consider commonPrefix returning malformed string to 
be fine? I have lost you here. For example, for code sample given 
above, output is:

==========
Пи
П[\D0]
==========

Problem is if you use == on code unit you can match only part of 
valid symbol.

Oct 24 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, October 24, 2012 14:37:33 mist wrote:
 Wait. So you consider commonPrefix returning malformed string to
 be fine? I have lost you here. For example, for code sample given
 above, output is:
=20
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
 =D0=9F=D0=B8
 =D0=9F[\D0]
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
=20
 Problem is if you use =3D=3D on code unit you can match only part of
 valid symbol.

Hmmm. Let me think this through for a moment. Every code point starts w=
ith a=20
code unit that tells you how many code units are in the code point, and=
 each=20
code point should have only one sequence of code units which represents=
 it, so=20
something like find or startsWith should be able to just use code units=
.=20
commonPrefix is effectively doing a startsWith/find, but it's shortcutt=
ed once=20
there's a difference, and that _could_ be in the middle of a code point=
, since=20
you could have a code point with 3 code units where the first 2 match b=
ut no=20
the third one. So, yes. There is a bug here.

Now, a full decode still isn't necessary. It just has to keep track of =
how=20
long the code point is and return a slice starting at the end of the co=
de=20
previous code point if not all of a code point matches, but you've defi=
nitely=20
found a bug.

- Jonathan M Davis

Oct 24 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Oct 24, 2012 at 12:38:41PM -0700, Jonathan M Davis wrote:
 On Wednesday, October 24, 2012 14:37:33 mist wrote:
 Wait. So you consider commonPrefix returning malformed string to
 be fine? I have lost you here. For example, for code sample given
 above, output is:
 
 ==========
 Пи
 П[\D0]
 ==========
 
 Problem is if you use == on code unit you can match only part of
 valid symbol.

 
 Hmmm. Let me think this through for a moment. Every code point starts
 with a code unit that tells you how many code units are in the code
 point, and each code point should have only one sequence of code units
 which represents it, so something like find or startsWith should be
 able to just use code units.  commonPrefix is effectively doing a
 startsWith/find, but it's shortcutted once there's a difference, and
 that _could_ be in the middle of a code point, since you could have a
 code point with 3 code units where the first 2 match but no the third
 one. So, yes. There is a bug here.
 
 Now, a full decode still isn't necessary. It just has to keep track of
 how long the code point is and return a slice starting at the end of
 the code previous code point if not all of a code point matches, but
 you've definitely found a bug.

[...]

For many algorithms, full decode is not necessary. This is something
that Phobos should take advantage of (at least in theory; I'm not sure
how practical this is with the current codebase).

Actually, in the above case, *no* decode is necessary at all. UTF-8 was
designed specifically for this: if you see a byte with its highest bits
set to 0b10, that means you're in the middle of a code point. You can
scan forwards or backwards until the first byte whose highest bits
aren't 0b10; that's guaranteed to be the start of a code point (provided
the original string is actually well-formed UTF-8). There is no need to
keep track of length at all.

Many algorithms can be optimized to take advantage of this. Counting the
number of code points is simply counting the number of bytes whose
highest bits are not 0b10. Given some arbitrary offset into a char[],
you can use std.range.radial to find the nearest code point boundary
(i.e., byte whose upper bits are not 0b10).

Given a badly-truncated UTF-8 string (i.e., it got cut in the middle of
a code point), you can recover the still-valid substring by deleting the
bytes with high bits 0b10 at the beginning/end of the string. You'll
lose the truncated code point, but the rest of the string is still
usable.

Etc..


T

-- 
Marketing: the art of convincing people to pay for what they didn't need before
which you can't deliver after.

Oct 24 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, October 24, 2012 14:37:33 mist wrote:
 Wait. So you consider commonPrefix returning malformed string to
 be fine? I have lost you here. For example, for code sample given
 above, output is:
=20
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
 =D0=9F=D0=B8
 =D0=9F[\D0]
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
=20
 Problem is if you use =3D=3D on code unit you can match only part of
 valid symbol.

http://d.puremagic.com/issues/show_bug.cgi?id=3D8890

- Jonathan M Davis

Oct 24 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 10/24/2012 01:07 PM, Jonathan M Davis wrote:
 On Wednesday, October 24, 2012 12:42:59 mist wrote:
 On Tuesday, 23 October 2012 at 17:36:53 UTC, Simen Kjaeraas wrote:
 On 2012-10-23, 19:21, mist wrote:
 Hm, and all phobos functions should operate on narrow strings
 as if they where not random-acessible? I am thinking about
 something like commonPrefix from std.algorithm, which operates
 on code points for strings.

 Preferably, yes. If there are performance (or other) benefits
 from
 operating on code units, and it's just as safe, then operating
 on code
 units is ok.

 Probably I don't undertsand it fully, but D approach has always
 been "safe first, fast with some additional syntax". Back to
 commonPrefix and take:

 ==========================
 import std.stdio, std.traits, std.algorithm, std.range;

 void main()
 {
 	auto beer = "Пиво";
 	auto r1 = beer.take(2);
 	auto pony = "Пони";
 	auto r2 = commonPrefix(beer, pony);
 	writeln(r1);
 	writeln(r2);
 }
 ==========================

 First one returns 2 symbols. Second one - 3 code points and
 broken string. There is no way such incosistency by-default in
 standard library is understandable by a newbie.

 We don't really have much choice here. As long as strings are arrays of code
 units, it wouldn't work to treat them as ranges of their elements, because
 that would be a complete disaster for unicode. You'd be operating on code
 units rather than code points, which is almost always wrong.

There are plenty cases where it makes no difference, or iterating by
code point is harmful, or just as incorrect.

str.filter!(a=>a!='x'); // works for all str iterated by
                         // code point or by code unit

string x = str.filter!(a=>a!='x').array;// only works in the latter case

dstring s = "ÅA";
dstring g = s.filter!(a=>a!='A').array;


 Pretty much the
 only way to really solve the problem as long as strings are arrays with all of
 the normal array operations is for the std.range traits (hasLength,
 hasSlicing, etc.) and the range functions for arrays in std.array (e.g. front,
 popFront, etc.) to treat strings as ranges of code points (dchar), which is
 what they do. The result _is_ confusing, but as long as strings are arrays of
 code units like they are now, to do anything else would result in incorrect
 behavior.

It would result in by-code-unit behavior.

 There just isn't a good solution given what strings currently are in
 the language itself.

 Andrei's suggestion would work if Walter could be talked into it, but that
 doesn't look like it's going to happen. And making it so that strings are
 structs which hold arrays of code units could work, but without language
 support, it's likely to have major issues. String literals would have to
 become the struct type, which could cause issue with calling C functions, and
 the code breakage would be _way_ larger than with Andrei's suggestion, since
 arrays of code units would no longer be strings at all.
 ...

You realize that the proposed solution is that arrays of code units
would no longer be arrays of code units?

Oct 24 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, October 24, 2012 13:39:50 Timon Gehr wrote:
 You realize that the proposed solution is that arrays of code units
 would no longer be arrays of code units?

Yes and no. They'd be arrays of code units, but any operations on them which 
weren't unicode safe would require using the rep property. So, for instance, 
using ptr on them to pass to C functions would be fine, but slicing wouldn't. 
It definitely would be a case of violating the turtles all the way down 
principle, because arrays of code units wouldn't really be proper arrays 
anymore, but as long as they're treated as actual arrays, they _will_ be 
misued. The trick is doing something that's both correct and reasonably 
efficient by default but allows fully efficient code if you code with an 
understanding of unicode, and to do that, you can't have arrays of code units 
like we do now. But for better or worse, that doesn't look like it's going to 
change.

What we have right now actually works quite well if you understand the issues 
involved, but it's not newbie friendly at all.

- Jonathan M Davis

Oct 24 2012

"mist" <none none.none> writes:

On Wednesday, 24 October 2012 at 12:03:10 UTC, Jonathan M Davis 
wrote:
 On Wednesday, October 24, 2012 13:39:50 Timon Gehr wrote:
 You realize that the proposed solution is that arrays of code 
 units
 would no longer be arrays of code units?

 Yes and no. They'd be arrays of code units, but any operations 
 on them which
 weren't unicode safe would require using the rep property. So, 
 for instance,
 using ptr on them to pass to C functions would be fine, but 
 slicing wouldn't.
 It definitely would be a case of violating the turtles all the 
 way down
 principle, because arrays of code units wouldn't really be 
 proper arrays
 anymore, but as long as they're treated as actual arrays, they 
 _will_ be
 misued. The trick is doing something that's both correct and 
 reasonably
 efficient by default but allows fully efficient code if you 
 code with an
 understanding of unicode, and to do that, you can't have arrays 
 of code units
 like we do now. But for better or worse, that doesn't look like 
 it's going to
 change.

 What we have right now actually works quite well if you 
 understand the issues
 involved, but it's not newbie friendly at all.

 - Jonathan M Davis

What about a compromise - turning this proposal upside down and 
requiring something like "utfstring".decode to operate on 
symbols? ( There is front & Co in std.array but I am thinking of 
more tightly coupled to string ) It would have removed necessity 
of copy-pasting the very same checks for all algorithms and move 
decision about usage of code points vs code units to user side. 
Yes, it is does not prohibit a lot if senseless operations, but 
at least it is consistent approach. I personally believe that not 
being able to understand what to await from basic 
algorithm/operation applied to string (without looking at lib 
source code) is much more difficult sitation then necessity to 
properly understand unicode.

Oct 24 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, October 24, 2012 14:18:04 mist wrote:
 What about a compromise - turning this proposal upside down and
 requiring something like "utfstring".decode to operate on
 symbols? ( There is front & Co in std.array but I am thinking of
 more tightly coupled to string ) It would have removed necessity
 of copy-pasting the very same checks for all algorithms and move
 decision about usage of code points vs code units to user side.
 Yes, it is does not prohibit a lot if senseless operations, but
 at least it is consistent approach.

I'm afraid that I don't understand what you're proposing.

 I personally believe that not
 being able to understand what to await from basic
 algorithm/operation applied to string (without looking at lib
 source code) is much more difficult sitation then necessity to
 properly understand unicode.

Well, to use ranges in general, you need to understand hasLength, hasSlicing, 
isRandomAccessRange, etc., and you need to understand what it means when 
template constraints fail based on those templates. That being the case, 
strings are no different from any other range in that if they fail to 
instantiate with a particular function, then you need to look at the template 
constraints and see what the function requires, and sometimes you just can't 
know without looking at the template constraints, because it's not always 
obvious which range operations a particular function will require just based 
on what it's supposed to do. The main issue is understanding which range-based 
operations arrays have but which narrow strings don't. Then when a template 
fails to instantiate because a string isn't random access or sliceable or 
whatnot, you understand why rather than getting totally confused about int[] 
working and char[] not working.

- Jonathan M Davis

Oct 24 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 10/23/2012 05:58 PM, mist wrote:
 On Tuesday, 23 October 2012 at 15:55:23 UTC, Andrei Alexandrescu wrote:
 On 10/23/12 11:36 AM, mist wrote:
 Was thinking on this topic after seeing this:
 http://stackoverflow.com/questions/13014999/cannot-slice-taker-from-std-range-in-d


 Still can't understand rationale here. Why native slicing / random
 access is allowed for narrow strings, but trait explicitly negates this?

 Historical mistake.

 Andrei

 Is string  random access gonna be deprecated some day then or this is
 considered a mistake we need to get used to? :)

The other valid opinion is that the 'mistake' is in Phobos because it
treats narrow character arrays specially.
Note that string is just a name for immutable(char)[]. It would have to 
become a struct if random access was to be deprecated.

Oct 23 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, October 24, 2012 00:28:28 Timon Gehr wrote:
 The other valid opinion is that the 'mistake' is in Phobos because it
 treats narrow character arrays specially.

If it didn't, then range-based functions would be useless for strings in most 
cases, because it rarely makes sense to operate on code units.

 Note that string is just a name for immutable(char)[]. It would have to
 become a struct if random access was to be deprecated.

I think that Andrei was arguing for changing how the compiler itself handles 
arrays of char and wchar so that they wouldn't have direct random access or 
length anymore, forcing you to do something like str.rep[6] for random access 
regardless of what happens with range-based functions.

- Jonathan M Davis

Oct 23 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 10/24/2012 01:07 AM, Jonathan M Davis wrote:
 On Wednesday, October 24, 2012 00:28:28 Timon Gehr wrote:
 The other valid opinion is that the 'mistake' is in Phobos because it
 treats narrow character arrays specially.

 If it didn't, then range-based functions would be useless for strings in most
 cases, because it rarely makes sense to operate on code units.

 Note that string is just a name for immutable(char)[]. It would have to
 become a struct if random access was to be deprecated.

 I think that Andrei was arguing for changing how the compiler itself handles
 arrays of char and wchar so that they wouldn't have direct random access or
 length anymore, forcing you to do something like str.rep[6] for random access
 regardless of what happens with range-based functions.

 - Jonathan M Davis

That idea does not even deserve discussion.

Oct 23 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, October 24, 2012 01:33:28 Timon Gehr wrote:
 On 10/24/2012 01:07 AM, Jonathan M Davis wrote:
 On Wednesday, October 24, 2012 00:28:28 Timon Gehr wrote:
 The other valid opinion is that the 'mistake' is in Phobos because it
 treats narrow character arrays specially.

 
 If it didn't, then range-based functions would be useless for strings in
 most cases, because it rarely makes sense to operate on code units.
 
 Note that string is just a name for immutable(char)[]. It would have to
 become a struct if random access was to be deprecated.

 
 I think that Andrei was arguing for changing how the compiler itself
 handles arrays of char and wchar so that they wouldn't have direct random
 access or length anymore, forcing you to do something like str.rep[6] for
 random access regardless of what happens with range-based functions.
 
 - Jonathan M Davis

 
 That idea does not even deserve discussion.

Actually, it solves the problem quite well, because you then have to work at 
misusing strings (of any constness or char type), but it's still extremely 
easy to operate on code units if you want to. However, Walter seems to think 
that everyone should understand unicode and code for it, in which case it 
would be normal for the programmer to understand all of the quirks of code 
units vs code points and code accordingly, but I think that it's pretty clear 
that that the average programmer doesn't have a clue about unicode, so if the 
normal string operations do anything which isn't unicode aware (e.g. length), 
then lots of programmers are going to screw it up. But since such a change 
would break tons of code, I think that there's pretty much no way that it's 
going to happen at this point even if it were generally agreed that it was the 
way to go.

The alternative, of course, is to create a string type which wraps arrays of 
the various character types, but no one has been able to come up with a design 
for it which was generally accepted. It also risks not working very well with 
string literals and the like, since a string literal would no longer be a 
string (similar to the nonsense that you have to put up with in C++ with 
regards to std::string vs string literals). But even if someone can come up 
with a solid solution, the amount of code which it would break could easiily 
disqualify it anyway.

- Jonathan M Davis

Oct 23 2012

"mist" <none none.none> writes:

Actually it is awesome.
But all the code breakage.. eh.

Oct 24 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 10/24/2012 12:45 PM, mist wrote:
 Actually it is awesome.
 But all the code breakage.. eh.

Obviously T[] should support indexing for any T.
This is the definition of an array.

Oct 24 2012

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Tuesday, 23 October 2012 at 23:07:28 UTC, Jonathan M Davis 
wrote:
 I think that Andrei was arguing for changing how the compiler 
 itself handles arrays of char and wchar so that they wouldn't

As I said last time this came up, we could actually do this today 
without changing the compiler. Since string is a user defined 
type anyway, we could just define it differently.

http://arsdnet.net/dcode/test99.d

I'm pretty sure that changes to Phobos are even required. (The 
reason I called it "String" there instead of "string" is simply 
so it doesn't conflict with the string in object.d)

Oct 23 2012

"Simen Kjaeraas" <simen.kjaras gmail.com> writes:

On 2012-41-24 01:10, Adam D. Ruppe <destructionator gmail.com> wrote:

 On Tuesday, 23 October 2012 at 23:07:28 UTC, Jonathan M Davis wrote:
 I think that Andrei was arguing for changing how the compiler itself  
 handles arrays of char and wchar so that they wouldn't

 As I said last time this came up, we could actually do this today  
 without changing the compiler. Since string is a user defined type  
 anyway, we could just define it differently.

 http://arsdnet.net/dcode/test99.d

 I'm pretty sure that changes to Phobos are even required. (The reason I  
 called it "String" there instead of "string" is simply so it doesn't  
 conflict with the string in object.d)


As long as typeof("") != String, this is not going t work:

auto s = "";

-- 
Simen

Oct 23 2012

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Wednesday, 24 October 2012 at 06:43:13 UTC, Simen Kjaeraas 
wrote:
 As long as typeof("") != String, this is not going t work:

 auto s = "";

Gah, I hate literals.

Oct 24 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, October 24, 2012 12:53:23 H. S. Teoh wrote:
 For many algorithms, full decode is not necessary. This is something
 that Phobos should take advantage of (at least in theory; I'm not sure
 how practical this is with the current codebase).

It does take advantage of it in a number of cases but not necessarily 
everywhere that it could. That's actually one major issue with ranges though 
is that if you've wrapped a string in a range at all (via map, filter, take, or 
whatever), then the resultant range is forced to decode on every call to front 
or popFront (well, partial decode on popFront anyway), whereas functions can 
special case strings to avoid extraneous decoding with them. So, you can take 
a performance hit if you're operating on wrapped strings rather than on 
strings directly.

 Actually, in the above case, *no* decode is necessary at all. UTF-8 was
 designed specifically for this: if you see a byte with its highest bits
 set to 0b10, that means you're in the middle of a code point. You can
 scan forwards or backwards until the first byte whose highest bits
 aren't 0b10; that's guaranteed to be the start of a code point (provided
 the original string is actually well-formed UTF-8). There is no need to
 keep track of length at all.

I wouldn't say that "no" decoding is necessary. Rather, I'd say that partial 
decoding is necessary. If you have to examine the code units to determine 
where code points are or how long they are or whatnot, then you're still doing 
part of what decode has to do, whereas a function like find can forgo checking 
any of that entirely and merely compare the values of the code units. _That_'s 
what I'd consider to be no decoding required, and commonPrefix is buggy 
precisely because it's doing no decoding rather than partial decoding. But I 
suppose that it's arguing semantics.

- Jonathan M Davis

Oct 24 2012

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Narrow string is not a random access range