digitalmars.D - Making all strings UTF ranges has some risk of WTF

Andrei Alexandrescu (42/42) Feb 03 2010 It's no secret that string et al. are not a magic recipe for writing

Robert Jacques (7/19) Feb 03 2010 I like b) and d), with a slight preference for d. I think the benefits o...
ZY Zhou (2/22) Feb 03 2010 I choose (b)

Andrei Alexandrescu (8/31) Feb 03 2010 Perfect. Then you'll be glad to know that I changed all of Phobos to

Chad J (23/46) Feb 03 2010 I'm leaning towards (c) here.

Andrei Alexandrescu (4/60) Feb 03 2010 I hear you. Actually, to either quench or add to the confusion, .length

Chad J (4/10) Feb 03 2010 0.o

grauzone (17/35) Feb 03 2010 Change the type of string literals from char[] (or whatever the string

Trass3r (3/22) Feb 04 2010 That sounds like a really reasonable way to me.
Simen kjaeraas (9/22) Feb 04 2010 This seems to me a job for @disable:

Walter Bright (5/7) Feb 03 2010 I'm concerned it would be slow. Most operations on strings do not need

Andrei Alexandrescu (8/16) Feb 03 2010 I thought you're going to say that, but fortunately it's easy to
Andrei Alexandrescu (3/11) Feb 03 2010 Oh, one more thing: doing mixed-width searches would require decoding.

Walter Bright (2/3) Feb 03 2010 Or a conversion before the loop starts of the search term.

Andrei Alexandrescu (4/8) Feb 03 2010 That triggers memory allocation plus the same cost. It's not likely to

Ben Hanson (4/14) Feb 04 2010 Exactly. Please don't go down the route that Microsoft did for regex cas...

dsimcha (10/52) Feb 03 2010 I personally would find this extremely annoying because most of the code...

Andrei Alexandrescu (8/62) Feb 03 2010 It's definitely going to be easy to use all sensible algorithms with

Steven Schveighoffer (25/47) Feb 08 2010 I'm in the same camp as dsimcha, I generally write all my apps assuming ...

Michel Fortin (33/39) Feb 03 2010 UTF-8 and UTF-16 encodings are interesting beasts. If you have a UTF-8

Chad J (6/16) Feb 04 2010 This would be under the condition that there is another property that

Rainer Deyke (8/17) Feb 03 2010 These are all fine for a dedicated string type. They're horrible for

Andrei Alexandrescu (4/20) Feb 03 2010 Arrays of char and wchar are not quite generic - they are definitely UTF...

Rainer Deyke (11/13) Feb 03 2010 A 'char' is a single utf-8 code unit. A 'char[]' is (or should be) a

Andrei Alexandrescu (7/20) Feb 03 2010 I agree up to the assessment of the size of the problem and a couple of

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (25/38) Feb 03 2010 They would yield dchar, right? Wouldn't that cause trouble in templated

Andrei Alexandrescu (15/55) Feb 03 2010 Yes, dchar. There was some figuring out in parts of Phobos, but the
Daniel Keep (8/16) Feb 04 2010 I believe what you're after is this:

bearophile (7/7) Feb 04 2010 I'd like D to give the right results and to be not bug-prone by default ...

bearophile (5/7) Feb 04 2010 So it's not really immutable. It contains an immutable dynamic array tha...
=?UTF-8?B?QWxpIMOHZWhyZWxp?= (14/15) Feb 04 2010 len() to have a different computational complexity, like O(n) or O(1),

Jason House (3/32) Feb 04 2010 The underlying array of byte-sized data fragments is an implementation d...
Simen kjaeraas (5/18) Feb 04 2010 Of the above, I feel (b) is the correct solution, and I understand it

bearophile (4/6) Feb 04 2010 Yes, I presume he was mostly looking for a justification of his ideas he...

Andrei Alexandrescu (22/28) Feb 04 2010 I am ready to throw away the implementation as soon as a better idea

Michel Fortin (21/32) Feb 04 2010 Has any thought been given to foreach? Currently all these work for stri...

Andrei Alexandrescu (4/38) Feb 04 2010 This is a good point. I'm in favor of changing the language to make the

Don (13/53) Feb 04 2010 We seem to be approaching the point where char[], wchar[] and dchar[]

Jerry Quinn (3/10) Feb 04 2010 Well, if you're working with a LOT of text, you may be mmapping GB's of ...

dsimcha (6/15) Feb 04 2010 text. Yes, this does happen. You better be able to handle it in a sane...

Rainer Deyke (7/12) Feb 04 2010 If it's not too late to completely change the semantics of char[], then

Andrei Alexandrescu (25/36) Feb 04 2010 One idea I've had for a while was to have a universal string type:

grauzone (3/29) Feb 04 2010 You mean like this?
Rainer Deyke (7/26) Feb 04 2010 Although I see some potential in a universal string type, I don't think

Andrei Alexandrescu (4/29) Feb 04 2010 The definition I outlined does not specify or constrain the strategy of
Michael Rynn (75/89) Feb 05 2010 Firstly, for such "augmented types" in D, such as strings, bignums or an...

Michel Fortin (26/52) Feb 04 2010 That's a nice concept, but it seems to me that it adds much overhead to

Andrei Alexandrescu (10/64) Feb 04 2010 Well as it's been mentioned, sometimes you may assemble a string out of

Justin Johansson (14/38) Feb 04 2010 I concur. It's great to see consensus moving in this direction. For

bearophile (4/7) Feb 04 2010 And just to be sure: you are better than me because you actually impleme...

Steve Teale (4/7) Feb 05 2010 Andrei, congratulations on starting the most interesting thread I have s...

Andrei Alexandrescu (3/11) Feb 05 2010 A bit of both, with an emphasis on library.

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

It's no secret that string et al. are not a magic recipe for writing 
correct Unicode code. However, things are pretty good and could be 
further improved by operating the following changes in std.array and 
std.range:

- make front() and back() for UTF-8 and UTF-16 automatically decode the 
first and last Unicode character

- make popFront() and popBack() skip one entire Unicode character 
(instead of just one code unit)

- alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings

- change hasLength to return false for UTF-8 and UTF-16 strings

These changes effectively make UTF-8 and UTF-16 bidirectional ranges, 
with the quirk that you still have a sort of a random-access operator.

I'm very strongly in favor of this change. Bidirectional strings allow 
beautiful correct algorithms to be written that handle encoded strings 
without any additional effort; with these changes, everything applicable 
of std.algorithm works out of the box (with the appropriate fixes here 
and there), which is really remarkable.

The remaining WTF is the length property. Traditionally, a range 
offering length also implies the expectation that a range of length n 
allows you to call popFront n times and then assert that the range is 
empty. However, if you check e.g. hasLength!string it will yield false, 
although the string does have an accessible member by that name and of 
the appropriate type.

Although Phobos always checks its assumptions, people might occasionally 
write code that just uses .length without checking hasLength. Then, 
they'll be annoyed when the code fails with UTF-8 and UTF-16 strings.

(The "real" length of the range is not stored, but can be computed by 
using str.walkLength() in std.range.)

What can be done about that? I see a number of solutions:

(a) Do not operate the change at all.

(b) Operate the change and mention that in range algorithms you should 
check hasLength and only then use "length" under the assumption that it 
really means "elements count".

(c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define 
a different name for that. Any other name (codeUnits, codes etc.) would 
do. The entire point is to not make algorithms believe strings have a 
.length property.

(d) Have std.range define a distinct property called e.g. "count" and 
then specialize it appropriately. Then change all references to .length 
in std.algorithm and elsewhere to .count.

What would you do? Any ideas are welcome.


Andrei

Feb 03 2010

"Robert Jacques" <sandford jhu.edu> writes:

On Wed, 03 Feb 2010 21:00:21 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:
 (a) Do not operate the change at all.

 (b) Operate the change and mention that in range algorithms you should  
 check hasLength and only then use "length" under the assumption that it  
 really means "elements count".

 (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define  
 a different name for that. Any other name (codeUnits, codes etc.) would  
 do. The entire point is to not make algorithms believe strings have a  
 .length property.

 (d) Have std.range define a distinct property called e.g. "count" and  
 then specialize it appropriately. Then change all references to .length  
 in std.algorithm and elsewhere to .count.

 What would you do? Any ideas are welcome.

I like b) and d), with a slight preference for d. I think the benefits of  
strings being encoding correct and able to use std.algorithm outweighs the  
disadvantages. And making char[] different from T[] is going to play havoc  
with templated algorithms. Another alternative is to remove the char types  
 from the language and implement them as library ranges.

Feb 03 2010

ZY Zhou <ZY Zhou.comm> writes:

Andrei Alexandrescu Wrote:
 What can be done about that? I see a number of solutions:
 
 (a) Do not operate the change at all.
 
 (b) Operate the change and mention that in range algorithms you should 
 check hasLength and only then use "length" under the assumption that it 
 really means "elements count".
 
 (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define 
 a different name for that. Any other name (codeUnits, codes etc.) would 
 do. The entire point is to not make algorithms believe strings have a 
 .length property.
 
 (d) Have std.range define a distinct property called e.g. "count" and 
 then specialize it appropriately. Then change all references to .length 
 in std.algorithm and elsewhere to .count.
 
 What would you do? Any ideas are welcome.
 
 Andrei

I choose (b)

Feb 03 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

ZY Zhou wrote:
 Andrei Alexandrescu Wrote:
 What can be done about that? I see a number of solutions:

 (a) Do not operate the change at all.

 (b) Operate the change and mention that in range algorithms you should 
 check hasLength and only then use "length" under the assumption that it 
 really means "elements count".

 (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define 
 a different name for that. Any other name (codeUnits, codes etc.) would 
 do. The entire point is to not make algorithms believe strings have a 
 .length property.

 (d) Have std.range define a distinct property called e.g. "count" and 
 then specialize it appropriately. Then change all references to .length 
 in std.algorithm and elsewhere to .count.

 What would you do? Any ideas are welcome.

 Andrei

 
 I choose (b)

Perfect. Then you'll be glad to know that I changed all of Phobos to 
support it. All unittests pass now, but I suspect there are a couple of 
bugs left.

The change will be part of the next release. I'll commit during the 
weekend. This is a very exciting development, I was very unhappy about 
the interaction between std.algorithm and strings.


Andrei

Feb 03 2010

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

Andrei Alexandrescu wrote:
 ...
 
 What can be done about that? I see a number of solutions:
 
 (a) Do not operate the change at all.
 
 (b) Operate the change and mention that in range algorithms you should
 check hasLength and only then use "length" under the assumption that it
 really means "elements count".
 
 (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define
 a different name for that. Any other name (codeUnits, codes etc.) would
 do. The entire point is to not make algorithms believe strings have a
 .length property.
 
 (d) Have std.range define a distinct property called e.g. "count" and
 then specialize it appropriately. Then change all references to .length
 in std.algorithm and elsewhere to .count.
 
 What would you do? Any ideas are welcome.
 
 
 Andrei

I'm leaning towards (c) here.

To me the .length on char[] and wchar[] are kinda like doing this:

struct SomePOD
{
    int a, b;
    double y;
}


SomePOD pod;
auto len = pod.length;
assert(len == 16); // true.


I'll admit it's not a perfect analogy.  What I'm playing on here is that
the .length on char[] and wchar[] returns the /size of/ the string in
bytes rather than the /length/ of the string in number of (well-formed)
characters.

Unfortunately .sizeof is supposed to return the size of the string's
reference (8 bytes on x86 systems) and not the size of the string, IIRC.
 So that's taken.

So perhaps a .bytes or .nbytes property.  Maybe make it work for arrays
of structs and things like that too.  A tuple (or any container) of
non-homogeneous elements could probably benefit from this property as well.

Given such a property being available, I wouldn't miss .length at all.
It's quite misleading.

Feb 03 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Chad J wrote:
 Andrei Alexandrescu wrote:
 ...

 What can be done about that? I see a number of solutions:

 (a) Do not operate the change at all.

 (b) Operate the change and mention that in range algorithms you should
 check hasLength and only then use "length" under the assumption that it
 really means "elements count".

 (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define
 a different name for that. Any other name (codeUnits, codes etc.) would
 do. The entire point is to not make algorithms believe strings have a
 .length property.

 (d) Have std.range define a distinct property called e.g. "count" and
 then specialize it appropriately. Then change all references to .length
 in std.algorithm and elsewhere to .count.

 What would you do? Any ideas are welcome.


 Andrei

 
 I'm leaning towards (c) here.
 
 To me the .length on char[] and wchar[] are kinda like doing this:
 
 struct SomePOD
 {
     int a, b;
     double y;
 }
 
 
 SomePOD pod;
 auto len = pod.length;
 assert(len == 16); // true.
 
 
 I'll admit it's not a perfect analogy.  What I'm playing on here is that
 the .length on char[] and wchar[] returns the /size of/ the string in
 bytes rather than the /length/ of the string in number of (well-formed)
 characters.
 
 Unfortunately .sizeof is supposed to return the size of the string's
 reference (8 bytes on x86 systems) and not the size of the string, IIRC.
  So that's taken.
 
 So perhaps a .bytes or .nbytes property.  Maybe make it work for arrays
 of structs and things like that too.  A tuple (or any container) of
 non-homogeneous elements could probably benefit from this property as well.
 
 Given such a property being available, I wouldn't miss .length at all.
 It's quite misleading.

I hear you. Actually, to either quench or add to the confusion, .length
for wstring returns the length in 16-bit units, not bytes.

Andrei

Feb 03 2010

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

Andrei Alexandrescu wrote:
 ...
 
 I hear you. Actually, to either quench or add to the confusion, .length
 for wstring returns the length in 16-bit units, not bytes.
 
 Andrei

0.o

...
kay.

Feb 03 2010

grauzone <none example.net> writes:

Andrei Alexandrescu wrote:
 What can be done about that? I see a number of solutions:
 
 (a) Do not operate the change at all.
 
 (b) Operate the change and mention that in range algorithms you should 
 check hasLength and only then use "length" under the assumption that it 
 really means "elements count".
 
 (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define 
 a different name for that. Any other name (codeUnits, codes etc.) would 
 do. The entire point is to not make algorithms believe strings have a 
 .length property.
 
 (d) Have std.range define a distinct property called e.g. "count" and 
 then specialize it appropriately. Then change all references to .length 
 in std.algorithm and elsewhere to .count.
 
 What would you do? Any ideas are welcome.

Change the type of string literals from char[] (or whatever the string 
type is in D2) to a wrapper struct defined in object.d:

struct string {
     char[] raw;
}

Now string.length is invalid, and you don't have to do weird stuff as in 
(b) or (c).

 From here on, you could do 2 things:
1. add accessor methods to string like string classes in other languages do
2. leave the wrapper struct as it is (just add the required range foo), 
and require the user to use either a) the range API (with utf-8 decoding 
etc.) or b) access the raw "byte" string with string.raw.

I really liked how strings were simply char[]'s, but now with immutable, 
there's a lot of noise around it anyway, and there's no real value to 
strings being array slices anymore. Making the user deal directly with 
utf-8 was probably a bad idea to begin with.

Feb 03 2010

Trass3r <un known.com> writes:

Am 04.02.2010, 04:05 Uhr, schrieb grauzone <none example.net>:

 Andrei Alexandrescu wrote:
 What can be done about that? I see a number of solutions:
  (a) Do not operate the change at all.
  (b) Operate the change and mention that in range algorithms you should  
 check hasLength and only then use "length" under the assumption that it  
 really means "elements count".
  (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and  
 define a different name for that. Any other name (codeUnits, codes  
 etc.) would do. The entire point is to not make algorithms believe  
 strings have a .length property.
  (d) Have std.range define a distinct property called e.g. "count" and  
 then specialize it appropriately. Then change all references to .length  
 in std.algorithm and elsewhere to .count.
  What would you do? Any ideas are welcome.


Definitely against (c)+(d).

 Change the type of string literals from char[] (or whatever the string  
 type is in D2) to a wrapper struct defined in object.d:

 struct string {
      char[] raw;
 }

That sounds like a really reasonable way to me.

Feb 04 2010

"Simen kjaeraas" <simen.kjaras gmail.com> writes:

grauzone <none example.net> wrote:
 Change the type of string literals from char[] (or whatever the string  
 type is in D2) to a wrapper struct defined in object.d:

 struct string {
      char[] raw;
 }

 Now string.length is invalid, and you don't have to do weird stuff as in  
 (b) or (c).

  From here on, you could do 2 things:
 1. add accessor methods to string like string classes in other languages  
 do
 2. leave the wrapper struct as it is (just add the required range foo),  
 and require the user to use either a) the range API (with utf-8 decoding  
 etc.) or b) access the raw "byte" string with string.raw.

This seems to me a job for  disable:

struct string {
     immutable( char )[] payload;
     alias payload this;
      disable int length( ) { return 0; }
}

-- 
Simen

Feb 04 2010

Walter Bright <newshound1 digitalmars.com> writes:

Andrei Alexandrescu wrote:
 It's no secret that string et al. are not a magic recipe for writing 
 correct Unicode code.

I'm concerned it would be slow. Most operations on strings do not need 
to decode the unicode characters, for example, find, startsWith, etc., 
do not. Decoding then doing find, startsWith, etc., will be considerably 
slower.

Feb 03 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Walter Bright wrote:
 Andrei Alexandrescu wrote:
 It's no secret that string et al. are not a magic recipe for writing 
 correct Unicode code.

 
 I'm concerned it would be slow. Most operations on strings do not need 
 to decode the unicode characters, for example, find, startsWith, etc., 
 do not. Decoding then doing find, startsWith, etc., will be considerably 
 slower.

I thought you're going to say that, but fortunately it's easy to 
special-case certain algorithms for strings during compilation. In fact 
I already did - for example, Boyer-Moore searching would be very 
difficult to rewrite for variable-length characters, but there's no need 
for it. I special-cased that algorithm.

I believe this is a good strategy.

Andrei

Feb 03 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Walter Bright wrote:
 Andrei Alexandrescu wrote:
 It's no secret that string et al. are not a magic recipe for writing 
 correct Unicode code.

 
 I'm concerned it would be slow. Most operations on strings do not need 
 to decode the unicode characters, for example, find, startsWith, etc., 
 do not. Decoding then doing find, startsWith, etc., will be considerably 
 slower.

Oh, one more thing: doing mixed-width searches would require decoding.

Andrei

Feb 03 2010

Walter Bright <newshound1 digitalmars.com> writes:

Andrei Alexandrescu wrote:
 Oh, one more thing: doing mixed-width searches would require decoding.

Or a conversion before the loop starts of the search term.

Feb 03 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Walter Bright wrote:
 Andrei Alexandrescu wrote:
 Oh, one more thing: doing mixed-width searches would require decoding.

 
 Or a conversion before the loop starts of the search term.

That triggers memory allocation plus the same cost. It's not likely to 
work any better.

Andrei

Feb 03 2010

Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:

Andrei Alexandrescu Wrote:

 Walter Bright wrote:
 Andrei Alexandrescu wrote:
 Oh, one more thing: doing mixed-width searches would require decoding.

 
 Or a conversion before the loop starts of the search term.

 
 That triggers memory allocation plus the same cost. It's not likely to 
 work any better.
 
 Andrei

Exactly. Please don't go down the route that Microsoft did for regex case
insensitive searching (at least when I last looked at it) - they made the
*input string* all lower case... D'oh!

Regards,

Ben

Feb 04 2010

dsimcha <dsimcha yahoo.com> writes:

== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 It's no secret that string et al. are not a magic recipe for writing
 correct Unicode code. However, things are pretty good and could be
 further improved by operating the following changes in std.array and
 std.range:
 - make front() and back() for UTF-8 and UTF-16 automatically decode the
 first and last Unicode character
 - make popFront() and popBack() skip one entire Unicode character
 (instead of just one code unit)
 - alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings
 - change hasLength to return false for UTF-8 and UTF-16 strings
 These changes effectively make UTF-8 and UTF-16 bidirectional ranges,
 with the quirk that you still have a sort of a random-access operator.
 I'm very strongly in favor of this change. Bidirectional strings allow
 beautiful correct algorithms to be written that handle encoded strings
 without any additional effort; with these changes, everything applicable
 of std.algorithm works out of the box (with the appropriate fixes here
 and there), which is really remarkable.
 The remaining WTF is the length property. Traditionally, a range
 offering length also implies the expectation that a range of length n
 allows you to call popFront n times and then assert that the range is
 empty. However, if you check e.g. hasLength!string it will yield false,
 although the string does have an accessible member by that name and of
 the appropriate type.
 Although Phobos always checks its assumptions, people might occasionally
 write code that just uses .length without checking hasLength. Then,
 they'll be annoyed when the code fails with UTF-8 and UTF-16 strings.
 (The "real" length of the range is not stored, but can be computed by
 using str.walkLength() in std.range.)
 What can be done about that? I see a number of solutions:
 (a) Do not operate the change at all.
 (b) Operate the change and mention that in range algorithms you should
 check hasLength and only then use "length" under the assumption that it
 really means "elements count".
 (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define
 a different name for that. Any other name (codeUnits, codes etc.) would
 do. The entire point is to not make algorithms believe strings have a
 .length property.
 (d) Have std.range define a distinct property called e.g. "count" and
 then specialize it appropriately. Then change all references to .length
 in std.algorithm and elsewhere to .count.
 What would you do? Any ideas are welcome.
 Andrei

I personally would find this extremely annoying because most of the code I write
that involves strings is scientific computing code that will never be
internationalized, let alone released to the general public.  I basically just
use
ASCII because it's all I need and if your UTF-8 string contains only ASCII
characters, it can be treated as random-access.  I don't know how many people
out
there are in similar situations, but I doubt they'll be too happy.

On the other hand, I guess it wouldn't be hard to write a simple wrapper struct
on
top of immutable(ubyte)[] and call it AsciiString.  Once alias this gets fully
debugged, I could even make it implicitly convert to immutable(char)[].

Feb 03 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

dsimcha wrote:
 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 It's no secret that string et al. are not a magic recipe for writing
 correct Unicode code. However, things are pretty good and could be
 further improved by operating the following changes in std.array and
 std.range:
 - make front() and back() for UTF-8 and UTF-16 automatically decode the
 first and last Unicode character
 - make popFront() and popBack() skip one entire Unicode character
 (instead of just one code unit)
 - alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings
 - change hasLength to return false for UTF-8 and UTF-16 strings
 These changes effectively make UTF-8 and UTF-16 bidirectional ranges,
 with the quirk that you still have a sort of a random-access operator.
 I'm very strongly in favor of this change. Bidirectional strings allow
 beautiful correct algorithms to be written that handle encoded strings
 without any additional effort; with these changes, everything applicable
 of std.algorithm works out of the box (with the appropriate fixes here
 and there), which is really remarkable.
 The remaining WTF is the length property. Traditionally, a range
 offering length also implies the expectation that a range of length n
 allows you to call popFront n times and then assert that the range is
 empty. However, if you check e.g. hasLength!string it will yield false,
 although the string does have an accessible member by that name and of
 the appropriate type.
 Although Phobos always checks its assumptions, people might occasionally
 write code that just uses .length without checking hasLength. Then,
 they'll be annoyed when the code fails with UTF-8 and UTF-16 strings.
 (The "real" length of the range is not stored, but can be computed by
 using str.walkLength() in std.range.)
 What can be done about that? I see a number of solutions:
 (a) Do not operate the change at all.
 (b) Operate the change and mention that in range algorithms you should
 check hasLength and only then use "length" under the assumption that it
 really means "elements count".
 (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define
 a different name for that. Any other name (codeUnits, codes etc.) would
 do. The entire point is to not make algorithms believe strings have a
 .length property.
 (d) Have std.range define a distinct property called e.g. "count" and
 then specialize it appropriately. Then change all references to .length
 in std.algorithm and elsewhere to .count.
 What would you do? Any ideas are welcome.
 Andrei

 
 I personally would find this extremely annoying because most of the code I
write
 that involves strings is scientific computing code that will never be
 internationalized, let alone released to the general public.  I basically just
use
 ASCII because it's all I need and if your UTF-8 string contains only ASCII
 characters, it can be treated as random-access.  I don't know how many people
out
 there are in similar situations, but I doubt they'll be too happy.
 
 On the other hand, I guess it wouldn't be hard to write a simple wrapper
struct on
 top of immutable(ubyte)[] and call it AsciiString.  Once alias this gets fully
 debugged, I could even make it implicitly convert to immutable(char)[].

It's definitely going to be easy to use all sensible algorithms with 
immutable(ubyte)[]. But even if you go with string, there should be no 
problem at all. Remember, telling ASCII from UTF is one mask and one 
test away, and the way Walter and I wrote virtually all related routines 
was to special-case ASCII. In most cases I don't think you'll notice a 
decrease in performance.

Andrei

Feb 03 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Wed, 03 Feb 2010 23:41:02 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 dsimcha wrote:
  I personally would find this extremely annoying because most of the  
 code I write
 that involves strings is scientific computing code that will never be
 internationalized, let alone released to the general public.  I  
 basically just use
 ASCII because it's all I need and if your UTF-8 string contains only  
 ASCII
 characters, it can be treated as random-access.  I don't know how many  
 people out
 there are in similar situations, but I doubt they'll be too happy.
  On the other hand, I guess it wouldn't be hard to write a simple  
 wrapper struct on
 top of immutable(ubyte)[] and call it AsciiString.  Once alias this  
 gets fully
 debugged, I could even make it implicitly convert to immutable(char)[].

 It's definitely going to be easy to use all sensible algorithms with  
 immutable(ubyte)[]. But even if you go with string, there should be no  
 problem at all. Remember, telling ASCII from UTF is one mask and one  
 test away, and the way Walter and I wrote virtually all related routines  
 was to special-case ASCII. In most cases I don't think you'll notice a  
 decrease in performance.

I'm in the same camp as dsimcha, I generally write all my apps assuming  
ASCII strings (most are internal tools anyways).

Can the compiler help making ASCII strings easier to use?  i.e., this  
already works:

wstring s = "hello"; // converts to immutable(wchar)[]

what about this?

asciistring a = "hello"; // converts to immutable(ubyte)[] (or  
immutable(ASCIIChar)[])
asciistring a = "\uFBCD"; // error, requires cast.

The only issue that remains to be resolved then is the upgradability that  
ascii characters currently enjoy for utf8.  I.e. I can call any utf-8  
accepting function with an ASCII string, but not an ASCII string accepting  
function with utf-8 data.

Ideally, there should be a 7-bit ASCII character type that implicitly  
upconverts to char, and can be initialized with a string literal.

In addition, you are putting D's utf8 char even further away from C's  
ASCII char.  It would be nice to separate compatible C strings from d  
strings.  At some point, I should be able to designate a function (even a  
C function) takes only ASCII data, and the compiler should disallow  
passing general utf8 data into it.  This involves either renaming D's char  
to keep source closer to C, or rewriting C function signatures to reflect  
the difference.

-Steve

Feb 08 2010

Michel Fortin <michel.fortin michelf.com> writes:

On 2010-02-03 21:00:21 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 It's no secret that string et al. are not a magic recipe for writing 
 correct Unicode code.
 
 [...]
 
 What would you do? Any ideas are welcome.

UTF-8 and UTF-16 encodings are interesting beasts. If you have a UTF-8 
string and want to search for an occurrence of that string in another 
UTF-8 string, you don't have to decode each multi-byte code-points: a 
binary comparison is enough. If you're counting counting the number of 
code points, then all you need is to count the number of code unit with 
the most significant bit set to zero. If on the other hand you're 
applying a character-by-character transformation, then you need to 
fully decode each character, unless you're only interested in 
transforming characters from the lower non-multibyte subrange of the 
encoding (which happens quite often).

Clearly, I don't think there's a one-size-fit-all way to iterate over 
string arrays. Fully decoding each code unit is clearly the most costly 
method; it shouldn't be required when its not necessary.

I think we need to be able to represent char[] and wchar[] as a range 
of dchar to deal with cases where you want to iterate over Unicode code 
points, but I'd let the programmer ultimately decide what to do.

As for .length, I'll say that removing this property would make it hard 
to write low-level code. For instance, if I copy a string into a 
buffer, I need to know the length in bytes (array.length * 
sizeof(array[0])), not the number of characters. So it doesn't make 
much sense to disable .length.

So my answer would be mostly to leave things as they are.

Perhaps the char[]�and wchar[] as dchar ranges could be aliased to 
string and wstring, but that'd definitely be a blow to the philosophy 
of strings as simple arrays. You'd also still need to be able to access 
the actual array underneath. And will all the implicit conversions 
still work? I'm really not sure it's worth it, but perhaps.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Feb 03 2010

Chad J <chadjoan __spam.is.bad__gmail.com> writes:

Michel Fortin wrote:
 ...
 
 As for .length, I'll say that removing this property would make it hard
 to write low-level code. For instance, if I copy a string into a buffer,
 I need to know the length in bytes (array.length * sizeof(array[0])),
 not the number of characters. So it doesn't make much sense to disable
 .length.
 
 ...
 

This would be under the condition that there is another property that
does the same (or similar) thing.  It's the part where he said, "and
define a different name for that".

You could also just reinterpret them as byte[] or ubyte[].  Then they
will behave in a low level way.  No surprises.

Feb 04 2010

Rainer Deyke <rainerd eldwood.com> writes:

Andrei Alexandrescu wrote:
 - make front() and back() for UTF-8 and UTF-16 automatically decode the
 first and last Unicode character
 
 - make popFront() and popBack() skip one entire Unicode character
 (instead of just one code unit)
 
 - alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings
 
 - change hasLength to return false for UTF-8 and UTF-16 strings

These are all fine for a dedicated string type.  They're horrible for
generic arrays, for the following reasons:
  - They break generic code.
  - They make it impossible to manipulate an array of code units as an
array of code units.


-- 
Rainer Deyke - rainerd eldwood.com

Feb 03 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Rainer Deyke wrote:
 Andrei Alexandrescu wrote:
 - make front() and back() for UTF-8 and UTF-16 automatically decode the
 first and last Unicode character

 - make popFront() and popBack() skip one entire Unicode character
 (instead of just one code unit)

 - alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings

 - change hasLength to return false for UTF-8 and UTF-16 strings

 
 These are all fine for a dedicated string type.  They're horrible for
 generic arrays, for the following reasons:
   - They break generic code.
   - They make it impossible to manipulate an array of code units as an
 array of code units.

Arrays of char and wchar are not quite generic - they are definitely UTF 
strings.

Andrei

Feb 03 2010

Rainer Deyke <rainerd eldwood.com> writes:

Andrei Alexandrescu wrote:
 Arrays of char and wchar are not quite generic - they are definitely UTF
 strings.

A 'char' is a single utf-8 code unit.  A 'char[]' is (or should be) a
generic array of utf-8 code units.  Sometimes these code units line up
to form valid unicode code points, sometimes they don't.

If you want a data type that always contains a valid utf-8 string, don't
call it 'char[]'.  It's misleading, it breaks generic code, and it
renders built-in arrays useless for when you actually want an array of
utf-8 code units.  It's the same mistake as std::vector<bool> in C++,
but much worse.


-- 
Rainer Deyke - rainerd eldwood.com

Feb 03 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Rainer Deyke wrote:
 Andrei Alexandrescu wrote:
 Arrays of char and wchar are not quite generic - they are definitely UTF
 strings.

 
 A 'char' is a single utf-8 code unit.  A 'char[]' is (or should be) a
 generic array of utf-8 code units.  Sometimes these code units line up
 to form valid unicode code points, sometimes they don't.
 
 If you want a data type that always contains a valid utf-8 string, don't
 call it 'char[]'.  It's misleading, it breaks generic code, and it
 renders built-in arrays useless for when you actually want an array of
 utf-8 code units.  It's the same mistake as std::vector<bool> in C++,
 but much worse.

I agree up to the assessment of the size of the problem and a couple of 
other points. I've had a great time writing utf code in D with string. 
Getting back to C++'s std::string really put things in perspective.

If your purpose is to store some disparate utf-8 code units (a need that 
I've never had), I see no problem with storing then as ubyte[].


Andrei

Feb 03 2010

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

Andrei Alexandrescu wrote:
 It's no secret that string et al. are not a magic recipe for writing
 correct Unicode code. However, things are pretty good and could be
 further improved by operating the following changes in std.array and
 std.range:

 - make front() and back() for UTF-8 and UTF-16 automatically decode the
 first and last Unicode character

They would yield dchar, right? Wouldn't that cause trouble in templated 
code?

 - make popFront() and popBack() skip one entire Unicode character
 (instead of just one code unit)

That's perfectly fine, because the opposite operations do "encode":

     string s = "ağ";
     assert(s.length == 3);
     s ~= 'ş';
     assert(s.length == 5);

 - alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings

Ok.

 - change hasLength to return false for UTF-8 and UTF-16 strings

I don't understand that one. strings have lengths. Adding and removing 
does not alter length by 1 for those types. I don't think it's a big 
deal. It is already so in the language for those types. dstring does not 
have that problem and could be used when by-1 change is desired.

 (b) Operate the change and mention that in range algorithms you should
 check hasLength and only then use "length" under the assumption that it
 really means "elements count".

The change sounds ok and hasLength should yield true. Or... can it 
return an enum { no, kind_of, yes } ;)

Current utf.decode takes the index by reference and modifies it by the 
amount. Could popFront() do something similar?

I think that's it: front() and popFront() are separated for cohesion. 
What is causing trouble here is the separation of "by-N" from popFront().

You are concerned that the user makes the assumption and popFront() will 
reduce by 1. I think that is the problem here.

How about something like:

   // returns the amount that the next popFront() will reduce length
   int nextStep();

Ali

Feb 03 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Ali Çehreli wrote:
 Andrei Alexandrescu wrote:
  > It's no secret that string et al. are not a magic recipe for writing
  > correct Unicode code. However, things are pretty good and could be
  > further improved by operating the following changes in std.array and
  > std.range:
  >
  > - make front() and back() for UTF-8 and UTF-16 automatically decode the
  > first and last Unicode character

 They would yield dchar, right? Wouldn't that cause trouble in templated 
 code?

Yes, dchar. There was some figuring out in parts of Phobos, but the 
gains are well worth it.

The simplifications are enormous. Until now, Phobos didn't hit the nail 
on the head with simple encoding/decoding/transcoding primitives. There 
were many attempts in std.utf, std.encoding, and std.string - all very 
clunky to use. Now I can just write s.front to get the first dchar of 
any string, and s.popFront to drop it. Very simple!

  > - make popFront() and popBack() skip one entire Unicode character
  > (instead of just one code unit)

 That's perfectly fine, because the opposite operations do "encode":

     string s = "ağ";
     assert(s.length == 3);
     s ~= 'ş';
     assert(s.length == 5);

  > - alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings

 Ok.

  > - change hasLength to return false for UTF-8 and UTF-16 strings

 I don't understand that one. strings have lengths. Adding and removing 
 does not alter length by 1 for those types. I don't think it's a big 
 deal. It is already so in the language for those types. dstring does not 
 have that problem and could be used when by-1 change is desired.

hasLength is a property used by range algorithms to tell them that a 
range stores the length with a particular meaning (the number of 
elements). It is perfectly fine that strings don't obey hasLength but do 
expose .length - it's just that it has different semantics.

  > (b) Operate the change and mention that in range algorithms you should
  > check hasLength and only then use "length" under the assumption that it
  > really means "elements count".

 The change sounds ok and hasLength should yield true. Or... can it 
 return an enum { no, kind_of, yes } ;)

 Current utf.decode takes the index by reference and modifies it by the 
 amount. Could popFront() do something similar?

I think we could dedicate a special function for that. In fact it does 
exist I think - it's called stride().

Andrei

Feb 03 2010

Daniel Keep <daniel.keep.lists gmail.com> writes:

Ali Çehreli wrote:
 ...
 
 The change sounds ok and hasLength should yield true. Or... can it
 return an enum { no, kind_of, yes } ;)
 
 ...
 
 Ali

I believe what you're after is this:

enum Bool
{
  True,
  False,
  FileNotFound
}

Feb 04 2010

bearophile <bearophileHUGS lycos.com> writes:

I'd like D to give the right results and to be not bug-prone by default (even
if it's a bit slower) and do something more optimized but maybe faster when I
want it and I know what I am doing. So I like the idea of making strings more
correct.

I liked the idea of D strings acting like normal arrays, but they are different
data structures, so it's better to accept them as something different, even if
similar (especially UTF32 ones). So a specific struct, named like String, can
be used to represent a string. String literals too return such struct.

In C language the strlen() is O(n), so people can learn that finding the length
of most strings is O(n) in D too (UTF32 strings today are not common. If D gets
success in a future the UTF32 strings can become the only used strings. Such
things are happened many times in computer history). The String structs can
cache inside themselves the length and hash value the first time such values
are computed (so each String is a struct 4 words long).

Time ago you (Andrei) told me that you don't like a function like len() to have
a different computational complexity, like O(n) or O(1), according to the
input. But a simple possibility is for "foo".length property to return the
length of the walk on all chars, that is O(n) the first time you call it (it's
O(1) on UTF32 strings). Some algorithms or some parts of code (and string
literals, etc.) can know what's the length of the string they return or use, so
they can put this this length value inside the String cached value and there's
never a need to compute it.

The struct String that represents the string supports the [] operator too
(there can be a MutableString too, that is not used often), but it usually
walks the string to find the right character to return. Then a method like
"foo".ubyte(i) can be used to return the i-th ubyte of the string. So initially
this [] is done with just a O(n) walk, later some performance optimization can
be added, like a simple skip tree based on indexes.

Bye,
bearophile

Feb 04 2010

bearophile <bearophileHUGS lycos.com> writes:

 The String structs can cache inside themselves the length and hash value the
first time such
 values are computed (so each String is a struct 4 words long).

So it's not really immutable. It contains an immutable dynamic array that can
be called "str", plus two mutable words.
To access the underlying 8/16/32 bit units it's composed you can use the [] of
the str attribute:
"foo".str[i]

Bye,
bearophile

Feb 04 2010

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

bearophile wrote:
 Time ago you (Andrei) told me that you don't like a function like 

len() to have a different computational complexity, like O(n) or O(1), 
according to the input.

This reminds me of an excellent talk by Matt Austern on STL's 
singly-linked lists.

One of the interesting points of the design was around the length of the 
singly-linked list. In the end, he decides not to provide one.

I think his point was that getting the length of the data structure was 
not one of the main operations of a singly list. I agree that the users 
who really need length can wrap it in a struct that stores the length.

Although there doesn't seem to be a slide dedicated to that point, the 
presentation is here:

   http://www.accu-usa.org/Slides/SinglyLinkedLists.ppt

Ali

Feb 04 2010

Jason House <jason.james.house gmail.com> writes:

Andrei Alexandrescu Wrote:

 It's no secret that string et al. are not a magic recipe for writing 
 correct Unicode code. However, things are pretty good and could be 
 further improved by operating the following changes in std.array and 
 std.range:
 
 These changes effectively make UTF-8 and UTF-16 bidirectional ranges, 
 with the quirk that you still have a sort of a random-access operator.
 
 I'm very strongly in favor of this change. Bidirectional strings allow 
 beautiful correct algorithms to be written that handle encoded strings 
 without any additional effort; with these changes, everything applicable 
 of std.algorithm works out of the box (with the appropriate fixes here 
 and there), which is really remarkable.
 
 The remaining WTF is the length property. Traditionally, a range 
 offering length also implies the expectation that a range of length n 
 allows you to call popFront n times and then assert that the range is 
 empty. However, if you check e.g. hasLength!string it will yield false, 
 although the string does have an accessible member by that name and of 
 the appropriate type.
 
 Although Phobos always checks its assumptions, people might occasionally 
 write code that just uses .length without checking hasLength. Then, 
 they'll be annoyed when the code fails with UTF-8 and UTF-16 strings.
 
 (The "real" length of the range is not stored, but can be computed by 
 using str.walkLength() in std.range.)
 
 What can be done about that? I see a number of solutions:

The underlying array of byte-sized data fragments is an implementation detail.
hasLength is a kludge. Follow good OO design and hide the implementation
details from the standard interface!

I would use a struct for UTF8 and UTF16 strings, and add a method to get the
raw array. That allows simple, compiler-enforced usage while still allowing
special casing to use raw data. As an added bonus, this method can generalize
for other variable widthrange elements.

Feb 04 2010

"Simen kjaeraas" <simen.kjaras gmail.com> writes:

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:
 What can be done about that? I see a number of solutions:

 (a) Do not operate the change at all.

 (b) Operate the change and mention that in range algorithms you should  
 check hasLength and only then use "length" under the assumption that it  
 really means "elements count".

 (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define  
 a different name for that. Any other name (codeUnits, codes etc.) would  
 do. The entire point is to not make algorithms believe strings have a  
 .length property.

 (d) Have std.range define a distinct property called e.g. "count" and  
 then specialize it appropriately. Then change all references to .length  
 in std.algorithm and elsewhere to .count.

 What would you do? Any ideas are welcome.

Of the above, I feel (b) is the correct solution, and I understand it
has already been implemented in svn.

-- 
Simen

Feb 04 2010

bearophile <bearophileHUGS lycos.com> writes:

Simen kjaeraas:
 Of the above, I feel (b) is the correct solution, and I understand it
 has already been implemented in svn.

Yes, I presume he was mostly looking for a justification of his ideas he has
already accepted and even partially implemented :-)

Bye,
bearophile

Feb 04 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 Simen kjaeraas:
 Of the above, I feel (b) is the correct solution, and I understand
 it has already been implemented in svn.

 
 Yes, I presume he was mostly looking for a justification of his ideas
 he has already accepted and even partially implemented :-)

I am ready to throw away the implementation as soon as a better idea 
comes around. As other times, I operated the change to see how things 
feel with the new approach.

Generally it feels like the new state of affairs is a solid improvement. 
One recurring problem has been that some code has assumed that 
ElementType!SomeString has the width of one encoding unit. That 
assumption is no longer true so I had to change such code with 
typeof(SomeString.init[0]). Probably I'll abstract that as 
CodeUnit!SomeString in std.traits.

I also found some bugs; for example Levenshtein distance was erroneous 
because it didn't operate at character level. The fix using front and 
popFront was very simple.

Regarding defining an entire new struct for strings, I think that's a 
sensible approach. With the new operators in tow, UString (universal 
string) that traffics in dchar and makes representation a detail would 
be nicely implementable. It could even have mutable elements at dchar 
granularity. My feeling is, however, that at this point too much 
toothpaste is out of the tube for that to happen in D2. That would be 
offset if current strings were unbearable, but I think they're working 
very well.


Andrei

Feb 04 2010

Michel Fortin <michel.fortin michelf.com> writes:

On 2010-02-04 12:19:42 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 bearophile wrote:
 Simen kjaeraas:
 Of the above, I feel (b) is the correct solution, and I understand
 it has already been implemented in svn.

 
 Yes, I presume he was mostly looking for a justification of his ideas
 he has already accepted and even partially implemented :-)

 
 I am ready to throw away the implementation as soon as a better idea 
 comes around. As other times, I operated the change to see how things 
 feel with the new approach.

Has any thought been given to foreach? Currently all these work for strings:

	foreach (c; "abc") { } // typeof(c) is 'char'
	foreach (char c; "abc") { }
	foreach (wchar c; "abc") { }
	foreach (dchar c; "abc") { }

I'm concerned about the first case where the element type is implicit. 
The implicit element type is (currently) the code units. If the range 
use code points 'dchar' as the element type, then I think foreach needs 
to be changed so that the default element type is 'dchar' too (in the 
first line of my example). Having ranges and foreach disagree on this 
would be very inconsistent. Of course you should be allowed to iterate 
using 'char' and 'wchar' too.

I think this would fit nicely. I was surprised at first when learning D 
and I noticed that foreach didn't do this, that I had to explicitly has 
for it.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Feb 04 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Michel Fortin wrote:
 On 2010-02-04 12:19:42 -0500, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> said:
 
 bearophile wrote:
 Simen kjaeraas:
 Of the above, I feel (b) is the correct solution, and I understand
 it has already been implemented in svn.

 Yes, I presume he was mostly looking for a justification of his ideas
 he has already accepted and even partially implemented :-)

 I am ready to throw away the implementation as soon as a better idea 
 comes around. As other times, I operated the change to see how things 
 feel with the new approach.

 
 Has any thought been given to foreach? Currently all these work for 
 strings:
 
     foreach (c; "abc") { } // typeof(c) is 'char'
     foreach (char c; "abc") { }
     foreach (wchar c; "abc") { }
     foreach (dchar c; "abc") { }
 
 I'm concerned about the first case where the element type is implicit. 
 The implicit element type is (currently) the code units. If the range 
 use code points 'dchar' as the element type, then I think foreach needs 
 to be changed so that the default element type is 'dchar' too (in the 
 first line of my example). Having ranges and foreach disagree on this 
 would be very inconsistent. Of course you should be allowed to iterate 
 using 'char' and 'wchar' too.
 
 I think this would fit nicely. I was surprised at first when learning D 
 and I noticed that foreach didn't do this, that I had to explicitly has 
 for it.

This is a good point. I'm in favor of changing the language to make the 
implicit type dchar.

Andrei

Feb 04 2010

Don <nospam nospam.com> writes:

Andrei Alexandrescu wrote:
 Michel Fortin wrote:
 On 2010-02-04 12:19:42 -0500, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> said:

 bearophile wrote:
 Simen kjaeraas:
 Of the above, I feel (b) is the correct solution, and I understand
 it has already been implemented in svn.

 Yes, I presume he was mostly looking for a justification of his ideas
 he has already accepted and even partially implemented :-)

 I am ready to throw away the implementation as soon as a better idea 
 comes around. As other times, I operated the change to see how things 
 feel with the new approach.

 Has any thought been given to foreach? Currently all these work for 
 strings:

     foreach (c; "abc") { } // typeof(c) is 'char'
     foreach (char c; "abc") { }
     foreach (wchar c; "abc") { }
     foreach (dchar c; "abc") { }

 I'm concerned about the first case where the element type is implicit. 
 The implicit element type is (currently) the code units. If the range 
 use code points 'dchar' as the element type, then I think foreach 
 needs to be changed so that the default element type is 'dchar' too 
 (in the first line of my example). Having ranges and foreach disagree 
 on this would be very inconsistent. Of course you should be allowed to 
 iterate using 'char' and 'wchar' too.

 I think this would fit nicely. I was surprised at first when learning 
 D and I noticed that foreach didn't do this, that I had to explicitly 
 has for it.

 
 This is a good point. I'm in favor of changing the language to make the 
 implicit type dchar.
 
 Andrei

We seem to be approaching the point where char[], wchar[] and dchar[] 
are all arrays of dchar, but with different levels of compression.
It makes me wonder if the char, wchar types actually make any sense.
If char[] is actually a UTF string, then char[] ~ char should be 
permitted ONLY if char can be implicitly converted to dchar. Otherwise, 
you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which will 
not necessarily result in a valid unicode string.

I suspect that string, wstring should have been the primary types and 
had a .codepoints property, which returned a ubyte[] resp. ushort[] 
reference to the data. It's too late, of course. The extra value you get 
by having a specific type for 'this is a code point for a UTF8 string' 
seems to be very minor, compared to just using a ubyte.

Feb 04 2010

Jerry Quinn <jlquinn optonline.net> writes:

Don Wrote:
 We seem to be approaching the point where char[], wchar[] and dchar[] 
 are all arrays of dchar, but with different levels of compression.
 It makes me wonder if the char, wchar types actually make any sense.
 If char[] is actually a UTF string, then char[] ~ char should be 
 permitted ONLY if char can be implicitly converted to dchar. Otherwise, 
 you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which will 
 not necessarily result in a valid unicode string.

Well, if you're working with a LOT of text, you may be mmapping GB's of UTF-8
text.  Yes, this does happen.  You better be able to handle it in a sane
manner, i.e. not reallocating the memory to read the data in.  So, there is a
definite need for casting to array of char, and dealing with the inevitable
stray non-unicode char in that mess.  

Real-world text processing can be a messy affair.  It probably requires walking
such an array and returning slices cast to char after they've been validated.

Feb 04 2010

dsimcha <dsimcha yahoo.com> writes:

== Quote from Jerry Quinn (jlquinn optonline.net)'s article
 Don Wrote:
 We seem to be approaching the point where char[], wchar[] and dchar[]
 are all arrays of dchar, but with different levels of compression.
 It makes me wonder if the char, wchar types actually make any sense.
 If char[] is actually a UTF string, then char[] ~ char should be
 permitted ONLY if char can be implicitly converted to dchar. Otherwise,
 you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which will
 not necessarily result in a valid unicode string.

 Well, if you're working with a LOT of text, you may be mmapping GB's of UTF-8

text.  Yes, this does happen.  You better be able to handle it in a sane manner,
i.e. not reallocating the memory to read the data in.  So, there is a definite
need for casting to array of char, and dealing with the inevitable stray
non-unicode char in that mess.

Welcome to the world of DNA sequence manipulation.

Feb 04 2010

Rainer Deyke <rainerd eldwood.com> writes:

Don wrote:
 I suspect that string, wstring should have been the primary types and
 had a .codepoints property, which returned a ubyte[] resp. ushort[]
 reference to the data. It's too late, of course. The extra value you get
 by having a specific type for 'this is a code point for a UTF8 string'
 seems to be very minor, compared to just using a ubyte.

If it's not too late to completely change the semantics of char[], then
it's also not too late to dump 'char' completely.  If it /is/ too late
to remove 'char', then 'char[]' should retain the current semantics and
a new string type should be added for the new semantics.


-- 
Rainer Deyke - rainerd eldwood.com

Feb 04 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Rainer Deyke wrote:
 Don wrote:
 I suspect that string, wstring should have been the primary types and
 had a .codepoints property, which returned a ubyte[] resp. ushort[]
 reference to the data. It's too late, of course. The extra value you get
 by having a specific type for 'this is a code point for a UTF8 string'
 seems to be very minor, compared to just using a ubyte.

 
 If it's not too late to completely change the semantics of char[], then
 it's also not too late to dump 'char' completely.  If it /is/ too late
 to remove 'char', then 'char[]' should retain the current semantics and
 a new string type should be added for the new semantics.

One idea I've had for a while was to have a universal string type:

struct UString {
     union {
         char[] utf8;
         wchar[] utf16;
         dchar[] utf32;
     }
     enum Discriminator { utf8, utf16, utf32 };
     Discriminator kind;
     IntervalTree!(size_t) skip;
     ...
}

The IntervalTree stores the skip amounts that must be added for a given 
index in the string. For ASCII strings that would be null. Then its size 
grows with the number of multibyte characters. Beyond a threshold, 
representation is transparently switched to utf16 or utf32 as needed and 
the tree becomes smaller or null again.

In an advanced implementation the discriminator and the tree could be 
stored at negative offset, and the tree could be compressed taking 
advantage of its limited size. That would make UString quite 
low-overhead while offering a staunchly dchar-based interface.

I don't mind at all using string, but I also think UString would be a 
good extra abstraction.


Andrei

Feb 04 2010

grauzone <none example.net> writes:

Andrei Alexandrescu wrote:
 Rainer Deyke wrote:
 Don wrote:
 I suspect that string, wstring should have been the primary types and
 had a .codepoints property, which returned a ubyte[] resp. ushort[]
 reference to the data. It's too late, of course. The extra value you get
 by having a specific type for 'this is a code point for a UTF8 string'
 seems to be very minor, compared to just using a ubyte.

 If it's not too late to completely change the semantics of char[], then
 it's also not too late to dump 'char' completely.  If it /is/ too late
 to remove 'char', then 'char[]' should retain the current semantics and
 a new string type should be added for the new semantics.

 
 One idea I've had for a while was to have a universal string type:
 
 struct UString {
     union {
         char[] utf8;
         wchar[] utf16;
         dchar[] utf32;
     }
     enum Discriminator { utf8, utf16, utf32 };
     Discriminator kind;
     IntervalTree!(size_t) skip;
     ...
 }

You mean like this?
http://www.dprogramming.com/mtext.php

Feb 04 2010

Rainer Deyke <rainerd eldwood.com> writes:

Andrei Alexandrescu wrote:
 One idea I've had for a while was to have a universal string type:
 
 struct UString {
     union {
         char[] utf8;
         wchar[] utf16;
         dchar[] utf32;
     }
     enum Discriminator { utf8, utf16, utf32 };
     Discriminator kind;
     IntervalTree!(size_t) skip;
     ...
 }
 
 The IntervalTree stores the skip amounts that must be added for a given
 index in the string. For ASCII strings that would be null. Then its size
 grows with the number of multibyte characters. Beyond a threshold,
 representation is transparently switched to utf16 or utf32 as needed and
 the tree becomes smaller or null again.

Although I see some potential in a universal string type, I don't think
this is the right implementation strategy.  I'd rather have my short
strings in utf-32 (optimized for speed) and my long strings in
utf-8/utf-16 (optimized for memory usage).


-- 
Rainer Deyke - rainerd eldwood.com

Feb 04 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Rainer Deyke wrote:
 Andrei Alexandrescu wrote:
 One idea I've had for a while was to have a universal string type:

 struct UString {
     union {
         char[] utf8;
         wchar[] utf16;
         dchar[] utf32;
     }
     enum Discriminator { utf8, utf16, utf32 };
     Discriminator kind;
     IntervalTree!(size_t) skip;
     ...
 }

 The IntervalTree stores the skip amounts that must be added for a given
 index in the string. For ASCII strings that would be null. Then its size
 grows with the number of multibyte characters. Beyond a threshold,
 representation is transparently switched to utf16 or utf32 as needed and
 the tree becomes smaller or null again.

 
 Although I see some potential in a universal string type, I don't think
 this is the right implementation strategy.  I'd rather have my short
 strings in utf-32 (optimized for speed) and my long strings in
 utf-8/utf-16 (optimized for memory usage).

The definition I outlined does not specify or constrain the strategy of 
changing the discriminator.

Andrei

Feb 04 2010

Michael Rynn <michaelrynn optusnet.com.au> writes:

On Thu, 04 Feb 2010 18:41:48 -0700, Rainer Deyke wrote:

 Andrei Alexandrescu wrote:
 One idea I've had for a while was to have a universal string type:
 
 struct UString {
     union {
         char[] utf8;
         wchar[] utf16;
         dchar[] utf32;
     }
     enum Discriminator { utf8, utf16, utf32 }; Discriminator kind;
     IntervalTree!(size_t) skip;
     ...
 }
 



Firstly, for such "augmented types" in D, such as strings, bignums or any 
future ideas , it is great to have the facilities of creating them using 
the struct, so that they can be used elsewhere without regards to whether 
they are built in as compiler specials or in the library.

What is there for struct now is good and getting better in D2, but I 
still feel a little insecure with understanding how to make a really 
optimal implementation that is as good as a built in type that the 
compiler understands. The DPL is being been a help for this.

Programmers will want to use raw char[] wchar[] dchar[] for whatever 
reasons with their ?simple? behaviours, so they should not be made 
unavailable because more sophisticated types are creatable, purely for 
unicode strings.

I have made a UString implementation, similar to above.  But I played a 
different trick. I was interested for this to also maintain a terminating 
null char for conversion passing to Windows API functions, in particular 
16 bit W. interfaces.

struct UString_char {
	char[]   str_;
	/// ... lots of  good D type stuff, constructor and assign 
conversions, access

	size_t length() {
		return str_.length - 1; // hide terminating null
	}
}

struct UString_wchar {
	wchar[]   str_;
	/// ditto D type stuff
}

struct UString_dchar {
	dchar[]   str_;
}

// throw in void[] for charity. (although no one will need it)

struct UString_void {
	void[]  ptr_;
}
enum UStringType { UC_CHAR, UC_WCHAR, UC_DCHAR }
	
struct UString {

	
	union {
		UString_void vstr;
		UString_char cstr;
		UString_wchar wstr;
		UString_dchar dstr;
	}
	UStringType	ztype;
	
	// type things to track what we are.
}

I could then choose individual components by themselves, where 
appropriate, even get them working.

In D2 immutable not for str_ array, while appending or fiddling null 
terminator.

I did not get associative array working as a key using UString, have not 
tried since.

Also made a class version called VString containing the union. 
There's a lot of issues.

I also must acknowledge the prior art of the mtext code, and its MString 
structure type.  I was partly inspired by seeing this, and how complex it 
was to do nearly everything.  

When last I checked mtext it was kind of broken for recent D1 and D2 
compilers, and I did not want to fix.  I admit I did not like the 
complexity of the direct union { char[], wchar[], dchar[] }  Splitting up 
into seperatedly usable structs seems to me to give 3 times the potential 
for the same price.

The advantage of using struct for such types is it may help bring about 
perfection of such a POD based "type creation" facility. I note from 
looking at some of the phobos D2 code, eg std.array, this seems to be 
attempted in places.

Nearly all the more interesting D types, arrays, maps, are all equivalent 
to smallish POD types, with at least 2-3 times the machine word size 
(32/64 bit).

Making it all work and understandable and avoiding WTFbug is a big 
challenge.

Feb 05 2010

Michel Fortin <michel.fortin michelf.com> writes:

On 2010-02-04 18:16:55 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 Rainer Deyke wrote:
 Don wrote:
 I suspect that string, wstring should have been the primary types and
 had a .codepoints property, which returned a ubyte[] resp. ushort[]
 reference to the data. It's too late, of course. The extra value you get
 by having a specific type for 'this is a code point for a UTF8 string'
 seems to be very minor, compared to just using a ubyte.

 
 If it's not too late to completely change the semantics of char[], then
 it's also not too late to dump 'char' completely.  If it /is/ too late
 to remove 'char', then 'char[]' should retain the current semantics and
 a new string type should be added for the new semantics.

 
 One idea I've had for a while was to have a universal string type:
 
 struct UString {
      union {
          char[] utf8;
          wchar[] utf16;
          dchar[] utf32;
      }
      enum Discriminator { utf8, utf16, utf32 };
      Discriminator kind;
      IntervalTree!(size_t) skip;
      ...
 }

That's a nice concept, but it seems to me that it adds much overhead to 
improve a rather niche area. It's not often that you need to access 
characters by index. Generally when you need to it's because you've 
already parsed the string and want to return to a previous location, in 
which case you'd better when you first parse to just save the range or 
the index in code units rather than the index in code point.

But I have to say quite satisfied in the way D handle strings in 
general. Easy access to code points and direct access to the data is 
quite handy. I think it fits very well with a low-level language.

I'd say in general when manipulating strings I rarely need to bother 
about code points. Most of the time I'm just searching for ASCII-range 
markers when parsing so I can search for them directly as code units, 
not bothering at all about multi-byte characters. That's why I'm a 
little wary about your changes. If I'm looking for a substring then I 
can search by code units too. It's just for the more fancy stuff 
(case-insensitive searching, character transformation) that it becomes 
necessary to work with code points.

That's why I'm a little wary about your changes in that area: I fear 
it'll make the common case of working with code units more difficult to 
deal with. But I won't judge before I see.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Feb 04 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Don wrote:
 Andrei Alexandrescu wrote:
 Michel Fortin wrote:
 On 2010-02-04 12:19:42 -0500, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> said:

 bearophile wrote:
 Simen kjaeraas:
 Of the above, I feel (b) is the correct solution, and I understand
 it has already been implemented in svn.

 Yes, I presume he was mostly looking for a justification of his ideas
 he has already accepted and even partially implemented :-)

 I am ready to throw away the implementation as soon as a better idea 
 comes around. As other times, I operated the change to see how 
 things feel with the new approach.

 Has any thought been given to foreach? Currently all these work for 
 strings:

     foreach (c; "abc") { } // typeof(c) is 'char'
     foreach (char c; "abc") { }
     foreach (wchar c; "abc") { }
     foreach (dchar c; "abc") { }

 I'm concerned about the first case where the element type is 
 implicit. The implicit element type is (currently) the code units. If 
 the range use code points 'dchar' as the element type, then I think 
 foreach needs to be changed so that the default element type is 
 'dchar' too (in the first line of my example). Having ranges and 
 foreach disagree on this would be very inconsistent. Of course you 
 should be allowed to iterate using 'char' and 'wchar' too.

 I think this would fit nicely. I was surprised at first when learning 
 D and I noticed that foreach didn't do this, that I had to explicitly 
 has for it.

 This is a good point. I'm in favor of changing the language to make 
 the implicit type dchar.

 Andrei

 
 We seem to be approaching the point where char[], wchar[] and dchar[] 
 are all arrays of dchar, but with different levels of compression.

That is a good way to look at things.

 It makes me wonder if the char, wchar types actually make any sense.
 If char[] is actually a UTF string, then char[] ~ char should be 
 permitted ONLY if char can be implicitly converted to dchar. Otherwise, 
 you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which will 
 not necessarily result in a valid unicode string.

Well as it's been mentioned, sometimes you may assemble a string out of 
individual characters. Probably that case is rare enough to warrant a 
cast. Note that today char is already convertible to dchar (there's no 
checking).

 I suspect that string, wstring should have been the primary types and 
 had a .codepoints property, which returned a ubyte[] resp. ushort[] 
 reference to the data. It's too late, of course. The extra value you get 
 by having a specific type for 'this is a code point for a UTF8 string' 
 seems to be very minor, compared to just using a ubyte.

What we can do is to have to!(const ubyte[]) work for all UTF8 strings 
and to!(const ushort[]) work for all UTF16 strings. That view is correct 
and safe. Also, it's not difficult to add a .codepoints pseudo-property.


Andrei

Feb 04 2010

Justin Johansson <no spam.com> writes:

Andrei Alexandrescu wrote:
 Has any thought been given to foreach? Currently all these work for 
 strings:

     foreach (c; "abc") { } // typeof(c) is 'char'
     foreach (char c; "abc") { }
     foreach (wchar c; "abc") { }
     foreach (dchar c; "abc") { }

 I'm concerned about the first case where the element type is implicit. 
 The implicit element type is (currently) the code units. If the range 
 use code points 'dchar' as the element type, then I think foreach 
 needs to be changed so that the default element type is 'dchar' too 
 (in the first line of my example). Having ranges and foreach disagree 
 on this would be very inconsistent. Of course you should be allowed to 
 iterate using 'char' and 'wchar' too.

 I think this would fit nicely. I was surprised at first when learning 
 D and I noticed that foreach didn't do this, that I had to explicitly 
 has for it.

 
 This is a good point. I'm in favor of changing the language to make the 
 implicit type dchar.
 
 Andrei

I concur.  It's great to see consensus moving in this direction.  For
too long Java has suffered the err that a short (i.e. UTF-16 codeunit)
is just about as good as a full Unicode codepoint (i.e. UTF-32
"codeunit").  As a result, the near-enough is good-enough, 16-bit Java
API's means that programmers either forget (as best) or become slack (at
worse) in the dealing of valid Unicode characters.  Part of this
also stems from the culture that if it ain't ASCII or in a Western
character set (BMP), who cares.

As a matter of taste, I'd prefer to see a dchar Unicode codepoint
officially acknowledged/ordained as "unichar", though I guess there
is always the alias resort for pedants like myself.

Cheers
Justin Johansson

Feb 04 2010

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:
 I am ready to throw away the implementation as soon as a better idea 
 comes around. As other times, I operated the change to see how things 
 feel with the new approach.

And just to be sure: you are better than me because you actually implement
things, while I am here just complaining all the time :o) So thank you for your
work.

Later,
bearophile

Feb 04 2010

Steve Teale <steve.teale britseyeview.com> writes:

Andrei Alexandrescu Wrote:

 
 What would you do? Any ideas are welcome.
 

Andrei, congratulations on starting the most interesting thread I have seen on
dm.D for as long as I can remember.  It has managed to stay really focused, and
has produced a lot of interesting suggestions.

I don't envy you the task of sorting through them - but you did ask. From what
you've seen so far, are we talking Phobos or language issues, and if both, in
what proportion?

Steve.

Feb 05 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Steve Teale wrote:
 Andrei Alexandrescu Wrote:
 
 What would you do? Any ideas are welcome.

 
 Andrei, congratulations on starting the most interesting thread I have seen on
dm.D for as long as I can remember.  It has managed to stay really focused, and
has produced a lot of interesting suggestions.
 
 I don't envy you the task of sorting through them - but you did ask. From what
you've seen so far, are we talking Phobos or language issues, and if both, in
what proportion?

A bit of both, with an emphasis on library.

Andrei

Feb 05 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Making all strings UTF ranges has some risk of WTF