digitalmars.D - standard ranges

Gor Gyolchanyan (8/8) Jun 27 2012 Are there functions, which wrap arbitrary range types into standard rang...

Timon Gehr (4/12) Jun 27 2012 A narrow string is not a RandomAccessRange.

Gor Gyolchanyan (6/21) Jun 27 2012 I tested it out and the string literal without qualifiers counts as a
Jonathan M Davis (5/7) Jun 27 2012 That depends entirely on what you assign it to.
Gor Gyolchanyan (9/17) Jun 27 2012 this is weird. I wrote a function, which transforms anything, which

Jonathan M Davis (5/23) Jun 27 2012 _All_ strings are considered to be ranges of dchar. That's why string an...
Gor Gyolchanyan (6/30) Jun 27 2012 So why is the type of a string literal _string_ by default? Isn't it

Timon Gehr (7/17) Jun 27 2012 Because it is a _string_ literal. If you are asking why utf-8 is the

Jonathan M Davis (51/85) Jun 27 2012 I don't see why having the literal be a string would make anything confu...

Steven Schveighoffer (14/24) Jun 27 2012 No, the reason is:

Gor Gyolchanyan (12/36) Jun 27 2012 a
Jonathan M Davis (22/24) Jun 27 2012 Except that they couldn't have slicing, because it would be very ineffic...
Gor Gyolchanyan (12/36) Jun 27 2012 Yes you can get away. The struct string would have ubyte[] ushort[]

Timon Gehr (7/44) Jun 27 2012 Encoding cannot be taken care of 100% transparently. It has performance

Timon Gehr (9/33) Jun 27 2012 There is no reason for anyone to be confused about this endlessly. It

Steven Schveighoffer (11/19) Jun 27 2012 Default type of the literal should be the library type. If you want

Timon Gehr (14/33) Jun 27 2012 Then it is not a library type, but a built-in type. Are you planning to

Steven Schveighoffer (10/52) Jun 27 2012 No, druntime, and include minimal utf support. We do the same thing wit...

Jonathan M Davis (35/96) Jun 27 2012 Cast it to ubyte[]. Problem solved. I honestly don't think that operatin...

travert phare.normalesup.org (Christophe Travert) (43/48) Jun 28 2012 char[] is not treated as an array by the library, and is not treated as

Jonathan M Davis (74/83) Jun 28 2012 nly_

travert phare.normalesup.org (Christophe Travert) (33/59) Jun 28 2012 All arrays are treated as RandomAccessRanges, except for char[] and

Jonathan M Davis (6/16) Jun 28 2012 So, it looked to me like you were saying that making string a struct wou...

David Nadlinger (8/16) Jun 28 2012 I think he meant that the problem would be solved because people

travert phare.normalesup.org (Christophe Travert) (2/16) Jun 28 2012 Yes.

Andrei Alexandrescu (13/17) Jun 28 2012 In a way it's too late for any language in actual use. The "fog of
Timon Gehr (47/108) Jun 28 2012 "Not treated like other arrays", is the strongest claim that can be

travert phare.normalesup.org (Christophe Travert) (19/52) Jun 28 2012 I consider this bad design.

Timon Gehr (18/75) Jun 27 2012 Because the proposed 'string' interface is inconvenient to use and

Jonathan M Davis (17/27) Jun 27 2012 I think that a lot of programmers want to be able to use strings without...
Steven Schveighoffer (36/79) Aug 23 2012 Sorry to resurrect this thread, I've been very absent from D, and am jus...

Gor Gyolchanyan (10/29) Jun 27 2012 t
Jonathan M Davis (13/45) Jun 27 2012 Except that they shouldn't be, because you can't do random access on a n...

Roman D. Boiko (9/35) Jun 28 2012 Pedantically speaking, it is possible to index a string with

Roman D. Boiko (3/11) Jun 28 2012 Just a remark, indexing would take O(N) operations and N/B memory

Roman D. Boiko (17/29) Jun 28 2012 That being said, I would be against switching from string

Roman D. Boiko (10/18) Jun 28 2012 This would support both random access to characters by their code
Andrei Alexandrescu (4/11) Jun 28 2012 Pedantically speaking, sheer timings say nothing without the appropriate...

Roman D. Boiko (17/21) Jun 28 2012 I used results of benchmarks for two such algorithms, which I

Roman D. Boiko (16/16) Jun 28 2012 Timings should not be very different from random access in any
Andrei Alexandrescu (6/26) Jun 28 2012 Well of course I've exaggerated a bit. My point is that mentioning "200

Roman D. Boiko (20/26) Jun 28 2012 I have no benchmarks for plain array access on the same machine

Roman D. Boiko (13/13) Jun 28 2012 My point (and the reason I somehow hijacked this thread) is that

Jonathan M Davis (13/52) Jun 27 2012 If you want efficient strings, you _must_ worry about the encoding. It's...

Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:

Are there functions, which wrap arbitrary range types into standard range
interfaces?
I looked at the docs, but couldn't find anything.
Use case:

RandomAccessRange!dchar s = ???("Hello, world!");

-- 
Bye,
Gor Gyolchanyan.

Jun 27 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 06/27/2012 03:25 PM, Gor Gyolchanyan wrote:
 Are there functions, which wrap arbitrary range types into standard
 range interfaces?
 I looked at the docs, but couldn't find anything.
 Use case:

 RandomAccessRange!dchar s = ???("Hello, world!");

 --
 Bye,
 Gor Gyolchanyan.

A narrow string is not a RandomAccessRange.

RandomAccessFinite!(immutable(dchar)) s = inputRangeObject("Hello, 
world!"d);

Jun 27 2012

Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:

On Wed, Jun 27, 2012 at 5:38 PM, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 06/27/2012 03:25 PM, Gor Gyolchanyan wrote:

 Are there functions, which wrap arbitrary range types into standard
 range interfaces?
 I looked at the docs, but couldn't find anything.
 Use case:

 RandomAccessRange!dchar s = ???("Hello, world!");

 --
 Bye,
 Gor Gyolchanyan.

 A narrow string is not a RandomAccessRange.

 RandomAccessFinite!(immutable(**dchar)) s = inputRangeObject("Hello,
 world!"d);

I tested it out and the string literal without qualifiers counts as a
dstring.

-- 
Bye,
Gor Gyolchanyan.

Jun 27 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, June 27, 2012 17:58:46 Gor Gyolchanyan wrote:
 I tested it out and the string literal without qualifiers counts as a
 dstring.

That depends entirely on what you assign it to. 
writeln(typeof("hello").stringof) prints string, not dstring. So, the literal 
by itself is a string by default.

- Jonathan M Davis

Jun 27 2012

Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:

On Wed, Jun 27, 2012 at 7:41 PM, Jonathan M Davis <jmdavisProg gmx.com>wrote:

 On Wednesday, June 27, 2012 17:58:46 Gor Gyolchanyan wrote:
 I tested it out and the string literal without qualifiers counts as a
 dstring.

 That depends entirely on what you assign it to.
 writeln(typeof("hello").stringof) prints string, not dstring. So, the
 literal
 by itself is a string by default.

 - Jonathan M Davis

this is weird. I wrote a function, which transforms anything, which
qualifies as isForwardRange into an implementation of ForwardRange. And the
type inference of that function produced a ForwardRangeImpl!dchar when I
passed it a string literal.

Although string and wstring also qualify as a forward range.

-- 
Bye,
Gor Gyolchanyan.

Jun 27 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, June 27, 2012 19:47:41 Gor Gyolchanyan wrote:
 On Wed, Jun 27, 2012 at 7:41 PM, Jonathan M Davis 

<jmdavisProg gmx.com>wrote:
 On Wednesday, June 27, 2012 17:58:46 Gor Gyolchanyan wrote:
 I tested it out and the string literal without qualifiers counts as a
 dstring.

 
 That depends entirely on what you assign it to.
 writeln(typeof("hello").stringof) prints string, not dstring. So, the
 literal
 by itself is a string by default.
 
 - Jonathan M Davis

 
 this is weird. I wrote a function, which transforms anything, which
 qualifies as isForwardRange into an implementation of ForwardRange. And the
 type inference of that function produced a ForwardRangeImpl!dchar when I
 passed it a string literal.
 
 Although string and wstring also qualify as a forward range.

_All_ strings are considered to be ranges of dchar. That's why string and 
wstring are not random access ranges and hasLength is false for them.

- Jonathan M Davis

Jun 27 2012

Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:

On Wed, Jun 27, 2012 at 7:49 PM, Jonathan M Davis <jmdavisProg gmx.com>wrote:

 On Wednesday, June 27, 2012 19:47:41 Gor Gyolchanyan wrote:
 On Wed, Jun 27, 2012 at 7:41 PM, Jonathan M Davis

 <jmdavisProg gmx.com>wrote:
 On Wednesday, June 27, 2012 17:58:46 Gor Gyolchanyan wrote:
 I tested it out and the string literal without qualifiers counts as a
 dstring.

 That depends entirely on what you assign it to.
 writeln(typeof("hello").stringof) prints string, not dstring. So, the
 literal
 by itself is a string by default.

 - Jonathan M Davis

 this is weird. I wrote a function, which transforms anything, which
 qualifies as isForwardRange into an implementation of ForwardRange. And

 the
 type inference of that function produced a ForwardRangeImpl!dchar when I
 passed it a string literal.

 Although string and wstring also qualify as a forward range.

 _All_ strings are considered to be ranges of dchar. That's why string and
 wstring are not random access ranges and hasLength is false for them.

 - Jonathan M Davis

So why is the type of a string literal _string_ by default? Isn't it
confusing when dealing with ranges?

-- 
Bye,
Gor Gyolchanyan.

Jun 27 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 06/27/2012 05:54 PM, Gor Gyolchanyan wrote:
 On Wed, Jun 27, 2012 at 7:49 PM, Jonathan M Davis <jmdavisProg gmx.com
 <mailto:jmdavisProg gmx.com>> wrote:

     _All_ strings are considered to be ranges of dchar. That's why
     string and
     wstring are not random access ranges and hasLength is false for them.

     - Jonathan M Davis


 So why is the type of a string literal _string_ by default?

Because it is a _string_ literal. If you are asking why utf-8 is the
default, that is because it is the most space efficient, backwards-
compatible to ASCII, and because random access to a string is rarely
required.


? Isn't it confusing when dealing with ranges?
 --
 Bye,
 Gor Gyolchanyan.

Why would it be?

Jun 27 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, June 27, 2012 19:54:12 Gor Gyolchanyan wrote:
 On Wed, Jun 27, 2012 at 7:49 PM, Jonathan M Davis 

<jmdavisProg gmx.com>wrote:
 On Wednesday, June 27, 2012 19:47:41 Gor Gyolchanyan wrote:
 On Wed, Jun 27, 2012 at 7:41 PM, Jonathan M Davis

 
 <jmdavisProg gmx.com>wrote:
 On Wednesday, June 27, 2012 17:58:46 Gor Gyolchanyan wrote:
 I tested it out and the string literal without qualifiers counts as
 a
 dstring.

 
 That depends entirely on what you assign it to.
 writeln(typeof("hello").stringof) prints string, not dstring. So, the
 literal
 by itself is a string by default.
 
 - Jonathan M Davis

 
 this is weird. I wrote a function, which transforms anything, which
 qualifies as isForwardRange into an implementation of ForwardRange. And

 
 the
 
 type inference of that function produced a ForwardRangeImpl!dchar when I
 passed it a string literal.
 
 Although string and wstring also qualify as a forward range.

 
 _All_ strings are considered to be ranges of dchar. That's why string and
 wstring are not random access ranges and hasLength is false for them.
 
 - Jonathan M Davis

 
 So why is the type of a string literal _string_ by default? Isn't it
 confusing when dealing with ranges?

I don't see why having the literal be a string would make anything confusing. 
The fact that a string is considered a range of dchar rather than char could 
be, but I don't see why having a string literal be a dstring instead of a 
string would help with that. Besides, it's generally expected that you'll use 
string for strings unless you specifically need wstring or dstring for some 
reason.

Regardless, ranges aren't really part of the language. They're a library 
artifact. The _only_ place that the language has anything to do with them is 
foreach, in which case

foreach(e; range)
{
 // code
}

becomes

for(auto _range = range; !_range.empty; _range.popFront())
{
 auto e _range.front;
 // code
}

That's it. So, the fact that Phobos treats strings as ranges of dchar is 
completely separate from what the language is doing with string literals. 
foreach on strings doesn't iterate over dchars unless you specifically give 
dchar as the element type. You can get a strings length. You can use random 
access on it. You can slice it. But this falls apart _very_ quickly with 
general algorithms, because a string is an array of code _units_ rather than 
code points. So, if you iterate over char, you're iterating over pieces of 
characters rather than whole characters. So, Phobos' solution is to treat 
arrays of char and wchar as ranges of dchar rather than ranges of char and 
wchar, and they lose length, random access, and slicing as far as ranges are 
concerned (though algorithms can special case for them and use those abilities 
where appropriate, since they're still there - they just can't be used 
generically or you'd be operating on code units).

In some cases, you need to be able to treat strings as arrays of code units, 
while in others you need to treat them as arrays of code points. In order to 
use strings properly, you need to understand that. There's no way around it. 
It's life with unicode. The library went the route of using code points for 
everything because it's more correct and less error-prone, whereas the 
language itself generally deals with code units This does create a bit of 
schizophrenia when dealing with built-in stuff (such as foreach) and library 
stuff, but that's the way that it goes at this point.

If strings were a struct of some kind that defaulted to using code points but 
allowed you to use code units when necessary, then the situation could be 
improved, but no one has been able to come up with a satisfactory proposal to 
do that, and it would break so much code at this point to change what string 
was aliased to that it's unlikely to ever happen. Not to mention, it doesn't 
really fix the underlying problem of having to know and worry about code units 
vs code points. They're intrinsic to unicode, and you can't really fix that. 
There's no way around it if you want to able to efficiently operate on strings.

- Jonathan M Davis

Jun 27 2012

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Wed, 27 Jun 2012 13:30:48 -0400, Jonathan M Davis <jmdavisProg gmx.com>  
wrote:


 I don't see why having the literal be a string would make anything  
 confusing.
 The fact that a string is considered a range of dchar rather than char  
 could
 be, but I don't see why having a string literal be a dstring instead of a
 string would help with that. Besides, it's generally expected that  
 you'll use
 string for strings unless you specifically need wstring or dstring for  
 some
 reason.

No, the reason is:

1. T[] is a range of T, unless T == char or T == wchar, and then it's a  
range of dchar (huh?)
2. char[] is not a random access range, even though str[i] and str.length  
work.

The fundamental flaw in the way this works is that phobos is pretending  
immutable(char)[] is not an array.  immutable(char)[] should be an array  
of immutable char, string should be a *separate type* of a range of dchar,  
perhaps with immutable(char)[] as its underlying storage.

D needs a full, library-defined string type.  Until it has that, it's  
going to cause endless confusion and WATs.

-Steve

Jun 27 2012

Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:

On Wed, Jun 27, 2012 at 10:09 PM, Steven Schveighoffer
<schveiguy yahoo.com> wrote:
 On Wed, 27 Jun 2012 13:30:48 -0400, Jonathan M Davis <jmdavisProg gmx.com=

 wrote:


 I don't see why having the literal be a string would make anything
 confusing.
 The fact that a string is considered a range of dchar rather than char
 could
 be, but I don't see why having a string literal be a dstring instead of =


a
 string would help with that. Besides, it's generally expected that you'l=


l
 use
 string for strings unless you specifically need wstring or dstring for
 some
 reason.


 No, the reason is:

 1. T[] is a range of T, unless T =3D=3D char or T =3D=3D wchar, and then =

it's a
 range of dchar (huh?)
 2. char[] is not a random access range, even though str[i] and str.length
 work.

 The fundamental flaw in the way this works is that phobos is pretending
 immutable(char)[] is not an array. =C2=A0immutable(char)[] should be an a=

rray of
 immutable char, string should be a *separate type* of a range of dchar,
 perhaps with immutable(char)[] as its underlying storage.

 D needs a full, library-defined string type. =C2=A0Until it has that, it'=

s going
 to cause endless confusion and WATs.

 -Steve

Agreed. Having struct strings (with slices and everything) will set
the record straight.

--=20
Bye,
Gor Gyolchanyan.

Jun 27 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, June 27, 2012 22:29:25 Gor Gyolchanyan wrote:
 Agreed. Having struct strings (with slices and everything) will set
 the record straight.

Except that they couldn't have slicing, because it would be very inefficient. 
You'd have to get at the actual array of code units to slice anything. A 
struct string type would have to be restricted to exactly the same set of 
operations that range-based functions consider strings to have and then give 
you a way to get at the underlying code unit representation to be able to use 
it when special-casing for strings for efficiency, just like you do now.

You _can't_ get away from the fact that you're dealing with an array (or list 
or whatever) of code units even if you do want to operate on it as a range of 
code points most of the time. Having a struct would fix the issues like foreach 
iterating over char by default whereas range-based functions iterate over 
dchar - it would make it consistent by making it dchar for everything - but 
the issue of code unit vs code point still remains and you can't get rid of 
it. Anyone wanting to write efficient string-processing code _needs_ to 
understand unicode. There's no way around it (which is part of the reason that 
Walter isn't keen on the idea of changing how strings work in the language 
itself).

So, while having a string type which is a struct does help eliminate the 
schizophrenia, the core problem of code unit vs code point is still there, and 
you still need to understand it. There is no fix for it, because it's intrinsic 
to how unicode works.

- Jonathan M Davis

Jun 27 2012

Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:

On Wed, Jun 27, 2012 at 10:42 PM, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 On Wednesday, June 27, 2012 22:29:25 Gor Gyolchanyan wrote:
 Agreed. Having struct strings (with slices and everything) will set
 the record straight.

 Except that they couldn't have slicing, because it would be very inefficient.
 You'd have to get at the actual array of code units to slice anything. A
 struct string type would have to be restricted to exactly the same set of
 operations that range-based functions consider strings to have and then give
 you a way to get at the underlying code unit representation to be able to use
 it when special-casing for strings for efficiency, just like you do now.

 You _can't_ get away from the fact that you're dealing with an array (or list
 or whatever) of code units even if you do want to operate on it as a range of
 code points most of the time. Having a struct would fix the issues like foreach
 iterating over char by default whereas range-based functions iterate over
 dchar - it would make it consistent by making it dchar for everything - but
 the issue of code unit vs code point still remains and you can't get rid of
 it. Anyone wanting to write efficient string-processing code _needs_ to
 understand unicode. There's no way around it (which is part of the reason that
 Walter isn't keen on the idea of changing how strings work in the language
 itself).

 So, while having a string type which is a struct does help eliminate the
 schizophrenia, the core problem of code unit vs code point is still there, and
 you still need to understand it. There is no fix for it, because it's intrinsic
 to how unicode works.

 - Jonathan M Davis

Yes you can get away. The struct string would have ubyte[] ushort[]
and uint[] as the representation. Maybe even the char[], wchar[] and
dchar[], but those won't be strings as we know them now. The string
struct will take care of encoding 100% transparently and will provide
access to the representation, which is good for bit blitting and other
encoding-agnostic operations, but the representation is then known NOT
to be a valid string and will need to be placed into the string struct
in order to use string operations.

-- 
Bye,
Gor Gyolchanyan.

Jun 27 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 06/27/2012 08:54 PM, Gor Gyolchanyan wrote:
 On Wed, Jun 27, 2012 at 10:42 PM, Jonathan M Davis<jmdavisProg gmx.com>  wrote:
 On Wednesday, June 27, 2012 22:29:25 Gor Gyolchanyan wrote:
 Agreed. Having struct strings (with slices and everything) will set
 the record straight.

 Except that they couldn't have slicing, because it would be very inefficient.
 You'd have to get at the actual array of code units to slice anything. A
 struct string type would have to be restricted to exactly the same set of
 operations that range-based functions consider strings to have and then give
 you a way to get at the underlying code unit representation to be able to use
 it when special-casing for strings for efficiency, just like you do now.

 You _can't_ get away from the fact that you're dealing with an array (or list
 or whatever) of code units even if you do want to operate on it as a range of
 code points most of the time. Having a struct would fix the issues like foreach
 iterating over char by default whereas range-based functions iterate over
 dchar - it would make it consistent by making it dchar for everything - but
 the issue of code unit vs code point still remains and you can't get rid of
 it. Anyone wanting to write efficient string-processing code _needs_ to
 understand unicode. There's no way around it (which is part of the reason that
 Walter isn't keen on the idea of changing how strings work in the language
 itself).

 So, while having a string type which is a struct does help eliminate the
 schizophrenia, the core problem of code unit vs code point is still there, and
 you still need to understand it. There is no fix for it, because it's intrinsic
 to how unicode works.

 - Jonathan M Davis

 Yes you can get away. The struct string would have ubyte[] ushort[]
 and uint[] as the representation. Maybe even the char[], wchar[] and
 dchar[], but those won't be strings as we know them now. The string
 struct will take care of encoding 100% transparently

Encoding cannot be taken care of 100% transparently. It has performance 
implications.

 and will provide access to the representation, which is good for bit blitting
and other
 encoding-agnostic operations, but the representation is then known NOT
 to be a valid string

It is NOT known not to be a valid string. Furthermore, this directly 
contradicts what you claimed above. If the representation is exposed,
it is certainly not transparent.

 and will need to be placed into the string struct in order to use string
operations.

aliasing..?

Jun 27 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 06/27/2012 08:09 PM, Steven Schveighoffer wrote:
 On Wed, 27 Jun 2012 13:30:48 -0400, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:


 I don't see why having the literal be a string would make anything
 confusing.
 The fact that a string is considered a range of dchar rather than char
 could
 be, but I don't see why having a string literal be a dstring instead of a
 string would help with that. Besides, it's generally expected that
 you'll use
 string for strings unless you specifically need wstring or dstring for
 some
 reason.

 No, the reason is:

 1. T[] is a range of T, unless T == char or T == wchar, and then it's a
 range of dchar (huh?)
 2. char[] is not a random access range, even though str[i] and
 str.length work.

 The fundamental flaw in the way this works is that phobos is pretending
 immutable(char)[] is not an array. immutable(char)[] should be an array
 of immutable char, string should be a *separate type* of a range of
 dchar, perhaps with immutable(char)[] as its underlying storage.

 D needs a full, library-defined string type. Until it has that, it's
 going to cause endless confusion and WATs.

 -Steve

There is no reason for anyone to be confused about this endlessly. It
is simple to understand. Furthermore, think about the implications of a
library-defined string type: it just introduces the problem of what the
type of built-in string literals should be. This would cause endless
pain with type deduction, ifti, string mixins, ... A library-defined
string type cannot be a full string type. Pretending that it can has no
value.

alias immutable(char)[] string is just fine.

Jun 27 2012

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:

 There is no reason for anyone to be confused about this endlessly. It
 is simple to understand. Furthermore, think about the implications of a
 library-defined string type: it just introduces the problem of what the
 type of built-in string literals should be. This would cause endless
 pain with type deduction, ifti, string mixins, ... A library-defined
 string type cannot be a full string type. Pretending that it can has no
 value.

Default type of the literal should be the library type.  If you want  
immutable(char)[], use "abc".codeunits or equivalent.

Of course, it should by default work as a zero-terminated char * for C  
compatibility.

The current situation is not simple to understand.  Generic code that  
accepts arrays has to special-case narrow-width strings if you plan to use  
phobos with them in some cases.  That is a horrible situation.

 alias immutable(char)[] string is just fine.

That is technically fine, but if phobos wants to treat immutable(char)[]  
as something other than an array, it is not fine.

-Steve

Jun 27 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 06/27/2012 10:22 PM, Steven Schveighoffer wrote:
 On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:

 There is no reason for anyone to be confused about this endlessly. It
 is simple to understand. Furthermore, think about the implications of a
 library-defined string type: it just introduces the problem of what the
 type of built-in string literals should be. This would cause endless
 pain with type deduction, ifti, string mixins, ... A library-defined
 string type cannot be a full string type. Pretending that it can has no
 value.

 Default type of the literal should be the library type.

Then it is not a library type, but a built-in type. Are you planning to
inject a dependency on Phobos into the compiler?

 If you want immutable(char)[], use "abc".codeunits or equivalent.

I really don't want to type .codeunits, but I want to use
immutable(char)[] everywhere. This 'library type' is just an interface
change that makes writing nice and efficient code a kludge.

 Of course, it should by default work as a zero-terminated char * for C
 compatibility.

 The current situation is not simple to understand.

It is simple, even if not immediately obvious. It does not have to be
immediately obvious without explanation. It needs to be convenient.

 Generic code that accepts arrays  has to special-case narrow-width strings if
you plan to
 use phobos with them in some cases. That is a horrible situation.

Generic code accepts ranges, not arrays. All necessary (or maybe
unnecessary, I don't know) special casing is already done for you in
Phobos. The _only_ thing that is problematic is the inconsistent
'foreach' behaviour.

 alias immutable(char)[] string is just fine.

 That is technically fine, but if phobos wants to treat immutable(char)[]
 as something other than an array, it is not fine.

 -Steve

Phobos does not treat immutable(char)[] as something other than an
array. It does not treat all arrays uniformly though.

Jun 27 2012

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Wed, 27 Jun 2012 16:55:49 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 06/27/2012 10:22 PM, Steven Schveighoffer wrote:
 On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr gmx.ch>  
 wrote:

 There is no reason for anyone to be confused about this endlessly. It
 is simple to understand. Furthermore, think about the implications of a
 library-defined string type: it just introduces the problem of what the
 type of built-in string literals should be. This would cause endless
 pain with type deduction, ifti, string mixins, ... A library-defined
 string type cannot be a full string type. Pretending that it can has no
 value.

 Default type of the literal should be the library type.

 Then it is not a library type, but a built-in type. Are you planning to
 inject a dependency on Phobos into the compiler?

No, druntime, and include minimal utf support.  We do the same thing with  
AssociativeArray.

 If you want immutable(char)[], use "abc".codeunits or equivalent.

 I really don't want to type .codeunits, but I want to use
 immutable(char)[] everywhere. This 'library type' is just an interface
 change that makes writing nice and efficient code a kludge.

When most string functions take strings, why would you want to use  
immutable(char)[] everywhere?

 Of course, it should by default work as a zero-terminated char * for C
 compatibility.

 The current situation is not simple to understand.

 It is simple, even if not immediately obvious. It does not have to be
 immediately obvious without explanation. It needs to be convenient.

Try sorting an array of ascii characters.

 Generic code that accepts arrays  has to special-case narrow-width  
 strings if you plan to
 use phobos with them in some cases. That is a horrible situation.

 Generic code accepts ranges, not arrays. All necessary (or maybe
 unnecessary, I don't know) special casing is already done for you in
 Phobos. The _only_ thing that is problematic is the inconsistent
 'foreach' behaviour.

Plenty of generic code specializes on arrays.

 alias immutable(char)[] string is just fine.

 That is technically fine, but if phobos wants to treat immutable(char)[]
 as something other than an array, it is not fine.

 -Steve

 Phobos does not treat immutable(char)[] as something other than an
 array. It does not treat all arrays uniformly though.

It certainly does.  An array by definition is a random-access range.  It  
does not treat strings as random access ranges.

-Steve

Jun 27 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, June 27, 2012 17:11:56 Steven Schveighoffer wrote:
 On Wed, 27 Jun 2012 16:55:49 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:
 On 06/27/2012 10:22 PM, Steven Schveighoffer wrote:
 On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr gmx.ch>
 
 wrote:
 There is no reason for anyone to be confused about this endlessly. It
 is simple to understand. Furthermore, think about the implications of a
 library-defined string type: it just introduces the problem of what the
 type of built-in string literals should be. This would cause endless
 pain with type deduction, ifti, string mixins, ... A library-defined
 string type cannot be a full string type. Pretending that it can has no
 value.

 
 Default type of the literal should be the library type.

 
 Then it is not a library type, but a built-in type. Are you planning to
 inject a dependency on Phobos into the compiler?

 
 No, druntime, and include minimal utf support. We do the same thing with
 AssociativeArray.
 
 If you want immutable(char)[], use "abc".codeunits or equivalent.

 
 I really don't want to type .codeunits, but I want to use
 immutable(char)[] everywhere. This 'library type' is just an interface
 change that makes writing nice and efficient code a kludge.

 
 When most string functions take strings, why would you want to use
 immutable(char)[] everywhere?
 
 Of course, it should by default work as a zero-terminated char * for C
 compatibility.
 
 The current situation is not simple to understand.

 
 It is simple, even if not immediately obvious. It does not have to be
 immediately obvious without explanation. It needs to be convenient.

 
 Try sorting an array of ascii characters.

Cast it to ubyte[]. Problem solved. I honestly don't think that operating on 
code units like that should be encourage at all, so if it's a bit hard to do, 
then that's a _good_ thing (but since all that's required is casting to 
ubyte[], it's still quite easy - you just have to tell the compiler that 
that's what you really want to do rather than it being the default behavior). 
The problem that we have is the inconsistencies between how the language 
treats strings and how the library does, not the fact that operating on char[] 
as if it were ASCII rather than UTF-8 requires some casting.

 Generic code that accepts arrays has to special-case narrow-width
 strings if you plan to
 use phobos with them in some cases. That is a horrible situation.

 
 Generic code accepts ranges, not arrays. All necessary (or maybe
 unnecessary, I don't know) special casing is already done for you in
 Phobos. The _only_ thing that is problematic is the inconsistent
 'foreach' behaviour.

 
 Plenty of generic code specializes on arrays.

You're stuck doing that regardless of how strings are represented. You have to 
operate on them as ranges of code points (or even graphemes) if you want 
correct string processing, but that's inefficient, so anything caring about 
efficiency which can gain extra efficiency by coding with knowledge of how
unicode 
works and operate on the code units will need to special case. Whether string 
is an array or a struct has zero effect on that. All that it affects is what 
operates on it as an array of code units vs a range of code points.

 alias immutable(char)[] string is just fine.

 
 That is technically fine, but if phobos wants to treat immutable(char)[]
 as something other than an array, it is not fine.
 
 -Steve

 
 Phobos does not treat immutable(char)[] as something other than an
 array. It does not treat all arrays uniformly though.

 
 It certainly does. An array by definition is a random-access range. It
 does not treat strings as random access ranges.

Well, now you're getting into a semantics argument. isRandomAccessRange defines 
what a random access range is. All arrays which aren't narrow strings qualify. 
Narrow strings do not. Yes, they do have random-access operations, but they 
aren't random-access ranges, because they're ranges of code points, not code 
units.

Yes, this makes it so that character arrays are treated inconsistently from 
other arrays, but the library is very consistent in how it handles them, 
because it _never_ deals with strings as being made of code units. If it's 
operating on them as arrays, then it takes unicode into account, and if it's 
operating on them as ranges, it treats them as ranges of code points. It 
_always_ makes sure that it's operating on code points. Plenty of code 
specializes on strings so that it can deal with the code units in an efficient 
manner rather than having to decode them all the time, but Phobos is 
completely consistent with regards to how it treats strings. The _only_ 
inconsintencies are between the language and the library - namely how foreach 
iterates on code units by default and the fact that while the language defines 
length, slicing, and random-access operations for strings, the library 
effectively does not consider strings to have them.

- Jonathan M Davis

Jun 27 2012

travert phare.normalesup.org (Christophe Travert) writes:

"Jonathan M Davis" , dans le message (digitalmars.D:170852), a écrit :
 completely consistent with regards to how it treats strings. The _only_ 
 inconsintencies are between the language and the library - namely how foreach 
 iterates on code units by default and the fact that while the language defines 
 length, slicing, and random-access operations for strings, the library 
 effectively does not consider strings to have them.

char[] is not treated as an array by the library, and is not treated as 
a RandomAccessRange. That is a second inconsistency, and it would be 
avoided is string were a struct.

I won't repeat arguments that were already said, but if it matters, to 
me, things should be such that:

 - string is a druntime defined struct, with an undelying 
immutable(char)[]. It is a BidirectionalRange of dchar. Slicing is 
provided for convenience, but not as opSlice, since it is not O(1), but 
as a method with a separate name. Direct access to the underlying 
char[]/ubyte[] is provided.

 - similar structs are provided to hold underlying const(char)[] and 
char[]

 - similar structs are provided for wstring

 - dstring is a druntime defined alias to dchar[] or a struct with the 
same functionalities for consistency with narrow string being struct.

 - All those structs may be provided as a template.
struct string(T = immutable(char)) {...}
alias string(immutable(wchar)) wstring;
alias string(immutable(dchar)) dstring;

string(const(char)) and string(char) ... are the other types of 
strings.

 - this string template could also be defined as a wrapper to convert 
any range of char/wchar into a range of dchar. That does not need to be 
in druntime. Only types necessary for string litterals should be in 
druntime.

 - string should not be convertible to char*. Use toStringz to interface 
with c code, or the underlying char[] if you know you it is 
zero-terminated, at you own risk. Only string litterals need to be 
convertible to char*, and I would say that they should be 
zero-terminated only when they are directly used as char*, to allow the 
compiler to optimize memory.

 - char /may/ disappear in favor of ubyte (or the contrary, or one could 
alias the other), if there is no other need to keep separate types that 
having strings that are different from ubyte[]. Only dchar is necessary, 
and it could just be called char.

That is ideal to me. Of course, I understand code compatibility is 
important, and compromises have to be made. The current situation is a 
compromise, but I don't like it because it is a WAT for every newcomer. 
But the last point, for example, would bring no more that code breakage. 
Such code breakage may make us find bugs however...

-- 
Christophe

Jun 28 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, June 28, 2012 08:05:19 Christophe Travert wrote:
 "Jonathan M Davis" , dans le message (digitalmars.D:170852), a =C3=A9=

crit :
 completely consistent with regards to how it treats strings. The _o=


nly_
 inconsintencies are between the language and the library - namely h=


ow
 foreach iterates on code units by default and the fact that while t=


he
 language defines length, slicing, and random-access operations for
 strings, the library effectively does not consider strings to have =


them.

 char[] is not treated as an array by the library

Phobos _does_ treat char[] as an array. isDynamicArray!(char[]) is true=
, and=20
char[] works with the functions in std.array. It's just that they're al=
l=20
special-cased appropriately to handle narrow strings properly. What it =
doesn't=20
do is treat char[] as a range of char.

 and is not treated as a RandomAccessRange.

Which is what I already said.

 That is a second inconsistency, and it would be avoided is string wer=

e a=20
struct.

No, it wouldn't. It is _impossible_ to implement length, slicing, and i=
ndexing=20
for UTF-8 and UTF-16 strings in O(1). Whether you're using an array or =
a=20
struct to represent them is irrelevant. And if you can't do those opera=
tions=20
in O(1), then they can't be random access ranges.

The _only_ thing that using a struct for narrow strings fixes is the=20=

inconsistencies with foreach (it would then use dchar just like all of =
the=20
range stuff does), and slicing, indexing, and length wouldn't be on it,=
=20
eliminating the oddity of them existing but not considered to exist by =
range-
based functions. It _would_ make things somewhat nicer for newbies, but=
 it=20
would not give you one iota more of functionality. Narrow strings would=
 still=20
be bidirectional ranges but not access ranges, and you would still have=
 to=20
operate on the underlying array to operate on strings efficiently.

If we were to start from stratch, it probably would be better to go wit=
h a=20
struct type for strings, but it would break far too much code for far t=
oo=20
little benefit at this point. You need to understand the unicode stuff=20=

regardless - like the difference between code units and code points. So=
, if=20
anything, the fact that strings are treated inconsistently and are trea=
ted as=20
ranges of dchar - which confuses so many newbies - is arguably a _good_=
 thing=20
in that it forces newbies to realize and understand the unicode issues=20=

involved rather than blindly using strings in a horribly inefficient ma=
nner as=20
would inevitably occur with a struct string type.

So, no, the situation is not exactly ideal, and yes, a struct string ty=
pe=20
might have been a better solution, but I think that many of the folks w=
ho are=20
pushing for a struct string type are seriously overestimating the probl=
ems=20
that it would solve. Yes, it would make the language and library more=20=

consistent, but that's it. You'd still have to use strings in essential=
ly the=20
same way that you do now. It's just that you wouldn't have to explicitl=
y use=20
dchar with foreach, and you'd have to get at the property which returne=
d the=20
underlying array in order to operate on the code units as you need to d=
o in=20
many functions to make your code appropriately efficient rather than si=
mply=20
using the string that way directly by not using its range-based functio=
ns.=20
There is a difference, but it's a lot smaller than many people seem to =
think.

- Jonathan M Davis

Jun 28 2012

travert phare.normalesup.org (Christophe Travert) writes:

Jonathan M Davis , dans le message (digitalmars.D:170872), a écrit :
 On Thursday, June 28, 2012 08:05:19 Christophe Travert wrote:
 "Jonathan M Davis" , dans le message (digitalmars.D:170852), a écrit :
 completely consistent with regards to how it treats strings. The _only_
 inconsintencies are between the language and the library - namely how
 foreach iterates on code units by default and the fact that while the
 language defines length, slicing, and random-access operations for
 strings, the library effectively does not consider strings to have them.


 
 char[] is not treated as an array by the library

 
 Phobos _does_ treat char[] as an array. isDynamicArray!(char[]) is true, and 
 char[] works with the functions in std.array. It's just that they're all 
 special-cased appropriately to handle narrow strings properly. What it doesn't 
 do is treat char[] as a range of char.
 
 and is not treated as a RandomAccessRange.


All arrays are treated as RandomAccessRanges, except for char[] and 
wchar[]. So I think I am entitled to say that strings are not treated as 
arrays. An I would say I am also entitle to say strings are not normal 
ranges, since they define length, but have isLength as true, and define 
opIndex and opSlice, but are not RandomAccessRanges.

The fact that isDynamicArray!(char[]) is true, but 
isRandomAccessRange is not is just another aspect of the schizophrenia. 
The behavior of a templated function on a string will depend on which 
was used as a guard.

 
 Which is what I already said.
 
 That is a second inconsistency, and it would be avoided is string were a 

 struct.
 
 No, it wouldn't. It is _impossible_ to implement length, slicing, and indexing 
 for UTF-8 and UTF-16 strings in O(1). Whether you're using an array or a 
 struct to represent them is irrelevant. And if you can't do those operations 
 in O(1), then they can't be random access ranges.

I never said strings should support length and slicing. I even said 
they should not. foreach is inconsistent with the way strings are 
treated in phobos, but opIndex, opSlice and length, are inconsistent to. 
string[0] and string.front do not even return the same....

Please read my post a little bit more carefully before 
answering them.

About the rest of your post, I basically say the same as you in shorter 
terms, except that I am in favor of changing things (but I didn't even 
said they should be changed in my conclusion).

newcomers are troubled by this problem, and I think it is important. 
They will make mistakes when using both array and range functions on 
strings in the same algorithm, or when using array functions without 
knowing about utf8 encoding issues (the fact that array functions are 
also valid range functions if not for strings does not help). But I also 
think experienced programmers can be affected, because of inattention, 
reusing codes written by inexperienced programmers, or inappropriate 
template guards usage.

As a more general comment, I think having a consistent langage is a very 
important goal to achieve when designing a langage. It makes everything 
simpler, from langage design to user through compiler and library 
development. It may not be too late for D.

-- 
Christophe

Jun 28 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, June 28, 2012 09:28:52 Christophe Travert wrote:
 I never said strings should support length and slicing. I even said
 they should not. foreach is inconsistent with the way strings are
 treated in phobos, but opIndex, opSlice and length, are inconsistent to.
 string[0] and string.front do not even return the same....
 
 Please read my post a little bit more carefully before
 answering them.

You said this:

 char[] is not treated as an array by the library, and is not treated as 
 a RandomAccessRange. That is a second inconsistency, and it would be 
 avoided is string were a struct.

So, it looked to me like you were saying that making string a struct would 
make it so that it was a random access range, which would mean implementing 
length, opSlice, and opIndex.

- Jonathan M Davis

Jun 28 2012

"David Nadlinger" <see klickverbot.at> writes:

On Thursday, 28 June 2012 at 09:49:19 UTC, Jonathan M Davis wrote:
 char[] is not treated as an array by the library, and is not 
 treated as a RandomAccessRange. That is a second 
 inconsistency, and it would be avoided is string were a struct.

 So, it looked to me like you were saying that making string a 
 struct would
 make it so that it was a random access range, which would mean 
 implementing
 length, opSlice, and opIndex.

I think he meant that the problem would be solved because people 
would be less likely to expect it to be a random access range in 
the first place.

What troubles me most with having is(string == immutable(char)[]) 
is that it more or less precludes us from adding small string 
optimizations, etc. in the future…

David

Jun 28 2012

travert phare.normalesup.org (Christophe Travert) writes:

"David Nadlinger" , dans le message (digitalmars.D:170875), a écrit :
 On Thursday, 28 June 2012 at 09:49:19 UTC, Jonathan M Davis wrote:
 char[] is not treated as an array by the library, and is not 
 treated as a RandomAccessRange. That is a second 
 inconsistency, and it would be avoided is string were a struct.

 So, it looked to me like you were saying that making string a 
 struct would
 make it so that it was a random access range, which would mean 
 implementing
 length, opSlice, and opIndex.

 
 I think he meant that the problem would be solved because people 
 would be less likely to expect it to be a random access range in 
 the first place.

Yes.

Jun 28 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 6/28/12 5:28 AM, Christophe Travert wrote:
 As a more general comment, I think having a consistent langage is a very
 important goal to achieve when designing a langage. It makes everything
 simpler, from langage design to user through compiler and library
 development. It may not be too late for D.

In a way it's too late for any language in actual use. The "fog of 
language design" makes it nigh impossible to design a language/library 
combo that is perfectly consistent, not to mention the fact that 
consistency itself has many dimensions, some of which may be in competition.

We'd probably do things a bit differently if we started from scratch. As 
things are, D's strings have a couple of quirks but are very apt for 
good and efficient string manipulation where index computation in the 
code unit realm is combined with the range of code points realm. I 
suppose people who have an understanding of UTF don't have difficulty 
using D's strings. Above all, alea jacta est and there's little we can 
do about that save for inventing a time machine.


Andrei

Jun 28 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 06/28/2012 11:28 AM, Christophe Travert wrote:
 Jonathan M Davis , dans le message (digitalmars.D:170872), a écrit :
 On Thursday, June 28, 2012 08:05:19 Christophe Travert wrote:
 "Jonathan M Davis" , dans le message (digitalmars.D:170852), a écrit :
 completely consistent with regards to how it treats strings. The _only_
 inconsintencies are between the language and the library - namely how
 foreach iterates on code units by default and the fact that while the
 language defines length, slicing, and random-access operations for
 strings, the library effectively does not consider strings to have them.


 char[] is not treated as an array by the library

 Phobos _does_ treat char[] as an array. isDynamicArray!(char[]) is true, and
 char[] works with the functions in std.array. It's just that they're all
 special-cased appropriately to handle narrow strings properly. What it doesn't
 do is treat char[] as a range of char.

 and is not treated as a RandomAccessRange.


 All arrays are treated as RandomAccessRanges, except for char[] and
 wchar[]. So I think I am entitled to say that strings are not treated as
 arrays.

"Not treated like other arrays", is the strongest claim that can be
made there.

 An I would say I am also entitle to say strings are not normal
 ranges, since they define length, but have isLength as true,

hasLength as false. They define length, but it is not part of the range
interface.

It is analogous to the following:

class charArray : ForwardRange!dchar{
     /* interface ForwardRange!dchar */
     dchar front();
     bool empty();
     void popFront();
     NarrowString save();

     /* other methods */
     size_t length();
     char opIndex(size_t i);
     String opSlice(size_t a, size_t b);
}

 and define opIndex and opSlice,

[] and [..] operate on code units, but for a random access range as
defined by Phobos, they would not.

 but are not RandomAccessRanges.

 The fact that isDynamicArray!(char[]) is true, but
 isRandomAccessRange is not is just another aspect of the schizophrenia.
 The behavior of a templated function on a string will depend on which
 was used as a guard.

No, it won't.

 Which is what I already said.

 That is a second inconsistency, and it would be avoided is string were a

 struct.

 No, it wouldn't. It is _impossible_ to implement length, slicing, and indexing
 for UTF-8 and UTF-16 strings in O(1). Whether you're using an array or a
 struct to represent them is irrelevant. And if you can't do those operations
 in O(1), then they can't be random access ranges.

 I never said strings should support length and slicing. I even said
 they should not. foreach is inconsistent with the way strings are
 treated in phobos, but opIndex, opSlice and length, are inconsistent to.
 string[0] and string.front do not even return the same....

 Please read my post a little bit more carefully before
 answering them.

This is very impolite.

On Thursday, June 28, 2012 08:05:19 Christophe Travert wrote:
 Slicing is provided for convenience, but not as opSlice, since it is not O(1),
but
 as a method with a separate name.


 About the rest of your post, I basically say the same as you in shorter
 terms, except that I am in favor of changing things (but I didn't even
 said they should be changed in my conclusion).

When read carefully, the conclusion says that code compatibility is
important only a couple sentences before it says that breaking code for
the fun of it may be a good thing.

 newcomers are troubled by this problem,  and I think it is important.

Newcomers sometimes become seasoned D programmers. Sometimes they know
what Unicode is about even before that.

 They will make mistakes when using both array and range functions on
 strings in the same algorithm, or when using array functions without
 knowing about utf8 encoding issues (the fact that array functions are
 also valid range functions if not for strings does not help). But I also
 think experienced programmers can be affected, because of inattention,
 reusing codes written by inexperienced programmers, or inappropriate
 template guards usage.

In the ASCII-7 subset, UTF-8 strings are actually random access, and
iterating an UTF-8 string by code point is safe if you are eg. just
going to treat some ASCII characters specially.

I don't care much whether or not (bad?) code handles Unicode correctly,
but it is important that code correctly documents whether or not it
does so, and to what extent it does. The new std.regex has good Unicode
support, and to enable that, it had to add some pretty large tables to
Phobos, the functionality of which is not exposed to the library user
as of now. It is therefore safe to say that many/most existing D
programs do not handle the whole Unicode standard correctly.

Unicode has to be _actively_ supported. There are distinct issues that
are hard to abstract away efficiently. Treating an Unicode string as a
range of code points is not solving them. (dchar[] indexing is still
not guaranteed to give back the 'i'th character!) Why build this
interpretation into the language?

 As a more general comment, I think having a consistent langage is a very
 important goal to achieve when designing a langage. It makes everything
 simpler, from langage design to user through compiler and library
 development. It may not be too late for D.

The language is consistent here. The library treats some language
features specially. It is not the language that is "confusing". The
whole reason to introduce the library behaviour is probably based on
similar reasoning as given in your post. The special casing has not
caused me any trouble, and sometimes it was useful.

Jun 28 2012

travert phare.normalesup.org (Christophe Travert) writes:

Timon Gehr , dans le message (digitalmars.D:170884), a écrit :
 An I would say I am also entitle to say strings are not normal
 ranges, since they define length, but have isLength as true,

 
 hasLength as false.

Of course, my mistake.

 They define length, but it is not part of the range interface.
 
 It is analogous to the following:
 [...]

I consider this bad design.

 and define opIndex and opSlice,

 
 [] and [..] operate on code units, but for a random access range as
 defined by Phobos, they would not.

A bidirectional range of dchar with additional methods of a random 
access range of char. That is what I call schizophrenic.

 but are not RandomAccessRanges.

 The fact that isDynamicArray!(char[]) is true, but
 isRandomAccessRange is not is just another aspect of the schizophrenia.
 The behavior of a templated function on a string will depend on which
 was used as a guard.

 No, it won't.

Take the isRandomAccessRange specialization of an algorithm in Phobos, 
replace the guard by isDynamicArray, and you are very likely to change 
the behavior, if you do not simply break the function.

 When read carefully, the conclusion says that code compatibility is
 important only a couple sentences before it says that breaking code for
 the fun of it may be a good thing.

It was intended as a side-note, not a conclusion. Sorry for not being 
clear.

 newcomers are troubled by this problem,  and I think it is important.

 
 Newcomers sometimes become seasoned D programmers. Sometimes they know
 what Unicode is about even before that.

I knew what unicode was before coming to D. But, strings being arrays, I 
suspected myString.front would return the same as myString[0], i.e., a 
char, and that it was my job to make sure my algorithms were valid for 
UTF-8 encoding if I wanted to support it. Most of the time, in langage 
without such UTF-8 support, they are without much troubles. Code units 
matters more than code points most of the time.

 The language is consistent here. The library treats some language
 features specially. It is not the language that is "confusing". The
 whole reason to introduce the library behaviour is probably based on
 similar reasoning as given in your post.

OK, I should have said the standard library is inconsistent (with the 
langage).

 The special casing has not caused me any trouble, and sometimes it was 
 useful.

Of course, I can live with that.

Jun 28 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 06/27/2012 11:11 PM, Steven Schveighoffer wrote:
 On Wed, 27 Jun 2012 16:55:49 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 06/27/2012 10:22 PM, Steven Schveighoffer wrote:
 On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr gmx.ch>
 wrote:

 There is no reason for anyone to be confused about this endlessly. It
 is simple to understand. Furthermore, think about the implications of a
 library-defined string type: it just introduces the problem of what the
 type of built-in string literals should be. This would cause endless
 pain with type deduction, ifti, string mixins, ... A library-defined
 string type cannot be a full string type. Pretending that it can has no
 value.

 Default type of the literal should be the library type.

 Then it is not a library type, but a built-in type. Are you planning to
 inject a dependency on Phobos into the compiler?

 No, druntime, and include minimal utf support. We do the same thing with
 AssociativeArray.

In this case it is misleading to call it a library type.

 If you want immutable(char)[], use "abc".codeunits or equivalent.

 I really don't want to type .codeunits, but I want to use
 immutable(char)[] everywhere. This 'library type' is just an interface
 change that makes writing nice and efficient code a kludge.

 When most string functions take strings, why would you want to use
 immutable(char)[] everywhere?

Because the proposed 'string' interface is inconvenient to use and 
useless. It is a struct with one data member and no additionally
maintained invariant, and it strictly narrows the essential parts of
the interface to the data that is reachable without a large typing
overhead. immutable(char)[] supports exactly the operations I usually
need. Maybe I'm not representative.

 Of course, it should by default work as a zero-terminated char * for C
 compatibility.

 The current situation is not simple to understand.

 It is simple, even if not immediately obvious. It does not have to be
 immediately obvious without explanation. It needs to be convenient.

 Try sorting an array of ascii characters.

auto asciitext = cast(ubyte[])"I am ascii text";
sort(asciitext);


 Generic code that accepts arrays has to special-case narrow-width
 strings if you plan to
 use phobos with them in some cases. That is a horrible situation.

 Generic code accepts ranges, not arrays. All necessary (or maybe
 unnecessary, I don't know) special casing is already done for you in
 Phobos. The _only_ thing that is problematic is the inconsistent
 'foreach' behaviour.

 Plenty of generic code specializes on arrays.

Ok, point taken. But plenty of generic code then specializes on
strings as well. Would the net gain be so huge? There is also always
the option of just not passing strings to some helper template function
you defined.

There are multiple valid contradictory considerations on the topic, but
I have found the current way of dealing with strings very pleasant.

 alias immutable(char)[] string is just fine.

 That is technically fine, but if phobos wants to treat immutable(char)[]
 as something other than an array, it is not fine.

 -Steve

 Phobos does not treat immutable(char)[] as something other than an
 array. It does not treat all arrays uniformly though.

 It certainly does. An array by definition is a random-access range. It
 does not treat strings as random access ranges.

 -Steve

You are right about the random-access part, but the definition of an
array does not depend on the 'range' concept.

Jun 27 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, June 27, 2012 23:41:14 Timon Gehr wrote:
 On 06/27/2012 11:11 PM, Steven Schveighoffer wrote:
 When most string functions take strings, why would you want to use
 immutable(char)[] everywhere?

 
 Because the proposed 'string' interface is inconvenient to use and
 useless. It is a struct with one data member and no additionally
 maintained invariant, and it strictly narrows the essential parts of
 the interface to the data that is reachable without a large typing
 overhead. immutable(char)[] supports exactly the operations I usually
 need. Maybe I'm not representative.

I think that a lot of programmers want to be able to use strings without 
worrying about any of the details (like unicode). The fact that foreach and 
the library don't treat strings the same is confusing, and the fact that 
narrow strings are ranges of dchar (with all that that implies with regards to 
the operations that they support) seems to confuse a lot of people. If we had 
a struct for a string type, then the usage would be consistent (always a range 
of dchar), allowing the average programmer to more or less ignore unicode 
considerations as long as they don't care about efficiency, but it would still 
allow those who _do_ care to get at the underlying representation. So, a 
struct would be an improvement in that regard.

But for those who know what they're doing with regards to unicode and 
understand the fact that foreach treats strings one way and the library treats 
them another way, it really isn't a problem. It works quite well (which is one 
of the reasons that Walter isn't too keen on changing strings). It just isn't 
terribly newbie-friendly.

- Jonathan M Davis

Jun 27 2012

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

Sorry to resurrect this thread, I've been very absent from D, and am just  
now going through all these old posts.

On Wed, 27 Jun 2012 17:41:14 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 06/27/2012 11:11 PM, Steven Schveighoffer wrote:

 No, druntime, and include minimal utf support. We do the same thing with
 AssociativeArray.

 In this case it is misleading to call it a library type.

What I mean is, the compiler does not define the structure of it.  It  
simply knows it exists, and expects a certain API for it.

The type itself is purely defined in the library, and could possibly be  
used directly as a library type.

 If you want immutable(char)[], use "abc".codeunits or equivalent.

 I really don't want to type .codeunits, but I want to use
 immutable(char)[] everywhere. This 'library type' is just an interface
 change that makes writing nice and efficient code a kludge.

 When most string functions take strings, why would you want to use
 immutable(char)[] everywhere?

 Because the proposed 'string' interface is inconvenient to use and  
 useless. It is a struct with one data member and no additionally
 maintained invariant, and it strictly narrows the essential parts of
 the interface to the data that is reachable without a large typing
 overhead. immutable(char)[] supports exactly the operations I usually
 need. Maybe I'm not representative.

Most usages of strings are to concatenate them, print them, use them as  
keys, read them from a stream, etc.  None of this requires direct access  
to the data.  They can be treated as a nebulous type.

So maybe you are in the minority.  I don't really know.

 The current situation is not simple to understand.

 It is simple, even if not immediately obvious. It does not have to be
 immediately obvious without explanation. It needs to be convenient.



I will respond to this in a different way.  This is the confusing part:

string x;
assert(!hasLength!string);
assert(!isRandomAccessRange!string);
auto len = x.length;
auto c = x[0]; // wtf?

It is *always* going to cause people to question isRandomAccessRange and  
hasLength because clearly a string has the purported properties needed to  
satisfy both.

 Generic code that accepts arrays has to special-case narrow-width
 strings if you plan to
 use phobos with them in some cases. That is a horrible situation.

 Generic code accepts ranges, not arrays. All necessary (or maybe
 unnecessary, I don't know) special casing is already done for you in
 Phobos. The _only_ thing that is problematic is the inconsistent
 'foreach' behaviour.

 Plenty of generic code specializes on arrays.

 Ok, point taken. But plenty of generic code then specializes on
 strings as well. Would the net gain be so huge? There is also always
 the option of just not passing strings to some helper template function
 you defined.

The net gain is not in the reduction of specializations -- of course we  
will need specializations for strings because to do any less would make D  
extremely inefficient compared to other languages.

The gain is in the reduction of confusion.  We are asking our users "I  
know, I know, it's an array, but *please* pretend it's not! Don't listen  
to the compiler!"

And this is for a *BASIC* type of the language!

range.save() is the same thing.  It prevents nothing, and you have to take  
special care to avoid using basic operations (i.e. assign) and pretend  
they don't exist, even though they compile.  To me, that is worthless.  If  
a string does not support random access, then str[0] should not compile.  
period.

 You are right about the random-access part, but the definition of an
 array does not depend on the 'range' concept.

The range concept is notably more confusing with strings that are random  
access types but not random access ranges, even though they support all  
the properties needed for a random access range.

-Steve

Aug 23 2012

Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:

On Thu, Jun 28, 2012 at 12:22 AM, Steven Schveighoffer
<schveiguy yahoo.com> wrote:
 On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:

 There is no reason for anyone to be confused about this endlessly. It
 is simple to understand. Furthermore, think about the implications of a
 library-defined string type: it just introduces the problem of what the
 type of built-in string literals should be. This would cause endless
 pain with type deduction, ifti, string mixins, ... A library-defined
 string type cannot be a full string type. Pretending that it can has no
 value.


 Default type of the literal should be the library type. =C2=A0If you want
 immutable(char)[], use "abc".codeunits or equivalent.

 Of course, it should by default work as a zero-terminated char * for C
 compatibility.

 The current situation is not simple to understand. =C2=A0Generic code tha=

t
 accepts arrays has to special-case narrow-width strings if you plan to us=

e
 phobos with them in some cases. =C2=A0That is a horrible situation.


 alias immutable(char)[] string is just fine.


 That is technically fine, but if phobos wants to treat immutable(char)[] =

as
 something other than an array, it is not fine.

 -Steve

Currently strings below dstring are only applicable in ForwardRange
and below, but not RandomAccessRange as they should be.

--=20
Bye,
Gor Gyolchanyan.

Jun 27 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Thursday, June 28, 2012 08:59:32 Gor Gyolchanyan wrote:
 On Thu, Jun 28, 2012 at 12:22 AM, Steven Schveighoffer
 
 <schveiguy yahoo.com> wrote:
 On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:
 There is no reason for anyone to be confused about this endlessly. It
 is simple to understand. Furthermore, think about the implications of a
 library-defined string type: it just introduces the problem of what the
 type of built-in string literals should be. This would cause endless
 pain with type deduction, ifti, string mixins, ... A library-defined
 string type cannot be a full string type. Pretending that it can has no
 value.

 
 Default type of the literal should be the library type.  If you want
 immutable(char)[], use "abc".codeunits or equivalent.
 
 Of course, it should by default work as a zero-terminated char * for C
 compatibility.
 
 The current situation is not simple to understand.  Generic code that
 accepts arrays has to special-case narrow-width strings if you plan to use
 phobos with them in some cases.  That is a horrible situation.
 
 alias immutable(char)[] string is just fine.

 
 That is technically fine, but if phobos wants to treat immutable(char)[]
 as
 something other than an array, it is not fine.
 
 -Steve

 
 Currently strings below dstring are only applicable in ForwardRange
 and below, but not RandomAccessRange as they should be.

Except that they shouldn't be, because you can't do random access on a narrow 
string in O(1). If you can't index or slice a range in O(1), it has no 
business having those operations. The same goes for length. That's why narrow 
strings do not have any of those operations as far as ranges are concerned. 
Having those operations in anything worse than O(1) violates the algorithmic 
complexity guarantees that ranges are supposed to provide, which would 
seriously harm the efficiency of algorithms which rely on them. It's the same 
reason why std.container defines the algorithmic complexity of all the 
operations in std.container. If you want a random-access range which is a 
string type, you need dchar[], const(dchar)[], or dstring. That is very much 
on purpose and would not change even if strings were structs.

- Jonathan M Davis

Jun 27 2012