www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - standard ranges

reply Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:
Are there functions, which wrap arbitrary range types into standard range
interfaces?
I looked at the docs, but couldn't find anything.
Use case:

RandomAccessRange!dchar s = ???("Hello, world!");

-- 
Bye,
Gor Gyolchanyan.
Jun 27 2012
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 06/27/2012 03:25 PM, Gor Gyolchanyan wrote:
 Are there functions, which wrap arbitrary range types into standard
 range interfaces?
 I looked at the docs, but couldn't find anything.
 Use case:

 RandomAccessRange!dchar s = ???("Hello, world!");

 --
 Bye,
 Gor Gyolchanyan.
A narrow string is not a RandomAccessRange. RandomAccessFinite!(immutable(dchar)) s = inputRangeObject("Hello, world!"d);
Jun 27 2012
next sibling parent Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:
On Wed, Jun 27, 2012 at 5:38 PM, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 06/27/2012 03:25 PM, Gor Gyolchanyan wrote:

 Are there functions, which wrap arbitrary range types into standard
 range interfaces?
 I looked at the docs, but couldn't find anything.
 Use case:

 RandomAccessRange!dchar s = ???("Hello, world!");

 --
 Bye,
 Gor Gyolchanyan.
A narrow string is not a RandomAccessRange. RandomAccessFinite!(immutable(**dchar)) s = inputRangeObject("Hello, world!"d);
I tested it out and the string literal without qualifiers counts as a dstring. -- Bye, Gor Gyolchanyan.
Jun 27 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, June 27, 2012 17:58:46 Gor Gyolchanyan wrote:
 I tested it out and the string literal without qualifiers counts as a
 dstring.
That depends entirely on what you assign it to. writeln(typeof("hello").stringof) prints string, not dstring. So, the literal by itself is a string by default. - Jonathan M Davis
Jun 27 2012
prev sibling parent Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:
On Wed, Jun 27, 2012 at 7:41 PM, Jonathan M Davis <jmdavisProg gmx.com>wrote:

 On Wednesday, June 27, 2012 17:58:46 Gor Gyolchanyan wrote:
 I tested it out and the string literal without qualifiers counts as a
 dstring.
That depends entirely on what you assign it to. writeln(typeof("hello").stringof) prints string, not dstring. So, the literal by itself is a string by default. - Jonathan M Davis
this is weird. I wrote a function, which transforms anything, which qualifies as isForwardRange into an implementation of ForwardRange. And the type inference of that function produced a ForwardRangeImpl!dchar when I passed it a string literal. Although string and wstring also qualify as a forward range. -- Bye, Gor Gyolchanyan.
Jun 27 2012
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, June 27, 2012 19:47:41 Gor Gyolchanyan wrote:
 On Wed, Jun 27, 2012 at 7:41 PM, Jonathan M Davis 
<jmdavisProg gmx.com>wrote:
 On Wednesday, June 27, 2012 17:58:46 Gor Gyolchanyan wrote:
 I tested it out and the string literal without qualifiers counts as a
 dstring.
That depends entirely on what you assign it to. writeln(typeof("hello").stringof) prints string, not dstring. So, the literal by itself is a string by default. - Jonathan M Davis
this is weird. I wrote a function, which transforms anything, which qualifies as isForwardRange into an implementation of ForwardRange. And the type inference of that function produced a ForwardRangeImpl!dchar when I passed it a string literal. Although string and wstring also qualify as a forward range.
_All_ strings are considered to be ranges of dchar. That's why string and wstring are not random access ranges and hasLength is false for them. - Jonathan M Davis
Jun 27 2012
prev sibling next sibling parent reply Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:
On Wed, Jun 27, 2012 at 7:49 PM, Jonathan M Davis <jmdavisProg gmx.com>wrote:

 On Wednesday, June 27, 2012 19:47:41 Gor Gyolchanyan wrote:
 On Wed, Jun 27, 2012 at 7:41 PM, Jonathan M Davis
<jmdavisProg gmx.com>wrote:
 On Wednesday, June 27, 2012 17:58:46 Gor Gyolchanyan wrote:
 I tested it out and the string literal without qualifiers counts as a
 dstring.
That depends entirely on what you assign it to. writeln(typeof("hello").stringof) prints string, not dstring. So, the literal by itself is a string by default. - Jonathan M Davis
this is weird. I wrote a function, which transforms anything, which qualifies as isForwardRange into an implementation of ForwardRange. And
the
 type inference of that function produced a ForwardRangeImpl!dchar when I
 passed it a string literal.

 Although string and wstring also qualify as a forward range.
_All_ strings are considered to be ranges of dchar. That's why string and wstring are not random access ranges and hasLength is false for them. - Jonathan M Davis
So why is the type of a string literal _string_ by default? Isn't it confusing when dealing with ranges? -- Bye, Gor Gyolchanyan.
Jun 27 2012
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 06/27/2012 05:54 PM, Gor Gyolchanyan wrote:
 On Wed, Jun 27, 2012 at 7:49 PM, Jonathan M Davis <jmdavisProg gmx.com
 <mailto:jmdavisProg gmx.com>> wrote:

     _All_ strings are considered to be ranges of dchar. That's why
     string and
     wstring are not random access ranges and hasLength is false for them.

     - Jonathan M Davis


 So why is the type of a string literal _string_ by default?
Because it is a _string_ literal. If you are asking why utf-8 is the default, that is because it is the most space efficient, backwards- compatible to ASCII, and because random access to a string is rarely required. ? Isn't it confusing when dealing with ranges?
 --
 Bye,
 Gor Gyolchanyan.
Why would it be?
Jun 27 2012
prev sibling next sibling parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, June 27, 2012 19:54:12 Gor Gyolchanyan wrote:
 On Wed, Jun 27, 2012 at 7:49 PM, Jonathan M Davis 
<jmdavisProg gmx.com>wrote:
 On Wednesday, June 27, 2012 19:47:41 Gor Gyolchanyan wrote:
 On Wed, Jun 27, 2012 at 7:41 PM, Jonathan M Davis
<jmdavisProg gmx.com>wrote:
 On Wednesday, June 27, 2012 17:58:46 Gor Gyolchanyan wrote:
 I tested it out and the string literal without qualifiers counts as
 a
 dstring.
That depends entirely on what you assign it to. writeln(typeof("hello").stringof) prints string, not dstring. So, the literal by itself is a string by default. - Jonathan M Davis
this is weird. I wrote a function, which transforms anything, which qualifies as isForwardRange into an implementation of ForwardRange. And
the
 type inference of that function produced a ForwardRangeImpl!dchar when I
 passed it a string literal.
 
 Although string and wstring also qualify as a forward range.
_All_ strings are considered to be ranges of dchar. That's why string and wstring are not random access ranges and hasLength is false for them. - Jonathan M Davis
So why is the type of a string literal _string_ by default? Isn't it confusing when dealing with ranges?
I don't see why having the literal be a string would make anything confusing. The fact that a string is considered a range of dchar rather than char could be, but I don't see why having a string literal be a dstring instead of a string would help with that. Besides, it's generally expected that you'll use string for strings unless you specifically need wstring or dstring for some reason. Regardless, ranges aren't really part of the language. They're a library artifact. The _only_ place that the language has anything to do with them is foreach, in which case foreach(e; range) { // code } becomes for(auto _range = range; !_range.empty; _range.popFront()) { auto e _range.front; // code } That's it. So, the fact that Phobos treats strings as ranges of dchar is completely separate from what the language is doing with string literals. foreach on strings doesn't iterate over dchars unless you specifically give dchar as the element type. You can get a strings length. You can use random access on it. You can slice it. But this falls apart _very_ quickly with general algorithms, because a string is an array of code _units_ rather than code points. So, if you iterate over char, you're iterating over pieces of characters rather than whole characters. So, Phobos' solution is to treat arrays of char and wchar as ranges of dchar rather than ranges of char and wchar, and they lose length, random access, and slicing as far as ranges are concerned (though algorithms can special case for them and use those abilities where appropriate, since they're still there - they just can't be used generically or you'd be operating on code units). In some cases, you need to be able to treat strings as arrays of code units, while in others you need to treat them as arrays of code points. In order to use strings properly, you need to understand that. There's no way around it. It's life with unicode. The library went the route of using code points for everything because it's more correct and less error-prone, whereas the language itself generally deals with code units This does create a bit of schizophrenia when dealing with built-in stuff (such as foreach) and library stuff, but that's the way that it goes at this point. If strings were a struct of some kind that defaulted to using code points but allowed you to use code units when necessary, then the situation could be improved, but no one has been able to come up with a satisfactory proposal to do that, and it would break so much code at this point to change what string was aliased to that it's unlikely to ever happen. Not to mention, it doesn't really fix the underlying problem of having to know and worry about code units vs code points. They're intrinsic to unicode, and you can't really fix that. There's no way around it if you want to able to efficiently operate on strings. - Jonathan M Davis
Jun 27 2012
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 27 Jun 2012 13:30:48 -0400, Jonathan M Davis <jmdavisProg gmx.com>  
wrote:


 I don't see why having the literal be a string would make anything  
 confusing.
 The fact that a string is considered a range of dchar rather than char  
 could
 be, but I don't see why having a string literal be a dstring instead of a
 string would help with that. Besides, it's generally expected that  
 you'll use
 string for strings unless you specifically need wstring or dstring for  
 some
 reason.
No, the reason is: 1. T[] is a range of T, unless T == char or T == wchar, and then it's a range of dchar (huh?) 2. char[] is not a random access range, even though str[i] and str.length work. The fundamental flaw in the way this works is that phobos is pretending immutable(char)[] is not an array. immutable(char)[] should be an array of immutable char, string should be a *separate type* of a range of dchar, perhaps with immutable(char)[] as its underlying storage. D needs a full, library-defined string type. Until it has that, it's going to cause endless confusion and WATs. -Steve
Jun 27 2012
next sibling parent Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:
On Wed, Jun 27, 2012 at 10:09 PM, Steven Schveighoffer
<schveiguy yahoo.com> wrote:
 On Wed, 27 Jun 2012 13:30:48 -0400, Jonathan M Davis <jmdavisProg gmx.com=

 wrote:


 I don't see why having the literal be a string would make anything
 confusing.
 The fact that a string is considered a range of dchar rather than char
 could
 be, but I don't see why having a string literal be a dstring instead of =
a
 string would help with that. Besides, it's generally expected that you'l=
l
 use
 string for strings unless you specifically need wstring or dstring for
 some
 reason.
No, the reason is: 1. T[] is a range of T, unless T =3D=3D char or T =3D=3D wchar, and then =
it's a
 range of dchar (huh?)
 2. char[] is not a random access range, even though str[i] and str.length
 work.

 The fundamental flaw in the way this works is that phobos is pretending
 immutable(char)[] is not an array. =C2=A0immutable(char)[] should be an a=
rray of
 immutable char, string should be a *separate type* of a range of dchar,
 perhaps with immutable(char)[] as its underlying storage.

 D needs a full, library-defined string type. =C2=A0Until it has that, it'=
s going
 to cause endless confusion and WATs.

 -Steve
Agreed. Having struct strings (with slices and everything) will set the record straight. --=20 Bye, Gor Gyolchanyan.
Jun 27 2012
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, June 27, 2012 22:29:25 Gor Gyolchanyan wrote:
 Agreed. Having struct strings (with slices and everything) will set
 the record straight.
Except that they couldn't have slicing, because it would be very inefficient. You'd have to get at the actual array of code units to slice anything. A struct string type would have to be restricted to exactly the same set of operations that range-based functions consider strings to have and then give you a way to get at the underlying code unit representation to be able to use it when special-casing for strings for efficiency, just like you do now. You _can't_ get away from the fact that you're dealing with an array (or list or whatever) of code units even if you do want to operate on it as a range of code points most of the time. Having a struct would fix the issues like foreach iterating over char by default whereas range-based functions iterate over dchar - it would make it consistent by making it dchar for everything - but the issue of code unit vs code point still remains and you can't get rid of it. Anyone wanting to write efficient string-processing code _needs_ to understand unicode. There's no way around it (which is part of the reason that Walter isn't keen on the idea of changing how strings work in the language itself). So, while having a string type which is a struct does help eliminate the schizophrenia, the core problem of code unit vs code point is still there, and you still need to understand it. There is no fix for it, because it's intrinsic to how unicode works. - Jonathan M Davis
Jun 27 2012
prev sibling next sibling parent reply Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:
On Wed, Jun 27, 2012 at 10:42 PM, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 On Wednesday, June 27, 2012 22:29:25 Gor Gyolchanyan wrote:
 Agreed. Having struct strings (with slices and everything) will set
 the record straight.
Except that they couldn't have slicing, because it would be very inefficient. You'd have to get at the actual array of code units to slice anything. A struct string type would have to be restricted to exactly the same set of operations that range-based functions consider strings to have and then give you a way to get at the underlying code unit representation to be able to use it when special-casing for strings for efficiency, just like you do now. You _can't_ get away from the fact that you're dealing with an array (or list or whatever) of code units even if you do want to operate on it as a range of code points most of the time. Having a struct would fix the issues like foreach iterating over char by default whereas range-based functions iterate over dchar - it would make it consistent by making it dchar for everything - but the issue of code unit vs code point still remains and you can't get rid of it. Anyone wanting to write efficient string-processing code _needs_ to understand unicode. There's no way around it (which is part of the reason that Walter isn't keen on the idea of changing how strings work in the language itself). So, while having a string type which is a struct does help eliminate the schizophrenia, the core problem of code unit vs code point is still there, and you still need to understand it. There is no fix for it, because it's intrinsic to how unicode works. - Jonathan M Davis
Yes you can get away. The struct string would have ubyte[] ushort[] and uint[] as the representation. Maybe even the char[], wchar[] and dchar[], but those won't be strings as we know them now. The string struct will take care of encoding 100% transparently and will provide access to the representation, which is good for bit blitting and other encoding-agnostic operations, but the representation is then known NOT to be a valid string and will need to be placed into the string struct in order to use string operations. -- Bye, Gor Gyolchanyan.
Jun 27 2012
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 06/27/2012 08:54 PM, Gor Gyolchanyan wrote:
 On Wed, Jun 27, 2012 at 10:42 PM, Jonathan M Davis<jmdavisProg gmx.com>  wrote:
 On Wednesday, June 27, 2012 22:29:25 Gor Gyolchanyan wrote:
 Agreed. Having struct strings (with slices and everything) will set
 the record straight.
Except that they couldn't have slicing, because it would be very inefficient. You'd have to get at the actual array of code units to slice anything. A struct string type would have to be restricted to exactly the same set of operations that range-based functions consider strings to have and then give you a way to get at the underlying code unit representation to be able to use it when special-casing for strings for efficiency, just like you do now. You _can't_ get away from the fact that you're dealing with an array (or list or whatever) of code units even if you do want to operate on it as a range of code points most of the time. Having a struct would fix the issues like foreach iterating over char by default whereas range-based functions iterate over dchar - it would make it consistent by making it dchar for everything - but the issue of code unit vs code point still remains and you can't get rid of it. Anyone wanting to write efficient string-processing code _needs_ to understand unicode. There's no way around it (which is part of the reason that Walter isn't keen on the idea of changing how strings work in the language itself). So, while having a string type which is a struct does help eliminate the schizophrenia, the core problem of code unit vs code point is still there, and you still need to understand it. There is no fix for it, because it's intrinsic to how unicode works. - Jonathan M Davis
Yes you can get away. The struct string would have ubyte[] ushort[] and uint[] as the representation. Maybe even the char[], wchar[] and dchar[], but those won't be strings as we know them now. The string struct will take care of encoding 100% transparently
Encoding cannot be taken care of 100% transparently. It has performance implications.
 and will provide access to the representation, which is good for bit blitting
and other
 encoding-agnostic operations, but the representation is then known NOT
 to be a valid string
It is NOT known not to be a valid string. Furthermore, this directly contradicts what you claimed above. If the representation is exposed, it is certainly not transparent.
 and will need to be placed into the string struct in order to use string
operations.
aliasing..?
Jun 27 2012
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 06/27/2012 08:09 PM, Steven Schveighoffer wrote:
 On Wed, 27 Jun 2012 13:30:48 -0400, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:


 I don't see why having the literal be a string would make anything
 confusing.
 The fact that a string is considered a range of dchar rather than char
 could
 be, but I don't see why having a string literal be a dstring instead of a
 string would help with that. Besides, it's generally expected that
 you'll use
 string for strings unless you specifically need wstring or dstring for
 some
 reason.
No, the reason is: 1. T[] is a range of T, unless T == char or T == wchar, and then it's a range of dchar (huh?) 2. char[] is not a random access range, even though str[i] and str.length work. The fundamental flaw in the way this works is that phobos is pretending immutable(char)[] is not an array. immutable(char)[] should be an array of immutable char, string should be a *separate type* of a range of dchar, perhaps with immutable(char)[] as its underlying storage. D needs a full, library-defined string type. Until it has that, it's going to cause endless confusion and WATs. -Steve
There is no reason for anyone to be confused about this endlessly. It is simple to understand. Furthermore, think about the implications of a library-defined string type: it just introduces the problem of what the type of built-in string literals should be. This would cause endless pain with type deduction, ifti, string mixins, ... A library-defined string type cannot be a full string type. Pretending that it can has no value. alias immutable(char)[] string is just fine.
Jun 27 2012
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:

 There is no reason for anyone to be confused about this endlessly. It
 is simple to understand. Furthermore, think about the implications of a
 library-defined string type: it just introduces the problem of what the
 type of built-in string literals should be. This would cause endless
 pain with type deduction, ifti, string mixins, ... A library-defined
 string type cannot be a full string type. Pretending that it can has no
 value.
Default type of the literal should be the library type. If you want immutable(char)[], use "abc".codeunits or equivalent. Of course, it should by default work as a zero-terminated char * for C compatibility. The current situation is not simple to understand. Generic code that accepts arrays has to special-case narrow-width strings if you plan to use phobos with them in some cases. That is a horrible situation.
 alias immutable(char)[] string is just fine.
That is technically fine, but if phobos wants to treat immutable(char)[] as something other than an array, it is not fine. -Steve
Jun 27 2012
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 06/27/2012 10:22 PM, Steven Schveighoffer wrote:
 On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:

 There is no reason for anyone to be confused about this endlessly. It
 is simple to understand. Furthermore, think about the implications of a
 library-defined string type: it just introduces the problem of what the
 type of built-in string literals should be. This would cause endless
 pain with type deduction, ifti, string mixins, ... A library-defined
 string type cannot be a full string type. Pretending that it can has no
 value.
Default type of the literal should be the library type.
Then it is not a library type, but a built-in type. Are you planning to inject a dependency on Phobos into the compiler?
 If you want immutable(char)[], use "abc".codeunits or equivalent.
I really don't want to type .codeunits, but I want to use immutable(char)[] everywhere. This 'library type' is just an interface change that makes writing nice and efficient code a kludge.
 Of course, it should by default work as a zero-terminated char * for C
 compatibility.

 The current situation is not simple to understand.
It is simple, even if not immediately obvious. It does not have to be immediately obvious without explanation. It needs to be convenient.
 Generic code that accepts arrays  has to special-case narrow-width strings if
you plan to
 use phobos with them in some cases. That is a horrible situation.
Generic code accepts ranges, not arrays. All necessary (or maybe unnecessary, I don't know) special casing is already done for you in Phobos. The _only_ thing that is problematic is the inconsistent 'foreach' behaviour.
 alias immutable(char)[] string is just fine.
That is technically fine, but if phobos wants to treat immutable(char)[] as something other than an array, it is not fine. -Steve
Phobos does not treat immutable(char)[] as something other than an array. It does not treat all arrays uniformly though.
Jun 27 2012
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 27 Jun 2012 16:55:49 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 06/27/2012 10:22 PM, Steven Schveighoffer wrote:
 On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr gmx.ch>  
 wrote:

 There is no reason for anyone to be confused about this endlessly. It
 is simple to understand. Furthermore, think about the implications of a
 library-defined string type: it just introduces the problem of what the
 type of built-in string literals should be. This would cause endless
 pain with type deduction, ifti, string mixins, ... A library-defined
 string type cannot be a full string type. Pretending that it can has no
 value.
Default type of the literal should be the library type.
Then it is not a library type, but a built-in type. Are you planning to inject a dependency on Phobos into the compiler?
No, druntime, and include minimal utf support. We do the same thing with AssociativeArray.
 If you want immutable(char)[], use "abc".codeunits or equivalent.
I really don't want to type .codeunits, but I want to use immutable(char)[] everywhere. This 'library type' is just an interface change that makes writing nice and efficient code a kludge.
When most string functions take strings, why would you want to use immutable(char)[] everywhere?
 Of course, it should by default work as a zero-terminated char * for C
 compatibility.

 The current situation is not simple to understand.
It is simple, even if not immediately obvious. It does not have to be immediately obvious without explanation. It needs to be convenient.
Try sorting an array of ascii characters.
 Generic code that accepts arrays  has to special-case narrow-width  
 strings if you plan to
 use phobos with them in some cases. That is a horrible situation.
Generic code accepts ranges, not arrays. All necessary (or maybe unnecessary, I don't know) special casing is already done for you in Phobos. The _only_ thing that is problematic is the inconsistent 'foreach' behaviour.
Plenty of generic code specializes on arrays.
 alias immutable(char)[] string is just fine.
That is technically fine, but if phobos wants to treat immutable(char)[] as something other than an array, it is not fine. -Steve
Phobos does not treat immutable(char)[] as something other than an array. It does not treat all arrays uniformly though.
It certainly does. An array by definition is a random-access range. It does not treat strings as random access ranges. -Steve
Jun 27 2012
next sibling parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, June 27, 2012 17:11:56 Steven Schveighoffer wrote:
 On Wed, 27 Jun 2012 16:55:49 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:
 On 06/27/2012 10:22 PM, Steven Schveighoffer wrote:
 On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr gmx.ch>
 
 wrote:
 There is no reason for anyone to be confused about this endlessly. It
 is simple to understand. Furthermore, think about the implications of a
 library-defined string type: it just introduces the problem of what the
 type of built-in string literals should be. This would cause endless
 pain with type deduction, ifti, string mixins, ... A library-defined
 string type cannot be a full string type. Pretending that it can has no
 value.
Default type of the literal should be the library type.
Then it is not a library type, but a built-in type. Are you planning to inject a dependency on Phobos into the compiler?
No, druntime, and include minimal utf support. We do the same thing with AssociativeArray.
 If you want immutable(char)[], use "abc".codeunits or equivalent.
I really don't want to type .codeunits, but I want to use immutable(char)[] everywhere. This 'library type' is just an interface change that makes writing nice and efficient code a kludge.
When most string functions take strings, why would you want to use immutable(char)[] everywhere?
 Of course, it should by default work as a zero-terminated char * for C
 compatibility.
 
 The current situation is not simple to understand.
It is simple, even if not immediately obvious. It does not have to be immediately obvious without explanation. It needs to be convenient.
Try sorting an array of ascii characters.
Cast it to ubyte[]. Problem solved. I honestly don't think that operating on code units like that should be encourage at all, so if it's a bit hard to do, then that's a _good_ thing (but since all that's required is casting to ubyte[], it's still quite easy - you just have to tell the compiler that that's what you really want to do rather than it being the default behavior). The problem that we have is the inconsistencies between how the language treats strings and how the library does, not the fact that operating on char[] as if it were ASCII rather than UTF-8 requires some casting.
 Generic code that accepts arrays has to special-case narrow-width
 strings if you plan to
 use phobos with them in some cases. That is a horrible situation.
Generic code accepts ranges, not arrays. All necessary (or maybe unnecessary, I don't know) special casing is already done for you in Phobos. The _only_ thing that is problematic is the inconsistent 'foreach' behaviour.
Plenty of generic code specializes on arrays.
You're stuck doing that regardless of how strings are represented. You have to operate on them as ranges of code points (or even graphemes) if you want correct string processing, but that's inefficient, so anything caring about efficiency which can gain extra efficiency by coding with knowledge of how unicode works and operate on the code units will need to special case. Whether string is an array or a struct has zero effect on that. All that it affects is what operates on it as an array of code units vs a range of code points.
 alias immutable(char)[] string is just fine.
That is technically fine, but if phobos wants to treat immutable(char)[] as something other than an array, it is not fine. -Steve
Phobos does not treat immutable(char)[] as something other than an array. It does not treat all arrays uniformly though.
It certainly does. An array by definition is a random-access range. It does not treat strings as random access ranges.
Well, now you're getting into a semantics argument. isRandomAccessRange defines what a random access range is. All arrays which aren't narrow strings qualify. Narrow strings do not. Yes, they do have random-access operations, but they aren't random-access ranges, because they're ranges of code points, not code units. Yes, this makes it so that character arrays are treated inconsistently from other arrays, but the library is very consistent in how it handles them, because it _never_ deals with strings as being made of code units. If it's operating on them as arrays, then it takes unicode into account, and if it's operating on them as ranges, it treats them as ranges of code points. It _always_ makes sure that it's operating on code points. Plenty of code specializes on strings so that it can deal with the code units in an efficient manner rather than having to decode them all the time, but Phobos is completely consistent with regards to how it treats strings. The _only_ inconsintencies are between the language and the library - namely how foreach iterates on code units by default and the fact that while the language defines length, slicing, and random-access operations for strings, the library effectively does not consider strings to have them. - Jonathan M Davis
Jun 27 2012
parent reply travert phare.normalesup.org (Christophe Travert) writes:
"Jonathan M Davis" , dans le message (digitalmars.D:170852), a écrit :
 completely consistent with regards to how it treats strings. The _only_ 
 inconsintencies are between the language and the library - namely how foreach 
 iterates on code units by default and the fact that while the language defines 
 length, slicing, and random-access operations for strings, the library 
 effectively does not consider strings to have them.
char[] is not treated as an array by the library, and is not treated as a RandomAccessRange. That is a second inconsistency, and it would be avoided is string were a struct. I won't repeat arguments that were already said, but if it matters, to me, things should be such that: - string is a druntime defined struct, with an undelying immutable(char)[]. It is a BidirectionalRange of dchar. Slicing is provided for convenience, but not as opSlice, since it is not O(1), but as a method with a separate name. Direct access to the underlying char[]/ubyte[] is provided. - similar structs are provided to hold underlying const(char)[] and char[] - similar structs are provided for wstring - dstring is a druntime defined alias to dchar[] or a struct with the same functionalities for consistency with narrow string being struct. - All those structs may be provided as a template. struct string(T = immutable(char)) {...} alias string(immutable(wchar)) wstring; alias string(immutable(dchar)) dstring; string(const(char)) and string(char) ... are the other types of strings. - this string template could also be defined as a wrapper to convert any range of char/wchar into a range of dchar. That does not need to be in druntime. Only types necessary for string litterals should be in druntime. - string should not be convertible to char*. Use toStringz to interface with c code, or the underlying char[] if you know you it is zero-terminated, at you own risk. Only string litterals need to be convertible to char*, and I would say that they should be zero-terminated only when they are directly used as char*, to allow the compiler to optimize memory. - char /may/ disappear in favor of ubyte (or the contrary, or one could alias the other), if there is no other need to keep separate types that having strings that are different from ubyte[]. Only dchar is necessary, and it could just be called char. That is ideal to me. Of course, I understand code compatibility is important, and compromises have to be made. The current situation is a compromise, but I don't like it because it is a WAT for every newcomer. But the last point, for example, would bring no more that code breakage. Such code breakage may make us find bugs however... -- Christophe
Jun 28 2012
parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, June 28, 2012 08:05:19 Christophe Travert wrote:
 "Jonathan M Davis" , dans le message (digitalmars.D:170852), a =C3=A9=
crit :
 completely consistent with regards to how it treats strings. The _o=
nly_
 inconsintencies are between the language and the library - namely h=
ow
 foreach iterates on code units by default and the fact that while t=
he
 language defines length, slicing, and random-access operations for
 strings, the library effectively does not consider strings to have =
them.
 char[] is not treated as an array by the library
Phobos _does_ treat char[] as an array. isDynamicArray!(char[]) is true= , and=20 char[] works with the functions in std.array. It's just that they're al= l=20 special-cased appropriately to handle narrow strings properly. What it = doesn't=20 do is treat char[] as a range of char.
 and is not treated as a RandomAccessRange.
Which is what I already said.
 That is a second inconsistency, and it would be avoided is string wer=
e a=20 struct. No, it wouldn't. It is _impossible_ to implement length, slicing, and i= ndexing=20 for UTF-8 and UTF-16 strings in O(1). Whether you're using an array or = a=20 struct to represent them is irrelevant. And if you can't do those opera= tions=20 in O(1), then they can't be random access ranges. The _only_ thing that using a struct for narrow strings fixes is the=20= inconsistencies with foreach (it would then use dchar just like all of = the=20 range stuff does), and slicing, indexing, and length wouldn't be on it,= =20 eliminating the oddity of them existing but not considered to exist by = range- based functions. It _would_ make things somewhat nicer for newbies, but= it=20 would not give you one iota more of functionality. Narrow strings would= still=20 be bidirectional ranges but not access ranges, and you would still have= to=20 operate on the underlying array to operate on strings efficiently. If we were to start from stratch, it probably would be better to go wit= h a=20 struct type for strings, but it would break far too much code for far t= oo=20 little benefit at this point. You need to understand the unicode stuff=20= regardless - like the difference between code units and code points. So= , if=20 anything, the fact that strings are treated inconsistently and are trea= ted as=20 ranges of dchar - which confuses so many newbies - is arguably a _good_= thing=20 in that it forces newbies to realize and understand the unicode issues=20= involved rather than blindly using strings in a horribly inefficient ma= nner as=20 would inevitably occur with a struct string type. So, no, the situation is not exactly ideal, and yes, a struct string ty= pe=20 might have been a better solution, but I think that many of the folks w= ho are=20 pushing for a struct string type are seriously overestimating the probl= ems=20 that it would solve. Yes, it would make the language and library more=20= consistent, but that's it. You'd still have to use strings in essential= ly the=20 same way that you do now. It's just that you wouldn't have to explicitl= y use=20 dchar with foreach, and you'd have to get at the property which returne= d the=20 underlying array in order to operate on the code units as you need to d= o in=20 many functions to make your code appropriately efficient rather than si= mply=20 using the string that way directly by not using its range-based functio= ns.=20 There is a difference, but it's a lot smaller than many people seem to = think. - Jonathan M Davis
Jun 28 2012
parent reply travert phare.normalesup.org (Christophe Travert) writes:
Jonathan M Davis , dans le message (digitalmars.D:170872), a écrit :
 On Thursday, June 28, 2012 08:05:19 Christophe Travert wrote:
 "Jonathan M Davis" , dans le message (digitalmars.D:170852), a écrit :
 completely consistent with regards to how it treats strings. The _only_
 inconsintencies are between the language and the library - namely how
 foreach iterates on code units by default and the fact that while the
 language defines length, slicing, and random-access operations for
 strings, the library effectively does not consider strings to have them.
 char[] is not treated as an array by the library
Phobos _does_ treat char[] as an array. isDynamicArray!(char[]) is true, and char[] works with the functions in std.array. It's just that they're all special-cased appropriately to handle narrow strings properly. What it doesn't do is treat char[] as a range of char.
 and is not treated as a RandomAccessRange.
All arrays are treated as RandomAccessRanges, except for char[] and wchar[]. So I think I am entitled to say that strings are not treated as arrays. An I would say I am also entitle to say strings are not normal ranges, since they define length, but have isLength as true, and define opIndex and opSlice, but are not RandomAccessRanges. The fact that isDynamicArray!(char[]) is true, but isRandomAccessRange is not is just another aspect of the schizophrenia. The behavior of a templated function on a string will depend on which was used as a guard.
 
 Which is what I already said.
 
 That is a second inconsistency, and it would be avoided is string were a 
struct. No, it wouldn't. It is _impossible_ to implement length, slicing, and indexing for UTF-8 and UTF-16 strings in O(1). Whether you're using an array or a struct to represent them is irrelevant. And if you can't do those operations in O(1), then they can't be random access ranges.
I never said strings should support length and slicing. I even said they should not. foreach is inconsistent with the way strings are treated in phobos, but opIndex, opSlice and length, are inconsistent to. string[0] and string.front do not even return the same.... Please read my post a little bit more carefully before answering them. About the rest of your post, I basically say the same as you in shorter terms, except that I am in favor of changing things (but I didn't even said they should be changed in my conclusion). newcomers are troubled by this problem, and I think it is important. They will make mistakes when using both array and range functions on strings in the same algorithm, or when using array functions without knowing about utf8 encoding issues (the fact that array functions are also valid range functions if not for strings does not help). But I also think experienced programmers can be affected, because of inattention, reusing codes written by inexperienced programmers, or inappropriate template guards usage. As a more general comment, I think having a consistent langage is a very important goal to achieve when designing a langage. It makes everything simpler, from langage design to user through compiler and library development. It may not be too late for D. -- Christophe
Jun 28 2012
next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, June 28, 2012 09:28:52 Christophe Travert wrote:
 I never said strings should support length and slicing. I even said
 they should not. foreach is inconsistent with the way strings are
 treated in phobos, but opIndex, opSlice and length, are inconsistent to.
 string[0] and string.front do not even return the same....
 
 Please read my post a little bit more carefully before
 answering them.
You said this:
 char[] is not treated as an array by the library, and is not treated as 
 a RandomAccessRange. That is a second inconsistency, and it would be 
 avoided is string were a struct.
So, it looked to me like you were saying that making string a struct would make it so that it was a random access range, which would mean implementing length, opSlice, and opIndex. - Jonathan M Davis
Jun 28 2012
parent reply "David Nadlinger" <see klickverbot.at> writes:
On Thursday, 28 June 2012 at 09:49:19 UTC, Jonathan M Davis wrote:
 char[] is not treated as an array by the library, and is not 
 treated as a RandomAccessRange. That is a second 
 inconsistency, and it would be avoided is string were a struct.
So, it looked to me like you were saying that making string a struct would make it so that it was a random access range, which would mean implementing length, opSlice, and opIndex.
I think he meant that the problem would be solved because people would be less likely to expect it to be a random access range in the first place. What troubles me most with having is(string == immutable(char)[]) is that it more or less precludes us from adding small string optimizations, etc. in the future… David
Jun 28 2012
parent travert phare.normalesup.org (Christophe Travert) writes:
"David Nadlinger" , dans le message (digitalmars.D:170875), a écrit :
 On Thursday, 28 June 2012 at 09:49:19 UTC, Jonathan M Davis wrote:
 char[] is not treated as an array by the library, and is not 
 treated as a RandomAccessRange. That is a second 
 inconsistency, and it would be avoided is string were a struct.
So, it looked to me like you were saying that making string a struct would make it so that it was a random access range, which would mean implementing length, opSlice, and opIndex.
I think he meant that the problem would be solved because people would be less likely to expect it to be a random access range in the first place.
Yes.
Jun 28 2012
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/28/12 5:28 AM, Christophe Travert wrote:
 As a more general comment, I think having a consistent langage is a very
 important goal to achieve when designing a langage. It makes everything
 simpler, from langage design to user through compiler and library
 development. It may not be too late for D.
In a way it's too late for any language in actual use. The "fog of language design" makes it nigh impossible to design a language/library combo that is perfectly consistent, not to mention the fact that consistency itself has many dimensions, some of which may be in competition. We'd probably do things a bit differently if we started from scratch. As things are, D's strings have a couple of quirks but are very apt for good and efficient string manipulation where index computation in the code unit realm is combined with the range of code points realm. I suppose people who have an understanding of UTF don't have difficulty using D's strings. Above all, alea jacta est and there's little we can do about that save for inventing a time machine. Andrei
Jun 28 2012
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 06/28/2012 11:28 AM, Christophe Travert wrote:
 Jonathan M Davis , dans le message (digitalmars.D:170872), a écrit :
 On Thursday, June 28, 2012 08:05:19 Christophe Travert wrote:
 "Jonathan M Davis" , dans le message (digitalmars.D:170852), a écrit :
 completely consistent with regards to how it treats strings. The _only_
 inconsintencies are between the language and the library - namely how
 foreach iterates on code units by default and the fact that while the
 language defines length, slicing, and random-access operations for
 strings, the library effectively does not consider strings to have them.
 char[] is not treated as an array by the library
Phobos _does_ treat char[] as an array. isDynamicArray!(char[]) is true, and char[] works with the functions in std.array. It's just that they're all special-cased appropriately to handle narrow strings properly. What it doesn't do is treat char[] as a range of char.
 and is not treated as a RandomAccessRange.
All arrays are treated as RandomAccessRanges, except for char[] and wchar[]. So I think I am entitled to say that strings are not treated as arrays.
"Not treated like other arrays", is the strongest claim that can be made there.
 An I would say I am also entitle to say strings are not normal
 ranges, since they define length, but have isLength as true,
hasLength as false. They define length, but it is not part of the range interface. It is analogous to the following: class charArray : ForwardRange!dchar{ /* interface ForwardRange!dchar */ dchar front(); bool empty(); void popFront(); NarrowString save(); /* other methods */ size_t length(); char opIndex(size_t i); String opSlice(size_t a, size_t b); }
 and define opIndex and opSlice,
[] and [..] operate on code units, but for a random access range as defined by Phobos, they would not.
 but are not RandomAccessRanges.

 The fact that isDynamicArray!(char[]) is true, but
 isRandomAccessRange is not is just another aspect of the schizophrenia.
 The behavior of a templated function on a string will depend on which
 was used as a guard.
No, it won't.
 Which is what I already said.

 That is a second inconsistency, and it would be avoided is string were a
struct. No, it wouldn't. It is _impossible_ to implement length, slicing, and indexing for UTF-8 and UTF-16 strings in O(1). Whether you're using an array or a struct to represent them is irrelevant. And if you can't do those operations in O(1), then they can't be random access ranges.
I never said strings should support length and slicing. I even said they should not. foreach is inconsistent with the way strings are treated in phobos, but opIndex, opSlice and length, are inconsistent to. string[0] and string.front do not even return the same.... Please read my post a little bit more carefully before answering them.
This is very impolite. On Thursday, June 28, 2012 08:05:19 Christophe Travert wrote:
 Slicing is provided for convenience, but not as opSlice, since it is not O(1),
but
 as a method with a separate name.
 About the rest of your post, I basically say the same as you in shorter
 terms, except that I am in favor of changing things (but I didn't even
 said they should be changed in my conclusion).
When read carefully, the conclusion says that code compatibility is important only a couple sentences before it says that breaking code for the fun of it may be a good thing.
 newcomers are troubled by this problem,  and I think it is important.
Newcomers sometimes become seasoned D programmers. Sometimes they know what Unicode is about even before that.
 They will make mistakes when using both array and range functions on
 strings in the same algorithm, or when using array functions without
 knowing about utf8 encoding issues (the fact that array functions are
 also valid range functions if not for strings does not help). But I also
 think experienced programmers can be affected, because of inattention,
 reusing codes written by inexperienced programmers, or inappropriate
 template guards usage.
In the ASCII-7 subset, UTF-8 strings are actually random access, and iterating an UTF-8 string by code point is safe if you are eg. just going to treat some ASCII characters specially. I don't care much whether or not (bad?) code handles Unicode correctly, but it is important that code correctly documents whether or not it does so, and to what extent it does. The new std.regex has good Unicode support, and to enable that, it had to add some pretty large tables to Phobos, the functionality of which is not exposed to the library user as of now. It is therefore safe to say that many/most existing D programs do not handle the whole Unicode standard correctly. Unicode has to be _actively_ supported. There are distinct issues that are hard to abstract away efficiently. Treating an Unicode string as a range of code points is not solving them. (dchar[] indexing is still not guaranteed to give back the 'i'th character!) Why build this interpretation into the language?
 As a more general comment, I think having a consistent langage is a very
 important goal to achieve when designing a langage. It makes everything
 simpler, from langage design to user through compiler and library
 development. It may not be too late for D.
The language is consistent here. The library treats some language features specially. It is not the language that is "confusing". The whole reason to introduce the library behaviour is probably based on similar reasoning as given in your post. The special casing has not caused me any trouble, and sometimes it was useful.
Jun 28 2012
parent travert phare.normalesup.org (Christophe Travert) writes:
Timon Gehr , dans le message (digitalmars.D:170884), a écrit :
 An I would say I am also entitle to say strings are not normal
 ranges, since they define length, but have isLength as true,
hasLength as false.
Of course, my mistake.
 They define length, but it is not part of the range interface.
 
 It is analogous to the following:
 [...]
I consider this bad design.
 and define opIndex and opSlice,
[] and [..] operate on code units, but for a random access range as defined by Phobos, they would not.
A bidirectional range of dchar with additional methods of a random access range of char. That is what I call schizophrenic.
 but are not RandomAccessRanges.

 The fact that isDynamicArray!(char[]) is true, but
 isRandomAccessRange is not is just another aspect of the schizophrenia.
 The behavior of a templated function on a string will depend on which
 was used as a guard.
No, it won't.
Take the isRandomAccessRange specialization of an algorithm in Phobos, replace the guard by isDynamicArray, and you are very likely to change the behavior, if you do not simply break the function.
 When read carefully, the conclusion says that code compatibility is
 important only a couple sentences before it says that breaking code for
 the fun of it may be a good thing.
It was intended as a side-note, not a conclusion. Sorry for not being clear.
 newcomers are troubled by this problem,  and I think it is important.
Newcomers sometimes become seasoned D programmers. Sometimes they know what Unicode is about even before that.
I knew what unicode was before coming to D. But, strings being arrays, I suspected myString.front would return the same as myString[0], i.e., a char, and that it was my job to make sure my algorithms were valid for UTF-8 encoding if I wanted to support it. Most of the time, in langage without such UTF-8 support, they are without much troubles. Code units matters more than code points most of the time.
 The language is consistent here. The library treats some language
 features specially. It is not the language that is "confusing". The
 whole reason to introduce the library behaviour is probably based on
 similar reasoning as given in your post.
OK, I should have said the standard library is inconsistent (with the langage).
 The special casing has not caused me any trouble, and sometimes it was 
 useful.
Of course, I can live with that.
Jun 28 2012
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 06/27/2012 11:11 PM, Steven Schveighoffer wrote:
 On Wed, 27 Jun 2012 16:55:49 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 06/27/2012 10:22 PM, Steven Schveighoffer wrote:
 On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr gmx.ch>
 wrote:

 There is no reason for anyone to be confused about this endlessly. It
 is simple to understand. Furthermore, think about the implications of a
 library-defined string type: it just introduces the problem of what the
 type of built-in string literals should be. This would cause endless
 pain with type deduction, ifti, string mixins, ... A library-defined
 string type cannot be a full string type. Pretending that it can has no
 value.
Default type of the literal should be the library type.
Then it is not a library type, but a built-in type. Are you planning to inject a dependency on Phobos into the compiler?
No, druntime, and include minimal utf support. We do the same thing with AssociativeArray.
In this case it is misleading to call it a library type.
 If you want immutable(char)[], use "abc".codeunits or equivalent.
I really don't want to type .codeunits, but I want to use immutable(char)[] everywhere. This 'library type' is just an interface change that makes writing nice and efficient code a kludge.
When most string functions take strings, why would you want to use immutable(char)[] everywhere?
Because the proposed 'string' interface is inconvenient to use and useless. It is a struct with one data member and no additionally maintained invariant, and it strictly narrows the essential parts of the interface to the data that is reachable without a large typing overhead. immutable(char)[] supports exactly the operations I usually need. Maybe I'm not representative.
 Of course, it should by default work as a zero-terminated char * for C
 compatibility.

 The current situation is not simple to understand.
It is simple, even if not immediately obvious. It does not have to be immediately obvious without explanation. It needs to be convenient.
Try sorting an array of ascii characters.
auto asciitext = cast(ubyte[])"I am ascii text"; sort(asciitext);
 Generic code that accepts arrays has to special-case narrow-width
 strings if you plan to
 use phobos with them in some cases. That is a horrible situation.
Generic code accepts ranges, not arrays. All necessary (or maybe unnecessary, I don't know) special casing is already done for you in Phobos. The _only_ thing that is problematic is the inconsistent 'foreach' behaviour.
Plenty of generic code specializes on arrays.
Ok, point taken. But plenty of generic code then specializes on strings as well. Would the net gain be so huge? There is also always the option of just not passing strings to some helper template function you defined. There are multiple valid contradictory considerations on the topic, but I have found the current way of dealing with strings very pleasant.
 alias immutable(char)[] string is just fine.
That is technically fine, but if phobos wants to treat immutable(char)[] as something other than an array, it is not fine. -Steve
Phobos does not treat immutable(char)[] as something other than an array. It does not treat all arrays uniformly though.
It certainly does. An array by definition is a random-access range. It does not treat strings as random access ranges. -Steve
You are right about the random-access part, but the definition of an array does not depend on the 'range' concept.
Jun 27 2012
next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, June 27, 2012 23:41:14 Timon Gehr wrote:
 On 06/27/2012 11:11 PM, Steven Schveighoffer wrote:
 When most string functions take strings, why would you want to use
 immutable(char)[] everywhere?
Because the proposed 'string' interface is inconvenient to use and useless. It is a struct with one data member and no additionally maintained invariant, and it strictly narrows the essential parts of the interface to the data that is reachable without a large typing overhead. immutable(char)[] supports exactly the operations I usually need. Maybe I'm not representative.
I think that a lot of programmers want to be able to use strings without worrying about any of the details (like unicode). The fact that foreach and the library don't treat strings the same is confusing, and the fact that narrow strings are ranges of dchar (with all that that implies with regards to the operations that they support) seems to confuse a lot of people. If we had a struct for a string type, then the usage would be consistent (always a range of dchar), allowing the average programmer to more or less ignore unicode considerations as long as they don't care about efficiency, but it would still allow those who _do_ care to get at the underlying representation. So, a struct would be an improvement in that regard. But for those who know what they're doing with regards to unicode and understand the fact that foreach treats strings one way and the library treats them another way, it really isn't a problem. It works quite well (which is one of the reasons that Walter isn't too keen on changing strings). It just isn't terribly newbie-friendly. - Jonathan M Davis
Jun 27 2012
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
Sorry to resurrect this thread, I've been very absent from D, and am just  
now going through all these old posts.

On Wed, 27 Jun 2012 17:41:14 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:

 On 06/27/2012 11:11 PM, Steven Schveighoffer wrote:
 No, druntime, and include minimal utf support. We do the same thing with
 AssociativeArray.
In this case it is misleading to call it a library type.
What I mean is, the compiler does not define the structure of it. It simply knows it exists, and expects a certain API for it. The type itself is purely defined in the library, and could possibly be used directly as a library type.
 If you want immutable(char)[], use "abc".codeunits or equivalent.
I really don't want to type .codeunits, but I want to use immutable(char)[] everywhere. This 'library type' is just an interface change that makes writing nice and efficient code a kludge.
When most string functions take strings, why would you want to use immutable(char)[] everywhere?
Because the proposed 'string' interface is inconvenient to use and useless. It is a struct with one data member and no additionally maintained invariant, and it strictly narrows the essential parts of the interface to the data that is reachable without a large typing overhead. immutable(char)[] supports exactly the operations I usually need. Maybe I'm not representative.
Most usages of strings are to concatenate them, print them, use them as keys, read them from a stream, etc. None of this requires direct access to the data. They can be treated as a nebulous type. So maybe you are in the minority. I don't really know.
 The current situation is not simple to understand.
It is simple, even if not immediately obvious. It does not have to be immediately obvious without explanation. It needs to be convenient.
I will respond to this in a different way. This is the confusing part: string x; assert(!hasLength!string); assert(!isRandomAccessRange!string); auto len = x.length; auto c = x[0]; // wtf? It is *always* going to cause people to question isRandomAccessRange and hasLength because clearly a string has the purported properties needed to satisfy both.
 Generic code that accepts arrays has to special-case narrow-width
 strings if you plan to
 use phobos with them in some cases. That is a horrible situation.
Generic code accepts ranges, not arrays. All necessary (or maybe unnecessary, I don't know) special casing is already done for you in Phobos. The _only_ thing that is problematic is the inconsistent 'foreach' behaviour.
Plenty of generic code specializes on arrays.
Ok, point taken. But plenty of generic code then specializes on strings as well. Would the net gain be so huge? There is also always the option of just not passing strings to some helper template function you defined.
The net gain is not in the reduction of specializations -- of course we will need specializations for strings because to do any less would make D extremely inefficient compared to other languages. The gain is in the reduction of confusion. We are asking our users "I know, I know, it's an array, but *please* pretend it's not! Don't listen to the compiler!" And this is for a *BASIC* type of the language! range.save() is the same thing. It prevents nothing, and you have to take special care to avoid using basic operations (i.e. assign) and pretend they don't exist, even though they compile. To me, that is worthless. If a string does not support random access, then str[0] should not compile. period.
 You are right about the random-access part, but the definition of an
 array does not depend on the 'range' concept.
The range concept is notably more confusing with strings that are random access types but not random access ranges, even though they support all the properties needed for a random access range. -Steve
Aug 23 2012
prev sibling next sibling parent Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:
On Thu, Jun 28, 2012 at 12:22 AM, Steven Schveighoffer
<schveiguy yahoo.com> wrote:
 On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:

 There is no reason for anyone to be confused about this endlessly. It
 is simple to understand. Furthermore, think about the implications of a
 library-defined string type: it just introduces the problem of what the
 type of built-in string literals should be. This would cause endless
 pain with type deduction, ifti, string mixins, ... A library-defined
 string type cannot be a full string type. Pretending that it can has no
 value.
Default type of the literal should be the library type. =C2=A0If you want immutable(char)[], use "abc".codeunits or equivalent. Of course, it should by default work as a zero-terminated char * for C compatibility. The current situation is not simple to understand. =C2=A0Generic code tha=
t
 accepts arrays has to special-case narrow-width strings if you plan to us=
e
 phobos with them in some cases. =C2=A0That is a horrible situation.


 alias immutable(char)[] string is just fine.
That is technically fine, but if phobos wants to treat immutable(char)[] =
as
 something other than an array, it is not fine.

 -Steve
Currently strings below dstring are only applicable in ForwardRange and below, but not RandomAccessRange as they should be. --=20 Bye, Gor Gyolchanyan.
Jun 27 2012
prev sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Thursday, June 28, 2012 08:59:32 Gor Gyolchanyan wrote:
 On Thu, Jun 28, 2012 at 12:22 AM, Steven Schveighoffer
 
 <schveiguy yahoo.com> wrote:
 On Wed, 27 Jun 2012 15:20:26 -0400, Timon Gehr <timon.gehr gmx.ch> wrote:
 There is no reason for anyone to be confused about this endlessly. It
 is simple to understand. Furthermore, think about the implications of a
 library-defined string type: it just introduces the problem of what the
 type of built-in string literals should be. This would cause endless
 pain with type deduction, ifti, string mixins, ... A library-defined
 string type cannot be a full string type. Pretending that it can has no
 value.
Default type of the literal should be the library type. If you want immutable(char)[], use "abc".codeunits or equivalent. Of course, it should by default work as a zero-terminated char * for C compatibility. The current situation is not simple to understand. Generic code that accepts arrays has to special-case narrow-width strings if you plan to use phobos with them in some cases. That is a horrible situation.
 alias immutable(char)[] string is just fine.
That is technically fine, but if phobos wants to treat immutable(char)[] as something other than an array, it is not fine. -Steve
Currently strings below dstring are only applicable in ForwardRange and below, but not RandomAccessRange as they should be.
Except that they shouldn't be, because you can't do random access on a narrow string in O(1). If you can't index or slice a range in O(1), it has no business having those operations. The same goes for length. That's why narrow strings do not have any of those operations as far as ranges are concerned. Having those operations in anything worse than O(1) violates the algorithmic complexity guarantees that ranges are supposed to provide, which would seriously harm the efficiency of algorithms which rely on them. It's the same reason why std.container defines the algorithmic complexity of all the operations in std.container. If you want a random-access range which is a string type, you need dchar[], const(dchar)[], or dstring. That is very much on purpose and would not change even if strings were structs. - Jonathan M Davis
Jun 27 2012
parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Thursday, 28 June 2012 at 05:10:43 UTC, Jonathan M Davis wrote:
 On Thursday, June 28, 2012 08:59:32 Gor Gyolchanyan wrote:
 Currently strings below dstring are only applicable in 
 ForwardRange
 and below, but not RandomAccessRange as they should be.
Except that they shouldn't be, because you can't do random access on a narrow string in O(1). If you can't index or slice a range in O(1), it has no business having those operations. The same goes for length. That's why narrow strings do not have any of those operations as far as ranges are concerned. Having those operations in anything worse than O(1) violates the algorithmic complexity guarantees that ranges are supposed to provide, which would seriously harm the efficiency of algorithms which rely on them. It's the same reason why std.container defines the algorithmic complexity of all the operations in std.container. If you want a random-access range which is a string type, you need dchar[], const(dchar)[], or dstring. That is very much on purpose and would not change even if strings were structs. - Jonathan M Davis
Pedantically speaking, it is possible to index a string with about 50-51% memory overhead to get random access in 0(1) time. Best-performing algorithms can do random access in about 35-50 nanoseconds per operation for strings up to tens of megabytes. For bigger strings (tested up to 1GB) or when some other memory-intensive calculations are performed simultaneously, random access takes up to 200 nanoseconds due to memory-access resolution process.
Jun 28 2012
next sibling parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Thursday, 28 June 2012 at 09:58:02 UTC, Roman D. Boiko wrote:
 Pedantically speaking, it is possible to index a string with 
 about 50-51% memory overhead to get random access in 0(1) time. 
 Best-performing algorithms can do random access in about 35-50 
 nanoseconds per operation for strings up to tens of megabytes. 
 For bigger strings (tested up to 1GB) or when some other 
 memory-intensive calculations are performed simultaneously, 
 random access takes up to 200 nanoseconds due to memory-access 
 resolution process.
Just a remark, indexing would take O(N) operations and N/B memory transfers where N = str.length and B is size of cache buffer.
Jun 28 2012
parent "Roman D. Boiko" <rb d-coding.com> writes:
On Thursday, 28 June 2012 at 10:02:59 UTC, Roman D. Boiko wrote:
 On Thursday, 28 June 2012 at 09:58:02 UTC, Roman D. Boiko wrote:
 Pedantically speaking, it is possible to index a string with 
 about 50-51% memory overhead to get random access in 0(1) 
 time. Best-performing algorithms can do random access in about 
 35-50 nanoseconds per operation for strings up to tens of 
 megabytes. For bigger strings (tested up to 1GB) or when some 
 other memory-intensive calculations are performed 
 simultaneously, random access takes up to 200 nanoseconds due 
 to memory-access resolution process.
Just a remark, indexing would take O(N) operations and N/B memory transfers where N = str.length and B is size of cache buffer.
That being said, I would be against switching from string representation as arrays. Such switch would hardly help us solve any problems of practical importance better (by a significant degree) than they have to be solved with current design. However, a struct could be created for indexing which I mentioned in two previous posts to give efficient random access for narrow strings (and arbitrary variable-length data stored consequently in arrays) without any significant overhead. Respective algorithms are called Rank and Select, and there exist many variations of them (with different trade-offs, but some of them are arguably better than others). I have investigated this question quite deeply in the last two weeks, because similar algorithms would be useful in my DCT project. If nobody else will implement them before me, I will eventually do that myself. It is just a matter of finding some free time, likely a week or two.
Jun 28 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Thursday, 28 June 2012 at 09:58:02 UTC, Roman D. Boiko wrote:
 Pedantically speaking, it is possible to index a string with 
 about 50-51% memory overhead to get random access in 0(1) time. 
 Best-performing algorithms can do random access in about 35-50 
 nanoseconds per operation for strings up to tens of megabytes. 
 For bigger strings (tested up to 1GB) or when some other 
 memory-intensive calculations are performed simultaneously, 
 random access takes up to 200 nanoseconds due to memory-access 
 resolution process.
This would support both random access to characters by their code point index in a string and determining code point index by code unit index. If only the former is needed, space overhead decreases to 25% for 1K and <15% for 16K-1G string sizes (measured in number of code units, which is twice the number of bytes for wstring). Strings up to 2^64 code units would be supported. This would also improve access speed significantly (by 10% for small strings and about twice for large).
Jun 28 2012
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/28/12 5:58 AM, Roman D. Boiko wrote:
 Pedantically speaking, it is possible to index a string with about
 50-51% memory overhead to get random access in 0(1) time.
 Best-performing algorithms can do random access in about 35-50
 nanoseconds per operation for strings up to tens of megabytes. For
 bigger strings (tested up to 1GB) or when some other memory-intensive
 calculations are performed simultaneously, random access takes up to 200
 nanoseconds due to memory-access resolution process.
Pedantically speaking, sheer timings say nothing without the appropriate baselines. Andrei
Jun 28 2012
parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Thursday, 28 June 2012 at 12:29:14 UTC, Andrei Alexandrescu 
wrote:
 On 6/28/12 5:58 AM, Roman D. Boiko wrote:
 Pedantically speaking, sheer timings say nothing without the 
 appropriate baselines.

 Andrei
I used results of benchmarks for two such algorithms, which I like most, taken from here: Vigna, S. (2008). "Broadword implementation of rank/select queries". Experimental Algorithms: 154–168. http://en.wikipedia.org/wiki/Succinct_data_structure#cite_ref-vigna2008broadword_6-0 Numbers should be valid for some C/C++ code executed on a machine that already existed back in 2008. I'm not sure there is a good baseline to compare. One option would be to benchmark random access to code points in a UTF-32 string. I also don't know about any D implementations of these algorithms, thus cannot predict how they would behave against dstring random access. But your statement that these timings say nothing is not fair, because they can be used to conclude that this speed should be enough for most practical use cases, especially if those use cases are known.
Jun 28 2012
next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
Timings should not be very different from random access in any 
UTF-32 string implementation, because of design of these 
algorithms:

* only operations on 64-bit aligned words are performed 
(addition, multiplication, bitwise and shift operations)

* there is no branching except at the very top level for very 
large array sizes

* data is stored in a way that makes algorithms cache-oblivious 
IIRC. Authors claim that very few cache misses are neccessary 
(1-2 per random access).

* after determining code unit index for some code point index 
further access is performed as usually inside an array, so in 
order to perform slicing it is only needed to calculate code unit 
indices for its end and start.

* original data arrays are not modified (unlike for compact 
representations of dstring, for example).
Jun 28 2012
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/28/12 8:57 AM, Roman D. Boiko wrote:
 On Thursday, 28 June 2012 at 12:29:14 UTC, Andrei Alexandrescu wrote:
 On 6/28/12 5:58 AM, Roman D. Boiko wrote:
 Pedantically speaking, sheer timings say nothing without the
 appropriate baselines.

 Andrei
I used results of benchmarks for two such algorithms, which I like most, taken from here: Vigna, S. (2008). "Broadword implementation of rank/select queries". Experimental Algorithms: 154–168. http://en.wikipedia.org/wiki/Succinct_data_structure#cite_ref-vigna2008broadword_6-0 Numbers should be valid for some C/C++ code executed on a machine that already existed back in 2008. I'm not sure there is a good baseline to compare. One option would be to benchmark random access to code points in a UTF-32 string. I also don't know about any D implementations of these algorithms, thus cannot predict how they would behave against dstring random access. But your statement that these timings say nothing is not fair, because they can be used to conclude that this speed should be enough for most practical use cases, especially if those use cases are known.
Well of course I've exaggerated a bit. My point is that mentioning "200 ns!!!" sounds to the uninformed ear as good as "2000 ns" or "20 ns", i.e. "an amount of time so short by human scale, it must mean fast". You need to compare e.g. against random access in an array etc. Andrei
Jun 28 2012
parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Thursday, 28 June 2012 at 14:34:03 UTC, Andrei Alexandrescu 
wrote:
 Well of course I've exaggerated a bit. My point is that 
 mentioning "200 ns!!!" sounds to the uninformed ear as good as 
 "2000 ns" or "20 ns", i.e. "an amount of time so short by human 
 scale, it must mean fast". You need to compare e.g. against 
 random access in an array etc.

 Andrei
I have no benchmarks for plain array access on the same machine and compiler that authors used. However, it looks like two cache misses happen at most. If that is true, we may charge 100 ns each memory access + computation. I would claim that from those most time takes memory access, since the same algorithms take 35-50 ns for smaller arrays (up to 4Mbits which is about 512KB), but I'm not sure that my conclusions are definitely true. Also, I made a mistake in another post. I should have said that it is possible to address arrays of up to 2^64 code units, but benchmarks are provided for data sizes in bits (i.e., up to 1GBit). Asymptotically algorithms should require slightly smaller space overhead for bigger arrays: space complexity is O(N/logN). But memory address resolution may become slower. This is true for both Rank/Select algorithms and raw array access. Again, please note that price is paid only once per code unit resolution (for Select) or code point calculation (for Rank). Subsequent nearby accesses should be very cheap.
Jun 28 2012
parent "Roman D. Boiko" <rb d-coding.com> writes:
My point (and the reason I somehow hijacked this thread) is that 
such functionality would be useful for random access over narrow 
strings. Currently random access is missing.

Also this approach fits nicely if random access is needed to 
Unicode characters, not just code points.

I don't see much practical value in proposals to reconsider 
current string implementation. Arguments have been presented that 
it improves consistency, but as Andrei replied, consistency 
itself has many dimensions. Proposed changes have not been 
described in detail, but from what I understood, they would make 
common use cases more verbose.

In contrast, I do see value in Select / Rank indexing. Does 
anybody agree?
Jun 28 2012
prev sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, June 27, 2012 22:54:28 Gor Gyolchanyan wrote:
 On Wed, Jun 27, 2012 at 10:42 PM, Jonathan M Davis <jmdavisProg gmx.com> 
wrote:
 On Wednesday, June 27, 2012 22:29:25 Gor Gyolchanyan wrote:
 Agreed. Having struct strings (with slices and everything) will set
 the record straight.
Except that they couldn't have slicing, because it would be very inefficient. You'd have to get at the actual array of code units to slice anything. A struct string type would have to be restricted to exactly the same set of operations that range-based functions consider strings to have and then give you a way to get at the underlying code unit representation to be able to use it when special-casing for strings for efficiency, just like you do now. You _can't_ get away from the fact that you're dealing with an array (or list or whatever) of code units even if you do want to operate on it as a range of code points most of the time. Having a struct would fix the issues like foreach iterating over char by default whereas range-based functions iterate over dchar - it would make it consistent by making it dchar for everything - but the issue of code unit vs code point still remains and you can't get rid of it. Anyone wanting to write efficient string-processing code _needs_ to understand unicode. There's no way around it (which is part of the reason that Walter isn't keen on the idea of changing how strings work in the language itself). So, while having a string type which is a struct does help eliminate the schizophrenia, the core problem of code unit vs code point is still there, and you still need to understand it. There is no fix for it, because it's intrinsic to how unicode works. - Jonathan M Davis
Yes you can get away. The struct string would have ubyte[] ushort[] and uint[] as the representation. Maybe even the char[], wchar[] and dchar[], but those won't be strings as we know them now. The string struct will take care of encoding 100% transparently and will provide access to the representation, which is good for bit blitting and other encoding-agnostic operations, but the representation is then known NOT to be a valid string and will need to be placed into the string struct in order to use string operations.
If you want efficient strings, you _must_ worry about the encoding. It's _impossible_ for it to be otherwise. It helps quite a bit if you're using functions that someone else already wrote which take this into account rather than having to write the functions yourself, but if you're doing much in the way of string processing, you _must_ understand unicode in order to handle them properly. I fully understand that it's something that most people don't want to have to worry about, but the reality of the matter is that the can't do that unless you don't care about efficiency. The fact that strings are variably length encoded has a huge impact on how they need to be used if you care about both correctness and efficiency. You can't escape it. - Jonathan M Davis
Jun 27 2012