www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Ranges

reply Jonas Drewsen <jdrewsen nospam.com> writes:
Hi,

    I'm working a bit with ranges atm. but there are definitely some 
things that are not clear to me yet. Can anyone tell me why the char 
arrays cannot be copied but the int arrays can?

import std.stdio;
import std.algorithm;

void main(string[] args) {

   // This works
   int[]	a1 = [1,2,3,4];
   int[] a2 = [5,6,7,8];
   copy(a1, a2);

   // This does not!
   char[] a3 = ['1','2','3','4'];
   char[] a4 = ['5','6','7','8'];
   copy(a3, a4);

}

Error message:

test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if 
(isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1))) 
does not match any function template declaration

test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if 
(isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1))) 
cannot deduce template function from argument types !()(char[],char[])

Thanks,
Jonas
Mar 12 2011
next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday 12 March 2011 14:02:00 Jonas Drewsen wrote:
 Hi,
 
     I'm working a bit with ranges atm. but there are definitely some
 things that are not clear to me yet. Can anyone tell me why the char
 arrays cannot be copied but the int arrays can?
 
 import std.stdio;
 import std.algorithm;
 
 void main(string[] args) {
 
    // This works
    int[]	a1 = [1,2,3,4];
    int[] a2 = [5,6,7,8];
    copy(a1, a2);
 
    // This does not!
    char[] a3 = ['1','2','3','4'];
    char[] a4 = ['5','6','7','8'];
    copy(a3, a4);
 
 }
 
 Error message:
 
 test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
 (isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1)))
 does not match any function template declaration
 
 test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
 (isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1)))
 cannot deduce template function from argument types !()(char[],char[])

Character arrays / strings are not exactly normal. And there's a very good reason for it: unicode. In unicode, a character is generally a single code point (there are also graphemes which involve combining code points to add accents and superscripts and whatnot to create a single character, but we'll ignore that in this discussion - it's complicated enough as it is). Depending on the encoding, that code point may be made up of one - or more - code units. UTF-8 uses 8 bit code units. UTF-16 uses 16 bit code units. And UTF-32 uses 32-bit code units. char is a UTF-8 code unit. wchar is a UTF-16 code unit. dchar is a UTF-32 code unit. UTF-32 is the _only_ one of those three which _always_ has one code unit per code point. With an array of integers you can index it and slice it and be sure that everything that you're doing is valid. If you look at a single element, you know that it's a valid int. If you slice it, you know that every int in there is valid. If you're dealing with a dstring or dchar[], then the same still holds. A dstring or dchar[] is an array of UTF-32 code units. Every code point is a single code unit, so every element in the array is a valid code point. You can take an arbitrary element in that array and know that it's a valid code point. You can slice it wherever you want and you still have a valid dstrin g or dchar[]. The same does _not_ hold for char[] and wchar[]. char[] and wchar[] are arrays of UTF-8 and UTF-16 code units respectively. In both of those encodings, multiple code units are required to create a single code point. So, for instance, a code point could have 4 code units. That means that _4_ elements of that char[] make up a _single_ code point. You'd need _all_ 4 of those elements to create a single, valid character. So, you _can't_ just take an arbitrary element in a char[] or wchar[] and expect it to be valid. You _can't_ just slice it anywhere. The resulting array stands a good chance of being invalid. You have to slice on code point boundaries - otherwise you could slice characters in hald and end up with an invalid string. So, unlike other arrays, it just doesn't work to treat char[] and wchar[] as random access ranges of their element type. What the programmer cares about is characters - dchars - not chars or wchars. So, the way this is handled is that char[], wchar[], and dchar[] are all treated as ranges of dchar. In the case of dchar[], this is nothing special. You can index it and slice it as normal. So, it is a random access range.. However, in the case of char[] and wchar[], that means that when you're iterating over them that you're not dealing with a single element of the array at a time. front returns a dchar, and popFront() pops off however many elements made up front. It's like with foreach. If you iterate a char[] with auto or char, then each individual element is given foreach(c; myStr) {} But if you iterate over with dchar, then each code point is given as a dchar: foreach(dchar c; myStr) {} If you were to try and iterate over a char[] by char, then you would be looking at code units rather than code points which is _rarely_ what you want. If you're dealing with anything other than pure ASCII, you _will_ have bugs if you do that. You're supposed to use dchar with foreach and character arrays. That way, each value you process is a valid character. Ranges do the same, only you don't give them an iteration type, so they're _always_ iterating over dchar. So, when you're using a range of char[] or wchar[], you're really using a range of dchar. These ranges are bi-directional. They can't be sliced, and they can't be indexed (since doing so would likely be invalid). This generally works very well. It's exactly what you want in most cases. The problem is that that means that the range that you're iterating over is effectively of a different type than the original char[] or wchar[]. You can't just take two ranges of dchar of the same length and necessarily have them fit in the same char[] or wchar[]. They have the same length, because they have the same number of code points. However, they could have a different number of code _units_, so the lengths of the actual arrays could differ. So, you can't just take an arbitrary dchar range and copy it to another arbitrary dchar range. The way that this is dealt with in the case of a function like copy is that what you're copying _to_ must be an output range. char[] and wchar[] are _not_ output ranges, because of their differing number of code units per code point. So, they don't work with copy. You need to use a dchar[] as the output range if you want to use strings with copy. Now, in some cases, it might be possible to special case some of the range functions to treat char[] and wchar[] as arrays instead of ranges (in the case of copy, that's probably possible if both arguments are of the same type), but that can't be done in the general case. You could open an enhancement request for copy to treat char[] and wchar[] as arrays if _both_ of the arguments are of the same type. - Jonathan M Davis
Mar 12 2011
next sibling parent Jonas Drewsen <jdrewsen nospam.com> writes:
Hi Jonathan,

    Thank you very much your in depth answer!

    It should indeed goto a faq somewhere it think. I did now about the 
codepoint/unit stuff but had no idea that ranges of char are handled 
using dchar internally. This makes sense but is an easy pitfall for 
newcomers trying to use std.{algoritm,array,ranges} for char[].

Thanks
Jonas

On 13/03/11 01.05, Jonathan M Davis wrote:
 On Saturday 12 March 2011 14:02:00 Jonas Drewsen wrote:
 Hi,

      I'm working a bit with ranges atm. but there are definitely some
 things that are not clear to me yet. Can anyone tell me why the char
 arrays cannot be copied but the int arrays can?

 import std.stdio;
 import std.algorithm;

 void main(string[] args) {

     // This works
     int[]	a1 = [1,2,3,4];
     int[] a2 = [5,6,7,8];
     copy(a1, a2);

     // This does not!
     char[] a3 = ['1','2','3','4'];
     char[] a4 = ['5','6','7','8'];
     copy(a3, a4);

 }

 Error message:

 test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
 (isInputRange!(Range1)&&  isOutputRange!(Range2,ElementType!(Range1)))
 does not match any function template declaration

 test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
 (isInputRange!(Range1)&&  isOutputRange!(Range2,ElementType!(Range1)))
 cannot deduce template function from argument types !()(char[],char[])

Character arrays / strings are not exactly normal. And there's a very good reason for it: unicode. In unicode, a character is generally a single code point (there are also graphemes which involve combining code points to add accents and superscripts and whatnot to create a single character, but we'll ignore that in this discussion - it's complicated enough as it is). Depending on the encoding, that code point may be made up of one - or more - code units. UTF-8 uses 8 bit code units. UTF-16 uses 16 bit code units. And UTF-32 uses 32-bit code units. char is a UTF-8 code unit. wchar is a UTF-16 code unit. dchar is a UTF-32 code unit. UTF-32 is the _only_ one of those three which _always_ has one code unit per code point. With an array of integers you can index it and slice it and be sure that everything that you're doing is valid. If you look at a single element, you know that it's a valid int. If you slice it, you know that every int in there is valid. If you're dealing with a dstring or dchar[], then the same still holds. A dstring or dchar[] is an array of UTF-32 code units. Every code point is a single code unit, so every element in the array is a valid code point. You can take an arbitrary element in that array and know that it's a valid code point. You can slice it wherever you want and you still have a valid dstrin g or dchar[]. The same does _not_ hold for char[] and wchar[]. char[] and wchar[] are arrays of UTF-8 and UTF-16 code units respectively. In both of those encodings, multiple code units are required to create a single code point. So, for instance, a code point could have 4 code units. That means that _4_ elements of that char[] make up a _single_ code point. You'd need _all_ 4 of those elements to create a single, valid character. So, you _can't_ just take an arbitrary element in a char[] or wchar[] and expect it to be valid. You _can't_ just slice it anywhere. The resulting array stands a good chance of being invalid. You have to slice on code point boundaries - otherwise you could slice characters in hald and end up with an invalid string. So, unlike other arrays, it just doesn't work to treat char[] and wchar[] as random access ranges of their element type. What the programmer cares about is characters - dchars - not chars or wchars. So, the way this is handled is that char[], wchar[], and dchar[] are all treated as ranges of dchar. In the case of dchar[], this is nothing special. You can index it and slice it as normal. So, it is a random access range.. However, in the case of char[] and wchar[], that means that when you're iterating over them that you're not dealing with a single element of the array at a time. front returns a dchar, and popFront() pops off however many elements made up front. It's like with foreach. If you iterate a char[] with auto or char, then each individual element is given foreach(c; myStr) {} But if you iterate over with dchar, then each code point is given as a dchar: foreach(dchar c; myStr) {} If you were to try and iterate over a char[] by char, then you would be looking at code units rather than code points which is _rarely_ what you want. If you're dealing with anything other than pure ASCII, you _will_ have bugs if you do that. You're supposed to use dchar with foreach and character arrays. That way, each value you process is a valid character. Ranges do the same, only you don't give them an iteration type, so they're _always_ iterating over dchar. So, when you're using a range of char[] or wchar[], you're really using a range of dchar. These ranges are bi-directional. They can't be sliced, and they can't be indexed (since doing so would likely be invalid). This generally works very well. It's exactly what you want in most cases. The problem is that that means that the range that you're iterating over is effectively of a different type than the original char[] or wchar[]. You can't just take two ranges of dchar of the same length and necessarily have them fit in the same char[] or wchar[]. They have the same length, because they have the same number of code points. However, they could have a different number of code _units_, so the lengths of the actual arrays could differ. So, you can't just take an arbitrary dchar range and copy it to another arbitrary dchar range. The way that this is dealt with in the case of a function like copy is that what you're copying _to_ must be an output range. char[] and wchar[] are _not_ output ranges, because of their differing number of code units per code point. So, they don't work with copy. You need to use a dchar[] as the output range if you want to use strings with copy. Now, in some cases, it might be possible to special case some of the range functions to treat char[] and wchar[] as arrays instead of ranges (in the case of copy, that's probably possible if both arguments are of the same type), but that can't be done in the general case. You could open an enhancement request for copy to treat char[] and wchar[] as arrays if _both_ of the arguments are of the same type. - Jonathan M Davis

Mar 12 2011
prev sibling next sibling parent reply Peter Alexander <peter.alexander.au gmail.com> writes:
On 13/03/11 12:05 AM, Jonathan M Davis wrote:
 So, when you're using a range of char[] or wchar[], you're really using a range
 of dchar. These ranges are bi-directional. They can't be sliced, and they can't
 be indexed (since doing so would likely be invalid). This generally works very
 well. It's exactly what you want in most cases. The problem is that that means
 that the range that you're iterating over is effectively of a different type
than
 the original char[] or wchar[].

This has to be the worst language design decision /ever/. You can't just mess around with fundamental principles like "the first element in an array of T has type T" for the sake of a minor convenience. How are we supposed to do generic programming if common sense reasoning about types doesn't hold? This is just std::vector<bool> from C++ all over again. Can we not learn from mistakes of the past?
Mar 18 2011
parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, March 18, 2011 03:32:35 spir wrote:
 On 03/18/2011 10:29 AM, Peter Alexander wrote:
 On 13/03/11 12:05 AM, Jonathan M Davis wrote:
 So, when you're using a range of char[] or wchar[], you're really using
 a range of dchar. These ranges are bi-directional. They can't be
 sliced, and they can't be indexed (since doing so would likely be
 invalid). This generally works very well. It's exactly what you want in
 most cases. The problem is that that means that the range that you're
 iterating over is effectively of a different type than
 the original char[] or wchar[].

This has to be the worst language design decision /ever/. =20 You can't just mess around with fundamental principles like "the first element in an array of T has type T" for the sake of a minor convenience. How are we supposed to do generic programming if common sense reasoning about types doesn't hold? =20 This is just std::vector<bool> from C++ all over again. Can we not learn from mistakes of the past?

I partially agree, but. Compare with a simple ascii text: you could itera=

 over it chars (=3Dcodes=3Dbytes), words, lines... Or according to specific
 schemes for your app (eg reverse order, every number in it, every word at
 start of line...). A piece of is not only a stream of codes.
=20
 The problem is there is no good decision, in the case of char[] or wchar[=

 We should have to choose a kind of "natural" sense of what it means to
 iterate over a text, but there no such thing. What does it *mean*? What is
 the natural unit of a text?
 Bytes or words are code units which mean nothing. Code units (<-> dchars)
 are not guaranteed to mean anything neither (as shown by past discussion:
 a code unit may be the base 'a', the following one be the composite '^',
 both in "=C3=A2"). Code unit do not represent "characters" in the common =

 So, it is very clear that implicitely iterating over dchars is a wrong
 choice. But what else? I would rather get rid of wchar and dchar and deal
 with plain stream of bytes supposed to represent utf8. Until we get a good
 solution to operate at the level of "human" characters.

Iterating over dchars works in _most_ cases. Iterating over chars only work= s for=20 pure ASCII. The additional overhead for dealing with graphemes instead of c= ode=20 points is almost certainly prohibitive, it _usually_ isn't necessary, and w= e=20 don't have an actualy grapheme solution yet. So, treating char[] and wchar[= ] as=20 if their elements were valid on their own is _not_ going to work. Treating = them=20 along with dchar[] as ranges of dchar _mostly_ works. We definitely should = have a=20 way to handle them as ranges of graphemes for those who need to, but the co= de=20 point vs grapheme issue is nowhere near as critical as the code unit vs cod= e=20 point issue. I don't really want to get into the whole unicode discussion again. It has = been=20 discussed quite a bit on the D list already. There is no perfect solution. = The=20 current solution _mostly_ works, and, for the most part IMHO, is the correc= t=20 solution. We _do_ need a full-on grapheme handling solution, but a lot of s= tuff=20 doesn't need that and the overhead for dealing with it would be prohibitive= =2E The=20 main problem with using code points rather than graphemes is the lack of=20 normalization, and a _lot_ of string code can get by just fine without that. So, we have a really good 90% solution and we still need a 100% solution, b= ut=20 using the 100% all of the time would almost certainly not be acceptable due= to=20 performance issues, and doing stuff by code unit instead of code point woul= d be=20 _really_ bad. So, what we have is good and will likely stay as is. We just = need=20 a proper grapheme solution for those who need it. =2D Jonathan M Davis P.S. Unicode is just plain ugly.... :(
Mar 18 2011
parent Peter Alexander <peter.alexander.au gmail.com> writes:
On 18/03/11 5:53 PM, Jonathan M Davis wrote:
 On Friday, March 18, 2011 03:32:35 spir wrote:
 On 03/18/2011 10:29 AM, Peter Alexander wrote:
 On 13/03/11 12:05 AM, Jonathan M Davis wrote:
 So, when you're using a range of char[] or wchar[], you're really using
 a range of dchar. These ranges are bi-directional. They can't be
 sliced, and they can't be indexed (since doing so would likely be
 invalid). This generally works very well. It's exactly what you want in
 most cases. The problem is that that means that the range that you're
 iterating over is effectively of a different type than
 the original char[] or wchar[].

This has to be the worst language design decision /ever/. You can't just mess around with fundamental principles like "the first element in an array of T has type T" for the sake of a minor convenience. How are we supposed to do generic programming if common sense reasoning about types doesn't hold? This is just std::vector<bool> from C++ all over again. Can we not learn from mistakes of the past?

I partially agree, but. Compare with a simple ascii text: you could iterate over it chars (=codes=bytes), words, lines... Or according to specific schemes for your app (eg reverse order, every number in it, every word at start of line...). A piece of is not only a stream of codes. The problem is there is no good decision, in the case of char[] or wchar[]. We should have to choose a kind of "natural" sense of what it means to iterate over a text, but there no such thing. What does it *mean*? What is the natural unit of a text? Bytes or words are code units which mean nothing. Code units (<-> dchars) are not guaranteed to mean anything neither (as shown by past discussion: a code unit may be the base 'a', the following one be the composite '^', both in "â"). Code unit do not represent "characters" in the common sense. So, it is very clear that implicitely iterating over dchars is a wrong choice. But what else? I would rather get rid of wchar and dchar and deal with plain stream of bytes supposed to represent utf8. Until we get a good solution to operate at the level of "human" characters.

Iterating over dchars works in _most_ cases. Iterating over chars only works for pure ASCII. The additional overhead for dealing with graphemes instead of code points is almost certainly prohibitive, it _usually_ isn't necessary, and we don't have an actualy grapheme solution yet. So, treating char[] and wchar[] as if their elements were valid on their own is _not_ going to work. Treating them along with dchar[] as ranges of dchar _mostly_ works. We definitely should have a way to handle them as ranges of graphemes for those who need to, but the code point vs grapheme issue is nowhere near as critical as the code unit vs code point issue. I don't really want to get into the whole unicode discussion again. It has been discussed quite a bit on the D list already. There is no perfect solution. The current solution _mostly_ works, and, for the most part IMHO, is the correct solution. We _do_ need a full-on grapheme handling solution, but a lot of stuff doesn't need that and the overhead for dealing with it would be prohibitive. The main problem with using code points rather than graphemes is the lack of normalization, and a _lot_ of string code can get by just fine without that. So, we have a really good 90% solution and we still need a 100% solution, but using the 100% all of the time would almost certainly not be acceptable due to performance issues, and doing stuff by code unit instead of code point would be _really_ bad. So, what we have is good and will likely stay as is. We just need a proper grapheme solution for those who need it. - Jonathan M Davis P.S. Unicode is just plain ugly.... :(

I must be missing something, because the solution seems obvious to me: char[], wchar[], and dchar[] should be simple arrays like int[] with no unicode semantics. string, wstring, and dstring should not be aliases to arrays, but instead should be separate types that behave the way char[], wchar[], and dchar[] do currently. Is there any problem with this approach?
Mar 18 2011
prev sibling parent spir <denis.spir gmail.com> writes:
On 03/18/2011 10:29 AM, Peter Alexander wrote:
 On 13/03/11 12:05 AM, Jonathan M Davis wrote:
 So, when you're using a range of char[] or wchar[], you're really using a range
 of dchar. These ranges are bi-directional. They can't be sliced, and they can't
 be indexed (since doing so would likely be invalid). This generally works very
 well. It's exactly what you want in most cases. The problem is that that means
 that the range that you're iterating over is effectively of a different type
 than
 the original char[] or wchar[].

This has to be the worst language design decision /ever/. You can't just mess around with fundamental principles like "the first element in an array of T has type T" for the sake of a minor convenience. How are we supposed to do generic programming if common sense reasoning about types doesn't hold? This is just std::vector<bool> from C++ all over again. Can we not learn from mistakes of the past?

I partially agree, but. Compare with a simple ascii text: you could iterate over it chars (=codes=bytes), words, lines... Or according to specific schemes for your app (eg reverse order, every number in it, every word at start of line...). A piece of is not only a stream of codes. The problem is there is no good decision, in the case of char[] or wchar[]. We should have to choose a kind of "natural" sense of what it means to iterate over a text, but there no such thing. What does it *mean*? What is the natural unit of a text? Bytes or words are code units which mean nothing. Code units (<-> dchars) are not guaranteed to mean anything neither (as shown by past discussion: a code unit may be the base 'a', the following one be the composite '^', both in "â"). Code unit do not represent "characters" in the common sense. So, it is very clear that implicitely iterating over dchars is a wrong choice. But what else? I would rather get rid of wchar and dchar and deal with plain stream of bytes supposed to represent utf8. Until we get a good solution to operate at the level of "human" characters. Denis -- _________________ vita es estrany spir.wikidot.com
Mar 18 2011
prev sibling next sibling parent reply Bekenn <leaveme alone.com> writes:
On 3/12/2011 2:02 PM, Jonas Drewsen wrote:
 Error message:

 test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
 (isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1)))
 does not match any function template declaration

 test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
 (isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1)))
 cannot deduce template function from argument types !()(char[],char[])

I haven't checked (could be completely off here), but I don't think that char[] counts as an input range; you would normally want to use dchar instead.
Mar 12 2011
next sibling parent Bekenn <leaveme alone.com> writes:
Or, better yet, just read Jonathan's post.
Mar 12 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday 12 March 2011 16:11:20 Bekenn wrote:
 On 3/12/2011 2:02 PM, Jonas Drewsen wrote:
 Error message:
 
 test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
 (isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1)))
 does not match any function template declaration
 
 test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
 (isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1)))
 cannot deduce template function from argument types !()(char[],char[])

I haven't checked (could be completely off here), but I don't think that char[] counts as an input range; you would normally want to use dchar instead.

Char[] _does_ count as input range (of dchar). It just doesn't count as an _output_ range (since it doesn't really hold dchar). - Jonathan M Davis
Mar 12 2011
prev sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
What Jonathan said really needs to be put up on the D website, maybe
under the articles section. Heck, I'd just put a link to that recent
UTF thread on the website, it's really informative (the one on UTF and
meaning of glyphs, etc). And UTF will only get more important, just
like multicore.

Speaking of which, a description on ranges should be put up there as
well. There's that article Andrei once wrote, but we should put it on
the D site and discuss D's implementation of ranges in more detail.
And by 'we' I mean someone who's well versed in ranges. :p
Mar 12 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday 12 March 2011 16:05:37 Jonathan M Davis wrote:
 You could open an
 enhancement request for copy to treat char[] and wchar[] as arrays if
 _both_ of the arguments are of the same type.

Actually, on reflection, I'd have to say that there's not much point to that. If you really want to copy on array to another (rather than a range), just use the array copy syntax: void main() { auto i = [1, 2, 3, 4]; auto j = [3, 4, 5, 6]; assert(i == [1, 2, 3, 4]); assert(j == [3, 4, 5, 6]); i[] = j[]; assert(i == [3, 4, 5, 6]); assert(j == [3, 4, 5, 6]); } copy is of benefit, because it works on generic ranges, not for copying arrays (arrays already allow you to do that quite nicely), so if all you're looking at copying is arrays, then just use the array copy syntax. - Jonathan M Davis
Mar 12 2011
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On 03/13/2011 01:05 AM, Jonathan M Davis wrote:
 If you were to try and iterate over a char[] by char, then you would be looking
 at code units rather than code points which is _rarely_ what you want. If
you're
 dealing with anything other than pure ASCII, you _will_ have bugs if you do
 that. You're supposed to use dchar with foreach and character arrays. That way,
 each value you process is a valid character. Ranges do the same, only you don't
 give them an iteration type, so they're _always_ iterating over dchar.

Side-note: you can be sure the source is pure ASCII if, and only if, it is mechanically produced. (As soon as an end-user touches it, it may hold anything, since OSes and apps offer users means to introduces characters which are not on their keyboards). This can also easily be checked in utf-8 (which has been designed for that): all ASCII chars are coded using the same code as in ASCII, thus all codes should be < 128. Denis -- _________________ vita es estrany spir.wikidot.com
Mar 13 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday 18 March 2011 02:29:51 Peter Alexander wrote:
 On 13/03/11 12:05 AM, Jonathan M Davis wrote:
 So, when you're using a range of char[] or wchar[], you're really using a
 range of dchar. These ranges are bi-directional. They can't be sliced,
 and they can't be indexed (since doing so would likely be invalid). This
 generally works very well. It's exactly what you want in most cases. The
 problem is that that means that the range that you're iterating over is
 effectively of a different type than the original char[] or wchar[].

This has to be the worst language design decision /ever/. You can't just mess around with fundamental principles like "the first element in an array of T has type T" for the sake of a minor convenience. How are we supposed to do generic programming if common sense reasoning about types doesn't hold? This is just std::vector<bool> from C++ all over again. Can we not learn from mistakes of the past?

It really isn't a problem for the most part. You just need to understand that when using range-based functions, char[] and wchar[] are effectively _not_ arrays. They are ranges of dchar. And given the fact that it really wouldn't make sense to treat them as arrays in this case anyway (due to the fact that a single element is a code unit but _not_ a code point), the current solution makes a lot of sense. Generally, you just can't treat char[] and wchar[] as arrays when you're dealing with characters/code points rather than code units. So, yes it's a bit weird, but it makes a lot of sense given how unicode is designed. And it works. If you really don't want to deal with it, then just use dchar[] and dstring everywhere. - Jonathan M Davis
Mar 18 2011
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, March 18, 2011 14:08:48 Peter Alexander wrote:
 On 18/03/11 5:53 PM, Jonathan M Davis wrote:
 On Friday, March 18, 2011 03:32:35 spir wrote:
 On 03/18/2011 10:29 AM, Peter Alexander wrote:
 On 13/03/11 12:05 AM, Jonathan M Davis wrote:
 So, when you're using a range of char[] or wchar[], you're really
 using a range of dchar. These ranges are bi-directional. They can't
 be sliced, and they can't be indexed (since doing so would likely be
 invalid). This generally works very well. It's exactly what you want
 in most cases. The problem is that that means that the range that
 you're iterating over is effectively of a different type than
 the original char[] or wchar[].

This has to be the worst language design decision /ever/. =20 You can't just mess around with fundamental principles like "the first element in an array of T has type T" for the sake of a minor convenience. How are we supposed to do generic programming if common sense reasoning about types doesn't hold? =20 This is just std::vector<bool> from C++ all over again. Can we not learn from mistakes of the past?

I partially agree, but. Compare with a simple ascii text: you could iterate over it chars (=3Dcodes=3Dbytes), words, lines... Or according=



 specific schemes for your app (eg reverse order, every number in it,
 every word at start of line...). A piece of is not only a stream of
 codes.
=20
 The problem is there is no good decision, in the case of char[] or
 wchar[]. We should have to choose a kind of "natural" sense of what it
 means to iterate over a text, but there no such thing. What does it
 *mean*? What is the natural unit of a text?
 Bytes or words are code units which mean nothing. Code units (<->=20
 dchars) are not guaranteed to mean anything neither (as shown by past
 discussion: a code unit may be the base 'a', the following one be the
 composite '^', both in "=C3=A2"). Code unit do not represent "characte=



 the common sense. So, it is very clear that implicitely iterating over
 dchars is a wrong choice. But what else? I would rather get rid of
 wchar and dchar and deal with plain stream of bytes supposed to
 represent utf8. Until we get a good solution to operate at the level of
 "human" characters.

Iterating over dchars works in _most_ cases. Iterating over chars only works for pure ASCII. The additional overhead for dealing with graphemes instead of code points is almost certainly prohibitive, it _usually_ isn't necessary, and we don't have an actualy grapheme solution yet. So, treating char[] and wchar[] as if their elements were valid on their own is _not_ going to work. Treating them along with dchar[] as ranges of dchar _mostly_ works. We definitely should have a way to handle them as ranges of graphemes for those who need to, but the code point vs grapheme issue is nowhere near as critical as the code unit vs code point issue. =20 I don't really want to get into the whole unicode discussion again. It has been discussed quite a bit on the D list already. There is no perfect solution. The current solution _mostly_ works, and, for the most part IMHO, is the correct solution. We _do_ need a full-on grapheme handling solution, but a lot of stuff doesn't need that and the overhead for dealing with it would be prohibitive. The main problem with using code points rather than graphemes is the lack of normalization, and a _lot_ of string code can get by just fine without that. =20 So, we have a really good 90% solution and we still need a 100% solutio=


 but using the 100% all of the time would almost certainly not be
 acceptable due to performance issues, and doing stuff by code unit
 instead of code point would be _really_ bad. So, what we have is good
 and will likely stay as is. We just need a proper grapheme solution for
 those who need it.
=20
 - Jonathan M Davis
=20
=20
 P.S. Unicode is just plain ugly.... :(

I must be missing something, because the solution seems obvious to me: =20 char[], wchar[], and dchar[] should be simple arrays like int[] with no unicode semantics. =20 string, wstring, and dstring should not be aliases to arrays, but instead should be separate types that behave the way char[], wchar[], and dchar[] do currently. =20 Is there any problem with this approach?

There has been a fair bit of debate about it in the past. No one has been a= ble=20 to come up with an alternate solution which is generally considered better = than=20 what we have. char is defined to be a UTF-8 code unit. wchar in defined to be a UTF-16 co= de=20 unit. dchar is defined to be a UTF-32 code unit (which is also guaranteed t= o be a=20 code point). So, manipulating char[] and wchar[] as arrays of characters do= esn't=20 generally make any sense. They _aren't_ characters. They're code units. Hav= ing a=20 range of char or wchar generally makes no sense. When you don't care about the contents of a string, treating it as an array= is=20 very useful. When you _do_ care, you need to treat it as a range of dchar -= no=20 matter which unicode encoding it uses. So, having arrays of code units whic= h are=20 treated as ranges of dchar makes a lot of sense. We could have a wrapper type which wrapped arrays of char, wchar, or dchar = and=20 had the appropriate operations on them and was a range of dchar, but then y= ou'd=20 have to get at the underlying array for various stuff. So, whether it's a g= ain or=20 loss is debatable. You have to special-case strings _regardless_. For some = stuff,=20 they need to be treated as arrays of code units, and for other stuff they n= eed to=20 be treated as ranges of code points. As it stands, range-based functions treat char[], wchar[], etc. properly.=20 Allowing char[] to be treated as a range of char would just cause bugs in m= ost=20 cases. Generally speaking, when someone is trying to do stuff like use char= [] as=20 an output range, it _shouldn't_ work. In almost all cases, it would just be= =20 buggy to treat char[] as a range of char and allow that to work (which woul= d=20 happen if we treated char[] as a range of char). So, in almost all cases where treating char[] as a range of dchar causes=20 problems, it's _preventing bugs_. The one glaring problem with the current= =20 scheme is foreach. It defaults to the element type of the array, so if you = don't=20 give dchar as the element type when iterating over a char[] or wchar[], the= n=20 you're going to have bugs. There have been suggestions on how to fix that (= such=20 as giving a warning when not giving the iteration type for char[] and wchar= []=20 when using foreach), but nothing has been implemented yet. Overall, using arrays for strings (as D has done since pretty much forever)= =20 works really well. It's just that char[] and wchar[] cannot be treated as r= anges=20 of char or wchar or you're just asking for problems. And on the whole, the= =20 current solution works quite well, and operations that are disallowed by it= are=20 _supposed_ to be disallowed. =2D Jonathan M Davis
Mar 18 2011