digitalmars.D.learn

digitalmars.D.learn - Ranges

Jonas Drewsen (25/25) Mar 12 2011 Hi,

Jonathan M Davis (73/105) Mar 12 2011 Character arrays / strings are not exactly normal. And there's a very go...

Jonas Drewsen (9/114) Mar 12 2011 Hi Jonathan,
Peter Alexander (8/14) Mar 18 2011 This has to be the worst language design decision /ever/.

Jonathan M Davis (13/30) Mar 18 2011 It really isn't a problem for the most part. You just need to understand...
spir (22/37) Mar 18 2011 I partially agree, but. Compare with a simple ascii text: you could iter...
Jonathan M Davis (44/81) Mar 18 2011 ].

Peter Alexander (8/69) Mar 18 2011 I must be missing something, because the solution seems obvious to me:

Jonathan M Davis (68/152) Mar 18 2011 rs" in

Bekenn (4/11) Mar 12 2011 I haven't checked (could be completely off here), but I don't think that...

Bekenn (1/1) Mar 12 2011 Or, better yet, just read Jonathan's post.
Jonathan M Davis (4/18) Mar 12 2011 Char[] _does_ count as input range (of dchar). It just doesn't count as ...
Andrej Mitrovic (9/9) Mar 12 2011 What Jonathan said really needs to be put up on the D website, maybe

Jonathan M Davis (18/21) Mar 12 2011 Actually, on reflection, I'd have to say that there's not much point to ...
spir (13/19) Mar 13 2011 Side-note: you can be sure the source is pure ASCII if, and only if, it ...

Jonas Drewsen <jdrewsen nospam.com> writes:

Hi,

    I'm working a bit with ranges atm. but there are definitely some 
things that are not clear to me yet. Can anyone tell me why the char 
arrays cannot be copied but the int arrays can?

import std.stdio;
import std.algorithm;

void main(string[] args) {

   // This works
   int[]	a1 = [1,2,3,4];
   int[] a2 = [5,6,7,8];
   copy(a1, a2);

   // This does not!
   char[] a3 = ['1','2','3','4'];
   char[] a4 = ['5','6','7','8'];
   copy(a3, a4);

}

Error message:

test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if 
(isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1))) 
does not match any function template declaration

test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if 
(isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1))) 
cannot deduce template function from argument types !()(char[],char[])

Thanks,
Jonas

Mar 12 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday 12 March 2011 14:02:00 Jonas Drewsen wrote:
 Hi,
 
     I'm working a bit with ranges atm. but there are definitely some
 things that are not clear to me yet. Can anyone tell me why the char
 arrays cannot be copied but the int arrays can?
 
 import std.stdio;
 import std.algorithm;
 
 void main(string[] args) {
 
    // This works
    int[]	a1 = [1,2,3,4];
    int[] a2 = [5,6,7,8];
    copy(a1, a2);
 
    // This does not!
    char[] a3 = ['1','2','3','4'];
    char[] a4 = ['5','6','7','8'];
    copy(a3, a4);
 
 }
 
 Error message:
 
 test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
 (isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1)))
 does not match any function template declaration
 
 test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
 (isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1)))
 cannot deduce template function from argument types !()(char[],char[])

Character arrays / strings are not exactly normal. And there's a very good 
reason for it: unicode.

In unicode, a character is generally a single code point (there are also 
graphemes which involve combining code points to add accents and superscripts 
and whatnot to create a single character, but we'll ignore that in this 
discussion - it's complicated enough as it is). Depending on the encoding, that 
code point may be made up of one - or more - code units. UTF-8 uses 8 bit code 
units. UTF-16 uses 16 bit code units. And UTF-32 uses 32-bit code units. char
is 
a UTF-8 code unit. wchar is a UTF-16 code unit. dchar is a UTF-32 code unit. 
UTF-32 is the _only_ one of those three which _always_ has one code unit per 
code point.

With an array of integers you can index it and slice it and be sure that 
everything that you're doing is valid. If you look at a single element, you
know 
that it's a valid int. If you slice it, you know that every int in there is 
valid. If you're dealing with a dstring or dchar[], then the same still holds.

A dstring or dchar[] is an array of UTF-32 code units. Every code point is a 
single code unit, so every element in the array is a valid code point. You can 
take an arbitrary element in that array and know that it's a valid code point. 
You can slice it wherever you want and you still have a valid dstrin
g or dchar[]. The same does _not_ hold for char[] and wchar[].

char[] and wchar[] are arrays of UTF-8 and UTF-16 code units respectively. In 
both of those encodings, multiple code units are required to create a single 
code point. So, for instance, a code point could have 4 code units. That means 
that _4_ elements of that char[] make up a _single_ code point. You'd need
_all_ 
4 of those elements to create a single, valid character. So, you _can't_ just 
take an arbitrary element in a char[] or wchar[] and expect it to be valid. You 
_can't_ just slice it anywhere. The resulting array stands a good chance of 
being invalid. You have to slice on code point boundaries - otherwise you could 
slice characters in hald and end up with an invalid string. So, unlike other 
arrays, it just doesn't work to treat char[] and wchar[] as random access
ranges 
of their element type. What the programmer cares about is characters - dchars - 
not chars or wchars.

So, the way this is handled is that char[], wchar[], and dchar[] are all
treated 
as ranges of dchar. In the case of dchar[], this is nothing special. You can 
index it and slice it as normal. So, it is a random access range.. However, in 
the case of char[] and wchar[], that means that when you're iterating over them 
that you're not dealing with a single element of the array at a time. front 
returns a dchar, and popFront() pops off however many elements made up front. 
It's like with foreach. If you iterate a char[] with auto or char, then each 
individual element is given

foreach(c; myStr) {}

But if you iterate over with dchar, then each code point is given as a dchar:

foreach(dchar c; myStr) {}

If you were to try and iterate over a char[] by char, then you would be looking 
at code units rather than code points which is _rarely_ what you want. If
you're 
dealing with anything other than pure ASCII, you _will_ have bugs if you do 
that. You're supposed to use dchar with foreach and character arrays. That way, 
each value you process is a valid character. Ranges do the same, only you don't 
give them an iteration type, so they're _always_ iterating over dchar.

So, when you're using a range of char[] or wchar[], you're really using a range 
of dchar. These ranges are bi-directional. They can't be sliced, and they can't 
be indexed (since doing so would likely be invalid). This generally works very 
well. It's exactly what you want in most cases. The problem is that that means 
that the range that you're iterating over is effectively of a different type
than 
the original char[] or wchar[].

You can't just take two ranges of dchar of the same length and necessarily have 
them fit in the same char[] or wchar[]. They have the same length, because they 
have the same number of code points. However, they could have a different
number 
of code _units_, so the lengths of the actual arrays could differ. So, you
can't 
just take an arbitrary dchar range and copy it to another arbitrary dchar range.

The way that this is dealt with in the case of a function like copy is that
what 
you're copying _to_ must be an output range. char[] and wchar[] are _not_
output 
ranges, because of their differing number of code units per code point. So,
they 
don't work with copy. You need to use a dchar[] as the output range if you want 
to use strings with copy.

Now, in some cases, it might be possible to special case some of the range 
functions to treat char[] and wchar[] as arrays instead of ranges (in the case 
of copy, that's probably possible if both arguments are of the same type), but 
that can't be done in the general case. You could open an enhancement request 
for copy to treat char[] and wchar[] as arrays if _both_ of the arguments are
of 
the same type.

- Jonathan M Davis

Mar 12 2011

Jonas Drewsen <jdrewsen nospam.com> writes:

Hi Jonathan,

    Thank you very much your in depth answer!

    It should indeed goto a faq somewhere it think. I did now about the 
codepoint/unit stuff but had no idea that ranges of char are handled 
using dchar internally. This makes sense but is an easy pitfall for 
newcomers trying to use std.{algoritm,array,ranges} for char[].

Thanks
Jonas

On 13/03/11 01.05, Jonathan M Davis wrote:
 On Saturday 12 March 2011 14:02:00 Jonas Drewsen wrote:
 Hi,

      I'm working a bit with ranges atm. but there are definitely some
 things that are not clear to me yet. Can anyone tell me why the char
 arrays cannot be copied but the int arrays can?

 import std.stdio;
 import std.algorithm;

 void main(string[] args) {

     // This works
     int[]	a1 = [1,2,3,4];
     int[] a2 = [5,6,7,8];
     copy(a1, a2);

     // This does not!
     char[] a3 = ['1','2','3','4'];
     char[] a4 = ['5','6','7','8'];
     copy(a3, a4);

 }

 Error message:

 test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
 (isInputRange!(Range1)&&  isOutputRange!(Range2,ElementType!(Range1)))
 does not match any function template declaration

 test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
 (isInputRange!(Range1)&&  isOutputRange!(Range2,ElementType!(Range1)))
 cannot deduce template function from argument types !()(char[],char[])

 Character arrays / strings are not exactly normal. And there's a very good
 reason for it: unicode.

 In unicode, a character is generally a single code point (there are also
 graphemes which involve combining code points to add accents and superscripts
 and whatnot to create a single character, but we'll ignore that in this
 discussion - it's complicated enough as it is). Depending on the encoding, that
 code point may be made up of one - or more - code units. UTF-8 uses 8 bit code
 units. UTF-16 uses 16 bit code units. And UTF-32 uses 32-bit code units. char
is
 a UTF-8 code unit. wchar is a UTF-16 code unit. dchar is a UTF-32 code unit.
 UTF-32 is the _only_ one of those three which _always_ has one code unit per
 code point.

 With an array of integers you can index it and slice it and be sure that
 everything that you're doing is valid. If you look at a single element, you
know
 that it's a valid int. If you slice it, you know that every int in there is
 valid. If you're dealing with a dstring or dchar[], then the same still holds.

 A dstring or dchar[] is an array of UTF-32 code units. Every code point is a
 single code unit, so every element in the array is a valid code point. You can
 take an arbitrary element in that array and know that it's a valid code point.
 You can slice it wherever you want and you still have a valid dstrin
 g or dchar[]. The same does _not_ hold for char[] and wchar[].

 char[] and wchar[] are arrays of UTF-8 and UTF-16 code units respectively. In
 both of those encodings, multiple code units are required to create a single
 code point. So, for instance, a code point could have 4 code units. That means
 that _4_ elements of that char[] make up a _single_ code point. You'd need
_all_
 4 of those elements to create a single, valid character. So, you _can't_ just
 take an arbitrary element in a char[] or wchar[] and expect it to be valid. You
 _can't_ just slice it anywhere. The resulting array stands a good chance of
 being invalid. You have to slice on code point boundaries - otherwise you could
 slice characters in hald and end up with an invalid string. So, unlike other
 arrays, it just doesn't work to treat char[] and wchar[] as random access
ranges
 of their element type. What the programmer cares about is characters - dchars -
 not chars or wchars.

 So, the way this is handled is that char[], wchar[], and dchar[] are all
treated
 as ranges of dchar. In the case of dchar[], this is nothing special. You can
 index it and slice it as normal. So, it is a random access range.. However, in
 the case of char[] and wchar[], that means that when you're iterating over them
 that you're not dealing with a single element of the array at a time. front
 returns a dchar, and popFront() pops off however many elements made up front.
 It's like with foreach. If you iterate a char[] with auto or char, then each
 individual element is given

 foreach(c; myStr) {}

 But if you iterate over with dchar, then each code point is given as a dchar:

 foreach(dchar c; myStr) {}

 If you were to try and iterate over a char[] by char, then you would be looking
 at code units rather than code points which is _rarely_ what you want. If
you're
 dealing with anything other than pure ASCII, you _will_ have bugs if you do
 that. You're supposed to use dchar with foreach and character arrays. That way,
 each value you process is a valid character. Ranges do the same, only you don't
 give them an iteration type, so they're _always_ iterating over dchar.

 So, when you're using a range of char[] or wchar[], you're really using a range
 of dchar. These ranges are bi-directional. They can't be sliced, and they can't
 be indexed (since doing so would likely be invalid). This generally works very
 well. It's exactly what you want in most cases. The problem is that that means
 that the range that you're iterating over is effectively of a different type
than
 the original char[] or wchar[].

 You can't just take two ranges of dchar of the same length and necessarily have
 them fit in the same char[] or wchar[]. They have the same length, because they
 have the same number of code points. However, they could have a different
number
 of code _units_, so the lengths of the actual arrays could differ. So, you
can't
 just take an arbitrary dchar range and copy it to another arbitrary dchar
range.

 The way that this is dealt with in the case of a function like copy is that
what
 you're copying _to_ must be an output range. char[] and wchar[] are _not_
output
 ranges, because of their differing number of code units per code point. So,
they
 don't work with copy. You need to use a dchar[] as the output range if you want
 to use strings with copy.

 Now, in some cases, it might be possible to special case some of the range
 functions to treat char[] and wchar[] as arrays instead of ranges (in the case
 of copy, that's probably possible if both arguments are of the same type), but
 that can't be done in the general case. You could open an enhancement request
 for copy to treat char[] and wchar[] as arrays if _both_ of the arguments are
of
 the same type.

 - Jonathan M Davis

Mar 12 2011

Peter Alexander <peter.alexander.au gmail.com> writes:

On 13/03/11 12:05 AM, Jonathan M Davis wrote:
 So, when you're using a range of char[] or wchar[], you're really using a range
 of dchar. These ranges are bi-directional. They can't be sliced, and they can't
 be indexed (since doing so would likely be invalid). This generally works very
 well. It's exactly what you want in most cases. The problem is that that means
 that the range that you're iterating over is effectively of a different type
than
 the original char[] or wchar[].

This has to be the worst language design decision /ever/.

You can't just mess around with fundamental principles like "the first 
element in an array of T has type T" for the sake of a minor 
convenience. How are we supposed to do generic programming if common 
sense reasoning about types doesn't hold?

This is just std::vector<bool> from C++ all over again. Can we not learn 
from mistakes of the past?

Mar 18 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday 18 March 2011 02:29:51 Peter Alexander wrote:
 On 13/03/11 12:05 AM, Jonathan M Davis wrote:
 So, when you're using a range of char[] or wchar[], you're really using a
 range of dchar. These ranges are bi-directional. They can't be sliced,
 and they can't be indexed (since doing so would likely be invalid). This
 generally works very well. It's exactly what you want in most cases. The
 problem is that that means that the range that you're iterating over is
 effectively of a different type than the original char[] or wchar[].

 
 This has to be the worst language design decision /ever/.
 
 You can't just mess around with fundamental principles like "the first
 element in an array of T has type T" for the sake of a minor
 convenience. How are we supposed to do generic programming if common
 sense reasoning about types doesn't hold?
 
 This is just std::vector<bool> from C++ all over again. Can we not learn
 from mistakes of the past?

It really isn't a problem for the most part. You just need to understand that 
when using range-based functions, char[] and wchar[] are effectively _not_ 
arrays. They are ranges of dchar. And given the fact that it really wouldn't 
make sense to treat them as arrays in this case anyway (due to the fact that a 
single element is a code unit but _not_ a code point), the current solution 
makes a lot of sense. Generally, you just can't treat char[] and wchar[] as 
arrays when you're dealing with characters/code points rather than code units. 
So, yes it's a bit weird, but it makes a lot of sense given how unicode is 
designed. And it works.

If you really don't want to deal with it, then just use dchar[] and dstring 
everywhere.

- Jonathan M Davis

Mar 18 2011

spir <denis.spir gmail.com> writes:

On 03/18/2011 10:29 AM, Peter Alexander wrote:
 On 13/03/11 12:05 AM, Jonathan M Davis wrote:
 So, when you're using a range of char[] or wchar[], you're really using a range
 of dchar. These ranges are bi-directional. They can't be sliced, and they can't
 be indexed (since doing so would likely be invalid). This generally works very
 well. It's exactly what you want in most cases. The problem is that that means
 that the range that you're iterating over is effectively of a different type
 than
 the original char[] or wchar[].

 This has to be the worst language design decision /ever/.

 You can't just mess around with fundamental principles like "the first element
 in an array of T has type T" for the sake of a minor convenience. How are we
 supposed to do generic programming if common sense reasoning about types
 doesn't hold?

 This is just std::vector<bool> from C++ all over again. Can we not learn from
 mistakes of the past?

I partially agree, but. Compare with a simple ascii text: you could iterate 
over it chars (=codes=bytes), words, lines... Or according to specific schemes 
for your app (eg reverse order, every number in it, every word at start of 
line...). A piece of is not only a stream of codes.

The problem is there is no good decision, in the case of char[] or wchar[]. We 
should have to choose a kind of "natural" sense of what it means to iterate 
over a text, but there no such thing. What does it *mean*? What is the natural 
unit of a text?
Bytes or words are code units which mean nothing. Code units (<-> dchars) are 
not guaranteed to mean anything neither (as shown by past discussion: a code 
unit may be the base 'a', the following one be the composite '^', both in
"â"). 
Code unit do not represent "characters" in the common sense. So, it is very 
clear that implicitely iterating over dchars is a wrong choice. But what else?
I would rather get rid of wchar and dchar and deal with plain stream of bytes 
supposed to represent utf8. Until we get a good solution to operate at the 
level of "human" characters.

Denis
-- 
_________________
vita es estrany
spir.wikidot.com

Mar 18 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, March 18, 2011 03:32:35 spir wrote:
 On 03/18/2011 10:29 AM, Peter Alexander wrote:
 On 13/03/11 12:05 AM, Jonathan M Davis wrote:
 So, when you're using a range of char[] or wchar[], you're really using
 a range of dchar. These ranges are bi-directional. They can't be
 sliced, and they can't be indexed (since doing so would likely be
 invalid). This generally works very well. It's exactly what you want in
 most cases. The problem is that that means that the range that you're
 iterating over is effectively of a different type than
 the original char[] or wchar[].

=20
 This has to be the worst language design decision /ever/.
=20
 You can't just mess around with fundamental principles like "the first
 element in an array of T has type T" for the sake of a minor
 convenience. How are we supposed to do generic programming if common
 sense reasoning about types doesn't hold?
=20
 This is just std::vector<bool> from C++ all over again. Can we not learn
 from mistakes of the past?

=20
 I partially agree, but. Compare with a simple ascii text: you could itera=

te
 over it chars (=3Dcodes=3Dbytes), words, lines... Or according to specific
 schemes for your app (eg reverse order, every number in it, every word at
 start of line...). A piece of is not only a stream of codes.
=20
 The problem is there is no good decision, in the case of char[] or wchar[=

].
 We should have to choose a kind of "natural" sense of what it means to
 iterate over a text, but there no such thing. What does it *mean*? What is
 the natural unit of a text?
 Bytes or words are code units which mean nothing. Code units (<-> dchars)
 are not guaranteed to mean anything neither (as shown by past discussion:
 a code unit may be the base 'a', the following one be the composite '^',
 both in "=C3=A2"). Code unit do not represent "characters" in the common =

sense.
 So, it is very clear that implicitely iterating over dchars is a wrong
 choice. But what else? I would rather get rid of wchar and dchar and deal
 with plain stream of bytes supposed to represent utf8. Until we get a good
 solution to operate at the level of "human" characters.

Iterating over dchars works in _most_ cases. Iterating over chars only work=
s for=20
pure ASCII. The additional overhead for dealing with graphemes instead of c=
ode=20
points is almost certainly prohibitive, it _usually_ isn't necessary, and w=
e=20
don't have an actualy grapheme solution yet. So, treating char[] and wchar[=
] as=20
if their elements were valid on their own is _not_ going to work. Treating =
them=20
along with dchar[] as ranges of dchar _mostly_ works. We definitely should =
have a=20
way to handle them as ranges of graphemes for those who need to, but the co=
de=20
point vs grapheme issue is nowhere near as critical as the code unit vs cod=
e=20
point issue.

I don't really want to get into the whole unicode discussion again. It has =
been=20
discussed quite a bit on the D list already. There is no perfect solution. =
The=20
current solution _mostly_ works, and, for the most part IMHO, is the correc=
t=20
solution. We _do_ need a full-on grapheme handling solution, but a lot of s=
tuff=20
doesn't need that and the overhead for dealing with it would be prohibitive=
=2E The=20
main problem with using code points rather than graphemes is the lack of=20
normalization, and a _lot_ of string code can get by just fine without that.

So, we have a really good 90% solution and we still need a 100% solution, b=
ut=20
using the 100% all of the time would almost certainly not be acceptable due=
 to=20
performance issues, and doing stuff by code unit instead of code point woul=
d be=20
_really_ bad. So, what we have is good and will likely stay as is. We just =
need=20
a proper grapheme solution for those who need it.

=2D Jonathan M Davis


P.S. Unicode is just plain ugly.... :(

Mar 18 2011

Peter Alexander <peter.alexander.au gmail.com> writes:

On 18/03/11 5:53 PM, Jonathan M Davis wrote:
 On Friday, March 18, 2011 03:32:35 spir wrote:
 On 03/18/2011 10:29 AM, Peter Alexander wrote:
 On 13/03/11 12:05 AM, Jonathan M Davis wrote:
 So, when you're using a range of char[] or wchar[], you're really using
 a range of dchar. These ranges are bi-directional. They can't be
 sliced, and they can't be indexed (since doing so would likely be
 invalid). This generally works very well. It's exactly what you want in
 most cases. The problem is that that means that the range that you're
 iterating over is effectively of a different type than
 the original char[] or wchar[].

 This has to be the worst language design decision /ever/.

 You can't just mess around with fundamental principles like "the first
 element in an array of T has type T" for the sake of a minor
 convenience. How are we supposed to do generic programming if common
 sense reasoning about types doesn't hold?

 This is just std::vector<bool>  from C++ all over again. Can we not learn
 from mistakes of the past?

 I partially agree, but. Compare with a simple ascii text: you could iterate
 over it chars (=codes=bytes), words, lines... Or according to specific
 schemes for your app (eg reverse order, every number in it, every word at
 start of line...). A piece of is not only a stream of codes.

 The problem is there is no good decision, in the case of char[] or wchar[].
 We should have to choose a kind of "natural" sense of what it means to
 iterate over a text, but there no such thing. What does it *mean*? What is
 the natural unit of a text?
 Bytes or words are code units which mean nothing. Code units (<->  dchars)
 are not guaranteed to mean anything neither (as shown by past discussion:
 a code unit may be the base 'a', the following one be the composite '^',
 both in "â"). Code unit do not represent "characters" in the common sense.
 So, it is very clear that implicitely iterating over dchars is a wrong
 choice. But what else? I would rather get rid of wchar and dchar and deal
 with plain stream of bytes supposed to represent utf8. Until we get a good
 solution to operate at the level of "human" characters.

 Iterating over dchars works in _most_ cases. Iterating over chars only works
for
 pure ASCII. The additional overhead for dealing with graphemes instead of code
 points is almost certainly prohibitive, it _usually_ isn't necessary, and we
 don't have an actualy grapheme solution yet. So, treating char[] and wchar[] as
 if their elements were valid on their own is _not_ going to work. Treating them
 along with dchar[] as ranges of dchar _mostly_ works. We definitely should
have a
 way to handle them as ranges of graphemes for those who need to, but the code
 point vs grapheme issue is nowhere near as critical as the code unit vs code
 point issue.

 I don't really want to get into the whole unicode discussion again. It has been
 discussed quite a bit on the D list already. There is no perfect solution. The
 current solution _mostly_ works, and, for the most part IMHO, is the correct
 solution. We _do_ need a full-on grapheme handling solution, but a lot of stuff
 doesn't need that and the overhead for dealing with it would be prohibitive.
The
 main problem with using code points rather than graphemes is the lack of
 normalization, and a _lot_ of string code can get by just fine without that.

 So, we have a really good 90% solution and we still need a 100% solution, but
 using the 100% all of the time would almost certainly not be acceptable due to
 performance issues, and doing stuff by code unit instead of code point would be
 _really_ bad. So, what we have is good and will likely stay as is. We just need
 a proper grapheme solution for those who need it.

 - Jonathan M Davis


 P.S. Unicode is just plain ugly.... :(

I must be missing something, because the solution seems obvious to me:

char[], wchar[], and dchar[] should be simple arrays like int[] with no 
unicode semantics.

string, wstring, and dstring should not be aliases to arrays, but 
instead should be separate types that behave the way char[], wchar[], 
and dchar[] do currently.

Is there any problem with this approach?

Mar 18 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, March 18, 2011 14:08:48 Peter Alexander wrote:
 On 18/03/11 5:53 PM, Jonathan M Davis wrote:
 On Friday, March 18, 2011 03:32:35 spir wrote:
 On 03/18/2011 10:29 AM, Peter Alexander wrote:
 On 13/03/11 12:05 AM, Jonathan M Davis wrote:
 So, when you're using a range of char[] or wchar[], you're really
 using a range of dchar. These ranges are bi-directional. They can't
 be sliced, and they can't be indexed (since doing so would likely be
 invalid). This generally works very well. It's exactly what you want
 in most cases. The problem is that that means that the range that
 you're iterating over is effectively of a different type than
 the original char[] or wchar[].

=20
 This has to be the worst language design decision /ever/.
=20
 You can't just mess around with fundamental principles like "the first
 element in an array of T has type T" for the sake of a minor
 convenience. How are we supposed to do generic programming if common
 sense reasoning about types doesn't hold?
=20
 This is just std::vector<bool>  from C++ all over again. Can we not
 learn from mistakes of the past?

=20
 I partially agree, but. Compare with a simple ascii text: you could
 iterate over it chars (=3Dcodes=3Dbytes), words, lines... Or according=



 to
 specific schemes for your app (eg reverse order, every number in it,
 every word at start of line...). A piece of is not only a stream of
 codes.
=20
 The problem is there is no good decision, in the case of char[] or
 wchar[]. We should have to choose a kind of "natural" sense of what it
 means to iterate over a text, but there no such thing. What does it
 *mean*? What is the natural unit of a text?
 Bytes or words are code units which mean nothing. Code units (<->=20
 dchars) are not guaranteed to mean anything neither (as shown by past
 discussion: a code unit may be the base 'a', the following one be the
 composite '^', both in "=C3=A2"). Code unit do not represent "characte=



rs" in
 the common sense. So, it is very clear that implicitely iterating over
 dchars is a wrong choice. But what else? I would rather get rid of
 wchar and dchar and deal with plain stream of bytes supposed to
 represent utf8. Until we get a good solution to operate at the level of
 "human" characters.

=20
 Iterating over dchars works in _most_ cases. Iterating over chars only
 works for pure ASCII. The additional overhead for dealing with graphemes
 instead of code points is almost certainly prohibitive, it _usually_
 isn't necessary, and we don't have an actualy grapheme solution yet. So,
 treating char[] and wchar[] as if their elements were valid on their own
 is _not_ going to work. Treating them along with dchar[] as ranges of
 dchar _mostly_ works. We definitely should have a way to handle them as
 ranges of graphemes for those who need to, but the code point vs
 grapheme issue is nowhere near as critical as the code unit vs code
 point issue.
=20
 I don't really want to get into the whole unicode discussion again. It
 has been discussed quite a bit on the D list already. There is no
 perfect solution. The current solution _mostly_ works, and, for the most
 part IMHO, is the correct solution. We _do_ need a full-on grapheme
 handling solution, but a lot of stuff doesn't need that and the overhead
 for dealing with it would be prohibitive. The main problem with using
 code points rather than graphemes is the lack of normalization, and a
 _lot_ of string code can get by just fine without that.
=20
 So, we have a really good 90% solution and we still need a 100% solutio=


n,
 but using the 100% all of the time would almost certainly not be
 acceptable due to performance issues, and doing stuff by code unit
 instead of code point would be _really_ bad. So, what we have is good
 and will likely stay as is. We just need a proper grapheme solution for
 those who need it.
=20
 - Jonathan M Davis
=20
=20
 P.S. Unicode is just plain ugly.... :(

=20
 I must be missing something, because the solution seems obvious to me:
=20
 char[], wchar[], and dchar[] should be simple arrays like int[] with no
 unicode semantics.
=20
 string, wstring, and dstring should not be aliases to arrays, but
 instead should be separate types that behave the way char[], wchar[],
 and dchar[] do currently.
=20
 Is there any problem with this approach?

There has been a fair bit of debate about it in the past. No one has been a=
ble=20
to come up with an alternate solution which is generally considered better =
than=20
what we have.

char is defined to be a UTF-8 code unit. wchar in defined to be a UTF-16 co=
de=20
unit. dchar is defined to be a UTF-32 code unit (which is also guaranteed t=
o be a=20
code point). So, manipulating char[] and wchar[] as arrays of characters do=
esn't=20
generally make any sense. They _aren't_ characters. They're code units. Hav=
ing a=20
range of char or wchar generally makes no sense.

When you don't care about the contents of a string, treating it as an array=
 is=20
very useful. When you _do_ care, you need to treat it as a range of dchar -=
 no=20
matter which unicode encoding it uses. So, having arrays of code units whic=
h are=20
treated as ranges of dchar makes a lot of sense.

We could have a wrapper type which wrapped arrays of char, wchar, or dchar =
and=20
had the appropriate operations on them and was a range of dchar, but then y=
ou'd=20
have to get at the underlying array for various stuff. So, whether it's a g=
ain or=20
loss is debatable. You have to special-case strings _regardless_. For some =
stuff,=20
they need to be treated as arrays of code units, and for other stuff they n=
eed to=20
be treated as ranges of code points.

As it stands, range-based functions treat char[], wchar[], etc. properly.=20
Allowing char[] to be treated as a range of char would just cause bugs in m=
ost=20
cases. Generally speaking, when someone is trying to do stuff like use char=
[] as=20
an output range, it _shouldn't_ work. In almost all cases, it would just be=
=20
buggy to treat char[] as a range of char and allow that to work (which woul=
d=20
happen if we treated char[] as a range of char).

So, in almost all cases where treating char[] as a range of dchar causes=20
problems, it's _preventing bugs_. The one glaring problem with the current=
=20
scheme is foreach. It defaults to the element type of the array, so if you =
don't=20
give dchar as the element type when iterating over a char[] or wchar[], the=
n=20
you're going to have bugs. There have been suggestions on how to fix that (=
such=20
as giving a warning when not giving the iteration type for char[] and wchar=
[]=20
when using foreach), but nothing has been implemented yet.

Overall, using arrays for strings (as D has done since pretty much forever)=
=20
works really well. It's just that char[] and wchar[] cannot be treated as r=
anges=20
of char or wchar or you're just asking for problems. And on the whole, the=
=20
current solution works quite well, and operations that are disallowed by it=
 are=20
_supposed_ to be disallowed.

=2D Jonathan M Davis

Mar 18 2011

Bekenn <leaveme alone.com> writes:

On 3/12/2011 2:02 PM, Jonas Drewsen wrote:
 Error message:

 test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
 (isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1)))
 does not match any function template declaration

 test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
 (isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1)))
 cannot deduce template function from argument types !()(char[],char[])

I haven't checked (could be completely off here), but I don't think that 
char[] counts as an input range; you would normally want to use dchar 
instead.

Mar 12 2011

Bekenn <leaveme alone.com> writes:

Or, better yet, just read Jonathan's post.

Mar 12 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday 12 March 2011 16:11:20 Bekenn wrote:
 On 3/12/2011 2:02 PM, Jonas Drewsen wrote:
 Error message:
 
 test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
 (isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1)))
 does not match any function template declaration
 
 test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
 (isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1)))
 cannot deduce template function from argument types !()(char[],char[])

 
 I haven't checked (could be completely off here), but I don't think that
 char[] counts as an input range; you would normally want to use dchar
 instead.

Char[] _does_ count as input range (of dchar). It just doesn't count as an 
_output_ range (since it doesn't really hold dchar).

- Jonathan M Davis

Mar 12 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

What Jonathan said really needs to be put up on the D website, maybe
under the articles section. Heck, I'd just put a link to that recent
UTF thread on the website, it's really informative (the one on UTF and
meaning of glyphs, etc). And UTF will only get more important, just
like multicore.

Speaking of which, a description on ranges should be put up there as
well. There's that article Andrei once wrote, but we should put it on
the D site and discuss D's implementation of ranges in more detail.
And by 'we' I mean someone who's well versed in ranges. :p

Mar 12 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday 12 March 2011 16:05:37 Jonathan M Davis wrote:
 You could open an
 enhancement request for copy to treat char[] and wchar[] as arrays if
 _both_ of the arguments are of the same type.

Actually, on reflection, I'd have to say that there's not much point to that.
If 
you really want to copy on array to another (rather than a range), just use the 
array copy syntax:

void main()
{
    auto i = [1, 2, 3, 4];
    auto j = [3, 4, 5, 6];
    assert(i == [1, 2, 3, 4]);
    assert(j == [3, 4, 5, 6]);

    i[] = j[];

    assert(i == [3, 4, 5, 6]);
    assert(j == [3, 4, 5, 6]);
}

copy is of benefit, because it works on generic ranges, not for copying arrays 
(arrays already allow you to do that quite nicely), so if all you're looking at 
copying is arrays, then just use the array copy syntax.

- Jonathan M Davis

Mar 12 2011

spir <denis.spir gmail.com> writes:

On 03/13/2011 01:05 AM, Jonathan M Davis wrote:
 If you were to try and iterate over a char[] by char, then you would be looking
 at code units rather than code points which is _rarely_ what you want. If
you're
 dealing with anything other than pure ASCII, you _will_ have bugs if you do
 that. You're supposed to use dchar with foreach and character arrays. That way,
 each value you process is a valid character. Ranges do the same, only you don't
 give them an iteration type, so they're _always_ iterating over dchar.

Side-note: you can be sure the source is pure ASCII if, and only if, it is 
mechanically produced. (As soon as an end-user touches it, it may hold 
anything, since OSes and apps offer users means to introduces characters which 
are not on their keyboards).
This can also easily be checked in utf-8 (which has been designed for that): 
all ASCII chars are coded using the same code as in ASCII, thus all codes 
should be < 128.

Denis
-- 
_________________
vita es estrany
spir.wikidot.com

Mar 13 2011

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Ranges