digitalmars.D.learn - typeof(string.front) should be char

Piotr Szturmaj (17/17) Mar 02 2012 Hello,

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (4/21) Mar 02 2012 No, that's by design. When used as InputRange ranges, slices of any

Jonathan M Davis (36/64) Mar 03 2012 Indeed.

Piotr Szturmaj (38/87) Mar 03 2012 Foreach gives opportunity to handle any string by char, wchar or dchar,

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (6/41) Mar 03 2012 Yes, Phobos faces the same issues.
Jonathan M Davis (15/20) Mar 03 2012 Yes it does. And there's _no_ way around that if you want to handle unic...
H. S. Teoh (19/25) Mar 03 2012 Or use:

Timon Gehr (2/6) Mar 03 2012 No, it is less efficient.

Jonathan M Davis (7/14) Mar 03 2012 Operating on code points is more efficient than operating on graphemes i...

Timon Gehr (5/19) Mar 03 2012 When the code actually only cares about some characters that have 7-bit

Jonathan M Davis (12/37) Mar 03 2012 True, but writing code without caring about unicode frequently leads to ...
H. S. Teoh (12/15) Mar 03 2012 [...]

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (5/18) Mar 03 2012 Denis Spir was working on solving that problem but unfortunately we

Jonathan M Davis (6/33) Mar 03 2012 be
Jonathan M Davis (16/41) Mar 03 2012 be

Jacob Carlborg (5/22) Mar 03 2012 I thought all these would be either "dchar" or "immutable(dchar)". Why

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (5/31) Mar 03 2012 In the case of char and wchar slices, the "elements" are decoded as the

Jacob Carlborg (4/28) Mar 03 2012 Ah, I see, thanks.

Piotr Szturmaj <bncrbme jadamspam.pl> writes:

Hello,

For this code:

     auto c = "test"c;
     auto w = "test"w;
     auto d = "test"d;
     pragma(msg, typeof(c.front));
     pragma(msg, typeof(w.front));
     pragma(msg, typeof(d.front));

compiler prints:

dchar
dchar
immutable(dchar)

IMO it should print this:

immutable(char)
immutable(wchar)
immutable(dchar)

Is it a bug?

Mar 02 2012

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 03/02/2012 06:30 PM, Piotr Szturmaj wrote:
 Hello,

 For this code:

 auto c = "test"c;
 auto w = "test"w;
 auto d = "test"d;
 pragma(msg, typeof(c.front));
 pragma(msg, typeof(w.front));
 pragma(msg, typeof(d.front));

 compiler prints:

 dchar
 dchar
 immutable(dchar)

 IMO it should print this:

 immutable(char)
 immutable(wchar)
 immutable(dchar)

 Is it a bug?

No, that's by design. When used as InputRange ranges, slices of any 
character type are exposed as ranges of dchar.

Ali

Mar 02 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, March 02, 2012 20:41:35 Ali =C3=87ehreli wrote:
 On 03/02/2012 06:30 PM, Piotr Szturmaj wrote:
  > Hello,
  >=20
  > For this code:
  >=20
  > auto c =3D "test"c;
  > auto w =3D "test"w;
  > auto d =3D "test"d;
  > pragma(msg, typeof(c.front));
  > pragma(msg, typeof(w.front));
  > pragma(msg, typeof(d.front));
  >=20
  > compiler prints:
  >=20
  > dchar
  > dchar
  > immutable(dchar)
  >=20
  > IMO it should print this:
  >=20
  > immutable(char)
  > immutable(wchar)
  > immutable(dchar)
  >=20
  > Is it a bug?
=20
 No, that's by design. When used as InputRange ranges, slices of any
 character type are exposed as ranges of dchar.

Indeed.

Strings are always treated as ranges of dchar, because it generally mak=
es no=20
sense to operate on individual chars or wchars. A char is a UTF-8 code =
unit. A=20
wchar is a UTF-16 code unit. And a dchar is a UTF-32 code unit. The _on=
ly_ one=20
of those which is guranteed to be a code point is dchar, since in UTF-3=
2, all=20
code points are a single code unit. If you were to operate on individua=
l chars=20
or wchars, you'd be operating on pieces of characters rather than whole=
=20
characters, which wreaks havoc with unicode.

Now, technically speaking, a code point isn't necessarily a full charac=
ter,=20
since you can also combine code points (e.g. adding a subscript to a le=
tter),=20
and a full character is what's called a grapheme, and unfortunately, at=
 the=20
moment, Phobos doesn't have a way to operate on graphemes, but operatin=
g on=20
code points is _far_ more correct than operating on code units. It's al=
so more=20
efficient.

Unfortunately, in order to code completely efficiently with unicode, yo=
u have=20
understand quite a bit about it, which most programmers don't, but by=20=

operating on ranges of code points, Phobos manages to be correct in the=
=20
majority of cases.

So, yes. It's very much on purpose that all strings are treated as rang=
es of=20
dchar.

- Jonathan M Davis

Mar 03 2012

Piotr Szturmaj <bncrbme jadamspam.pl> writes:

Jonathan M Davis wrote:
 On Friday, March 02, 2012 20:41:35 Ali Çehreli wrote:
 On 03/02/2012 06:30 PM, Piotr Szturmaj wrote:
   >  Hello,
   >
   >  For this code:
   >
   >  auto c = "test"c;
   >  auto w = "test"w;
   >  auto d = "test"d;
   >  pragma(msg, typeof(c.front));
   >  pragma(msg, typeof(w.front));
   >  pragma(msg, typeof(d.front));
   >
   >  compiler prints:
   >
   >  dchar
   >  dchar
   >  immutable(dchar)
   >
   >  IMO it should print this:
   >
   >  immutable(char)
   >  immutable(wchar)
   >  immutable(dchar)
   >
   >  Is it a bug?

 No, that's by design. When used as InputRange ranges, slices of any
 character type are exposed as ranges of dchar.

 Indeed.

 Strings are always treated as ranges of dchar, because it generally makes no
 sense to operate on individual chars or wchars. A char is a UTF-8 code unit. A
 wchar is a UTF-16 code unit. And a dchar is a UTF-32 code unit. The _only_ one
 of those which is guranteed to be a code point is dchar, since in UTF-32, all
 code points are a single code unit. If you were to operate on individual chars
 or wchars, you'd be operating on pieces of characters rather than whole
 characters, which wreaks havoc with unicode.

 Now, technically speaking, a code point isn't necessarily a full character,
 since you can also combine code points (e.g. adding a subscript to a letter),
 and a full character is what's called a grapheme, and unfortunately, at the
 moment, Phobos doesn't have a way to operate on graphemes, but operating on
 code points is _far_ more correct than operating on code units. It's also more
 efficient.

 Unfortunately, in order to code completely efficiently with unicode, you have
 understand quite a bit about it, which most programmers don't, but by
 operating on ranges of code points, Phobos manages to be correct in the
 majority of cases.

I know about Unicode, code units/points and their encoding.

 So, yes. It's very much on purpose that all strings are treated as ranges of
 dchar.

Foreach gives opportunity to handle any string by char, wchar or dchar, 
the default dchar is appropriate here, but why for ranges?

I was afraid it is on purpose, because it has some bad consequences. It 
breaks genericity when dealing with ranges. Consider a custom range of char:

struct CharRange
{
      property bool empty();
      property char front();
     void popFront();
}

typeof(CharRange.front) and ElementType!CharRange both return _char_ 
while for string they return _dchar_. This discrepancy pushes the range 
writer to handle special string cases. I'm currently trying to write 
ByDchar range:

template ByDchar(R)
      if (isInputRange!R && isSomeChar!(ElementType!R))
{
     alias ElementType!R E;
     static if (is(E == dchar))
         alias R ByDchar;
     else static if (is(E == char))
     {
         struct ByDchar
         {
             ...
         }
     }
     else static if (is(E == wchar))
     {
         ...
     }
}

The problem with that range is when it takes a string type, it aliases 
this type with itself, because ElementType!R yields dchar. This is why 
I'm talking about "bad consequences", I just want to iterate string by 
_char_, not _dchar_.

Mar 03 2012

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 03/03/2012 05:57 AM, Piotr Szturmaj wrote:
 Consider a custom range of
 char:

 struct CharRange
 {
  property bool empty();
  property char front();
 void popFront();
 }

 typeof(CharRange.front) and ElementType!CharRange both return _char_

Yes, and I would expect both to the same type.

 while for string they return _dchar_. This discrepancy pushes the range
 writer to handle special string cases.

Yes, Phobos faces the same issues.

 I'm currently trying to write
 ByDchar range:

 template ByDchar(R)
 if (isInputRange!R && isSomeChar!(ElementType!R))
 {
 alias ElementType!R E;
 static if (is(E == dchar))
 alias R ByDchar;
 else static if (is(E == char))
 {
 struct ByDchar
 {
 ...
 }
 }
 else static if (is(E == wchar))
 {
 ...
 }
 }

 The problem with that range is when it takes a string type, it aliases
 this type with itself, because ElementType!R yields dchar. This is why
 I'm talking about "bad consequences", I just want to iterate string by
 _char_, not _dchar_.

In case you don't know already, there are std.traits.isNarrowString, 
std.range.ForEachType, etc. which may be useful.

Ali

Mar 03 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday, March 03, 2012 14:57:59 Piotr Szturmaj wrote:
 This discrepancy pushes the range writer to handle special string cases.

Yes it does. And there's _no_ way around that if you want to handle unicode 
both correctly and efficiently. To handle it correctly, you must operate on
code 
points (or even better, graphemes), but to handle them efficiently, you must 
take the encoding into account. Phobos has gone with the default of 
correctness while giving you the tools to special case stuff for efficiency. 
Phobos itself uses static if all over the place to special case pieces of 
functions on string type. Stuff like isNarrowString and ElementEncodingType 
exist specifically for that.

 The problem with that range is when it takes a string type, it aliases
 this type with itself, because ElementType!R yields dchar. This is why
 I'm talking about "bad consequences", I just want to iterate string by
 _char_, not _dchar_.

If you want to iterate by char, then use foreach or use a wrapper range (or 
cast to ubyte[] and operate on that). Phobos specificically does not to do 
that, because it breaks unicode. It doesn't stop you from iterating by char or 
wchar if you really want to, but it operates or ranges of dchar by default, 
because it's more correct.

- Jonathan M Davis

Mar 03 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sat, Mar 03, 2012 at 11:53:41AM -0800, Jonathan M Davis wrote:
[...]
 If you want to iterate by char, then use foreach or use a wrapper
 range (or cast to ubyte[] and operate on that).

Or use:

	string str = ...;
	for (size_t i=0; i < str.length; i++) {
		/* do something with str[i] */
	}


 Phobos specificically does not to do that, because it breaks unicode.
 It doesn't stop you from iterating by char or wchar if you really want
 to, but it operates or ranges of dchar by default, because it's more
 correct.

[...]

I think this is the correct approach. Always err on the side of correct
and/or safe, but give the programmer the option of getting under the
hood if he wants otherwise.


T

-- 
A linguistics professor was lecturing to his class one day.
"In English," he said, "A double negative forms a positive. In some
languages, though, such as Russian, a double negative is still a
negative. However, there is no language wherein a double positive can
form a negative."
A voice from the back of the room piped up, "Yeah, yeah."

Mar 03 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 03/03/2012 09:40 AM, Jonathan M Davis wrote:
 ...  but operating on
 code points is _far_ more correct than operating on code units. It's also more
 efficient.
 [snip.]

No, it is less efficient.

Mar 03 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday, March 03, 2012 18:38:44 Timon Gehr wrote:
 On 03/03/2012 09:40 AM, Jonathan M Davis wrote:
 ...  but operating on
 code points is _far_ more correct than operating on code units. It's also
 more efficient.
 [snip.]

 
 No, it is less efficient.

Operating on code points is more efficient than operating on graphemes is what
I 
meant. I can see that I wasn't clear enough on that.

It's more correct than operating on code units and less correct than operating 
on graphemes,while it's less efficient than operating on code units and more 
efficient than operating on graphemes.

- Jonathan M Davis

Mar 03 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 03/03/2012 08:46 PM, Jonathan M Davis wrote:
 On Saturday, March 03, 2012 18:38:44 Timon Gehr wrote:
 On 03/03/2012 09:40 AM, Jonathan M Davis wrote:
 ...  but operating on
 code points is _far_ more correct than operating on code units. It's also
 more efficient.
 [snip.]

 No, it is less efficient.

 Operating on code points is more efficient than operating on graphemes is what
I
 meant. I can see that I wasn't clear enough on that.

Makes sense.

 It's more correct than operating on code units and less correct than operating
 on graphemes,while it's less efficient than operating on code units and more
 efficient than operating on graphemes.

 - Jonathan M Davis

When the code actually only cares about some characters that have 7-bit 
ASCII values, most of the time there are no correctness issues when 
operating on code units directly.

Mar 03 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday, March 03, 2012 21:05:40 Timon Gehr wrote:
 On 03/03/2012 08:46 PM, Jonathan M Davis wrote:
 On Saturday, March 03, 2012 18:38:44 Timon Gehr wrote:
 On 03/03/2012 09:40 AM, Jonathan M Davis wrote:
 ...  but operating on
 code points is _far_ more correct than operating on code units. It's
 also
 more efficient.
 [snip.]

 
 No, it is less efficient.

 
 Operating on code points is more efficient than operating on graphemes is
 what I meant. I can see that I wasn't clear enough on that.

 
 Makes sense.
 
 It's more correct than operating on code units and less correct than
 operating on graphemes,while it's less efficient than operating on code
 units and more efficient than operating on graphemes.
 
 - Jonathan M Davis

 
 When the code actually only cares about some characters that have 7-bit
 ASCII values, most of the time there are no correctness issues when
 operating on code units directly.

True, but writing code without caring about unicode frequently leads to bugs 
when you actually _do_ have to deal with unicode (the fact that an American 
programmer runs into unicode less just makes it worse, because they're less 
likely to catch their bugs), and char is UTF-8 by definition.

So, operating specifically on ASCII is an optimization and should be coded for 
specifically rather than being generally encouraged. And having ranges over 
strings be code units rather than code points would encourage incorrect usage. 
The current solution encourages correct usage (or at least usage which is 
closer to correct, since it still isn't at the grapheme level) without 
disallowing more optimized code.

- Jonathan M Davis

Mar 03 2012

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sat, Mar 03, 2012 at 12:42:53PM -0800, Jonathan M Davis wrote:
[...]
 The current solution encourages correct usage (or at least usage which
 is closer to correct, since it still isn't at the grapheme level)
 without disallowing more optimized code.

[...]

Speaking of graphemes, is anyone interested in implementing Unicode
normalization for D? I looked at the specs briefly, and it seems to be
something that is straightforward to implement, albeit somewhat tedious.

It would be nice if D string types are normalized (needs slight change
to string concatenation). Or at least, if there's a guaranteed
normalized string type for those who care about it.


T

-- 
You have to expect the unexpected. -- RL

Mar 03 2012

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 03/03/2012 01:42 PM, H. S. Teoh wrote:
 On Sat, Mar 03, 2012 at 12:42:53PM -0800, Jonathan M Davis wrote:
 [...]
 The current solution encourages correct usage (or at least usage which
 is closer to correct, since it still isn't at the grapheme level)
 without disallowing more optimized code.

 [...]

 Speaking of graphemes, is anyone interested in implementing Unicode
 normalization for D? I looked at the specs briefly, and it seems to be
 something that is straightforward to implement, albeit somewhat tedious.

 It would be nice if D string types are normalized (needs slight change
 to string concatenation). Or at least, if there's a guaranteed
 normalized string type for those who care about it.


 T

Denis Spir was working on solving that problem but unfortunately we 
haven't heard from him for almost a year now. I think this is his site:

   http://spir.wikidot.com

Ali

Mar 03 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday, March 03, 2012 13:46:16 Ali =C3=87ehreli wrote:
 On 03/03/2012 01:42 PM, H. S. Teoh wrote:
 On Sat, Mar 03, 2012 at 12:42:53PM -0800, Jonathan M Davis wrote:
 [...]
=20
 The current solution encourages correct usage (or at least usage w=



hich
 is closer to correct, since it still isn't at the grapheme level)
 without disallowing more optimized code.

=20
 [...]
=20
 Speaking of graphemes, is anyone interested in implementing Unicode=


 normalization for D? I looked at the specs briefly, and it seems to=


 be
 something that is straightforward to implement, albeit somewhat ted=


ious.
=20
 It would be nice if D string types are normalized (needs slight cha=


nge
 to string concatenation). Or at least, if there's a guaranteed
 normalized string type for those who care about it.
=20
=20
 T

=20
 Denis Spir was working on solving that problem but unfortunately we
 haven't heard from him for almost a year now. I think this is his sit=

e:
=20
    http://spir.wikidot.com
=20
 Ali

Mar 03 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Saturday, March 03, 2012 13:46:16 Ali =C3=87ehreli wrote:
 On 03/03/2012 01:42 PM, H. S. Teoh wrote:
 On Sat, Mar 03, 2012 at 12:42:53PM -0800, Jonathan M Davis wrote:
 [...]
=20
 The current solution encourages correct usage (or at least usage w=



hich
 is closer to correct, since it still isn't at the grapheme level)
 without disallowing more optimized code.

=20
 [...]
=20
 Speaking of graphemes, is anyone interested in implementing Unicode=


 normalization for D? I looked at the specs briefly, and it seems to=


 be
 something that is straightforward to implement, albeit somewhat ted=


ious.
=20
 It would be nice if D string types are normalized (needs slight cha=


nge
 to string concatenation). Or at least, if there's a guaranteed
 normalized string type for those who care about it.
=20
=20
 T

=20
 Denis Spir was working on solving that problem but unfortunately we
 haven't heard from him for almost a year now. I think this is his sit=

e:
=20
    http://spir.wikidot.com

There's some stuff in the new std.regex which was done to enhance unico=
de=20
support which is currently completely internal to it which may end up b=
eing=20
the basis for more, but Dmitry hasn't yet worked on creating a version =
of that=20
for more general consumption AFAIK. I'm not quite sure what he did thou=
gh,=20
since I'm not familier with std.regex.

- Jonathan M Davis

Mar 03 2012

Jacob Carlborg <doob me.com> writes:

On 2012-03-03 03:30, Piotr Szturmaj wrote:
 Hello,

 For this code:

 auto c = "test"c;
 auto w = "test"w;
 auto d = "test"d;
 pragma(msg, typeof(c.front));
 pragma(msg, typeof(w.front));
 pragma(msg, typeof(d.front));

 compiler prints:

 dchar
 dchar
 immutable(dchar)

I thought all these would be either "dchar" or "immutable(dchar)". Why 
are they of different types?

 IMO it should print this:

 immutable(char)
 immutable(wchar)
 immutable(dchar)

 Is it a bug?


-- 
/Jacob Carlborg

Mar 03 2012

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 03/03/2012 04:36 AM, Jacob Carlborg wrote:
 On 2012-03-03 03:30, Piotr Szturmaj wrote:
 Hello,

 For this code:

 auto c = "test"c;
 auto w = "test"w;
 auto d = "test"d;
 pragma(msg, typeof(c.front));
 pragma(msg, typeof(w.front));
 pragma(msg, typeof(d.front));

 compiler prints:

 dchar
 dchar
 immutable(dchar)

 I thought all these would be either "dchar" or "immutable(dchar)". Why
 are they of different types?

In the case of char and wchar slices, the "elements" are decoded as the 
iteration happens. In other words, the returned values are not actual 
elements of the ranges.

 IMO it should print this:

 immutable(char)
 immutable(wchar)
 immutable(dchar)

 Is it a bug?


Ali

Mar 03 2012

Jacob Carlborg <doob me.com> writes:

On 2012-03-03 15:10, Ali Çehreli wrote:
 On 03/03/2012 04:36 AM, Jacob Carlborg wrote:
 On 2012-03-03 03:30, Piotr Szturmaj wrote:
 Hello,

 For this code:

 auto c = "test"c;
 auto w = "test"w;
 auto d = "test"d;
 pragma(msg, typeof(c.front));
 pragma(msg, typeof(w.front));
 pragma(msg, typeof(d.front));

 compiler prints:

 dchar
 dchar
 immutable(dchar)

 I thought all these would be either "dchar" or "immutable(dchar)". Why
 are they of different types?

 In the case of char and wchar slices, the "elements" are decoded as the
 iteration happens. In other words, the returned values are not actual
 elements of the ranges.

Ah, I see, thanks.

-- 
/Jacob Carlborg

Mar 03 2012

D Programming

C/C++ Programming

Other

digitalmars.D.learn - typeof(string.front) should be char