digitalmars.D - D on hackernews

Andrei Alexandrescu (4/4) Sep 20 2011 http://hackerne.ws/item?id=3014861

Nick Sabalausky (5/8) Sep 20 2011 Looks like hackernews must be be having some sort of technical problem. ...

Nick Sabalausky (7/17) Sep 20 2011 Hmm, but the other threads on that site seem to be working fine. Weird.
Timon Gehr (2/12) Sep 21 2011 It seems like the thread has been closed. ("[dead]")

bearophile (6/9) Sep 21 2011 I think the Wikipedia D page needs to be rewritten, leaving 80-90% of it...

Timon Gehr (15/20) Sep 21 2011 Yes, that is important. Wikipedia is usually the first place people go

Andrei Alexandrescu (3/16) Sep 21 2011 Agreed. Does anyone volunteer for fixing D's Wikipedia page?

Marco Leise (50/67) Sep 21 2011 f

Timon Gehr (4/34) Sep 21 2011 That was just a bad place to accidentally leave out a word. ;)

Timon Gehr (2/15) Sep 21 2011 ... concerns [D1].
travert phare.normalesup.org (Christophe) (18/20) Sep 21 2011 Well, I think they are. The ptr+length stuff is amasing, but the

Jonathan M Davis (25/45) Sep 21 2011 What do you mean? It does exactly what it says that it does.

travert phare.normalesup.org (Christophe Travert) (24/53) Sep 21 2011 It does say it uses the slice operator if the range is sliceable, and

Simen Kjaeraas (16/30) Sep 21 2011 at =

travert phare.normalesup.org (Christophe) (10/17) Sep 21 2011 Yes, sorry. Then I disagree with "as long as char is a unicode code

Jonathan M Davis (42/93) Sep 21 2011 1. drop says nothing about slicing.

travert phare.normalesup.org (Christophe) (19/33) Sep 21 2011 I figured that out.

Jonathan M Davis (75/89) Sep 21 2011 t

travert phare.normalesup.org (Christophe) (22/31) Sep 21 2011 not documentation of drop.

Jonathan M Davis (15/51) Sep 21 2011 You weren't specific enough to make it clear what you meant. It looked l...

travert phare.normalesup.org (Christophe) (6/14) Sep 22 2011 Well, char could also disapear in favor of ubyte, but that will confuse

Andrei Alexandrescu (4/8) Sep 21 2011 I'd love to hear more about that. The standard library does optimize

travert phare.normalesup.org (Christophe Travert) (17/25) Sep 21 2011 Well, in that other thread called "Re: toUTFz and WinAPI

Andrei Alexandrescu (4/27) Sep 21 2011 That sounds great. Looking forward to your pull requests!

Andrei Alexandrescu (12/27) Sep 21 2011 String handling in D is good modulo the oddities you noticed. What would...

Graham Fawcett (11/47) Sep 21 2011 1. Let "string" remain an alias for "immutable(char)[]", and introduce a...
Peter Alexander (14/44) Sep 21 2011 From what I can see, the problem with D string is that they are a

Jacob Carlborg (5/29) Sep 21 2011 Ruby pre 1.9 behaves similar. Well, actually Ruby pre 1.9 is not

Andrej Mitrovic (4/6) Sep 21 2011 I really doubt this by now. We've mentioned a million times that

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

http://hackerne.ws/item?id=3014861

Apparently we're still having a PR issue. I tried to chime in but I 
can't add a comment.


Andrei

Sep 20 2011

"Nick Sabalausky" <a a.a> writes:

"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:j5bpbf$1g2u$1 digitalmars.com...
 http://hackerne.ws/item?id=3014861

 Apparently we're still having a PR issue. I tried to chime in but I can't 
 add a comment.

Looks like hackernews must be be having some sort of technical problem. I've 
posted there before with no problem, and my "karma" is 12. But I'm logged in 
now, and it's not giving me any way to reply either.

Sep 20 2011

"Nick Sabalausky" <a a.a> writes:

"Nick Sabalausky" <a a.a> wrote in message 
news:j5bpnf$1gpj$1 digitalmars.com...
 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
 news:j5bpbf$1g2u$1 digitalmars.com...
 http://hackerne.ws/item?id=3014861

 Apparently we're still having a PR issue. I tried to chime in but I can't 
 add a comment.

 Looks like hackernews must be be having some sort of technical problem. 
 I've posted there before with no problem, and my "karma" is 12. But I'm 
 logged in now, and it's not giving me any way to reply either.

Hmm, but the other threads on that site seem to be working fine. Weird.

Oh, I see. Looks like that's part of a story that's been killed:

http://hackerne.ws/item?id=3014824

The link from "thirsteh" seems to indicate the story was probably just some 
"Yay Go!" flamebait.

Sep 20 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 09/21/2011 06:38 AM, Nick Sabalausky wrote:
 "Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org>  wrote in message
 news:j5bpbf$1g2u$1 digitalmars.com...
 http://hackerne.ws/item?id=3014861

 Apparently we're still having a PR issue. I tried to chime in but I can't
 add a comment.

 Looks like hackernews must be be having some sort of technical problem. I've
 posted there before with no problem, and my "karma" is 12. But I'm logged in
 now, and it's not giving me any way to reply either.

It seems like the thread has been closed. ("[dead]")

Sep 21 2011

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:

 http://hackerne.ws/item?id=3014861
 
 Apparently we're still having a PR issue.

I think the Wikipedia D page needs to be rewritten, leaving 80-90% of its space
to D (meaning D2).

Regarding "D is doomed" more than 99% of all languages ever invented fail to
become widespread. It's the most common fate.

Regarding D "killer features", I like D for being multi-level allowing both
almost-high-level coding and low-level coding with user specified memory
layouts, for functional-style functions in a procedural program or
procedural-style functions in a functional-style program, and for not using a
totally new syntax :-)

Bye,
bearophile

Sep 21 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 09/21/2011 09:37 AM, bearophile wrote:
 Andrei Alexandrescu:

 http://hackerne.ws/item?id=3014861

 Apparently we're still having a PR issue.

 I think the Wikipedia D page needs to be rewritten, leaving 80-90% of its
space to D (meaning D2).

Yes, that is important. Wikipedia is usually the first place people go 
looking for information, and much of the information given there is 
horribly outdated/wrong and mostly only concerns. Many people think 
what's on Wikipedia is true. [citation needed]

"For performance reasons, string slicing and the length property operate 
on code units rather than code points (characters), which frequently 
confuses developers.[27]"

The link at [27] only says that many programmers that don't had have to 
handle unicode have trouble understanding how unicode works initially. 
It is not a D thing in any other way than that D actually supports 
unicode natively. Yet the 'D strings are strange and confusing' argument 
comes up quite often on the web, probably because many feel they are 
competent enough to discuss the language after having read the Wikipedia 
article.

Sep 21 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/21/11 8:52 AM, Timon Gehr wrote:
 On 09/21/2011 09:37 AM, bearophile wrote:
 Andrei Alexandrescu:

 http://hackerne.ws/item?id=3014861

 Apparently we're still having a PR issue.

 I think the Wikipedia D page needs to be rewritten, leaving 80-90% of
 its space to D (meaning D2).

 Yes, that is important. Wikipedia is usually the first place people go
 looking for information, and much of the information given there is
 horribly outdated/wrong and mostly only concerns.

Agreed. Does anyone volunteer for fixing D's Wikipedia page?

Andrei

Sep 21 2011

"Marco Leise" <Marco.Leise gmx.de> writes:

Am 21.09.2011, 16:24 Uhr, schrieb Andrei Alexandrescu  =

<SeeWebsiteForEmail erdani.org>:

 On 9/21/11 8:52 AM, Timon Gehr wrote:
 On 09/21/2011 09:37 AM, bearophile wrote:
 Andrei Alexandrescu:

 http://hackerne.ws/item?id=3D3014861

 Apparently we're still having a PR issue.

 I think the Wikipedia D page needs to be rewritten, leaving 80-90% o=



f
 its space to D (meaning D2).

 Yes, that is important. Wikipedia is usually the first place people g=


o
 looking for information, and much of the information given there is
 horribly outdated/wrong and mostly only concerns.

 Agreed. Does anyone volunteer for fixing D's Wikipedia page?

 Andrei

Is anyone uninvolved enough to be objective and involved enough to know =
 =

what they write?
Timon, I think you are exaggerating a bit. It is not mostly only concern=
s,  =

but I agree they have bold headers, and other language pages like Java o=
r  =

C++ lack this section entirely. Now it would certainly make people  =

suspicious if the section disappeared over night. I would reduce the fon=
t  =

weight of the headers in that section, remove the talk about UTF-8 strin=
g  =

handling, at some point in time move the 'library split' issue to a  =

historical section about D1. I think the focus on x86 is also a valid  =

concern.

Here are some statistics I collected, for fun:

The Catal=C3=A0 version also says the following:
- It is unstable and unsuitable for production environments (version 0.1=
40)

The Catal=C3=A0 and Galego version say:
- The only documentation is the official specification

The following languages are a direct translations of (or the source for)=
  =

the English version. Maybe their authors can be contacted and are willin=
g  =

to update their language version after the change:
- Arabic
- Espa=C3=B1ol
- Polski (not the exact same sections, probably translated from an older=
  =

version)

The Italian version has the most impressive features list with 14 sectio=
ns  =

about characteristics!

The Latin version is blowing my mind, just because people who use a long=
  =

dead language would write code in a new one that has to do with computer=
s:

import tango.io.Console;

int main(char[][] args) {
     Cout("salve munde!");
     return 0;
}

Many language pages are hopelessly outdated, but speakers of 'minority' =
 =

languages will look for an English article anyway.

Sep 21 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 09/22/2011 01:17 AM, Marco Leise wrote:
 Am 21.09.2011, 16:24 Uhr, schrieb Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org>:

 On 9/21/11 8:52 AM, Timon Gehr wrote:
 On 09/21/2011 09:37 AM, bearophile wrote:
 Andrei Alexandrescu:

 http://hackerne.ws/item?id=3014861

 Apparently we're still having a PR issue.

 I think the Wikipedia D page needs to be rewritten, leaving 80-90% of
 its space to D (meaning D2).

 Yes, that is important. Wikipedia is usually the first place people go
 looking for information, and much of the information given there is
 horribly outdated/wrong and mostly only concerns.

 Agreed. Does anyone volunteer for fixing D's Wikipedia page?

 Andrei

 Is anyone uninvolved enough to be objective and involved enough to know
 what they write?
 Timon, I think you are exaggerating a bit. It is not mostly only
 concerns, but I agree they have bold headers, and other language pages
 like Java or C++ lack this section entirely.

On 09/21/2011 04:33 PM, Timon Gehr wrote:
 Yes, that is important. Wikipedia is usually the first place people go
 looking for information, and much of the information given there is
 horribly outdated/wrong and mostly only concerns.

 ... concerns [D1].

That was just a bad place to accidentally leave out a word. ;)
I agree that the article also contains some useful information.

Sep 21 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 09/21/2011 03:52 PM, Timon Gehr wrote:
 On 09/21/2011 09:37 AM, bearophile wrote:
 Andrei Alexandrescu:

 http://hackerne.ws/item?id=3014861

 Apparently we're still having a PR issue.

 I think the Wikipedia D page needs to be rewritten, leaving 80-90% of
 its space to D (meaning D2).

 Yes, that is important. Wikipedia is usually the first place people go
 looking for information, and much of the information given there is
 horribly outdated/wrong and mostly only concerns.

... concerns [D1].

Sep 21 2011

travert phare.normalesup.org (Christophe) writes:

Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
 unicode natively. Yet the 'D strings are strange and confusing' argument 
 comes up quite often on the web.

Well, I think they are. The ptr+length stuff is amasing, but the 
behavior of strings in phobos is weird.

mini-quiz: what should std.range.drop(some_string, 1) do ?
hint: what it actually does is not what the documentation of phobos 
suggests*...

Strings are array of char, but they appear like a lazy range of dchar to 
phobos. I could cope with the fact that this is a little unexpected for 
beginners. But well, that creates a lot of exceptions in phobos, like 
the fact that you can't even copy a char[] to a char[] with 
std.algorithm.copy. And I don't mention all the optimization that are 
not/cannot be performed for those strings. I'll just remember to use 
ubyte[] wherever I can...

* Please, someone just adds in the documentation of IsSliceable that 
narrow strings are an exception, like it was recently added to 
hasLength.

-- 
Christophe

Sep 21 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, September 21, 2011 15:16:33 Christophe wrote:
 Timon Gehr , dans le message (digitalmars.D:144889), a =C3=A9crit :
 unicode natively. Yet the 'D strings are strange and confusing' arg=


ument
 comes up quite often on the web.

=20
 Well, I think they are. The ptr+length stuff is amasing, but the
 behavior of strings in phobos is weird.
=20
 mini-quiz: what should std.range.drop(some_string, 1) do ?
 hint: what it actually does is not what the documentation of phobos
 suggests*...

What do you mean? It does exactly what it says that it does.

 Strings are array of char, but they appear like a lazy range of dchar=

 to
 phobos. I could cope with the fact that this is a little unexpected f=

or
 beginners. But well, that creates a lot of exceptions in phobos, like=

 the fact that you can't even copy a char[] to a char[] with
 std.algorithm.copy. And I don't mention all the optimization that are=

 not/cannot be performed for those strings. I'll just remember to use
 ubyte[] wherever I can...

Yeah, well, as long as char is a unicode code unit, that's the way that=
 it=20
goes. In general, Phobos does a good job of using slicing and other=20
optimizations on strings when it can in spite of the fact that they're =
not=20
sliceable ranges, but there are cases where the fact that you _have_ to=
=20
process them to be able to find the nth code point means that you just =
can't=20
process them as efficiently as your typical array. That's life with a v=
ariable-
length encoding - and that includes std.algorithm.copy. But that's an e=
asy one=20
to get around, since if you wanted to ignore unicode safety and just co=
py some=20
chunk of the string, you can always just slice it directly with no need=
 for=20
copy.

 * Please, someone just adds in the documentation of IsSliceable that
 narrow strings are an exception, like it was recently added to
 hasLength.

A good point.

- Jonathan M Davis

Sep 21 2011

travert phare.normalesup.org (Christophe Travert) writes:

Jonathan M Davis , dans le message (digitalmars.D:144896), a écrit :
 On Wednesday, September 21, 2011 15:16:33 Christophe wrote:
 Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
 unicode natively. Yet the 'D strings are strange and confusing' argument
 comes up quite often on the web.

 
 Well, I think they are. The ptr+length stuff is amasing, but the
 behavior of strings in phobos is weird.
 
 mini-quiz: what should std.range.drop(some_string, 1) do ?
 hint: what it actually does is not what the documentation of phobos
 suggests*...

 
 What do you mean? It does exactly what it says that it does.

It does say it uses the slice operator if the range is sliceable, and
the documentation to isSliceable fails to precise that a narrow string 
is not sliceable.

 Yeah, well, as long as char is a unicode code unit, that's the way that it 
 goes.

They are not unicode units.

void main() {
  char a = 'ä';
  writeln(a); // outputs: \344
  writeln('ä'); // outputs: ä
}

Obviouly, a code unit don't fit in a char.
Thus 'char[]' is not what the name claims it is.

Unicode operations should be supported by a different class that 
is really a lazy range of dchar implemented as an undelying char[], with 
no length, index, or stride operator, and appropriate optimizations.

 In general, Phobos does a good job of using slicing and other 
 optimizations on strings when it can in spite of the fact that they're not 
 sliceable ranges, but there are cases where the fact that you _have_ to 
 process them to be able to find the nth code point means that you just can't 
 process them as efficiently as your typical array. That's life with a variable-
 length encoding. - and that includes std.algorithm.copy. But that's an easy
one 
 to get around, since if you wanted to ignore unicode safety and just copy some 
 chunk of the string, you can always just slice it directly with no need for 
 copy.

Dealing with utfencoded strings is less efficient, but there is a number 
of algorithms that can be optimized for utfencoded strings, like copying 
or finding an ascii char in a string. Unfortunately, there is no 
practical way to do this with the current range API.

About copy, it's not that easy to overcome the problem if you are using 
a template, and that template happens to be instanciated for strings.

 * Please, someone just adds in the documentation of IsSliceable that
 narrow strings are an exception, like it was recently added to
 hasLength.

 
 A good point.

The main point of my post actually.

-- 
Christophe

Sep 21 2011

"Simen Kjaeraas" <simen.kjaras gmail.com> writes:

On Wed, 21 Sep 2011 20:20:55 +0200, Christophe Travert  =

<travert phare.normalesup.org> wrote:


 Yeah, well, as long as char is a unicode code unit, that's the way th=


at  =

 it
 goes.

 They are not unicode units.

 void main() {
   char a =3D '=C3=A4';
   writeln(a); // outputs: \344
   writeln('=C3=A4'); // outputs: =C3=A4
 }

 Obviouly, a code unit don't fit in a char.
 Thus 'char[]' is not what the name claims it is.

Oh, it absolutely is. According to the Unicode Consortium, A code unit i=
s
"The minimal bit combination that can represent a unit of encoded text
for processing or interchange. The Unicode Standard uses 8-bit code unit=
s
in the UTF-8 encoding form [...]".

What you are thinking about is a code point.


 Unicode operations should be supported by a different class that
 is really a lazy range of dchar implemented as an undelying char[], wi=

th
 no length, index, or stride operator, and appropriate optimizations.

I can agree with this, but the benefits over what we already have are ni=
gh
zilch.


-- =

   Simen

Sep 21 2011

travert phare.normalesup.org (Christophe) writes:

"Simen Kjaeraas" , dans le message (digitalmars.D:144921), a écrit :
 What you are thinking about is a code point.

Yes, sorry. Then I disagree with "as long as char is a unicode code 
unit, that's the way that it goes", since myString.front should then 
return a code unit, whereas it actually returns a code point.

 Unicode operations should be supported by a different class that
 is really a lazy range of dchar implemented as an undelying char[], with
 no length, index, or stride operator, and appropriate optimizations.

 
 I can agree with this, but the benefits over what we already have are nigh
 zilch.

I think no one here as any illusion about changing this in D2.

However, this class could be introduced in phobos right now, without 
changing anything about string. It would simply be a bit safer than 
strings.

-- 
Christophe

Sep 21 2011

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, September 21, 2011 11:20 Christophe Travert wrote:
 Jonathan M Davis , dans le message (digitalmars.D:144896), a écrit :
 On Wednesday, September 21, 2011 15:16:33 Christophe wrote:
 Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
 unicode natively. Yet the 'D strings are strange and confusing'
 argument comes up quite often on the web.

 
 Well, I think they are. The ptr+length stuff is amasing, but the
 behavior of strings in phobos is weird.
 
 mini-quiz: what should std.range.drop(some_string, 1) do ?
 hint: what it actually does is not what the documentation of phobos
 suggests*...

 
 What do you mean? It does exactly what it says that it does.

 
 It does say it uses the slice operator if the range is sliceable, and
 the documentation to isSliceable fails to precise that a narrow string
 is not sliceable.

1. drop says nothing about slicing.
2. popFrontN (which drop calls) says that it slices for ranges that support 
slicing. Strings do not unless they're arrays of dchar.

Yes, hasSlicing should probably be clearer about narrow strings, but that has 
nothing to do with drop.

 Yeah, well, as long as char is a unicode code unit, that's the way that
 it goes.

 
 They are not unicode units.
 
 void main() {
 char a = 'ä';
 writeln(a); // outputs: \344
 writeln('ä'); // outputs: ä
 }
 
 Obviouly, a code unit don't fit in a char.
 Thus 'char[]' is not what the name claims it is.
 
 Unicode operations should be supported by a different class that
 is really a lazy range of dchar implemented as an undelying char[], with
 no length, index, or stride operator, and appropriate optimizations.

The problem with

char a = 'ä';

is completely separate from char[]. That has to do with the fact that the 
compiler isn't properly dealing with narrowing conversions for an individual 
character. A char is _by definition_ a UTF-8 code unit. 

char a = 'ä';

shouldn't even be legal. It's a compiler bug. Most code which operates on 
individual chars or wchars is buggy. dchar is what should be used for 
individual characters. The code you give is buggy because it's trying to use 
char as individual character, and the compiler has a bug, so it doesn't catch 
it.

 In general, Phobos does a good job of using slicing and other
 optimizations on strings when it can in spite of the fact that they're
 not sliceable ranges, but there are cases where the fact that you _have_
 to process them to be able to find the nth code point means that you
 just can't process them as efficiently as your typical array. That's
 life with a variable- length encoding. - and that includes
 std.algorithm.copy. But that's an easy one to get around, since if you
 wanted to ignore unicode safety and just copy some chunk of the string,
 you can always just slice it directly with no need for copy.

 
 Dealing with utfencoded strings is less efficient, but there is a number
 of algorithms that can be optimized for utfencoded strings, like copying
 or finding an ascii char in a string. Unfortunately, there is no
 practical way to do this with the current range API.

 About copy, it's not that easy to overcome the problem if you are using
 a template, and that template happens to be instanciated for strings.

In general, you _must_ treat strings as ranges of dchar if you want your code 
to be correct with regards to code points. So, that's the default. If you know 
that your particular algorithm can be optimized for strings without treating 
them strictly as ranges of dchar and still deal with unicode correctly (e.g. 
by slicing them, because you know the correct point to slice to), then you 
special-case your template for narrow strings. Phobos does that in a number of 
places. But you _have_ to special case it because being able to do so depends 
entirely on your algorithm. You can't treat strings that way in the general 
case, or yor code is not going to handle unicode correctly.

No, the way that strings are handled in D is not perfect. But we're trying to 
balance correctness and efficiency, and it does a good job of that. Using 
strings as ranges of dchar is correct, and in a large number of cases, it is 
as efficient as your going to get it (perhaps particular functions could be 
better optimized, but the API can't be made more efficient). And in the cases 
where you know you can safely get better efficiency out of strings by special 
casing them, that's what you do. It works. It works fairly well. And no one 
has been able to come up with a solution that's definitely better.

If there's an issue, it's that the message on how to correctly handle strings 
is not necessarily being communicated as well as it needs to be. In a few 
places, the documentation could probably be improved, but what we probobly 
really need in order to help get the message across is more articles on ranges 
and strings as ranges. I've actually partially written an article on that, but 
due to some compiler bugs regarding std.container, I ended up temporarily
shelving it.

- Jonathan M Davis

Sep 21 2011

travert phare.normalesup.org (Christophe) writes:

"Jonathan M Davis" , dans le message (digitalmars.D:144922), a écrit :
 1. drop says nothing about slicing.
 2. popFrontN (which drop calls) says that it slices for ranges that support 
 slicing. Strings do not unless they're arrays of dchar.
 
 Yes, hasSlicing should probably be clearer about narrow strings, but that has 
 nothing to do with drop.

I never said there was a problem with drop.

 char a = 'ä';
 
 shouldn't even be legal. It's a compiler bug.

I figured that out.
I wanted to show that a char couldn't hold a code point, but I was too 
fast and confused code points with code units.

 Dealing with utfencoded strings is less efficient, but there is a number
 of algorithms that can be optimized for utfencoded strings, like copying
 or finding an ascii char in a string. Unfortunately, there is no
 practical way to do this with the current range API.


Maybe there should be a way for the designer of a class to provide an 
overload for some algorithms, like forwarding to myClass.algorithm for 
instance. The problem is that this is an open door for unvoluntary 
hacking.

Oh, I just noticed I'm actually answering to myself. Thinking out loud, 
am I ?

 [...]

After having read all of you, I have no problems with string being a 
lazy range of dchar. But I have a problem with immutable(char)[] being 
lazy range of dchar (ie not being a array), and I have a problem with 
string being immutable(char)[] (ie providing length opIndex and 
opSlice).

Thanks
-- 
Christophe

Sep 21 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, September 21, 2011 19:56:47 Christophe wrote:
 "Jonathan M Davis" , dans le message (digitalmars.D:144922), a =C3=A9=

crit :
 1. drop says nothing about slicing.
 2. popFrontN (which drop calls) says that it slices for ranges that=


 support slicing. Strings do not unless they're arrays of dchar.
=20
 Yes, hasSlicing should probably be clearer about narrow strings, bu=


t
 that has nothing to do with drop.

=20
 I never said there was a problem with drop.

Yes you did. You said:

"mini-quiz: what should std.range.drop(some_string, 1) do ?
hint: what it actually does is not what the documentation of phobos=20
suggests*..."

 After having read all of you, I have no problems with string being a
 lazy range of dchar. But I have a problem with immutable(char)[] bein=

g
 lazy range of dchar (ie not being a array), and I have a problem with=

 string being immutable(char)[] (ie providing length opIndex and
 opSlice).

For efficiency, you need to be able to treat strings as arrays of code =
units for=20
some algorithms. For correctness, you need to be able to treat them as =
ranges=20
of code points (dchar) in the general case. You need both. The question=
 is how=20
to provide that. strings as arrays came first (D1), whereas ranges came=
 later.=20
We _need_ to treat strings as ranges of dchar or they're essentially un=
usable=20
in the general case. Operating on code units is almost always _wrong_. =
So,=20
when we added the range functions, we special-cased them for strings so=
 that=20
strings are treated as ranges of dchar as they need to be. And in cases=
 where=20
you actually need to treat a string as an array of code units for effic=
iency,=20
you special case the function for them, and you still get that. What ot=
her way=20
would you do it?

There _are_ some edges here - such as foreach defalting to char for str=
ing=20
when dchar is really what you shoud be iterating with - and there are t=
imes=20
when you want to use a string with a range-based function and can't, be=
cause=20
it needs a random-access range or one which is sliceable to do what it =
does,=20
which can be annoying. But what else can you do there? You can't treat =
the=20
string as a range of code units in that case. The result would be compl=
etely=20
wrong. Imagine if sort worked on a char[]. You'd get an array of sorted=
 code=20
units, which would _not_ be code points, and which would be completely=20=

useless. So, treating a string as a range of code units makes no sense.=


We could switch to having a struct of some kind which was a string, mak=
e it a=20
range of dchar, and have it contain an array of char, wchar, or dchar=20=

internally. It would have to restrict its operations in exactly the sam=
e=20
manner that the range functions for strings currently do, so the exact =
same=20
algorithms would or wouldn't work with it. And then you'd need to provi=
de=20
access to the underlying array of code units so that algorithms special=
 casing=20
strings could operate on the array instead. Ultimately, it's pretty muc=
h the=20
same thing, except now you have a wrapper struct. How does that buy you=
=20
anything? The _only_ thing that it would buy you AFAIK is that foreach =
would=20
then default to dchar instead of the code unit type. The basic problem =
still=20
exists. You still need to special case strings for efficiency, and you =
still=20
need to treat them as a range of dchar in the general case. It's an inh=
erent=20
issue with variable length encodings. You can't just magically make it =
go=20
away.

If you have a better solution, please share it, but the fact that we wa=
nt both=20
efficiency and correctness binds us pretty thoroughly here.

- Jonathan M Davis

Sep 21 2011

travert phare.normalesup.org (Christophe) writes:

Jonathan M Davis , dans le message (digitalmars.D:144944), a écrit :
 I never said there was a problem with drop.

 
 Yes you did. You said:
 
 "mini-quiz: what should std.range.drop(some_string, 1) do ?
 hint: what it actually does is not what the documentation of phobos 

                                              ^^^^^^^^^^^^^^^^^^^^^^^
 suggests*..."

not documentation of drop.

 If you have a better solution, please share it, but the fact that we want both 
 efficiency and correctness binds us pretty thoroughly here.

- char[], etc. being real arrays.
- strings being lazy ranges of dchar, providing access to underlying 
char[].

Correctness of the langage is better, since we don't have a T[] having a 
front method that returns something else than T, or a type that accepts 
opSlice but is not sliceable, etc.

Runtime correctness and efficiency are the same as the current ones, 
since the whole phobos already considers strings as lazy range of dchar. 
It is even better, since the user cannot change an arbitrary code point 
in a string without explicitely asking for the undelying char[]. 
Optimizations can come the same way as they currently can, since the 
underlying char is accessible.

I can deal with strings the way they are, since they are an heritage. 
They are not perfect, and will never be unless computers become fat 
enough to treat dchar[] just as efficiently as char[]. I am also aware 
that phobos cannot be optimized for every cases in the first place, and 
I can change my mind.

-- 
Christophe

Sep 21 2011

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, September 21, 2011 14:08 Christophe wrote:
 Jonathan M Davis , dans le message (digitalmars.D:144944), a écrit :
 I never said there was a problem with drop.

 
 Yes you did. You said:
 
 "mini-quiz: what should std.range.drop(some_string, 1) do ?
 hint: what it actually does is not what the documentation of phobos

 
 ^^^^^^^^^^^^^^^^^^^^^^^
 
 suggests*..."

 
 not documentation of drop.

You weren't specific enough to make it clear what you meant. It looked like
you were complaining about drop's documentation.

 If you have a better solution, please share it, but the fact that we want
 both efficiency and correctness binds us pretty thoroughly here.

 
 - char[], etc. being real arrays.

Which is actually arguably a _bad_ thing, since it doesn't generally make 
sense to operate on individual chars. What you really want 99.99999999999% of 
the time is code points not code units.

 - strings being lazy ranges of dchar, providing access to underlying
 char[].
 
 Correctness of the langage is better, since we don't have a T[] having a
 front method that returns something else than T, or a type that accepts
 opSlice but is not sliceable, etc.
 
 Runtime correctness and efficiency are the same as the current ones,
 since the whole phobos already considers strings as lazy range of dchar.
 It is even better, since the user cannot change an arbitrary code point
 in a string without explicitely asking for the undelying char[].
 Optimizations can come the same way as they currently can, since the
 underlying char is accessible.
 
 I can deal with strings the way they are, since they are an heritage.
 They are not perfect, and will never be unless computers become fat
 enough to treat dchar[] just as efficiently as char[]. I am also aware
 that phobos cannot be optimized for every cases in the first place, and
 I can change my mind.

So, essentially you're arguing for a wrapper around arrays of code units. That 
does add some benefits (such as making foreach default to dchar), but 
ultimately doesn't add that much additional benefit (it also makes dealing 
with array literals much more interesting). If we were to start over again, 
that may very well be the way that we'd go, but the added benefits just don't 
outweigh the immense amount of code breakage which would result. Maybe the 
situation will change with D3, but at this point, I think that we've done a 
fairly good job of making it possible to treat strings as ranges.

- Jonathan M Davis

Sep 21 2011

travert phare.normalesup.org (Christophe) writes:

"Jonathan M Davis" , dans le message (digitalmars.D:144962), a écrit :
 - char[], etc. being real arrays.

 
 Which is actually arguably a _bad_ thing, since it doesn't generally make 
 sense to operate on individual chars. What you really want 99.99999999999% of 
 the time is code points not code units.

Well, char could also disapear in favor of ubyte, but that will confuse 
even more people.

 If we were to start over again, that may very well be the way that 
 we'd go, but the added benefits just don't outweigh the immense amount 
 of code breakage which would result.

I 100% agree with that.

-- 
Christophe

Sep 22 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/21/11 1:20 PM, Christophe Travert wrote:
 Dealing with utfencoded strings is less efficient, but there is a number
 of algorithms that can be optimized for utfencoded strings, like copying
 or finding an ascii char in a string. Unfortunately, there is no
 practical way to do this with the current range API.

I'd love to hear more about that. The standard library does optimize 
certain algorithms for UTF strings.

Andrei

Sep 21 2011

travert phare.normalesup.org (Christophe Travert) writes:

Andrei Alexandrescu , dans le message (digitalmars.D:144936), a écrit :
 On 9/21/11 1:20 PM, Christophe Travert wrote:
 Dealing with utfencoded strings is less efficient, but there is a number
 of algorithms that can be optimized for utfencoded strings, like copying
 or finding an ascii char in a string. Unfortunately, there is no
 practical way to do this with the current range API.

 
 I'd love to hear more about that. The standard library does optimize 
 certain algorithms for UTF strings.


Well, in that other thread called "Re: toUTFz and WinAPI 
GetTextExtentPoint32W/" in D.learn (what is the proper way to refer to 
a message here ?), I showed how to improve walkLength for strings and 
utf.stride.

About finding a character in a string, rather than relying 
on string.popFront, which makes the loop un-unrollable, 
we could search code unit per code unit directly. This is obviously 
better for ascii char, and I'll be looking for a nice idea for other 
code points (besides using find(Range, Range)).

I didn't review phobos with that idea in mind, and didn't do any 
benchmark exept the one for walkLength, but using string.popFront is a 
bad idea in term of performance, so work-arrounds are often better, and 
they are not that hard to find. I may do that when I have more time to 
give to D.

-- 
Christophe

Sep 21 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/21/11 3:26 PM, Christophe Travert wrote:
 Andrei Alexandrescu , dans le message (digitalmars.D:144936), a écrit :
 On 9/21/11 1:20 PM, Christophe Travert wrote:
 Dealing with utfencoded strings is less efficient, but there is a number
 of algorithms that can be optimized for utfencoded strings, like copying
 or finding an ascii char in a string. Unfortunately, there is no
 practical way to do this with the current range API.

 I'd love to hear more about that. The standard library does optimize
 certain algorithms for UTF strings.


 Well, in that other thread called "Re: toUTFz and WinAPI
 GetTextExtentPoint32W/" in D.learn (what is the proper way to refer to
 a message here ?), I showed how to improve walkLength for strings and
 utf.stride.

Interesting, thanks.

 About finding a character in a string, rather than relying
 on string.popFront, which makes the loop un-unrollable,
 we could search code unit per code unit directly. This is obviously
 better for ascii char, and I'll be looking for a nice idea for other
 code points (besides using find(Range, Range)).

 I didn't review phobos with that idea in mind, and didn't do any
 benchmark exept the one for walkLength, but using string.popFront is a
 bad idea in term of performance, so work-arrounds are often better, and
 they are not that hard to find. I may do that when I have more time to
 give to D.

That sounds great. Looking forward to your pull requests!

Andrei

Sep 21 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/21/11 10:16 AM, Christophe wrote:
 Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
 unicode natively. Yet the 'D strings are strange and confusing' argument
 comes up quite often on the web.

 Well, I think they are. The ptr+length stuff is amasing, but the
 behavior of strings in phobos is weird.

 mini-quiz: what should std.range.drop(some_string, 1) do ?
 hint: what it actually does is not what the documentation of phobos
 suggests*...

 Strings are array of char, but they appear like a lazy range of dchar to
 phobos. I could cope with the fact that this is a little unexpected for
 beginners. But well, that creates a lot of exceptions in phobos, like
 the fact that you can't even copy a char[] to a char[] with
 std.algorithm.copy. And I don't mention all the optimization that are
 not/cannot be performed for those strings. I'll just remember to use
 ubyte[] wherever I can...

String handling in D is good modulo the oddities you noticed. What would 
make it perfect would be:

* Add property .rep that returns byte[], ushort[], or uint[] for char[], 
wchar[], dchar[] respectively (with the appropriate qualifier).

* Replace .length with .codeUnits.

* Disallow [n] and [m .. n]

This would upgrade D's strings from good to awesome. Really it would be 
a dream come true. Unfortunately it would also break most D code there 
is out there. I don't see how we can improve the current situation while 
staying backward compatible.


Andrei

Sep 21 2011

Graham Fawcett <fawcett uwindsor.ca> writes:

On Wed, 21 Sep 2011 11:39:03 -0500, Andrei Alexandrescu wrote:

 On 9/21/11 10:16 AM, Christophe wrote:
 Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
 unicode natively. Yet the 'D strings are strange and confusing'
 argument comes up quite often on the web.

 Well, I think they are. The ptr+length stuff is amasing, but the
 behavior of strings in phobos is weird.

 mini-quiz: what should std.range.drop(some_string, 1) do ? hint: what
 it actually does is not what the documentation of phobos suggests*...

 Strings are array of char, but they appear like a lazy range of dchar
 to phobos. I could cope with the fact that this is a little unexpected
 for beginners. But well, that creates a lot of exceptions in phobos,
 like the fact that you can't even copy a char[] to a char[] with
 std.algorithm.copy. And I don't mention all the optimization that are
 not/cannot be performed for those strings. I'll just remember to use
 ubyte[] wherever I can...

 
 String handling in D is good modulo the oddities you noticed. What would
 make it perfect would be:
 
 * Add property .rep that returns byte[], ushort[], or uint[] for char[],
 wchar[], dchar[] respectively (with the appropriate qualifier).
 
 * Replace .length with .codeUnits.
 
 * Disallow [n] and [m .. n]
 
 This would upgrade D's strings from good to awesome. Really it would be
 a dream come true. Unfortunately it would also break most D code there
 is out there. I don't see how we can improve the current situation while
 staying backward compatible.
 
 
 Andrei

1. Let "string" remain an alias for "immutable(char)[]", and introduce a 
new struct, "text!charType", that does the awesome stuff; provide good 
conversion routines and casts between text instances and old-school 
strings.

2. Provide an awesome "std.text" library for the text type and related 
operations, absorbing std.uni and other similar modules. Adapt Phobos to 
take "text" in virtually every function that currently expects a "string".

3. ...

4. Profit.

Graham

Sep 21 2011

Peter Alexander <peter.alexander.au gmail.com> writes:

On 21/09/11 5:39 PM, Andrei Alexandrescu wrote:
 On 9/21/11 10:16 AM, Christophe wrote:
 Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
 unicode natively. Yet the 'D strings are strange and confusing' argument
 comes up quite often on the web.

 Well, I think they are. The ptr+length stuff is amasing, but the
 behavior of strings in phobos is weird.

 mini-quiz: what should std.range.drop(some_string, 1) do ?
 hint: what it actually does is not what the documentation of phobos
 suggests*...

 Strings are array of char, but they appear like a lazy range of dchar to
 phobos. I could cope with the fact that this is a little unexpected for
 beginners. But well, that creates a lot of exceptions in phobos, like
 the fact that you can't even copy a char[] to a char[] with
 std.algorithm.copy. And I don't mention all the optimization that are
 not/cannot be performed for those strings. I'll just remember to use
 ubyte[] wherever I can...

 String handling in D is good modulo the oddities you noticed. What would
 make it perfect would be:

 * Add property .rep that returns byte[], ushort[], or uint[] for char[],
 wchar[], dchar[] respectively (with the appropriate qualifier).

 * Replace .length with .codeUnits.

 * Disallow [n] and [m .. n]

 This would upgrade D's strings from good to awesome. Really it would be
 a dream come true. Unfortunately it would also break most D code there
 is out there. I don't see how we can improve the current situation while
 staying backward compatible.


 Andrei

 From what I can see, the problem with D string is that they are a 
'magic' special case for arrays.

char[] should be an array of char, just like int[] is an array of int. 
If you have a T[] arr, then typeof(arr.front) should be T. This is what 
everyone would expect. char[] should essentially be the same as byte[], 
although char[] would be more natural for ASCII strings.

string should be something different, a separate type. As you say, 
disallow [n] and [m..n] would be good as they make no sense with VLE. 
You could have .length and .codeUnits, but length would have to be O(n). 
That's not ideal, but since string wouldn't be an array, it doesn't need 
to have the same complexity guarantees.

Same for wchar[], dchar[], wstring and dstring.

Of course, making that change would break existing code. Maybe D3? :-)

Sep 21 2011

Jacob Carlborg <doob me.com> writes:

On 2011-09-21 15:52, Timon Gehr wrote:
 On 09/21/2011 09:37 AM, bearophile wrote:
 Andrei Alexandrescu:

 http://hackerne.ws/item?id=3014861

 Apparently we're still having a PR issue.

 I think the Wikipedia D page needs to be rewritten, leaving 80-90% of
 its space to D (meaning D2).

 Yes, that is important. Wikipedia is usually the first place people go
 looking for information, and much of the information given there is
 horribly outdated/wrong and mostly only concerns. Many people think
 what's on Wikipedia is true. [citation needed]

 "For performance reasons, string slicing and the length property operate
 on code units rather than code points (characters), which frequently
 confuses developers.[27]"

 The link at [27] only says that many programmers that don't had have to
 handle unicode have trouble understanding how unicode works initially.
 It is not a D thing in any other way than that D actually supports
 unicode natively. Yet the 'D strings are strange and confusing' argument
 comes up quite often on the web, probably because many feel they are
 competent enough to discuss the language after having read the Wikipedia
 article.

Ruby pre 1.9 behaves similar. Well, actually Ruby pre 1.9 is not 
encoding aware at all, if I recall correctly.

-- 
/Jacob Carlborg

Sep 21 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 9/21/11, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:
 http://hackerne.ws/item?id=3014861

 Apparently we're still having a PR issue.

I really doubt this by now. We've mentioned a million times that
there's no dual standard libraries. I think this argument seems to be
re-introduced by trolls as far as I can tell.

Sep 21 2011

D Programming

C/C++ Programming

Other

digitalmars.D - D on hackernews