www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - D on hackernews

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
http://hackerne.ws/item?id=3014861

Apparently we're still having a PR issue. I tried to chime in but I 
can't add a comment.


Andrei
Sep 20 2011
next sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:j5bpbf$1g2u$1 digitalmars.com...
 http://hackerne.ws/item?id=3014861

 Apparently we're still having a PR issue. I tried to chime in but I can't 
 add a comment.

Looks like hackernews must be be having some sort of technical problem. I've posted there before with no problem, and my "karma" is 12. But I'm logged in now, and it's not giving me any way to reply either.
Sep 20 2011
next sibling parent "Nick Sabalausky" <a a.a> writes:
"Nick Sabalausky" <a a.a> wrote in message 
news:j5bpnf$1gpj$1 digitalmars.com...
 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
 news:j5bpbf$1g2u$1 digitalmars.com...
 http://hackerne.ws/item?id=3014861

 Apparently we're still having a PR issue. I tried to chime in but I can't 
 add a comment.

Looks like hackernews must be be having some sort of technical problem. I've posted there before with no problem, and my "karma" is 12. But I'm logged in now, and it's not giving me any way to reply either.

Hmm, but the other threads on that site seem to be working fine. Weird. Oh, I see. Looks like that's part of a story that's been killed: http://hackerne.ws/item?id=3014824 The link from "thirsteh" seems to indicate the story was probably just some "Yay Go!" flamebait.
Sep 20 2011
prev sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 09/21/2011 06:38 AM, Nick Sabalausky wrote:
 "Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org>  wrote in message
 news:j5bpbf$1g2u$1 digitalmars.com...
 http://hackerne.ws/item?id=3014861

 Apparently we're still having a PR issue. I tried to chime in but I can't
 add a comment.

Looks like hackernews must be be having some sort of technical problem. I've posted there before with no problem, and my "karma" is 12. But I'm logged in now, and it's not giving me any way to reply either.

It seems like the thread has been closed. ("[dead]")
Sep 21 2011
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:

 http://hackerne.ws/item?id=3014861
 
 Apparently we're still having a PR issue.

I think the Wikipedia D page needs to be rewritten, leaving 80-90% of its space to D (meaning D2). Regarding "D is doomed" more than 99% of all languages ever invented fail to become widespread. It's the most common fate. Regarding D "killer features", I like D for being multi-level allowing both almost-high-level coding and low-level coding with user specified memory layouts, for functional-style functions in a procedural program or procedural-style functions in a functional-style program, and for not using a totally new syntax :-) Bye, bearophile
Sep 21 2011
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 09/21/2011 09:37 AM, bearophile wrote:
 Andrei Alexandrescu:

 http://hackerne.ws/item?id=3014861

 Apparently we're still having a PR issue.

I think the Wikipedia D page needs to be rewritten, leaving 80-90% of its space to D (meaning D2).

Yes, that is important. Wikipedia is usually the first place people go looking for information, and much of the information given there is horribly outdated/wrong and mostly only concerns. Many people think what's on Wikipedia is true. [citation needed] "For performance reasons, string slicing and the length property operate on code units rather than code points (characters), which frequently confuses developers.[27]" The link at [27] only says that many programmers that don't had have to handle unicode have trouble understanding how unicode works initially. It is not a D thing in any other way than that D actually supports unicode natively. Yet the 'D strings are strange and confusing' argument comes up quite often on the web, probably because many feel they are competent enough to discuss the language after having read the Wikipedia article.
Sep 21 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/21/11 8:52 AM, Timon Gehr wrote:
 On 09/21/2011 09:37 AM, bearophile wrote:
 Andrei Alexandrescu:

 http://hackerne.ws/item?id=3014861

 Apparently we're still having a PR issue.

I think the Wikipedia D page needs to be rewritten, leaving 80-90% of its space to D (meaning D2).

Yes, that is important. Wikipedia is usually the first place people go looking for information, and much of the information given there is horribly outdated/wrong and mostly only concerns.

Agreed. Does anyone volunteer for fixing D's Wikipedia page? Andrei
Sep 21 2011
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 09/22/2011 01:17 AM, Marco Leise wrote:
 Am 21.09.2011, 16:24 Uhr, schrieb Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org>:

 On 9/21/11 8:52 AM, Timon Gehr wrote:
 On 09/21/2011 09:37 AM, bearophile wrote:
 Andrei Alexandrescu:

 http://hackerne.ws/item?id=3014861

 Apparently we're still having a PR issue.

I think the Wikipedia D page needs to be rewritten, leaving 80-90% of its space to D (meaning D2).

Yes, that is important. Wikipedia is usually the first place people go looking for information, and much of the information given there is horribly outdated/wrong and mostly only concerns.

Agreed. Does anyone volunteer for fixing D's Wikipedia page? Andrei

Is anyone uninvolved enough to be objective and involved enough to know what they write? Timon, I think you are exaggerating a bit. It is not mostly only concerns, but I agree they have bold headers, and other language pages like Java or C++ lack this section entirely.

On 09/21/2011 04:33 PM, Timon Gehr wrote:
 Yes, that is important. Wikipedia is usually the first place people go
 looking for information, and much of the information given there is
 horribly outdated/wrong and mostly only concerns.

... concerns [D1].

That was just a bad place to accidentally leave out a word. ;) I agree that the article also contains some useful information.
Sep 21 2011
prev sibling next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 09/21/2011 03:52 PM, Timon Gehr wrote:
 On 09/21/2011 09:37 AM, bearophile wrote:
 Andrei Alexandrescu:

 http://hackerne.ws/item?id=3014861

 Apparently we're still having a PR issue.

I think the Wikipedia D page needs to be rewritten, leaving 80-90% of its space to D (meaning D2).

Yes, that is important. Wikipedia is usually the first place people go looking for information, and much of the information given there is horribly outdated/wrong and mostly only concerns.

... concerns [D1].
Sep 21 2011
prev sibling next sibling parent reply travert phare.normalesup.org (Christophe) writes:
Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
 unicode natively. Yet the 'D strings are strange and confusing' argument 
 comes up quite often on the web.

Well, I think they are. The ptr+length stuff is amasing, but the behavior of strings in phobos is weird. mini-quiz: what should std.range.drop(some_string, 1) do ? hint: what it actually does is not what the documentation of phobos suggests*... Strings are array of char, but they appear like a lazy range of dchar to phobos. I could cope with the fact that this is a little unexpected for beginners. But well, that creates a lot of exceptions in phobos, like the fact that you can't even copy a char[] to a char[] with std.algorithm.copy. And I don't mention all the optimization that are not/cannot be performed for those strings. I'll just remember to use ubyte[] wherever I can... * Please, someone just adds in the documentation of IsSliceable that narrow strings are an exception, like it was recently added to hasLength. -- Christophe
Sep 21 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/21/11 10:16 AM, Christophe wrote:
 Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
 unicode natively. Yet the 'D strings are strange and confusing' argument
 comes up quite often on the web.

Well, I think they are. The ptr+length stuff is amasing, but the behavior of strings in phobos is weird. mini-quiz: what should std.range.drop(some_string, 1) do ? hint: what it actually does is not what the documentation of phobos suggests*... Strings are array of char, but they appear like a lazy range of dchar to phobos. I could cope with the fact that this is a little unexpected for beginners. But well, that creates a lot of exceptions in phobos, like the fact that you can't even copy a char[] to a char[] with std.algorithm.copy. And I don't mention all the optimization that are not/cannot be performed for those strings. I'll just remember to use ubyte[] wherever I can...

String handling in D is good modulo the oddities you noticed. What would make it perfect would be: * Add property .rep that returns byte[], ushort[], or uint[] for char[], wchar[], dchar[] respectively (with the appropriate qualifier). * Replace .length with .codeUnits. * Disallow [n] and [m .. n] This would upgrade D's strings from good to awesome. Really it would be a dream come true. Unfortunately it would also break most D code there is out there. I don't see how we can improve the current situation while staying backward compatible. Andrei
Sep 21 2011
parent Peter Alexander <peter.alexander.au gmail.com> writes:
On 21/09/11 5:39 PM, Andrei Alexandrescu wrote:
 On 9/21/11 10:16 AM, Christophe wrote:
 Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
 unicode natively. Yet the 'D strings are strange and confusing' argument
 comes up quite often on the web.

Well, I think they are. The ptr+length stuff is amasing, but the behavior of strings in phobos is weird. mini-quiz: what should std.range.drop(some_string, 1) do ? hint: what it actually does is not what the documentation of phobos suggests*... Strings are array of char, but they appear like a lazy range of dchar to phobos. I could cope with the fact that this is a little unexpected for beginners. But well, that creates a lot of exceptions in phobos, like the fact that you can't even copy a char[] to a char[] with std.algorithm.copy. And I don't mention all the optimization that are not/cannot be performed for those strings. I'll just remember to use ubyte[] wherever I can...

String handling in D is good modulo the oddities you noticed. What would make it perfect would be: * Add property .rep that returns byte[], ushort[], or uint[] for char[], wchar[], dchar[] respectively (with the appropriate qualifier). * Replace .length with .codeUnits. * Disallow [n] and [m .. n] This would upgrade D's strings from good to awesome. Really it would be a dream come true. Unfortunately it would also break most D code there is out there. I don't see how we can improve the current situation while staying backward compatible. Andrei

From what I can see, the problem with D string is that they are a 'magic' special case for arrays. char[] should be an array of char, just like int[] is an array of int. If you have a T[] arr, then typeof(arr.front) should be T. This is what everyone would expect. char[] should essentially be the same as byte[], although char[] would be more natural for ASCII strings. string should be something different, a separate type. As you say, disallow [n] and [m..n] would be good as they make no sense with VLE. You could have .length and .codeUnits, but length would have to be O(n). That's not ideal, but since string wouldn't be an array, it doesn't need to have the same complexity guarantees. Same for wchar[], dchar[], wstring and dstring. Of course, making that change would break existing code. Maybe D3? :-)
Sep 21 2011
prev sibling parent reply travert phare.normalesup.org (Christophe Travert) writes:
Jonathan M Davis , dans le message (digitalmars.D:144896), a écrit :
 On Wednesday, September 21, 2011 15:16:33 Christophe wrote:
 Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
 unicode natively. Yet the 'D strings are strange and confusing' argument
 comes up quite often on the web.

Well, I think they are. The ptr+length stuff is amasing, but the behavior of strings in phobos is weird. mini-quiz: what should std.range.drop(some_string, 1) do ? hint: what it actually does is not what the documentation of phobos suggests*...

What do you mean? It does exactly what it says that it does.

It does say it uses the slice operator if the range is sliceable, and the documentation to isSliceable fails to precise that a narrow string is not sliceable.
 Yeah, well, as long as char is a unicode code unit, that's the way that it 
 goes.

They are not unicode units. void main() { char a = 'ä'; writeln(a); // outputs: \344 writeln('ä'); // outputs: ä } Obviouly, a code unit don't fit in a char. Thus 'char[]' is not what the name claims it is. Unicode operations should be supported by a different class that is really a lazy range of dchar implemented as an undelying char[], with no length, index, or stride operator, and appropriate optimizations.
 In general, Phobos does a good job of using slicing and other 
 optimizations on strings when it can in spite of the fact that they're not 
 sliceable ranges, but there are cases where the fact that you _have_ to 
 process them to be able to find the nth code point means that you just can't 
 process them as efficiently as your typical array. That's life with a variable-
 length encoding. - and that includes std.algorithm.copy. But that's an easy
one 
 to get around, since if you wanted to ignore unicode safety and just copy some 
 chunk of the string, you can always just slice it directly with no need for 
 copy.

Dealing with utfencoded strings is less efficient, but there is a number of algorithms that can be optimized for utfencoded strings, like copying or finding an ascii char in a string. Unfortunately, there is no practical way to do this with the current range API. About copy, it's not that easy to overcome the problem if you are using a template, and that template happens to be instanciated for strings.
 * Please, someone just adds in the documentation of IsSliceable that
 narrow strings are an exception, like it was recently added to
 hasLength.

A good point.

The main point of my post actually. -- Christophe
Sep 21 2011
next sibling parent travert phare.normalesup.org (Christophe) writes:
"Simen Kjaeraas" , dans le message (digitalmars.D:144921), a écrit :
 What you are thinking about is a code point.

Yes, sorry. Then I disagree with "as long as char is a unicode code unit, that's the way that it goes", since myString.front should then return a code unit, whereas it actually returns a code point.
 Unicode operations should be supported by a different class that
 is really a lazy range of dchar implemented as an undelying char[], with
 no length, index, or stride operator, and appropriate optimizations.

I can agree with this, but the benefits over what we already have are nigh zilch.

I think no one here as any illusion about changing this in D2. However, this class could be introduced in phobos right now, without changing anything about string. It would simply be a bit safer than strings. -- Christophe
Sep 21 2011
prev sibling next sibling parent reply travert phare.normalesup.org (Christophe) writes:
"Jonathan M Davis" , dans le message (digitalmars.D:144922), a écrit :
 1. drop says nothing about slicing.
 2. popFrontN (which drop calls) says that it slices for ranges that support 
 slicing. Strings do not unless they're arrays of dchar.
 
 Yes, hasSlicing should probably be clearer about narrow strings, but that has 
 nothing to do with drop.

I never said there was a problem with drop.
 char a = 'ä';
 
 shouldn't even be legal. It's a compiler bug.

I figured that out. I wanted to show that a char couldn't hold a code point, but I was too fast and confused code points with code units.
 Dealing with utfencoded strings is less efficient, but there is a number
 of algorithms that can be optimized for utfencoded strings, like copying
 or finding an ascii char in a string. Unfortunately, there is no
 practical way to do this with the current range API.


Maybe there should be a way for the designer of a class to provide an overload for some algorithms, like forwarding to myClass.algorithm for instance. The problem is that this is an open door for unvoluntary hacking. Oh, I just noticed I'm actually answering to myself. Thinking out loud, am I ?
 [...]

After having read all of you, I have no problems with string being a lazy range of dchar. But I have a problem with immutable(char)[] being lazy range of dchar (ie not being a array), and I have a problem with string being immutable(char)[] (ie providing length opIndex and opSlice). Thanks -- Christophe
Sep 21 2011
parent reply travert phare.normalesup.org (Christophe) writes:
Jonathan M Davis , dans le message (digitalmars.D:144944), a écrit :
 I never said there was a problem with drop.

Yes you did. You said: "mini-quiz: what should std.range.drop(some_string, 1) do ? hint: what it actually does is not what the documentation of phobos

 suggests*..."

 If you have a better solution, please share it, but the fact that we want both 
 efficiency and correctness binds us pretty thoroughly here.

- char[], etc. being real arrays. - strings being lazy ranges of dchar, providing access to underlying char[]. Correctness of the langage is better, since we don't have a T[] having a front method that returns something else than T, or a type that accepts opSlice but is not sliceable, etc. Runtime correctness and efficiency are the same as the current ones, since the whole phobos already considers strings as lazy range of dchar. It is even better, since the user cannot change an arbitrary code point in a string without explicitely asking for the undelying char[]. Optimizations can come the same way as they currently can, since the underlying char is accessible. I can deal with strings the way they are, since they are an heritage. They are not perfect, and will never be unless computers become fat enough to treat dchar[] just as efficiently as char[]. I am also aware that phobos cannot be optimized for every cases in the first place, and I can change my mind. -- Christophe
Sep 21 2011
parent travert phare.normalesup.org (Christophe) writes:
"Jonathan M Davis" , dans le message (digitalmars.D:144962), a écrit :
 - char[], etc. being real arrays.

Which is actually arguably a _bad_ thing, since it doesn't generally make sense to operate on individual chars. What you really want 99.99999999999% of the time is code points not code units.

Well, char could also disapear in favor of ubyte, but that will confuse even more people.
 If we were to start over again, that may very well be the way that 
 we'd go, but the added benefits just don't outweigh the immense amount 
 of code breakage which would result.

I 100% agree with that. -- Christophe
Sep 22 2011
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/21/11 1:20 PM, Christophe Travert wrote:
 Dealing with utfencoded strings is less efficient, but there is a number
 of algorithms that can be optimized for utfencoded strings, like copying
 or finding an ascii char in a string. Unfortunately, there is no
 practical way to do this with the current range API.

I'd love to hear more about that. The standard library does optimize certain algorithms for UTF strings. Andrei
Sep 21 2011
parent reply travert phare.normalesup.org (Christophe Travert) writes:
Andrei Alexandrescu , dans le message (digitalmars.D:144936), a écrit :
 On 9/21/11 1:20 PM, Christophe Travert wrote:
 Dealing with utfencoded strings is less efficient, but there is a number
 of algorithms that can be optimized for utfencoded strings, like copying
 or finding an ascii char in a string. Unfortunately, there is no
 practical way to do this with the current range API.

I'd love to hear more about that. The standard library does optimize certain algorithms for UTF strings.

Well, in that other thread called "Re: toUTFz and WinAPI GetTextExtentPoint32W/" in D.learn (what is the proper way to refer to a message here ?), I showed how to improve walkLength for strings and utf.stride. About finding a character in a string, rather than relying on string.popFront, which makes the loop un-unrollable, we could search code unit per code unit directly. This is obviously better for ascii char, and I'll be looking for a nice idea for other code points (besides using find(Range, Range)). I didn't review phobos with that idea in mind, and didn't do any benchmark exept the one for walkLength, but using string.popFront is a bad idea in term of performance, so work-arrounds are often better, and they are not that hard to find. I may do that when I have more time to give to D. -- Christophe
Sep 21 2011
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/21/11 3:26 PM, Christophe Travert wrote:
 Andrei Alexandrescu , dans le message (digitalmars.D:144936), a écrit :
 On 9/21/11 1:20 PM, Christophe Travert wrote:
 Dealing with utfencoded strings is less efficient, but there is a number
 of algorithms that can be optimized for utfencoded strings, like copying
 or finding an ascii char in a string. Unfortunately, there is no
 practical way to do this with the current range API.

I'd love to hear more about that. The standard library does optimize certain algorithms for UTF strings.

Well, in that other thread called "Re: toUTFz and WinAPI GetTextExtentPoint32W/" in D.learn (what is the proper way to refer to a message here ?), I showed how to improve walkLength for strings and utf.stride.

Interesting, thanks.
 About finding a character in a string, rather than relying
 on string.popFront, which makes the loop un-unrollable,
 we could search code unit per code unit directly. This is obviously
 better for ascii char, and I'll be looking for a nice idea for other
 code points (besides using find(Range, Range)).

 I didn't review phobos with that idea in mind, and didn't do any
 benchmark exept the one for walkLength, but using string.popFront is a
 bad idea in term of performance, so work-arrounds are often better, and
 they are not that hard to find. I may do that when I have more time to
 give to D.

That sounds great. Looking forward to your pull requests! Andrei
Sep 21 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, September 21, 2011 15:16:33 Christophe wrote:
 Timon Gehr , dans le message (digitalmars.D:144889), a =C3=A9crit :
 unicode natively. Yet the 'D strings are strange and confusing' arg=


 comes up quite often on the web.

Well, I think they are. The ptr+length stuff is amasing, but the behavior of strings in phobos is weird. =20 mini-quiz: what should std.range.drop(some_string, 1) do ? hint: what it actually does is not what the documentation of phobos suggests*...

What do you mean? It does exactly what it says that it does.
 Strings are array of char, but they appear like a lazy range of dchar=

 phobos. I could cope with the fact that this is a little unexpected f=

 beginners. But well, that creates a lot of exceptions in phobos, like=

 the fact that you can't even copy a char[] to a char[] with
 std.algorithm.copy. And I don't mention all the optimization that are=

 not/cannot be performed for those strings. I'll just remember to use
 ubyte[] wherever I can...

Yeah, well, as long as char is a unicode code unit, that's the way that= it=20 goes. In general, Phobos does a good job of using slicing and other=20 optimizations on strings when it can in spite of the fact that they're = not=20 sliceable ranges, but there are cases where the fact that you _have_ to= =20 process them to be able to find the nth code point means that you just = can't=20 process them as efficiently as your typical array. That's life with a v= ariable- length encoding - and that includes std.algorithm.copy. But that's an e= asy one=20 to get around, since if you wanted to ignore unicode safety and just co= py some=20 chunk of the string, you can always just slice it directly with no need= for=20 copy.
 * Please, someone just adds in the documentation of IsSliceable that
 narrow strings are an exception, like it was recently added to
 hasLength.

A good point. - Jonathan M Davis
Sep 21 2011
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2011-09-21 15:52, Timon Gehr wrote:
 On 09/21/2011 09:37 AM, bearophile wrote:
 Andrei Alexandrescu:

 http://hackerne.ws/item?id=3014861

 Apparently we're still having a PR issue.

I think the Wikipedia D page needs to be rewritten, leaving 80-90% of its space to D (meaning D2).

Yes, that is important. Wikipedia is usually the first place people go looking for information, and much of the information given there is horribly outdated/wrong and mostly only concerns. Many people think what's on Wikipedia is true. [citation needed] "For performance reasons, string slicing and the length property operate on code units rather than code points (characters), which frequently confuses developers.[27]" The link at [27] only says that many programmers that don't had have to handle unicode have trouble understanding how unicode works initially. It is not a D thing in any other way than that D actually supports unicode natively. Yet the 'D strings are strange and confusing' argument comes up quite often on the web, probably because many feel they are competent enough to discuss the language after having read the Wikipedia article.

Ruby pre 1.9 behaves similar. Well, actually Ruby pre 1.9 is not encoding aware at all, if I recall correctly. -- /Jacob Carlborg
Sep 21 2011
prev sibling parent "Simen Kjaeraas" <simen.kjaras gmail.com> writes:
On Wed, 21 Sep 2011 20:20:55 +0200, Christophe Travert  =

<travert phare.normalesup.org> wrote:


 Yeah, well, as long as char is a unicode code unit, that's the way th=


 it
 goes.

They are not unicode units. void main() { char a =3D '=C3=A4'; writeln(a); // outputs: \344 writeln('=C3=A4'); // outputs: =C3=A4 } Obviouly, a code unit don't fit in a char. Thus 'char[]' is not what the name claims it is.

Oh, it absolutely is. According to the Unicode Consortium, A code unit i= s "The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code unit= s in the UTF-8 encoding form [...]". What you are thinking about is a code point.
 Unicode operations should be supported by a different class that
 is really a lazy range of dchar implemented as an undelying char[], wi=

 no length, index, or stride operator, and appropriate optimizations.

I can agree with this, but the benefits over what we already have are ni= gh zilch. -- = Simen
Sep 21 2011
prev sibling next sibling parent Graham Fawcett <fawcett uwindsor.ca> writes:
On Wed, 21 Sep 2011 11:39:03 -0500, Andrei Alexandrescu wrote:

 On 9/21/11 10:16 AM, Christophe wrote:
 Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
 unicode natively. Yet the 'D strings are strange and confusing'
 argument comes up quite often on the web.

Well, I think they are. The ptr+length stuff is amasing, but the behavior of strings in phobos is weird. mini-quiz: what should std.range.drop(some_string, 1) do ? hint: what it actually does is not what the documentation of phobos suggests*... Strings are array of char, but they appear like a lazy range of dchar to phobos. I could cope with the fact that this is a little unexpected for beginners. But well, that creates a lot of exceptions in phobos, like the fact that you can't even copy a char[] to a char[] with std.algorithm.copy. And I don't mention all the optimization that are not/cannot be performed for those strings. I'll just remember to use ubyte[] wherever I can...

String handling in D is good modulo the oddities you noticed. What would make it perfect would be: * Add property .rep that returns byte[], ushort[], or uint[] for char[], wchar[], dchar[] respectively (with the appropriate qualifier). * Replace .length with .codeUnits. * Disallow [n] and [m .. n] This would upgrade D's strings from good to awesome. Really it would be a dream come true. Unfortunately it would also break most D code there is out there. I don't see how we can improve the current situation while staying backward compatible. Andrei

1. Let "string" remain an alias for "immutable(char)[]", and introduce a new struct, "text!charType", that does the awesome stuff; provide good conversion routines and casts between text instances and old-school strings. 2. Provide an awesome "std.text" library for the text type and related operations, absorbing std.uni and other similar modules. Adapt Phobos to take "text" in virtually every function that currently expects a "string". 3. ... 4. Profit. Graham
Sep 21 2011
prev sibling parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 21.09.2011, 16:24 Uhr, schrieb Andrei Alexandrescu  =

<SeeWebsiteForEmail erdani.org>:

 On 9/21/11 8:52 AM, Timon Gehr wrote:
 On 09/21/2011 09:37 AM, bearophile wrote:
 Andrei Alexandrescu:

 http://hackerne.ws/item?id=3D3014861

 Apparently we're still having a PR issue.

I think the Wikipedia D page needs to be rewritten, leaving 80-90% o=



 its space to D (meaning D2).

Yes, that is important. Wikipedia is usually the first place people g=


 looking for information, and much of the information given there is
 horribly outdated/wrong and mostly only concerns.

Agreed. Does anyone volunteer for fixing D's Wikipedia page? Andrei

Is anyone uninvolved enough to be objective and involved enough to know = = what they write? Timon, I think you are exaggerating a bit. It is not mostly only concern= s, = but I agree they have bold headers, and other language pages like Java o= r = C++ lack this section entirely. Now it would certainly make people = suspicious if the section disappeared over night. I would reduce the fon= t = weight of the headers in that section, remove the talk about UTF-8 strin= g = handling, at some point in time move the 'library split' issue to a = historical section about D1. I think the focus on x86 is also a valid = concern. Here are some statistics I collected, for fun: The Catal=C3=A0 version also says the following: - It is unstable and unsuitable for production environments (version 0.1= 40) The Catal=C3=A0 and Galego version say: - The only documentation is the official specification The following languages are a direct translations of (or the source for)= = the English version. Maybe their authors can be contacted and are willin= g = to update their language version after the change: - Arabic - Espa=C3=B1ol - Polski (not the exact same sections, probably translated from an older= = version) The Italian version has the most impressive features list with 14 sectio= ns = about characteristics! The Latin version is blowing my mind, just because people who use a long= = dead language would write code in a new one that has to do with computer= s: import tango.io.Console; int main(char[][] args) { Cout("salve munde!"); return 0; } Many language pages are hopelessly outdated, but speakers of 'minority' = = languages will look for an English article anyway.
Sep 21 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 9/21/11, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:
 http://hackerne.ws/item?id=3014861

 Apparently we're still having a PR issue.

I really doubt this by now. We've mentioned a million times that there's no dual standard libraries. I think this argument seems to be re-introduced by trolls as far as I can tell.
Sep 21 2011
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, September 21, 2011 11:20 Christophe Travert wrote:
 Jonathan M Davis , dans le message (digitalmars.D:144896), a écrit :
 On Wednesday, September 21, 2011 15:16:33 Christophe wrote:
 Timon Gehr , dans le message (digitalmars.D:144889), a écrit :
 unicode natively. Yet the 'D strings are strange and confusing'
 argument comes up quite often on the web.

Well, I think they are. The ptr+length stuff is amasing, but the behavior of strings in phobos is weird. mini-quiz: what should std.range.drop(some_string, 1) do ? hint: what it actually does is not what the documentation of phobos suggests*...

What do you mean? It does exactly what it says that it does.

It does say it uses the slice operator if the range is sliceable, and the documentation to isSliceable fails to precise that a narrow string is not sliceable.

1. drop says nothing about slicing. 2. popFrontN (which drop calls) says that it slices for ranges that support slicing. Strings do not unless they're arrays of dchar. Yes, hasSlicing should probably be clearer about narrow strings, but that has nothing to do with drop.
 Yeah, well, as long as char is a unicode code unit, that's the way that
 it goes.

They are not unicode units. void main() { char a = 'ä'; writeln(a); // outputs: \344 writeln('ä'); // outputs: ä } Obviouly, a code unit don't fit in a char. Thus 'char[]' is not what the name claims it is. Unicode operations should be supported by a different class that is really a lazy range of dchar implemented as an undelying char[], with no length, index, or stride operator, and appropriate optimizations.

The problem with char a = 'ä'; is completely separate from char[]. That has to do with the fact that the compiler isn't properly dealing with narrowing conversions for an individual character. A char is _by definition_ a UTF-8 code unit. char a = 'ä'; shouldn't even be legal. It's a compiler bug. Most code which operates on individual chars or wchars is buggy. dchar is what should be used for individual characters. The code you give is buggy because it's trying to use char as individual character, and the compiler has a bug, so it doesn't catch it.
 In general, Phobos does a good job of using slicing and other
 optimizations on strings when it can in spite of the fact that they're
 not sliceable ranges, but there are cases where the fact that you _have_
 to process them to be able to find the nth code point means that you
 just can't process them as efficiently as your typical array. That's
 life with a variable- length encoding. - and that includes
 std.algorithm.copy. But that's an easy one to get around, since if you
 wanted to ignore unicode safety and just copy some chunk of the string,
 you can always just slice it directly with no need for copy.

Dealing with utfencoded strings is less efficient, but there is a number of algorithms that can be optimized for utfencoded strings, like copying or finding an ascii char in a string. Unfortunately, there is no practical way to do this with the current range API. About copy, it's not that easy to overcome the problem if you are using a template, and that template happens to be instanciated for strings.

In general, you _must_ treat strings as ranges of dchar if you want your code to be correct with regards to code points. So, that's the default. If you know that your particular algorithm can be optimized for strings without treating them strictly as ranges of dchar and still deal with unicode correctly (e.g. by slicing them, because you know the correct point to slice to), then you special-case your template for narrow strings. Phobos does that in a number of places. But you _have_ to special case it because being able to do so depends entirely on your algorithm. You can't treat strings that way in the general case, or yor code is not going to handle unicode correctly. No, the way that strings are handled in D is not perfect. But we're trying to balance correctness and efficiency, and it does a good job of that. Using strings as ranges of dchar is correct, and in a large number of cases, it is as efficient as your going to get it (perhaps particular functions could be better optimized, but the API can't be made more efficient). And in the cases where you know you can safely get better efficiency out of strings by special casing them, that's what you do. It works. It works fairly well. And no one has been able to come up with a solution that's definitely better. If there's an issue, it's that the message on how to correctly handle strings is not necessarily being communicated as well as it needs to be. In a few places, the documentation could probably be improved, but what we probobly really need in order to help get the message across is more articles on ranges and strings as ranges. I've actually partially written an article on that, but due to some compiler bugs regarding std.container, I ended up temporarily shelving it. - Jonathan M Davis
Sep 21 2011
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, September 21, 2011 19:56:47 Christophe wrote:
 "Jonathan M Davis" , dans le message (digitalmars.D:144922), a =C3=A9=

 1. drop says nothing about slicing.
 2. popFrontN (which drop calls) says that it slices for ranges that=


 support slicing. Strings do not unless they're arrays of dchar.
=20
 Yes, hasSlicing should probably be clearer about narrow strings, bu=


 that has nothing to do with drop.

I never said there was a problem with drop.

Yes you did. You said: "mini-quiz: what should std.range.drop(some_string, 1) do ? hint: what it actually does is not what the documentation of phobos=20 suggests*..."
 After having read all of you, I have no problems with string being a
 lazy range of dchar. But I have a problem with immutable(char)[] bein=

 lazy range of dchar (ie not being a array), and I have a problem with=

 string being immutable(char)[] (ie providing length opIndex and
 opSlice).

For efficiency, you need to be able to treat strings as arrays of code = units for=20 some algorithms. For correctness, you need to be able to treat them as = ranges=20 of code points (dchar) in the general case. You need both. The question= is how=20 to provide that. strings as arrays came first (D1), whereas ranges came= later.=20 We _need_ to treat strings as ranges of dchar or they're essentially un= usable=20 in the general case. Operating on code units is almost always _wrong_. = So,=20 when we added the range functions, we special-cased them for strings so= that=20 strings are treated as ranges of dchar as they need to be. And in cases= where=20 you actually need to treat a string as an array of code units for effic= iency,=20 you special case the function for them, and you still get that. What ot= her way=20 would you do it? There _are_ some edges here - such as foreach defalting to char for str= ing=20 when dchar is really what you shoud be iterating with - and there are t= imes=20 when you want to use a string with a range-based function and can't, be= cause=20 it needs a random-access range or one which is sliceable to do what it = does,=20 which can be annoying. But what else can you do there? You can't treat = the=20 string as a range of code units in that case. The result would be compl= etely=20 wrong. Imagine if sort worked on a char[]. You'd get an array of sorted= code=20 units, which would _not_ be code points, and which would be completely=20= useless. So, treating a string as a range of code units makes no sense.= We could switch to having a struct of some kind which was a string, mak= e it a=20 range of dchar, and have it contain an array of char, wchar, or dchar=20= internally. It would have to restrict its operations in exactly the sam= e=20 manner that the range functions for strings currently do, so the exact = same=20 algorithms would or wouldn't work with it. And then you'd need to provi= de=20 access to the underlying array of code units so that algorithms special= casing=20 strings could operate on the array instead. Ultimately, it's pretty muc= h the=20 same thing, except now you have a wrapper struct. How does that buy you= =20 anything? The _only_ thing that it would buy you AFAIK is that foreach = would=20 then default to dchar instead of the code unit type. The basic problem = still=20 exists. You still need to special case strings for efficiency, and you = still=20 need to treat them as a range of dchar in the general case. It's an inh= erent=20 issue with variable length encodings. You can't just magically make it = go=20 away. If you have a better solution, please share it, but the fact that we wa= nt both=20 efficiency and correctness binds us pretty thoroughly here. - Jonathan M Davis
Sep 21 2011
prev sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, September 21, 2011 14:08 Christophe wrote:
 Jonathan M Davis , dans le message (digitalmars.D:144944), a écrit :
 I never said there was a problem with drop.

Yes you did. You said: "mini-quiz: what should std.range.drop(some_string, 1) do ? hint: what it actually does is not what the documentation of phobos

^^^^^^^^^^^^^^^^^^^^^^^
 suggests*..."

not documentation of drop.

You weren't specific enough to make it clear what you meant. It looked like you were complaining about drop's documentation.
 If you have a better solution, please share it, but the fact that we want
 both efficiency and correctness binds us pretty thoroughly here.

- char[], etc. being real arrays.

Which is actually arguably a _bad_ thing, since it doesn't generally make sense to operate on individual chars. What you really want 99.99999999999% of the time is code points not code units.
 - strings being lazy ranges of dchar, providing access to underlying
 char[].
 
 Correctness of the langage is better, since we don't have a T[] having a
 front method that returns something else than T, or a type that accepts
 opSlice but is not sliceable, etc.
 
 Runtime correctness and efficiency are the same as the current ones,
 since the whole phobos already considers strings as lazy range of dchar.
 It is even better, since the user cannot change an arbitrary code point
 in a string without explicitely asking for the undelying char[].
 Optimizations can come the same way as they currently can, since the
 underlying char is accessible.
 
 I can deal with strings the way they are, since they are an heritage.
 They are not perfect, and will never be unless computers become fat
 enough to treat dchar[] just as efficiently as char[]. I am also aware
 that phobos cannot be optimized for every cases in the first place, and
 I can change my mind.

So, essentially you're arguing for a wrapper around arrays of code units. That does add some benefits (such as making foreach default to dchar), but ultimately doesn't add that much additional benefit (it also makes dealing with array literals much more interesting). If we were to start over again, that may very well be the way that we'd go, but the added benefits just don't outweigh the immense amount of code breakage which would result. Maybe the situation will change with D3, but at this point, I think that we've done a fairly good job of making it possible to treat strings as ranges. - Jonathan M Davis
Sep 21 2011