www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Higher level built-in strings

reply bearophile <bearophileHUGS lycos.com> writes:
This odd post comes from reading the nice part about strings of chapter 4 of
TDPL. In the last few years I have seen changes in how D strings are meant and
managed, changes that make them less and less like arrays (random-access
sequences of mutable code units) and more and more what they are at high level
(immutable bidirectional sequences of code points).

So a final jump is to make string types something different from normal arrays.
This frees them to behave better as high level strings. Probably other people
have had similar ideas, so if you think this post is boring or useless, please
ignore it.

So strings can have some differences compared to arrays:

1) length returns code points, but it's immutable and stored in the string, so
it's an O(1) operation (returns what std.utf.count() returns).

2) Assigning a string to another one is allowed:
string s1 = s2;
But changing the length manually is not allowed:
s1.length += 2; // error
This makes them more close to immutable, more similar to Python strings.

3) Another immutable string-specific attribute can be added that returns the
number of code units in the string, for example .codeunits or .nunits or
something similar.

4) s1[i] is not O(1), it's generally slower, and returns the i-th code point
(there are ways to speed this operation up to O(log n) with some internal
indexes). (Code points in dstrings can be accessed in O(1)).

5) foreach(c; s1) yields a code point dchars regardless if s1 is string,
dstring, wstring.
But you can use foreach(char c; s1) if s1 is a string and you are sure s1 uses
only 7 bit chars. But in such cases you can also use a immutable(ubyte)[].
Python3 does something similar, its has a str type that's always UTF16 (or
UTF32) and a bytes type that is similar to a ubyte[]. I think D can do
something similar. std.string functions can made to work on ubyte[] and
immutable(ubyte)[] too.


Some more things, I am not sure about them:

6) Strings can contain their hash value. This field is initialized with a empty
value (like -1) and computed lazily on-demand. (as done in Python). This can
make them faster when they often are put inside associative arrays and sets,
avoiding to compute their hash value again and again. So strings are not fully
immutable, because this value gets initialized. But it's a pure value, it's
determined only by the immutable contents of the string, so I don't think this
can cause big problems in multi-thread programs. If two threads update it, they
find the same result.

7) the validation (std.utf.validat(), or a cast) can be necessary to create a
string. This means that the type string/dstring/wstring implies it's validated
:-)

8) If strings are immutable, then the append can always create a new string. So
the extra information at the end of the memory block (used for appending to
dynamic arrays) is not necessary.

9)
- Today memory, even RAM, is cheap, but moving RAM to the CPU caches is not so
fast, so a program that works on just the cache is faster.
- UFT8 and UTF16 make strings bidirectional ranges.
- UTF encodings are just data compression, but it's not so efficient.
So a smarter compression scheme can compress strings more in memory, and the
decrease in cache misses can compensate for the increased CPU work to
decompress them. (But keeping strings compressed can turn them from
bidirectional ranges to forward ranges, this is not so bad).
There is a compressor that gives a decompression speed of about 3 times slower
than memcpy():
http://www.oberhumer.com/opensource/lzo/
LZO can be used transparently to compress strings in RAM when strings become
long enough.
Hash computation and equality tests done among compressed strings are faster.


Turning strings into something different from arrays looks like a loss, but in
practice they already are not arrays, thinking about them as normal arrays is
no so useful and it's UTF-unsafe, using [] to read the code units doesn't seem
so useful. Code that manages strings/wstrings as normal arrays is not correct
in general.

Bye,
bearophile
Jul 18 2010
next sibling parent reply %u <ae au.com> writes:
 5) foreach(c; s1) yields a code point dchars regardless if s1 is string,
dstring, wstring.
 But you can use foreach(char c; s1) if s1 is a string and you are sure s1 uses
only 7 bit chars. But in such cases you can also use a immutable(ubyte)[]. Python3 does something similar, its has a str type that's always UTF16 (or UTF32) and a bytes type that is similar to a ubyte[]. I think D can do something similar. std.string functions can made to work on ubyte[] and immutable(ubyte)[] too. Actually I think D should depreciate char and wchar in user code except inside externals for compatibility with C and Windows functions. When you are working with individual characters you almost always want either a dchar or a byte.
Jul 19 2010
parent bearophile <bearophileHUGS lycos.com> writes:
%u:
 When you are working with individual characters you almost always want either a
 dchar or a byte.
A dchar and an ubyte are better. This is almost what Python3 does and I think it can be good. Maybe other people will give their opinions about my original post. Bye, bearophile
Jul 19 2010
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
bearophile wrote:
 This odd post comes from reading the nice part about strings of chapter 4 of
 TDPL. In the last few years I have seen changes in how D strings are meant
 and managed, changes that make them less and less like arrays (random-access
 sequences of mutable code units) and more and more what they are at high
 level (immutable bidirectional sequences of code points).
Strings in D are deliberately meant to be arrays, not special things. Other languages make them special because they have insufficiently powerful arrays. As for indexing by code point, I also believe this is a mistake. It is proposed often, but overlooks: 1. most string operations, such as copying and searching, even regular expressions, work just fine using regular indices. 2. doing the operations in (1) using code points and having to continually decode the strings would result in disastrously slow code. 3. the user can always layer a code point interface over the strings, but going the other way is not so practical.
Jul 19 2010
next sibling parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Walter Bright (newshound2 digitalmars.com)'s article
 bearophile wrote:
 This odd post comes from reading the nice part about strings of chapter 4 of
 TDPL. In the last few years I have seen changes in how D strings are meant
 and managed, changes that make them less and less like arrays (random-access
 sequences of mutable code units) and more and more what they are at high
 level (immutable bidirectional sequences of code points).
Strings in D are deliberately meant to be arrays, not special things. Other languages make them special because they have insufficiently powerful arrays. As for indexing by code point, I also believe this is a mistake. It is proposed often, but overlooks: 1. most string operations, such as copying and searching, even regular expressions, work just fine using regular indices. 2. doing the operations in (1) using code points and having to continually decode the strings would result in disastrously slow code. 3. the user can always layer a code point interface over the strings, but going the other way is not so practical.
4. Sometimes one can make valid assumptions about the contents of a string. For example, in an internal utility app that will never be internationalized you may get away with assuming a character is an ASCII byte. If you know your input will be in the Basic Multilingual Plane (for example if working with pre-sanitized input), you can use wstrings and always assume a character is 2 bytes. 5. For dchar strings, a code unit equals a code point. Should the interface for dchar strings be completely different than that for char and wchar strings?
Jul 19 2010
prev sibling next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 19 Jul 2010 16:04:21 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 bearophile wrote:
 This odd post comes from reading the nice part about strings of chapter  
 4 of
 TDPL. In the last few years I have seen changes in how D strings are  
 meant
 and managed, changes that make them less and less like arrays  
 (random-access
 sequences of mutable code units) and more and more what they are at high
 level (immutable bidirectional sequences of code points).
Strings in D are deliberately meant to be arrays, not special things. Other languages make them special because they have insufficiently powerful arrays.
Andrei is changing that. Already, isRandomAccessRange!(string) == false. I kind of don't like this direction, even though its clever. What you end up with is phobos refusing to believe that a string or char[] is an array, but the compiler saying it is. What I'd prefer is something where the compiler types string literals as string, a type defined by phobos which contains as its first member an immutable(char)[] (where the compiler puts the literal). Then we can properly limit the other operations.
 As for indexing by code point, I also believe this is a mistake. It is  
 proposed often, but overlooks:

 1. most string operations, such as copying and searching, even regular  
 expressions, work just fine using regular indices.

 2. doing the operations in (1) using code points and having to  
 continually decode the strings would result in disastrously slow code.

 3. the user can always layer a code point interface over the strings,  
 but going the other way is not so practical.
I agree here. Anything that uses indexing to perform a linear operation is bound for the scrap heap. But what about this: foreach(c; str) which types c as char (or immutable char), not dchar. These are the subtle problems that we have with the dichotomy of phobos refusing to believe a string is an array, but the compiler believing it is. I think the default inference for this should be dchar, and phobos can make that true as long as it controls the string type. There are other points to consider: 1) a string *could be* indexed by character and return the code point being pointed to. 2) even slicing could be valid as long as the slice operator jumps back to the start of the dchar being encoded. This might make for very tricky code, but then again, such is the cost of trying to slice something like a utf-8 string :) But having the compiler force the string type to be an array, when it clearly isn't, doesn't help. Give the runtime the choice, like it's done for AA's, and I think we may have something that is workable, and doesn't suck performance-wise. -Steve
Jul 19 2010
parent reply Walter Bright <newshound2 digitalmars.com> writes:
Steven Schveighoffer wrote:
 On Mon, 19 Jul 2010 16:04:21 -0400, Walter Bright 
 <newshound2 digitalmars.com> wrote:
 Strings in D are deliberately meant to be arrays, not special things. 
 Other languages make them special because they have insufficiently 
 powerful arrays.
Andrei is changing that. Already, isRandomAccessRange!(string) == false. I kind of don't like this direction, even though its clever.
That decision may be a mistake.
 I agree here.  Anything that uses indexing to perform a linear operation 
 is bound for the scrap heap.  But what about this:
 
 foreach(c; str)
 
 which types c as char (or immutable char), not dchar.
Probably too late to change that one.
Jul 19 2010
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 07/19/2010 11:31 PM, Walter Bright wrote:
 Steven Schveighoffer wrote:
 On Mon, 19 Jul 2010 16:04:21 -0400, Walter Bright
 <newshound2 digitalmars.com> wrote:
 Strings in D are deliberately meant to be arrays, not special things.
 Other languages make them special because they have insufficiently
 powerful arrays.
Andrei is changing that. Already, isRandomAccessRange!(string) == false. I kind of don't like this direction, even though its clever.
That decision may be a mistake.
I think otherwise. In fact I confess I am extremely excited. The current state of affairs described built-in strings very accurately: they are formally bidirectional ranges, yet they offer random access for code units that you can freely use if you so wish. It's modeling reality very accurately. As far as I know, all algorithms in std.algorithm that work well without decoding are special-cased for strings to work fast and yield correct results. Andrei
Jul 19 2010
parent Walter Bright <newshound2 digitalmars.com> writes:
Andrei Alexandrescu wrote:
 I think otherwise. In fact I confess I am extremely excited. The current 
 state of affairs described built-in strings very accurately: they are 
 formally bidirectional ranges, yet they offer random access for code 
 units that you can freely use if you so wish. It's modeling reality very 
 accurately.
 
 As far as I know, all algorithms in std.algorithm that work well without 
 decoding are special-cased for strings to work fast and yield correct 
 results.
That's good to hear.
Jul 20 2010
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter:

As it turns out, indexing by byte is *far* more common than indexing by code
unit, in fact, I've never ever needed to index by code unit.<
OK.
Probably too late to change that one.
There is very little D2 code around, so little changes as this one are possible still. As alternative see also this enhancement request, partially coming from a post of mine in D.learn: http://d.puremagic.com/issues/show_bug.cgi?id=4483 (If you refuse this fallback idea too, then it's better to close this bug report. Keeping too much dead wood in Bugzilla will eventually cause small troubles.) In my original post my points 2, 6, 8 and 9 are still valid. To be nice they need strings to be a little different from standard arrays. Thanks for the comments, bearophile
Jul 20 2010
parent reply Walter Bright <newshound2 digitalmars.com> writes:
bearophile wrote:
 Probably too late to change that one.
There is very little D2 code around, so little changes as this one are possible still.
It's a D1 feature, and has been there since nearly the beginning.
Jul 20 2010
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 20 Jul 2010 14:29:43 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 bearophile wrote:
 Probably too late to change that one.
There is very little D2 code around, so little changes as this one are possible still.
It's a D1 feature, and has been there since nearly the beginning.
Since when did we care about D1 compatibility? const, inout, array appending, final, typeof(string), TLS globals just to name a few... If you expect D1 code to compile fine and run on D2, you are deluding yourself. The worst that happens is that code starts using dchar instead of char, and either a compiler error occurs and it's fixed simply by doing: foreach(char c; str) or it compiles fine because the type is never explicitly stated, and what's the big deal there? The code just becomes more utf compatible for free :) -Steve
Jul 20 2010
parent reply Walter Bright <newshound2 digitalmars.com> writes:
Steven Schveighoffer wrote:
 On Tue, 20 Jul 2010 14:29:43 -0400, Walter Bright 
 <newshound2 digitalmars.com> wrote:
 It's a D1 feature, and has been there since nearly the beginning.
Since when did we care about D1 compatibility?
We care about incompatibilities that silently break code.
 const, inout, array appending, final, typeof(string), TLS globals just 
 to name a few...
 
 If you expect D1 code to compile fine and run on D2, you are deluding 
 yourself.
No argument there, but we do try to avoid silent breakage.
 The worst that happens is that code starts using dchar instead of char, 
 and either a compiler error occurs and it's fixed simply by doing:
 
 foreach(char c; str)
 
 or it compiles fine because the type is never explicitly stated, and 
 what's the big deal there?  The code just becomes more utf compatible 
 for free :)
I don't think it's necessarily true that the user really wanted the decoded character rather than the byte, or even that wanting the decoded character is more likely to be desired.
Jul 20 2010
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Walter Bright wrote:
 Steven Schveighoffer wrote:
 On Tue, 20 Jul 2010 14:29:43 -0400, Walter Bright 
 <newshound2 digitalmars.com> wrote:
 It's a D1 feature, and has been there since nearly the beginning.
Since when did we care about D1 compatibility?
We care about incompatibilities that silently break code.
 const, inout, array appending, final, typeof(string), TLS globals just 
 to name a few...

 If you expect D1 code to compile fine and run on D2, you are deluding 
 yourself.
No argument there, but we do try to avoid silent breakage.
 The worst that happens is that code starts using dchar instead of 
 char, and either a compiler error occurs and it's fixed simply by doing:

 foreach(char c; str)

 or it compiles fine because the type is never explicitly stated, and 
 what's the big deal there?  The code just becomes more utf compatible 
 for free :)
I don't think it's necessarily true that the user really wanted the decoded character rather than the byte, or even that wanting the decoded character is more likely to be desired.
Unfortunately it's inconsistent. foreach for ranges operates in terms of front, empty, popFront - just not for strings. I avoid foreach in std.algorithm and in generic code. For my money I'd be okay if foreach (c; str) wouldn't even compile - the user would be asked to specify the type. But if the implicit type is allowed, I'm afraid I believe that dchar would be the best choice. Andrei
Jul 20 2010
prev sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2010-07-20 00:31:34 -0400, Walter Bright <newshound2 digitalmars.com> said:

 Steven Schveighoffer wrote:
 I agree here.  Anything that uses indexing to perform a linear 
 operation is bound for the scrap heap.  But what about this:
 
 foreach(c; str)
 
 which types c as char (or immutable char), not dchar.
Probably too late to change that one.
Sad. That's one of the first things I tried when I first learned D and the result did surprise me. I expected foreach to iterate on characters (code points), not code units. Then I saw I could add 'dchar' to get that behaviour and found that to be not too bad. The big problem here is that ranges and foreach behave differently. A range that doesn't work with foreach isn't a good range. That's even worse when that range is at the core of the language because it'll look bad on both ranges and the language. As it stands now, when doing generic programming we'd have to write foreach like this so it works the same with foreach as it does with the range APIs, just in case the range is a string: foreach (ElementType!(typeof(range)) c; range) {} Something needs to change so the above always work the same as not specifying the type! Either foreach should be adapted or ranges should let go the idea of iterating on code points for the default string type. As for the "too late to change" stance, I'm not sure. It'll certainly be to late to change in a year, but right now D2 is still pretty new. What makes you say it's too late? -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jul 20 2010
next sibling parent reply DCoder <anon ym.ous> writes:
== Quote from Michel Fortin (michel.fortin michelf.com)'s article
 On 2010-07-20 00:31:34 -0400, Walter Bright
<newshound2 digitalmars.com> said:
 Steven Schveighoffer wrote:
 I agree here.  Anything that uses indexing to perform a linear
 operation is bound for the scrap heap.  But what about this:

 foreach(c; str)

 which types c as char (or immutable char), not dchar.
Probably too late to change that one.
Sad. That's one of the first things I tried when I first learned D
and
 the result did surprise me. I expected foreach to iterate on
characters
 (code points), not code units. Then I saw I could add 'dchar' to
get
 that behaviour and found that to be not too bad.
 The big problem here is that ranges and foreach behave
differently. A
 range that doesn't work with foreach isn't a good range. That's
even
 worse when that range is at the core of the language because it'll
look
 bad on both ranges and the language.
 As it stands now, when doing generic programming we'd have to
write
 foreach like this so it works the same with foreach as it does
with the
 range APIs, just in case the range is a string:
 	foreach (ElementType!(typeof(range)) c; range) {}
 Something needs to change so the above always work the same as not
 specifying the type! Either foreach should be adapted or ranges
should
 let go the idea of iterating on code points for the default string
type. I'm wondering how bad would it be introduce a schar (short char, 1 byte) type and then let char simply map to a "default" char type: dchar, wchar, or whatever we tell the compiler. By default, char would map to dchar. alias char dchar; Coupling this with implicit cast of schar/char (from imported C code) to dchar, I think this might even help fix many such situations.
Jul 20 2010
parent Jonathan M Davis <jmdavisprog gmail.com> writes:
On Tuesday, July 20, 2010 05:30:51 DCoder wrote:
 I'm wondering how bad would it be introduce a schar (short char, 1
 byte) type and then let char simply map to a "default" char type:
 dchar, wchar, or whatever we tell the compiler. By default, char
 would map to dchar.
 
 alias char dchar;
 
 Coupling this with implicit cast of schar/char (from imported C
 code) to dchar, I think this might even help fix many such
 situations.
That doesn't really gain us anything. It would likely just make it so that char would mean dchar and be used by default. And honestly, using dchar is not really the solution. If you wan to do that, you can. However, using dchar in the general case is not necessarily a good idea since it wastes so much space. On top of all that, TDPL was _very_ clear on the differences between char, wchar, and dchar and TDPL is supposed to stay accurate. So, any changes which would contradict TDPL need a _very_ good reason for being made, or they won't be. Overall, strings in D work great. The only issue really is making it so that you properly deal with the cases where you need to treat them as code points vs when you can treat them as code units. Any programmer who wants to entirely avoid the problem can just using dchar and dstring. For the rest, you need to understand how string and wstring work and just handle them appropriately. There may be a few places where things should be smoothed out, but overall, I really do think that they work well as they are. - Jonathan M Davis
Jul 20 2010
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
Michel Fortin wrote:
 As for the "too late to change" stance, I'm not sure. It'll certainly be 
 to late to change in a year, but right now D2 is still pretty new. What 
 makes you say it's too late?
As I said to bearophile, it's a D1 feature.
Jul 20 2010
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:
 1. most string operations, such as copying and searching, even regular 
 expressions, work just fine using regular indices.
 
 2. doing the operations in (1) using code points and having to continually 
 decode the strings would result in disastrously slow code.
In my original post I have forgotten another difference over arrays: 5b) a method like ".unit()" that allows to index code units. So "foo".unit(1) is always O(1). Lower level code can use this method as [] is used for arrays. Copying is done on the bytes themselves, with a memcpy, no decoding necessary. If the point (9) (automatic LZO encoding) is used, then copying can be 2-3 times faster for long strings (because there is less data and you don't need to uncompress it to copy). (if such compression is added, then strings can need a third accessor method, to the true bytes).
 3. the user can always layer a code point interface over the strings, but going
 the other way is not so practical.
This is true. But it makes the string usage unnecessarily low-level and hard... A better design in a smart system language as D is to give strings a default high level "interface" that sees strings as what they are at high level, and add a second lower level interface when you need faster lower-level fiddling (so they have [] that returns code points and unit() that returns code units). Bye, bearophile
Jul 19 2010
parent reply Walter Bright <newshound2 digitalmars.com> writes:
bearophile wrote:
 Walter Bright:
 1. most string operations, such as copying and searching, even regular 
 expressions, work just fine using regular indices.
 
 2. doing the operations in (1) using code points and having to continually
  decode the strings would result in disastrously slow code.
In my original post I have forgotten another difference over arrays: 5b) a method like ".unit()" that allows to index code units. So "foo".unit(1) is always O(1). Lower level code can use this method as [] is used for arrays.
This is backwards. The [i] should behave as expected for arrays. As it turns out, indexing by byte is *far* more common than indexing by code unit, in fact, I've never ever needed to index by code unit. (Though it is sometimes necessary to step through by code unit, that's different from indexing by code unit.)
 3. the user can always layer a code point interface over the strings, but
 going the other way is not so practical.
This is true. But it makes the string usage unnecessarily low-level and hard...
I don't believe that manipulating strings in D is hard, even if you do have to work with multibyte characters. You do have to be aware they are multibyte, but I think that just comes with being a programmer. A better design in a smart system language as D is to give strings a
 default high level "interface" that sees strings as what they are at high
 level, and add a second lower level interface when you need faster
 lower-level fiddling (so they have [] that returns code points and unit()
 that returns code units).
I have some moderate experience with using utf. First there's the D javascript engine, which is fully utf'd. The D string design fits in with it perfectly. Then there are chunks of C++ ascii-only code I've translated to D, and it then worked with utf-8 without further modification. Based on that, I believe the D string design hits the sweet spot between efficiency and utility.
Jul 19 2010
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 07/19/2010 11:29 PM, Walter Bright wrote:
 bearophile wrote:
 Walter Bright:
 1. most string operations, such as copying and searching, even
 regular expressions, work just fine using regular indices.

 2. doing the operations in (1) using code points and having to
 continually
 decode the strings would result in disastrously slow code.
In my original post I have forgotten another difference over arrays: 5b) a method like ".unit()" that allows to index code units. So "foo".unit(1) is always O(1). Lower level code can use this method as [] is used for arrays.
This is backwards. The [i] should behave as expected for arrays. As it turns out, indexing by byte is *far* more common than indexing by code unit, in fact, I've never ever needed to index by code unit. (Though it is sometimes necessary to step through by code unit, that's different from indexing by code unit.)
Exactly. And that's what the bidirectional range interface is doing for strings.
 3. the user can always layer a code point interface over the strings,
 but
 going the other way is not so practical.
This is true. But it makes the string usage unnecessarily low-level and hard...
I don't believe that manipulating strings in D is hard, even if you do have to work with multibyte characters. You do have to be aware they are multibyte, but I think that just comes with being a programmer. A better design in a smart system language as D is to give strings a
 default high level "interface" that sees strings as what they are at high
 level, and add a second lower level interface when you need faster
 lower-level fiddling (so they have [] that returns code points and unit()
 that returns code units).
I have some moderate experience with using utf. First there's the D javascript engine, which is fully utf'd. The D string design fits in with it perfectly. Then there are chunks of C++ ascii-only code I've translated to D, and it then worked with utf-8 without further modification. Based on that, I believe the D string design hits the sweet spot between efficiency and utility.
I agree. In fact there is no language I know that can compete with D at UTF string handling. Andrei
Jul 19 2010
prev sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
Walter Bright Wrote:

 bearophile wrote:
 Walter Bright:
 1. most string operations, such as copying and searching, even regular 
 expressions, work just fine using regular indices.
 
 2. doing the operations in (1) using code points and having to continually
  decode the strings would result in disastrously slow code.
In my original post I have forgotten another difference over arrays: 5b) a method like ".unit()" that allows to index code units. So "foo".unit(1) is always O(1). Lower level code can use this method as [] is used for arrays.
This is backwards. The [i] should behave as expected for arrays. As it turns out, indexing by byte is *far* more common than indexing by code unit, in fact, I've never ever needed to index by code unit. (Though it is sometimes necessary to step through by code unit, that's different from indexing by code unit.)
I've had the same experience. The proposed changes would make string useless to me, even for Unicode work. I'd end up using ubyte[] instead.
Jul 20 2010
parent Walter Bright <newshound2 digitalmars.com> writes:
Sean Kelly wrote:
 I've had the same experience.  The proposed changes would make string useless
 to me, even for Unicode work.  I'd end up using ubyte[] instead.
At this point, the experience with D strings is that D gets it right. To change it would require someone who has spent a *lot* of hours programming in utf making a very compelling argument.
Jul 20 2010
prev sibling next sibling parent reply Marianne Gagnon <auria.mg gmail.com> writes:
Pragmatically, I seem to have noted that in languages with low level strings,
people invariably come up with librairies that provide higher-level strings.
C/C++ provided low-level strings only initially, then a not-so-powerful
std::string; and we saw QString, wxString, irr::string, BetterString, countless
others...

Java, on the other end, provided a powerful high-level String object from the
start; and to my knowledge it is used consistently in all Java programs with no
other string classes being made.

I do acknowlegde that D arrays are much better than C/C++ arrays. Still, my
prediction is that if D chooses to stick to C-style function calls, and does
not provide a standard high-level String object, then a myriad of string
objects will start popping around. Because lots of people like OOP and don't
like C-style calls.

Just my 2c :) I mean be wrong

-- Auria

Walter Bright Wrote:

 Strings in D are deliberately meant to be arrays, not special things. Other 
 languages make them special because they have insufficiently powerful arrays.
 
 As for indexing by code point, I also believe this is a mistake. It is
proposed 
 often, but overlooks:
 
 1. most string operations, such as copying and searching, even regular 
 expressions, work just fine using regular indices.
 
 2. doing the operations in (1) using code points and having to continually 
 decode the strings would result in disastrously slow code.
 
 3. the user can always layer a code point interface over the strings, but
going 
 the other way is not so practical.
Jul 19 2010
parent Jonathan M Davis <jmdavisprog gmail.com> writes:
On Monday, July 19, 2010 16:49:58 Marianne Gagnon wrote:
 Pragmatically, I seem to have noted that in languages with low level
 strings, people invariably come up with librairies that provide
 higher-level strings. C/C++ provided low-level strings only initially,
 then a not-so-powerful std::string; and we saw QString, wxString,
 irr::string, BetterString, countless others...
 
 Java, on the other end, provided a powerful high-level String object from
 the start; and to my knowledge it is used consistently in all Java
 programs with no other string classes being made.
 
 I do acknowlegde that D arrays are much better than C/C++ arrays. Still, my
 prediction is that if D chooses to stick to C-style function calls, and
 does not provide a standard high-level String object, then a myriad of
 string objects will start popping around. Because lots of people like OOP
 and don't like C-style calls.
 
 Just my 2c :) I mean be wrong
 
 -- Auria
For the most part, D's strings are plenty high level. Between how fantastic D's arrays are and the fact that you can call functions on them as if they were objects, D's strings are quite high level in terms of how you use them. Their solution for how to deal with unicode is also quite powerful. The problem is that the solution for how to deal with unicode makes it so that if you try and deal with individual chars or wchars, you're going to very quickly shoot yourself in the foot. If you want to avoid the problem entirely, you simply use dstring and they're at least as powerful - and arguably more so - than Java's strings. The only issue with strings in D that I'm aware of is the danger with trying to deal with individual characters. But their are lots of great functions for dealing with strings, and they allow you to easily deal with individual characters by just using them as single-character strings rather than as chars or wchars. On the whole, D's strings are the best strings that I've used. - Jonathan M Davis
Jul 19 2010
prev sibling next sibling parent reply Jesse Phillips <jessekphillips+d gmail.com> writes:
What about:

struct String {
	string items;
	alias items this;
}

And add the needed functions you wish to have in string and it will still work
in existing functions that operate on immutable(char)[]
Jul 19 2010
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 07/19/2010 06:51 PM, Jesse Phillips wrote:
 What about:

 struct String {
 	string items;
 	alias items this;
 }

 And add the needed functions you wish to have in string and it will still work
in existing functions that operate on immutable(char)[]
Fortunately you can essentially achieve the above by simply writing free functions that take a string or a ref string as their first argument. Then you can use str.foo(args) as an alternative for foo(str, args). Andrei
Jul 19 2010
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 19 Jul 2010 20:26:47 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 07/19/2010 06:51 PM, Jesse Phillips wrote:
 What about:

 struct String {
 	string items;
 	alias items this;
 }

 And add the needed functions you wish to have in string and it will  
 still work in existing functions that operate on immutable(char)[]
Fortunately you can essentially achieve the above by simply writing free functions that take a string or a ref string as their first argument. Then you can use str.foo(args) as an alternative for foo(str, args).
How do we make this work? auto str = "hello world"; foreach(c; str) assert(is(typeof(c) == dchar)); -Steve
Jul 20 2010
parent reply Sean Kelly <sean invisibleduck.org> writes:
Steven Schveighoffer Wrote:
 
 How do we make this work?
 
 auto str = "hello world";
 foreach(c; str)
     assert(is(typeof(c) == dchar));
foreach (dchar c; str) assert(...); This feature has been in D for years.
Jul 20 2010
next sibling parent reply =?UTF-8?B?IkrDqXLDtG1lIE0uIEJlcmdlciI=?= <jeberger free.fr> writes:
Sean Kelly wrote:
 Steven Schveighoffer Wrote:
 How do we make this work?

 auto str =3D "hello world";
 foreach(c; str)
     assert(is(typeof(c) =3D=3D dchar));
=20 foreach (dchar c; str) assert(...); =20 This feature has been in D for years.
And what about this one: void func(T) (T range) { foreach (elem; range) assert (is (typeof (elem) =3D=3D ElementType!(T))); } func ("azerty"); auto a =3D [ 1, 2, 3, 4, 5]; func (a); Jerome --=20 mailto:jeberger free.fr http://jeberger.free.fr Jabber: jeberger jabber.fr
Jul 20 2010
parent reply Walter Bright <newshound2 digitalmars.com> writes:
Jérôme M. Berger wrote:
 	And what about this one:
 
 void func(T) (T range) {
     foreach (elem; range)
         assert (is (typeof (elem) == ElementType!(T)));
 }
 
 func ("azerty");
 auto a = [ 1, 2, 3, 4, 5];
 func (a);
You can specialize the template for strings: void func(T:string)(T range) { ... }
Jul 20 2010
next sibling parent reply =?UTF-8?B?IkrDqXLDtG1lIE0uIEJlcmdlciI=?= <jeberger free.fr> writes:
Walter Bright wrote:
 J=C3=A9r=C3=B4me M. Berger wrote:
     And what about this one:

 void func(T) (T range) {
     foreach (elem; range)
         assert (is (typeof (elem) =3D=3D ElementType!(T)));
 }

 func ("azerty");
 auto a =3D [ 1, 2, 3, 4, 5];
 func (a);
=20 You can specialize the template for strings: =20 void func(T:string)(T range) { ... }
Sure, i can also not use a template and write however many overloaded functions I need. So what are templates for? Jerome --=20 mailto:jeberger free.fr http://jeberger.free.fr Jabber: jeberger jabber.fr
Jul 20 2010
parent Walter Bright <newshound2 digitalmars.com> writes:
Jérôme M. Berger wrote:
 Walter Bright wrote:
 You can specialize the template for strings:

 void func(T:string)(T range) { ... }
Sure, i can also not use a template and write however many overloaded functions I need. So what are templates for?
The overloaded template specialization capability is exactly because it is often advantageous to write custom versions for certain types. The user of the template doesn't see that, it looks generic to him.
Jul 20 2010
prev sibling parent "Aelxx" <aelxx yandex.ru> writes:
"Walter Bright" <newshound2 digitalmars.com> ÓÏÏÂÝÉÌ/ÓÏÏÂÝÉÌÁ × ÎÏ×ÏÓÔÑÈ 
ÓÌÅÄÕÀÝÅÅ: news:i24st1$12uh$1 digitalmars.com...
 Jerome M. Berger wrote:
 And what about this one:

 void func(T) (T range) {
     foreach (elem; range)
         assert (is (typeof (elem) == ElementType!(T)));
 }

 func ("azerty");
 auto a = [ 1, 2, 3, 4, 5];
 func (a);
You can specialize the template for strings: void func(T:string)(T range) { ... }
Hmm. Theoreticaly a bit more general void func(T, U, V )(T rangeT, U rangeU, V rangeV) { ... } void func(T:string, U, V )(T rangeT, U rangeU, V rangeV) { ... } void func(T, U:string, V )(T rangeT, U rangeU, V rangeV) { ... } void func(T, U, V:string )(T rangeT, U rangeU, V rangeV) { ... } void func(T:string, U:string, V )(T rangeT, U rangeU, V rangeV) { ... } void func(T:string, U, V:string )(T rangeT, U rangeU, V rangeV) { ... } void func(T, U:string, V:string )(T rangeT, U rangeU, V rangeV) { ... } void func(T:string, U:string, V:string )(T rangeT, U rangeU, V rangeV) { ... }
Jul 21 2010
prev sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 20 Jul 2010 11:02:57 -0400, Sean Kelly <sean invisibleduck.org>  
wrote:

 Steven Schveighoffer Wrote:
 How do we make this work?

 auto str = "hello world";
 foreach(c; str)
     assert(is(typeof(c) == dchar));
foreach (dchar c; str) assert(...); This feature has been in D for years.
The omission of dchar is on purpose. Phobos has characterized string as a bidirectional range of dchars. For every range where I do: foreach(e; range) e is of the type of the range. Except for char and wchar. This schizophrenia of type induction is very bad for D, and it's a good argument of why strings should not simply be arrays. -Steve
Jul 20 2010
parent reply Walter Bright <newshound2 digitalmars.com> writes:
Steven Schveighoffer wrote:
 The omission of dchar is on purpose.  Phobos has characterized string as 
 a bidirectional range of dchars.  For every range where I do:
 
 foreach(e; range)
 
 e is of the type of the range.  Except for char and wchar.  This 
 schizophrenia of type induction is very bad for D, and it's a good 
 argument of why strings should not simply be arrays.
For many algorithms on strings, iterating by char is preferred over dchar, even for multibyte strings.
Jul 20 2010
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 20 Jul 2010 15:21:34 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 Steven Schveighoffer wrote:
 The omission of dchar is on purpose.  Phobos has characterized string  
 as a bidirectional range of dchars.  For every range where I do:
  foreach(e; range)
  e is of the type of the range.  Except for char and wchar.  This  
 schizophrenia of type induction is very bad for D, and it's a good  
 argument of why strings should not simply be arrays.
For many algorithms on strings, iterating by char is preferred over dchar, even for multibyte strings.
Huh? Which ones? AFAIK, all of std.algorithm treats strings as ranges of dchar. I am 100% in agreement with you that indexing and length should be done by char. All I'm talking about is foreach. -Steve
Jul 20 2010
parent reply Walter Bright <newshound2 digitalmars.com> writes:
Steven Schveighoffer wrote:
 On Tue, 20 Jul 2010 15:21:34 -0400, Walter Bright 
 <newshound2 digitalmars.com> wrote:
 
 Steven Schveighoffer wrote:
 The omission of dchar is on purpose.  Phobos has characterized string 
 as a bidirectional range of dchars.  For every range where I do:
  foreach(e; range)
  e is of the type of the range.  Except for char and wchar.  This 
 schizophrenia of type induction is very bad for D, and it's a good 
 argument of why strings should not simply be arrays.
For many algorithms on strings, iterating by char is preferred over dchar, even for multibyte strings.
Huh? Which ones?
Searching, for one.
 AFAIK, all of std.algorithm treats strings as ranges 
 of dchar.
Andrei posted elsewhere that there were specializations for strings to do it one way or the other based on which was more efficient.
Jul 20 2010
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Walter Bright wrote:
 Steven Schveighoffer wrote:
 On Tue, 20 Jul 2010 15:21:34 -0400, Walter Bright 
 <newshound2 digitalmars.com> wrote:

 Steven Schveighoffer wrote:
 The omission of dchar is on purpose.  Phobos has characterized 
 string as a bidirectional range of dchars.  For every range where I do:
  foreach(e; range)
  e is of the type of the range.  Except for char and wchar.  This 
 schizophrenia of type induction is very bad for D, and it's a good 
 argument of why strings should not simply be arrays.
For many algorithms on strings, iterating by char is preferred over dchar, even for multibyte strings.
Huh? Which ones?
Searching, for one.
 AFAIK, all of std.algorithm treats strings as ranges of dchar.
Andrei posted elsewhere that there were specializations for strings to do it one way or the other based on which was more efficient.
Boyer-Moore comes to mind. Andrei
Jul 20 2010
prev sibling next sibling parent "Rory McGuire" <rmcguire neonova.co.za> writes:
On Tue, 20 Jul 2010 01:51:51 +0200, Jesse Phillips  
<jessekphillips+d gmail.com> wrote:

 What about:

 struct String {
 	string items;
 	alias items this;
 }

 And add the needed functions you wish to have in string and it will  
 still work in existing functions that operate on immutable(char)[]
You shouldn't need to do that: string strstr(string haystack, string needle); can be used as: string s; s.strstr("needle"); so you can add "methods" to a string or whatever just by defining functions. -Rory
Jul 20 2010
prev sibling next sibling parent "Rory McGuire" <rmcguire neonova.co.za> writes:
On Tue, 20 Jul 2010 16:08:06 +0200, Jesse Phillips  
<jesse.k.phillips gmail.com> wrote:

 But then you can't overload operators.

 On Tue, Jul 20, 2010 at 12:54 AM, Rory McGuire <rmcguire neonova.co.za>  
 wrote:
 On Tue, 20 Jul 2010 01:51:51 +0200, Jesse Phillips
 <jessekphillips+d gmail.com> wrote:

 What about:

 struct String {
        string items;
        alias items this;
 }

 And add the needed functions you wish to have in string and it will  
 still
 work in existing functions that operate on immutable(char)[]
You shouldn't need to do that: string strstr(string haystack, string needle); can be used as: string s; s.strstr("needle"); so you can add "methods" to a string or whatever just by defining functions. -Rory
such as?
Jul 20 2010
prev sibling parent reply "Rory McGuire" <rmcguire neonova.co.za> writes:
On Tue, 20 Jul 2010 16:51:57 +0200, Rory McGuire <rmcguire neonova.co.za>  
wrote:

 On Tue, 20 Jul 2010 16:08:06 +0200, Jesse Phillips  
 <jesse.k.phillips gmail.com> wrote:

 But then you can't overload operators.

 On Tue, Jul 20, 2010 at 12:54 AM, Rory McGuire <rmcguire neonova.co.za>  
 wrote:
 On Tue, 20 Jul 2010 01:51:51 +0200, Jesse Phillips
 <jessekphillips+d gmail.com> wrote:

 What about:

 struct String {
        string items;
        alias items this;
 }

 And add the needed functions you wish to have in string and it will  
 still
 work in existing functions that operate on immutable(char)[]
You shouldn't need to do that: string strstr(string haystack, string needle); can be used as: string s; s.strstr("needle"); so you can add "methods" to a string or whatever just by defining functions. -Rory
such as?
I mean is there not another way to do the same thing?
Jul 20 2010
parent reply "Simen kjaeraas" <simen.kjaras gmail.com> writes:
Rory McGuire <rmcguire neonova.co.za> wrote:

[snip]

Rory, is there something wrong with your newsreader? I keep seeing your
posts as replies only to the top post.

-- 
Simen
Jul 20 2010
next sibling parent reply "Rory McGuire" <rmcguire neonova.co.za> writes:
On Tue, 20 Jul 2010 18:35:12 +0200, Simen kjaeraas  
<simen.kjaras gmail.com> wrote:

 Rory McGuire <rmcguire neonova.co.za> wrote:

 [snip]

 Rory, is there something wrong with your newsreader? I keep seeing your
 posts as replies only to the top post.
I'm using opera mail. Any suggestions for Linux+Windows, excluding thunderbird(slow)?
Jul 20 2010
parent Walter Bright <newshound2 digitalmars.com> writes:
Rory McGuire wrote:
 I'm using opera mail. Any suggestions for Linux+Windows, excluding 
 thunderbird(slow)?
I use thunderbird on both windows & linux, haven't noticed speed problems other than my slow internet connection. I have noticed that thunderbird uses multithreading fairly effectively to speed itself up.
Jul 20 2010
prev sibling next sibling parent Jonathan M Davis <jmdavisprog gmail.com> writes:
On Tuesday, July 20, 2010 11:45:41 Rory McGuire wrote:
 On Tue, 20 Jul 2010 18:35:12 +0200, Simen kjaeraas
 
 <simen.kjaras gmail.com> wrote:
 Rory McGuire <rmcguire neonova.co.za> wrote:
 
 [snip]
 
 Rory, is there something wrong with your newsreader? I keep seeing your
 posts as replies only to the top post.
I'm using opera mail. Any suggestions for Linux+Windows, excluding thunderbird(slow)?
Well, since I'm a kde user, I use knode if I want a newsreader and kmail if I want a mail client. I prefer knode for dealing with newsgroups rather than using a mail list with kmail, but I do sometimes end up using kmail with mail lists rather than knode because I can take advantage of imap and have stuff properly synced between my machines. - Jonathan M Davis
Jul 20 2010
prev sibling parent Fawzi Mohamed <fawzi gmx.ch> writes:
I did not read all the discussion in detail, but in my opinion  
something that would be very useful in a library is

struct String{
	void *ptr;
	size_t _l;
	enum :size_t {
		MaskLen=((~cast(size_t)0)>>2)
	}
	enum :int {
		BitsLen=8*size_t.sizeof-2
	}
	size_t len(){
		return (_l & MaskLen);
	}
	int encodingId(){
		return cast(int)(_l>>BitsLen);
	}
}

plus stuff to simplify its creation from T[] arrays and getting T[]  
arrays from it.

this type would them be used where one wants a string without caring  
about its encoding, and without having to make all string accepting  
functions templates.
As it was explained by others many string operations are rather generic.
*this* is what I would have expected from string, not an alias to  
char[].

Fawzi
	
Jul 20 2010
prev sibling parent reply Jesse Phillips <jessekphillips+D gmail.com> writes:
Simen kjaeraas Wrote:

 Rory McGuire <rmcguire neonova.co.za> wrote:
 
 [snip]
 
 Rory, is there something wrong with your newsreader? I keep seeing your
 posts as replies only to the top post.
 
 -- 
 Simen
Actually I'm getting his messages as emails, and just thought he was only sending to me.
Jul 20 2010
next sibling parent awishformore <awishformore gmail.com> writes:
Am 20.07.2010 19:15, schrieb Jesse Phillips:
 Simen kjaeraas Wrote:

 Rory McGuire<rmcguire neonova.co.za>  wrote:

 [snip]

 Rory, is there something wrong with your newsreader? I keep seeing your
 posts as replies only to the top post.

 --
 Simen
Actually I'm getting his messages as emails, and just thought he was only sending to me.
Same here. It's kindof annoying tbh. /Max
Jul 20 2010
prev sibling parent reply =?UTF-8?B?IkrDqXLDtG1lIE0uIEJlcmdlciI=?= <jeberger free.fr> writes:
Jesse Phillips wrote:
 Simen kjaeraas Wrote:
=20
 Rory McGuire <rmcguire neonova.co.za> wrote:

 [snip]

 Rory, is there something wrong with your newsreader? I keep seeing you=
r
 posts as replies only to the top post.

 --=20
 Simen
=20 Actually I'm getting his messages as emails, and just thought he was on=
ly sending to me. He's hitting "reply to all" instead of just plain "reply". Funny thing is I don't have any problem with his messages but yours does appear as a reply to the top post... Jerome --=20 mailto:jeberger free.fr http://jeberger.free.fr Jabber: jeberger jabber.fr
Jul 20 2010
parent Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
On 20/07/2010 20:41, "Jérôme M. Berger" wrote:
 Jesse Phillips wrote:
 Simen kjaeraas Wrote:
Funny thing is I don't have any problem with his messages but yours does appear as a reply to the top post... Jerome
In my client (Thunderbird 3), it appears as a top level post. Its been happening several times (with different people) since I started using TB 3 I think. -- Bruno Medeiros - Software Engineer
Jul 26 2010