digitalmars.D - Higher level built-in strings
- bearophile <bearophileHUGS lycos.com> Jul 18 2010
- %u <ae au.com> Jul 19 2010
- bearophile <bearophileHUGS lycos.com> Jul 19 2010
- Walter Bright <newshound2 digitalmars.com> Jul 19 2010
- dsimcha <dsimcha yahoo.com> Jul 19 2010
- "Steven Schveighoffer" <schveiguy yahoo.com> Jul 19 2010
- Walter Bright <newshound2 digitalmars.com> Jul 19 2010
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Jul 19 2010
- Walter Bright <newshound2 digitalmars.com> Jul 20 2010
- bearophile <bearophileHUGS lycos.com> Jul 20 2010
- Walter Bright <newshound2 digitalmars.com> Jul 20 2010
- Walter Bright <newshound2 digitalmars.com> Jul 20 2010
- Michel Fortin <michel.fortin michelf.com> Jul 20 2010
- DCoder <anon ym.ous> Jul 20 2010
- Jonathan M Davis <jmdavisprog gmail.com> Jul 20 2010
- Walter Bright <newshound2 digitalmars.com> Jul 20 2010
- bearophile <bearophileHUGS lycos.com> Jul 19 2010
- Walter Bright <newshound2 digitalmars.com> Jul 19 2010
- Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Jul 19 2010
- Sean Kelly <sean invisibleduck.org> Jul 20 2010
- Walter Bright <newshound2 digitalmars.com> Jul 20 2010
- "Steven Schveighoffer" <schveiguy yahoo.com> Jul 20 2010
This odd post comes from reading the nice part about strings of chapter 4 of TDPL. In the last few years I have seen changes in how D strings are meant and managed, changes that make them less and less like arrays (random-access sequences of mutable code units) and more and more what they are at high level (immutable bidirectional sequences of code points). So a final jump is to make string types something different from normal arrays. This frees them to behave better as high level strings. Probably other people have had similar ideas, so if you think this post is boring or useless, please ignore it. So strings can have some differences compared to arrays: 1) length returns code points, but it's immutable and stored in the string, so it's an O(1) operation (returns what std.utf.count() returns). 2) Assigning a string to another one is allowed: string s1 = s2; But changing the length manually is not allowed: s1.length += 2; // error This makes them more close to immutable, more similar to Python strings. 3) Another immutable string-specific attribute can be added that returns the number of code units in the string, for example .codeunits or .nunits or something similar. 4) s1[i] is not O(1), it's generally slower, and returns the i-th code point (there are ways to speed this operation up to O(log n) with some internal indexes). (Code points in dstrings can be accessed in O(1)). 5) foreach(c; s1) yields a code point dchars regardless if s1 is string, dstring, wstring. But you can use foreach(char c; s1) if s1 is a string and you are sure s1 uses only 7 bit chars. But in such cases you can also use a immutable(ubyte)[]. Python3 does something similar, its has a str type that's always UTF16 (or UTF32) and a bytes type that is similar to a ubyte[]. I think D can do something similar. std.string functions can made to work on ubyte[] and immutable(ubyte)[] too. Some more things, I am not sure about them: 6) Strings can contain their hash value. This field is initialized with a empty value (like -1) and computed lazily on-demand. (as done in Python). This can make them faster when they often are put inside associative arrays and sets, avoiding to compute their hash value again and again. So strings are not fully immutable, because this value gets initialized. But it's a pure value, it's determined only by the immutable contents of the string, so I don't think this can cause big problems in multi-thread programs. If two threads update it, they find the same result. 7) the validation (std.utf.validat(), or a cast) can be necessary to create a string. This means that the type string/dstring/wstring implies it's validated :-) 8) If strings are immutable, then the append can always create a new string. So the extra information at the end of the memory block (used for appending to dynamic arrays) is not necessary. 9) - Today memory, even RAM, is cheap, but moving RAM to the CPU caches is not so fast, so a program that works on just the cache is faster. - UFT8 and UTF16 make strings bidirectional ranges. - UTF encodings are just data compression, but it's not so efficient. So a smarter compression scheme can compress strings more in memory, and the decrease in cache misses can compensate for the increased CPU work to decompress them. (But keeping strings compressed can turn them from bidirectional ranges to forward ranges, this is not so bad). There is a compressor that gives a decompression speed of about 3 times slower than memcpy(): http://www.oberhumer.com/opensource/lzo/ LZO can be used transparently to compress strings in RAM when strings become long enough. Hash computation and equality tests done among compressed strings are faster. Turning strings into something different from arrays looks like a loss, but in practice they already are not arrays, thinking about them as normal arrays is no so useful and it's UTF-unsafe, using [] to read the code units doesn't seem so useful. Code that manages strings/wstrings as normal arrays is not correct in general. Bye, bearophile
Jul 18 2010
5) foreach(c; s1) yields a code point dchars regardless if s1 is string,
But you can use foreach(char c; s1) if s1 is a string and you are sure s1 uses
does something similar, its has a str type that's always UTF16 (or UTF32) and a bytes type that is similar to a ubyte[]. I think D can do something similar. std.string functions can made to work on ubyte[] and immutable(ubyte)[] too. Actually I think D should depreciate char and wchar in user code except inside externals for compatibility with C and Windows functions. When you are working with individual characters you almost always want either a dchar or a byte.
Jul 19 2010
%u:When you are working with individual characters you almost always want either a dchar or a byte.
A dchar and an ubyte are better. This is almost what Python3 does and I think it can be good. Maybe other people will give their opinions about my original post. Bye, bearophile
Jul 19 2010
bearophile wrote:This odd post comes from reading the nice part about strings of chapter 4 of TDPL. In the last few years I have seen changes in how D strings are meant and managed, changes that make them less and less like arrays (random-access sequences of mutable code units) and more and more what they are at high level (immutable bidirectional sequences of code points).
Strings in D are deliberately meant to be arrays, not special things. Other languages make them special because they have insufficiently powerful arrays. As for indexing by code point, I also believe this is a mistake. It is proposed often, but overlooks: 1. most string operations, such as copying and searching, even regular expressions, work just fine using regular indices. 2. doing the operations in (1) using code points and having to continually decode the strings would result in disastrously slow code. 3. the user can always layer a code point interface over the strings, but going the other way is not so practical.
Jul 19 2010
== Quote from Walter Bright (newshound2 digitalmars.com)'s articlebearophile wrote:This odd post comes from reading the nice part about strings of chapter 4 of TDPL. In the last few years I have seen changes in how D strings are meant and managed, changes that make them less and less like arrays (random-access sequences of mutable code units) and more and more what they are at high level (immutable bidirectional sequences of code points).
languages make them special because they have insufficiently powerful arrays. As for indexing by code point, I also believe this is a mistake. It is proposed often, but overlooks: 1. most string operations, such as copying and searching, even regular expressions, work just fine using regular indices. 2. doing the operations in (1) using code points and having to continually decode the strings would result in disastrously slow code. 3. the user can always layer a code point interface over the strings, but going the other way is not so practical.
4. Sometimes one can make valid assumptions about the contents of a string. For example, in an internal utility app that will never be internationalized you may get away with assuming a character is an ASCII byte. If you know your input will be in the Basic Multilingual Plane (for example if working with pre-sanitized input), you can use wstrings and always assume a character is 2 bytes. 5. For dchar strings, a code unit equals a code point. Should the interface for dchar strings be completely different than that for char and wchar strings?
Jul 19 2010
On Mon, 19 Jul 2010 16:04:21 -0400, Walter Bright <newshound2 digitalmars.com> wrote:bearophile wrote:This odd post comes from reading the nice part about strings of chapter 4 of TDPL. In the last few years I have seen changes in how D strings are meant and managed, changes that make them less and less like arrays (random-access sequences of mutable code units) and more and more what they are at high level (immutable bidirectional sequences of code points).
Strings in D are deliberately meant to be arrays, not special things. Other languages make them special because they have insufficiently powerful arrays.
Andrei is changing that. Already, isRandomAccessRange!(string) == false. I kind of don't like this direction, even though its clever. What you end up with is phobos refusing to believe that a string or char[] is an array, but the compiler saying it is. What I'd prefer is something where the compiler types string literals as string, a type defined by phobos which contains as its first member an immutable(char)[] (where the compiler puts the literal). Then we can properly limit the other operations.As for indexing by code point, I also believe this is a mistake. It is proposed often, but overlooks: 1. most string operations, such as copying and searching, even regular expressions, work just fine using regular indices. 2. doing the operations in (1) using code points and having to continually decode the strings would result in disastrously slow code. 3. the user can always layer a code point interface over the strings, but going the other way is not so practical.
I agree here. Anything that uses indexing to perform a linear operation is bound for the scrap heap. But what about this: foreach(c; str) which types c as char (or immutable char), not dchar. These are the subtle problems that we have with the dichotomy of phobos refusing to believe a string is an array, but the compiler believing it is. I think the default inference for this should be dchar, and phobos can make that true as long as it controls the string type. There are other points to consider: 1) a string *could be* indexed by character and return the code point being pointed to. 2) even slicing could be valid as long as the slice operator jumps back to the start of the dchar being encoded. This might make for very tricky code, but then again, such is the cost of trying to slice something like a utf-8 string :) But having the compiler force the string type to be an array, when it clearly isn't, doesn't help. Give the runtime the choice, like it's done for AA's, and I think we may have something that is workable, and doesn't suck performance-wise. -Steve
Jul 19 2010
Steven Schveighoffer wrote:On Mon, 19 Jul 2010 16:04:21 -0400, Walter Bright <newshound2 digitalmars.com> wrote:Strings in D are deliberately meant to be arrays, not special things. Other languages make them special because they have insufficiently powerful arrays.
Andrei is changing that. Already, isRandomAccessRange!(string) == false. I kind of don't like this direction, even though its clever.
That decision may be a mistake.I agree here. Anything that uses indexing to perform a linear operation is bound for the scrap heap. But what about this: foreach(c; str) which types c as char (or immutable char), not dchar.
Probably too late to change that one.
Jul 19 2010
On 07/19/2010 11:31 PM, Walter Bright wrote:Steven Schveighoffer wrote:On Mon, 19 Jul 2010 16:04:21 -0400, Walter Bright <newshound2 digitalmars.com> wrote:Strings in D are deliberately meant to be arrays, not special things. Other languages make them special because they have insufficiently powerful arrays.
Andrei is changing that. Already, isRandomAccessRange!(string) == false. I kind of don't like this direction, even though its clever.
That decision may be a mistake.
I think otherwise. In fact I confess I am extremely excited. The current state of affairs described built-in strings very accurately: they are formally bidirectional ranges, yet they offer random access for code units that you can freely use if you so wish. It's modeling reality very accurately. As far as I know, all algorithms in std.algorithm that work well without decoding are special-cased for strings to work fast and yield correct results. Andrei
Jul 19 2010
Andrei Alexandrescu wrote:I think otherwise. In fact I confess I am extremely excited. The current state of affairs described built-in strings very accurately: they are formally bidirectional ranges, yet they offer random access for code units that you can freely use if you so wish. It's modeling reality very accurately. As far as I know, all algorithms in std.algorithm that work well without decoding are special-cased for strings to work fast and yield correct results.
That's good to hear.
Jul 20 2010
Walter:As it turns out, indexing by byte is *far* more common than indexing by code unit, in fact, I've never ever needed to index by code unit.<
OK.Probably too late to change that one.
There is very little D2 code around, so little changes as this one are possible still. As alternative see also this enhancement request, partially coming from a post of mine in D.learn: http://d.puremagic.com/issues/show_bug.cgi?id=4483 (If you refuse this fallback idea too, then it's better to close this bug report. Keeping too much dead wood in Bugzilla will eventually cause small troubles.) In my original post my points 2, 6, 8 and 9 are still valid. To be nice they need strings to be a little different from standard arrays. Thanks for the comments, bearophile
Jul 20 2010
bearophile wrote:Probably too late to change that one.
There is very little D2 code around, so little changes as this one are possible still.
It's a D1 feature, and has been there since nearly the beginning.
Jul 20 2010
Steven Schveighoffer wrote:On Tue, 20 Jul 2010 14:29:43 -0400, Walter Bright <newshound2 digitalmars.com> wrote:It's a D1 feature, and has been there since nearly the beginning.
Since when did we care about D1 compatibility?
We care about incompatibilities that silently break code.const, inout, array appending, final, typeof(string), TLS globals just to name a few... If you expect D1 code to compile fine and run on D2, you are deluding yourself.
No argument there, but we do try to avoid silent breakage.The worst that happens is that code starts using dchar instead of char, and either a compiler error occurs and it's fixed simply by doing: foreach(char c; str) or it compiles fine because the type is never explicitly stated, and what's the big deal there? The code just becomes more utf compatible for free :)
I don't think it's necessarily true that the user really wanted the decoded character rather than the byte, or even that wanting the decoded character is more likely to be desired.
Jul 20 2010
Walter Bright wrote:Steven Schveighoffer wrote:On Tue, 20 Jul 2010 14:29:43 -0400, Walter Bright <newshound2 digitalmars.com> wrote:It's a D1 feature, and has been there since nearly the beginning.
Since when did we care about D1 compatibility?
We care about incompatibilities that silently break code.const, inout, array appending, final, typeof(string), TLS globals just to name a few... If you expect D1 code to compile fine and run on D2, you are deluding yourself.
No argument there, but we do try to avoid silent breakage.The worst that happens is that code starts using dchar instead of char, and either a compiler error occurs and it's fixed simply by doing: foreach(char c; str) or it compiles fine because the type is never explicitly stated, and what's the big deal there? The code just becomes more utf compatible for free :)
I don't think it's necessarily true that the user really wanted the decoded character rather than the byte, or even that wanting the decoded character is more likely to be desired.
Unfortunately it's inconsistent. foreach for ranges operates in terms of front, empty, popFront - just not for strings. I avoid foreach in std.algorithm and in generic code. For my money I'd be okay if foreach (c; str) wouldn't even compile - the user would be asked to specify the type. But if the implicit type is allowed, I'm afraid I believe that dchar would be the best choice. Andrei
Jul 20 2010
On 2010-07-20 00:31:34 -0400, Walter Bright <newshound2 digitalmars.com> said:Steven Schveighoffer wrote:I agree here. Anything that uses indexing to perform a linear operation is bound for the scrap heap. But what about this: foreach(c; str) which types c as char (or immutable char), not dchar.
Probably too late to change that one.
Sad. That's one of the first things I tried when I first learned D and the result did surprise me. I expected foreach to iterate on characters (code points), not code units. Then I saw I could add 'dchar' to get that behaviour and found that to be not too bad. The big problem here is that ranges and foreach behave differently. A range that doesn't work with foreach isn't a good range. That's even worse when that range is at the core of the language because it'll look bad on both ranges and the language. As it stands now, when doing generic programming we'd have to write foreach like this so it works the same with foreach as it does with the range APIs, just in case the range is a string: foreach (ElementType!(typeof(range)) c; range) {} Something needs to change so the above always work the same as not specifying the type! Either foreach should be adapted or ranges should let go the idea of iterating on code points for the default string type. As for the "too late to change" stance, I'm not sure. It'll certainly be to late to change in a year, but right now D2 is still pretty new. What makes you say it's too late? -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jul 20 2010
== Quote from Michel Fortin (michel.fortin michelf.com)'s articleOn 2010-07-20 00:31:34 -0400, Walter Bright
Steven Schveighoffer wrote:I agree here. Anything that uses indexing to perform a linear operation is bound for the scrap heap. But what about this: foreach(c; str) which types c as char (or immutable char), not dchar.
Probably too late to change that one.
the result did surprise me. I expected foreach to iterate on
(code points), not code units. Then I saw I could add 'dchar' to
that behaviour and found that to be not too bad. The big problem here is that ranges and foreach behave
range that doesn't work with foreach isn't a good range. That's
worse when that range is at the core of the language because it'll
bad on both ranges and the language. As it stands now, when doing generic programming we'd have to
foreach like this so it works the same with foreach as it does
range APIs, just in case the range is a string: foreach (ElementType!(typeof(range)) c; range) {} Something needs to change so the above always work the same as not specifying the type! Either foreach should be adapted or ranges
let go the idea of iterating on code points for the default string
I'm wondering how bad would it be introduce a schar (short char, 1 byte) type and then let char simply map to a "default" char type: dchar, wchar, or whatever we tell the compiler. By default, char would map to dchar. alias char dchar; Coupling this with implicit cast of schar/char (from imported C code) to dchar, I think this might even help fix many such situations.
Jul 20 2010
On Tuesday, July 20, 2010 05:30:51 DCoder wrote:I'm wondering how bad would it be introduce a schar (short char, 1 byte) type and then let char simply map to a "default" char type: dchar, wchar, or whatever we tell the compiler. By default, char would map to dchar. alias char dchar; Coupling this with implicit cast of schar/char (from imported C code) to dchar, I think this might even help fix many such situations.
That doesn't really gain us anything. It would likely just make it so that char would mean dchar and be used by default. And honestly, using dchar is not really the solution. If you wan to do that, you can. However, using dchar in the general case is not necessarily a good idea since it wastes so much space. On top of all that, TDPL was _very_ clear on the differences between char, wchar, and dchar and TDPL is supposed to stay accurate. So, any changes which would contradict TDPL need a _very_ good reason for being made, or they won't be. Overall, strings in D work great. The only issue really is making it so that you properly deal with the cases where you need to treat them as code points vs when you can treat them as code units. Any programmer who wants to entirely avoid the problem can just using dchar and dstring. For the rest, you need to understand how string and wstring work and just handle them appropriately. There may be a few places where things should be smoothed out, but overall, I really do think that they work well as they are. - Jonathan M Davis
Jul 20 2010
Michel Fortin wrote:As for the "too late to change" stance, I'm not sure. It'll certainly be to late to change in a year, but right now D2 is still pretty new. What makes you say it's too late?
As I said to bearophile, it's a D1 feature.
Jul 20 2010
Walter Bright:1. most string operations, such as copying and searching, even regular expressions, work just fine using regular indices. 2. doing the operations in (1) using code points and having to continually decode the strings would result in disastrously slow code.
In my original post I have forgotten another difference over arrays: 5b) a method like ".unit()" that allows to index code units. So "foo".unit(1) is always O(1). Lower level code can use this method as [] is used for arrays. Copying is done on the bytes themselves, with a memcpy, no decoding necessary. If the point (9) (automatic LZO encoding) is used, then copying can be 2-3 times faster for long strings (because there is less data and you don't need to uncompress it to copy). (if such compression is added, then strings can need a third accessor method, to the true bytes).3. the user can always layer a code point interface over the strings, but going the other way is not so practical.
This is true. But it makes the string usage unnecessarily low-level and hard... A better design in a smart system language as D is to give strings a default high level "interface" that sees strings as what they are at high level, and add a second lower level interface when you need faster lower-level fiddling (so they have [] that returns code points and unit() that returns code units). Bye, bearophile
Jul 19 2010
bearophile wrote:Walter Bright:1. most string operations, such as copying and searching, even regular expressions, work just fine using regular indices. 2. doing the operations in (1) using code points and having to continually decode the strings would result in disastrously slow code.
In my original post I have forgotten another difference over arrays: 5b) a method like ".unit()" that allows to index code units. So "foo".unit(1) is always O(1). Lower level code can use this method as [] is used for arrays.
This is backwards. The [i] should behave as expected for arrays. As it turns out, indexing by byte is *far* more common than indexing by code unit, in fact, I've never ever needed to index by code unit. (Though it is sometimes necessary to step through by code unit, that's different from indexing by code unit.)3. the user can always layer a code point interface over the strings, but going the other way is not so practical.
This is true. But it makes the string usage unnecessarily low-level and hard...
I don't believe that manipulating strings in D is hard, even if you do have to work with multibyte characters. You do have to be aware they are multibyte, but I think that just comes with being a programmer. A better design in a smart system language as D is to give strings adefault high level "interface" that sees strings as what they are at high level, and add a second lower level interface when you need faster lower-level fiddling (so they have [] that returns code points and unit() that returns code units).
I have some moderate experience with using utf. First there's the D javascript engine, which is fully utf'd. The D string design fits in with it perfectly. Then there are chunks of C++ ascii-only code I've translated to D, and it then worked with utf-8 without further modification. Based on that, I believe the D string design hits the sweet spot between efficiency and utility.
Jul 19 2010
On 07/19/2010 11:29 PM, Walter Bright wrote:bearophile wrote:Walter Bright:1. most string operations, such as copying and searching, even regular expressions, work just fine using regular indices. 2. doing the operations in (1) using code points and having to continually decode the strings would result in disastrously slow code.
In my original post I have forgotten another difference over arrays: 5b) a method like ".unit()" that allows to index code units. So "foo".unit(1) is always O(1). Lower level code can use this method as [] is used for arrays.
This is backwards. The [i] should behave as expected for arrays. As it turns out, indexing by byte is *far* more common than indexing by code unit, in fact, I've never ever needed to index by code unit. (Though it is sometimes necessary to step through by code unit, that's different from indexing by code unit.)
Exactly. And that's what the bidirectional range interface is doing for strings.3. the user can always layer a code point interface over the strings, but going the other way is not so practical.
This is true. But it makes the string usage unnecessarily low-level and hard...
I don't believe that manipulating strings in D is hard, even if you do have to work with multibyte characters. You do have to be aware they are multibyte, but I think that just comes with being a programmer. A better design in a smart system language as D is to give strings adefault high level "interface" that sees strings as what they are at high level, and add a second lower level interface when you need faster lower-level fiddling (so they have [] that returns code points and unit() that returns code units).
I have some moderate experience with using utf. First there's the D javascript engine, which is fully utf'd. The D string design fits in with it perfectly. Then there are chunks of C++ ascii-only code I've translated to D, and it then worked with utf-8 without further modification. Based on that, I believe the D string design hits the sweet spot between efficiency and utility.
I agree. In fact there is no language I know that can compete with D at UTF string handling. Andrei
Jul 19 2010
Walter Bright Wrote:bearophile wrote:Walter Bright:1. most string operations, such as copying and searching, even regular expressions, work just fine using regular indices. 2. doing the operations in (1) using code points and having to continually decode the strings would result in disastrously slow code.
In my original post I have forgotten another difference over arrays: 5b) a method like ".unit()" that allows to index code units. So "foo".unit(1) is always O(1). Lower level code can use this method as [] is used for arrays.
This is backwards. The [i] should behave as expected for arrays. As it turns out, indexing by byte is *far* more common than indexing by code unit, in fact, I've never ever needed to index by code unit. (Though it is sometimes necessary to step through by code unit, that's different from indexing by code unit.)
I've had the same experience. The proposed changes would make string useless to me, even for Unicode work. I'd end up using ubyte[] instead.
Jul 20 2010
Sean Kelly wrote:I've had the same experience. The proposed changes would make string useless to me, even for Unicode work. I'd end up using ubyte[] instead.
At this point, the experience with D strings is that D gets it right. To change it would require someone who has spent a *lot* of hours programming in utf making a very compelling argument.
Jul 20 2010
On Tue, 20 Jul 2010 14:29:43 -0400, Walter Bright <newshound2 digitalmars.com> wrote:bearophile wrote:Probably too late to change that one.
possible still.
It's a D1 feature, and has been there since nearly the beginning.
Since when did we care about D1 compatibility? const, inout, array appending, final, typeof(string), TLS globals just to name a few... If you expect D1 code to compile fine and run on D2, you are deluding yourself. The worst that happens is that code starts using dchar instead of char, and either a compiler error occurs and it's fixed simply by doing: foreach(char c; str) or it compiles fine because the type is never explicitly stated, and what's the big deal there? The code just becomes more utf compatible for free :) -Steve
Jul 20 2010









bearophile <bearophileHUGS lycos.com> 