digitalmars.D - Making all strings UTF ranges has some risk of WTF
- Andrei Alexandrescu (42/42) Feb 03 2010 It's no secret that string et al. are not a magic recipe for writing
- Robert Jacques (7/19) Feb 03 2010 I like b) and d), with a slight preference for d. I think the benefits o...
- ZY Zhou (2/22) Feb 03 2010 I choose (b)
- Andrei Alexandrescu (8/31) Feb 03 2010 Perfect. Then you'll be glad to know that I changed all of Phobos to
- Chad J (23/46) Feb 03 2010 I'm leaning towards (c) here.
- Andrei Alexandrescu (4/60) Feb 03 2010 I hear you. Actually, to either quench or add to the confusion, .length
- Chad J (4/10) Feb 03 2010 0.o
- grauzone (17/35) Feb 03 2010 Change the type of string literals from char[] (or whatever the string
- Trass3r (3/22) Feb 04 2010 That sounds like a really reasonable way to me.
- Simen kjaeraas (9/22) Feb 04 2010 This seems to me a job for @disable:
- Walter Bright (5/7) Feb 03 2010 I'm concerned it would be slow. Most operations on strings do not need
- Andrei Alexandrescu (8/16) Feb 03 2010 I thought you're going to say that, but fortunately it's easy to
- Andrei Alexandrescu (3/11) Feb 03 2010 Oh, one more thing: doing mixed-width searches would require decoding.
- Walter Bright (2/3) Feb 03 2010 Or a conversion before the loop starts of the search term.
- Andrei Alexandrescu (4/8) Feb 03 2010 That triggers memory allocation plus the same cost. It's not likely to
- Ben Hanson (4/14) Feb 04 2010 Exactly. Please don't go down the route that Microsoft did for regex cas...
- dsimcha (10/52) Feb 03 2010 I personally would find this extremely annoying because most of the code...
- Andrei Alexandrescu (8/62) Feb 03 2010 It's definitely going to be easy to use all sensible algorithms with
- Steven Schveighoffer (25/47) Feb 08 2010 I'm in the same camp as dsimcha, I generally write all my apps assuming ...
- Michel Fortin (33/39) Feb 03 2010 UTF-8 and UTF-16 encodings are interesting beasts. If you have a UTF-8
- Chad J (6/16) Feb 04 2010 This would be under the condition that there is another property that
- Rainer Deyke (8/17) Feb 03 2010 These are all fine for a dedicated string type. They're horrible for
- Andrei Alexandrescu (4/20) Feb 03 2010 Arrays of char and wchar are not quite generic - they are definitely UTF...
- Rainer Deyke (11/13) Feb 03 2010 A 'char' is a single utf-8 code unit. A 'char[]' is (or should be) a
- Andrei Alexandrescu (7/20) Feb 03 2010 I agree up to the assessment of the size of the problem and a couple of
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (25/38) Feb 03 2010 They would yield dchar, right? Wouldn't that cause trouble in templated
- Andrei Alexandrescu (15/55) Feb 03 2010 Yes, dchar. There was some figuring out in parts of Phobos, but the
- Daniel Keep (8/16) Feb 04 2010 I believe what you're after is this:
- bearophile (7/7) Feb 04 2010 I'd like D to give the right results and to be not bug-prone by default ...
- bearophile (5/7) Feb 04 2010 So it's not really immutable. It contains an immutable dynamic array tha...
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (14/15) Feb 04 2010 len() to have a different computational complexity, like O(n) or O(1),
- Jason House (3/32) Feb 04 2010 The underlying array of byte-sized data fragments is an implementation d...
- Simen kjaeraas (5/18) Feb 04 2010 Of the above, I feel (b) is the correct solution, and I understand it
- bearophile (4/6) Feb 04 2010 Yes, I presume he was mostly looking for a justification of his ideas he...
- Andrei Alexandrescu (22/28) Feb 04 2010 I am ready to throw away the implementation as soon as a better idea
- Michel Fortin (21/32) Feb 04 2010 Has any thought been given to foreach? Currently all these work for stri...
- Andrei Alexandrescu (4/38) Feb 04 2010 This is a good point. I'm in favor of changing the language to make the
- Don (13/53) Feb 04 2010 We seem to be approaching the point where char[], wchar[] and dchar[]
- Jerry Quinn (3/10) Feb 04 2010 Well, if you're working with a LOT of text, you may be mmapping GB's of ...
- dsimcha (6/15) Feb 04 2010 text. Yes, this does happen. You better be able to handle it in a sane...
- Rainer Deyke (7/12) Feb 04 2010 If it's not too late to completely change the semantics of char[], then
- Andrei Alexandrescu (25/36) Feb 04 2010 One idea I've had for a while was to have a universal string type:
- grauzone (3/29) Feb 04 2010 You mean like this?
- Rainer Deyke (7/26) Feb 04 2010 Although I see some potential in a universal string type, I don't think
- Andrei Alexandrescu (4/29) Feb 04 2010 The definition I outlined does not specify or constrain the strategy of
- Michael Rynn (75/89) Feb 05 2010 Firstly, for such "augmented types" in D, such as strings, bignums or an...
- Michel Fortin (26/52) Feb 04 2010 That's a nice concept, but it seems to me that it adds much overhead to
- Andrei Alexandrescu (10/64) Feb 04 2010 Well as it's been mentioned, sometimes you may assemble a string out of
- Justin Johansson (14/38) Feb 04 2010 I concur. It's great to see consensus moving in this direction. For
- bearophile (4/7) Feb 04 2010 And just to be sure: you are better than me because you actually impleme...
- Steve Teale (4/7) Feb 05 2010 Andrei, congratulations on starting the most interesting thread I have s...
- Andrei Alexandrescu (3/11) Feb 05 2010 A bit of both, with an emphasis on library.
It's no secret that string et al. are not a magic recipe for writing correct Unicode code. However, things are pretty good and could be further improved by operating the following changes in std.array and std.range: - make front() and back() for UTF-8 and UTF-16 automatically decode the first and last Unicode character - make popFront() and popBack() skip one entire Unicode character (instead of just one code unit) - alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings - change hasLength to return false for UTF-8 and UTF-16 strings These changes effectively make UTF-8 and UTF-16 bidirectional ranges, with the quirk that you still have a sort of a random-access operator. I'm very strongly in favor of this change. Bidirectional strings allow beautiful correct algorithms to be written that handle encoded strings without any additional effort; with these changes, everything applicable of std.algorithm works out of the box (with the appropriate fixes here and there), which is really remarkable. The remaining WTF is the length property. Traditionally, a range offering length also implies the expectation that a range of length n allows you to call popFront n times and then assert that the range is empty. However, if you check e.g. hasLength!string it will yield false, although the string does have an accessible member by that name and of the appropriate type. Although Phobos always checks its assumptions, people might occasionally write code that just uses .length without checking hasLength. Then, they'll be annoyed when the code fails with UTF-8 and UTF-16 strings. (The "real" length of the range is not stored, but can be computed by using str.walkLength() in std.range.) What can be done about that? I see a number of solutions: (a) Do not operate the change at all. (b) Operate the change and mention that in range algorithms you should check hasLength and only then use "length" under the assumption that it really means "elements count". (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define a different name for that. Any other name (codeUnits, codes etc.) would do. The entire point is to not make algorithms believe strings have a .length property. (d) Have std.range define a distinct property called e.g. "count" and then specialize it appropriately. Then change all references to .length in std.algorithm and elsewhere to .count. What would you do? Any ideas are welcome. Andrei
Feb 03 2010
On Wed, 03 Feb 2010 21:00:21 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:(a) Do not operate the change at all. (b) Operate the change and mention that in range algorithms you should check hasLength and only then use "length" under the assumption that it really means "elements count". (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define a different name for that. Any other name (codeUnits, codes etc.) would do. The entire point is to not make algorithms believe strings have a .length property. (d) Have std.range define a distinct property called e.g. "count" and then specialize it appropriately. Then change all references to .length in std.algorithm and elsewhere to .count. What would you do? Any ideas are welcome.I like b) and d), with a slight preference for d. I think the benefits of strings being encoding correct and able to use std.algorithm outweighs the disadvantages. And making char[] different from T[] is going to play havoc with templated algorithms. Another alternative is to remove the char types from the language and implement them as library ranges.
Feb 03 2010
Andrei Alexandrescu Wrote:What can be done about that? I see a number of solutions: (a) Do not operate the change at all. (b) Operate the change and mention that in range algorithms you should check hasLength and only then use "length" under the assumption that it really means "elements count". (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define a different name for that. Any other name (codeUnits, codes etc.) would do. The entire point is to not make algorithms believe strings have a .length property. (d) Have std.range define a distinct property called e.g. "count" and then specialize it appropriately. Then change all references to .length in std.algorithm and elsewhere to .count. What would you do? Any ideas are welcome. AndreiI choose (b)
Feb 03 2010
ZY Zhou wrote:Andrei Alexandrescu Wrote:Perfect. Then you'll be glad to know that I changed all of Phobos to support it. All unittests pass now, but I suspect there are a couple of bugs left. The change will be part of the next release. I'll commit during the weekend. This is a very exciting development, I was very unhappy about the interaction between std.algorithm and strings. AndreiWhat can be done about that? I see a number of solutions: (a) Do not operate the change at all. (b) Operate the change and mention that in range algorithms you should check hasLength and only then use "length" under the assumption that it really means "elements count". (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define a different name for that. Any other name (codeUnits, codes etc.) would do. The entire point is to not make algorithms believe strings have a .length property. (d) Have std.range define a distinct property called e.g. "count" and then specialize it appropriately. Then change all references to .length in std.algorithm and elsewhere to .count. What would you do? Any ideas are welcome. AndreiI choose (b)
Feb 03 2010
Andrei Alexandrescu wrote:... What can be done about that? I see a number of solutions: (a) Do not operate the change at all. (b) Operate the change and mention that in range algorithms you should check hasLength and only then use "length" under the assumption that it really means "elements count". (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define a different name for that. Any other name (codeUnits, codes etc.) would do. The entire point is to not make algorithms believe strings have a .length property. (d) Have std.range define a distinct property called e.g. "count" and then specialize it appropriately. Then change all references to .length in std.algorithm and elsewhere to .count. What would you do? Any ideas are welcome. AndreiI'm leaning towards (c) here. To me the .length on char[] and wchar[] are kinda like doing this: struct SomePOD { int a, b; double y; } SomePOD pod; auto len = pod.length; assert(len == 16); // true. I'll admit it's not a perfect analogy. What I'm playing on here is that the .length on char[] and wchar[] returns the /size of/ the string in bytes rather than the /length/ of the string in number of (well-formed) characters. Unfortunately .sizeof is supposed to return the size of the string's reference (8 bytes on x86 systems) and not the size of the string, IIRC. So that's taken. So perhaps a .bytes or .nbytes property. Maybe make it work for arrays of structs and things like that too. A tuple (or any container) of non-homogeneous elements could probably benefit from this property as well. Given such a property being available, I wouldn't miss .length at all. It's quite misleading.
Feb 03 2010
Chad J wrote:Andrei Alexandrescu wrote:I hear you. Actually, to either quench or add to the confusion, .length for wstring returns the length in 16-bit units, not bytes. Andrei... What can be done about that? I see a number of solutions: (a) Do not operate the change at all. (b) Operate the change and mention that in range algorithms you should check hasLength and only then use "length" under the assumption that it really means "elements count". (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define a different name for that. Any other name (codeUnits, codes etc.) would do. The entire point is to not make algorithms believe strings have a .length property. (d) Have std.range define a distinct property called e.g. "count" and then specialize it appropriately. Then change all references to .length in std.algorithm and elsewhere to .count. What would you do? Any ideas are welcome. AndreiI'm leaning towards (c) here. To me the .length on char[] and wchar[] are kinda like doing this: struct SomePOD { int a, b; double y; } SomePOD pod; auto len = pod.length; assert(len == 16); // true. I'll admit it's not a perfect analogy. What I'm playing on here is that the .length on char[] and wchar[] returns the /size of/ the string in bytes rather than the /length/ of the string in number of (well-formed) characters. Unfortunately .sizeof is supposed to return the size of the string's reference (8 bytes on x86 systems) and not the size of the string, IIRC. So that's taken. So perhaps a .bytes or .nbytes property. Maybe make it work for arrays of structs and things like that too. A tuple (or any container) of non-homogeneous elements could probably benefit from this property as well. Given such a property being available, I wouldn't miss .length at all. It's quite misleading.
Feb 03 2010
Andrei Alexandrescu wrote:... I hear you. Actually, to either quench or add to the confusion, .length for wstring returns the length in 16-bit units, not bytes. Andrei0.o ... kay.
Feb 03 2010
Andrei Alexandrescu wrote:What can be done about that? I see a number of solutions: (a) Do not operate the change at all. (b) Operate the change and mention that in range algorithms you should check hasLength and only then use "length" under the assumption that it really means "elements count". (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define a different name for that. Any other name (codeUnits, codes etc.) would do. The entire point is to not make algorithms believe strings have a .length property. (d) Have std.range define a distinct property called e.g. "count" and then specialize it appropriately. Then change all references to .length in std.algorithm and elsewhere to .count. What would you do? Any ideas are welcome.Change the type of string literals from char[] (or whatever the string type is in D2) to a wrapper struct defined in object.d: struct string { char[] raw; } Now string.length is invalid, and you don't have to do weird stuff as in (b) or (c). From here on, you could do 2 things: 1. add accessor methods to string like string classes in other languages do 2. leave the wrapper struct as it is (just add the required range foo), and require the user to use either a) the range API (with utf-8 decoding etc.) or b) access the raw "byte" string with string.raw. I really liked how strings were simply char[]'s, but now with immutable, there's a lot of noise around it anyway, and there's no real value to strings being array slices anymore. Making the user deal directly with utf-8 was probably a bad idea to begin with.
Feb 03 2010
Am 04.02.2010, 04:05 Uhr, schrieb grauzone <none example.net>:Andrei Alexandrescu wrote:Definitely against (c)+(d).What can be done about that? I see a number of solutions: (a) Do not operate the change at all. (b) Operate the change and mention that in range algorithms you should check hasLength and only then use "length" under the assumption that it really means "elements count". (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define a different name for that. Any other name (codeUnits, codes etc.) would do. The entire point is to not make algorithms believe strings have a .length property. (d) Have std.range define a distinct property called e.g. "count" and then specialize it appropriately. Then change all references to .length in std.algorithm and elsewhere to .count. What would you do? Any ideas are welcome.Change the type of string literals from char[] (or whatever the string type is in D2) to a wrapper struct defined in object.d: struct string { char[] raw; }That sounds like a really reasonable way to me.
Feb 04 2010
grauzone <none example.net> wrote:Change the type of string literals from char[] (or whatever the string type is in D2) to a wrapper struct defined in object.d: struct string { char[] raw; } Now string.length is invalid, and you don't have to do weird stuff as in (b) or (c). From here on, you could do 2 things: 1. add accessor methods to string like string classes in other languages do 2. leave the wrapper struct as it is (just add the required range foo), and require the user to use either a) the range API (with utf-8 decoding etc.) or b) access the raw "byte" string with string.raw.This seems to me a job for disable: struct string { immutable( char )[] payload; alias payload this; disable int length( ) { return 0; } } -- Simen
Feb 04 2010
Andrei Alexandrescu wrote:It's no secret that string et al. are not a magic recipe for writing correct Unicode code.I'm concerned it would be slow. Most operations on strings do not need to decode the unicode characters, for example, find, startsWith, etc., do not. Decoding then doing find, startsWith, etc., will be considerably slower.
Feb 03 2010
Walter Bright wrote:Andrei Alexandrescu wrote:I thought you're going to say that, but fortunately it's easy to special-case certain algorithms for strings during compilation. In fact I already did - for example, Boyer-Moore searching would be very difficult to rewrite for variable-length characters, but there's no need for it. I special-cased that algorithm. I believe this is a good strategy. AndreiIt's no secret that string et al. are not a magic recipe for writing correct Unicode code.I'm concerned it would be slow. Most operations on strings do not need to decode the unicode characters, for example, find, startsWith, etc., do not. Decoding then doing find, startsWith, etc., will be considerably slower.
Feb 03 2010
Walter Bright wrote:Andrei Alexandrescu wrote:Oh, one more thing: doing mixed-width searches would require decoding. AndreiIt's no secret that string et al. are not a magic recipe for writing correct Unicode code.I'm concerned it would be slow. Most operations on strings do not need to decode the unicode characters, for example, find, startsWith, etc., do not. Decoding then doing find, startsWith, etc., will be considerably slower.
Feb 03 2010
Andrei Alexandrescu wrote:Oh, one more thing: doing mixed-width searches would require decoding.Or a conversion before the loop starts of the search term.
Feb 03 2010
Walter Bright wrote:Andrei Alexandrescu wrote:That triggers memory allocation plus the same cost. It's not likely to work any better. AndreiOh, one more thing: doing mixed-width searches would require decoding.Or a conversion before the loop starts of the search term.
Feb 03 2010
Andrei Alexandrescu Wrote:Walter Bright wrote:Exactly. Please don't go down the route that Microsoft did for regex case insensitive searching (at least when I last looked at it) - they made the *input string* all lower case... D'oh! Regards, BenAndrei Alexandrescu wrote:That triggers memory allocation plus the same cost. It's not likely to work any better. AndreiOh, one more thing: doing mixed-width searches would require decoding.Or a conversion before the loop starts of the search term.
Feb 04 2010
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s articleIt's no secret that string et al. are not a magic recipe for writing correct Unicode code. However, things are pretty good and could be further improved by operating the following changes in std.array and std.range: - make front() and back() for UTF-8 and UTF-16 automatically decode the first and last Unicode character - make popFront() and popBack() skip one entire Unicode character (instead of just one code unit) - alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings - change hasLength to return false for UTF-8 and UTF-16 strings These changes effectively make UTF-8 and UTF-16 bidirectional ranges, with the quirk that you still have a sort of a random-access operator. I'm very strongly in favor of this change. Bidirectional strings allow beautiful correct algorithms to be written that handle encoded strings without any additional effort; with these changes, everything applicable of std.algorithm works out of the box (with the appropriate fixes here and there), which is really remarkable. The remaining WTF is the length property. Traditionally, a range offering length also implies the expectation that a range of length n allows you to call popFront n times and then assert that the range is empty. However, if you check e.g. hasLength!string it will yield false, although the string does have an accessible member by that name and of the appropriate type. Although Phobos always checks its assumptions, people might occasionally write code that just uses .length without checking hasLength. Then, they'll be annoyed when the code fails with UTF-8 and UTF-16 strings. (The "real" length of the range is not stored, but can be computed by using str.walkLength() in std.range.) What can be done about that? I see a number of solutions: (a) Do not operate the change at all. (b) Operate the change and mention that in range algorithms you should check hasLength and only then use "length" under the assumption that it really means "elements count". (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define a different name for that. Any other name (codeUnits, codes etc.) would do. The entire point is to not make algorithms believe strings have a .length property. (d) Have std.range define a distinct property called e.g. "count" and then specialize it appropriately. Then change all references to .length in std.algorithm and elsewhere to .count. What would you do? Any ideas are welcome. AndreiI personally would find this extremely annoying because most of the code I write that involves strings is scientific computing code that will never be internationalized, let alone released to the general public. I basically just use ASCII because it's all I need and if your UTF-8 string contains only ASCII characters, it can be treated as random-access. I don't know how many people out there are in similar situations, but I doubt they'll be too happy. On the other hand, I guess it wouldn't be hard to write a simple wrapper struct on top of immutable(ubyte)[] and call it AsciiString. Once alias this gets fully debugged, I could even make it implicitly convert to immutable(char)[].
Feb 03 2010
dsimcha wrote:== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s articleIt's definitely going to be easy to use all sensible algorithms with immutable(ubyte)[]. But even if you go with string, there should be no problem at all. Remember, telling ASCII from UTF is one mask and one test away, and the way Walter and I wrote virtually all related routines was to special-case ASCII. In most cases I don't think you'll notice a decrease in performance. AndreiIt's no secret that string et al. are not a magic recipe for writing correct Unicode code. However, things are pretty good and could be further improved by operating the following changes in std.array and std.range: - make front() and back() for UTF-8 and UTF-16 automatically decode the first and last Unicode character - make popFront() and popBack() skip one entire Unicode character (instead of just one code unit) - alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings - change hasLength to return false for UTF-8 and UTF-16 strings These changes effectively make UTF-8 and UTF-16 bidirectional ranges, with the quirk that you still have a sort of a random-access operator. I'm very strongly in favor of this change. Bidirectional strings allow beautiful correct algorithms to be written that handle encoded strings without any additional effort; with these changes, everything applicable of std.algorithm works out of the box (with the appropriate fixes here and there), which is really remarkable. The remaining WTF is the length property. Traditionally, a range offering length also implies the expectation that a range of length n allows you to call popFront n times and then assert that the range is empty. However, if you check e.g. hasLength!string it will yield false, although the string does have an accessible member by that name and of the appropriate type. Although Phobos always checks its assumptions, people might occasionally write code that just uses .length without checking hasLength. Then, they'll be annoyed when the code fails with UTF-8 and UTF-16 strings. (The "real" length of the range is not stored, but can be computed by using str.walkLength() in std.range.) What can be done about that? I see a number of solutions: (a) Do not operate the change at all. (b) Operate the change and mention that in range algorithms you should check hasLength and only then use "length" under the assumption that it really means "elements count". (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define a different name for that. Any other name (codeUnits, codes etc.) would do. The entire point is to not make algorithms believe strings have a .length property. (d) Have std.range define a distinct property called e.g. "count" and then specialize it appropriately. Then change all references to .length in std.algorithm and elsewhere to .count. What would you do? Any ideas are welcome. AndreiI personally would find this extremely annoying because most of the code I write that involves strings is scientific computing code that will never be internationalized, let alone released to the general public. I basically just use ASCII because it's all I need and if your UTF-8 string contains only ASCII characters, it can be treated as random-access. I don't know how many people out there are in similar situations, but I doubt they'll be too happy. On the other hand, I guess it wouldn't be hard to write a simple wrapper struct on top of immutable(ubyte)[] and call it AsciiString. Once alias this gets fully debugged, I could even make it implicitly convert to immutable(char)[].
Feb 03 2010
On Wed, 03 Feb 2010 23:41:02 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:dsimcha wrote:I'm in the same camp as dsimcha, I generally write all my apps assuming ASCII strings (most are internal tools anyways). Can the compiler help making ASCII strings easier to use? i.e., this already works: wstring s = "hello"; // converts to immutable(wchar)[] what about this? asciistring a = "hello"; // converts to immutable(ubyte)[] (or immutable(ASCIIChar)[]) asciistring a = "\uFBCD"; // error, requires cast. The only issue that remains to be resolved then is the upgradability that ascii characters currently enjoy for utf8. I.e. I can call any utf-8 accepting function with an ASCII string, but not an ASCII string accepting function with utf-8 data. Ideally, there should be a 7-bit ASCII character type that implicitly upconverts to char, and can be initialized with a string literal. In addition, you are putting D's utf8 char even further away from C's ASCII char. It would be nice to separate compatible C strings from d strings. At some point, I should be able to designate a function (even a C function) takes only ASCII data, and the compiler should disallow passing general utf8 data into it. This involves either renaming D's char to keep source closer to C, or rewriting C function signatures to reflect the difference. -SteveI personally would find this extremely annoying because most of the code I write that involves strings is scientific computing code that will never be internationalized, let alone released to the general public. I basically just use ASCII because it's all I need and if your UTF-8 string contains only ASCII characters, it can be treated as random-access. I don't know how many people out there are in similar situations, but I doubt they'll be too happy. On the other hand, I guess it wouldn't be hard to write a simple wrapper struct on top of immutable(ubyte)[] and call it AsciiString. Once alias this gets fully debugged, I could even make it implicitly convert to immutable(char)[].It's definitely going to be easy to use all sensible algorithms with immutable(ubyte)[]. But even if you go with string, there should be no problem at all. Remember, telling ASCII from UTF is one mask and one test away, and the way Walter and I wrote virtually all related routines was to special-case ASCII. In most cases I don't think you'll notice a decrease in performance.
Feb 08 2010
On 2010-02-03 21:00:21 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:It's no secret that string et al. are not a magic recipe for writing correct Unicode code. [...] What would you do? Any ideas are welcome.UTF-8 and UTF-16 encodings are interesting beasts. If you have a UTF-8 string and want to search for an occurrence of that string in another UTF-8 string, you don't have to decode each multi-byte code-points: a binary comparison is enough. If you're counting counting the number of code points, then all you need is to count the number of code unit with the most significant bit set to zero. If on the other hand you're applying a character-by-character transformation, then you need to fully decode each character, unless you're only interested in transforming characters from the lower non-multibyte subrange of the encoding (which happens quite often). Clearly, I don't think there's a one-size-fit-all way to iterate over string arrays. Fully decoding each code unit is clearly the most costly method; it shouldn't be required when its not necessary. I think we need to be able to represent char[] and wchar[] as a range of dchar to deal with cases where you want to iterate over Unicode code points, but I'd let the programmer ultimately decide what to do. As for .length, I'll say that removing this property would make it hard to write low-level code. For instance, if I copy a string into a buffer, I need to know the length in bytes (array.length * sizeof(array[0])), not the number of characters. So it doesn't make much sense to disable .length. So my answer would be mostly to leave things as they are. Perhaps the char[] and wchar[] as dchar ranges could be aliased to string and wstring, but that'd definitely be a blow to the philosophy of strings as simple arrays. You'd also still need to be able to access the actual array underneath. And will all the implicit conversions still work? I'm really not sure it's worth it, but perhaps. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Feb 03 2010
Michel Fortin wrote:... As for .length, I'll say that removing this property would make it hard to write low-level code. For instance, if I copy a string into a buffer, I need to know the length in bytes (array.length * sizeof(array[0])), not the number of characters. So it doesn't make much sense to disable .length. ...This would be under the condition that there is another property that does the same (or similar) thing. It's the part where he said, "and define a different name for that". You could also just reinterpret them as byte[] or ubyte[]. Then they will behave in a low level way. No surprises.
Feb 04 2010
Andrei Alexandrescu wrote:- make front() and back() for UTF-8 and UTF-16 automatically decode the first and last Unicode character - make popFront() and popBack() skip one entire Unicode character (instead of just one code unit) - alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings - change hasLength to return false for UTF-8 and UTF-16 stringsThese are all fine for a dedicated string type. They're horrible for generic arrays, for the following reasons: - They break generic code. - They make it impossible to manipulate an array of code units as an array of code units. -- Rainer Deyke - rainerd eldwood.com
Feb 03 2010
Rainer Deyke wrote:Andrei Alexandrescu wrote:Arrays of char and wchar are not quite generic - they are definitely UTF strings. Andrei- make front() and back() for UTF-8 and UTF-16 automatically decode the first and last Unicode character - make popFront() and popBack() skip one entire Unicode character (instead of just one code unit) - alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings - change hasLength to return false for UTF-8 and UTF-16 stringsThese are all fine for a dedicated string type. They're horrible for generic arrays, for the following reasons: - They break generic code. - They make it impossible to manipulate an array of code units as an array of code units.
Feb 03 2010
Andrei Alexandrescu wrote:Arrays of char and wchar are not quite generic - they are definitely UTF strings.A 'char' is a single utf-8 code unit. A 'char[]' is (or should be) a generic array of utf-8 code units. Sometimes these code units line up to form valid unicode code points, sometimes they don't. If you want a data type that always contains a valid utf-8 string, don't call it 'char[]'. It's misleading, it breaks generic code, and it renders built-in arrays useless for when you actually want an array of utf-8 code units. It's the same mistake as std::vector<bool> in C++, but much worse. -- Rainer Deyke - rainerd eldwood.com
Feb 03 2010
Rainer Deyke wrote:Andrei Alexandrescu wrote:I agree up to the assessment of the size of the problem and a couple of other points. I've had a great time writing utf code in D with string. Getting back to C++'s std::string really put things in perspective. If your purpose is to store some disparate utf-8 code units (a need that I've never had), I see no problem with storing then as ubyte[]. AndreiArrays of char and wchar are not quite generic - they are definitely UTF strings.A 'char' is a single utf-8 code unit. A 'char[]' is (or should be) a generic array of utf-8 code units. Sometimes these code units line up to form valid unicode code points, sometimes they don't. If you want a data type that always contains a valid utf-8 string, don't call it 'char[]'. It's misleading, it breaks generic code, and it renders built-in arrays useless for when you actually want an array of utf-8 code units. It's the same mistake as std::vector<bool> in C++, but much worse.
Feb 03 2010
Andrei Alexandrescu wrote:It's no secret that string et al. are not a magic recipe for writing correct Unicode code. However, things are pretty good and could be further improved by operating the following changes in std.array and std.range: - make front() and back() for UTF-8 and UTF-16 automatically decode the first and last Unicode characterThey would yield dchar, right? Wouldn't that cause trouble in templated code?- make popFront() and popBack() skip one entire Unicode character (instead of just one code unit)That's perfectly fine, because the opposite operations do "encode": string s = "aÄŸ"; assert(s.length == 3); s ~= 'ÅŸ'; assert(s.length == 5);- alter isRandomAccessRange to return false for UTF-8 and UTF-16 stringsOk.- change hasLength to return false for UTF-8 and UTF-16 stringsI don't understand that one. strings have lengths. Adding and removing does not alter length by 1 for those types. I don't think it's a big deal. It is already so in the language for those types. dstring does not have that problem and could be used when by-1 change is desired.(b) Operate the change and mention that in range algorithms you should check hasLength and only then use "length" under the assumption that it really means "elements count".The change sounds ok and hasLength should yield true. Or... can it return an enum { no, kind_of, yes } ;) Current utf.decode takes the index by reference and modifies it by the amount. Could popFront() do something similar? I think that's it: front() and popFront() are separated for cohesion. What is causing trouble here is the separation of "by-N" from popFront(). You are concerned that the user makes the assumption and popFront() will reduce by 1. I think that is the problem here. How about something like: // returns the amount that the next popFront() will reduce length int nextStep(); Ali
Feb 03 2010
Ali Çehreli wrote:Andrei Alexandrescu wrote: > It's no secret that string et al. are not a magic recipe for writing > correct Unicode code. However, things are pretty good and could be > further improved by operating the following changes in std.array and > std.range: > > - make front() and back() for UTF-8 and UTF-16 automatically decode the > first and last Unicode character They would yield dchar, right? Wouldn't that cause trouble in templated code?Yes, dchar. There was some figuring out in parts of Phobos, but the gains are well worth it. The simplifications are enormous. Until now, Phobos didn't hit the nail on the head with simple encoding/decoding/transcoding primitives. There were many attempts in std.utf, std.encoding, and std.string - all very clunky to use. Now I can just write s.front to get the first dchar of any string, and s.popFront to drop it. Very simple!> - make popFront() and popBack() skip one entire Unicode character > (instead of just one code unit) That's perfectly fine, because the opposite operations do "encode": string s = "ağ"; assert(s.length == 3); s ~= 'ş'; assert(s.length == 5); > - alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings Ok. > - change hasLength to return false for UTF-8 and UTF-16 strings I don't understand that one. strings have lengths. Adding and removing does not alter length by 1 for those types. I don't think it's a big deal. It is already so in the language for those types. dstring does not have that problem and could be used when by-1 change is desired.hasLength is a property used by range algorithms to tell them that a range stores the length with a particular meaning (the number of elements). It is perfectly fine that strings don't obey hasLength but do expose .length - it's just that it has different semantics.> (b) Operate the change and mention that in range algorithms you should > check hasLength and only then use "length" under the assumption that it > really means "elements count". The change sounds ok and hasLength should yield true. Or... can it return an enum { no, kind_of, yes } ;) Current utf.decode takes the index by reference and modifies it by the amount. Could popFront() do something similar?I think we could dedicate a special function for that. In fact it does exist I think - it's called stride(). Andrei
Feb 03 2010
Ali Çehreli wrote:... The change sounds ok and hasLength should yield true. Or... can it return an enum { no, kind_of, yes } ;) ... AliI believe what you're after is this: enum Bool { True, False, FileNotFound }
Feb 04 2010
I'd like D to give the right results and to be not bug-prone by default (even if it's a bit slower) and do something more optimized but maybe faster when I want it and I know what I am doing. So I like the idea of making strings more correct. I liked the idea of D strings acting like normal arrays, but they are different data structures, so it's better to accept them as something different, even if similar (especially UTF32 ones). So a specific struct, named like String, can be used to represent a string. String literals too return such struct. In C language the strlen() is O(n), so people can learn that finding the length of most strings is O(n) in D too (UTF32 strings today are not common. If D gets success in a future the UTF32 strings can become the only used strings. Such things are happened many times in computer history). The String structs can cache inside themselves the length and hash value the first time such values are computed (so each String is a struct 4 words long). Time ago you (Andrei) told me that you don't like a function like len() to have a different computational complexity, like O(n) or O(1), according to the input. But a simple possibility is for "foo".length property to return the length of the walk on all chars, that is O(n) the first time you call it (it's O(1) on UTF32 strings). Some algorithms or some parts of code (and string literals, etc.) can know what's the length of the string they return or use, so they can put this this length value inside the String cached value and there's never a need to compute it. The struct String that represents the string supports the [] operator too (there can be a MutableString too, that is not used often), but it usually walks the string to find the right character to return. Then a method like "foo".ubyte(i) can be used to return the i-th ubyte of the string. So initially this [] is done with just a O(n) walk, later some performance optimization can be added, like a simple skip tree based on indexes. Bye, bearophile
Feb 04 2010
The String structs can cache inside themselves the length and hash value the first time such values are computed (so each String is a struct 4 words long).So it's not really immutable. It contains an immutable dynamic array that can be called "str", plus two mutable words. To access the underlying 8/16/32 bit units it's composed you can use the [] of the str attribute: "foo".str[i] Bye, bearophile
Feb 04 2010
bearophile wrote:Time ago you (Andrei) told me that you don't like a function likelen() to have a different computational complexity, like O(n) or O(1), according to the input. This reminds me of an excellent talk by Matt Austern on STL's singly-linked lists. One of the interesting points of the design was around the length of the singly-linked list. In the end, he decides not to provide one. I think his point was that getting the length of the data structure was not one of the main operations of a singly list. I agree that the users who really need length can wrap it in a struct that stores the length. Although there doesn't seem to be a slide dedicated to that point, the presentation is here: http://www.accu-usa.org/Slides/SinglyLinkedLists.ppt Ali
Feb 04 2010
Andrei Alexandrescu Wrote:It's no secret that string et al. are not a magic recipe for writing correct Unicode code. However, things are pretty good and could be further improved by operating the following changes in std.array and std.range: These changes effectively make UTF-8 and UTF-16 bidirectional ranges, with the quirk that you still have a sort of a random-access operator. I'm very strongly in favor of this change. Bidirectional strings allow beautiful correct algorithms to be written that handle encoded strings without any additional effort; with these changes, everything applicable of std.algorithm works out of the box (with the appropriate fixes here and there), which is really remarkable. The remaining WTF is the length property. Traditionally, a range offering length also implies the expectation that a range of length n allows you to call popFront n times and then assert that the range is empty. However, if you check e.g. hasLength!string it will yield false, although the string does have an accessible member by that name and of the appropriate type. Although Phobos always checks its assumptions, people might occasionally write code that just uses .length without checking hasLength. Then, they'll be annoyed when the code fails with UTF-8 and UTF-16 strings. (The "real" length of the range is not stored, but can be computed by using str.walkLength() in std.range.) What can be done about that? I see a number of solutions:The underlying array of byte-sized data fragments is an implementation detail. hasLength is a kludge. Follow good OO design and hide the implementation details from the standard interface! I would use a struct for UTF8 and UTF16 strings, and add a method to get the raw array. That allows simple, compiler-enforced usage while still allowing special casing to use raw data. As an added bonus, this method can generalize for other variable widthrange elements.
Feb 04 2010
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:What can be done about that? I see a number of solutions: (a) Do not operate the change at all. (b) Operate the change and mention that in range algorithms you should check hasLength and only then use "length" under the assumption that it really means "elements count". (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define a different name for that. Any other name (codeUnits, codes etc.) would do. The entire point is to not make algorithms believe strings have a .length property. (d) Have std.range define a distinct property called e.g. "count" and then specialize it appropriately. Then change all references to .length in std.algorithm and elsewhere to .count. What would you do? Any ideas are welcome.Of the above, I feel (b) is the correct solution, and I understand it has already been implemented in svn. -- Simen
Feb 04 2010
Simen kjaeraas:Of the above, I feel (b) is the correct solution, and I understand it has already been implemented in svn.Yes, I presume he was mostly looking for a justification of his ideas he has already accepted and even partially implemented :-) Bye, bearophile
Feb 04 2010
bearophile wrote:Simen kjaeraas:I am ready to throw away the implementation as soon as a better idea comes around. As other times, I operated the change to see how things feel with the new approach. Generally it feels like the new state of affairs is a solid improvement. One recurring problem has been that some code has assumed that ElementType!SomeString has the width of one encoding unit. That assumption is no longer true so I had to change such code with typeof(SomeString.init[0]). Probably I'll abstract that as CodeUnit!SomeString in std.traits. I also found some bugs; for example Levenshtein distance was erroneous because it didn't operate at character level. The fix using front and popFront was very simple. Regarding defining an entire new struct for strings, I think that's a sensible approach. With the new operators in tow, UString (universal string) that traffics in dchar and makes representation a detail would be nicely implementable. It could even have mutable elements at dchar granularity. My feeling is, however, that at this point too much toothpaste is out of the tube for that to happen in D2. That would be offset if current strings were unbearable, but I think they're working very well. AndreiOf the above, I feel (b) is the correct solution, and I understand it has already been implemented in svn.Yes, I presume he was mostly looking for a justification of his ideas he has already accepted and even partially implemented :-)
Feb 04 2010
On 2010-02-04 12:19:42 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:bearophile wrote:Has any thought been given to foreach? Currently all these work for strings: foreach (c; "abc") { } // typeof(c) is 'char' foreach (char c; "abc") { } foreach (wchar c; "abc") { } foreach (dchar c; "abc") { } I'm concerned about the first case where the element type is implicit. The implicit element type is (currently) the code units. If the range use code points 'dchar' as the element type, then I think foreach needs to be changed so that the default element type is 'dchar' too (in the first line of my example). Having ranges and foreach disagree on this would be very inconsistent. Of course you should be allowed to iterate using 'char' and 'wchar' too. I think this would fit nicely. I was surprised at first when learning D and I noticed that foreach didn't do this, that I had to explicitly has for it. -- Michel Fortin michel.fortin michelf.com http://michelf.com/Simen kjaeraas:I am ready to throw away the implementation as soon as a better idea comes around. As other times, I operated the change to see how things feel with the new approach.Of the above, I feel (b) is the correct solution, and I understand it has already been implemented in svn.Yes, I presume he was mostly looking for a justification of his ideas he has already accepted and even partially implemented :-)
Feb 04 2010
Michel Fortin wrote:On 2010-02-04 12:19:42 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:This is a good point. I'm in favor of changing the language to make the implicit type dchar. Andreibearophile wrote:Has any thought been given to foreach? Currently all these work for strings: foreach (c; "abc") { } // typeof(c) is 'char' foreach (char c; "abc") { } foreach (wchar c; "abc") { } foreach (dchar c; "abc") { } I'm concerned about the first case where the element type is implicit. The implicit element type is (currently) the code units. If the range use code points 'dchar' as the element type, then I think foreach needs to be changed so that the default element type is 'dchar' too (in the first line of my example). Having ranges and foreach disagree on this would be very inconsistent. Of course you should be allowed to iterate using 'char' and 'wchar' too. I think this would fit nicely. I was surprised at first when learning D and I noticed that foreach didn't do this, that I had to explicitly has for it.Simen kjaeraas:I am ready to throw away the implementation as soon as a better idea comes around. As other times, I operated the change to see how things feel with the new approach.Of the above, I feel (b) is the correct solution, and I understand it has already been implemented in svn.Yes, I presume he was mostly looking for a justification of his ideas he has already accepted and even partially implemented :-)
Feb 04 2010
Andrei Alexandrescu wrote:Michel Fortin wrote:We seem to be approaching the point where char[], wchar[] and dchar[] are all arrays of dchar, but with different levels of compression. It makes me wonder if the char, wchar types actually make any sense. If char[] is actually a UTF string, then char[] ~ char should be permitted ONLY if char can be implicitly converted to dchar. Otherwise, you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which will not necessarily result in a valid unicode string. I suspect that string, wstring should have been the primary types and had a .codepoints property, which returned a ubyte[] resp. ushort[] reference to the data. It's too late, of course. The extra value you get by having a specific type for 'this is a code point for a UTF8 string' seems to be very minor, compared to just using a ubyte.On 2010-02-04 12:19:42 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:This is a good point. I'm in favor of changing the language to make the implicit type dchar. Andreibearophile wrote:Has any thought been given to foreach? Currently all these work for strings: foreach (c; "abc") { } // typeof(c) is 'char' foreach (char c; "abc") { } foreach (wchar c; "abc") { } foreach (dchar c; "abc") { } I'm concerned about the first case where the element type is implicit. The implicit element type is (currently) the code units. If the range use code points 'dchar' as the element type, then I think foreach needs to be changed so that the default element type is 'dchar' too (in the first line of my example). Having ranges and foreach disagree on this would be very inconsistent. Of course you should be allowed to iterate using 'char' and 'wchar' too. I think this would fit nicely. I was surprised at first when learning D and I noticed that foreach didn't do this, that I had to explicitly has for it.Simen kjaeraas:I am ready to throw away the implementation as soon as a better idea comes around. As other times, I operated the change to see how things feel with the new approach.Of the above, I feel (b) is the correct solution, and I understand it has already been implemented in svn.Yes, I presume he was mostly looking for a justification of his ideas he has already accepted and even partially implemented :-)
Feb 04 2010
Don Wrote:We seem to be approaching the point where char[], wchar[] and dchar[] are all arrays of dchar, but with different levels of compression. It makes me wonder if the char, wchar types actually make any sense. If char[] is actually a UTF string, then char[] ~ char should be permitted ONLY if char can be implicitly converted to dchar. Otherwise, you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which will not necessarily result in a valid unicode string.Well, if you're working with a LOT of text, you may be mmapping GB's of UTF-8 text. Yes, this does happen. You better be able to handle it in a sane manner, i.e. not reallocating the memory to read the data in. So, there is a definite need for casting to array of char, and dealing with the inevitable stray non-unicode char in that mess. Real-world text processing can be a messy affair. It probably requires walking such an array and returning slices cast to char after they've been validated.
Feb 04 2010
== Quote from Jerry Quinn (jlquinn optonline.net)'s articleDon Wrote:text. Yes, this does happen. You better be able to handle it in a sane manner, i.e. not reallocating the memory to read the data in. So, there is a definite need for casting to array of char, and dealing with the inevitable stray non-unicode char in that mess. Welcome to the world of DNA sequence manipulation.We seem to be approaching the point where char[], wchar[] and dchar[] are all arrays of dchar, but with different levels of compression. It makes me wonder if the char, wchar types actually make any sense. If char[] is actually a UTF string, then char[] ~ char should be permitted ONLY if char can be implicitly converted to dchar. Otherwise, you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which will not necessarily result in a valid unicode string.Well, if you're working with a LOT of text, you may be mmapping GB's of UTF-8
Feb 04 2010
Don wrote:I suspect that string, wstring should have been the primary types and had a .codepoints property, which returned a ubyte[] resp. ushort[] reference to the data. It's too late, of course. The extra value you get by having a specific type for 'this is a code point for a UTF8 string' seems to be very minor, compared to just using a ubyte.If it's not too late to completely change the semantics of char[], then it's also not too late to dump 'char' completely. If it /is/ too late to remove 'char', then 'char[]' should retain the current semantics and a new string type should be added for the new semantics. -- Rainer Deyke - rainerd eldwood.com
Feb 04 2010
Rainer Deyke wrote:Don wrote:One idea I've had for a while was to have a universal string type: struct UString { union { char[] utf8; wchar[] utf16; dchar[] utf32; } enum Discriminator { utf8, utf16, utf32 }; Discriminator kind; IntervalTree!(size_t) skip; ... } The IntervalTree stores the skip amounts that must be added for a given index in the string. For ASCII strings that would be null. Then its size grows with the number of multibyte characters. Beyond a threshold, representation is transparently switched to utf16 or utf32 as needed and the tree becomes smaller or null again. In an advanced implementation the discriminator and the tree could be stored at negative offset, and the tree could be compressed taking advantage of its limited size. That would make UString quite low-overhead while offering a staunchly dchar-based interface. I don't mind at all using string, but I also think UString would be a good extra abstraction. AndreiI suspect that string, wstring should have been the primary types and had a .codepoints property, which returned a ubyte[] resp. ushort[] reference to the data. It's too late, of course. The extra value you get by having a specific type for 'this is a code point for a UTF8 string' seems to be very minor, compared to just using a ubyte.If it's not too late to completely change the semantics of char[], then it's also not too late to dump 'char' completely. If it /is/ too late to remove 'char', then 'char[]' should retain the current semantics and a new string type should be added for the new semantics.
Feb 04 2010
Andrei Alexandrescu wrote:Rainer Deyke wrote:You mean like this? http://www.dprogramming.com/mtext.phpDon wrote:One idea I've had for a while was to have a universal string type: struct UString { union { char[] utf8; wchar[] utf16; dchar[] utf32; } enum Discriminator { utf8, utf16, utf32 }; Discriminator kind; IntervalTree!(size_t) skip; ... }I suspect that string, wstring should have been the primary types and had a .codepoints property, which returned a ubyte[] resp. ushort[] reference to the data. It's too late, of course. The extra value you get by having a specific type for 'this is a code point for a UTF8 string' seems to be very minor, compared to just using a ubyte.If it's not too late to completely change the semantics of char[], then it's also not too late to dump 'char' completely. If it /is/ too late to remove 'char', then 'char[]' should retain the current semantics and a new string type should be added for the new semantics.
Feb 04 2010
Andrei Alexandrescu wrote:One idea I've had for a while was to have a universal string type: struct UString { union { char[] utf8; wchar[] utf16; dchar[] utf32; } enum Discriminator { utf8, utf16, utf32 }; Discriminator kind; IntervalTree!(size_t) skip; ... } The IntervalTree stores the skip amounts that must be added for a given index in the string. For ASCII strings that would be null. Then its size grows with the number of multibyte characters. Beyond a threshold, representation is transparently switched to utf16 or utf32 as needed and the tree becomes smaller or null again.Although I see some potential in a universal string type, I don't think this is the right implementation strategy. I'd rather have my short strings in utf-32 (optimized for speed) and my long strings in utf-8/utf-16 (optimized for memory usage). -- Rainer Deyke - rainerd eldwood.com
Feb 04 2010
Rainer Deyke wrote:Andrei Alexandrescu wrote:The definition I outlined does not specify or constrain the strategy of changing the discriminator. AndreiOne idea I've had for a while was to have a universal string type: struct UString { union { char[] utf8; wchar[] utf16; dchar[] utf32; } enum Discriminator { utf8, utf16, utf32 }; Discriminator kind; IntervalTree!(size_t) skip; ... } The IntervalTree stores the skip amounts that must be added for a given index in the string. For ASCII strings that would be null. Then its size grows with the number of multibyte characters. Beyond a threshold, representation is transparently switched to utf16 or utf32 as needed and the tree becomes smaller or null again.Although I see some potential in a universal string type, I don't think this is the right implementation strategy. I'd rather have my short strings in utf-32 (optimized for speed) and my long strings in utf-8/utf-16 (optimized for memory usage).
Feb 04 2010
On Thu, 04 Feb 2010 18:41:48 -0700, Rainer Deyke wrote:Andrei Alexandrescu wrote:Firstly, for such "augmented types" in D, such as strings, bignums or any future ideas , it is great to have the facilities of creating them using the struct, so that they can be used elsewhere without regards to whether they are built in as compiler specials or in the library. What is there for struct now is good and getting better in D2, but I still feel a little insecure with understanding how to make a really optimal implementation that is as good as a built in type that the compiler understands. The DPL is being been a help for this. Programmers will want to use raw char[] wchar[] dchar[] for whatever reasons with their ?simple? behaviours, so they should not be made unavailable because more sophisticated types are creatable, purely for unicode strings. I have made a UString implementation, similar to above. But I played a different trick. I was interested for this to also maintain a terminating null char for conversion passing to Windows API functions, in particular 16 bit W. interfaces. struct UString_char { char[] str_; /// ... lots of good D type stuff, constructor and assign conversions, access size_t length() { return str_.length - 1; // hide terminating null } } struct UString_wchar { wchar[] str_; /// ditto D type stuff } struct UString_dchar { dchar[] str_; } // throw in void[] for charity. (although no one will need it) struct UString_void { void[] ptr_; } enum UStringType { UC_CHAR, UC_WCHAR, UC_DCHAR } struct UString { union { UString_void vstr; UString_char cstr; UString_wchar wstr; UString_dchar dstr; } UStringType ztype; // type things to track what we are. } I could then choose individual components by themselves, where appropriate, even get them working. In D2 immutable not for str_ array, while appending or fiddling null terminator. I did not get associative array working as a key using UString, have not tried since. Also made a class version called VString containing the union. There's a lot of issues. I also must acknowledge the prior art of the mtext code, and its MString structure type. I was partly inspired by seeing this, and how complex it was to do nearly everything. When last I checked mtext it was kind of broken for recent D1 and D2 compilers, and I did not want to fix. I admit I did not like the complexity of the direct union { char[], wchar[], dchar[] } Splitting up into seperatedly usable structs seems to me to give 3 times the potential for the same price. The advantage of using struct for such types is it may help bring about perfection of such a POD based "type creation" facility. I note from looking at some of the phobos D2 code, eg std.array, this seems to be attempted in places. Nearly all the more interesting D types, arrays, maps, are all equivalent to smallish POD types, with at least 2-3 times the machine word size (32/64 bit). Making it all work and understandable and avoiding WTFbug is a big challenge.One idea I've had for a while was to have a universal string type: struct UString { union { char[] utf8; wchar[] utf16; dchar[] utf32; } enum Discriminator { utf8, utf16, utf32 }; Discriminator kind; IntervalTree!(size_t) skip; ... }
Feb 05 2010
On 2010-02-04 18:16:55 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:Rainer Deyke wrote:That's a nice concept, but it seems to me that it adds much overhead to improve a rather niche area. It's not often that you need to access characters by index. Generally when you need to it's because you've already parsed the string and want to return to a previous location, in which case you'd better when you first parse to just save the range or the index in code units rather than the index in code point. But I have to say quite satisfied in the way D handle strings in general. Easy access to code points and direct access to the data is quite handy. I think it fits very well with a low-level language. I'd say in general when manipulating strings I rarely need to bother about code points. Most of the time I'm just searching for ASCII-range markers when parsing so I can search for them directly as code units, not bothering at all about multi-byte characters. That's why I'm a little wary about your changes. If I'm looking for a substring then I can search by code units too. It's just for the more fancy stuff (case-insensitive searching, character transformation) that it becomes necessary to work with code points. That's why I'm a little wary about your changes in that area: I fear it'll make the common case of working with code units more difficult to deal with. But I won't judge before I see. -- Michel Fortin michel.fortin michelf.com http://michelf.com/Don wrote:One idea I've had for a while was to have a universal string type: struct UString { union { char[] utf8; wchar[] utf16; dchar[] utf32; } enum Discriminator { utf8, utf16, utf32 }; Discriminator kind; IntervalTree!(size_t) skip; ... }I suspect that string, wstring should have been the primary types and had a .codepoints property, which returned a ubyte[] resp. ushort[] reference to the data. It's too late, of course. The extra value you get by having a specific type for 'this is a code point for a UTF8 string' seems to be very minor, compared to just using a ubyte.If it's not too late to completely change the semantics of char[], then it's also not too late to dump 'char' completely. If it /is/ too late to remove 'char', then 'char[]' should retain the current semantics and a new string type should be added for the new semantics.
Feb 04 2010
Don wrote:Andrei Alexandrescu wrote:That is a good way to look at things.Michel Fortin wrote:We seem to be approaching the point where char[], wchar[] and dchar[] are all arrays of dchar, but with different levels of compression.On 2010-02-04 12:19:42 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:This is a good point. I'm in favor of changing the language to make the implicit type dchar. Andreibearophile wrote:Has any thought been given to foreach? Currently all these work for strings: foreach (c; "abc") { } // typeof(c) is 'char' foreach (char c; "abc") { } foreach (wchar c; "abc") { } foreach (dchar c; "abc") { } I'm concerned about the first case where the element type is implicit. The implicit element type is (currently) the code units. If the range use code points 'dchar' as the element type, then I think foreach needs to be changed so that the default element type is 'dchar' too (in the first line of my example). Having ranges and foreach disagree on this would be very inconsistent. Of course you should be allowed to iterate using 'char' and 'wchar' too. I think this would fit nicely. I was surprised at first when learning D and I noticed that foreach didn't do this, that I had to explicitly has for it.Simen kjaeraas:I am ready to throw away the implementation as soon as a better idea comes around. As other times, I operated the change to see how things feel with the new approach.Of the above, I feel (b) is the correct solution, and I understand it has already been implemented in svn.Yes, I presume he was mostly looking for a justification of his ideas he has already accepted and even partially implemented :-)It makes me wonder if the char, wchar types actually make any sense. If char[] is actually a UTF string, then char[] ~ char should be permitted ONLY if char can be implicitly converted to dchar. Otherwise, you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which will not necessarily result in a valid unicode string.Well as it's been mentioned, sometimes you may assemble a string out of individual characters. Probably that case is rare enough to warrant a cast. Note that today char is already convertible to dchar (there's no checking).I suspect that string, wstring should have been the primary types and had a .codepoints property, which returned a ubyte[] resp. ushort[] reference to the data. It's too late, of course. The extra value you get by having a specific type for 'this is a code point for a UTF8 string' seems to be very minor, compared to just using a ubyte.What we can do is to have to!(const ubyte[]) work for all UTF8 strings and to!(const ushort[]) work for all UTF16 strings. That view is correct and safe. Also, it's not difficult to add a .codepoints pseudo-property. Andrei
Feb 04 2010
Andrei Alexandrescu wrote:I concur. It's great to see consensus moving in this direction. For too long Java has suffered the err that a short (i.e. UTF-16 codeunit) is just about as good as a full Unicode codepoint (i.e. UTF-32 "codeunit"). As a result, the near-enough is good-enough, 16-bit Java API's means that programmers either forget (as best) or become slack (at worse) in the dealing of valid Unicode characters. Part of this also stems from the culture that if it ain't ASCII or in a Western character set (BMP), who cares. As a matter of taste, I'd prefer to see a dchar Unicode codepoint officially acknowledged/ordained as "unichar", though I guess there is always the alias resort for pedants like myself. Cheers Justin JohanssonHas any thought been given to foreach? Currently all these work for strings: foreach (c; "abc") { } // typeof(c) is 'char' foreach (char c; "abc") { } foreach (wchar c; "abc") { } foreach (dchar c; "abc") { } I'm concerned about the first case where the element type is implicit. The implicit element type is (currently) the code units. If the range use code points 'dchar' as the element type, then I think foreach needs to be changed so that the default element type is 'dchar' too (in the first line of my example). Having ranges and foreach disagree on this would be very inconsistent. Of course you should be allowed to iterate using 'char' and 'wchar' too. I think this would fit nicely. I was surprised at first when learning D and I noticed that foreach didn't do this, that I had to explicitly has for it.This is a good point. I'm in favor of changing the language to make the implicit type dchar. Andrei
Feb 04 2010
Andrei Alexandrescu:I am ready to throw away the implementation as soon as a better idea comes around. As other times, I operated the change to see how things feel with the new approach.And just to be sure: you are better than me because you actually implement things, while I am here just complaining all the time :o) So thank you for your work. Later, bearophile
Feb 04 2010
Andrei Alexandrescu Wrote:What would you do? Any ideas are welcome.Andrei, congratulations on starting the most interesting thread I have seen on dm.D for as long as I can remember. It has managed to stay really focused, and has produced a lot of interesting suggestions. I don't envy you the task of sorting through them - but you did ask. From what you've seen so far, are we talking Phobos or language issues, and if both, in what proportion? Steve.
Feb 05 2010
Steve Teale wrote:Andrei Alexandrescu Wrote:A bit of both, with an emphasis on library. AndreiWhat would you do? Any ideas are welcome.Andrei, congratulations on starting the most interesting thread I have seen on dm.D for as long as I can remember. It has managed to stay really focused, and has produced a lot of interesting suggestions. I don't envy you the task of sorting through them - but you did ask. From what you've seen so far, are we talking Phobos or language issues, and if both, in what proportion?
Feb 05 2010