digitalmars.D - [review] new string type
- Steven Schveighoffer (180/180) Nov 30 2010 In a prior thread, I promised to create a narrow string type that would ...
- Jonathan M Davis (37/245) Nov 30 2010 1. At least until universal function syntax is in the language, you use ...
- Steven Schveighoffer (20/71) Nov 30 2010 Yes, but first, this is not a problem with the string type, it's a probl...
- Ellery Newcomer (7/15) Nov 30 2010 There definitely is value in being able to index and slice into utf
- Steven Schveighoffer (10/29) Dec 01 2010 Maybe :) The other alternative is to throw an exception if you try to
- Ellery Newcomer (21/35) Dec 01 2010 in my mind, the problem isn't so much indexing an intermediate code unit...
- spir (8/14) Dec 02 2010 This would yield several times the same code point / dchar, or do i misi...
- Steven Schveighoffer (6/16) Dec 02 2010 It would only yield the code units that are the start of a code point.
- Steven Schveighoffer (13/53) Dec 02 2010 Yes, it does seem odd, but then again, how often do you need the
- spir (11/15) Dec 03 2010 s =20
- Steven Schveighoffer (11/21) Dec 03 2010 Indexing a string is rare, unless you are parsing something (yes it does...
- Simon (9/12) Dec 03 2010 In my XP you never index *especially* when parsing.
- Jerry Quinn (4/23) Dec 03 2010 I tend to do a lot of transforming strings, but I need to track offsets ...
- Steven Schveighoffer (19/25) Dec 03 2010 In my daily usage of strings, I generally use a string as a whole, not
- Jerry Quinn (15/35) Dec 03 2010 That's reasonable. I"m not trying to create invalid strings.
- Steven Schveighoffer (17/39) Dec 04 2010 I'm assuming you are not changing the length of the string, or is that n...
- spir (11/15) Dec 03 2010 =20
- Steven Wawryk (16/30) Jan 12 2011 I don't like either. I'd prefer (from std.algorithm)
- Steven Wawryk (3/4) Jan 12 2011 Sorry, I didn't mean the the codeUnits retuning a size_t. I meant an
- Lars T. Kyllingstad (14/31) Nov 30 2010 It feels extremely weird that the indices refer to code units and not
- spir (23/57) Dec 01 2010 If I understand correctly how _charStart works in combination with index...
- Steven Schveighoffer (18/48) Dec 01 2010 I don't think it's possible to do that with any implementation without
- Lars T. Kyllingstad (15/55) Dec 01 2010 As soon as you add opIndex(), your interface becomes that of a random-
- Steven Schveighoffer (8/62) Dec 02 2010 Then substring (slicing) becomes an O(n) operation. It just doesn't wor...
- Lars T. Kyllingstad (27/90) Dec 03 2010 What I meant wast that opSlice() should be disabled in the same way as
- Steven Schveighoffer (13/102) Dec 03 2010 O(n) slicing and indexing, called by different names, is unacceptable. ...
- Lars T. Kyllingstad (5/18) Dec 03 2010 Ah, you are right. I didn't read the documentation for
- Jonathan M Davis (10/71) Dec 03 2010 A string type without slicing (which must be O(1)) is DOA without questi...
- Jonathan M Davis (64/148) Nov 30 2010 I really don't think that we want to have to be adding string functions ...
- Steven Schveighoffer (40/186) Dec 01 2010 Why? It's a choice between adding them to the struct, and doing func(st...
- foobar (6/19) Dec 01 2010 A string type should always maintain the invariant that it is a valid un...
- spir (22/42) Dec 01 2010 ith =20
- stephan (19/21) Dec 01 2010 Hi Denis, you might want to consider helping us out.
- spir (49/80) Dec 01 2010 lization and the fact that a single "character" can be constructed from ...
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (44/56) Dec 03 2010 a true
- spir (63/131) Dec 03 2010 ion
- Steven Schveighoffer (10/37) Dec 01 2010 Copying is not a good idea, nor is runtime validation. We can only
- Steven Wawryk (14/174) Jan 12 2011 I like the direction you're taking but have some quibbles about details....
- Steven Schveighoffer (23/40) Jan 13 2011 Thanks for taking the time. I will respond to your points, but please
In a prior thread, I promised to create a narrow string type that would enforce the requirements of narrow strings. That is, the new string type should respect the encoding of narrow strings. Here is a rough draft, tested minimally, but it does compile and pass simple tests. It's pretty simple, which is what I would expect. I copied a lot of stuff from std.array to get this to work. The point of this type is -- if we replace what the compiler considers "strings" with this type, then we get both the compiler *and* phobos agreeing as to what this type is: A bi-directional range of dchar. As a bonus, char[] and wchar[] now would become arrays and be manipulated consistently with other arrays, which if not done correctly could cause problems, but may provide more flexibility and opportunity for performance. Instead of the library fighting you on it. Anyways, here it is, released under the boost license, commence attack ;) // Written in the D programming language. /** Copyright: Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010 License: <a href="http://www.boost.org/LICENSE_1_0.txt">Boost License 1.0</a>. Authors: $(WEB erdani.org, Andrei Alexandrescu, Steven Schveighoffer) Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010. Distributed under the Boost Software License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) */ import std.utf; import std.traits; struct string_t(T) if (is(Unqual!T == char) || is(Unqual!T == wchar)) { private T[] _data; this(T[] data) { this._data = data; } // note, this assumes that idx is a valid index private size_t _charStart(size_t idx) const { static if(is(Unqual!T == wchar)) { immutable c = _data.ptr[idx]; if(c >= 0xDC00 && c <= 0xDFFF) { // surrogate pair detected, verify we have at least 2 wchars, // and that both wchars are properly encoded. assert(idx > 0, "Invalid UTF character at beginning of string"); return idx-1; } else return idx; } else { const p = _data.ptr + idx; if ((p[0] & 0b1100_0000) != 0b1000_0000) { return idx; } else if (idx >= 1 && (p[-1] & 0b1100_0000) != 0b1000_0000) { return idx - 1; } else if (idx >= 2 && (p[-2] & 0b1100_0000) != 0b1000_0000) { return idx - 2; } else if (idx >= 3 && (p[-3] & 0b1100_0000) != 0b1000_0000) { return idx - 3; } else { assert(false, "Invalid UTF character in string"); } } } void popFront() { auto nc = std.utf.stride(_data, 0); assert(nc <= _data.length && nc != 0xFF, "Invalid sequence at beginning of string"); _data = _data[nc .. $]; } void popBack() { immutable n = _data.length; assert(n, "Attempting to pop back of an empty string"); _data = _data.ptr[0.._charStart(n-1)]; } property dchar front() const { assert(_data.length, "Attempting to fetch the front of an empty string"); size_t i = 0; return decode(_data, i); } property dchar back() const { immutable n = _data.length; assert(n, "Attempting to fetch the back of an empty string"); auto idx = _charStart(n-1); return std.utf.decode(_data, idx); } property bool empty() const { return !_data.length; } property typeof(this) save() { return this; } // support read-only random access via code unit index. dchar opIndex(size_t idx) { idx = _charStart(idx); return std.utf.decode(_data, idx); } string_t opSlice() { return this; } string_t opSlice(size_t start, size_t end) { if(start != _data.length) start = _charStart(start); if(end != _data.length) end = _charStart(end); return string_t(_data[start..end]); } // note we don't call this length because length can be assumed to be the // number of elements, which this isn't. property size_t codeUnits() const { return _data.length; } // support append and concat // TODO: need to support appending various types of strings to eachother // (right now only same-type strings can be appended, or raw arrays) ref string_t opOpAssign(string op, U)(U data) if (op == "~" && is(U == string_t)) { _data ~= data._data; return this; } ref string_t opOpAssign(string op, U)(U data) if (op == "~" && !is(U == string_t) && is(typeof(_data ~= U.init))) { _data ~= data; return this; } string_t opBinary(string op, U)(U data) if (op == "~" && is(U == string_t)) { return string_t(_data ~ data._data); } string_t opBinary(string op, U)(U data) if (op == "~" && !is(U == string_t) && is(typeof(_data ~ U.init))) { return string_t(_data ~ data); } } template string_t(T) if (is(Unqual!T == dchar)) { alias T[] string_t; } /** begin test code **/ import std.stdio; alias string_t!(immutable char) mystring; alias string_t!(immutable wchar) mywstring; alias string_t!(immutable dchar) mydstring; void main() { auto str = mystring("hello"); str ~= " world"; str ~= mystring("!!!"); writeln(str._data); }
Nov 30 2010
On Tuesday, November 30, 2010 07:48:03 Steven Schveighoffer wrote:In a prior thread, I promised to create a narrow string type that would enforce the requirements of narrow strings. That is, the new string type should respect the encoding of narrow strings. Here is a rough draft, tested minimally, but it does compile and pass simple tests. It's pretty simple, which is what I would expect. I copied a lot of stuff from std.array to get this to work. The point of this type is -- if we replace what the compiler considers "strings" with this type, then we get both the compiler *and* phobos agreeing as to what this type is: A bi-directional range of dchar. As a bonus, char[] and wchar[] now would become arrays and be manipulated consistently with other arrays, which if not done correctly could cause problems, but may provide more flexibility and opportunity for performance. Instead of the library fighting you on it. Anyways, here it is, released under the boost license, commence attack ;) // Written in the D programming language. /** Copyright: Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010 License: <a href="http://www.boost.org/LICENSE_1_0.txt">Boost License 1.0</a>. Authors: $(WEB erdani.org, Andrei Alexandrescu, Steven Schveighoffer) Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010. Distributed under the Boost Software License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) */ import std.utf; import std.traits; struct string_t(T) if (is(Unqual!T == char) || is(Unqual!T == wchar)) { private T[] _data; this(T[] data) { this._data = data; } // note, this assumes that idx is a valid index private size_t _charStart(size_t idx) const { static if(is(Unqual!T == wchar)) { immutable c = _data.ptr[idx]; if(c >= 0xDC00 && c <= 0xDFFF) { // surrogate pair detected, verify we have at least 2 wchars, // and that both wchars are properly encoded. assert(idx > 0, "Invalid UTF character at beginning of string"); return idx-1; } else return idx; } else { const p = _data.ptr + idx; if ((p[0] & 0b1100_0000) != 0b1000_0000) { return idx; } else if (idx >= 1 && (p[-1] & 0b1100_0000) != 0b1000_0000) { return idx - 1; } else if (idx >= 2 && (p[-2] & 0b1100_0000) != 0b1000_0000) { return idx - 2; } else if (idx >= 3 && (p[-3] & 0b1100_0000) != 0b1000_0000) { return idx - 3; } else { assert(false, "Invalid UTF character in string"); } } } void popFront() { auto nc = std.utf.stride(_data, 0); assert(nc <= _data.length && nc != 0xFF, "Invalid sequence at beginning of string"); _data = _data[nc .. $]; } void popBack() { immutable n = _data.length; assert(n, "Attempting to pop back of an empty string"); _data = _data.ptr[0.._charStart(n-1)]; } property dchar front() const { assert(_data.length, "Attempting to fetch the front of an empty string"); size_t i = 0; return decode(_data, i); } property dchar back() const { immutable n = _data.length; assert(n, "Attempting to fetch the back of an empty string"); auto idx = _charStart(n-1); return std.utf.decode(_data, idx); } property bool empty() const { return !_data.length; } property typeof(this) save() { return this; } // support read-only random access via code unit index. dchar opIndex(size_t idx) { idx = _charStart(idx); return std.utf.decode(_data, idx); } string_t opSlice() { return this; } string_t opSlice(size_t start, size_t end) { if(start != _data.length) start = _charStart(start); if(end != _data.length) end = _charStart(end); return string_t(_data[start..end]); } // note we don't call this length because length can be assumed to be the // number of elements, which this isn't. property size_t codeUnits() const { return _data.length; } // support append and concat // TODO: need to support appending various types of strings to eachother // (right now only same-type strings can be appended, or raw arrays) ref string_t opOpAssign(string op, U)(U data) if (op == "~" && is(U == string_t)) { _data ~= data._data; return this; } ref string_t opOpAssign(string op, U)(U data) if (op == "~" && !is(U == string_t) && is(typeof(_data ~= U.init))) { _data ~= data; return this; } string_t opBinary(string op, U)(U data) if (op == "~" && is(U == string_t)) { return string_t(_data ~ data._data); } string_t opBinary(string op, U)(U data) if (op == "~" && !is(U == string_t) && is(typeof(_data ~ U.init))) { return string_t(_data ~ data); } } template string_t(T) if (is(Unqual!T == dchar)) { alias T[] string_t; } /** begin test code **/ import std.stdio; alias string_t!(immutable char) mystring; alias string_t!(immutable wchar) mywstring; alias string_t!(immutable dchar) mydstring; void main() { auto str = mystring("hello"); str ~= " world"; str ~= mystring("!!!"); writeln(str._data); }1. At least until universal function syntax is in the language, you use the ability to do str.func(). 2. Functions that would work just fine treating strings as arrays of code units (presumably because they don't care about what the actual data is) lose efficiency, because now a string isn't an array. 3. You have no access to the underlying array unless you're dealing with an actual array of dchar. 4. Indexing is no longer O(1), which violates the guarantees of the index operator. 5. Slicing (other than a full slice) is no longer O(1), which violates the guarantees of the slicing operator. What you're doing here is forcing the view of a string as a range of dchar in all cases. Granted, that's what you want in most cases, but it can degrade efficiency, and the fact that some operations (in particular indexing and slicing) are not O(1) like they're supposed to be means that algorithms which rely on O(1) behavior from them could increase their cost by an order of magnitude. All the cases where treating a string as an actual array which are currently valid are left out to dry The inherent problem with strings is that we _want_ to be able to view them as both arrays of code units and as ranges of dchar. Trying to view them only as ranges of dchar is inefficient in a number of cases. Even something as simple as getting a substring is less efficient with this code. Only in cases where you don't actually know the length of the substring that you want is this essentially as efficient as what we have now. You're ignoring all cases where viewing a string as an array of code units is correct and desirable. You're only solving half of the problem (albeit the more prevalent half). It could very well be that we need a better way to handle when a string is treated as a range of dchar and when it is treated as an array of code units, but trying to treat them as ranges of dchar in all cases is not the right answer. Now, making it possible to have a wrapper struct like this which works in many cases where you'd use strings could reduce errors in code in many situations, so giving the programmer the ability to do that could be interesting. But it seems to me that this solution is too limiting to be a viable replacement for strings as they are. - Jonathan M Davis
Nov 30 2010
On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis <jmdavisProg gmx.com> wrote:1. At least until universal function syntax is in the language, you use the ability to do str.func().Yes, but first, this is not a problem with the string type, it's a problem with the language. Second, any string-specific functions can be added, it is a struct after all, not a builtin.2. Functions that would work just fine treating strings as arrays of code units (presumably because they don't care about what the actual data is) lose efficiency, because now a string isn't an array.Which functions are those? They can be allowed via wrappers.3. You have no access to the underlying array unless you're dealing with an actual array of dchar.I thought of adding some kind of access. I wasn't sure the best way. I was thinking of allowing direct access via opCast, because I think casting might be a sufficient red flag to let you know you are crossing into dangerous waters. But it could just be as easy as making the array itself public.4. Indexing is no longer O(1), which violates the guarantees of the index operator.Indexing is still O(1).5. Slicing (other than a full slice) is no longer O(1), which violates the guarantees of the slicing operator.Slicing is still O(1).What you're doing here is forcing the view of a string as a range of dchar in all cases. Granted, that's what you want in most cases, but it can degrade efficiency, and the fact that some operations (in particular indexing and slicing) are not O(1) like they're supposed to be means that algorithms which rely on O(1) behavior from them could increase their cost by an order of magnitude. All the cases where treating a string as an actual array which are currently valid are left out to dryYou can still use char[] and wchar[].The inherent problem with strings is that we _want_ to be able to view them as both arrays of code units and as ranges of dchar. Trying to view them only as ranges of dchar is inefficient in a number of cases. Even something as simple as getting a substring is less efficient with this code. Only in cases where you don't actually know the length of the substring that you want is this essentially as efficient as what we have now. You're ignoring all cases where viewing a string as an array of code units is correct and desirable. You're only solving half of the problem (albeit the more prevalent half).I'm not planning on eliminating char and wchar based arrays. In other words, you should be able to get access to the array, but it should not be the default, and it should be considered unsafe.Now, making it possible to have a wrapper struct like this which works in many cases where you'd use strings could reduce errors in code in many situations, so giving the programmer the ability to do that could be interesting. But it seems to me that this solution is too limiting to be a viable replacement for strings as they are.Hopefully you can see that I'm not eliminating the functionality you are looking for, just making it not the default. -Steve
Nov 30 2010
On 11/30/2010 12:52 PM, Steven Schveighoffer wrote:On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M DavisThere definitely is value in being able to index and slice into utf strings without resulting in invalid utf, but I think the fact that it indexes on code unit and returns code point is sufficiently strange that it qualifies as abuse of operator overloading. That said, I like this proposal better than the current state of affairs. I'll try to play with it more on the weekend4. Indexing is no longer O(1), which violates the guarantees of the index operator.Indexing is still O(1).5. Slicing (other than a full slice) is no longer O(1), which violates the guarantees of the slicing operator.Slicing is still O(1).
Nov 30 2010
On Tue, 30 Nov 2010 18:31:05 -0500, Ellery Newcomer <ellery-newcomer utulsa.edu> wrote:On 11/30/2010 12:52 PM, Steven Schveighoffer wrote:Maybe :) The other alternative is to throw an exception if you try to access a code unit that is not the beginning of a code point. That might actually be less weird, I'll try doing that on the next iteration. I find that iteration over string characters using index is a very rare thing anyways, you either use foreach, which should give you dchars, or you use something like find, which should never give you an invalid index. -SteveOn Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M DavisThere definitely is value in being able to index and slice into utf strings without resulting in invalid utf, but I think the fact that it indexes on code unit and returns code point is sufficiently strange that it qualifies as abuse of operator overloading.4. Indexing is no longer O(1), which violates the guarantees of the index operator.Indexing is still O(1).5. Slicing (other than a full slice) is no longer O(1), which violates the guarantees of the slicing operator.Slicing is still O(1).
Dec 01 2010
On 12/01/2010 03:35 PM, Steven Schveighoffer wrote:On Tue, 30 Nov 2010 18:31:05 -0500, Ellery Newcomerin my mind, the problem isn't so much indexing an intermediate code unit gets you earlier code units (it's a little strange, and I'm not sure whether greater strictness would be better - on the one hand, less strictness would be more tolerant of bugs and make it that much more difficult to detect them, but on the other hand if you were doing something like getting a random or approximate slice into your string, less strictness would mean that much less annoyance, though I have no idea why you would want to do that) as it is just the difference between the two and the confusion that it's bound to cause the noobies.There definitely is value in being able to index and slice into utf strings without resulting in invalid utf, but I think the fact that it indexes on code unit and returns code point is sufficiently strange that it qualifies as abuse of operator overloading.Maybe :) The other alternative is to throw an exception if you try to access a code unit that is not the beginning of a code point. That might actually be less weird, I'll try doing that on the next iteration.I find that iteration over string characters using index is a very rare thing anyways, you either use foreach, which should give you dchars, or you use something like find, which should never give you an invalid index. -Stevefind was the counterargument I had in mind for keeping the operator overload, as something like s[find(s,'\u2729') .. s.codeUnits] is just a bit better than s.codePointSliceAt(find(s,'\u2729'), s.codeUnits); I really don't know. One thing that strikes me, though, if you're going to keep opIndex, is that being able to do foreach(size_t codeuniti, dchar c; s){ } would be nice. Actually, it looks like you can do that with current strings.
Dec 01 2010
On Wed, 01 Dec 2010 20:13:35 -0600 Ellery Newcomer <ellery-newcomer utulsa.edu> wrote:One thing that strikes me, though, if you're going to keep opIndex, is=20 that being able to do =20 foreach(size_t codeuniti, dchar c; s){ =20 }This would yield several times the same code point / dchar, or do i misinte= rpret? Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 02 2010
On Thu, 02 Dec 2010 05:08:54 -0500, spir <denis.spir gmail.com> wrote:On Wed, 01 Dec 2010 20:13:35 -0600 Ellery Newcomer <ellery-newcomer utulsa.edu> wrote:It would only yield the code units that are the start of a code point. So if the code point that starts at the beginning of the array is 2 code units long, you would iterate 0, 2, 3, ... -SteveOne thing that strikes me, though, if you're going to keep opIndex, is that being able to do foreach(size_t codeuniti, dchar c; s){ }This would yield several times the same code point / dchar, or do i misinterpret?
Dec 02 2010
On Wed, 01 Dec 2010 21:13:35 -0500, Ellery Newcomer <ellery-newcomer utulsa.edu> wrote:On 12/01/2010 03:35 PM, Steven Schveighoffer wrote:Yes, it does seem odd, but then again, how often do you need the individual characters of a string? I wrote php code for about 6 months as a full time job before I found I needed to access individual characters, and then I had to look up how to do it :) It's just not a common thing. Typically, the index you use is calculated from something like find, and you don't care what it is, as long as it's storable and persistent.On Tue, 30 Nov 2010 18:31:05 -0500, Ellery Newcomerin my mind, the problem isn't so much indexing an intermediate code unit gets you earlier code units (it's a little strange, and I'm not sure whether greater strictness would be better - on the one hand, less strictness would be more tolerant of bugs and make it that much more difficult to detect them, but on the other hand if you were doing something like getting a random or approximate slice into your string, less strictness would mean that much less annoyance, though I have no idea why you would want to do that) as it is just the difference between the two and the confusion that it's bound to cause the noobies.There definitely is value in being able to index and slice into utf strings without resulting in invalid utf, but I think the fact that it indexes on code unit and returns code point is sufficiently strange that it qualifies as abuse of operator overloading.Maybe :) The other alternative is to throw an exception if you try to access a code unit that is not the beginning of a code point. That might actually be less weird, I'll try doing that on the next iteration.Ugh, yes, and actually, that reminds me I should define opDollar.I find that iteration over string characters using index is a very rare thing anyways, you either use foreach, which should give you dchars, or you use something like find, which should never give you an invalid index. -Stevefind was the counterargument I had in mind for keeping the operator overload, as something like s[find(s,'\u2729') .. s.codeUnits] is just a bit better than s.codePointSliceAt(find(s,'\u2729'), s.codeUnits); I really don't know.One thing that strikes me, though, if you're going to keep opIndex, is that being able to do foreach(size_t codeuniti, dchar c; s){ } would be nice. Actually, it looks like you can do that with current strings.At this point, you can't do that except via opApply, and I didn't want to inject that in fear that it would be pointed out as a drawback. It would be nice if we could define a way to do that via ranges... -Steve
Dec 02 2010
On Thu, 02 Dec 2010 16:24:03 -0500 "Steven Schveighoffer" <schveiguy yahoo.com> wrote:Yes, it does seem odd, but then again, how often do you need the =20 individual characters of a string? I wrote php code for about 6 months a=s =20a full time job before I found I needed to access individual characters, ==20and then I had to look up how to do it :) It's just not a common thing.Well, I wonder whether I have ever written any piece of code without string= indexing ;-) Surely depends on your app domain(s)... Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 03 2010
On Fri, 03 Dec 2010 11:15:36 -0500, spir <denis.spir gmail.com> wrote:On Thu, 02 Dec 2010 16:24:03 -0500 "Steven Schveighoffer" <schveiguy yahoo.com> wrote:Indexing a string is rare, unless you are parsing something (yes it does truly depend on the domain), but even rarer is indexing a string based on a hard-coded value. Really, this last case is where the type would behave in a confusing way. Consider that D's current's strings have even more of a limited means of indexing -- you can only validly index a string using an index where a code-point is encoded as a single code-unit. Accessing other indexes (even ones that start a multi-unit code-point) results in invalid characters. -SteveYes, it does seem odd, but then again, how often do you need the individual characters of a string? I wrote php code for about 6 months as a full time job before I found I needed to access individual characters, and then I had to look up how to do it :) It's just not a common thing.Well, I wonder whether I have ever written any piece of code without string indexing ;-) Surely depends on your app domain(s)...
Dec 03 2010
On 03/12/2010 18:29, Steven Schveighoffer wrote:On Fri, 03 Dec 2010 11:15:36 -0500, spir <denis.spir gmail.com> wrote: Indexing a string is rare, unless you are parsing something (yes it does truly depend on the domain),In my XP you never index *especially* when parsing. Every parser I've written & used treats input as a character stream. XML & D for instance permit code points > 0x7x which means you must decode them properly when using utf-8, even if you are going to be storing the input back as utf-8. -- My enormous talent is exceeded only by my outrageous laziness. http://www.ssTk.co.uk
Dec 03 2010
Steven Schveighoffer Wrote:On Fri, 03 Dec 2010 11:15:36 -0500, spir <denis.spir gmail.com> wrote:I tend to do a lot of transforming strings, but I need to track offsets back to the original text to maintain alignment between the results and the input. For that, indexes are necessary and we use them a lot. Probably the right thing to do in this case is just pay for the cost of using dchar everywhere, but if you're working with large enough quantities of data, storage efficiency matters. JerryOn Thu, 02 Dec 2010 16:24:03 -0500 "Steven Schveighoffer" <schveiguy yahoo.com> wrote:Indexing a string is rare, unless you are parsing something (yes it does truly depend on the domain), but even rarer is indexing a string based on a hard-coded value. Really, this last case is where the type would behave in a confusing way.Yes, it does seem odd, but then again, how often do you need the individual characters of a string? I wrote php code for about 6 months as a full time job before I found I needed to access individual characters, and then I had to look up how to do it :) It's just not a common thing.Well, I wonder whether I have ever written any piece of code without string indexing ;-) Surely depends on your app domain(s)...
Dec 03 2010
On Fri, 03 Dec 2010 14:40:30 -0500, Jerry Quinn <jlquinn optonline.net> wrote:I tend to do a lot of transforming strings, but I need to track offsets back to the original text to maintain alignment between the results and the input. For that, indexes are necessary and we use them a lot.In my daily usage of strings, I generally use a string as a whole, not individual characters. But I do occasionally use it. Let's also understand that indexing is still present, what is deactivated is the ability to index to arbitrary code-units. It sounds to me like this new type would not affect your ability to store offsets (you can store an index, use it later when referring to the string, etc. just like you can now). My string type does not allow for writeable strings. My plan was to allow you access to the underlying char[] and let you edit that way. Letting someone write a dchar into the middle a utf-8 string could cause lots of problems, so I just disabled it by default. Not sure how that affects your 'transforming' work, are you actually changing the data or just lazily transforming? I'm interested to hear whether you think my string type would be a viable alternative.Probably the right thing to do in this case is just pay for the cost of using dchar everywhere, but if you're working with large enough quantities of data, storage efficiency matters.The huge advantage of using utf-8 is backwards compatibility with ASCII for C functions. -Steve
Dec 03 2010
Steven Schveighoffer Wrote:On Fri, 03 Dec 2010 14:40:30 -0500, Jerry Quinn <jlquinn optonline.net> wrote:That's reasonable. I"m not trying to create invalid strings. I'm actually working in C++ but keeping an eye on things going on in D-land. The kind of stuff we do is to normalize text in preparation for natural language processing. As a simple example, let's say you want to use a set of regexes to identify patterns in text. You want to return the offsets of each regex that matches. However, before the regexes run, you replace all html tags with a placeholder, so they can easily span tags without worrying about the contents. Before you return the results to the user, though, you need to translate those offsets back to ones for the original string. Everything is unicode of course and we care about processing unicode code points, but want to maintain UTF-8 storage underneath to keep size down. In reality, we're often doing things like single character normalizations as well as larger spans, but still need to maintain alignment to the original data. As long as this is reasonable to do, I'm fine. I just wasn't sure from the descriptions I was seeing. JerryI tend to do a lot of transforming strings, but I need to track offsets back to the original text to maintain alignment between the results and the input. For that, indexes are necessary and we use them a lot.In my daily usage of strings, I generally use a string as a whole, not individual characters. But I do occasionally use it. Let's also understand that indexing is still present, what is deactivated is the ability to index to arbitrary code-units. It sounds to me like this new type would not affect your ability to store offsets (you can store an index, use it later when referring to the string, etc. just like you can now). My string type does not allow for writeable strings. My plan was to allow you access to the underlying char[] and let you edit that way. Letting someone write a dchar into the middle a utf-8 string could cause lots of problems, so I just disabled it by default.
Dec 03 2010
On Fri, 03 Dec 2010 22:08:37 -0500, Jerry Quinn <jlquinn optonline.net> wrote:I'm actually working in C++ but keeping an eye on things going on in D-land. The kind of stuff we do is to normalize text in preparation for natural language processing. As a simple example, let's say you want to use a set of regexes to identify patterns in text. You want to return the offsets of each regex that matches. However, before the regexes run, you replace all html tags with a placeholder, so they can easily span tags without worrying about the contents.I'm assuming you are not changing the length of the string, or is that not correct?Before you return the results to the user, though, you need to translate those offsets back to ones for the original string.Hm... I guess you must be changing the lengths if the offsets are different. That seems odd, wouldn't you encounter performance issues when processing large documents?Everything is unicode of course and we care about processing unicode code points, but want to maintain UTF-8 storage underneath to keep size down. In reality, we're often doing things like single character normalizations as well as larger spans, but still need to maintain alignment to the original data. As long as this is reasonable to do, I'm fine. I just wasn't sure from the descriptions I was seeing.What you will have is access to the underlying char[] array, which should give you full edit access. I just don't want strings to be easily editable since doing so can be very difficult. Any offsets to dchar code-points in the string will match offsets to char code-units. In effect, you are always indexing by code-unit, even though with the string type you get code-points back. It should be as simple as accessing a member (like str.data) or casting (i.e. cast(char[])str). I'm unsure yet if it's dangerous enough to require casting. -Steve
Dec 04 2010
On Thu, 02 Dec 2010 16:24:03 -0500 "Steven Schveighoffer" <schveiguy yahoo.com> wrote:At this point, you can't do that except via opApply, and I didn't want to==20inject that in fear that it would be pointed out as a drawback. =20 It would be nice if we could define a way to do that via ranges...According to TDPL p.381, one could define slicing instead of opApply or ran= ge methods (since a slice is "rangeable", the compiler should take a slice = if nothing else is present) (not tried, yet). Your string type is, indeed, = a typical case where this would perfectly make sense. Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 03 2010
On 02/12/10 12:43, Ellery Newcomer wrote:On 12/01/2010 03:35 PM, Steven Schveighoffer wrote:I don't like either. I'd prefer (from std.algorithm) s = find(s, '\u2729'); I still don't see a need for any random-access methods, including slicing, on the code-point range. The ability to convert back and forth between the string_t and T[] would be useful to allow the user to choose between random-access code-unit range and bidirectional code-point range (or grapheme/glyph/etc). For example a T[] accessor property and an opAssign from a T[]. So if there are efficiencies to be had with random-access (and slicing), then the user could choose s = find(s.codeUnits, '\u2729'); Are there any other reasons to slice on a code-point range? Or have any capabilities beyond a bidirectional range? Unless there are compelling reasons, I'd have to say that slicing and indexing on the code-point level is not much more than a hack.I find that iteration over string characters using index is a very rare thing anyways, you either use foreach, which should give you dchars, or you use something like find, which should never give you an invalid index. -Stevefind was the counterargument I had in mind for keeping the operator overload, as something like s[find(s,'\u2729') .. s.codeUnits] is just a bit better than s.codePointSliceAt(find(s,'\u2729'), s.codeUnits); I really don't know.
Jan 12 2011
s = find(s.codeUnits, '\u2729');Sorry, I didn't mean the the codeUnits retuning a size_t. I meant an accessor for the underlying code-unit range. Something like property T[] codeUnits() { return _data; }
Jan 12 2011
On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis <jmdavisProg gmx.com> wrote: [...]It feels extremely weird that the indices refer to code units and not code points. If I write auto str = mystring("hæ?"); writeln(str[1], " ", str[2]); I expect it to print "æ ?", not "æ æ" like it does now. On a side note: It seems to me that the only reason to have char, wchar, and dchar as separate types in the language is that arrays of said types are UTF-encoded strings. If a type such as the proposed one were to become the default string type in D, it might as well wrap an array of ubyte/ushort/uint, since direct user manipulation of the underlying array will generally only happen in the rare cases when one wants to deal directly with code units. -Lars4. Indexing is no longer O(1), which violates the guarantees of the index operator.Indexing is still O(1).5. Slicing (other than a full slice) is no longer O(1), which violates the guarantees of the slicing operator.Slicing is still O(1). [...]
Nov 30 2010
On Tue, 30 Nov 2010 23:34:11 +0000 (UTC) "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> wrote:On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote: =20If I understand correctly how _charStart works in combination with indexing= and slicing, then here is something wrong in the type's interface. After auto str =3D mystring("h=C3=A6?"); Either one provides a code unit index and gets a code unit: writeln(str[1], " ", str[2]); // "=EF=BF=BD =EF=BF=BD" (invalid utf code p= oints) Or one provides a code point index and gets a code point: writeln(str[1], " ", str[2]); // "=C3=A6 ?" But for string manipulation, wouldn't it be better that your string type sy= stematically wraps a dchar[] array, whatever the original encoding? For ind= exing, slicing, finding, counting, etc... to be fast, I mean. Decoding beei= ng done only once at string creation time.On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis <jmdavisProg gmx.com> wrote: =20 [...] =20=20 It feels extremely weird that the indices refer to code units and not=20 code points. If I write =20 auto str =3D mystring("h=C3=A6?"); writeln(str[1], " ", str[2]); =20 I expect it to print "=C3=A6 ?", not "=C3=A6 =C3=A6" like it does now.4. Indexing is no longer O(1), which violates the guarantees of the index operator.=20 Indexing is still O(1). =205. Slicing (other than a full slice) is no longer O(1), which violates the guarantees of the slicing operator.=20 Slicing is still O(1). =20 [...]On a side note: It seems to me that the only reason to have char, wchar,==20and dchar as separate types in the language is that arrays of said types==20are UTF-encoded strings. If a type such as the proposed one were to=20 become the default string type in D, it might as well wrap an array of=20 ubyte/ushort/uint, since direct user manipulation of the underlying array==20will generally only happen in the rare cases when one wants to deal=20 directly with code units.Yes, but then, see remark above. Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 01 2010
On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad <public kyllingen.nospamnet> wrote:On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:I don't think it's possible to do that with any implementation without making indexing not O(1). This just isn't possible, unless you want to use dchar[]. But your point is well taken. I think what I'm going to do is throw an exception when accessing an invalid index. While also surprising, it doesn't result in "extra data". I feel it's probably very rare to just access hard-coded indexes like that unless you are sure of the data in the string. Or to use a for-loop to access characters, etc.On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis <jmdavisProg gmx.com> wrote: [...]It feels extremely weird that the indices refer to code units and not code points. If I write auto str = mystring("hæ?"); writeln(str[1], " ", str[2]); I expect it to print "æ ?", not "æ æ" like it does now.4. Indexing is no longer O(1), which violates the guarantees of the index operator.Indexing is still O(1).5. Slicing (other than a full slice) is no longer O(1), which violates the guarantees of the slicing operator.Slicing is still O(1). [...]On a side note: It seems to me that the only reason to have char, wchar, and dchar as separate types in the language is that arrays of said types are UTF-encoded strings. If a type such as the proposed one were to become the default string type in D, it might as well wrap an array of ubyte/ushort/uint, since direct user manipulation of the underlying array will generally only happen in the rare cases when one wants to deal directly with code units.I'd still want a char[] array for easy manipulation and eventual printing. Wrapping a ubyte[] with a string just to print would be strange. My goal in this exercise is to try and give control of what to deal with (code-points vs. code-units) back to the user. Right now, the library forces you to view them as code-points, and the compiler forces you to view them as code-units (except via foreach). But it is an interesting idea (removing small char types). -Steve
Dec 01 2010
On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad <public kyllingen.nospamnet> wrote:As soon as you add opIndex(), your interface becomes that of a random- access range, something which narrow strings are not. In fact, the distinction between random access and bidirectional range access for strings is in many ways the reason we're having this discussion. How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead adding a characterAt(i) function that retrieves the i'th code point, and which is not required to be O(1)? Then, if someone wants O(1) indexing they are forced to use string_t!dchar or just plain ol' arrays, both of which have clear, predictable indexing semantics. I think it's great that you're doing this, by the way! I haven't made up my mind yet about whether I want char[] or a separate string type, but it is great to have an actual implementation of the latter at hand when debating it. -LarsOn Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:I don't think it's possible to do that with any implementation without making indexing not O(1). This just isn't possible, unless you want to use dchar[]. But your point is well taken. I think what I'm going to do is throw an exception when accessing an invalid index. While also surprising, it doesn't result in "extra data". I feel it's probably very rare to just access hard-coded indexes like that unless you are sure of the data in the string. Or to use a for-loop to access characters, etc.On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis <jmdavisProg gmx.com> wrote: [...]It feels extremely weird that the indices refer to code units and not code points. If I write auto str = mystring("hæ?"); writeln(str[1], " ", str[2]); I expect it to print "æ ?", not "æ æ" like it does now.4. Indexing is no longer O(1), which violates the guarantees of the index operator.Indexing is still O(1).5. Slicing (other than a full slice) is no longer O(1), which violates the guarantees of the slicing operator.Slicing is still O(1). [...]
Dec 01 2010
On Thu, 02 Dec 2010 02:09:51 -0500, Lars T. Kyllingstad <public kyllingen.nospamnet> wrote:On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:Then substring (slicing) becomes an O(n) operation. It just doesn't work well. It seems to be awkward at first thought, but the more I think about it, the more I think it's right. When do you ever depend on specific indexes in a string being valid, or to be incrementing always by 1?On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad <public kyllingen.nospamnet> wrote:As soon as you add opIndex(), your interface becomes that of a random- access range, something which narrow strings are not. In fact, the distinction between random access and bidirectional range access for strings is in many ways the reason we're having this discussion. How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead adding a characterAt(i) function that retrieves the i'th code point, and which is not required to be O(1)? Then, if someone wants O(1) indexing they are forced to use string_t!dchar or just plain ol' arrays, both of which have clear, predictable indexing semantics.On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:I don't think it's possible to do that with any implementation without making indexing not O(1). This just isn't possible, unless you want to use dchar[]. But your point is well taken. I think what I'm going to do is throw an exception when accessing an invalid index. While also surprising, it doesn't result in "extra data". I feel it's probably very rare to just access hard-coded indexes like that unless you are sure of the data in the string. Or to use a for-loop to access characters, etc.On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis <jmdavisProg gmx.com> wrote: [...]It feels extremely weird that the indices refer to code units and not code points. If I write auto str = mystring("hæ?"); writeln(str[1], " ", str[2]); I expect it to print "æ ?", not "æ æ" like it does now.4. Indexing is no longer O(1), which violates the guarantees of the index operator.Indexing is still O(1).5. Slicing (other than a full slice) is no longer O(1), which violates the guarantees of the slicing operator.Slicing is still O(1). [...]I think it's great that you're doing this, by the way! I haven't made up my mind yet about whether I want char[] or a separate string type, but it is great to have an actual implementation of the latter at hand when debating it.Thanks :) -Steve
Dec 02 2010
On Thu, 02 Dec 2010 16:18:52 -0500, Steven Schveighoffer wrote:On Thu, 02 Dec 2010 02:09:51 -0500, Lars T. Kyllingstad <public kyllingen.nospamnet> wrote:What I meant wast that opSlice() should be disabled in the same way as opIndex(). What you have now is struct string_t(T) { dchar opIndex(size_t); // Must be O(1) string_t opSlice(size_t, size_t); // Must be O(1) } and what I'm suggesting is struct string_t(T) { dchar character(size_t); // May be O(n) string_t substring(size_t, size_t); // May be O(n) static if (is(T == dchar)) { alias character opIndex; // Must be O(1) alias substring opSlice; // Must be O(1) } }On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:Then substring (slicing) becomes an O(n) operation. It just doesn't work well.On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad <public kyllingen.nospamnet> wrote:As soon as you add opIndex(), your interface becomes that of a random- access range, something which narrow strings are not. In fact, the distinction between random access and bidirectional range access for strings is in many ways the reason we're having this discussion. How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead adding a characterAt(i) function that retrieves the i'th code point, and which is not required to be O(1)? Then, if someone wants O(1) indexing they are forced to use string_t!dchar or just plain ol' arrays, both of which have clear, predictable indexing semantics.On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:I don't think it's possible to do that with any implementation without making indexing not O(1). This just isn't possible, unless you want to use dchar[]. But your point is well taken. I think what I'm going to do is throw an exception when accessing an invalid index. While also surprising, it doesn't result in "extra data". I feel it's probably very rare to just access hard-coded indexes like that unless you are sure of the data in the string. Or to use a for-loop to access characters, etc.On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis <jmdavisProg gmx.com> wrote: [...]It feels extremely weird that the indices refer to code units and not code points. If I write auto str = mystring("hæ?"); writeln(str[1], " ", str[2]); I expect it to print "æ ?", not "æ æ" like it does now.4. Indexing is no longer O(1), which violates the guarantees of the index operator.Indexing is still O(1).5. Slicing (other than a full slice) is no longer O(1), which violates the guarantees of the slicing operator.Slicing is still O(1). [...]It seems to be awkward at first thought, but the more I think about it, the more I think it's right. When do you ever depend on specific indexes in a string being valid, or to be incrementing always by 1?That is a good question indeed. But I'm still not convinced. :) Another thing to consider is that by having opIndex() in there, your string satisfies isRandomAccessRange. Then, some algorithms which work with both bidirectional and random access may choose to go with the latter. This is a quote from the std.algorithm.find() docs: Specializations taking advantage of bidirectional or random access (where present) may accelerate search [...] -Lars
Dec 03 2010
On Fri, 03 Dec 2010 08:13:50 -0500, Lars T. Kyllingstad <public kyllingen.nospamnet> wrote:On Thu, 02 Dec 2010 16:18:52 -0500, Steven Schveighoffer wrote:O(n) slicing and indexing, called by different names, is unacceptable. It is possible to have O(1) indexing and slicing, I don't see why we should disallow it. If nothing else, it cripples this implementation to the point where it won't be accepted, simply because it performs worse than the current solution. Now, if one wants to have that feature in *addition* to O(1) indexing/slicing, I don't think that's an issue.On Thu, 02 Dec 2010 02:09:51 -0500, Lars T. Kyllingstad <public kyllingen.nospamnet> wrote:What I meant wast that opSlice() should be disabled in the same way as opIndex(). What you have now is struct string_t(T) { dchar opIndex(size_t); // Must be O(1) string_t opSlice(size_t, size_t); // Must be O(1) } and what I'm suggesting is struct string_t(T) { dchar character(size_t); // May be O(n) string_t substring(size_t, size_t); // May be O(n) static if (is(T == dchar)) { alias character opIndex; // Must be O(1) alias substring opSlice; // Must be O(1) } }On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:Then substring (slicing) becomes an O(n) operation. It just doesn't work well.On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad <public kyllingen.nospamnet> wrote:As soon as you add opIndex(), your interface becomes that of a random- access range, something which narrow strings are not. In fact, the distinction between random access and bidirectional range access for strings is in many ways the reason we're having this discussion. How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead adding a characterAt(i) function that retrieves the i'th code point, and which is not required to be O(1)? Then, if someone wants O(1) indexing they are forced to use string_t!dchar or just plain ol' arrays, both of which have clear, predictable indexing semantics.On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:I don't think it's possible to do that with any implementation without making indexing not O(1). This just isn't possible, unless you want to use dchar[]. But your point is well taken. I think what I'm going to do is throw an exception when accessing an invalid index. While also surprising, it doesn't result in "extra data". I feel it's probably very rare to just access hard-coded indexes like that unless you are sure of the data in the string. Or to use a for-loop to access characters, etc.On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis <jmdavisProg gmx.com> wrote: [...]It feels extremely weird that the indices refer to code units and not code points. If I write auto str = mystring("hæ?"); writeln(str[1], " ", str[2]); I expect it to print "æ ?", not "æ æ" like it does now.4. Indexing is no longer O(1), which violates the guarantees of the index operator.Indexing is still O(1).5. Slicing (other than a full slice) is no longer O(1), which violates the guarantees of the slicing operator.Slicing is still O(1). [...]I just looked it up, isRandomAccess requires hasLength, which my string type does not. So my string type is not a true random access range (this was intentional). -SteveIt seems to be awkward at first thought, but the more I think about it, the more I think it's right. When do you ever depend on specific indexes in a string being valid, or to be incrementing always by 1?That is a good question indeed. But I'm still not convinced. :) Another thing to consider is that by having opIndex() in there, your string satisfies isRandomAccessRange. Then, some algorithms which work with both bidirectional and random access may choose to go with the latter. This is a quote from the std.algorithm.find() docs: Specializations taking advantage of bidirectional or random access (where present) may accelerate search [...]
Dec 03 2010
On Fri, 03 Dec 2010 08:59:52 -0500, Steven Schveighoffer wrote:On Fri, 03 Dec 2010 08:13:50 -0500, Lars T. Kyllingstad <public kyllingen.nospamnet> wrote:Ah, you are right. I didn't read the documentation for isRandomAccessRange, I just looked at the accompanying code snippet, which doesn't mention hasLength. -LarsAnother thing to consider is that by having opIndex() in there, your string satisfies isRandomAccessRange. Then, some algorithms which work with both bidirectional and random access may choose to go with the latter. This is a quote from the std.algorithm.find() docs: Specializations taking advantage of bidirectional or random access (where present) may accelerate search [...]I just looked it up, isRandomAccess requires hasLength, which my string type does not. So my string type is not a true random access range (this was intentional).
Dec 03 2010
On Friday, December 03, 2010 05:13:50 Lars T. Kyllingstad wrote:On Thu, 02 Dec 2010 16:18:52 -0500, Steven Schveighoffer wrote:w.On Thu, 02 Dec 2010 02:09:51 -0500, Lars T. Kyllingstad =20 <public kyllingen.nospamnet> wrote:On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad =20 <public kyllingen.nospamnet> wrote:On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis <jmdavisProg gmx.com> wrote: =20 [...] =20=20 It feels extremely weird that the indices refer to code units and not code points. If I write =20 auto str =3D mystring("h=C3=A6?"); writeln(str[1], " ", str[2]); =20 I expect it to print "=C3=A6 ?", not "=C3=A6 =C3=A6" like it does no=4. Indexing is no longer O(1), which violates the guarantees of the index operator.=20 Indexing is still O(1). =205. Slicing (other than a full slice) is no longer O(1), which violates the guarantees of the slicing operator.=20 Slicing is still O(1). =20 [...]A string type without slicing (which must be O(1)) is DOA without question.= =20 Slicing is _far_ too useful to lose. Indexing in strings is fairly rare bec= ause=20 it's generally stupid idea, but slicing happens all the time. If nothing e= lse,=20 that is _the_ way to get a substring. =2D Jonathan M Davis=20 What I meant wast that opSlice() should be disabled in the same way as opIndex().=20 Then substring (slicing) becomes an O(n) operation. It just doesn't work well.=20 I don't think it's possible to do that with any implementation without making indexing not O(1). This just isn't possible, unless you want to use dchar[]. =20 But your point is well taken. I think what I'm going to do is throw an exception when accessing an invalid index. While also surprising, it doesn't result in "extra data". I feel it's probably very rare to just access hard-coded indexes like that unless you are sure of the data in the string. Or to use a for-loop to access characters, etc.=20 As soon as you add opIndex(), your interface becomes that of a random- access range, something which narrow strings are not. In fact, the distinction between random access and bidirectional range access for strings is in many ways the reason we're having this discussion. =20 How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead adding a characterAt(i) function that retrieves the i'th code point, and which is not required to be O(1)? Then, if someone wants O(1) indexing they are forced to use string_t!dchar or just plain ol' arrays, both of which have clear, predictable indexing semantics.
Dec 03 2010
On Tuesday, November 30, 2010 10:52:20 Steven Schveighoffer wrote:On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis <jmdavisProg gmx.com> wrote:I really don't think that we want to have to be adding string functions to a string struct. Having them external is far better, particularly when there are so many of them which aren't directly related to strings but use them heavily. Even if all of std.string got tacked onto a struct string type, that would leave out the rest of Phobos and any user code. I really think that universal function syntax is a necessity for a struct solution to be acceptable. I had considered a solution similar to this one a few months back, and the lack of universal function syntax is one of the reasons why I decided that it wasn't really an improvement. Honestly, without uniform function call syntax, I would consider a struct solution to be DOA. The loss in useability would just be too great.1. At least until universal function syntax is in the language, you use the ability to do str.func().Yes, but first, this is not a problem with the string type, it's a problem with the language. Second, any string-specific functions can be added, it is a struct after all, not a builtin.It is my understanding that there are various functions in std.algorithm which are able to treat strings as arrays and therefore process them more efficiently. I haven't looked into which ones are in that camp. I would think that find() might be, but I'd have to look.2. Functions that would work just fine treating strings as arrays of code units (presumably because they don't care about what the actual data is) lose efficiency, because now a string isn't an array.Which functions are those? They can be allowed via wrappers.Something like that would need to be done. It really should be a public property of some kind if not an outright public variable.3. You have no access to the underlying array unless you're dealing with an actual array of dchar.I thought of adding some kind of access. I wasn't sure the best way. I was thinking of allowing direct access via opCast, because I think casting might be a sufficient red flag to let you know you are crossing into dangerous waters. But it could just be as easy as making the array itself public.You're right. I misread what _charStart() did. However, if I understand it correctly now, you give it a code unit index and yet get a code point back. That worries me. It means that there is no relation between the length of the string and the indices that you'd use to index it. It _is_ related to the number of code units, which you can get with codeUnits(), but that disjoint seems like it could cause problems. It might be the correct solution, but I think that it merits some examination. Returning a code unit would be wrong since that totally breaks with the rest of the API dealing in dchar/code units, and it would be wrong to index by code point, since then indexing and slicing are no longer O(1). So, it's probably a choice between indexing by one thing and returning another or not having indexing and slicing, which would definitely not be good. So, maybe your solution is the best one in this respect, but it worries me. The exact ramifications of that need to be looked into.4. Indexing is no longer O(1), which violates the guarantees of the index operator.Indexing is still O(1).5. Slicing (other than a full slice) is no longer O(1), which violates the guarantees of the slicing operator.Slicing is still O(1).Except that what if you need to do both with the same type? Right now, you could have a function which treats a string as a range of dchar while another one which can get away with treating it as a range of code units can treat it as an array. You can pass the same string to both, and it works. That should still work if we go for a struct solution. Special-casing on strings and specifically using the internal array instead of the struct for them could fix the problem, but it still needs to be possible.What you're doing here is forcing the view of a string as a range of dchar in all cases. Granted, that's what you want in most cases, but it can degrade efficiency, and the fact that some operations (in particular indexing and slicing) are not O(1) like they're supposed to be means that algorithms which rely on O(1) behavior from them could increase their cost by an order of magnitude. All the cases where treating a string as an actual array which are currently valid are left out to dryYou can still use char[] and wchar[].There is definitely some value in making strings treated as ranges of dchar by default. But for the most part, that's the case already thanks to how std.array is defined. The only place where that runs into trouble is if you use foreach. It still treats them as arrays of code units unless you tell it to iterate over dchar. Either making foreach over character arrays iterate over dchar by default or making it a warning or error to use foreach with a char or wchar array of any kind without specifying the type to iterate over would fix that problem. I don't think that the basic idea of what you present is necessarily a bad idea (I've considered proposing similar solutions in the past), but I'm not at all convinced that it's an improvement. What we have works quite well overall, with the notable exception of foreach. I'm also worried that trying to bury the fact that strings are arrays of code units is going to cause a leaky abstraction which will case problems. It's not likely to be as bad as the situation with char literals and std::string in C++, but I think that we need to look at this very carefully and examine what all of the ramifications are before we even consider going for it. There is an inherent but necessary disjoint between having strings be arrays of code units and ranges of dchar. Sometimes they need to be treated as one, sometimes as the other. Ideally, the default - whichever it is - would be the one which leads to fewer bugs. But they both need to be there. A struct solution is essentially an attempt to make strings treated as ranges of dchar in more situations by default than is currenly the case. As such, it could be better than what we have now, but I'm honestly not convinced that it is. Aside from the foreach problem (which could be fixed with an appropriate warning or error - preferrably error), what we have works quite well. - Jonathan M DavisThe inherent problem with strings is that we _want_ to be able to view them as both arrays of code units and as ranges of dchar. Trying to view them only as ranges of dchar is inefficient in a number of cases. Even something as simple as getting a substring is less efficient with this code. Only in cases where you don't actually know the length of the substring that you want is this essentially as efficient as what we have now. You're ignoring all cases where viewing a string as an array of code units is correct and desirable. You're only solving half of the problem (albeit the more prevalent half).I'm not planning on eliminating char and wchar based arrays. In other words, you should be able to get access to the array, but it should not be the default, and it should be considered unsafe.Now, making it possible to have a wrapper struct like this which works in many cases where you'd use strings could reduce errors in code in many situations, so giving the programmer the ability to do that could be interesting. But it seems to me that this solution is too limiting to be a viable replacement for strings as they are.Hopefully you can see that I'm not eliminating the functionality you are looking for, just making it not the default.
Nov 30 2010
On Tue, 30 Nov 2010 18:34:56 -0500, Jonathan M Davis <jmdavisProg gmx.com> wrote:On Tuesday, November 30, 2010 10:52:20 Steven Schveighoffer wrote:Why? It's a choice between adding them to the struct, and doing func(str) instead. You already have to do func(R) for arbitrary ranges, I don't see why strings should be specialized. For things that should be considered 'built-in' for strings, they can be included as members of the struct. In other words, I think it's worth fixing strings now, even if it means we cannot call str.func() until the universal syntax is introduced.On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis <jmdavisProg gmx.com> wrote:I really don't think that we want to have to be adding string functions to a string struct. Having them external is far better, particularly when there are so many of them which aren't directly related to strings but use them heavily. Even if all of std.string got tacked onto a struct string type, that would leave out the rest of Phobos and any user code. I really think that universal function syntax is a necessity for a struct solution to be acceptable.1. At least until universal function syntax is in the language, youusethe ability to do str.func().Yes, but first, this is not a problem with the string type, it's a problem with the language. Second, any string-specific functions can be added, it is a struct after all, not a builtin.I had considered a solution similar to this one a few months back, and the lack of universal function syntax is one of the reasons why I decided that it wasn't really an improvement. Honestly, without uniform function call syntax, I would consider a struct solution to be DOA. The loss in useability would just be too great.But you can still call the function, it's just func(str). There is no loss of functionality. Usability isn't even really lost.Since std.algorithm works exclusively with ranges, those must be special cases for strings because strings are bi-directional ranges of dchar according to phobos. So those can continue to be special cases for the new string types.It is my understanding that there are various functions in std.algorithm which are able to treat strings as arrays and therefore process them more efficiently. I haven't looked into which ones are in that camp. I would think that find() might be, but I'd have to look.2. Functions that would work just fine treating strings as arrays of code units (presumably because they don't care about what the actual data is)loseefficiency, because now a string isn't an array.Which functions are those? They can be allowed via wrappers.How many times do you use a hard-coded index into a string without knowing the encoding of the string (i.e. I know this string is ascii)? How many times do you iterate the characters of a string via an incrementing index? It's not common to use the indexing operation with values that are not known or computed to be valid starts to code-points. In fact, the language depends on that (otherwise you'd see sliced strings everywhere with invalid data). So while it looks strange, it shouldn't be a common need. That being said, I think Lars pointed out that the strangeness of returning the code point even if you point in the middle would be surprising in some cases, so I think the better solution is to throw an exception.You're right. I misread what _charStart() did. However, if I understand it correctly now, you give it a code unit index and yet get a code point back. That worries me. It means that there is no relation between the length of the string and the indices that you'd use to index it. It _is_ related to the number of code units, which you can get with codeUnits(), but that disjoint seems like it could cause problems. It might be the correct solution, but I think that it merits some examination. Returning a code unit would be wrong since that totally breaks with the rest of the API dealing in dchar/code units, and it would be wrong to index by code point, since then indexing and slicing are no longer O(1). So, it's probably a choice between indexing by one thing and returning another or not having indexing and slicing, which would definitely not be good. So, maybe your solution is the best one in this respect, but it worries me. The exact ramifications of that need to be looked into.4. Indexing is no longer O(1), which violates the guarantees of theindexoperator.Indexing is still O(1).5. Slicing (other than a full slice) is no longer O(1), which violates the guarantees of the slicing operator.Slicing is still O(1).Yeah, it's definitely needed. I'll add access to the data member in the next version.Except that what if you need to do both with the same type? Right now, you could have a function which treats a string as a range of dchar while another one which can get away with treating it as a range of code units can treat it as an array. You can pass the same string to both, and it works. That should still work if we go for a struct solution. Special-casing on strings and specifically using the internal array instead of the struct for them could fix the problem, but it still needs to be possible.What you're doing here is forcing the view of a string as a range of dchar in all cases. Granted, that's what you want in most cases, but it can degrade efficiency, and the fact that some operations (in particular indexing and slicing) are not O(1) like they're supposed to be means that algorithms which rely on O(1) behavior from them could increase their cost by an order of magnitude. All the cases where treating a string as an actual array which arecurrentlyvalid are left out to dryYou can still use char[] and wchar[].or indexing. This is a huge problem.Hopefully you can see that I'm not eliminating the functionality you are looking for, just making it not the default.There is definitely some value in making strings treated as ranges of dchar by default. But for the most part, that's the case already thanks to how std.array is defined. The only place where that runs into trouble is if you use foreach.It still treats them as arrays of code units unless you tell it to iterate over dchar. Either making foreach over character arrays iterate over dchar by default or making it a warning or error to use foreach with a char or wchar array of any kind without specifying the type to iterate over would fix that problem.I agree that would be ideal, but it still doesn't solve the indexing problem.There is an inherent but necessary disjoint between having strings be arrays of code units and ranges of dchar. Sometimes they need to be treated as one, sometimes as the other. Ideally, the default - whichever it is - would be the one which leads to fewer bugs. But they both need to be there. A struct solution is essentially an attempt to make strings treated as ranges of dchar in more situations by default than is currenly the case. As such, it could be better than what we have now, but I'm honestly not convinced that it is. Aside from the foreach problem (which could be fixed with an appropriate warning or error - preferrably error), what we have works quite well.The thing is, the most common use of strings is as a string, not as an array of code-units. The common case is to print, slice, find, etc. on a *string*. When dealing with the string as a whole, either using an array or a specialized type works equally well. The uncommon case is to extract individual characters from the string. In this case, the default needs to be the most common need in that area -- extracting a dchar, not a code-unit. Having the default index operation extract a code-unit is very incorrect. -Steve
Dec 01 2010
Steven Schveighoffer Wrote: [snipped]3. You have no access to the underlying array unless you're dealing with an actual array of dchar.I thought of adding some kind of access. I wasn't sure the best way. I was thinking of allowing direct access via opCast, because I think casting might be a sufficient red flag to let you know you are crossing into dangerous waters. But it could just be as easy as making the array itself public.-SteveA string type should always maintain the invariant that it is a valid unicode string. Therefore I don't like having an unsafe opCast or providing direct access to the underlying array. I feel that there should be a read-only property for that. Algorithms that manipulate char[]'s should construct a new string instance which will validate the char[] it is being built from is a valid utf string. This looks like a great start for a proper string type. There's still the issue of literals that would require compiler/language changes. There's one other issue that should be considered at some stage: normalization and the fact that a single "character" can be constructed from several code points. (acutes and such)
Dec 01 2010
On Wed, 01 Dec 2010 03:30:07 -0500 foobar <foo bar.com> wrote:Steven Schveighoffer Wrote: [snipped]ith =203. You have no access to the underlying array unless you're dealing w==20an actual array of dchar.=20 I thought of adding some kind of access. I wasn't sure the best way. =20 I was thinking of allowing direct access via opCast, because I think =20 casting might be a sufficient red flag to let you know you are crossing=code string. Therefore I don't like having an unsafe opCast or providing di= rect access to the underlying array. I feel that there should be a read-onl= y property for that. Algorithms that manipulate char[]'s should construct a= new string instance which will validate the char[] it is being built from = is a valid utf string. But then, why not store a dchar[] array systematically? Validation and deco= ding is the same job. Once decoded, all methods work as expected (eg s[3] r= eturns the 4th code point) and blitz fast.into dangerous waters. =20 But it could just be as easy as making the array itself public. =20=20-Steve=20 A string type should always maintain the invariant that it is a valid uni=This looks like a great start for a proper string type. There's still the=issue of literals that would require compiler/language changes. Yop...There's one other issue that should be considered at some stage: normaliz=ation and the fact that a single "character" can be constructed from severa= l code points. (acutes and such)=20 This is my next little project. May build on Steve's job. (But it's not nec= essary, dchar is enough as a base, I guess.) Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 01 2010
Hi Denis, you might want to consider helping us out. We have got a feature-complete Unicode normalization, case-folding, and concatenation implementation passing all test cases in http://unicode.org/Public/6.0.0/ucd/NormalizationTest.txt (and then some) for all recent Unicode versions. This code was part of a bigger project that we have stopped working on. We feel that the Unicode normalization part might be useful to others. Therefore we consider releasing them under an open source license. Before we can do so, we have to clean up things a bit. Some open issues are a) The code still contains some TODOs and FIXMEs (bugs, inefficiencies, some bigger issues like more efficient storing of data etc.). b) No profiling and no benchmarking against the ICU implementation (http://site.icu-project.org/) has been done yet (we expect surprises). c) Implementation of additional Unicode algorithms (e.g. full case mapping, matching, collation). Since we have stopped working on the bigger project, we haven’t made much progress. Any help would be welcome. Let me know whether this would be of interest to you.There's one other issue that should be considered at some stage: normalization and the fact that a single "character" can be constructed from several code points. (acutes and such)This is my next little project. May build on Steve's job. (But it's not necessary, dchar is enough as a base, I guess.)
Dec 01 2010
On Wed, 01 Dec 2010 17:41:17 +0100 stephan <none example.com> wrote:=20lization and the fact that a single "character" can be constructed from sev= eral code points. (acutes and such)There's one other issue that should be considered at some stage: norma=necessary, dchar is enough as a base, I guess.)This is my next little project. May build on Steve's job. (But it's not=re=20 Hi Denis, you might want to consider helping us out. =20 We have got a feature-complete Unicode normalization, case-folding, and=20 concatenation implementation passing all test cases in=20 http://unicode.org/Public/6.0.0/ucd/NormalizationTest.txt (and then=20 some) for all recent Unicode versions. This code was part of a bigger=20 project that we have stopped working on. =20 We feel that the Unicode normalization part might be useful to others.=20 Therefore we consider releasing them under an open source license.=20 Before we can do so, we have to clean up things a bit. Some open issues a==20 a) The code still contains some TODOs and FIXMEs (bugs,=20 inefficiencies, some bigger issues like more efficient storing of data=20 etc.). =20 b) No profiling and no benchmarking against the ICU implementation=20 (http://site.icu-project.org/) has been done yet (we expect surprises). =20 c) Implementation of additional Unicode algorithms (e.g. full case=20 mapping, matching, collation). =20 Since we have stopped working on the bigger project, we haven=E2=80=99t m=ade=20much progress. Any help would be welcome. Let me know whether this would==20be of interest to you.Yes, of course it would be useful. in any case. Either you wish to go on yo= ur project, and I may be of some help. Or it would anyway be a useful base = or example of how to implement unicode algorithm. Maybe it's time to give s= ome more information of what I intend to write. I have done it already (par= tially in Python, nearly completely in Lua). What I have in mind is a "UText" type that provides the right abstraction f= or text processing / string maipulation as one has when dealing with ASCII = (in any fact any legacy character set). All what is needed is having a true= one-to-one mapping between characters (in the common sense) and elements o= f strings (what I call "code stacks"); one given stack unambiguously denote= s one character. To reach this point, in addition to decoding (ag from utf8= to code points), we must: * group codes into stacks=20 * normalize (into 'NFD') * sorts points in stacks That's the base. Then, we can for instance index or slice in O(1) as usual, and get a consis= tent substring of _characters_ (not "abstract characters"). We can search f= or substrings by simple, direct, comparisons. When dealing with utf32 strin= gs (or worse utf8), simple indexing or counting is O(n) or rather O(k.n) wh= ere k represents the (high) cost of "stacking", and normalizing and sorting= , on the fly -- it's not only traversing the whole string instead of random= , it's heavy computation all along the way. =46rom this base, all kinds of usual routines can be built without any more c= omplexity. That's all what I want do implement. I wish to write all general= -purpose ones (which means, for instance, nothing like casing). Precisely, I do not want to deal with anything related to script-, language= -, locale- specific issues. It's a completely separate & independant topic.= This indeed include the "compatibility" normalisation forms of unicode (wh= ich precisely do not provide a normal form...). It seems part of your proje= ct was to cope such issues. I would be happy to cooperate if you feel like going on (then, let us commu= nicate off list). I still have the Lua code (which used to run); even if us= eless as help for implementation (the languages are too different), it coul= d give some more concrete picture of what I have in mind. Also, it includes= several test datasets, reprocessed for usability, from unicode's online fi= les. Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 01 2010
spir wrote:What I have in mind is a "UText" type that provides the right abstraction for text processing / string maipulation as one has when dealing withASCII(in any fact any legacy character set). All what is needed is havinga trueone-to-one mapping between characters (in the common sense) andelements ofstrings (what I call "code stacks"); one given stack unambiguouslydenotesone character. To reach this point, in addition to decoding (ag fromutf8 tocode points), we must:* group codes into stacks * normalize (into 'NFD')Are those operations independent of the context? Is stacking always desired? I guess one would use one of the D string types when grouping or normalization is not desired, right? Makes sense.* sorts points in stacksOk, I see that it is possible with NFD. I am not experienced with Unicode, but I think there will be issues with other types of Unicode normalization. (Judging from your posts, I know that you know all these. :) )Then, we can for instance index or slice in O(1) as usual, and get a consistent substring of _characters_ [...] I do not want to deal withanything related to script-, language-, locale- specific issues. Is the concept of _character_ well defined in Unicode outside of the context of the an alphabet (I think your "script" covers alphabet.) It is an interesting decision when we actually want to see an array of code points as characters. When would it be correct to do so? I think the answer is when we start treating the string as a piece of text. For a string to be considered as text, it must be based on an alphabet. ASCII strings are pieces of text, because they are based on the 26-letter alphabet. I hope I don't sound like saying against anything that you said. I am also thinking about the other common operations that work on pieces of text: - sorting (e.g. ç is between c and d in many alphabets) - lowercasing, uppercasing (e.g. i<->İ and ı<->I in many alphabets) As a part of the Turkish D community, we've played with the idea of such a text type. It took advantage of D's support for Unicode encoded source code, so it's fully in Turkish. Yay! :) Here is the module that takes care of sorting, capitalization, and producing the base forms of the letters of the alphabets: http://code.google.com/p/trileri/source/browse/trunk/tr/alfabe.d It is also based on dchar[], as you recommend elsewhere in this thread. It is written with the older D2 operator overloading, doesn't support ranges, etc. But it currently supports ten alphabets (including the 26-letter English, and the Old Irish alphabet). Going out of the context of this thread, we've also worked on a type that contains pieces of text from different alphabets to make a "text", where a text like "jim & ali" is correctly capitalized as "JIM & ALİ". I am thinking more than what you describe. But your string would be useful for implementing ours, as we don't have normalization or stacking support at all. Thanks, Ali
Dec 03 2010
On Fri, 03 Dec 2010 14:11:35 -0800 Ali =C3=87ehreli <acehreli yahoo.com> wrote:spir wrote: =20 > What I have in mind is a "UText" type that provides the right abstract=ion> for text processing / string maipulation as one has when dealing with =ASCII> (in any fact any legacy character set). All what is needed is having a=true> one-to-one mapping between characters (in the common sense) and elemen=ts of> strings (what I call "code stacks"); one given stack unambiguously den=otes> one character. To reach this point, in addition to decoding (ag from =utf8 to> code points), we must: =20 > * group codes into stacks > * normalize (into 'NFD') =20 Are those operations independent of the context? Is stacking always desir=ed?=20 I guess one would use one of the D string types when grouping or=20 normalization is not desired, right? Makes sense.We can consider there are two kinds of uses for texts in most applications: 1. just input/output (where input is also from literals in code), possibly = with some concatenation. 2. string manipulation / text processing For the 1st case, there is no need for any sophisticated toolkit like the t= ype I intend to write. We can just read in latin-x, for instance, join it, = output is back, without any problems. (As long as all pieces of text share = the same encoding.) Problems arise as soon as text is to be manipulated or = processed in any other way: indexing, searching, counting, slicing, replaci= ng, etc... all these routines require isolating _characters- in the ordinar= y sense of the word, inside the string of code units or code points. A true text type, usable like ASCII in old days, would either provide routi= nes that do that do in the background, but in a costly way, or first group = codes into characters once only -- then every later operation is as cheap a= s possible. Normalising and sorting are also require so that each character= has only one representation. I intend to write a little article to explain the issue (& misunderstanding= s created by Unicode's use of "abstract character"), and possible solutions. To say it again: if all one needs is text input/output, then using such a t= ool is overkill. Actually, even the string type Steven is implementing is n= ot strictly necessary. But it would have the advantage, if I understand cor= rectly, to present a cleaner interface.> * sorts points in stacks =20 Ok, I see that it is possible with NFD. I am not experienced with=20 Unicode, but I think there will be issues with other types of Unicode=20 normalization. (Judging from your posts, I know that you know all these.==20:) )Yes, the algorithm comes with Unicode's docs about "canonicalisation".> Then, we can for instance index or slice in O(1) as usual, and get a > consistent substring of _characters_ [...] I do not want to deal with==20anything related to script-, language-, locale- specific issues. =20 Is the concept of _character_ well defined in Unicode outside of the=20 context of the an alphabet (I think your "script" covers alphabet.) =20 It is an interesting decision when we actually want to see an array of=20 code points as characters. When would it be correct to do so? I think=20 the answer is when we start treating the string as a piece of text. =20 For a string to be considered as text, it must be based on an alphabet.=20 ASCII strings are pieces of text, because they are based on the=20 26-letter alphabet. =20 I hope I don't sound like saying against anything that you said. I am=20 also thinking about the other common operations that work on pieces of te=xt:=20 - sorting (e.g. =C3=A7 is between c and d in many alphabets) - lowercasing, uppercasing (e.g. i<->=C4=B0 and =C4=B1<->I in many alphab=ets)=20 As a part of the Turkish D community, we've played with the idea of such==20a text type. It took advantage of D's support for Unicode encoded source==20code, so it's fully in Turkish. Yay! :) =20 Here is the module that takes care of sorting, capitalization, and=20 producing the base forms of the letters of the alphabets: =20 http://code.google.com/p/trileri/source/browse/trunk/tr/alfabe.d =20 It is also based on dchar[], as you recommend elsewhere in this thread. =20 It is written with the older D2 operator overloading, doesn't support=20 ranges, etc. But it currently supports ten alphabets (including the=20 26-letter English, and the Old Irish alphabet). =20 Going out of the context of this thread, we've also worked on a type=20 that contains pieces of text from different alphabets to make a "text",=20 where a text like "jim & ali" is correctly capitalized as "JIM & AL=C4=B0=".=20 I am thinking more than what you describe. But your string would be=20 useful for implementing ours, as we don't have normalization or stacking==20support at all.As said, I do not wish the enter the huge area of script-, natural language= -, culture-, specific issues; because it's not general; I just target a gen= eral-purpose tool. My type wouldn't even have a default uppercasee routine,= for instance: first, it's script specific, second, there is no general def= inition for that (*) -- even if Unicode provides such data. Sorting issue a= re even less decidable; it goes down to personal preferences (**). It's als= o too big, too diverse, too complicated, an area. But I guess the type I have in mind would provide a good basis for such a w= ork (or rather, hundreds of language- and domain-speicfic tools or applicat= ions). At least, issues about grouping codes into characters, and multiple = forms of characters, are solved: "AL=C4=B0" would always be represented the= same way, and in a logical way; so that if you count its characters, you g= et 3 for sure, and if you search for '=C4=B0' you find it for sure (which i= s not true today).Thanks, AliThank to you, Denis (*) Is the uppercase of "g=C3=A2t=C3=A9" "GATE" or "G=C3=82T=C3=89"? (**) Search for instance threads about users complaining when KDE decided t= o sort file names according to supposed user-friendly order ("natural", lol= !). -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 03 2010
On Wed, 01 Dec 2010 03:30:07 -0500, foobar <foo bar.com> wrote:Steven Schveighoffer Wrote: [snipped]Copying is not a good idea, nor is runtime validation. We can only protect the programmer so much. The good news is that the vast majority of strings are literals, which should be properly constructed by the compiler, and immutable.3. You have no access to the underlying array unless you're dealingwithan actual array of dchar.I thought of adding some kind of access. I wasn't sure the best way. I was thinking of allowing direct access via opCast, because I think casting might be a sufficient red flag to let you know you are crossing into dangerous waters. But it could just be as easy as making the array itself public.-SteveA string type should always maintain the invariant that it is a valid unicode string. Therefore I don't like having an unsafe opCast or providing direct access to the underlying array. I feel that there should be a read-only property for that. Algorithms that manipulate char[]'s should construct a new string instance which will validate the char[] it is being built from is a valid utf string.This looks like a great start for a proper string type. There's still the issue of literals that would require compiler/language changes.That is essential, the compiler has to defer the type of string literals to the library somehow.There's one other issue that should be considered at some stage: normalization and the fact that a single "character" can be constructed from several code points. (acutes and such)This is more solvable with a struct, but at this point, I'm not sure if it's worth worrying about. How common is that need? -Steve
Dec 01 2010
I like the direction you're taking but have some quibbles about details. Specifically, I'd go for a more complete separation into random-access code-unit ranges and bidirectional code-point ranges: On 01/12/10 02:18, Steven Schveighoffer wrote:// Written in the D programming language. /** Copyright: Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010 License: <a href="http://www.boost.org/LICENSE_1_0.txt">Boost License 1.0</a>. Authors: $(WEB erdani.org, Andrei Alexandrescu, Steven Schveighoffer) Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010. Distributed under the Boost Software License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) */ import std.utf; import std.traits; struct string_t(T) if (is(Unqual!T == char) || is(Unqual!T == wchar))Is there a reason not to include is(Unqual!T == dchar)?{ private T[] _data; this(T[] data) { this._data = data; }An opAssign from a T[] could facilitate conversion back and forth between code-point and code-unit ranges.// note, this assumes that idx is a valid index private size_t _charStart(size_t idx) const { static if(is(Unqual!T == wchar)) { immutable c = _data.ptr[idx]; if(c >= 0xDC00 && c <= 0xDFFF) { // surrogate pair detected, verify we have at least 2 wchars, // and that both wchars are properly encoded. assert(idx > 0, "Invalid UTF character at beginning of string"); return idx-1; } else return idx; } else { const p = _data.ptr + idx; if ((p[0] & 0b1100_0000) != 0b1000_0000) { return idx; } else if (idx >= 1 && (p[-1] & 0b1100_0000) != 0b1000_0000) { return idx - 1; } else if (idx >= 2 && (p[-2] & 0b1100_0000) != 0b1000_0000) { return idx - 2; } else if (idx >= 3 && (p[-3] & 0b1100_0000) != 0b1000_0000) { return idx - 3; } else { assert(false, "Invalid UTF character in string"); } } } void popFront() { auto nc = std.utf.stride(_data, 0); assert(nc <= _data.length && nc != 0xFF, "Invalid sequence at beginning of string"); _data = _data[nc .. $]; } void popBack() { immutable n = _data.length; assert(n, "Attempting to pop back of an empty string"); _data = _data.ptr[0.._charStart(n-1)]; } property dchar front() const { assert(_data.length, "Attempting to fetch the front of an empty string"); size_t i = 0; return decode(_data, i); } property dchar back() const { immutable n = _data.length; assert(n, "Attempting to fetch the back of an empty string"); auto idx = _charStart(n-1); return std.utf.decode(_data, idx); }There is the alternative of deferring decoding to the user and returning T[]'s holding exactly 1 code-point instead of dchars. I'm not sure which is best, but I'd be interested in seeing a case for choosing one or the other.property bool empty() const { return !_data.length; } property typeof(this) save() { return this; } // support read-only random access via code unit index. dchar opIndex(size_t idx) { idx = _charStart(idx); return std.utf.decode(_data, idx); } string_t opSlice() { return this; } string_t opSlice(size_t start, size_t end) { if(start != _data.length) start = _charStart(start); if(end != _data.length) end = _charStart(end); return string_t(_data[start..end]); } // note we don't call this length because length can be assumed to be the // number of elements, which this isn't. property size_t codeUnits() const { return _data.length; }I don't see a need for _charStart, opIndex, opSlice and codeUnits. If the underlying T[] can be returned by a property, then these can be done through the code-unit array, which is random-access.// support append and concat // TODO: need to support appending various types of strings to eachother // (right now only same-type strings can be appended, or raw arrays) ref string_t opOpAssign(string op, U)(U data) if (op == "~" && is(U == string_t)) { _data ~= data._data; return this; } ref string_t opOpAssign(string op, U)(U data) if (op == "~" && !is(U == string_t) && is(typeof(_data ~= U.init))) { _data ~= data; return this; } string_t opBinary(string op, U)(U data) if (op == "~" && is(U == string_t)) { return string_t(_data ~ data._data); } string_t opBinary(string op, U)(U data) if (op == "~" && !is(U == string_t) && is(typeof(_data ~ U.init))) { return string_t(_data ~ data); } } template string_t(T) if (is(Unqual!T == dchar)) { alias T[] string_t; } /** begin test code **/ import std.stdio; alias string_t!(immutable char) mystring; alias string_t!(immutable wchar) mywstring; alias string_t!(immutable dchar) mydstring; void main() { auto str = mystring("hello"); str ~= " world"; str ~= mystring("!!!"); writeln(str._data); }
Jan 12 2011
On Wed, 12 Jan 2011 04:49:26 -0500, Steven Wawryk <stevenw acres.com.au> wrote:I like the direction you're taking but have some quibbles about details. Specifically, I'd go for a more complete separation into random-access code-unit ranges and bidirectional code-point ranges:Thanks for taking the time. I will respond to your points, but please make your rebuttals to the new thread I'm about to create with an updated string type.Is there a reason not to include is(Unqual!T == dchar)?It's defined at the bottom as a simple alias to T[], since you do not need these features for a dchar[] string (it's correct to treat a dchar string as an array).An opAssign from a T[] could facilitate conversion back and forth between code-point and code-unit ranges.Yes, that would be a good feature.There is the alternative of deferring decoding to the user and returning T[]'s holding exactly 1 code-point instead of dchars. I'm not sure which is best, but I'd be interested in seeing a case for choosing one or the other.Definitely returning decoding and returning dchar is best. The whole point of this string type is to handle the decoding for you. If you want to do decoding manually, you can just stick with arrays.I don't see a need for _charStart, opIndex, opSlice and codeUnits. If the underlying T[] can be returned by a property, then these can be done through the code-unit array, which is random-access.But that puts extra pain on the user for not much reason. Currently, strings slice in one operation, you are proposing that we slice in three operations: 1. get the underlying array 2. slice it 3. reconstruct a string based on the slice. Plus, if you remove opIndex, you are restricting the usefulness of the range. Note that this string type already will decode dchars out of the front and back, why not just give that ability to the middle of the string?Note the definition of string_t!dchar here ^^^^ -Stevetemplate string_t(T) if (is(Unqual!T == dchar)) { alias T[] string_t; }
Jan 13 2011