www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [review] new string type

reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
In a prior thread, I promised to create a narrow string type that would  
enforce the requirements of narrow strings.  That is, the new string type  
should respect the encoding of narrow strings.

Here is a rough draft, tested minimally, but it does compile and pass  
simple tests.  It's pretty simple, which is what I would expect.  I copied  
a lot of stuff from std.array to get this to work.

The point of this type is -- if we replace what the compiler considers  
"strings" with this type, then we get both the compiler *and* phobos  
agreeing as to what this type is:  A bi-directional range of dchar.

As a bonus, char[] and wchar[] now would become arrays and be manipulated  
consistently with other arrays, which if not done correctly could cause  
problems, but may provide more flexibility and opportunity for  
performance.  Instead of the library fighting you on it.

Anyways, here it is, released under the boost license, commence attack ;)


// Written in the D programming language.

/**
Copyright: Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010
License:   <a href="http://www.boost.org/LICENSE_1_0.txt">Boost License  
1.0</a>.
Authors:   $(WEB erdani.org, Andrei Alexandrescu, Steven Schveighoffer)

Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010.
Distributed under the Boost Software License, Version 1.0.
    (See accompanying file LICENSE_1_0.txt or copy at
          http://www.boost.org/LICENSE_1_0.txt)
*/
import std.utf;
import std.traits;

struct string_t(T) if (is(Unqual!T == char) || is(Unqual!T == wchar))
{
     private T[] _data;
     this(T[] data)
     {
         this._data = data;
     }

     // note, this assumes that idx is a valid index
     private size_t _charStart(size_t idx) const
     {
         static if(is(Unqual!T == wchar))
         {
             immutable c = _data.ptr[idx];
             if(c >= 0xDC00 && c <= 0xDFFF)
             {
                 // surrogate pair detected, verify we have at least 2  
wchars,
                 // and that both wchars are properly encoded.
                 assert(idx > 0, "Invalid UTF character at beginning of  
string");
                 return idx-1;
             }
             else
                 return idx;
         }
         else
         {
             const p = _data.ptr + idx;
             if ((p[0] & 0b1100_0000) != 0b1000_0000)
             {
                 return idx;
             }
             else if (idx >= 1 && (p[-1] & 0b1100_0000) != 0b1000_0000)
             {
                 return idx - 1;
             }
             else if (idx >= 2 && (p[-2] & 0b1100_0000) != 0b1000_0000)
             {
                 return idx - 2;
             }
             else if (idx >= 3 && (p[-3] & 0b1100_0000) != 0b1000_0000)
             {
                 return idx - 3;
             }
             else
             {
                 assert(false, "Invalid UTF character in string");
             }
         }
     }

     void popFront()
     {
         auto nc = std.utf.stride(_data, 0);
         assert(nc <= _data.length && nc != 0xFF, "Invalid sequence at  
beginning of string");
         _data = _data[nc .. $];
     }

     void popBack()
     {
         immutable n = _data.length;
         assert(n, "Attempting to pop back of an empty string");
         _data = _data.ptr[0.._charStart(n-1)];
     }

      property dchar front() const
     {
         assert(_data.length, "Attempting to fetch the front of an empty  
string");
         size_t i = 0;
         return decode(_data, i);
     }

      property dchar back() const
     {
         immutable n = _data.length;
         assert(n, "Attempting to fetch the back of an empty string");
         auto idx = _charStart(n-1);
         return std.utf.decode(_data, idx);
     }

      property bool empty() const
     {
         return !_data.length;
     }

      property typeof(this) save()
     {
         return this;
     }

     // support read-only random access via code unit index.
     dchar opIndex(size_t idx)
     {
         idx = _charStart(idx);
         return std.utf.decode(_data, idx);
     }

     string_t opSlice()
     {
         return this;
     }

     string_t opSlice(size_t start, size_t end)
     {
         if(start != _data.length)
             start = _charStart(start);
         if(end != _data.length)
             end = _charStart(end);
         return string_t(_data[start..end]);
     }

     // note we don't call this length because length can be assumed to be  
the
     // number of elements, which this isn't.
      property size_t codeUnits() const
     {
         return _data.length;
     }

     // support append and concat
     // TODO: need to support appending various types of strings to  
eachother
     // (right now only same-type strings can be appended, or raw arrays)
     ref string_t opOpAssign(string op, U)(U data) if (op == "~" && is(U ==  
string_t))
     {
         _data ~= data._data;
         return this;
     }

     ref string_t opOpAssign(string op, U)(U data) if (op == "~" && !is(U  
== string_t) && is(typeof(_data ~= U.init)))
     {
         _data ~= data;
         return this;
     }

     string_t opBinary(string op, U)(U data) if (op == "~" && is(U ==  
string_t))
     {
         return string_t(_data ~ data._data);
     }

     string_t opBinary(string op, U)(U data) if (op == "~" && !is(U ==  
string_t) && is(typeof(_data ~ U.init)))
     {
         return string_t(_data ~ data);
     }
}

template string_t(T) if (is(Unqual!T == dchar))
{
     alias T[] string_t;
}

/** begin test code **/
import std.stdio;

alias string_t!(immutable char) mystring;
alias string_t!(immutable wchar) mywstring;
alias string_t!(immutable dchar) mydstring;

void main()
{
     auto str = mystring("hello");
     str ~= " world";
     str ~= mystring("!!!");
     writeln(str._data);
}
Nov 30 2010
next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Tuesday, November 30, 2010 07:48:03 Steven Schveighoffer wrote:
 In a prior thread, I promised to create a narrow string type that would
 enforce the requirements of narrow strings.  That is, the new string type
 should respect the encoding of narrow strings.
 
 Here is a rough draft, tested minimally, but it does compile and pass
 simple tests.  It's pretty simple, which is what I would expect.  I copied
 a lot of stuff from std.array to get this to work.
 
 The point of this type is -- if we replace what the compiler considers
 "strings" with this type, then we get both the compiler *and* phobos
 agreeing as to what this type is:  A bi-directional range of dchar.
 
 As a bonus, char[] and wchar[] now would become arrays and be manipulated
 consistently with other arrays, which if not done correctly could cause
 problems, but may provide more flexibility and opportunity for
 performance.  Instead of the library fighting you on it.
 
 Anyways, here it is, released under the boost license, commence attack ;)
 
 
 // Written in the D programming language.
 
 /**
 Copyright: Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010
 License:   <a href="http://www.boost.org/LICENSE_1_0.txt">Boost License
 1.0</a>.
 Authors:   $(WEB erdani.org, Andrei Alexandrescu, Steven Schveighoffer)
 
 Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010.
 Distributed under the Boost Software License, Version 1.0.
     (See accompanying file LICENSE_1_0.txt or copy at
           http://www.boost.org/LICENSE_1_0.txt)
 */
 import std.utf;
 import std.traits;
 
 struct string_t(T) if (is(Unqual!T == char) || is(Unqual!T == wchar))
 {
      private T[] _data;
      this(T[] data)
      {
          this._data = data;
      }
 
      // note, this assumes that idx is a valid index
      private size_t _charStart(size_t idx) const
      {
          static if(is(Unqual!T == wchar))
          {
              immutable c = _data.ptr[idx];
              if(c >= 0xDC00 && c <= 0xDFFF)
              {
                  // surrogate pair detected, verify we have at least 2
 wchars,
                  // and that both wchars are properly encoded.
                  assert(idx > 0, "Invalid UTF character at beginning of
 string");
                  return idx-1;
              }
              else
                  return idx;
          }
          else
          {
              const p = _data.ptr + idx;
              if ((p[0] & 0b1100_0000) != 0b1000_0000)
              {
                  return idx;
              }
              else if (idx >= 1 && (p[-1] & 0b1100_0000) != 0b1000_0000)
              {
                  return idx - 1;
              }
              else if (idx >= 2 && (p[-2] & 0b1100_0000) != 0b1000_0000)
              {
                  return idx - 2;
              }
              else if (idx >= 3 && (p[-3] & 0b1100_0000) != 0b1000_0000)
              {
                  return idx - 3;
              }
              else
              {
                  assert(false, "Invalid UTF character in string");
              }
          }
      }
 
      void popFront()
      {
          auto nc = std.utf.stride(_data, 0);
          assert(nc <= _data.length && nc != 0xFF, "Invalid sequence at
 beginning of string");
          _data = _data[nc .. $];
      }
 
      void popBack()
      {
          immutable n = _data.length;
          assert(n, "Attempting to pop back of an empty string");
          _data = _data.ptr[0.._charStart(n-1)];
      }
 
       property dchar front() const
      {
          assert(_data.length, "Attempting to fetch the front of an empty
 string");
          size_t i = 0;
          return decode(_data, i);
      }
 
       property dchar back() const
      {
          immutable n = _data.length;
          assert(n, "Attempting to fetch the back of an empty string");
          auto idx = _charStart(n-1);
          return std.utf.decode(_data, idx);
      }
 
       property bool empty() const
      {
          return !_data.length;
      }
 
       property typeof(this) save()
      {
          return this;
      }
 
      // support read-only random access via code unit index.
      dchar opIndex(size_t idx)
      {
          idx = _charStart(idx);
          return std.utf.decode(_data, idx);
      }
 
      string_t opSlice()
      {
          return this;
      }
 
      string_t opSlice(size_t start, size_t end)
      {
          if(start != _data.length)
              start = _charStart(start);
          if(end != _data.length)
              end = _charStart(end);
          return string_t(_data[start..end]);
      }
 
      // note we don't call this length because length can be assumed to be
 the
      // number of elements, which this isn't.
       property size_t codeUnits() const
      {
          return _data.length;
      }
 
      // support append and concat
      // TODO: need to support appending various types of strings to
 eachother
      // (right now only same-type strings can be appended, or raw arrays)
      ref string_t opOpAssign(string op, U)(U data) if (op == "~" && is(U ==
 string_t))
      {
          _data ~= data._data;
          return this;
      }
 
      ref string_t opOpAssign(string op, U)(U data) if (op == "~" && !is(U
 == string_t) && is(typeof(_data ~= U.init)))
      {
          _data ~= data;
          return this;
      }
 
      string_t opBinary(string op, U)(U data) if (op == "~" && is(U ==
 string_t))
      {
          return string_t(_data ~ data._data);
      }
 
      string_t opBinary(string op, U)(U data) if (op == "~" && !is(U ==
 string_t) && is(typeof(_data ~ U.init)))
      {
          return string_t(_data ~ data);
      }
 }
 
 template string_t(T) if (is(Unqual!T == dchar))
 {
      alias T[] string_t;
 }
 
 /** begin test code **/
 import std.stdio;
 
 alias string_t!(immutable char) mystring;
 alias string_t!(immutable wchar) mywstring;
 alias string_t!(immutable dchar) mydstring;
 
 void main()
 {
      auto str = mystring("hello");
      str ~= " world";
      str ~= mystring("!!!");
      writeln(str._data);
 }
1. At least until universal function syntax is in the language, you use the ability to do str.func(). 2. Functions that would work just fine treating strings as arrays of code units (presumably because they don't care about what the actual data is) lose efficiency, because now a string isn't an array. 3. You have no access to the underlying array unless you're dealing with an actual array of dchar. 4. Indexing is no longer O(1), which violates the guarantees of the index operator. 5. Slicing (other than a full slice) is no longer O(1), which violates the guarantees of the slicing operator. What you're doing here is forcing the view of a string as a range of dchar in all cases. Granted, that's what you want in most cases, but it can degrade efficiency, and the fact that some operations (in particular indexing and slicing) are not O(1) like they're supposed to be means that algorithms which rely on O(1) behavior from them could increase their cost by an order of magnitude. All the cases where treating a string as an actual array which are currently valid are left out to dry The inherent problem with strings is that we _want_ to be able to view them as both arrays of code units and as ranges of dchar. Trying to view them only as ranges of dchar is inefficient in a number of cases. Even something as simple as getting a substring is less efficient with this code. Only in cases where you don't actually know the length of the substring that you want is this essentially as efficient as what we have now. You're ignoring all cases where viewing a string as an array of code units is correct and desirable. You're only solving half of the problem (albeit the more prevalent half). It could very well be that we need a better way to handle when a string is treated as a range of dchar and when it is treated as an array of code units, but trying to treat them as ranges of dchar in all cases is not the right answer. Now, making it possible to have a wrapper struct like this which works in many cases where you'd use strings could reduce errors in code in many situations, so giving the programmer the ability to do that could be interesting. But it seems to me that this solution is too limiting to be a viable replacement for strings as they are. - Jonathan M Davis
Nov 30 2010
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis <jmdavisProg gmx.com>  
wrote:

 1. At least until universal function syntax is in the language, you use  
 the
 ability to do str.func().
Yes, but first, this is not a problem with the string type, it's a problem with the language. Second, any string-specific functions can be added, it is a struct after all, not a builtin.
 2. Functions that would work just fine treating strings as arrays of  
 code units
 (presumably because they don't care about what the actual data is) lose
 efficiency, because now a string isn't an array.
Which functions are those? They can be allowed via wrappers.
 3. You have no access to the underlying array unless you're dealing with  
 an
 actual array of dchar.
I thought of adding some kind of access. I wasn't sure the best way. I was thinking of allowing direct access via opCast, because I think casting might be a sufficient red flag to let you know you are crossing into dangerous waters. But it could just be as easy as making the array itself public.
 4. Indexing is no longer O(1), which violates the guarantees of the index
 operator.
Indexing is still O(1).
 5. Slicing (other than a full slice) is no longer O(1), which violates  
 the
 guarantees of the slicing operator.
Slicing is still O(1).
 What you're doing here is forcing the view of a string as a range of  
 dchar in
 all cases. Granted, that's what you want in most cases, but it can  
 degrade
 efficiency, and the fact that some operations (in particular indexing  
 and slicing)
 are not O(1) like they're supposed to be means that algorithms which  
 rely on
 O(1) behavior from them could increase their cost by an order of  
 magnitude. All
 the cases where treating a string as an actual array which are currently  
 valid
 are left out to dry
You can still use char[] and wchar[].
 The inherent problem with strings is that we _want_ to be able to view  
 them as
 both arrays of code units and as ranges of dchar. Trying to view them  
 only as
 ranges of dchar is inefficient in a number of cases. Even something as  
 simple as
 getting a substring is less efficient with this code. Only in cases  
 where you
 don't actually know the length of the substring that you want is this
 essentially as efficient as what we have now. You're ignoring all cases  
 where
 viewing a string as an array of code units is correct and desirable.  
 You're only
 solving half of the problem (albeit the more prevalent half).
I'm not planning on eliminating char and wchar based arrays. In other words, you should be able to get access to the array, but it should not be the default, and it should be considered unsafe.
 Now, making it possible to have a wrapper struct like this which works  
 in many
 cases where you'd use strings could reduce errors in code in many  
 situations, so
 giving the programmer the ability to do that could be interesting. But  
 it seems
 to me that this solution is too limiting to be a viable replacement for  
 strings
 as they are.
Hopefully you can see that I'm not eliminating the functionality you are looking for, just making it not the default. -Steve
Nov 30 2010
next sibling parent reply Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
On 11/30/2010 12:52 PM, Steven Schveighoffer wrote:
 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis

 4. Indexing is no longer O(1), which violates the guarantees of the index
 operator.
Indexing is still O(1).
 5. Slicing (other than a full slice) is no longer O(1), which violates
 the
 guarantees of the slicing operator.
Slicing is still O(1).
There definitely is value in being able to index and slice into utf strings without resulting in invalid utf, but I think the fact that it indexes on code unit and returns code point is sufficiently strange that it qualifies as abuse of operator overloading. That said, I like this proposal better than the current state of affairs. I'll try to play with it more on the weekend
Nov 30 2010
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 30 Nov 2010 18:31:05 -0500, Ellery Newcomer  
<ellery-newcomer utulsa.edu> wrote:

 On 11/30/2010 12:52 PM, Steven Schveighoffer wrote:
 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis

 4. Indexing is no longer O(1), which violates the guarantees of the  
 index
 operator.
Indexing is still O(1).
 5. Slicing (other than a full slice) is no longer O(1), which violates
 the
 guarantees of the slicing operator.
Slicing is still O(1).
There definitely is value in being able to index and slice into utf strings without resulting in invalid utf, but I think the fact that it indexes on code unit and returns code point is sufficiently strange that it qualifies as abuse of operator overloading.
Maybe :) The other alternative is to throw an exception if you try to access a code unit that is not the beginning of a code point. That might actually be less weird, I'll try doing that on the next iteration. I find that iteration over string characters using index is a very rare thing anyways, you either use foreach, which should give you dchars, or you use something like find, which should never give you an invalid index. -Steve
Dec 01 2010
parent reply Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
On 12/01/2010 03:35 PM, Steven Schveighoffer wrote:
 On Tue, 30 Nov 2010 18:31:05 -0500, Ellery Newcomer
 There definitely is value in being able to index and slice into utf
 strings without resulting in invalid utf, but I think the fact that it
 indexes on code unit and returns code point is sufficiently strange
 that it qualifies as abuse of operator overloading.
Maybe :) The other alternative is to throw an exception if you try to access a code unit that is not the beginning of a code point. That might actually be less weird, I'll try doing that on the next iteration.
in my mind, the problem isn't so much indexing an intermediate code unit gets you earlier code units (it's a little strange, and I'm not sure whether greater strictness would be better - on the one hand, less strictness would be more tolerant of bugs and make it that much more difficult to detect them, but on the other hand if you were doing something like getting a random or approximate slice into your string, less strictness would mean that much less annoyance, though I have no idea why you would want to do that) as it is just the difference between the two and the confusion that it's bound to cause the noobies.
 I find that iteration over string characters using index is a very rare
 thing anyways, you either use foreach, which should give you dchars, or
 you use something like find, which should never give you an invalid index.

 -Steve
find was the counterargument I had in mind for keeping the operator overload, as something like s[find(s,'\u2729') .. s.codeUnits] is just a bit better than s.codePointSliceAt(find(s,'\u2729'), s.codeUnits); I really don't know. One thing that strikes me, though, if you're going to keep opIndex, is that being able to do foreach(size_t codeuniti, dchar c; s){ } would be nice. Actually, it looks like you can do that with current strings.
Dec 01 2010
next sibling parent reply spir <denis.spir gmail.com> writes:
On Wed, 01 Dec 2010 20:13:35 -0600
Ellery Newcomer <ellery-newcomer utulsa.edu> wrote:

 One thing that strikes me, though, if you're going to keep opIndex, is=20
 that being able to do
=20
 foreach(size_t codeuniti, dchar c; s){
=20
 }
This would yield several times the same code point / dchar, or do i misinte= rpret? Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 02 2010
parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 02 Dec 2010 05:08:54 -0500, spir <denis.spir gmail.com> wrote:

 On Wed, 01 Dec 2010 20:13:35 -0600
 Ellery Newcomer <ellery-newcomer utulsa.edu> wrote:

 One thing that strikes me, though, if you're going to keep opIndex, is
 that being able to do

 foreach(size_t codeuniti, dchar c; s){

 }
This would yield several times the same code point / dchar, or do i misinterpret?
It would only yield the code units that are the start of a code point. So if the code point that starts at the beginning of the array is 2 code units long, you would iterate 0, 2, 3, ... -Steve
Dec 02 2010
prev sibling next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 01 Dec 2010 21:13:35 -0500, Ellery Newcomer  
<ellery-newcomer utulsa.edu> wrote:

 On 12/01/2010 03:35 PM, Steven Schveighoffer wrote:
 On Tue, 30 Nov 2010 18:31:05 -0500, Ellery Newcomer
 There definitely is value in being able to index and slice into utf
 strings without resulting in invalid utf, but I think the fact that it
 indexes on code unit and returns code point is sufficiently strange
 that it qualifies as abuse of operator overloading.
Maybe :) The other alternative is to throw an exception if you try to access a code unit that is not the beginning of a code point. That might actually be less weird, I'll try doing that on the next iteration.
in my mind, the problem isn't so much indexing an intermediate code unit gets you earlier code units (it's a little strange, and I'm not sure whether greater strictness would be better - on the one hand, less strictness would be more tolerant of bugs and make it that much more difficult to detect them, but on the other hand if you were doing something like getting a random or approximate slice into your string, less strictness would mean that much less annoyance, though I have no idea why you would want to do that) as it is just the difference between the two and the confusion that it's bound to cause the noobies.
Yes, it does seem odd, but then again, how often do you need the individual characters of a string? I wrote php code for about 6 months as a full time job before I found I needed to access individual characters, and then I had to look up how to do it :) It's just not a common thing. Typically, the index you use is calculated from something like find, and you don't care what it is, as long as it's storable and persistent.
 I find that iteration over string characters using index is a very rare
 thing anyways, you either use foreach, which should give you dchars, or
 you use something like find, which should never give you an invalid  
 index.

 -Steve
find was the counterargument I had in mind for keeping the operator overload, as something like s[find(s,'\u2729') .. s.codeUnits] is just a bit better than s.codePointSliceAt(find(s,'\u2729'), s.codeUnits); I really don't know.
Ugh, yes, and actually, that reminds me I should define opDollar.
 One thing that strikes me, though, if you're going to keep opIndex, is  
 that being able to do

 foreach(size_t codeuniti, dchar c; s){

 }

 would be nice. Actually, it looks like you can do that with current  
 strings.
At this point, you can't do that except via opApply, and I didn't want to inject that in fear that it would be pointed out as a drawback. It would be nice if we could define a way to do that via ranges... -Steve
Dec 02 2010
next sibling parent reply spir <denis.spir gmail.com> writes:
On Thu, 02 Dec 2010 16:24:03 -0500
"Steven Schveighoffer" <schveiguy yahoo.com> wrote:

 Yes, it does seem odd, but then again, how often do you need the =20
 individual characters of a string?  I wrote php code for about 6 months a=
s =20
 a full time job before I found I needed to access individual characters, =
=20
 and then I had to look up how to do it :)  It's just not a common thing.
Well, I wonder whether I have ever written any piece of code without string= indexing ;-) Surely depends on your app domain(s)... Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 03 2010
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 03 Dec 2010 11:15:36 -0500, spir <denis.spir gmail.com> wrote:

 On Thu, 02 Dec 2010 16:24:03 -0500
 "Steven Schveighoffer" <schveiguy yahoo.com> wrote:

 Yes, it does seem odd, but then again, how often do you need the
 individual characters of a string?  I wrote php code for about 6 months  
 as
 a full time job before I found I needed to access individual characters,
 and then I had to look up how to do it :)  It's just not a common thing.
Well, I wonder whether I have ever written any piece of code without string indexing ;-) Surely depends on your app domain(s)...
Indexing a string is rare, unless you are parsing something (yes it does truly depend on the domain), but even rarer is indexing a string based on a hard-coded value. Really, this last case is where the type would behave in a confusing way. Consider that D's current's strings have even more of a limited means of indexing -- you can only validly index a string using an index where a code-point is encoded as a single code-unit. Accessing other indexes (even ones that start a multi-unit code-point) results in invalid characters. -Steve
Dec 03 2010
next sibling parent Simon <s.d.hammett gmail.com> writes:
On 03/12/2010 18:29, Steven Schveighoffer wrote:
 On Fri, 03 Dec 2010 11:15:36 -0500, spir <denis.spir gmail.com> wrote:

 Indexing a string is rare, unless you are parsing something (yes it does
 truly depend on the domain),
In my XP you never index *especially* when parsing. Every parser I've written & used treats input as a character stream. XML & D for instance permit code points > 0x7x which means you must decode them properly when using utf-8, even if you are going to be storing the input back as utf-8. -- My enormous talent is exceeded only by my outrageous laziness. http://www.ssTk.co.uk
Dec 03 2010
prev sibling parent reply Jerry Quinn <jlquinn optonline.net> writes:
Steven Schveighoffer Wrote:

 On Fri, 03 Dec 2010 11:15:36 -0500, spir <denis.spir gmail.com> wrote:
 
 On Thu, 02 Dec 2010 16:24:03 -0500
 "Steven Schveighoffer" <schveiguy yahoo.com> wrote:

 Yes, it does seem odd, but then again, how often do you need the
 individual characters of a string?  I wrote php code for about 6 months  
 as
 a full time job before I found I needed to access individual characters,
 and then I had to look up how to do it :)  It's just not a common thing.
Well, I wonder whether I have ever written any piece of code without string indexing ;-) Surely depends on your app domain(s)...
Indexing a string is rare, unless you are parsing something (yes it does truly depend on the domain), but even rarer is indexing a string based on a hard-coded value. Really, this last case is where the type would behave in a confusing way.
I tend to do a lot of transforming strings, but I need to track offsets back to the original text to maintain alignment between the results and the input. For that, indexes are necessary and we use them a lot. Probably the right thing to do in this case is just pay for the cost of using dchar everywhere, but if you're working with large enough quantities of data, storage efficiency matters. Jerry
Dec 03 2010
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 03 Dec 2010 14:40:30 -0500, Jerry Quinn <jlquinn optonline.net>  
wrote:

 I tend to do a lot of transforming strings, but I need to track offsets  
 back to the original text to maintain alignment between the results and  
 the input.  For that, indexes are necessary and we use them a lot.
In my daily usage of strings, I generally use a string as a whole, not individual characters. But I do occasionally use it. Let's also understand that indexing is still present, what is deactivated is the ability to index to arbitrary code-units. It sounds to me like this new type would not affect your ability to store offsets (you can store an index, use it later when referring to the string, etc. just like you can now). My string type does not allow for writeable strings. My plan was to allow you access to the underlying char[] and let you edit that way. Letting someone write a dchar into the middle a utf-8 string could cause lots of problems, so I just disabled it by default. Not sure how that affects your 'transforming' work, are you actually changing the data or just lazily transforming? I'm interested to hear whether you think my string type would be a viable alternative.
 Probably the right thing to do in this case is just pay for the cost of  
 using dchar everywhere, but if you're working with large enough  
 quantities of data, storage efficiency matters.
The huge advantage of using utf-8 is backwards compatibility with ASCII for C functions. -Steve
Dec 03 2010
parent reply Jerry Quinn <jlquinn optonline.net> writes:
Steven Schveighoffer Wrote:

 On Fri, 03 Dec 2010 14:40:30 -0500, Jerry Quinn <jlquinn optonline.net>  
 wrote:
 
 I tend to do a lot of transforming strings, but I need to track offsets  
 back to the original text to maintain alignment between the results and  
 the input.  For that, indexes are necessary and we use them a lot.
In my daily usage of strings, I generally use a string as a whole, not individual characters. But I do occasionally use it. Let's also understand that indexing is still present, what is deactivated is the ability to index to arbitrary code-units. It sounds to me like this new type would not affect your ability to store offsets (you can store an index, use it later when referring to the string, etc. just like you can now). My string type does not allow for writeable strings. My plan was to allow you access to the underlying char[] and let you edit that way. Letting someone write a dchar into the middle a utf-8 string could cause lots of problems, so I just disabled it by default.
That's reasonable. I"m not trying to create invalid strings. I'm actually working in C++ but keeping an eye on things going on in D-land. The kind of stuff we do is to normalize text in preparation for natural language processing. As a simple example, let's say you want to use a set of regexes to identify patterns in text. You want to return the offsets of each regex that matches. However, before the regexes run, you replace all html tags with a placeholder, so they can easily span tags without worrying about the contents. Before you return the results to the user, though, you need to translate those offsets back to ones for the original string. Everything is unicode of course and we care about processing unicode code points, but want to maintain UTF-8 storage underneath to keep size down. In reality, we're often doing things like single character normalizations as well as larger spans, but still need to maintain alignment to the original data. As long as this is reasonable to do, I'm fine. I just wasn't sure from the descriptions I was seeing. Jerry
Dec 03 2010
parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 03 Dec 2010 22:08:37 -0500, Jerry Quinn <jlquinn optonline.net>  
wrote:

 I'm actually working in C++ but keeping an eye on things going on
 in D-land.  The kind of stuff we do is to normalize text in preparation
 for natural language processing.

 As a simple example, let's say you want to use a set of regexes to  
 identify patterns
 in text.  You want to return the offsets of each regex that matches.   
 However, before the regexes
 run, you replace all html tags with a placeholder, so they can easily  
 span
 tags without worrying about the contents.
I'm assuming you are not changing the length of the string, or is that not correct?
 Before you return the results to the
 user, though, you need to translate those offsets back to ones for the  
 original
 string.
Hm... I guess you must be changing the lengths if the offsets are different. That seems odd, wouldn't you encounter performance issues when processing large documents?
 Everything is unicode of course and we care about processing unicode  
 code points, but want to maintain UTF-8 storage underneath to keep size  
 down.

 In reality, we're often doing things like single character  
 normalizations as well as larger spans, but still need to maintain  
 alignment to the original data.

 As long as this is reasonable to do, I'm fine.  I just wasn't sure from  
 the descriptions I was seeing.
What you will have is access to the underlying char[] array, which should give you full edit access. I just don't want strings to be easily editable since doing so can be very difficult. Any offsets to dchar code-points in the string will match offsets to char code-units. In effect, you are always indexing by code-unit, even though with the string type you get code-points back. It should be as simple as accessing a member (like str.data) or casting (i.e. cast(char[])str). I'm unsure yet if it's dangerous enough to require casting. -Steve
Dec 04 2010
prev sibling parent spir <denis.spir gmail.com> writes:
On Thu, 02 Dec 2010 16:24:03 -0500
"Steven Schveighoffer" <schveiguy yahoo.com> wrote:

 At this point, you can't do that except via opApply, and I didn't want to=
=20
 inject that in fear that it would be pointed out as a drawback.
=20
 It would be nice if we could define a way to do that via ranges...
According to TDPL p.381, one could define slicing instead of opApply or ran= ge methods (since a slice is "rangeable", the compiler should take a slice = if nothing else is present) (not tried, yet). Your string type is, indeed, = a typical case where this would perfectly make sense. Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 03 2010
prev sibling parent reply Steven Wawryk <stevenw acres.com.au> writes:
On 02/12/10 12:43, Ellery Newcomer wrote:
 On 12/01/2010 03:35 PM, Steven Schveighoffer wrote:
 I find that iteration over string characters using index is a very rare
 thing anyways, you either use foreach, which should give you dchars, or
 you use something like find, which should never give you an invalid
 index.

 -Steve
find was the counterargument I had in mind for keeping the operator overload, as something like s[find(s,'\u2729') .. s.codeUnits] is just a bit better than s.codePointSliceAt(find(s,'\u2729'), s.codeUnits); I really don't know.
I don't like either. I'd prefer (from std.algorithm) s = find(s, '\u2729'); I still don't see a need for any random-access methods, including slicing, on the code-point range. The ability to convert back and forth between the string_t and T[] would be useful to allow the user to choose between random-access code-unit range and bidirectional code-point range (or grapheme/glyph/etc). For example a T[] accessor property and an opAssign from a T[]. So if there are efficiencies to be had with random-access (and slicing), then the user could choose s = find(s.codeUnits, '\u2729'); Are there any other reasons to slice on a code-point range? Or have any capabilities beyond a bidirectional range? Unless there are compelling reasons, I'd have to say that slicing and indexing on the code-point level is not much more than a hack.
Jan 12 2011
parent Steven Wawryk <stevenw acres.com.au> writes:
 s = find(s.codeUnits, '\u2729');
Sorry, I didn't mean the the codeUnits retuning a size_t. I meant an accessor for the underlying code-unit range. Something like property T[] codeUnits() { return _data; }
Jan 12 2011
prev sibling next sibling parent reply "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:
On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:
 
 [...]
 
 4. Indexing is no longer O(1), which violates the guarantees of the
 index operator.
Indexing is still O(1).
 5. Slicing (other than a full slice) is no longer O(1), which violates
 the
 guarantees of the slicing operator.
Slicing is still O(1). [...]
It feels extremely weird that the indices refer to code units and not code points. If I write auto str = mystring("hæ?"); writeln(str[1], " ", str[2]); I expect it to print "æ ?", not "æ æ" like it does now. On a side note: It seems to me that the only reason to have char, wchar, and dchar as separate types in the language is that arrays of said types are UTF-encoded strings. If a type such as the proposed one were to become the default string type in D, it might as well wrap an array of ubyte/ushort/uint, since direct user manipulation of the underlying array will generally only happen in the rare cases when one wants to deal directly with code units. -Lars
Nov 30 2010
next sibling parent spir <denis.spir gmail.com> writes:
On Tue, 30 Nov 2010 23:34:11 +0000 (UTC)
"Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> wrote:

 On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:
=20
 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:
=20
 [...]
=20
 4. Indexing is no longer O(1), which violates the guarantees of the
 index operator.
=20 Indexing is still O(1). =20
 5. Slicing (other than a full slice) is no longer O(1), which violates
 the
 guarantees of the slicing operator.
=20 Slicing is still O(1). =20 [...]
=20 It feels extremely weird that the indices refer to code units and not=20 code points. If I write =20 auto str =3D mystring("h=C3=A6?"); writeln(str[1], " ", str[2]); =20 I expect it to print "=C3=A6 ?", not "=C3=A6 =C3=A6" like it does now.
If I understand correctly how _charStart works in combination with indexing= and slicing, then here is something wrong in the type's interface. After auto str =3D mystring("h=C3=A6?"); Either one provides a code unit index and gets a code unit: writeln(str[1], " ", str[2]); // "=EF=BF=BD =EF=BF=BD" (invalid utf code p= oints) Or one provides a code point index and gets a code point: writeln(str[1], " ", str[2]); // "=C3=A6 ?" But for string manipulation, wouldn't it be better that your string type sy= stematically wraps a dchar[] array, whatever the original encoding? For ind= exing, slicing, finding, counting, etc... to be fast, I mean. Decoding beei= ng done only once at string creation time.
 On a side note:  It seems to me that the only reason to have char, wchar,=
=20
 and dchar as separate types in the language is that arrays of said types=
=20
 are UTF-encoded strings.  If a type such as the proposed one were to=20
 become the default string type in D, it might as well wrap an array of=20
 ubyte/ushort/uint, since direct user manipulation of the underlying array=
=20
 will generally only happen in the rare cases when one wants to deal=20
 directly with code units.
Yes, but then, see remark above. Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 01 2010
prev sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad  
<public kyllingen.nospamnet> wrote:

 On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:

 [...]

 4. Indexing is no longer O(1), which violates the guarantees of the
 index operator.
Indexing is still O(1).
 5. Slicing (other than a full slice) is no longer O(1), which violates
 the
 guarantees of the slicing operator.
Slicing is still O(1). [...]
It feels extremely weird that the indices refer to code units and not code points. If I write auto str = mystring("hæ?"); writeln(str[1], " ", str[2]); I expect it to print "æ ?", not "æ æ" like it does now.
I don't think it's possible to do that with any implementation without making indexing not O(1). This just isn't possible, unless you want to use dchar[]. But your point is well taken. I think what I'm going to do is throw an exception when accessing an invalid index. While also surprising, it doesn't result in "extra data". I feel it's probably very rare to just access hard-coded indexes like that unless you are sure of the data in the string. Or to use a for-loop to access characters, etc.
 On a side note:  It seems to me that the only reason to have char, wchar,
 and dchar as separate types in the language is that arrays of said types
 are UTF-encoded strings.  If a type such as the proposed one were to
 become the default string type in D, it might as well wrap an array of
 ubyte/ushort/uint, since direct user manipulation of the underlying array
 will generally only happen in the rare cases when one wants to deal
 directly with code units.
I'd still want a char[] array for easy manipulation and eventual printing. Wrapping a ubyte[] with a string just to print would be strange. My goal in this exercise is to try and give control of what to deal with (code-points vs. code-units) back to the user. Right now, the library forces you to view them as code-points, and the compiler forces you to view them as code-units (except via foreach). But it is an interesting idea (removing small char types). -Steve
Dec 01 2010
parent reply "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:
On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad
 <public kyllingen.nospamnet> wrote:
 
 On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:

 [...]

 4. Indexing is no longer O(1), which violates the guarantees of the
 index operator.
Indexing is still O(1).
 5. Slicing (other than a full slice) is no longer O(1), which
 violates the
 guarantees of the slicing operator.
Slicing is still O(1). [...]
It feels extremely weird that the indices refer to code units and not code points. If I write auto str = mystring("hæ?"); writeln(str[1], " ", str[2]); I expect it to print "æ ?", not "æ æ" like it does now.
I don't think it's possible to do that with any implementation without making indexing not O(1). This just isn't possible, unless you want to use dchar[]. But your point is well taken. I think what I'm going to do is throw an exception when accessing an invalid index. While also surprising, it doesn't result in "extra data". I feel it's probably very rare to just access hard-coded indexes like that unless you are sure of the data in the string. Or to use a for-loop to access characters, etc.
As soon as you add opIndex(), your interface becomes that of a random- access range, something which narrow strings are not. In fact, the distinction between random access and bidirectional range access for strings is in many ways the reason we're having this discussion. How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead adding a characterAt(i) function that retrieves the i'th code point, and which is not required to be O(1)? Then, if someone wants O(1) indexing they are forced to use string_t!dchar or just plain ol' arrays, both of which have clear, predictable indexing semantics. I think it's great that you're doing this, by the way! I haven't made up my mind yet about whether I want char[] or a separate string type, but it is great to have an actual implementation of the latter at hand when debating it. -Lars
Dec 01 2010
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 02 Dec 2010 02:09:51 -0500, Lars T. Kyllingstad  
<public kyllingen.nospamnet> wrote:

 On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad
 <public kyllingen.nospamnet> wrote:

 On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:

 [...]

 4. Indexing is no longer O(1), which violates the guarantees of the
 index operator.
Indexing is still O(1).
 5. Slicing (other than a full slice) is no longer O(1), which
 violates the
 guarantees of the slicing operator.
Slicing is still O(1). [...]
It feels extremely weird that the indices refer to code units and not code points. If I write auto str = mystring("hæ?"); writeln(str[1], " ", str[2]); I expect it to print "æ ?", not "æ æ" like it does now.
I don't think it's possible to do that with any implementation without making indexing not O(1). This just isn't possible, unless you want to use dchar[]. But your point is well taken. I think what I'm going to do is throw an exception when accessing an invalid index. While also surprising, it doesn't result in "extra data". I feel it's probably very rare to just access hard-coded indexes like that unless you are sure of the data in the string. Or to use a for-loop to access characters, etc.
As soon as you add opIndex(), your interface becomes that of a random- access range, something which narrow strings are not. In fact, the distinction between random access and bidirectional range access for strings is in many ways the reason we're having this discussion. How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead adding a characterAt(i) function that retrieves the i'th code point, and which is not required to be O(1)? Then, if someone wants O(1) indexing they are forced to use string_t!dchar or just plain ol' arrays, both of which have clear, predictable indexing semantics.
Then substring (slicing) becomes an O(n) operation. It just doesn't work well. It seems to be awkward at first thought, but the more I think about it, the more I think it's right. When do you ever depend on specific indexes in a string being valid, or to be incrementing always by 1?
 I think it's great that you're doing this, by the way!  I haven't made up
 my mind yet about whether I want char[] or a separate string type, but it
 is great to have an actual implementation of the latter at hand when
 debating it.
Thanks :) -Steve
Dec 02 2010
parent reply "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:
On Thu, 02 Dec 2010 16:18:52 -0500, Steven Schveighoffer wrote:

 On Thu, 02 Dec 2010 02:09:51 -0500, Lars T. Kyllingstad
 <public kyllingen.nospamnet> wrote:
 
 On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad
 <public kyllingen.nospamnet> wrote:

 On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:

 [...]

 4. Indexing is no longer O(1), which violates the guarantees of the
 index operator.
Indexing is still O(1).
 5. Slicing (other than a full slice) is no longer O(1), which
 violates the
 guarantees of the slicing operator.
Slicing is still O(1). [...]
It feels extremely weird that the indices refer to code units and not code points. If I write auto str = mystring("hæ?"); writeln(str[1], " ", str[2]); I expect it to print "æ ?", not "æ æ" like it does now.
I don't think it's possible to do that with any implementation without making indexing not O(1). This just isn't possible, unless you want to use dchar[]. But your point is well taken. I think what I'm going to do is throw an exception when accessing an invalid index. While also surprising, it doesn't result in "extra data". I feel it's probably very rare to just access hard-coded indexes like that unless you are sure of the data in the string. Or to use a for-loop to access characters, etc.
As soon as you add opIndex(), your interface becomes that of a random- access range, something which narrow strings are not. In fact, the distinction between random access and bidirectional range access for strings is in many ways the reason we're having this discussion. How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead adding a characterAt(i) function that retrieves the i'th code point, and which is not required to be O(1)? Then, if someone wants O(1) indexing they are forced to use string_t!dchar or just plain ol' arrays, both of which have clear, predictable indexing semantics.
Then substring (slicing) becomes an O(n) operation. It just doesn't work well.
What I meant wast that opSlice() should be disabled in the same way as opIndex(). What you have now is struct string_t(T) { dchar opIndex(size_t); // Must be O(1) string_t opSlice(size_t, size_t); // Must be O(1) } and what I'm suggesting is struct string_t(T) { dchar character(size_t); // May be O(n) string_t substring(size_t, size_t); // May be O(n) static if (is(T == dchar)) { alias character opIndex; // Must be O(1) alias substring opSlice; // Must be O(1) } }
 It seems to be awkward at first thought, but the more I
 think about it, the more I think it's right.  When do you ever depend on
 specific indexes in a string being valid, or to be incrementing always
 by 1?
That is a good question indeed. But I'm still not convinced. :) Another thing to consider is that by having opIndex() in there, your string satisfies isRandomAccessRange. Then, some algorithms which work with both bidirectional and random access may choose to go with the latter. This is a quote from the std.algorithm.find() docs: Specializations taking advantage of bidirectional or random access (where present) may accelerate search [...] -Lars
Dec 03 2010
next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 03 Dec 2010 08:13:50 -0500, Lars T. Kyllingstad  
<public kyllingen.nospamnet> wrote:

 On Thu, 02 Dec 2010 16:18:52 -0500, Steven Schveighoffer wrote:

 On Thu, 02 Dec 2010 02:09:51 -0500, Lars T. Kyllingstad
 <public kyllingen.nospamnet> wrote:

 On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad
 <public kyllingen.nospamnet> wrote:

 On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:

 [...]

 4. Indexing is no longer O(1), which violates the guarantees of the
 index operator.
Indexing is still O(1).
 5. Slicing (other than a full slice) is no longer O(1), which
 violates the
 guarantees of the slicing operator.
Slicing is still O(1). [...]
It feels extremely weird that the indices refer to code units and not code points. If I write auto str = mystring("hæ?"); writeln(str[1], " ", str[2]); I expect it to print "æ ?", not "æ æ" like it does now.
I don't think it's possible to do that with any implementation without making indexing not O(1). This just isn't possible, unless you want to use dchar[]. But your point is well taken. I think what I'm going to do is throw an exception when accessing an invalid index. While also surprising, it doesn't result in "extra data". I feel it's probably very rare to just access hard-coded indexes like that unless you are sure of the data in the string. Or to use a for-loop to access characters, etc.
As soon as you add opIndex(), your interface becomes that of a random- access range, something which narrow strings are not. In fact, the distinction between random access and bidirectional range access for strings is in many ways the reason we're having this discussion. How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead adding a characterAt(i) function that retrieves the i'th code point, and which is not required to be O(1)? Then, if someone wants O(1) indexing they are forced to use string_t!dchar or just plain ol' arrays, both of which have clear, predictable indexing semantics.
Then substring (slicing) becomes an O(n) operation. It just doesn't work well.
What I meant wast that opSlice() should be disabled in the same way as opIndex(). What you have now is struct string_t(T) { dchar opIndex(size_t); // Must be O(1) string_t opSlice(size_t, size_t); // Must be O(1) } and what I'm suggesting is struct string_t(T) { dchar character(size_t); // May be O(n) string_t substring(size_t, size_t); // May be O(n) static if (is(T == dchar)) { alias character opIndex; // Must be O(1) alias substring opSlice; // Must be O(1) } }
O(n) slicing and indexing, called by different names, is unacceptable. It is possible to have O(1) indexing and slicing, I don't see why we should disallow it. If nothing else, it cripples this implementation to the point where it won't be accepted, simply because it performs worse than the current solution. Now, if one wants to have that feature in *addition* to O(1) indexing/slicing, I don't think that's an issue.
 It seems to be awkward at first thought, but the more I
 think about it, the more I think it's right.  When do you ever depend on
 specific indexes in a string being valid, or to be incrementing always
 by 1?
That is a good question indeed. But I'm still not convinced. :) Another thing to consider is that by having opIndex() in there, your string satisfies isRandomAccessRange. Then, some algorithms which work with both bidirectional and random access may choose to go with the latter. This is a quote from the std.algorithm.find() docs: Specializations taking advantage of bidirectional or random access (where present) may accelerate search [...]
I just looked it up, isRandomAccess requires hasLength, which my string type does not. So my string type is not a true random access range (this was intentional). -Steve
Dec 03 2010
parent "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:
On Fri, 03 Dec 2010 08:59:52 -0500, Steven Schveighoffer wrote:

 On Fri, 03 Dec 2010 08:13:50 -0500, Lars T. Kyllingstad
 <public kyllingen.nospamnet> wrote:
 Another thing to consider is that by having opIndex() in there, your
 string satisfies isRandomAccessRange.  Then, some algorithms which work
 with both bidirectional and random access may choose to go with the
 latter.  This is a quote from the std.algorithm.find() docs:

     Specializations taking advantage of bidirectional or random access
     (where present) may accelerate search [...]
I just looked it up, isRandomAccess requires hasLength, which my string type does not. So my string type is not a true random access range (this was intentional).
Ah, you are right. I didn't read the documentation for isRandomAccessRange, I just looked at the accompanying code snippet, which doesn't mention hasLength. -Lars
Dec 03 2010
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, December 03, 2010 05:13:50 Lars T. Kyllingstad wrote:
 On Thu, 02 Dec 2010 16:18:52 -0500, Steven Schveighoffer wrote:
 On Thu, 02 Dec 2010 02:09:51 -0500, Lars T. Kyllingstad
=20
 <public kyllingen.nospamnet> wrote:
 On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:
 On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad
=20
 <public kyllingen.nospamnet> wrote:
 On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:
 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:
=20
 [...]
=20
 4. Indexing is no longer O(1), which violates the guarantees of the
 index operator.
=20 Indexing is still O(1). =20
 5. Slicing (other than a full slice) is no longer O(1), which
 violates the
 guarantees of the slicing operator.
=20 Slicing is still O(1). =20 [...]
=20 It feels extremely weird that the indices refer to code units and not code points. If I write =20 auto str =3D mystring("h=C3=A6?"); writeln(str[1], " ", str[2]); =20 I expect it to print "=C3=A6 ?", not "=C3=A6 =C3=A6" like it does no=
w.
=20
 I don't think it's possible to do that with any implementation without
 making indexing not O(1).  This just isn't possible, unless you want
 to use dchar[].
=20
 But your point is well taken.  I think what I'm going to do is throw
 an exception when accessing an invalid index.  While also surprising,
 it doesn't result in "extra data".  I feel it's probably very rare to
 just access hard-coded indexes like that unless you are sure of the
 data in the string.  Or to use a for-loop to access characters, etc.
=20 As soon as you add opIndex(), your interface becomes that of a random- access range, something which narrow strings are not. In fact, the distinction between random access and bidirectional range access for strings is in many ways the reason we're having this discussion. =20 How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead adding a characterAt(i) function that retrieves the i'th code point, and which is not required to be O(1)? Then, if someone wants O(1) indexing they are forced to use string_t!dchar or just plain ol' arrays, both of which have clear, predictable indexing semantics.
=20 Then substring (slicing) becomes an O(n) operation. It just doesn't work well.
=20 What I meant wast that opSlice() should be disabled in the same way as opIndex().
A string type without slicing (which must be O(1)) is DOA without question.= =20 Slicing is _far_ too useful to lose. Indexing in strings is fairly rare bec= ause=20 it's generally stupid idea, but slicing happens all the time. If nothing e= lse,=20 that is _the_ way to get a substring. =2D Jonathan M Davis
Dec 03 2010
prev sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Tuesday, November 30, 2010 10:52:20 Steven Schveighoffer wrote:
 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis <jmdavisProg gmx.com>
 
 wrote:
 1. At least until universal function syntax is in the language, you use
 the
 ability to do str.func().
Yes, but first, this is not a problem with the string type, it's a problem with the language. Second, any string-specific functions can be added, it is a struct after all, not a builtin.
I really don't think that we want to have to be adding string functions to a string struct. Having them external is far better, particularly when there are so many of them which aren't directly related to strings but use them heavily. Even if all of std.string got tacked onto a struct string type, that would leave out the rest of Phobos and any user code. I really think that universal function syntax is a necessity for a struct solution to be acceptable. I had considered a solution similar to this one a few months back, and the lack of universal function syntax is one of the reasons why I decided that it wasn't really an improvement. Honestly, without uniform function call syntax, I would consider a struct solution to be DOA. The loss in useability would just be too great.
 2. Functions that would work just fine treating strings as arrays of
 code units
 (presumably because they don't care about what the actual data is) lose
 efficiency, because now a string isn't an array.
Which functions are those? They can be allowed via wrappers.
It is my understanding that there are various functions in std.algorithm which are able to treat strings as arrays and therefore process them more efficiently. I haven't looked into which ones are in that camp. I would think that find() might be, but I'd have to look.
 3. You have no access to the underlying array unless you're dealing with
 an
 actual array of dchar.
I thought of adding some kind of access. I wasn't sure the best way. I was thinking of allowing direct access via opCast, because I think casting might be a sufficient red flag to let you know you are crossing into dangerous waters. But it could just be as easy as making the array itself public.
Something like that would need to be done. It really should be a public property of some kind if not an outright public variable.
 4. Indexing is no longer O(1), which violates the guarantees of the index
 operator.
Indexing is still O(1).
 5. Slicing (other than a full slice) is no longer O(1), which violates
 the
 guarantees of the slicing operator.
Slicing is still O(1).
You're right. I misread what _charStart() did. However, if I understand it correctly now, you give it a code unit index and yet get a code point back. That worries me. It means that there is no relation between the length of the string and the indices that you'd use to index it. It _is_ related to the number of code units, which you can get with codeUnits(), but that disjoint seems like it could cause problems. It might be the correct solution, but I think that it merits some examination. Returning a code unit would be wrong since that totally breaks with the rest of the API dealing in dchar/code units, and it would be wrong to index by code point, since then indexing and slicing are no longer O(1). So, it's probably a choice between indexing by one thing and returning another or not having indexing and slicing, which would definitely not be good. So, maybe your solution is the best one in this respect, but it worries me. The exact ramifications of that need to be looked into.
 What you're doing here is forcing the view of a string as a range of
 dchar in
 all cases. Granted, that's what you want in most cases, but it can
 degrade
 efficiency, and the fact that some operations (in particular indexing
 and slicing)
 are not O(1) like they're supposed to be means that algorithms which
 rely on
 O(1) behavior from them could increase their cost by an order of
 magnitude. All
 the cases where treating a string as an actual array which are currently
 valid
 are left out to dry
You can still use char[] and wchar[].
Except that what if you need to do both with the same type? Right now, you could have a function which treats a string as a range of dchar while another one which can get away with treating it as a range of code units can treat it as an array. You can pass the same string to both, and it works. That should still work if we go for a struct solution. Special-casing on strings and specifically using the internal array instead of the struct for them could fix the problem, but it still needs to be possible.
 The inherent problem with strings is that we _want_ to be able to view
 them as
 both arrays of code units and as ranges of dchar. Trying to view them
 only as
 ranges of dchar is inefficient in a number of cases. Even something as
 simple as
 getting a substring is less efficient with this code. Only in cases
 where you
 don't actually know the length of the substring that you want is this
 essentially as efficient as what we have now. You're ignoring all cases
 where
 viewing a string as an array of code units is correct and desirable.
 You're only
 solving half of the problem (albeit the more prevalent half).
I'm not planning on eliminating char and wchar based arrays. In other words, you should be able to get access to the array, but it should not be the default, and it should be considered unsafe.
 Now, making it possible to have a wrapper struct like this which works
 in many
 cases where you'd use strings could reduce errors in code in many
 situations, so
 giving the programmer the ability to do that could be interesting. But
 it seems
 to me that this solution is too limiting to be a viable replacement for
 strings
 as they are.
Hopefully you can see that I'm not eliminating the functionality you are looking for, just making it not the default.
There is definitely some value in making strings treated as ranges of dchar by default. But for the most part, that's the case already thanks to how std.array is defined. The only place where that runs into trouble is if you use foreach. It still treats them as arrays of code units unless you tell it to iterate over dchar. Either making foreach over character arrays iterate over dchar by default or making it a warning or error to use foreach with a char or wchar array of any kind without specifying the type to iterate over would fix that problem. I don't think that the basic idea of what you present is necessarily a bad idea (I've considered proposing similar solutions in the past), but I'm not at all convinced that it's an improvement. What we have works quite well overall, with the notable exception of foreach. I'm also worried that trying to bury the fact that strings are arrays of code units is going to cause a leaky abstraction which will case problems. It's not likely to be as bad as the situation with char literals and std::string in C++, but I think that we need to look at this very carefully and examine what all of the ramifications are before we even consider going for it. There is an inherent but necessary disjoint between having strings be arrays of code units and ranges of dchar. Sometimes they need to be treated as one, sometimes as the other. Ideally, the default - whichever it is - would be the one which leads to fewer bugs. But they both need to be there. A struct solution is essentially an attempt to make strings treated as ranges of dchar in more situations by default than is currenly the case. As such, it could be better than what we have now, but I'm honestly not convinced that it is. Aside from the foreach problem (which could be fixed with an appropriate warning or error - preferrably error), what we have works quite well. - Jonathan M Davis
Nov 30 2010
parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 30 Nov 2010 18:34:56 -0500, Jonathan M Davis <jmdavisProg gmx.com>  
wrote:

 On Tuesday, November 30, 2010 10:52:20 Steven Schveighoffer wrote:
 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis  
 <jmdavisProg gmx.com>

 wrote:
 1. At least until universal function syntax is in the language, you  
use
 the
 ability to do str.func().
Yes, but first, this is not a problem with the string type, it's a problem with the language. Second, any string-specific functions can be added, it is a struct after all, not a builtin.
I really don't think that we want to have to be adding string functions to a string struct. Having them external is far better, particularly when there are so many of them which aren't directly related to strings but use them heavily. Even if all of std.string got tacked onto a struct string type, that would leave out the rest of Phobos and any user code. I really think that universal function syntax is a necessity for a struct solution to be acceptable.
Why? It's a choice between adding them to the struct, and doing func(str) instead. You already have to do func(R) for arbitrary ranges, I don't see why strings should be specialized. For things that should be considered 'built-in' for strings, they can be included as members of the struct. In other words, I think it's worth fixing strings now, even if it means we cannot call str.func() until the universal syntax is introduced.
 I had considered a solution similar to this one a few months back, and  
 the lack
 of universal function syntax is one of the reasons why I decided that it  
 wasn't
 really an improvement. Honestly, without uniform function call syntax, I  
 would
 consider a struct solution to be DOA. The loss in useability would just  
 be too
 great.
But you can still call the function, it's just func(str). There is no loss of functionality. Usability isn't even really lost.
 2. Functions that would work just fine treating strings as arrays of
 code units
 (presumably because they don't care about what the actual data is)  
lose
 efficiency, because now a string isn't an array.
Which functions are those? They can be allowed via wrappers.
It is my understanding that there are various functions in std.algorithm which are able to treat strings as arrays and therefore process them more efficiently. I haven't looked into which ones are in that camp. I would think that find() might be, but I'd have to look.
Since std.algorithm works exclusively with ranges, those must be special cases for strings because strings are bi-directional ranges of dchar according to phobos. So those can continue to be special cases for the new string types.
 4. Indexing is no longer O(1), which violates the guarantees of the  
index
 operator.
Indexing is still O(1).
 5. Slicing (other than a full slice) is no longer O(1), which violates
 the
 guarantees of the slicing operator.
Slicing is still O(1).
You're right. I misread what _charStart() did. However, if I understand it correctly now, you give it a code unit index and yet get a code point back. That worries me. It means that there is no relation between the length of the string and the indices that you'd use to index it. It _is_ related to the number of code units, which you can get with codeUnits(), but that disjoint seems like it could cause problems. It might be the correct solution, but I think that it merits some examination. Returning a code unit would be wrong since that totally breaks with the rest of the API dealing in dchar/code units, and it would be wrong to index by code point, since then indexing and slicing are no longer O(1). So, it's probably a choice between indexing by one thing and returning another or not having indexing and slicing, which would definitely not be good. So, maybe your solution is the best one in this respect, but it worries me. The exact ramifications of that need to be looked into.
How many times do you use a hard-coded index into a string without knowing the encoding of the string (i.e. I know this string is ascii)? How many times do you iterate the characters of a string via an incrementing index? It's not common to use the indexing operation with values that are not known or computed to be valid starts to code-points. In fact, the language depends on that (otherwise you'd see sliced strings everywhere with invalid data). So while it looks strange, it shouldn't be a common need. That being said, I think Lars pointed out that the strangeness of returning the code point even if you point in the middle would be surprising in some cases, so I think the better solution is to throw an exception.
 What you're doing here is forcing the view of a string as a range of
 dchar in
 all cases. Granted, that's what you want in most cases, but it can
 degrade
 efficiency, and the fact that some operations (in particular indexing
 and slicing)
 are not O(1) like they're supposed to be means that algorithms which
 rely on
 O(1) behavior from them could increase their cost by an order of
 magnitude. All
 the cases where treating a string as an actual array which are  
currently
 valid
 are left out to dry
You can still use char[] and wchar[].
Except that what if you need to do both with the same type? Right now, you could have a function which treats a string as a range of dchar while another one which can get away with treating it as a range of code units can treat it as an array. You can pass the same string to both, and it works. That should still work if we go for a struct solution. Special-casing on strings and specifically using the internal array instead of the struct for them could fix the problem, but it still needs to be possible.
Yeah, it's definitely needed. I'll add access to the data member in the next version.
 Hopefully you can see that I'm not eliminating the functionality you are
 looking for, just making it not the default.
There is definitely some value in making strings treated as ranges of dchar by default. But for the most part, that's the case already thanks to how std.array is defined. The only place where that runs into trouble is if you use foreach.
or indexing. This is a huge problem.
 It
 still treats them as arrays of code units unless you tell it to iterate  
 over
 dchar. Either making foreach over character arrays iterate over dchar by  
 default
 or making it a warning or error to use foreach with a char or wchar  
 array of any
 kind without specifying the type to iterate over would fix that problem.
I agree that would be ideal, but it still doesn't solve the indexing problem.
 There is an inherent but necessary disjoint between having strings be  
 arrays of
 code units and ranges of dchar. Sometimes they need to be treated as one,
 sometimes as the other. Ideally, the default - whichever it is - would  
 be the
 one which leads to fewer bugs. But they both need to be there. A struct  
 solution
 is essentially an attempt to make strings treated as ranges of dchar in  
 more
 situations by default than is currenly the case. As such, it could be  
 better
 than what we have now, but I'm honestly not convinced that it is. Aside  
 from the
 foreach problem (which could be fixed with an appropriate warning or  
 error -
 preferrably error), what we have works quite well.
The thing is, the most common use of strings is as a string, not as an array of code-units. The common case is to print, slice, find, etc. on a *string*. When dealing with the string as a whole, either using an array or a specialized type works equally well. The uncommon case is to extract individual characters from the string. In this case, the default needs to be the most common need in that area -- extracting a dchar, not a code-unit. Having the default index operation extract a code-unit is very incorrect. -Steve
Dec 01 2010
prev sibling next sibling parent reply foobar <foo bar.com> writes:
Steven Schveighoffer Wrote:
[snipped]
 3. You have no access to the underlying array unless you're dealing with  
 an
 actual array of dchar.
I thought of adding some kind of access. I wasn't sure the best way. I was thinking of allowing direct access via opCast, because I think casting might be a sufficient red flag to let you know you are crossing into dangerous waters. But it could just be as easy as making the array itself public.
 -Steve
A string type should always maintain the invariant that it is a valid unicode string. Therefore I don't like having an unsafe opCast or providing direct access to the underlying array. I feel that there should be a read-only property for that. Algorithms that manipulate char[]'s should construct a new string instance which will validate the char[] it is being built from is a valid utf string. This looks like a great start for a proper string type. There's still the issue of literals that would require compiler/language changes. There's one other issue that should be considered at some stage: normalization and the fact that a single "character" can be constructed from several code points. (acutes and such)
Dec 01 2010
next sibling parent reply spir <denis.spir gmail.com> writes:
On Wed, 01 Dec 2010 03:30:07 -0500
foobar <foo bar.com> wrote:

 Steven Schveighoffer Wrote:
 [snipped]
 3. You have no access to the underlying array unless you're dealing w=
ith =20
 an
 actual array of dchar.
=20 I thought of adding some kind of access. I wasn't sure the best way. =20 I was thinking of allowing direct access via opCast, because I think =20 casting might be a sufficient red flag to let you know you are crossing=
=20
 into dangerous waters.
=20
 But it could just be as easy as making the array itself public.
=20
=20
 -Steve
=20 A string type should always maintain the invariant that it is a valid uni=
code string. Therefore I don't like having an unsafe opCast or providing di= rect access to the underlying array. I feel that there should be a read-onl= y property for that. Algorithms that manipulate char[]'s should construct a= new string instance which will validate the char[] it is being built from = is a valid utf string. But then, why not store a dchar[] array systematically? Validation and deco= ding is the same job. Once decoded, all methods work as expected (eg s[3] r= eturns the 4th code point) and blitz fast.
 This looks like a great start for a proper string type. There's still the=
issue of literals that would require compiler/language changes. Yop...
 There's one other issue that should be considered at some stage: normaliz=
ation and the fact that a single "character" can be constructed from severa= l code points. (acutes and such)=20 This is my next little project. May build on Steve's job. (But it's not nec= essary, dchar is enough as a base, I guess.) Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 01 2010
parent reply stephan <none example.com> writes:
 There's one other issue that should be considered at some stage: normalization
and the fact that a single "character" can be constructed from several code
points. (acutes and such)
This is my next little project. May build on Steve's job. (But it's not necessary, dchar is enough as a base, I guess.)
Hi Denis, you might want to consider helping us out. We have got a feature-complete Unicode normalization, case-folding, and concatenation implementation passing all test cases in http://unicode.org/Public/6.0.0/ucd/NormalizationTest.txt (and then some) for all recent Unicode versions. This code was part of a bigger project that we have stopped working on. We feel that the Unicode normalization part might be useful to others. Therefore we consider releasing them under an open source license. Before we can do so, we have to clean up things a bit. Some open issues are a) The code still contains some TODOs and FIXMEs (bugs, inefficiencies, some bigger issues like more efficient storing of data etc.). b) No profiling and no benchmarking against the ICU implementation (http://site.icu-project.org/) has been done yet (we expect surprises). c) Implementation of additional Unicode algorithms (e.g. full case mapping, matching, collation). Since we have stopped working on the bigger project, we haven’t made much progress. Any help would be welcome. Let me know whether this would be of interest to you.
Dec 01 2010
parent reply spir <denis.spir gmail.com> writes:
On Wed, 01 Dec 2010 17:41:17 +0100
stephan <none example.com> wrote:

=20
 There's one other issue that should be considered at some stage: norma=
lization and the fact that a single "character" can be constructed from sev= eral code points. (acutes and such)
 This is my next little project. May build on Steve's job. (But it's not=
necessary, dchar is enough as a base, I guess.)

=20
 Hi Denis, you might want to consider helping us out.
=20
 We have got a feature-complete Unicode normalization, case-folding, and=20
 concatenation implementation passing all test cases in=20
 http://unicode.org/Public/6.0.0/ucd/NormalizationTest.txt (and then=20
 some) for all recent Unicode versions. This code was part of a bigger=20
 project that we have stopped working on.
=20
 We feel that the Unicode normalization part might be useful to others.=20
 Therefore we consider releasing them under an open source license.=20
 Before we can do so, we have to clean up things a bit. Some open issues a=
re
=20
 a)    The code still contains some TODOs and FIXMEs (bugs,=20
 inefficiencies, some bigger issues like more efficient storing of data=20
 etc.).
=20
 b)    No profiling and no benchmarking against the ICU implementation=20
 (http://site.icu-project.org/) has been done yet (we expect surprises).
=20
 c)    Implementation of additional Unicode algorithms (e.g. full case=20
 mapping, matching, collation).
=20
 Since we have stopped working on the bigger project, we haven=E2=80=99t m=
ade=20
 much progress. Any help would be welcome. Let me know whether this would=
=20
 be of interest to you.
Yes, of course it would be useful. in any case. Either you wish to go on yo= ur project, and I may be of some help. Or it would anyway be a useful base = or example of how to implement unicode algorithm. Maybe it's time to give s= ome more information of what I intend to write. I have done it already (par= tially in Python, nearly completely in Lua). What I have in mind is a "UText" type that provides the right abstraction f= or text processing / string maipulation as one has when dealing with ASCII = (in any fact any legacy character set). All what is needed is having a true= one-to-one mapping between characters (in the common sense) and elements o= f strings (what I call "code stacks"); one given stack unambiguously denote= s one character. To reach this point, in addition to decoding (ag from utf8= to code points), we must: * group codes into stacks=20 * normalize (into 'NFD') * sorts points in stacks That's the base. Then, we can for instance index or slice in O(1) as usual, and get a consis= tent substring of _characters_ (not "abstract characters"). We can search f= or substrings by simple, direct, comparisons. When dealing with utf32 strin= gs (or worse utf8), simple indexing or counting is O(n) or rather O(k.n) wh= ere k represents the (high) cost of "stacking", and normalizing and sorting= , on the fly -- it's not only traversing the whole string instead of random= , it's heavy computation all along the way. =46rom this base, all kinds of usual routines can be built without any more c= omplexity. That's all what I want do implement. I wish to write all general= -purpose ones (which means, for instance, nothing like casing). Precisely, I do not want to deal with anything related to script-, language= -, locale- specific issues. It's a completely separate & independant topic.= This indeed include the "compatibility" normalisation forms of unicode (wh= ich precisely do not provide a normal form...). It seems part of your proje= ct was to cope such issues. I would be happy to cooperate if you feel like going on (then, let us commu= nicate off list). I still have the Lua code (which used to run); even if us= eless as help for implementation (the languages are too different), it coul= d give some more concrete picture of what I have in mind. Also, it includes= several test datasets, reprocessed for usability, from unicode's online fi= les. Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 01 2010
parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
spir wrote:

 What I have in mind is a "UText" type that provides the right abstraction
 for text processing / string maipulation as one has when dealing with 
ASCII
 (in any fact any legacy character set). All what is needed is having 
a true
 one-to-one mapping between characters (in the common sense) and 
elements of
 strings (what I call "code stacks"); one given stack unambiguously 
denotes
 one character. To reach this point, in addition to decoding (ag from 
utf8 to
 code points), we must:
 * group codes into stacks
 * normalize (into 'NFD')
Are those operations independent of the context? Is stacking always desired? I guess one would use one of the D string types when grouping or normalization is not desired, right? Makes sense.
 * sorts points in stacks
Ok, I see that it is possible with NFD. I am not experienced with Unicode, but I think there will be issues with other types of Unicode normalization. (Judging from your posts, I know that you know all these. :) )
 Then, we can for instance index or slice in O(1) as usual, and get a
 consistent substring of _characters_ [...] I do not want to deal with 
anything related to script-, language-, locale- specific issues. Is the concept of _character_ well defined in Unicode outside of the context of the an alphabet (I think your "script" covers alphabet.) It is an interesting decision when we actually want to see an array of code points as characters. When would it be correct to do so? I think the answer is when we start treating the string as a piece of text. For a string to be considered as text, it must be based on an alphabet. ASCII strings are pieces of text, because they are based on the 26-letter alphabet. I hope I don't sound like saying against anything that you said. I am also thinking about the other common operations that work on pieces of text: - sorting (e.g. ç is between c and d in many alphabets) - lowercasing, uppercasing (e.g. i<->İ and ı<->I in many alphabets) As a part of the Turkish D community, we've played with the idea of such a text type. It took advantage of D's support for Unicode encoded source code, so it's fully in Turkish. Yay! :) Here is the module that takes care of sorting, capitalization, and producing the base forms of the letters of the alphabets: http://code.google.com/p/trileri/source/browse/trunk/tr/alfabe.d It is also based on dchar[], as you recommend elsewhere in this thread. It is written with the older D2 operator overloading, doesn't support ranges, etc. But it currently supports ten alphabets (including the 26-letter English, and the Old Irish alphabet). Going out of the context of this thread, we've also worked on a type that contains pieces of text from different alphabets to make a "text", where a text like "jim & ali" is correctly capitalized as "JIM & ALİ". I am thinking more than what you describe. But your string would be useful for implementing ours, as we don't have normalization or stacking support at all. Thanks, Ali
Dec 03 2010
parent spir <denis.spir gmail.com> writes:
On Fri, 03 Dec 2010 14:11:35 -0800
Ali =C3=87ehreli <acehreli yahoo.com> wrote:

 spir wrote:
=20
  > What I have in mind is a "UText" type that provides the right abstract=
ion
  > for text processing / string maipulation as one has when dealing with =
ASCII
  > (in any fact any legacy character set). All what is needed is having a=
true
  > one-to-one mapping between characters (in the common sense) and elemen=
ts of
  > strings (what I call "code stacks"); one given stack unambiguously den=
otes
  > one character. To reach this point, in addition to decoding (ag from  =
utf8 to
  > code points), we must:
=20
  > * group codes into stacks
  > * normalize (into 'NFD')
=20
 Are those operations independent of the context? Is stacking always desir=
ed?
=20
 I guess one would use one of the D string types when grouping or=20
 normalization is not desired, right? Makes sense.
We can consider there are two kinds of uses for texts in most applications: 1. just input/output (where input is also from literals in code), possibly = with some concatenation. 2. string manipulation / text processing For the 1st case, there is no need for any sophisticated toolkit like the t= ype I intend to write. We can just read in latin-x, for instance, join it, = output is back, without any problems. (As long as all pieces of text share = the same encoding.) Problems arise as soon as text is to be manipulated or = processed in any other way: indexing, searching, counting, slicing, replaci= ng, etc... all these routines require isolating _characters- in the ordinar= y sense of the word, inside the string of code units or code points. A true text type, usable like ASCII in old days, would either provide routi= nes that do that do in the background, but in a costly way, or first group = codes into characters once only -- then every later operation is as cheap a= s possible. Normalising and sorting are also require so that each character= has only one representation. I intend to write a little article to explain the issue (& misunderstanding= s created by Unicode's use of "abstract character"), and possible solutions. To say it again: if all one needs is text input/output, then using such a t= ool is overkill. Actually, even the string type Steven is implementing is n= ot strictly necessary. But it would have the advantage, if I understand cor= rectly, to present a cleaner interface.
  > * sorts points in stacks
=20
 Ok, I see that it is possible with NFD. I am not experienced with=20
 Unicode, but I think there will be issues with other types of Unicode=20
 normalization. (Judging from your posts, I know that you know all these.=
=20
 :) )
Yes, the algorithm comes with Unicode's docs about "canonicalisation".
  > Then, we can for instance index or slice in O(1) as usual, and get a
  > consistent substring of _characters_ [...] I do not want to deal with=
=20
 anything related to script-, language-, locale- specific issues.
=20
 Is the concept of _character_ well defined in Unicode outside of the=20
 context of the an alphabet (I think your "script" covers alphabet.)
=20
 It is an interesting decision when we actually want to see an array of=20
 code points as characters. When would it be correct to do so? I think=20
 the answer is when we start treating the string as a piece of text.
=20
 For a string to be considered as text, it must be based on an alphabet.=20
 ASCII strings are pieces of text, because they are based on the=20
 26-letter alphabet.
=20
 I hope I don't sound like saying against anything that you said. I am=20
 also thinking about the other common operations that work on pieces of te=
xt:
=20
 - sorting (e.g. =C3=A7 is between c and d in many alphabets)
 - lowercasing, uppercasing (e.g. i<->=C4=B0 and =C4=B1<->I in many alphab=
ets)
=20
 As a part of the Turkish D community, we've played with the idea of such=
=20
 a text type. It took advantage of D's support for Unicode encoded source=
=20
 code, so it's fully in Turkish. Yay! :)
=20
 Here is the module that takes care of sorting, capitalization, and=20
 producing the base forms of the letters of the alphabets:
=20
      http://code.google.com/p/trileri/source/browse/trunk/tr/alfabe.d
=20
 It is also based on dchar[], as you recommend elsewhere in this thread.
=20
 It is written with the older D2 operator overloading, doesn't support=20
 ranges, etc. But it currently supports ten alphabets (including the=20
 26-letter English, and the Old Irish alphabet).
=20
 Going out of the context of this thread, we've also worked on a type=20
 that contains pieces of text from different alphabets to make a "text",=20
 where a text like "jim & ali" is correctly capitalized as "JIM & AL=C4=B0=
".
=20
 I am thinking more than what you describe. But your string would be=20
 useful for implementing ours, as we don't have normalization or stacking=
=20
 support at all.
As said, I do not wish the enter the huge area of script-, natural language= -, culture-, specific issues; because it's not general; I just target a gen= eral-purpose tool. My type wouldn't even have a default uppercasee routine,= for instance: first, it's script specific, second, there is no general def= inition for that (*) -- even if Unicode provides such data. Sorting issue a= re even less decidable; it goes down to personal preferences (**). It's als= o too big, too diverse, too complicated, an area. But I guess the type I have in mind would provide a good basis for such a w= ork (or rather, hundreds of language- and domain-speicfic tools or applicat= ions). At least, issues about grouping codes into characters, and multiple = forms of characters, are solved: "AL=C4=B0" would always be represented the= same way, and in a logical way; so that if you count its characters, you g= et 3 for sure, and if you search for '=C4=B0' you find it for sure (which i= s not true today).
 Thanks,
 Ali
Thank to you, Denis (*) Is the uppercase of "g=C3=A2t=C3=A9" "GATE" or "G=C3=82T=C3=89"? (**) Search for instance threads about users complaining when KDE decided t= o sort file names according to supposed user-friendly order ("natural", lol= !). -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 03 2010
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 01 Dec 2010 03:30:07 -0500, foobar <foo bar.com> wrote:

 Steven Schveighoffer Wrote:
 [snipped]
 3. You have no access to the underlying array unless you're dealing  
with
 an
 actual array of dchar.
I thought of adding some kind of access. I wasn't sure the best way. I was thinking of allowing direct access via opCast, because I think casting might be a sufficient red flag to let you know you are crossing into dangerous waters. But it could just be as easy as making the array itself public.
 -Steve
A string type should always maintain the invariant that it is a valid unicode string. Therefore I don't like having an unsafe opCast or providing direct access to the underlying array. I feel that there should be a read-only property for that. Algorithms that manipulate char[]'s should construct a new string instance which will validate the char[] it is being built from is a valid utf string.
Copying is not a good idea, nor is runtime validation. We can only protect the programmer so much. The good news is that the vast majority of strings are literals, which should be properly constructed by the compiler, and immutable.
 This looks like a great start for a proper string type. There's still  
 the issue of literals that would require compiler/language changes.
That is essential, the compiler has to defer the type of string literals to the library somehow.
 There's one other issue that should be considered at some stage:  
 normalization and the fact that a single "character" can be constructed  
 from several code points. (acutes and such)
This is more solvable with a struct, but at this point, I'm not sure if it's worth worrying about. How common is that need? -Steve
Dec 01 2010
prev sibling parent reply Steven Wawryk <stevenw acres.com.au> writes:
I like the direction you're taking but have some quibbles about details. 
  Specifically, I'd go for a more complete separation into random-access 
code-unit ranges and bidirectional code-point ranges:


On 01/12/10 02:18, Steven Schveighoffer wrote:
 // Written in the D programming language.

 /**
 Copyright: Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010
 License: <a href="http://www.boost.org/LICENSE_1_0.txt">Boost License
 1.0</a>.
 Authors: $(WEB erdani.org, Andrei Alexandrescu, Steven Schveighoffer)

 Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010.
 Distributed under the Boost Software License, Version 1.0.
 (See accompanying file LICENSE_1_0.txt or copy at
 http://www.boost.org/LICENSE_1_0.txt)
 */
 import std.utf;
 import std.traits;

 struct string_t(T) if (is(Unqual!T == char) || is(Unqual!T == wchar))
Is there a reason not to include is(Unqual!T == dchar)?
 {
 private T[] _data;
 this(T[] data)
 {
 this._data = data;
 }
An opAssign from a T[] could facilitate conversion back and forth between code-point and code-unit ranges.
 // note, this assumes that idx is a valid index
 private size_t _charStart(size_t idx) const
 {
 static if(is(Unqual!T == wchar))
 {
 immutable c = _data.ptr[idx];
 if(c >= 0xDC00 && c <= 0xDFFF)
 {
 // surrogate pair detected, verify we have at least 2 wchars,
 // and that both wchars are properly encoded.
 assert(idx > 0, "Invalid UTF character at beginning of string");
 return idx-1;
 }
 else
 return idx;
 }
 else
 {
 const p = _data.ptr + idx;
 if ((p[0] & 0b1100_0000) != 0b1000_0000)
 {
 return idx;
 }
 else if (idx >= 1 && (p[-1] & 0b1100_0000) != 0b1000_0000)
 {
 return idx - 1;
 }
 else if (idx >= 2 && (p[-2] & 0b1100_0000) != 0b1000_0000)
 {
 return idx - 2;
 }
 else if (idx >= 3 && (p[-3] & 0b1100_0000) != 0b1000_0000)
 {
 return idx - 3;
 }
 else
 {
 assert(false, "Invalid UTF character in string");
 }
 }
 }

 void popFront()
 {
 auto nc = std.utf.stride(_data, 0);
 assert(nc <= _data.length && nc != 0xFF, "Invalid sequence at beginning
 of string");
 _data = _data[nc .. $];
 }

 void popBack()
 {
 immutable n = _data.length;
 assert(n, "Attempting to pop back of an empty string");
 _data = _data.ptr[0.._charStart(n-1)];
 }

  property dchar front() const
 {
 assert(_data.length, "Attempting to fetch the front of an empty string");
 size_t i = 0;
 return decode(_data, i);
 }

  property dchar back() const
 {
 immutable n = _data.length;
 assert(n, "Attempting to fetch the back of an empty string");
 auto idx = _charStart(n-1);
 return std.utf.decode(_data, idx);
 }
There is the alternative of deferring decoding to the user and returning T[]'s holding exactly 1 code-point instead of dchars. I'm not sure which is best, but I'd be interested in seeing a case for choosing one or the other.
  property bool empty() const
 {
 return !_data.length;
 }

  property typeof(this) save()
 {
 return this;
 }

 // support read-only random access via code unit index.
 dchar opIndex(size_t idx)
 {
 idx = _charStart(idx);
 return std.utf.decode(_data, idx);
 }

 string_t opSlice()
 {
 return this;
 }

 string_t opSlice(size_t start, size_t end)
 {
 if(start != _data.length)
 start = _charStart(start);
 if(end != _data.length)
 end = _charStart(end);
 return string_t(_data[start..end]);
 }

 // note we don't call this length because length can be assumed to be the
 // number of elements, which this isn't.
  property size_t codeUnits() const
 {
 return _data.length;
 }
I don't see a need for _charStart, opIndex, opSlice and codeUnits. If the underlying T[] can be returned by a property, then these can be done through the code-unit array, which is random-access.
 // support append and concat
 // TODO: need to support appending various types of strings to eachother
 // (right now only same-type strings can be appended, or raw arrays)
 ref string_t opOpAssign(string op, U)(U data) if (op == "~" && is(U ==
 string_t))
 {
 _data ~= data._data;
 return this;
 }

 ref string_t opOpAssign(string op, U)(U data) if (op == "~" && !is(U ==
 string_t) && is(typeof(_data ~= U.init)))
 {
 _data ~= data;
 return this;
 }

 string_t opBinary(string op, U)(U data) if (op == "~" && is(U == string_t))
 {
 return string_t(_data ~ data._data);
 }

 string_t opBinary(string op, U)(U data) if (op == "~" && !is(U ==
 string_t) && is(typeof(_data ~ U.init)))
 {
 return string_t(_data ~ data);
 }
 }

 template string_t(T) if (is(Unqual!T == dchar))
 {
 alias T[] string_t;
 }

 /** begin test code **/
 import std.stdio;

 alias string_t!(immutable char) mystring;
 alias string_t!(immutable wchar) mywstring;
 alias string_t!(immutable dchar) mydstring;

 void main()
 {
 auto str = mystring("hello");
 str ~= " world";
 str ~= mystring("!!!");
 writeln(str._data);
 }
Jan 12 2011
parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 12 Jan 2011 04:49:26 -0500, Steven Wawryk <stevenw acres.com.au>  
wrote:

 I like the direction you're taking but have some quibbles about details.  
   Specifically, I'd go for a more complete separation into random-access  
 code-unit ranges and bidirectional code-point ranges:
Thanks for taking the time. I will respond to your points, but please make your rebuttals to the new thread I'm about to create with an updated string type.
 Is there a reason not to include is(Unqual!T == dchar)?
It's defined at the bottom as a simple alias to T[], since you do not need these features for a dchar[] string (it's correct to treat a dchar string as an array).
 An opAssign from a T[] could facilitate conversion back and forth  
 between code-point and code-unit ranges.
Yes, that would be a good feature.
 There is the alternative of deferring decoding to the user and returning  
 T[]'s holding exactly 1 code-point instead of dchars.  I'm not sure  
 which is best, but I'd be interested in seeing a case for choosing one  
 or the other.
Definitely returning decoding and returning dchar is best. The whole point of this string type is to handle the decoding for you. If you want to do decoding manually, you can just stick with arrays.
 I don't see a need for _charStart, opIndex, opSlice and codeUnits.  If  
 the underlying T[] can be returned by a property, then these can be done  
 through the code-unit array, which is random-access.
But that puts extra pain on the user for not much reason. Currently, strings slice in one operation, you are proposing that we slice in three operations: 1. get the underlying array 2. slice it 3. reconstruct a string based on the slice. Plus, if you remove opIndex, you are restricting the usefulness of the range. Note that this string type already will decode dchars out of the front and back, why not just give that ability to the middle of the string?
 template string_t(T) if (is(Unqual!T == dchar))
 {
 alias T[] string_t;
 }
Note the definition of string_t!dchar here ^^^^ -Steve
Jan 13 2011