www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [review] new string type

reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
In a prior thread, I promised to create a narrow string type that would  
enforce the requirements of narrow strings.  That is, the new string type  
should respect the encoding of narrow strings.

Here is a rough draft, tested minimally, but it does compile and pass  
simple tests.  It's pretty simple, which is what I would expect.  I copied  
a lot of stuff from std.array to get this to work.

The point of this type is -- if we replace what the compiler considers  
"strings" with this type, then we get both the compiler *and* phobos  
agreeing as to what this type is:  A bi-directional range of dchar.

As a bonus, char[] and wchar[] now would become arrays and be manipulated  
consistently with other arrays, which if not done correctly could cause  
problems, but may provide more flexibility and opportunity for  
performance.  Instead of the library fighting you on it.

Anyways, here it is, released under the boost license, commence attack ;)


// Written in the D programming language.

/**
Copyright: Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010
License:   <a href="http://www.boost.org/LICENSE_1_0.txt">Boost License  
1.0</a>.
Authors:   $(WEB erdani.org, Andrei Alexandrescu, Steven Schveighoffer)

Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010.
Distributed under the Boost Software License, Version 1.0.
    (See accompanying file LICENSE_1_0.txt or copy at
          http://www.boost.org/LICENSE_1_0.txt)
*/
import std.utf;
import std.traits;

struct string_t(T) if (is(Unqual!T == char) || is(Unqual!T == wchar))
{
     private T[] _data;
     this(T[] data)
     {
         this._data = data;
     }

     // note, this assumes that idx is a valid index
     private size_t _charStart(size_t idx) const
     {
         static if(is(Unqual!T == wchar))
         {
             immutable c = _data.ptr[idx];
             if(c >= 0xDC00 && c <= 0xDFFF)
             {
                 // surrogate pair detected, verify we have at least 2  
wchars,
                 // and that both wchars are properly encoded.
                 assert(idx > 0, "Invalid UTF character at beginning of  
string");
                 return idx-1;
             }
             else
                 return idx;
         }
         else
         {
             const p = _data.ptr + idx;
             if ((p[0] & 0b1100_0000) != 0b1000_0000)
             {
                 return idx;
             }
             else if (idx >= 1 && (p[-1] & 0b1100_0000) != 0b1000_0000)
             {
                 return idx - 1;
             }
             else if (idx >= 2 && (p[-2] & 0b1100_0000) != 0b1000_0000)
             {
                 return idx - 2;
             }
             else if (idx >= 3 && (p[-3] & 0b1100_0000) != 0b1000_0000)
             {
                 return idx - 3;
             }
             else
             {
                 assert(false, "Invalid UTF character in string");
             }
         }
     }

     void popFront()
     {
         auto nc = std.utf.stride(_data, 0);
         assert(nc <= _data.length && nc != 0xFF, "Invalid sequence at  
beginning of string");
         _data = _data[nc .. $];
     }

     void popBack()
     {
         immutable n = _data.length;
         assert(n, "Attempting to pop back of an empty string");
         _data = _data.ptr[0.._charStart(n-1)];
     }

      property dchar front() const
     {
         assert(_data.length, "Attempting to fetch the front of an empty  
string");
         size_t i = 0;
         return decode(_data, i);
     }

      property dchar back() const
     {
         immutable n = _data.length;
         assert(n, "Attempting to fetch the back of an empty string");
         auto idx = _charStart(n-1);
         return std.utf.decode(_data, idx);
     }

      property bool empty() const
     {
         return !_data.length;
     }

      property typeof(this) save()
     {
         return this;
     }

     // support read-only random access via code unit index.
     dchar opIndex(size_t idx)
     {
         idx = _charStart(idx);
         return std.utf.decode(_data, idx);
     }

     string_t opSlice()
     {
         return this;
     }

     string_t opSlice(size_t start, size_t end)
     {
         if(start != _data.length)
             start = _charStart(start);
         if(end != _data.length)
             end = _charStart(end);
         return string_t(_data[start..end]);
     }

     // note we don't call this length because length can be assumed to be  
the
     // number of elements, which this isn't.
      property size_t codeUnits() const
     {
         return _data.length;
     }

     // support append and concat
     // TODO: need to support appending various types of strings to  
eachother
     // (right now only same-type strings can be appended, or raw arrays)
     ref string_t opOpAssign(string op, U)(U data) if (op == "~" && is(U ==  
string_t))
     {
         _data ~= data._data;
         return this;
     }

     ref string_t opOpAssign(string op, U)(U data) if (op == "~" && !is(U  
== string_t) && is(typeof(_data ~= U.init)))
     {
         _data ~= data;
         return this;
     }

     string_t opBinary(string op, U)(U data) if (op == "~" && is(U ==  
string_t))
     {
         return string_t(_data ~ data._data);
     }

     string_t opBinary(string op, U)(U data) if (op == "~" && !is(U ==  
string_t) && is(typeof(_data ~ U.init)))
     {
         return string_t(_data ~ data);
     }
}

template string_t(T) if (is(Unqual!T == dchar))
{
     alias T[] string_t;
}

/** begin test code **/
import std.stdio;

alias string_t!(immutable char) mystring;
alias string_t!(immutable wchar) mywstring;
alias string_t!(immutable dchar) mydstring;

void main()
{
     auto str = mystring("hello");
     str ~= " world";
     str ~= mystring("!!!");
     writeln(str._data);
}
Nov 30 2010
next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Tuesday, November 30, 2010 07:48:03 Steven Schveighoffer wrote:
 In a prior thread, I promised to create a narrow string type that would
 enforce the requirements of narrow strings.  That is, the new string type
 should respect the encoding of narrow strings.
 
 Here is a rough draft, tested minimally, but it does compile and pass
 simple tests.  It's pretty simple, which is what I would expect.  I copied
 a lot of stuff from std.array to get this to work.
 
 The point of this type is -- if we replace what the compiler considers
 "strings" with this type, then we get both the compiler *and* phobos
 agreeing as to what this type is:  A bi-directional range of dchar.
 
 As a bonus, char[] and wchar[] now would become arrays and be manipulated
 consistently with other arrays, which if not done correctly could cause
 problems, but may provide more flexibility and opportunity for
 performance.  Instead of the library fighting you on it.
 
 Anyways, here it is, released under the boost license, commence attack ;)
 
 
 // Written in the D programming language.
 
 /**
 Copyright: Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010
 License:   <a href="http://www.boost.org/LICENSE_1_0.txt">Boost License
 1.0</a>.
 Authors:   $(WEB erdani.org, Andrei Alexandrescu, Steven Schveighoffer)
 
 Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010.
 Distributed under the Boost Software License, Version 1.0.
     (See accompanying file LICENSE_1_0.txt or copy at
           http://www.boost.org/LICENSE_1_0.txt)
 */
 import std.utf;
 import std.traits;
 
 struct string_t(T) if (is(Unqual!T == char) || is(Unqual!T == wchar))
 {
      private T[] _data;
      this(T[] data)
      {
          this._data = data;
      }
 
      // note, this assumes that idx is a valid index
      private size_t _charStart(size_t idx) const
      {
          static if(is(Unqual!T == wchar))
          {
              immutable c = _data.ptr[idx];
              if(c >= 0xDC00 && c <= 0xDFFF)
              {
                  // surrogate pair detected, verify we have at least 2
 wchars,
                  // and that both wchars are properly encoded.
                  assert(idx > 0, "Invalid UTF character at beginning of
 string");
                  return idx-1;
              }
              else
                  return idx;
          }
          else
          {
              const p = _data.ptr + idx;
              if ((p[0] & 0b1100_0000) != 0b1000_0000)
              {
                  return idx;
              }
              else if (idx >= 1 && (p[-1] & 0b1100_0000) != 0b1000_0000)
              {
                  return idx - 1;
              }
              else if (idx >= 2 && (p[-2] & 0b1100_0000) != 0b1000_0000)
              {
                  return idx - 2;
              }
              else if (idx >= 3 && (p[-3] & 0b1100_0000) != 0b1000_0000)
              {
                  return idx - 3;
              }
              else
              {
                  assert(false, "Invalid UTF character in string");
              }
          }
      }
 
      void popFront()
      {
          auto nc = std.utf.stride(_data, 0);
          assert(nc <= _data.length && nc != 0xFF, "Invalid sequence at
 beginning of string");
          _data = _data[nc .. $];
      }
 
      void popBack()
      {
          immutable n = _data.length;
          assert(n, "Attempting to pop back of an empty string");
          _data = _data.ptr[0.._charStart(n-1)];
      }
 
       property dchar front() const
      {
          assert(_data.length, "Attempting to fetch the front of an empty
 string");
          size_t i = 0;
          return decode(_data, i);
      }
 
       property dchar back() const
      {
          immutable n = _data.length;
          assert(n, "Attempting to fetch the back of an empty string");
          auto idx = _charStart(n-1);
          return std.utf.decode(_data, idx);
      }
 
       property bool empty() const
      {
          return !_data.length;
      }
 
       property typeof(this) save()
      {
          return this;
      }
 
      // support read-only random access via code unit index.
      dchar opIndex(size_t idx)
      {
          idx = _charStart(idx);
          return std.utf.decode(_data, idx);
      }
 
      string_t opSlice()
      {
          return this;
      }
 
      string_t opSlice(size_t start, size_t end)
      {
          if(start != _data.length)
              start = _charStart(start);
          if(end != _data.length)
              end = _charStart(end);
          return string_t(_data[start..end]);
      }
 
      // note we don't call this length because length can be assumed to be
 the
      // number of elements, which this isn't.
       property size_t codeUnits() const
      {
          return _data.length;
      }
 
      // support append and concat
      // TODO: need to support appending various types of strings to
 eachother
      // (right now only same-type strings can be appended, or raw arrays)
      ref string_t opOpAssign(string op, U)(U data) if (op == "~" && is(U ==
 string_t))
      {
          _data ~= data._data;
          return this;
      }
 
      ref string_t opOpAssign(string op, U)(U data) if (op == "~" && !is(U
 == string_t) && is(typeof(_data ~= U.init)))
      {
          _data ~= data;
          return this;
      }
 
      string_t opBinary(string op, U)(U data) if (op == "~" && is(U ==
 string_t))
      {
          return string_t(_data ~ data._data);
      }
 
      string_t opBinary(string op, U)(U data) if (op == "~" && !is(U ==
 string_t) && is(typeof(_data ~ U.init)))
      {
          return string_t(_data ~ data);
      }
 }
 
 template string_t(T) if (is(Unqual!T == dchar))
 {
      alias T[] string_t;
 }
 
 /** begin test code **/
 import std.stdio;
 
 alias string_t!(immutable char) mystring;
 alias string_t!(immutable wchar) mywstring;
 alias string_t!(immutable dchar) mydstring;
 
 void main()
 {
      auto str = mystring("hello");
      str ~= " world";
      str ~= mystring("!!!");
      writeln(str._data);
 }

1. At least until universal function syntax is in the language, you use the ability to do str.func(). 2. Functions that would work just fine treating strings as arrays of code units (presumably because they don't care about what the actual data is) lose efficiency, because now a string isn't an array. 3. You have no access to the underlying array unless you're dealing with an actual array of dchar. 4. Indexing is no longer O(1), which violates the guarantees of the index operator. 5. Slicing (other than a full slice) is no longer O(1), which violates the guarantees of the slicing operator. What you're doing here is forcing the view of a string as a range of dchar in all cases. Granted, that's what you want in most cases, but it can degrade efficiency, and the fact that some operations (in particular indexing and slicing) are not O(1) like they're supposed to be means that algorithms which rely on O(1) behavior from them could increase their cost by an order of magnitude. All the cases where treating a string as an actual array which are currently valid are left out to dry The inherent problem with strings is that we _want_ to be able to view them as both arrays of code units and as ranges of dchar. Trying to view them only as ranges of dchar is inefficient in a number of cases. Even something as simple as getting a substring is less efficient with this code. Only in cases where you don't actually know the length of the substring that you want is this essentially as efficient as what we have now. You're ignoring all cases where viewing a string as an array of code units is correct and desirable. You're only solving half of the problem (albeit the more prevalent half). It could very well be that we need a better way to handle when a string is treated as a range of dchar and when it is treated as an array of code units, but trying to treat them as ranges of dchar in all cases is not the right answer. Now, making it possible to have a wrapper struct like this which works in many cases where you'd use strings could reduce errors in code in many situations, so giving the programmer the ability to do that could be interesting. But it seems to me that this solution is too limiting to be a viable replacement for strings as they are. - Jonathan M Davis
Nov 30 2010
parent reply Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
On 11/30/2010 12:52 PM, Steven Schveighoffer wrote:
 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis

 4. Indexing is no longer O(1), which violates the guarantees of the index
 operator.

Indexing is still O(1).
 5. Slicing (other than a full slice) is no longer O(1), which violates
 the
 guarantees of the slicing operator.

Slicing is still O(1).

There definitely is value in being able to index and slice into utf strings without resulting in invalid utf, but I think the fact that it indexes on code unit and returns code point is sufficiently strange that it qualifies as abuse of operator overloading. That said, I like this proposal better than the current state of affairs. I'll try to play with it more on the weekend
Nov 30 2010
parent reply Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
On 12/01/2010 03:35 PM, Steven Schveighoffer wrote:
 On Tue, 30 Nov 2010 18:31:05 -0500, Ellery Newcomer
 There definitely is value in being able to index and slice into utf
 strings without resulting in invalid utf, but I think the fact that it
 indexes on code unit and returns code point is sufficiently strange
 that it qualifies as abuse of operator overloading.

Maybe :) The other alternative is to throw an exception if you try to access a code unit that is not the beginning of a code point. That might actually be less weird, I'll try doing that on the next iteration.

in my mind, the problem isn't so much indexing an intermediate code unit gets you earlier code units (it's a little strange, and I'm not sure whether greater strictness would be better - on the one hand, less strictness would be more tolerant of bugs and make it that much more difficult to detect them, but on the other hand if you were doing something like getting a random or approximate slice into your string, less strictness would mean that much less annoyance, though I have no idea why you would want to do that) as it is just the difference between the two and the confusion that it's bound to cause the noobies.
 I find that iteration over string characters using index is a very rare
 thing anyways, you either use foreach, which should give you dchars, or
 you use something like find, which should never give you an invalid index.

 -Steve

find was the counterargument I had in mind for keeping the operator overload, as something like s[find(s,'\u2729') .. s.codeUnits] is just a bit better than s.codePointSliceAt(find(s,'\u2729'), s.codeUnits); I really don't know. One thing that strikes me, though, if you're going to keep opIndex, is that being able to do foreach(size_t codeuniti, dchar c; s){ } would be nice. Actually, it looks like you can do that with current strings.
Dec 01 2010
next sibling parent Simon <s.d.hammett gmail.com> writes:
On 03/12/2010 18:29, Steven Schveighoffer wrote:
 On Fri, 03 Dec 2010 11:15:36 -0500, spir <denis.spir gmail.com> wrote:

 Indexing a string is rare, unless you are parsing something (yes it does
 truly depend on the domain),

In my XP you never index *especially* when parsing. Every parser I've written & used treats input as a character stream. XML & D for instance permit code points > 0x7x which means you must decode them properly when using utf-8, even if you are going to be storing the input back as utf-8. -- My enormous talent is exceeded only by my outrageous laziness. http://www.ssTk.co.uk
Dec 03 2010
prev sibling next sibling parent Jerry Quinn <jlquinn optonline.net> writes:
Steven Schveighoffer Wrote:

 On Fri, 03 Dec 2010 11:15:36 -0500, spir <denis.spir gmail.com> wrote:
 
 On Thu, 02 Dec 2010 16:24:03 -0500
 "Steven Schveighoffer" <schveiguy yahoo.com> wrote:

 Yes, it does seem odd, but then again, how often do you need the
 individual characters of a string?  I wrote php code for about 6 months  
 as
 a full time job before I found I needed to access individual characters,
 and then I had to look up how to do it :)  It's just not a common thing.

Well, I wonder whether I have ever written any piece of code without string indexing ;-) Surely depends on your app domain(s)...

Indexing a string is rare, unless you are parsing something (yes it does truly depend on the domain), but even rarer is indexing a string based on a hard-coded value. Really, this last case is where the type would behave in a confusing way.

I tend to do a lot of transforming strings, but I need to track offsets back to the original text to maintain alignment between the results and the input. For that, indexes are necessary and we use them a lot. Probably the right thing to do in this case is just pay for the cost of using dchar everywhere, but if you're working with large enough quantities of data, storage efficiency matters. Jerry
Dec 03 2010
prev sibling parent reply Steven Wawryk <stevenw acres.com.au> writes:
On 02/12/10 12:43, Ellery Newcomer wrote:
 On 12/01/2010 03:35 PM, Steven Schveighoffer wrote:
 I find that iteration over string characters using index is a very rare
 thing anyways, you either use foreach, which should give you dchars, or
 you use something like find, which should never give you an invalid
 index.

 -Steve

find was the counterargument I had in mind for keeping the operator overload, as something like s[find(s,'\u2729') .. s.codeUnits] is just a bit better than s.codePointSliceAt(find(s,'\u2729'), s.codeUnits); I really don't know.

I don't like either. I'd prefer (from std.algorithm) s = find(s, '\u2729'); I still don't see a need for any random-access methods, including slicing, on the code-point range. The ability to convert back and forth between the string_t and T[] would be useful to allow the user to choose between random-access code-unit range and bidirectional code-point range (or grapheme/glyph/etc). For example a T[] accessor property and an opAssign from a T[]. So if there are efficiencies to be had with random-access (and slicing), then the user could choose s = find(s.codeUnits, '\u2729'); Are there any other reasons to slice on a code-point range? Or have any capabilities beyond a bidirectional range? Unless there are compelling reasons, I'd have to say that slicing and indexing on the code-point level is not much more than a hack.
Jan 12 2011
parent Steven Wawryk <stevenw acres.com.au> writes:
 s = find(s.codeUnits, '\u2729');

Sorry, I didn't mean the the codeUnits retuning a size_t. I meant an accessor for the underlying code-unit range. Something like property T[] codeUnits() { return _data; }
Jan 12 2011
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis <jmdavisProg gmx.com>  
wrote:

 1. At least until universal function syntax is in the language, you use  
 the
 ability to do str.func().

Yes, but first, this is not a problem with the string type, it's a problem with the language. Second, any string-specific functions can be added, it is a struct after all, not a builtin.
 2. Functions that would work just fine treating strings as arrays of  
 code units
 (presumably because they don't care about what the actual data is) lose
 efficiency, because now a string isn't an array.

Which functions are those? They can be allowed via wrappers.
 3. You have no access to the underlying array unless you're dealing with  
 an
 actual array of dchar.

I thought of adding some kind of access. I wasn't sure the best way. I was thinking of allowing direct access via opCast, because I think casting might be a sufficient red flag to let you know you are crossing into dangerous waters. But it could just be as easy as making the array itself public.
 4. Indexing is no longer O(1), which violates the guarantees of the index
 operator.

Indexing is still O(1).
 5. Slicing (other than a full slice) is no longer O(1), which violates  
 the
 guarantees of the slicing operator.

Slicing is still O(1).
 What you're doing here is forcing the view of a string as a range of  
 dchar in
 all cases. Granted, that's what you want in most cases, but it can  
 degrade
 efficiency, and the fact that some operations (in particular indexing  
 and slicing)
 are not O(1) like they're supposed to be means that algorithms which  
 rely on
 O(1) behavior from them could increase their cost by an order of  
 magnitude. All
 the cases where treating a string as an actual array which are currently  
 valid
 are left out to dry

You can still use char[] and wchar[].
 The inherent problem with strings is that we _want_ to be able to view  
 them as
 both arrays of code units and as ranges of dchar. Trying to view them  
 only as
 ranges of dchar is inefficient in a number of cases. Even something as  
 simple as
 getting a substring is less efficient with this code. Only in cases  
 where you
 don't actually know the length of the substring that you want is this
 essentially as efficient as what we have now. You're ignoring all cases  
 where
 viewing a string as an array of code units is correct and desirable.  
 You're only
 solving half of the problem (albeit the more prevalent half).

I'm not planning on eliminating char and wchar based arrays. In other words, you should be able to get access to the array, but it should not be the default, and it should be considered unsafe.
 Now, making it possible to have a wrapper struct like this which works  
 in many
 cases where you'd use strings could reduce errors in code in many  
 situations, so
 giving the programmer the ability to do that could be interesting. But  
 it seems
 to me that this solution is too limiting to be a viable replacement for  
 strings
 as they are.

Hopefully you can see that I'm not eliminating the functionality you are looking for, just making it not the default. -Steve
Nov 30 2010
prev sibling next sibling parent "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:
On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:
 
 [...]
 
 4. Indexing is no longer O(1), which violates the guarantees of the
 index operator.

Indexing is still O(1).
 5. Slicing (other than a full slice) is no longer O(1), which violates
 the
 guarantees of the slicing operator.

Slicing is still O(1). [...]

It feels extremely weird that the indices refer to code units and not code points. If I write auto str = mystring("hæ?"); writeln(str[1], " ", str[2]); I expect it to print "æ ?", not "æ æ" like it does now. On a side note: It seems to me that the only reason to have char, wchar, and dchar as separate types in the language is that arrays of said types are UTF-encoded strings. If a type such as the proposed one were to become the default string type in D, it might as well wrap an array of ubyte/ushort/uint, since direct user manipulation of the underlying array will generally only happen in the rare cases when one wants to deal directly with code units. -Lars
Nov 30 2010
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Tuesday, November 30, 2010 10:52:20 Steven Schveighoffer wrote:
 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis <jmdavisProg gmx.com>
 
 wrote:
 1. At least until universal function syntax is in the language, you use
 the
 ability to do str.func().

Yes, but first, this is not a problem with the string type, it's a problem with the language. Second, any string-specific functions can be added, it is a struct after all, not a builtin.

I really don't think that we want to have to be adding string functions to a string struct. Having them external is far better, particularly when there are so many of them which aren't directly related to strings but use them heavily. Even if all of std.string got tacked onto a struct string type, that would leave out the rest of Phobos and any user code. I really think that universal function syntax is a necessity for a struct solution to be acceptable. I had considered a solution similar to this one a few months back, and the lack of universal function syntax is one of the reasons why I decided that it wasn't really an improvement. Honestly, without uniform function call syntax, I would consider a struct solution to be DOA. The loss in useability would just be too great.
 2. Functions that would work just fine treating strings as arrays of
 code units
 (presumably because they don't care about what the actual data is) lose
 efficiency, because now a string isn't an array.

Which functions are those? They can be allowed via wrappers.

It is my understanding that there are various functions in std.algorithm which are able to treat strings as arrays and therefore process them more efficiently. I haven't looked into which ones are in that camp. I would think that find() might be, but I'd have to look.
 3. You have no access to the underlying array unless you're dealing with
 an
 actual array of dchar.

I thought of adding some kind of access. I wasn't sure the best way. I was thinking of allowing direct access via opCast, because I think casting might be a sufficient red flag to let you know you are crossing into dangerous waters. But it could just be as easy as making the array itself public.

Something like that would need to be done. It really should be a public property of some kind if not an outright public variable.
 4. Indexing is no longer O(1), which violates the guarantees of the index
 operator.

Indexing is still O(1).
 5. Slicing (other than a full slice) is no longer O(1), which violates
 the
 guarantees of the slicing operator.

Slicing is still O(1).

You're right. I misread what _charStart() did. However, if I understand it correctly now, you give it a code unit index and yet get a code point back. That worries me. It means that there is no relation between the length of the string and the indices that you'd use to index it. It _is_ related to the number of code units, which you can get with codeUnits(), but that disjoint seems like it could cause problems. It might be the correct solution, but I think that it merits some examination. Returning a code unit would be wrong since that totally breaks with the rest of the API dealing in dchar/code units, and it would be wrong to index by code point, since then indexing and slicing are no longer O(1). So, it's probably a choice between indexing by one thing and returning another or not having indexing and slicing, which would definitely not be good. So, maybe your solution is the best one in this respect, but it worries me. The exact ramifications of that need to be looked into.
 What you're doing here is forcing the view of a string as a range of
 dchar in
 all cases. Granted, that's what you want in most cases, but it can
 degrade
 efficiency, and the fact that some operations (in particular indexing
 and slicing)
 are not O(1) like they're supposed to be means that algorithms which
 rely on
 O(1) behavior from them could increase their cost by an order of
 magnitude. All
 the cases where treating a string as an actual array which are currently
 valid
 are left out to dry

You can still use char[] and wchar[].

Except that what if you need to do both with the same type? Right now, you could have a function which treats a string as a range of dchar while another one which can get away with treating it as a range of code units can treat it as an array. You can pass the same string to both, and it works. That should still work if we go for a struct solution. Special-casing on strings and specifically using the internal array instead of the struct for them could fix the problem, but it still needs to be possible.
 The inherent problem with strings is that we _want_ to be able to view
 them as
 both arrays of code units and as ranges of dchar. Trying to view them
 only as
 ranges of dchar is inefficient in a number of cases. Even something as
 simple as
 getting a substring is less efficient with this code. Only in cases
 where you
 don't actually know the length of the substring that you want is this
 essentially as efficient as what we have now. You're ignoring all cases
 where
 viewing a string as an array of code units is correct and desirable.
 You're only
 solving half of the problem (albeit the more prevalent half).

I'm not planning on eliminating char and wchar based arrays. In other words, you should be able to get access to the array, but it should not be the default, and it should be considered unsafe.
 Now, making it possible to have a wrapper struct like this which works
 in many
 cases where you'd use strings could reduce errors in code in many
 situations, so
 giving the programmer the ability to do that could be interesting. But
 it seems
 to me that this solution is too limiting to be a viable replacement for
 strings
 as they are.

Hopefully you can see that I'm not eliminating the functionality you are looking for, just making it not the default.

There is definitely some value in making strings treated as ranges of dchar by default. But for the most part, that's the case already thanks to how std.array is defined. The only place where that runs into trouble is if you use foreach. It still treats them as arrays of code units unless you tell it to iterate over dchar. Either making foreach over character arrays iterate over dchar by default or making it a warning or error to use foreach with a char or wchar array of any kind without specifying the type to iterate over would fix that problem. I don't think that the basic idea of what you present is necessarily a bad idea (I've considered proposing similar solutions in the past), but I'm not at all convinced that it's an improvement. What we have works quite well overall, with the notable exception of foreach. I'm also worried that trying to bury the fact that strings are arrays of code units is going to cause a leaky abstraction which will case problems. It's not likely to be as bad as the situation with char literals and std::string in C++, but I think that we need to look at this very carefully and examine what all of the ramifications are before we even consider going for it. There is an inherent but necessary disjoint between having strings be arrays of code units and ranges of dchar. Sometimes they need to be treated as one, sometimes as the other. Ideally, the default - whichever it is - would be the one which leads to fewer bugs. But they both need to be there. A struct solution is essentially an attempt to make strings treated as ranges of dchar in more situations by default than is currenly the case. As such, it could be better than what we have now, but I'm honestly not convinced that it is. Aside from the foreach problem (which could be fixed with an appropriate warning or error - preferrably error), what we have works quite well. - Jonathan M Davis
Nov 30 2010
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On Tue, 30 Nov 2010 23:34:11 +0000 (UTC)
"Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> wrote:

 On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:
=20
 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:
=20
 [...]
=20
 4. Indexing is no longer O(1), which violates the guarantees of the
 index operator.

Indexing is still O(1). =20
 5. Slicing (other than a full slice) is no longer O(1), which violates
 the
 guarantees of the slicing operator.

Slicing is still O(1). =20 [...]

It feels extremely weird that the indices refer to code units and not=20 code points. If I write =20 auto str =3D mystring("h=C3=A6?"); writeln(str[1], " ", str[2]); =20 I expect it to print "=C3=A6 ?", not "=C3=A6 =C3=A6" like it does now.

If I understand correctly how _charStart works in combination with indexing= and slicing, then here is something wrong in the type's interface. After auto str =3D mystring("h=C3=A6?"); Either one provides a code unit index and gets a code unit: writeln(str[1], " ", str[2]); // "=EF=BF=BD =EF=BF=BD" (invalid utf code p= oints) Or one provides a code point index and gets a code point: writeln(str[1], " ", str[2]); // "=C3=A6 ?" But for string manipulation, wouldn't it be better that your string type sy= stematically wraps a dchar[] array, whatever the original encoding? For ind= exing, slicing, finding, counting, etc... to be fast, I mean. Decoding beei= ng done only once at string creation time.
 On a side note:  It seems to me that the only reason to have char, wchar,=

 and dchar as separate types in the language is that arrays of said types=

 are UTF-encoded strings.  If a type such as the proposed one were to=20
 become the default string type in D, it might as well wrap an array of=20
 ubyte/ushort/uint, since direct user manipulation of the underlying array=

 will generally only happen in the rare cases when one wants to deal=20
 directly with code units.

Yes, but then, see remark above. Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 01 2010
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 30 Nov 2010 18:31:05 -0500, Ellery Newcomer  
<ellery-newcomer utulsa.edu> wrote:

 On 11/30/2010 12:52 PM, Steven Schveighoffer wrote:
 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis

 4. Indexing is no longer O(1), which violates the guarantees of the  
 index
 operator.

Indexing is still O(1).
 5. Slicing (other than a full slice) is no longer O(1), which violates
 the
 guarantees of the slicing operator.

Slicing is still O(1).

There definitely is value in being able to index and slice into utf strings without resulting in invalid utf, but I think the fact that it indexes on code unit and returns code point is sufficiently strange that it qualifies as abuse of operator overloading.

Maybe :) The other alternative is to throw an exception if you try to access a code unit that is not the beginning of a code point. That might actually be less weird, I'll try doing that on the next iteration. I find that iteration over string characters using index is a very rare thing anyways, you either use foreach, which should give you dchars, or you use something like find, which should never give you an invalid index. -Steve
Dec 01 2010
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad  
<public kyllingen.nospamnet> wrote:

 On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:

 [...]

 4. Indexing is no longer O(1), which violates the guarantees of the
 index operator.

Indexing is still O(1).
 5. Slicing (other than a full slice) is no longer O(1), which violates
 the
 guarantees of the slicing operator.

Slicing is still O(1). [...]

It feels extremely weird that the indices refer to code units and not code points. If I write auto str = mystring("hæ?"); writeln(str[1], " ", str[2]); I expect it to print "æ ?", not "æ æ" like it does now.

I don't think it's possible to do that with any implementation without making indexing not O(1). This just isn't possible, unless you want to use dchar[]. But your point is well taken. I think what I'm going to do is throw an exception when accessing an invalid index. While also surprising, it doesn't result in "extra data". I feel it's probably very rare to just access hard-coded indexes like that unless you are sure of the data in the string. Or to use a for-loop to access characters, etc.
 On a side note:  It seems to me that the only reason to have char, wchar,
 and dchar as separate types in the language is that arrays of said types
 are UTF-encoded strings.  If a type such as the proposed one were to
 become the default string type in D, it might as well wrap an array of
 ubyte/ushort/uint, since direct user manipulation of the underlying array
 will generally only happen in the rare cases when one wants to deal
 directly with code units.

I'd still want a char[] array for easy manipulation and eventual printing. Wrapping a ubyte[] with a string just to print would be strange. My goal in this exercise is to try and give control of what to deal with (code-points vs. code-units) back to the user. Right now, the library forces you to view them as code-points, and the compiler forces you to view them as code-units (except via foreach). But it is an interesting idea (removing small char types). -Steve
Dec 01 2010
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 30 Nov 2010 18:34:56 -0500, Jonathan M Davis <jmdavisProg gmx.com>  
wrote:

 On Tuesday, November 30, 2010 10:52:20 Steven Schveighoffer wrote:
 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis  
 <jmdavisProg gmx.com>

 wrote:
 1. At least until universal function syntax is in the language, you  

 the
 ability to do str.func().

Yes, but first, this is not a problem with the string type, it's a problem with the language. Second, any string-specific functions can be added, it is a struct after all, not a builtin.

I really don't think that we want to have to be adding string functions to a string struct. Having them external is far better, particularly when there are so many of them which aren't directly related to strings but use them heavily. Even if all of std.string got tacked onto a struct string type, that would leave out the rest of Phobos and any user code. I really think that universal function syntax is a necessity for a struct solution to be acceptable.

Why? It's a choice between adding them to the struct, and doing func(str) instead. You already have to do func(R) for arbitrary ranges, I don't see why strings should be specialized. For things that should be considered 'built-in' for strings, they can be included as members of the struct. In other words, I think it's worth fixing strings now, even if it means we cannot call str.func() until the universal syntax is introduced.
 I had considered a solution similar to this one a few months back, and  
 the lack
 of universal function syntax is one of the reasons why I decided that it  
 wasn't
 really an improvement. Honestly, without uniform function call syntax, I  
 would
 consider a struct solution to be DOA. The loss in useability would just  
 be too
 great.

But you can still call the function, it's just func(str). There is no loss of functionality. Usability isn't even really lost.
 2. Functions that would work just fine treating strings as arrays of
 code units
 (presumably because they don't care about what the actual data is)  

 efficiency, because now a string isn't an array.

Which functions are those? They can be allowed via wrappers.

It is my understanding that there are various functions in std.algorithm which are able to treat strings as arrays and therefore process them more efficiently. I haven't looked into which ones are in that camp. I would think that find() might be, but I'd have to look.

Since std.algorithm works exclusively with ranges, those must be special cases for strings because strings are bi-directional ranges of dchar according to phobos. So those can continue to be special cases for the new string types.
 4. Indexing is no longer O(1), which violates the guarantees of the  

 operator.

Indexing is still O(1).
 5. Slicing (other than a full slice) is no longer O(1), which violates
 the
 guarantees of the slicing operator.

Slicing is still O(1).

You're right. I misread what _charStart() did. However, if I understand it correctly now, you give it a code unit index and yet get a code point back. That worries me. It means that there is no relation between the length of the string and the indices that you'd use to index it. It _is_ related to the number of code units, which you can get with codeUnits(), but that disjoint seems like it could cause problems. It might be the correct solution, but I think that it merits some examination. Returning a code unit would be wrong since that totally breaks with the rest of the API dealing in dchar/code units, and it would be wrong to index by code point, since then indexing and slicing are no longer O(1). So, it's probably a choice between indexing by one thing and returning another or not having indexing and slicing, which would definitely not be good. So, maybe your solution is the best one in this respect, but it worries me. The exact ramifications of that need to be looked into.

How many times do you use a hard-coded index into a string without knowing the encoding of the string (i.e. I know this string is ascii)? How many times do you iterate the characters of a string via an incrementing index? It's not common to use the indexing operation with values that are not known or computed to be valid starts to code-points. In fact, the language depends on that (otherwise you'd see sliced strings everywhere with invalid data). So while it looks strange, it shouldn't be a common need. That being said, I think Lars pointed out that the strangeness of returning the code point even if you point in the middle would be surprising in some cases, so I think the better solution is to throw an exception.
 What you're doing here is forcing the view of a string as a range of
 dchar in
 all cases. Granted, that's what you want in most cases, but it can
 degrade
 efficiency, and the fact that some operations (in particular indexing
 and slicing)
 are not O(1) like they're supposed to be means that algorithms which
 rely on
 O(1) behavior from them could increase their cost by an order of
 magnitude. All
 the cases where treating a string as an actual array which are  

 valid
 are left out to dry

You can still use char[] and wchar[].

Except that what if you need to do both with the same type? Right now, you could have a function which treats a string as a range of dchar while another one which can get away with treating it as a range of code units can treat it as an array. You can pass the same string to both, and it works. That should still work if we go for a struct solution. Special-casing on strings and specifically using the internal array instead of the struct for them could fix the problem, but it still needs to be possible.

Yeah, it's definitely needed. I'll add access to the data member in the next version.
 Hopefully you can see that I'm not eliminating the functionality you are
 looking for, just making it not the default.

There is definitely some value in making strings treated as ranges of dchar by default. But for the most part, that's the case already thanks to how std.array is defined. The only place where that runs into trouble is if you use foreach.

or indexing. This is a huge problem.
 It
 still treats them as arrays of code units unless you tell it to iterate  
 over
 dchar. Either making foreach over character arrays iterate over dchar by  
 default
 or making it a warning or error to use foreach with a char or wchar  
 array of any
 kind without specifying the type to iterate over would fix that problem.

I agree that would be ideal, but it still doesn't solve the indexing problem.
 There is an inherent but necessary disjoint between having strings be  
 arrays of
 code units and ranges of dchar. Sometimes they need to be treated as one,
 sometimes as the other. Ideally, the default - whichever it is - would  
 be the
 one which leads to fewer bugs. But they both need to be there. A struct  
 solution
 is essentially an attempt to make strings treated as ranges of dchar in  
 more
 situations by default than is currenly the case. As such, it could be  
 better
 than what we have now, but I'm honestly not convinced that it is. Aside  
 from the
 foreach problem (which could be fixed with an appropriate warning or  
 error -
 preferrably error), what we have works quite well.

The thing is, the most common use of strings is as a string, not as an array of code-units. The common case is to print, slice, find, etc. on a *string*. When dealing with the string as a whole, either using an array or a specialized type works equally well. The uncommon case is to extract individual characters from the string. In this case, the default needs to be the most common need in that area -- extracting a dchar, not a code-unit. Having the default index operation extract a code-unit is very incorrect. -Steve
Dec 01 2010
prev sibling next sibling parent "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:
On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad
 <public kyllingen.nospamnet> wrote:
 
 On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:

 [...]

 4. Indexing is no longer O(1), which violates the guarantees of the
 index operator.

Indexing is still O(1).
 5. Slicing (other than a full slice) is no longer O(1), which
 violates the
 guarantees of the slicing operator.

Slicing is still O(1). [...]

It feels extremely weird that the indices refer to code units and not code points. If I write auto str = mystring("hæ?"); writeln(str[1], " ", str[2]); I expect it to print "æ ?", not "æ æ" like it does now.

I don't think it's possible to do that with any implementation without making indexing not O(1). This just isn't possible, unless you want to use dchar[]. But your point is well taken. I think what I'm going to do is throw an exception when accessing an invalid index. While also surprising, it doesn't result in "extra data". I feel it's probably very rare to just access hard-coded indexes like that unless you are sure of the data in the string. Or to use a for-loop to access characters, etc.

As soon as you add opIndex(), your interface becomes that of a random- access range, something which narrow strings are not. In fact, the distinction between random access and bidirectional range access for strings is in many ways the reason we're having this discussion. How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead adding a characterAt(i) function that retrieves the i'th code point, and which is not required to be O(1)? Then, if someone wants O(1) indexing they are forced to use string_t!dchar or just plain ol' arrays, both of which have clear, predictable indexing semantics. I think it's great that you're doing this, by the way! I haven't made up my mind yet about whether I want char[] or a separate string type, but it is great to have an actual implementation of the latter at hand when debating it. -Lars
Dec 01 2010
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On Wed, 01 Dec 2010 20:13:35 -0600
Ellery Newcomer <ellery-newcomer utulsa.edu> wrote:

 One thing that strikes me, though, if you're going to keep opIndex, is=20
 that being able to do
=20
 foreach(size_t codeuniti, dchar c; s){
=20
 }

This would yield several times the same code point / dchar, or do i misinte= rpret? Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 02 2010
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 02 Dec 2010 05:08:54 -0500, spir <denis.spir gmail.com> wrote:

 On Wed, 01 Dec 2010 20:13:35 -0600
 Ellery Newcomer <ellery-newcomer utulsa.edu> wrote:

 One thing that strikes me, though, if you're going to keep opIndex, is
 that being able to do

 foreach(size_t codeuniti, dchar c; s){

 }

This would yield several times the same code point / dchar, or do i misinterpret?

It would only yield the code units that are the start of a code point. So if the code point that starts at the beginning of the array is 2 code units long, you would iterate 0, 2, 3, ... -Steve
Dec 02 2010
prev sibling next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 02 Dec 2010 02:09:51 -0500, Lars T. Kyllingstad  
<public kyllingen.nospamnet> wrote:

 On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad
 <public kyllingen.nospamnet> wrote:

 On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:

 [...]

 4. Indexing is no longer O(1), which violates the guarantees of the
 index operator.

Indexing is still O(1).
 5. Slicing (other than a full slice) is no longer O(1), which
 violates the
 guarantees of the slicing operator.

Slicing is still O(1). [...]

It feels extremely weird that the indices refer to code units and not code points. If I write auto str = mystring("hæ?"); writeln(str[1], " ", str[2]); I expect it to print "æ ?", not "æ æ" like it does now.

I don't think it's possible to do that with any implementation without making indexing not O(1). This just isn't possible, unless you want to use dchar[]. But your point is well taken. I think what I'm going to do is throw an exception when accessing an invalid index. While also surprising, it doesn't result in "extra data". I feel it's probably very rare to just access hard-coded indexes like that unless you are sure of the data in the string. Or to use a for-loop to access characters, etc.

As soon as you add opIndex(), your interface becomes that of a random- access range, something which narrow strings are not. In fact, the distinction between random access and bidirectional range access for strings is in many ways the reason we're having this discussion. How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead adding a characterAt(i) function that retrieves the i'th code point, and which is not required to be O(1)? Then, if someone wants O(1) indexing they are forced to use string_t!dchar or just plain ol' arrays, both of which have clear, predictable indexing semantics.

Then substring (slicing) becomes an O(n) operation. It just doesn't work well. It seems to be awkward at first thought, but the more I think about it, the more I think it's right. When do you ever depend on specific indexes in a string being valid, or to be incrementing always by 1?
 I think it's great that you're doing this, by the way!  I haven't made up
 my mind yet about whether I want char[] or a separate string type, but it
 is great to have an actual implementation of the latter at hand when
 debating it.

Thanks :) -Steve
Dec 02 2010
parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, December 03, 2010 05:13:50 Lars T. Kyllingstad wrote:
 On Thu, 02 Dec 2010 16:18:52 -0500, Steven Schveighoffer wrote:
 On Thu, 02 Dec 2010 02:09:51 -0500, Lars T. Kyllingstad
=20
 <public kyllingen.nospamnet> wrote:
 On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:
 On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad
=20
 <public kyllingen.nospamnet> wrote:
 On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:
 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:
=20
 [...]
=20
 4. Indexing is no longer O(1), which violates the guarantees of the
 index operator.

Indexing is still O(1). =20
 5. Slicing (other than a full slice) is no longer O(1), which
 violates the
 guarantees of the slicing operator.

Slicing is still O(1). =20 [...]

It feels extremely weird that the indices refer to code units and not code points. If I write =20 auto str =3D mystring("h=C3=A6?"); writeln(str[1], " ", str[2]); =20 I expect it to print "=C3=A6 ?", not "=C3=A6 =C3=A6" like it does no=





=20
 I don't think it's possible to do that with any implementation without
 making indexing not O(1).  This just isn't possible, unless you want
 to use dchar[].
=20
 But your point is well taken.  I think what I'm going to do is throw
 an exception when accessing an invalid index.  While also surprising,
 it doesn't result in "extra data".  I feel it's probably very rare to
 just access hard-coded indexes like that unless you are sure of the
 data in the string.  Or to use a for-loop to access characters, etc.

As soon as you add opIndex(), your interface becomes that of a random- access range, something which narrow strings are not. In fact, the distinction between random access and bidirectional range access for strings is in many ways the reason we're having this discussion. =20 How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead adding a characterAt(i) function that retrieves the i'th code point, and which is not required to be O(1)? Then, if someone wants O(1) indexing they are forced to use string_t!dchar or just plain ol' arrays, both of which have clear, predictable indexing semantics.

Then substring (slicing) becomes an O(n) operation. It just doesn't work well.

What I meant wast that opSlice() should be disabled in the same way as opIndex().

A string type without slicing (which must be O(1)) is DOA without question.= =20 Slicing is _far_ too useful to lose. Indexing in strings is fairly rare bec= ause=20 it's generally stupid idea, but slicing happens all the time. If nothing e= lse,=20 that is _the_ way to get a substring. =2D Jonathan M Davis
Dec 03 2010
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 01 Dec 2010 21:13:35 -0500, Ellery Newcomer  
<ellery-newcomer utulsa.edu> wrote:

 On 12/01/2010 03:35 PM, Steven Schveighoffer wrote:
 On Tue, 30 Nov 2010 18:31:05 -0500, Ellery Newcomer
 There definitely is value in being able to index and slice into utf
 strings without resulting in invalid utf, but I think the fact that it
 indexes on code unit and returns code point is sufficiently strange
 that it qualifies as abuse of operator overloading.

Maybe :) The other alternative is to throw an exception if you try to access a code unit that is not the beginning of a code point. That might actually be less weird, I'll try doing that on the next iteration.

in my mind, the problem isn't so much indexing an intermediate code unit gets you earlier code units (it's a little strange, and I'm not sure whether greater strictness would be better - on the one hand, less strictness would be more tolerant of bugs and make it that much more difficult to detect them, but on the other hand if you were doing something like getting a random or approximate slice into your string, less strictness would mean that much less annoyance, though I have no idea why you would want to do that) as it is just the difference between the two and the confusion that it's bound to cause the noobies.

Yes, it does seem odd, but then again, how often do you need the individual characters of a string? I wrote php code for about 6 months as a full time job before I found I needed to access individual characters, and then I had to look up how to do it :) It's just not a common thing. Typically, the index you use is calculated from something like find, and you don't care what it is, as long as it's storable and persistent.
 I find that iteration over string characters using index is a very rare
 thing anyways, you either use foreach, which should give you dchars, or
 you use something like find, which should never give you an invalid  
 index.

 -Steve

find was the counterargument I had in mind for keeping the operator overload, as something like s[find(s,'\u2729') .. s.codeUnits] is just a bit better than s.codePointSliceAt(find(s,'\u2729'), s.codeUnits); I really don't know.

Ugh, yes, and actually, that reminds me I should define opDollar.
 One thing that strikes me, though, if you're going to keep opIndex, is  
 that being able to do

 foreach(size_t codeuniti, dchar c; s){

 }

 would be nice. Actually, it looks like you can do that with current  
 strings.

At this point, you can't do that except via opApply, and I didn't want to inject that in fear that it would be pointed out as a drawback. It would be nice if we could define a way to do that via ranges... -Steve
Dec 02 2010
prev sibling next sibling parent "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:
On Thu, 02 Dec 2010 16:18:52 -0500, Steven Schveighoffer wrote:

 On Thu, 02 Dec 2010 02:09:51 -0500, Lars T. Kyllingstad
 <public kyllingen.nospamnet> wrote:
 
 On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad
 <public kyllingen.nospamnet> wrote:

 On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:

 [...]

 4. Indexing is no longer O(1), which violates the guarantees of the
 index operator.

Indexing is still O(1).
 5. Slicing (other than a full slice) is no longer O(1), which
 violates the
 guarantees of the slicing operator.

Slicing is still O(1). [...]

It feels extremely weird that the indices refer to code units and not code points. If I write auto str = mystring("hæ?"); writeln(str[1], " ", str[2]); I expect it to print "æ ?", not "æ æ" like it does now.

I don't think it's possible to do that with any implementation without making indexing not O(1). This just isn't possible, unless you want to use dchar[]. But your point is well taken. I think what I'm going to do is throw an exception when accessing an invalid index. While also surprising, it doesn't result in "extra data". I feel it's probably very rare to just access hard-coded indexes like that unless you are sure of the data in the string. Or to use a for-loop to access characters, etc.

As soon as you add opIndex(), your interface becomes that of a random- access range, something which narrow strings are not. In fact, the distinction between random access and bidirectional range access for strings is in many ways the reason we're having this discussion. How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead adding a characterAt(i) function that retrieves the i'th code point, and which is not required to be O(1)? Then, if someone wants O(1) indexing they are forced to use string_t!dchar or just plain ol' arrays, both of which have clear, predictable indexing semantics.

Then substring (slicing) becomes an O(n) operation. It just doesn't work well.

What I meant wast that opSlice() should be disabled in the same way as opIndex(). What you have now is struct string_t(T) { dchar opIndex(size_t); // Must be O(1) string_t opSlice(size_t, size_t); // Must be O(1) } and what I'm suggesting is struct string_t(T) { dchar character(size_t); // May be O(n) string_t substring(size_t, size_t); // May be O(n) static if (is(T == dchar)) { alias character opIndex; // Must be O(1) alias substring opSlice; // Must be O(1) } }
 It seems to be awkward at first thought, but the more I
 think about it, the more I think it's right.  When do you ever depend on
 specific indexes in a string being valid, or to be incrementing always
 by 1?

That is a good question indeed. But I'm still not convinced. :) Another thing to consider is that by having opIndex() in there, your string satisfies isRandomAccessRange. Then, some algorithms which work with both bidirectional and random access may choose to go with the latter. This is a quote from the std.algorithm.find() docs: Specializations taking advantage of bidirectional or random access (where present) may accelerate search [...] -Lars
Dec 03 2010
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 03 Dec 2010 08:13:50 -0500, Lars T. Kyllingstad  
<public kyllingen.nospamnet> wrote:

 On Thu, 02 Dec 2010 16:18:52 -0500, Steven Schveighoffer wrote:

 On Thu, 02 Dec 2010 02:09:51 -0500, Lars T. Kyllingstad
 <public kyllingen.nospamnet> wrote:

 On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad
 <public kyllingen.nospamnet> wrote:

 On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:

 [...]

 4. Indexing is no longer O(1), which violates the guarantees of the
 index operator.

Indexing is still O(1).
 5. Slicing (other than a full slice) is no longer O(1), which
 violates the
 guarantees of the slicing operator.

Slicing is still O(1). [...]

It feels extremely weird that the indices refer to code units and not code points. If I write auto str = mystring("hæ?"); writeln(str[1], " ", str[2]); I expect it to print "æ ?", not "æ æ" like it does now.

I don't think it's possible to do that with any implementation without making indexing not O(1). This just isn't possible, unless you want to use dchar[]. But your point is well taken. I think what I'm going to do is throw an exception when accessing an invalid index. While also surprising, it doesn't result in "extra data". I feel it's probably very rare to just access hard-coded indexes like that unless you are sure of the data in the string. Or to use a for-loop to access characters, etc.

As soon as you add opIndex(), your interface becomes that of a random- access range, something which narrow strings are not. In fact, the distinction between random access and bidirectional range access for strings is in many ways the reason we're having this discussion. How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead adding a characterAt(i) function that retrieves the i'th code point, and which is not required to be O(1)? Then, if someone wants O(1) indexing they are forced to use string_t!dchar or just plain ol' arrays, both of which have clear, predictable indexing semantics.

Then substring (slicing) becomes an O(n) operation. It just doesn't work well.

What I meant wast that opSlice() should be disabled in the same way as opIndex(). What you have now is struct string_t(T) { dchar opIndex(size_t); // Must be O(1) string_t opSlice(size_t, size_t); // Must be O(1) } and what I'm suggesting is struct string_t(T) { dchar character(size_t); // May be O(n) string_t substring(size_t, size_t); // May be O(n) static if (is(T == dchar)) { alias character opIndex; // Must be O(1) alias substring opSlice; // Must be O(1) } }

O(n) slicing and indexing, called by different names, is unacceptable. It is possible to have O(1) indexing and slicing, I don't see why we should disallow it. If nothing else, it cripples this implementation to the point where it won't be accepted, simply because it performs worse than the current solution. Now, if one wants to have that feature in *addition* to O(1) indexing/slicing, I don't think that's an issue.
 It seems to be awkward at first thought, but the more I
 think about it, the more I think it's right.  When do you ever depend on
 specific indexes in a string being valid, or to be incrementing always
 by 1?

That is a good question indeed. But I'm still not convinced. :) Another thing to consider is that by having opIndex() in there, your string satisfies isRandomAccessRange. Then, some algorithms which work with both bidirectional and random access may choose to go with the latter. This is a quote from the std.algorithm.find() docs: Specializations taking advantage of bidirectional or random access (where present) may accelerate search [...]

I just looked it up, isRandomAccess requires hasLength, which my string type does not. So my string type is not a true random access range (this was intentional). -Steve
Dec 03 2010
prev sibling next sibling parent "Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:
On Fri, 03 Dec 2010 08:59:52 -0500, Steven Schveighoffer wrote:

 On Fri, 03 Dec 2010 08:13:50 -0500, Lars T. Kyllingstad
 <public kyllingen.nospamnet> wrote:
 Another thing to consider is that by having opIndex() in there, your
 string satisfies isRandomAccessRange.  Then, some algorithms which work
 with both bidirectional and random access may choose to go with the
 latter.  This is a quote from the std.algorithm.find() docs:

     Specializations taking advantage of bidirectional or random access
     (where present) may accelerate search [...]

I just looked it up, isRandomAccess requires hasLength, which my string type does not. So my string type is not a true random access range (this was intentional).

Ah, you are right. I didn't read the documentation for isRandomAccessRange, I just looked at the accompanying code snippet, which doesn't mention hasLength. -Lars
Dec 03 2010
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On Thu, 02 Dec 2010 16:24:03 -0500
"Steven Schveighoffer" <schveiguy yahoo.com> wrote:

 Yes, it does seem odd, but then again, how often do you need the =20
 individual characters of a string?  I wrote php code for about 6 months a=

 a full time job before I found I needed to access individual characters, =

 and then I had to look up how to do it :)  It's just not a common thing.

Well, I wonder whether I have ever written any piece of code without string= indexing ;-) Surely depends on your app domain(s)... Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 03 2010
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On Thu, 02 Dec 2010 16:24:03 -0500
"Steven Schveighoffer" <schveiguy yahoo.com> wrote:

 At this point, you can't do that except via opApply, and I didn't want to=

 inject that in fear that it would be pointed out as a drawback.
=20
 It would be nice if we could define a way to do that via ranges...

According to TDPL p.381, one could define slicing instead of opApply or ran= ge methods (since a slice is "rangeable", the compiler should take a slice = if nothing else is present) (not tried, yet). Your string type is, indeed, = a typical case where this would perfectly make sense. Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 03 2010
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 03 Dec 2010 11:15:36 -0500, spir <denis.spir gmail.com> wrote:

 On Thu, 02 Dec 2010 16:24:03 -0500
 "Steven Schveighoffer" <schveiguy yahoo.com> wrote:

 Yes, it does seem odd, but then again, how often do you need the
 individual characters of a string?  I wrote php code for about 6 months  
 as
 a full time job before I found I needed to access individual characters,
 and then I had to look up how to do it :)  It's just not a common thing.

Well, I wonder whether I have ever written any piece of code without string indexing ;-) Surely depends on your app domain(s)...

Indexing a string is rare, unless you are parsing something (yes it does truly depend on the domain), but even rarer is indexing a string based on a hard-coded value. Really, this last case is where the type would behave in a confusing way. Consider that D's current's strings have even more of a limited means of indexing -- you can only validly index a string using an index where a code-point is encoded as a single code-unit. Accessing other indexes (even ones that start a multi-unit code-point) results in invalid characters. -Steve
Dec 03 2010
prev sibling next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 03 Dec 2010 14:40:30 -0500, Jerry Quinn <jlquinn optonline.net>  
wrote:

 I tend to do a lot of transforming strings, but I need to track offsets  
 back to the original text to maintain alignment between the results and  
 the input.  For that, indexes are necessary and we use them a lot.

In my daily usage of strings, I generally use a string as a whole, not individual characters. But I do occasionally use it. Let's also understand that indexing is still present, what is deactivated is the ability to index to arbitrary code-units. It sounds to me like this new type would not affect your ability to store offsets (you can store an index, use it later when referring to the string, etc. just like you can now). My string type does not allow for writeable strings. My plan was to allow you access to the underlying char[] and let you edit that way. Letting someone write a dchar into the middle a utf-8 string could cause lots of problems, so I just disabled it by default. Not sure how that affects your 'transforming' work, are you actually changing the data or just lazily transforming? I'm interested to hear whether you think my string type would be a viable alternative.
 Probably the right thing to do in this case is just pay for the cost of  
 using dchar everywhere, but if you're working with large enough  
 quantities of data, storage efficiency matters.

The huge advantage of using utf-8 is backwards compatibility with ASCII for C functions. -Steve
Dec 03 2010
next sibling parent Jerry Quinn <jlquinn optonline.net> writes:
Steven Schveighoffer Wrote:

 On Fri, 03 Dec 2010 14:40:30 -0500, Jerry Quinn <jlquinn optonline.net>  
 wrote:
 
 I tend to do a lot of transforming strings, but I need to track offsets  
 back to the original text to maintain alignment between the results and  
 the input.  For that, indexes are necessary and we use them a lot.

In my daily usage of strings, I generally use a string as a whole, not individual characters. But I do occasionally use it. Let's also understand that indexing is still present, what is deactivated is the ability to index to arbitrary code-units. It sounds to me like this new type would not affect your ability to store offsets (you can store an index, use it later when referring to the string, etc. just like you can now). My string type does not allow for writeable strings. My plan was to allow you access to the underlying char[] and let you edit that way. Letting someone write a dchar into the middle a utf-8 string could cause lots of problems, so I just disabled it by default.

That's reasonable. I"m not trying to create invalid strings. I'm actually working in C++ but keeping an eye on things going on in D-land. The kind of stuff we do is to normalize text in preparation for natural language processing. As a simple example, let's say you want to use a set of regexes to identify patterns in text. You want to return the offsets of each regex that matches. However, before the regexes run, you replace all html tags with a placeholder, so they can easily span tags without worrying about the contents. Before you return the results to the user, though, you need to translate those offsets back to ones for the original string. Everything is unicode of course and we care about processing unicode code points, but want to maintain UTF-8 storage underneath to keep size down. In reality, we're often doing things like single character normalizations as well as larger spans, but still need to maintain alignment to the original data. As long as this is reasonable to do, I'm fine. I just wasn't sure from the descriptions I was seeing. Jerry
Dec 03 2010
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 03 Dec 2010 22:08:37 -0500, Jerry Quinn <jlquinn optonline.net>  
wrote:

 I'm actually working in C++ but keeping an eye on things going on
 in D-land.  The kind of stuff we do is to normalize text in preparation
 for natural language processing.

 As a simple example, let's say you want to use a set of regexes to  
 identify patterns
 in text.  You want to return the offsets of each regex that matches.   
 However, before the regexes
 run, you replace all html tags with a placeholder, so they can easily  
 span
 tags without worrying about the contents.

I'm assuming you are not changing the length of the string, or is that not correct?
 Before you return the results to the
 user, though, you need to translate those offsets back to ones for the  
 original
 string.

Hm... I guess you must be changing the lengths if the offsets are different. That seems odd, wouldn't you encounter performance issues when processing large documents?
 Everything is unicode of course and we care about processing unicode  
 code points, but want to maintain UTF-8 storage underneath to keep size  
 down.

 In reality, we're often doing things like single character  
 normalizations as well as larger spans, but still need to maintain  
 alignment to the original data.

 As long as this is reasonable to do, I'm fine.  I just wasn't sure from  
 the descriptions I was seeing.

What you will have is access to the underlying char[] array, which should give you full edit access. I just don't want strings to be easily editable since doing so can be very difficult. Any offsets to dchar code-points in the string will match offsets to char code-units. In effect, you are always indexing by code-unit, even though with the string type you get code-points back. It should be as simple as accessing a member (like str.data) or casting (i.e. cast(char[])str). I'm unsure yet if it's dangerous enough to require casting. -Steve
Dec 04 2010
prev sibling parent reply Steven Wawryk <stevenw acres.com.au> writes:
I like the direction you're taking but have some quibbles about details. 
  Specifically, I'd go for a more complete separation into random-access 
code-unit ranges and bidirectional code-point ranges:


On 01/12/10 02:18, Steven Schveighoffer wrote:
 // Written in the D programming language.

 /**
 Copyright: Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010
 License: <a href="http://www.boost.org/LICENSE_1_0.txt">Boost License
 1.0</a>.
 Authors: $(WEB erdani.org, Andrei Alexandrescu, Steven Schveighoffer)

 Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010.
 Distributed under the Boost Software License, Version 1.0.
 (See accompanying file LICENSE_1_0.txt or copy at
 http://www.boost.org/LICENSE_1_0.txt)
 */
 import std.utf;
 import std.traits;

 struct string_t(T) if (is(Unqual!T == char) || is(Unqual!T == wchar))

Is there a reason not to include is(Unqual!T == dchar)?
 {
 private T[] _data;
 this(T[] data)
 {
 this._data = data;
 }

An opAssign from a T[] could facilitate conversion back and forth between code-point and code-unit ranges.
 // note, this assumes that idx is a valid index
 private size_t _charStart(size_t idx) const
 {
 static if(is(Unqual!T == wchar))
 {
 immutable c = _data.ptr[idx];
 if(c >= 0xDC00 && c <= 0xDFFF)
 {
 // surrogate pair detected, verify we have at least 2 wchars,
 // and that both wchars are properly encoded.
 assert(idx > 0, "Invalid UTF character at beginning of string");
 return idx-1;
 }
 else
 return idx;
 }
 else
 {
 const p = _data.ptr + idx;
 if ((p[0] & 0b1100_0000) != 0b1000_0000)
 {
 return idx;
 }
 else if (idx >= 1 && (p[-1] & 0b1100_0000) != 0b1000_0000)
 {
 return idx - 1;
 }
 else if (idx >= 2 && (p[-2] & 0b1100_0000) != 0b1000_0000)
 {
 return idx - 2;
 }
 else if (idx >= 3 && (p[-3] & 0b1100_0000) != 0b1000_0000)
 {
 return idx - 3;
 }
 else
 {
 assert(false, "Invalid UTF character in string");
 }
 }
 }

 void popFront()
 {
 auto nc = std.utf.stride(_data, 0);
 assert(nc <= _data.length && nc != 0xFF, "Invalid sequence at beginning
 of string");
 _data = _data[nc .. $];
 }

 void popBack()
 {
 immutable n = _data.length;
 assert(n, "Attempting to pop back of an empty string");
 _data = _data.ptr[0.._charStart(n-1)];
 }

  property dchar front() const
 {
 assert(_data.length, "Attempting to fetch the front of an empty string");
 size_t i = 0;
 return decode(_data, i);
 }

  property dchar back() const
 {
 immutable n = _data.length;
 assert(n, "Attempting to fetch the back of an empty string");
 auto idx = _charStart(n-1);
 return std.utf.decode(_data, idx);
 }

There is the alternative of deferring decoding to the user and returning T[]'s holding exactly 1 code-point instead of dchars. I'm not sure which is best, but I'd be interested in seeing a case for choosing one or the other.
  property bool empty() const
 {
 return !_data.length;
 }

  property typeof(this) save()
 {
 return this;
 }

 // support read-only random access via code unit index.
 dchar opIndex(size_t idx)
 {
 idx = _charStart(idx);
 return std.utf.decode(_data, idx);
 }

 string_t opSlice()
 {
 return this;
 }

 string_t opSlice(size_t start, size_t end)
 {
 if(start != _data.length)
 start = _charStart(start);
 if(end != _data.length)
 end = _charStart(end);
 return string_t(_data[start..end]);
 }

 // note we don't call this length because length can be assumed to be the
 // number of elements, which this isn't.
  property size_t codeUnits() const
 {
 return _data.length;
 }

I don't see a need for _charStart, opIndex, opSlice and codeUnits. If the underlying T[] can be returned by a property, then these can be done through the code-unit array, which is random-access.
 // support append and concat
 // TODO: need to support appending various types of strings to eachother
 // (right now only same-type strings can be appended, or raw arrays)
 ref string_t opOpAssign(string op, U)(U data) if (op == "~" && is(U ==
 string_t))
 {
 _data ~= data._data;
 return this;
 }

 ref string_t opOpAssign(string op, U)(U data) if (op == "~" && !is(U ==
 string_t) && is(typeof(_data ~= U.init)))
 {
 _data ~= data;
 return this;
 }

 string_t opBinary(string op, U)(U data) if (op == "~" && is(U == string_t))
 {
 return string_t(_data ~ data._data);
 }

 string_t opBinary(string op, U)(U data) if (op == "~" && !is(U ==
 string_t) && is(typeof(_data ~ U.init)))
 {
 return string_t(_data ~ data);
 }
 }

 template string_t(T) if (is(Unqual!T == dchar))
 {
 alias T[] string_t;
 }

 /** begin test code **/
 import std.stdio;

 alias string_t!(immutable char) mystring;
 alias string_t!(immutable wchar) mywstring;
 alias string_t!(immutable dchar) mydstring;

 void main()
 {
 auto str = mystring("hello");
 str ~= " world";
 str ~= mystring("!!!");
 writeln(str._data);
 }

Jan 12 2011
parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 12 Jan 2011 04:49:26 -0500, Steven Wawryk <stevenw acres.com.au>  
wrote:

 I like the direction you're taking but have some quibbles about details.  
   Specifically, I'd go for a more complete separation into random-access  
 code-unit ranges and bidirectional code-point ranges:

Thanks for taking the time. I will respond to your points, but please make your rebuttals to the new thread I'm about to create with an updated string type.
 Is there a reason not to include is(Unqual!T == dchar)?

It's defined at the bottom as a simple alias to T[], since you do not need these features for a dchar[] string (it's correct to treat a dchar string as an array).
 An opAssign from a T[] could facilitate conversion back and forth  
 between code-point and code-unit ranges.

Yes, that would be a good feature.
 There is the alternative of deferring decoding to the user and returning  
 T[]'s holding exactly 1 code-point instead of dchars.  I'm not sure  
 which is best, but I'd be interested in seeing a case for choosing one  
 or the other.

Definitely returning decoding and returning dchar is best. The whole point of this string type is to handle the decoding for you. If you want to do decoding manually, you can just stick with arrays.
 I don't see a need for _charStart, opIndex, opSlice and codeUnits.  If  
 the underlying T[] can be returned by a property, then these can be done  
 through the code-unit array, which is random-access.

But that puts extra pain on the user for not much reason. Currently, strings slice in one operation, you are proposing that we slice in three operations: 1. get the underlying array 2. slice it 3. reconstruct a string based on the slice. Plus, if you remove opIndex, you are restricting the usefulness of the range. Note that this string type already will decode dchars out of the front and back, why not just give that ability to the middle of the string?
 template string_t(T) if (is(Unqual!T == dchar))
 {
 alias T[] string_t;
 }


Note the definition of string_t!dchar here ^^^^ -Steve
Jan 13 2011