digitalmars.D - [review] new string type

Steven Schveighoffer (180/180) Nov 30 2010 In a prior thread, I promised to create a narrow string type that would ...

Jonathan M Davis (37/245) Nov 30 2010 1. At least until universal function syntax is in the language, you use ...

Steven Schveighoffer (20/71) Nov 30 2010 Yes, but first, this is not a problem with the string type, it's a probl...

Ellery Newcomer (7/15) Nov 30 2010 There definitely is value in being able to index and slice into utf

Steven Schveighoffer (10/29) Dec 01 2010 Maybe :) The other alternative is to throw an exception if you try to

Ellery Newcomer (21/35) Dec 01 2010 in my mind, the problem isn't so much indexing an intermediate code unit...

spir (8/14) Dec 02 2010 This would yield several times the same code point / dchar, or do i misi...

Steven Schveighoffer (6/16) Dec 02 2010 It would only yield the code units that are the start of a code point.

Steven Schveighoffer (13/53) Dec 02 2010 Yes, it does seem odd, but then again, how often do you need the

spir (11/15) Dec 03 2010 s =20

Steven Schveighoffer (11/21) Dec 03 2010 Indexing a string is rare, unless you are parsing something (yes it does...

Simon (9/12) Dec 03 2010 In my XP you never index *especially* when parsing.
Jerry Quinn (4/23) Dec 03 2010 I tend to do a lot of transforming strings, but I need to track offsets ...

Steven Schveighoffer (19/25) Dec 03 2010 In my daily usage of strings, I generally use a string as a whole, not

Jerry Quinn (15/35) Dec 03 2010 That's reasonable. I"m not trying to create invalid strings.

Steven Schveighoffer (17/39) Dec 04 2010 I'm assuming you are not changing the length of the string, or is that n...

spir (11/15) Dec 03 2010 =20

Steven Wawryk (16/30) Jan 12 2011 I don't like either. I'd prefer (from std.algorithm)

Steven Wawryk (3/4) Jan 12 2011 Sorry, I didn't mean the the codeUnits retuning a size_t. I meant an

Lars T. Kyllingstad (14/31) Nov 30 2010 It feels extremely weird that the indices refer to code units and not

spir (23/57) Dec 01 2010 If I understand correctly how _charStart works in combination with index...
Steven Schveighoffer (18/48) Dec 01 2010 I don't think it's possible to do that with any implementation without

Lars T. Kyllingstad (15/55) Dec 01 2010 As soon as you add opIndex(), your interface becomes that of a random-

Steven Schveighoffer (8/62) Dec 02 2010 Then substring (slicing) becomes an O(n) operation. It just doesn't wor...

Lars T. Kyllingstad (27/90) Dec 03 2010 What I meant wast that opSlice() should be disabled in the same way as

Steven Schveighoffer (13/102) Dec 03 2010 O(n) slicing and indexing, called by different names, is unacceptable. ...

Lars T. Kyllingstad (5/18) Dec 03 2010 Ah, you are right. I didn't read the documentation for

Jonathan M Davis (10/71) Dec 03 2010 A string type without slicing (which must be O(1)) is DOA without questi...

Jonathan M Davis (64/148) Nov 30 2010 I really don't think that we want to have to be adding string functions ...

Steven Schveighoffer (40/186) Dec 01 2010 Why? It's a choice between adding them to the struct, and doing func(st...

foobar (6/19) Dec 01 2010 A string type should always maintain the invariant that it is a valid un...

spir (22/42) Dec 01 2010 ith =20

stephan (19/21) Dec 01 2010 Hi Denis, you might want to consider helping us out.

spir (49/80) Dec 01 2010 lization and the fact that a single "character" can be constructed from ...

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (44/56) Dec 03 2010 a true

spir (63/131) Dec 03 2010 ion

Steven Schveighoffer (10/37) Dec 01 2010 Copying is not a good idea, nor is runtime validation. We can only

Steven Wawryk (14/174) Jan 12 2011 I like the direction you're taking but have some quibbles about details....

Steven Schveighoffer (23/40) Jan 13 2011 Thanks for taking the time. I will respond to your points, but please

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

In a prior thread, I promised to create a narrow string type that would  
enforce the requirements of narrow strings.  That is, the new string type  
should respect the encoding of narrow strings.

Here is a rough draft, tested minimally, but it does compile and pass  
simple tests.  It's pretty simple, which is what I would expect.  I copied  
a lot of stuff from std.array to get this to work.

The point of this type is -- if we replace what the compiler considers  
"strings" with this type, then we get both the compiler *and* phobos  
agreeing as to what this type is:  A bi-directional range of dchar.

As a bonus, char[] and wchar[] now would become arrays and be manipulated  
consistently with other arrays, which if not done correctly could cause  
problems, but may provide more flexibility and opportunity for  
performance.  Instead of the library fighting you on it.

Anyways, here it is, released under the boost license, commence attack ;)


// Written in the D programming language.

/**
Copyright: Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010
License:   <a href="http://www.boost.org/LICENSE_1_0.txt">Boost License  
1.0</a>.
Authors:   $(WEB erdani.org, Andrei Alexandrescu, Steven Schveighoffer)

Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010.
Distributed under the Boost Software License, Version 1.0.
    (See accompanying file LICENSE_1_0.txt or copy at
          http://www.boost.org/LICENSE_1_0.txt)
*/
import std.utf;
import std.traits;

struct string_t(T) if (is(Unqual!T == char) || is(Unqual!T == wchar))
{
     private T[] _data;
     this(T[] data)
     {
         this._data = data;
     }

     // note, this assumes that idx is a valid index
     private size_t _charStart(size_t idx) const
     {
         static if(is(Unqual!T == wchar))
         {
             immutable c = _data.ptr[idx];
             if(c >= 0xDC00 && c <= 0xDFFF)
             {
                 // surrogate pair detected, verify we have at least 2  
wchars,
                 // and that both wchars are properly encoded.
                 assert(idx > 0, "Invalid UTF character at beginning of  
string");
                 return idx-1;
             }
             else
                 return idx;
         }
         else
         {
             const p = _data.ptr + idx;
             if ((p[0] & 0b1100_0000) != 0b1000_0000)
             {
                 return idx;
             }
             else if (idx >= 1 && (p[-1] & 0b1100_0000) != 0b1000_0000)
             {
                 return idx - 1;
             }
             else if (idx >= 2 && (p[-2] & 0b1100_0000) != 0b1000_0000)
             {
                 return idx - 2;
             }
             else if (idx >= 3 && (p[-3] & 0b1100_0000) != 0b1000_0000)
             {
                 return idx - 3;
             }
             else
             {
                 assert(false, "Invalid UTF character in string");
             }
         }
     }

     void popFront()
     {
         auto nc = std.utf.stride(_data, 0);
         assert(nc <= _data.length && nc != 0xFF, "Invalid sequence at  
beginning of string");
         _data = _data[nc .. $];
     }

     void popBack()
     {
         immutable n = _data.length;
         assert(n, "Attempting to pop back of an empty string");
         _data = _data.ptr[0.._charStart(n-1)];
     }

      property dchar front() const
     {
         assert(_data.length, "Attempting to fetch the front of an empty  
string");
         size_t i = 0;
         return decode(_data, i);
     }

      property dchar back() const
     {
         immutable n = _data.length;
         assert(n, "Attempting to fetch the back of an empty string");
         auto idx = _charStart(n-1);
         return std.utf.decode(_data, idx);
     }

      property bool empty() const
     {
         return !_data.length;
     }

      property typeof(this) save()
     {
         return this;
     }

     // support read-only random access via code unit index.
     dchar opIndex(size_t idx)
     {
         idx = _charStart(idx);
         return std.utf.decode(_data, idx);
     }

     string_t opSlice()
     {
         return this;
     }

     string_t opSlice(size_t start, size_t end)
     {
         if(start != _data.length)
             start = _charStart(start);
         if(end != _data.length)
             end = _charStart(end);
         return string_t(_data[start..end]);
     }

     // note we don't call this length because length can be assumed to be  
the
     // number of elements, which this isn't.
      property size_t codeUnits() const
     {
         return _data.length;
     }

     // support append and concat
     // TODO: need to support appending various types of strings to  
eachother
     // (right now only same-type strings can be appended, or raw arrays)
     ref string_t opOpAssign(string op, U)(U data) if (op == "~" && is(U ==  
string_t))
     {
         _data ~= data._data;
         return this;
     }

     ref string_t opOpAssign(string op, U)(U data) if (op == "~" && !is(U  
== string_t) && is(typeof(_data ~= U.init)))
     {
         _data ~= data;
         return this;
     }

     string_t opBinary(string op, U)(U data) if (op == "~" && is(U ==  
string_t))
     {
         return string_t(_data ~ data._data);
     }

     string_t opBinary(string op, U)(U data) if (op == "~" && !is(U ==  
string_t) && is(typeof(_data ~ U.init)))
     {
         return string_t(_data ~ data);
     }
}

template string_t(T) if (is(Unqual!T == dchar))
{
     alias T[] string_t;
}

/** begin test code **/
import std.stdio;

alias string_t!(immutable char) mystring;
alias string_t!(immutable wchar) mywstring;
alias string_t!(immutable dchar) mydstring;

void main()
{
     auto str = mystring("hello");
     str ~= " world";
     str ~= mystring("!!!");
     writeln(str._data);
}

Nov 30 2010

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Tuesday, November 30, 2010 07:48:03 Steven Schveighoffer wrote:
 In a prior thread, I promised to create a narrow string type that would
 enforce the requirements of narrow strings.  That is, the new string type
 should respect the encoding of narrow strings.
 
 Here is a rough draft, tested minimally, but it does compile and pass
 simple tests.  It's pretty simple, which is what I would expect.  I copied
 a lot of stuff from std.array to get this to work.
 
 The point of this type is -- if we replace what the compiler considers
 "strings" with this type, then we get both the compiler *and* phobos
 agreeing as to what this type is:  A bi-directional range of dchar.
 
 As a bonus, char[] and wchar[] now would become arrays and be manipulated
 consistently with other arrays, which if not done correctly could cause
 problems, but may provide more flexibility and opportunity for
 performance.  Instead of the library fighting you on it.
 
 Anyways, here it is, released under the boost license, commence attack ;)
 
 
 // Written in the D programming language.
 
 /**
 Copyright: Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010
 License:   <a href="http://www.boost.org/LICENSE_1_0.txt">Boost License
 1.0</a>.
 Authors:   $(WEB erdani.org, Andrei Alexandrescu, Steven Schveighoffer)
 
 Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010.
 Distributed under the Boost Software License, Version 1.0.
     (See accompanying file LICENSE_1_0.txt or copy at
           http://www.boost.org/LICENSE_1_0.txt)
 */
 import std.utf;
 import std.traits;
 
 struct string_t(T) if (is(Unqual!T == char) || is(Unqual!T == wchar))
 {
      private T[] _data;
      this(T[] data)
      {
          this._data = data;
      }
 
      // note, this assumes that idx is a valid index
      private size_t _charStart(size_t idx) const
      {
          static if(is(Unqual!T == wchar))
          {
              immutable c = _data.ptr[idx];
              if(c >= 0xDC00 && c <= 0xDFFF)
              {
                  // surrogate pair detected, verify we have at least 2
 wchars,
                  // and that both wchars are properly encoded.
                  assert(idx > 0, "Invalid UTF character at beginning of
 string");
                  return idx-1;
              }
              else
                  return idx;
          }
          else
          {
              const p = _data.ptr + idx;
              if ((p[0] & 0b1100_0000) != 0b1000_0000)
              {
                  return idx;
              }
              else if (idx >= 1 && (p[-1] & 0b1100_0000) != 0b1000_0000)
              {
                  return idx - 1;
              }
              else if (idx >= 2 && (p[-2] & 0b1100_0000) != 0b1000_0000)
              {
                  return idx - 2;
              }
              else if (idx >= 3 && (p[-3] & 0b1100_0000) != 0b1000_0000)
              {
                  return idx - 3;
              }
              else
              {
                  assert(false, "Invalid UTF character in string");
              }
          }
      }
 
      void popFront()
      {
          auto nc = std.utf.stride(_data, 0);
          assert(nc <= _data.length && nc != 0xFF, "Invalid sequence at
 beginning of string");
          _data = _data[nc .. $];
      }
 
      void popBack()
      {
          immutable n = _data.length;
          assert(n, "Attempting to pop back of an empty string");
          _data = _data.ptr[0.._charStart(n-1)];
      }
 
       property dchar front() const
      {
          assert(_data.length, "Attempting to fetch the front of an empty
 string");
          size_t i = 0;
          return decode(_data, i);
      }
 
       property dchar back() const
      {
          immutable n = _data.length;
          assert(n, "Attempting to fetch the back of an empty string");
          auto idx = _charStart(n-1);
          return std.utf.decode(_data, idx);
      }
 
       property bool empty() const
      {
          return !_data.length;
      }
 
       property typeof(this) save()
      {
          return this;
      }
 
      // support read-only random access via code unit index.
      dchar opIndex(size_t idx)
      {
          idx = _charStart(idx);
          return std.utf.decode(_data, idx);
      }
 
      string_t opSlice()
      {
          return this;
      }
 
      string_t opSlice(size_t start, size_t end)
      {
          if(start != _data.length)
              start = _charStart(start);
          if(end != _data.length)
              end = _charStart(end);
          return string_t(_data[start..end]);
      }
 
      // note we don't call this length because length can be assumed to be
 the
      // number of elements, which this isn't.
       property size_t codeUnits() const
      {
          return _data.length;
      }
 
      // support append and concat
      // TODO: need to support appending various types of strings to
 eachother
      // (right now only same-type strings can be appended, or raw arrays)
      ref string_t opOpAssign(string op, U)(U data) if (op == "~" && is(U ==
 string_t))
      {
          _data ~= data._data;
          return this;
      }
 
      ref string_t opOpAssign(string op, U)(U data) if (op == "~" && !is(U
 == string_t) && is(typeof(_data ~= U.init)))
      {
          _data ~= data;
          return this;
      }
 
      string_t opBinary(string op, U)(U data) if (op == "~" && is(U ==
 string_t))
      {
          return string_t(_data ~ data._data);
      }
 
      string_t opBinary(string op, U)(U data) if (op == "~" && !is(U ==
 string_t) && is(typeof(_data ~ U.init)))
      {
          return string_t(_data ~ data);
      }
 }
 
 template string_t(T) if (is(Unqual!T == dchar))
 {
      alias T[] string_t;
 }
 
 /** begin test code **/
 import std.stdio;
 
 alias string_t!(immutable char) mystring;
 alias string_t!(immutable wchar) mywstring;
 alias string_t!(immutable dchar) mydstring;
 
 void main()
 {
      auto str = mystring("hello");
      str ~= " world";
      str ~= mystring("!!!");
      writeln(str._data);
 }

1. At least until universal function syntax is in the language, you use the 
ability to do str.func().

2. Functions that would work just fine treating strings as arrays of code units 
(presumably because they don't care about what the actual data is) lose 
efficiency, because now a string isn't an array.

3. You have no access to the underlying array unless you're dealing with an 
actual array of dchar.

4. Indexing is no longer O(1), which violates the guarantees of the index 
operator.

5. Slicing (other than a full slice) is no longer O(1), which violates the 
guarantees of the slicing operator.

What you're doing here is forcing the view of a string as a range of dchar in 
all cases. Granted, that's what you want in most cases, but it can degrade 
efficiency, and the fact that some operations (in particular indexing and
slicing) 
are not O(1) like they're supposed to be means that algorithms which rely on 
O(1) behavior from them could increase their cost by an order of magnitude. All 
the cases where treating a string as an actual array which are currently valid 
are left out to dry

The inherent problem with strings is that we _want_ to be able to view them as 
both arrays of code units and as ranges of dchar. Trying to view them only as 
ranges of dchar is inefficient in a number of cases. Even something as simple
as 
getting a substring is less efficient with this code. Only in cases where you 
don't actually know the length of the substring that you want is this 
essentially as efficient as what we have now. You're ignoring all cases where 
viewing a string as an array of code units is correct and desirable. You're
only 
solving half of the problem (albeit the more prevalent half).

It could very well be that we need a better way to handle when a string is 
treated as a range of dchar and when it is treated as an array of code units, 
but trying to treat them as ranges of dchar in all cases is not the right 
answer.

Now, making it possible to have a wrapper struct like this which works in many 
cases where you'd use strings could reduce errors in code in many situations,
so 
giving the programmer the ability to do that could be interesting. But it seems 
to me that this solution is too limiting to be a viable replacement for strings 
as they are.

- Jonathan M Davis

Nov 30 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis <jmdavisProg gmx.com>  
wrote:

 1. At least until universal function syntax is in the language, you use  
 the
 ability to do str.func().

Yes, but first, this is not a problem with the string type, it's a problem  
with the language.  Second, any string-specific functions can be added, it  
is a struct after all, not a builtin.

 2. Functions that would work just fine treating strings as arrays of  
 code units
 (presumably because they don't care about what the actual data is) lose
 efficiency, because now a string isn't an array.

Which functions are those?  They can be allowed via wrappers.

 3. You have no access to the underlying array unless you're dealing with  
 an
 actual array of dchar.

I thought of adding some kind of access.  I wasn't sure the best way.

I was thinking of allowing direct access via opCast, because I think  
casting might be a sufficient red flag to let you know you are crossing  
into dangerous waters.

But it could just be as easy as making the array itself public.

 4. Indexing is no longer O(1), which violates the guarantees of the index
 operator.

Indexing is still O(1).

 5. Slicing (other than a full slice) is no longer O(1), which violates  
 the
 guarantees of the slicing operator.

Slicing is still O(1).

 What you're doing here is forcing the view of a string as a range of  
 dchar in
 all cases. Granted, that's what you want in most cases, but it can  
 degrade
 efficiency, and the fact that some operations (in particular indexing  
 and slicing)
 are not O(1) like they're supposed to be means that algorithms which  
 rely on
 O(1) behavior from them could increase their cost by an order of  
 magnitude. All
 the cases where treating a string as an actual array which are currently  
 valid
 are left out to dry

You can still use char[] and wchar[].

 The inherent problem with strings is that we _want_ to be able to view  
 them as
 both arrays of code units and as ranges of dchar. Trying to view them  
 only as
 ranges of dchar is inefficient in a number of cases. Even something as  
 simple as
 getting a substring is less efficient with this code. Only in cases  
 where you
 don't actually know the length of the substring that you want is this
 essentially as efficient as what we have now. You're ignoring all cases  
 where
 viewing a string as an array of code units is correct and desirable.  
 You're only
 solving half of the problem (albeit the more prevalent half).

I'm not planning on eliminating char and wchar based arrays.

In other words, you should be able to get access to the array, but it  
should not be the default, and it should be considered unsafe.

 Now, making it possible to have a wrapper struct like this which works  
 in many
 cases where you'd use strings could reduce errors in code in many  
 situations, so
 giving the programmer the ability to do that could be interesting. But  
 it seems
 to me that this solution is too limiting to be a viable replacement for  
 strings
 as they are.

Hopefully you can see that I'm not eliminating the functionality you are  
looking for, just making it not the default.

-Steve

Nov 30 2010

Ellery Newcomer <ellery-newcomer utulsa.edu> writes:

On 11/30/2010 12:52 PM, Steven Schveighoffer wrote:
 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis

 4. Indexing is no longer O(1), which violates the guarantees of the index
 operator.

 Indexing is still O(1).

 5. Slicing (other than a full slice) is no longer O(1), which violates
 the
 guarantees of the slicing operator.

 Slicing is still O(1).

There definitely is value in being able to index and slice into utf 
strings without resulting in invalid utf, but I think the fact that it 
indexes on code unit and returns code point is sufficiently strange that 
it qualifies as abuse of operator overloading.

That said, I like this proposal better than the current state of 
affairs. I'll try to play with it more on the weekend

Nov 30 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 30 Nov 2010 18:31:05 -0500, Ellery Newcomer  
<ellery-newcomer utulsa.edu> wrote:

 On 11/30/2010 12:52 PM, Steven Schveighoffer wrote:
 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis

 4. Indexing is no longer O(1), which violates the guarantees of the  
 index
 operator.

 Indexing is still O(1).

 5. Slicing (other than a full slice) is no longer O(1), which violates
 the
 guarantees of the slicing operator.

 Slicing is still O(1).

 There definitely is value in being able to index and slice into utf  
 strings without resulting in invalid utf, but I think the fact that it  
 indexes on code unit and returns code point is sufficiently strange that  
 it qualifies as abuse of operator overloading.

Maybe :)  The other alternative is to throw an exception if you try to  
access a code unit that is not the beginning of a code point.

That might actually be less weird, I'll try doing that on the next  
iteration.

I find that iteration over string characters using index is a very rare  
thing anyways, you either use foreach, which should give you dchars, or  
you use something like find, which should never give you an invalid index.

-Steve

Dec 01 2010

Ellery Newcomer <ellery-newcomer utulsa.edu> writes:

On 12/01/2010 03:35 PM, Steven Schveighoffer wrote:
 On Tue, 30 Nov 2010 18:31:05 -0500, Ellery Newcomer
 There definitely is value in being able to index and slice into utf
 strings without resulting in invalid utf, but I think the fact that it
 indexes on code unit and returns code point is sufficiently strange
 that it qualifies as abuse of operator overloading.

 Maybe :) The other alternative is to throw an exception if you try to
 access a code unit that is not the beginning of a code point.

 That might actually be less weird, I'll try doing that on the next
 iteration.

in my mind, the problem isn't so much indexing an intermediate code unit 
gets you earlier code units (it's a little strange, and I'm not sure 
whether greater strictness would be better - on the one hand, less 
strictness would be more tolerant of bugs and make it that much more 
difficult to detect them, but on the other hand if you were doing 
something like getting a random or approximate slice into your string, 
less strictness would mean that much less annoyance, though I have no 
idea why you would want to do that) as it is just the difference between 
the two and the confusion that it's bound to cause the noobies.

 I find that iteration over string characters using index is a very rare
 thing anyways, you either use foreach, which should give you dchars, or
 you use something like find, which should never give you an invalid index.

 -Steve

find was the counterargument I had in mind for keeping the operator 
overload, as something like

s[find(s,'\u2729') .. s.codeUnits]

is just a bit better than

s.codePointSliceAt(find(s,'\u2729'), s.codeUnits);

I really don't know.

One thing that strikes me, though, if you're going to keep opIndex, is 
that being able to do

foreach(size_t codeuniti, dchar c; s){

}

would be nice. Actually, it looks like you can do that with current strings.

Dec 01 2010

spir <denis.spir gmail.com> writes:

On Wed, 01 Dec 2010 20:13:35 -0600
Ellery Newcomer <ellery-newcomer utulsa.edu> wrote:

 One thing that strikes me, though, if you're going to keep opIndex, is=20
 that being able to do
=20
 foreach(size_t codeuniti, dchar c; s){
=20
 }

This would yield several times the same code point / dchar, or do i misinte=
rpret?

Denis
-- -- -- -- -- -- --
vit esse estrany =E2=98=A3

spir.wikidot.com

Dec 02 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 02 Dec 2010 05:08:54 -0500, spir <denis.spir gmail.com> wrote:

 On Wed, 01 Dec 2010 20:13:35 -0600
 Ellery Newcomer <ellery-newcomer utulsa.edu> wrote:

 One thing that strikes me, though, if you're going to keep opIndex, is
 that being able to do

 foreach(size_t codeuniti, dchar c; s){

 }

 This would yield several times the same code point / dchar, or do i  
 misinterpret?

It would only yield the code units that are the start of a code point.

So if the code point that starts at the beginning of the array is 2 code  
units long, you would iterate

0, 2, 3, ...

-Steve

Dec 02 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Wed, 01 Dec 2010 21:13:35 -0500, Ellery Newcomer  
<ellery-newcomer utulsa.edu> wrote:

 On 12/01/2010 03:35 PM, Steven Schveighoffer wrote:
 On Tue, 30 Nov 2010 18:31:05 -0500, Ellery Newcomer
 There definitely is value in being able to index and slice into utf
 strings without resulting in invalid utf, but I think the fact that it
 indexes on code unit and returns code point is sufficiently strange
 that it qualifies as abuse of operator overloading.

 Maybe :) The other alternative is to throw an exception if you try to
 access a code unit that is not the beginning of a code point.

 That might actually be less weird, I'll try doing that on the next
 iteration.

 in my mind, the problem isn't so much indexing an intermediate code unit  
 gets you earlier code units (it's a little strange, and I'm not sure  
 whether greater strictness would be better - on the one hand, less  
 strictness would be more tolerant of bugs and make it that much more  
 difficult to detect them, but on the other hand if you were doing  
 something like getting a random or approximate slice into your string,  
 less strictness would mean that much less annoyance, though I have no  
 idea why you would want to do that) as it is just the difference between  
 the two and the confusion that it's bound to cause the noobies.

Yes, it does seem odd, but then again, how often do you need the  
individual characters of a string?  I wrote php code for about 6 months as  
a full time job before I found I needed to access individual characters,  
and then I had to look up how to do it :)  It's just not a common thing.

Typically, the index you use is calculated from something like find, and  
you don't care what it is, as long as it's storable and persistent.

 I find that iteration over string characters using index is a very rare
 thing anyways, you either use foreach, which should give you dchars, or
 you use something like find, which should never give you an invalid  
 index.

 -Steve

 find was the counterargument I had in mind for keeping the operator  
 overload, as something like

 s[find(s,'\u2729') .. s.codeUnits]

 is just a bit better than

 s.codePointSliceAt(find(s,'\u2729'), s.codeUnits);

 I really don't know.

Ugh, yes, and actually, that reminds me I should define opDollar.

 One thing that strikes me, though, if you're going to keep opIndex, is  
 that being able to do

 foreach(size_t codeuniti, dchar c; s){

 }

 would be nice. Actually, it looks like you can do that with current  
 strings.

At this point, you can't do that except via opApply, and I didn't want to  
inject that in fear that it would be pointed out as a drawback.

It would be nice if we could define a way to do that via ranges...

-Steve

Dec 02 2010

spir <denis.spir gmail.com> writes:

On Thu, 02 Dec 2010 16:24:03 -0500
"Steven Schveighoffer" <schveiguy yahoo.com> wrote:

 Yes, it does seem odd, but then again, how often do you need the =20
 individual characters of a string?  I wrote php code for about 6 months a=

s =20
 a full time job before I found I needed to access individual characters, =

=20
 and then I had to look up how to do it :)  It's just not a common thing.

Well, I wonder whether I have ever written any piece of code without string=
 indexing ;-)
Surely depends on your app domain(s)...

Denis
-- -- -- -- -- -- --
vit esse estrany =E2=98=A3

spir.wikidot.com

Dec 03 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 03 Dec 2010 11:15:36 -0500, spir <denis.spir gmail.com> wrote:

 On Thu, 02 Dec 2010 16:24:03 -0500
 "Steven Schveighoffer" <schveiguy yahoo.com> wrote:

 Yes, it does seem odd, but then again, how often do you need the
 individual characters of a string?  I wrote php code for about 6 months  
 as
 a full time job before I found I needed to access individual characters,
 and then I had to look up how to do it :)  It's just not a common thing.

 Well, I wonder whether I have ever written any piece of code without  
 string indexing ;-)
 Surely depends on your app domain(s)...

Indexing a string is rare, unless you are parsing something (yes it does  
truly depend on the domain), but even rarer is indexing a string based on  
a hard-coded value.  Really, this last case is where the type would behave  
in a confusing way.

Consider that D's current's strings have even more of a limited means of  
indexing -- you can only validly index a string using an index where a  
code-point is encoded as a single code-unit.  Accessing other indexes  
(even ones that start a multi-unit code-point) results in invalid  
characters.

-Steve

Dec 03 2010

Simon <s.d.hammett gmail.com> writes:

On 03/12/2010 18:29, Steven Schveighoffer wrote:
 On Fri, 03 Dec 2010 11:15:36 -0500, spir <denis.spir gmail.com> wrote:

 Indexing a string is rare, unless you are parsing something (yes it does
 truly depend on the domain),

In my XP you never index *especially* when parsing.
Every parser I've written & used treats input as a character stream.

XML & D for instance permit code points > 0x7x which means you must 
decode them properly when using utf-8, even if you are going to be 
storing the input back as utf-8.

-- 
My enormous talent is exceeded only by my outrageous laziness.
http://www.ssTk.co.uk

Dec 03 2010

Jerry Quinn <jlquinn optonline.net> writes:

Steven Schveighoffer Wrote:

 On Fri, 03 Dec 2010 11:15:36 -0500, spir <denis.spir gmail.com> wrote:
 
 On Thu, 02 Dec 2010 16:24:03 -0500
 "Steven Schveighoffer" <schveiguy yahoo.com> wrote:

 Yes, it does seem odd, but then again, how often do you need the
 individual characters of a string?  I wrote php code for about 6 months  
 as
 a full time job before I found I needed to access individual characters,
 and then I had to look up how to do it :)  It's just not a common thing.

 Well, I wonder whether I have ever written any piece of code without  
 string indexing ;-)
 Surely depends on your app domain(s)...

 
 Indexing a string is rare, unless you are parsing something (yes it does  
 truly depend on the domain), but even rarer is indexing a string based on  
 a hard-coded value.  Really, this last case is where the type would behave  
 in a confusing way.

I tend to do a lot of transforming strings, but I need to track offsets back to
the original text to maintain alignment between the results and the input.  For
that, indexes are necessary and we use them a lot.

Probably the right thing to do in this case is just pay for the cost of using
dchar everywhere, but if you're working with large enough quantities of data,
storage efficiency matters.

Jerry

Dec 03 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 03 Dec 2010 14:40:30 -0500, Jerry Quinn <jlquinn optonline.net>  
wrote:

 I tend to do a lot of transforming strings, but I need to track offsets  
 back to the original text to maintain alignment between the results and  
 the input.  For that, indexes are necessary and we use them a lot.

In my daily usage of strings, I generally use a string as a whole, not  
individual characters.  But I do occasionally use it.

Let's also understand that indexing is still present, what is deactivated  
is the ability to index to arbitrary code-units.  It sounds to me like  
this new type would not affect your ability to store offsets (you can  
store an index, use it later when referring to the string, etc. just like  
you can now).

My string type does not allow for writeable strings.  My plan was to allow  
you access to the underlying char[] and let you edit that way.  Letting  
someone write a dchar into the middle a utf-8 string could cause lots of  
problems, so I just disabled it by default.

Not sure how that affects your 'transforming' work, are you actually  
changing the data or just lazily transforming?  I'm interested to hear  
whether you think my string type would be a viable alternative.

 Probably the right thing to do in this case is just pay for the cost of  
 using dchar everywhere, but if you're working with large enough  
 quantities of data, storage efficiency matters.

The huge advantage of using utf-8 is backwards compatibility with ASCII  
for C functions.

-Steve

Dec 03 2010

Jerry Quinn <jlquinn optonline.net> writes:

Steven Schveighoffer Wrote:

 On Fri, 03 Dec 2010 14:40:30 -0500, Jerry Quinn <jlquinn optonline.net>  
 wrote:
 
 I tend to do a lot of transforming strings, but I need to track offsets  
 back to the original text to maintain alignment between the results and  
 the input.  For that, indexes are necessary and we use them a lot.

 
 In my daily usage of strings, I generally use a string as a whole, not  
 individual characters.  But I do occasionally use it.
 
 Let's also understand that indexing is still present, what is deactivated  
 is the ability to index to arbitrary code-units.  It sounds to me like  
 this new type would not affect your ability to store offsets (you can  
 store an index, use it later when referring to the string, etc. just like  
 you can now).
 
 My string type does not allow for writeable strings.  My plan was to allow  
 you access to the underlying char[] and let you edit that way.  Letting  
 someone write a dchar into the middle a utf-8 string could cause lots of  
 problems, so I just disabled it by default.

That's reasonable.  I"m not trying to create invalid strings.

I'm actually working in C++ but keeping an eye on things going on
in D-land.  The kind of stuff we do is to normalize text in preparation
for natural language processing. 

As a simple example, let's say you want to use a set of regexes to identify
patterns
in text.  You want to return the offsets of each regex that matches.  However,
before the regexes
run, you replace all html tags with a placeholder, so they can easily span
tags without worrying about the contents.  Before you return the results to the
user, though, you need to translate those offsets back to ones for the original
string.

Everything is unicode of course and we care about processing unicode code
points, but want to maintain UTF-8 storage underneath to keep size down.

In reality, we're often doing things like single character normalizations as
well as larger spans, but still need to maintain alignment to the original data.

As long as this is reasonable to do, I'm fine.  I just wasn't sure from the
descriptions I was seeing.

Jerry

Dec 03 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 03 Dec 2010 22:08:37 -0500, Jerry Quinn <jlquinn optonline.net>  
wrote:

 I'm actually working in C++ but keeping an eye on things going on
 in D-land.  The kind of stuff we do is to normalize text in preparation
 for natural language processing.

 As a simple example, let's say you want to use a set of regexes to  
 identify patterns
 in text.  You want to return the offsets of each regex that matches.   
 However, before the regexes
 run, you replace all html tags with a placeholder, so they can easily  
 span
 tags without worrying about the contents.

I'm assuming you are not changing the length of the string, or is that not  
correct?

 Before you return the results to the
 user, though, you need to translate those offsets back to ones for the  
 original
 string.

Hm... I guess you must be changing the lengths if the offsets are  
different.  That seems odd, wouldn't you encounter performance issues when  
processing large documents?

 Everything is unicode of course and we care about processing unicode  
 code points, but want to maintain UTF-8 storage underneath to keep size  
 down.

 In reality, we're often doing things like single character  
 normalizations as well as larger spans, but still need to maintain  
 alignment to the original data.

 As long as this is reasonable to do, I'm fine.  I just wasn't sure from  
 the descriptions I was seeing.

What you will have is access to the underlying char[] array, which should  
give you full edit access.  I just don't want strings to be easily  
editable since doing so can be very difficult.

Any offsets to dchar code-points in the string will match offsets to char  
code-units.  In effect, you are always indexing by code-unit, even though  
with the string type you get code-points back.

It should be as simple as accessing a member (like str.data) or casting  
(i.e. cast(char[])str).  I'm unsure yet if it's dangerous enough to  
require casting.

-Steve

Dec 04 2010

spir <denis.spir gmail.com> writes:

On Thu, 02 Dec 2010 16:24:03 -0500
"Steven Schveighoffer" <schveiguy yahoo.com> wrote:

 At this point, you can't do that except via opApply, and I didn't want to=

 =20
 inject that in fear that it would be pointed out as a drawback.
=20
 It would be nice if we could define a way to do that via ranges...

According to TDPL p.381, one could define slicing instead of opApply or ran=
ge methods (since a slice is "rangeable", the compiler should take a slice =
if nothing else is present) (not tried, yet). Your string type is, indeed, =
a typical case where this would perfectly make sense.

Denis
-- -- -- -- -- -- --
vit esse estrany =E2=98=A3

spir.wikidot.com

Dec 03 2010

Steven Wawryk <stevenw acres.com.au> writes:

On 02/12/10 12:43, Ellery Newcomer wrote:
 On 12/01/2010 03:35 PM, Steven Schveighoffer wrote:
 I find that iteration over string characters using index is a very rare
 thing anyways, you either use foreach, which should give you dchars, or
 you use something like find, which should never give you an invalid
 index.

 -Steve

 find was the counterargument I had in mind for keeping the operator
 overload, as something like

 s[find(s,'\u2729') .. s.codeUnits]

 is just a bit better than

 s.codePointSliceAt(find(s,'\u2729'), s.codeUnits);

 I really don't know.

I don't like either.  I'd prefer (from std.algorithm)

s = find(s, '\u2729');

I still don't see a need for any random-access methods, including 
slicing, on the code-point range.

The ability to convert back and forth between the string_t and T[] would 
be useful to allow the user to choose between random-access code-unit 
range and bidirectional code-point range (or grapheme/glyph/etc).  For 
example a T[] accessor property and an opAssign from a T[].

So if there are efficiencies to be had with random-access (and slicing), 
then the user could choose

s = find(s.codeUnits, '\u2729');

Are there any other reasons to slice on a code-point range?  Or have any 
capabilities beyond a bidirectional range?  Unless there are compelling 
reasons, I'd have to say that slicing and indexing on the code-point 
level is not much more than a hack.

Jan 12 2011

Steven Wawryk <stevenw acres.com.au> writes:

 s = find(s.codeUnits, '\u2729');

Sorry, I didn't mean the the codeUnits retuning a size_t.  I meant an 
accessor for the underlying code-unit range.  Something like

 property T[] codeUnits() { return _data; }

Jan 12 2011

"Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:

On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:
 
 [...]
 
 4. Indexing is no longer O(1), which violates the guarantees of the
 index operator.

 
 Indexing is still O(1).
 
 5. Slicing (other than a full slice) is no longer O(1), which violates
 the
 guarantees of the slicing operator.

 
 Slicing is still O(1).
 
 [...]

It feels extremely weird that the indices refer to code units and not 
code points.  If I write

  auto str = mystring("hæ?");
  writeln(str[1], " ", str[2]);

I expect it to print "æ ?", not "æ æ" like it does now.

On a side note:  It seems to me that the only reason to have char, wchar, 
and dchar as separate types in the language is that arrays of said types 
are UTF-encoded strings.  If a type such as the proposed one were to 
become the default string type in D, it might as well wrap an array of 
ubyte/ushort/uint, since direct user manipulation of the underlying array 
will generally only happen in the rare cases when one wants to deal 
directly with code units.

-Lars

Nov 30 2010

spir <denis.spir gmail.com> writes:

On Tue, 30 Nov 2010 23:34:11 +0000 (UTC)
"Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> wrote:

 On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:
=20
 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:
=20
 [...]
=20
 4. Indexing is no longer O(1), which violates the guarantees of the
 index operator.

=20
 Indexing is still O(1).
=20
 5. Slicing (other than a full slice) is no longer O(1), which violates
 the
 guarantees of the slicing operator.

=20
 Slicing is still O(1).
=20
 [...]

=20
 It feels extremely weird that the indices refer to code units and not=20
 code points.  If I write
=20
   auto str =3D mystring("h=C3=A6?");
   writeln(str[1], " ", str[2]);
=20
 I expect it to print "=C3=A6 ?", not "=C3=A6 =C3=A6" like it does now.

If I understand correctly how _charStart works in combination with indexing=
 and slicing, then here is something wrong in the type's interface.
After
	auto str =3D mystring("h=C3=A6?");
Either one provides a code unit index and gets a code unit:
	writeln(str[1], " ", str[2]); // "=EF=BF=BD =EF=BF=BD" (invalid utf code p=
oints)
Or one provides a code point index and gets a code point:
	writeln(str[1], " ", str[2]); // "=C3=A6 ?"

But for string manipulation, wouldn't it be better that your string type sy=
stematically wraps a dchar[] array, whatever the original encoding? For ind=
exing, slicing, finding, counting, etc... to be fast, I mean. Decoding beei=
ng done only once at string creation time.

 On a side note:  It seems to me that the only reason to have char, wchar,=

=20
 and dchar as separate types in the language is that arrays of said types=

=20
 are UTF-encoded strings.  If a type such as the proposed one were to=20
 become the default string type in D, it might as well wrap an array of=20
 ubyte/ushort/uint, since direct user manipulation of the underlying array=

=20
 will generally only happen in the rare cases when one wants to deal=20
 directly with code units.

Yes, but then, see remark above.


Denis
-- -- -- -- -- -- --
vit esse estrany =E2=98=A3

spir.wikidot.com

Dec 01 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad  
<public kyllingen.nospamnet> wrote:

 On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:

 [...]

 4. Indexing is no longer O(1), which violates the guarantees of the
 index operator.

 Indexing is still O(1).

 5. Slicing (other than a full slice) is no longer O(1), which violates
 the
 guarantees of the slicing operator.

 Slicing is still O(1).

 [...]

 It feels extremely weird that the indices refer to code units and not
 code points.  If I write

   auto str = mystring("hæ?");
   writeln(str[1], " ", str[2]);

 I expect it to print "æ ?", not "æ æ" like it does now.

I don't think it's possible to do that with any implementation without  
making indexing not O(1).  This just isn't possible, unless you want to  
use dchar[].

But your point is well taken.  I think what I'm going to do is throw an  
exception when accessing an invalid index.  While also surprising, it  
doesn't result in "extra data".  I feel it's probably very rare to just  
access hard-coded indexes like that unless you are sure of the data in the  
string.  Or to use a for-loop to access characters, etc.

 On a side note:  It seems to me that the only reason to have char, wchar,
 and dchar as separate types in the language is that arrays of said types
 are UTF-encoded strings.  If a type such as the proposed one were to
 become the default string type in D, it might as well wrap an array of
 ubyte/ushort/uint, since direct user manipulation of the underlying array
 will generally only happen in the rare cases when one wants to deal
 directly with code units.

I'd still want a char[] array for easy manipulation and eventual  
printing.  Wrapping a ubyte[] with a string just to print would be  
strange.  My goal in this exercise is to try and give control of what to  
deal with (code-points vs. code-units) back to the user.  Right now, the  
library forces you to view them as code-points, and the compiler forces  
you to view them as code-units (except via foreach).

But it is an interesting idea (removing small char types).

-Steve

Dec 01 2010

"Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:

On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad
 <public kyllingen.nospamnet> wrote:
 
 On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:

 [...]

 4. Indexing is no longer O(1), which violates the guarantees of the
 index operator.

 Indexing is still O(1).

 5. Slicing (other than a full slice) is no longer O(1), which
 violates the
 guarantees of the slicing operator.

 Slicing is still O(1).

 [...]

 It feels extremely weird that the indices refer to code units and not
 code points.  If I write

   auto str = mystring("hæ?");
   writeln(str[1], " ", str[2]);

 I expect it to print "æ ?", not "æ æ" like it does now.

 
 I don't think it's possible to do that with any implementation without
 making indexing not O(1).  This just isn't possible, unless you want to
 use dchar[].
 
 But your point is well taken.  I think what I'm going to do is throw an
 exception when accessing an invalid index.  While also surprising, it
 doesn't result in "extra data".  I feel it's probably very rare to just
 access hard-coded indexes like that unless you are sure of the data in
 the string.  Or to use a for-loop to access characters, etc.

As soon as you add opIndex(), your interface becomes that of a random-
access range, something which narrow strings are not.  In fact, the 
distinction between random access and bidirectional range access for 
strings is in many ways the reason we're having this discussion.

How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead 
adding a characterAt(i) function that retrieves the i'th code point, and 
which is not required to be O(1)?  Then, if someone wants O(1) indexing 
they are forced to use string_t!dchar or just plain ol' arrays, both of 
which have clear, predictable indexing semantics.

I think it's great that you're doing this, by the way!  I haven't made up 
my mind yet about whether I want char[] or a separate string type, but it 
is great to have an actual implementation of the latter at hand when 
debating it.

-Lars

Dec 01 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 02 Dec 2010 02:09:51 -0500, Lars T. Kyllingstad  
<public kyllingen.nospamnet> wrote:

 On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad
 <public kyllingen.nospamnet> wrote:

 On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:

 [...]

 4. Indexing is no longer O(1), which violates the guarantees of the
 index operator.

 Indexing is still O(1).

 5. Slicing (other than a full slice) is no longer O(1), which
 violates the
 guarantees of the slicing operator.

 Slicing is still O(1).

 [...]

 It feels extremely weird that the indices refer to code units and not
 code points.  If I write

   auto str = mystring("hæ?");
   writeln(str[1], " ", str[2]);

 I expect it to print "æ ?", not "æ æ" like it does now.

 I don't think it's possible to do that with any implementation without
 making indexing not O(1).  This just isn't possible, unless you want to
 use dchar[].

 But your point is well taken.  I think what I'm going to do is throw an
 exception when accessing an invalid index.  While also surprising, it
 doesn't result in "extra data".  I feel it's probably very rare to just
 access hard-coded indexes like that unless you are sure of the data in
 the string.  Or to use a for-loop to access characters, etc.

 As soon as you add opIndex(), your interface becomes that of a random-
 access range, something which narrow strings are not.  In fact, the
 distinction between random access and bidirectional range access for
 strings is in many ways the reason we're having this discussion.

 How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead
 adding a characterAt(i) function that retrieves the i'th code point, and
 which is not required to be O(1)?  Then, if someone wants O(1) indexing
 they are forced to use string_t!dchar or just plain ol' arrays, both of
 which have clear, predictable indexing semantics.

Then substring (slicing) becomes an O(n) operation.  It just doesn't work  
well.  It seems to be awkward at first thought, but the more I think about  
it, the more I think it's right.  When do you ever depend on specific  
indexes in a string being valid, or to be incrementing always by 1?

 I think it's great that you're doing this, by the way!  I haven't made up
 my mind yet about whether I want char[] or a separate string type, but it
 is great to have an actual implementation of the latter at hand when
 debating it.

Thanks :)

-Steve

Dec 02 2010

"Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:

On Thu, 02 Dec 2010 16:18:52 -0500, Steven Schveighoffer wrote:

 On Thu, 02 Dec 2010 02:09:51 -0500, Lars T. Kyllingstad
 <public kyllingen.nospamnet> wrote:
 
 On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad
 <public kyllingen.nospamnet> wrote:

 On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:

 [...]

 4. Indexing is no longer O(1), which violates the guarantees of the
 index operator.

 Indexing is still O(1).

 5. Slicing (other than a full slice) is no longer O(1), which
 violates the
 guarantees of the slicing operator.

 Slicing is still O(1).

 [...]

 It feels extremely weird that the indices refer to code units and not
 code points.  If I write

   auto str = mystring("hæ?");
   writeln(str[1], " ", str[2]);

 I expect it to print "æ ?", not "æ æ" like it does now.

 I don't think it's possible to do that with any implementation without
 making indexing not O(1).  This just isn't possible, unless you want
 to use dchar[].

 But your point is well taken.  I think what I'm going to do is throw
 an exception when accessing an invalid index.  While also surprising,
 it doesn't result in "extra data".  I feel it's probably very rare to
 just access hard-coded indexes like that unless you are sure of the
 data in the string.  Or to use a for-loop to access characters, etc.

 As soon as you add opIndex(), your interface becomes that of a random-
 access range, something which narrow strings are not.  In fact, the
 distinction between random access and bidirectional range access for
 strings is in many ways the reason we're having this discussion.

 How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead
 adding a characterAt(i) function that retrieves the i'th code point,
 and which is not required to be O(1)?  Then, if someone wants O(1)
 indexing they are forced to use string_t!dchar or just plain ol'
 arrays, both of which have clear, predictable indexing semantics.

 
 Then substring (slicing) becomes an O(n) operation.  It just doesn't
 work well.

What I meant wast that opSlice() should be disabled in the same way as 
opIndex().  What you have now is

    struct string_t(T)
    {
        dchar opIndex(size_t);               // Must be O(1)
        string_t opSlice(size_t, size_t);    // Must be O(1)
    }

and what I'm suggesting is

    struct string_t(T)
    {
        dchar character(size_t);             // May be O(n)
        string_t substring(size_t, size_t);  // May be O(n)

        static if (is(T == dchar))
        {
            alias character opIndex;         // Must be O(1)
            alias substring opSlice;         // Must be O(1)
        }
    }

 It seems to be awkward at first thought, but the more I
 think about it, the more I think it's right.  When do you ever depend on
 specific indexes in a string being valid, or to be incrementing always
 by 1?

That is a good question indeed.  But I'm still not convinced. :)

Another thing to consider is that by having opIndex() in there, your 
string satisfies isRandomAccessRange.  Then, some algorithms which work 
with both bidirectional and random access may choose to go with the 
latter.  This is a quote from the std.algorithm.find() docs:

    Specializations taking advantage of bidirectional or
    random access (where present) may accelerate search [...]

-Lars

Dec 03 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 03 Dec 2010 08:13:50 -0500, Lars T. Kyllingstad  
<public kyllingen.nospamnet> wrote:

 On Thu, 02 Dec 2010 16:18:52 -0500, Steven Schveighoffer wrote:

 On Thu, 02 Dec 2010 02:09:51 -0500, Lars T. Kyllingstad
 <public kyllingen.nospamnet> wrote:

 On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad
 <public kyllingen.nospamnet> wrote:

 On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:

 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:

 [...]

 4. Indexing is no longer O(1), which violates the guarantees of the
 index operator.

 Indexing is still O(1).

 5. Slicing (other than a full slice) is no longer O(1), which
 violates the
 guarantees of the slicing operator.

 Slicing is still O(1).

 [...]

 It feels extremely weird that the indices refer to code units and not
 code points.  If I write

   auto str = mystring("hæ?");
   writeln(str[1], " ", str[2]);

 I expect it to print "æ ?", not "æ æ" like it does now.

 I don't think it's possible to do that with any implementation without
 making indexing not O(1).  This just isn't possible, unless you want
 to use dchar[].

 But your point is well taken.  I think what I'm going to do is throw
 an exception when accessing an invalid index.  While also surprising,
 it doesn't result in "extra data".  I feel it's probably very rare to
 just access hard-coded indexes like that unless you are sure of the
 data in the string.  Or to use a for-loop to access characters, etc.

 As soon as you add opIndex(), your interface becomes that of a random-
 access range, something which narrow strings are not.  In fact, the
 distinction between random access and bidirectional range access for
 strings is in many ways the reason we're having this discussion.

 How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead
 adding a characterAt(i) function that retrieves the i'th code point,
 and which is not required to be O(1)?  Then, if someone wants O(1)
 indexing they are forced to use string_t!dchar or just plain ol'
 arrays, both of which have clear, predictable indexing semantics.

 Then substring (slicing) becomes an O(n) operation.  It just doesn't
 work well.

 What I meant wast that opSlice() should be disabled in the same way as
 opIndex().  What you have now is

     struct string_t(T)
     {
         dchar opIndex(size_t);               // Must be O(1)
         string_t opSlice(size_t, size_t);    // Must be O(1)
     }

 and what I'm suggesting is

     struct string_t(T)
     {
         dchar character(size_t);             // May be O(n)
         string_t substring(size_t, size_t);  // May be O(n)

         static if (is(T == dchar))
         {
             alias character opIndex;         // Must be O(1)
             alias substring opSlice;         // Must be O(1)
         }
     }

O(n) slicing and indexing, called by different names, is unacceptable.  It  
is possible to have O(1) indexing and slicing, I don't see why we should  
disallow it.  If nothing else, it cripples this implementation to the  
point where it won't be accepted, simply because it performs worse than  
the current solution.

Now, if one wants to have that feature in *addition* to O(1)  
indexing/slicing, I don't think that's an issue.

 It seems to be awkward at first thought, but the more I
 think about it, the more I think it's right.  When do you ever depend on
 specific indexes in a string being valid, or to be incrementing always
 by 1?

 That is a good question indeed.  But I'm still not convinced. :)

 Another thing to consider is that by having opIndex() in there, your
 string satisfies isRandomAccessRange.  Then, some algorithms which work
 with both bidirectional and random access may choose to go with the
 latter.  This is a quote from the std.algorithm.find() docs:

     Specializations taking advantage of bidirectional or
     random access (where present) may accelerate search [...]

I just looked it up, isRandomAccess requires hasLength, which my string  
type does not.  So my string type is not a true random access range (this  
was intentional).

-Steve

Dec 03 2010

"Lars T. Kyllingstad" <public kyllingen.NOSPAMnet> writes:

On Fri, 03 Dec 2010 08:59:52 -0500, Steven Schveighoffer wrote:

 On Fri, 03 Dec 2010 08:13:50 -0500, Lars T. Kyllingstad
 <public kyllingen.nospamnet> wrote:
 Another thing to consider is that by having opIndex() in there, your
 string satisfies isRandomAccessRange.  Then, some algorithms which work
 with both bidirectional and random access may choose to go with the
 latter.  This is a quote from the std.algorithm.find() docs:

     Specializations taking advantage of bidirectional or random access
     (where present) may accelerate search [...]

 
 I just looked it up, isRandomAccess requires hasLength, which my string
 type does not.  So my string type is not a true random access range
 (this was intentional).

Ah, you are right.  I didn't read the documentation for 
isRandomAccessRange, I just looked at the accompanying code snippet, 
which doesn't mention hasLength.

-Lars

Dec 03 2010

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, December 03, 2010 05:13:50 Lars T. Kyllingstad wrote:
 On Thu, 02 Dec 2010 16:18:52 -0500, Steven Schveighoffer wrote:
 On Thu, 02 Dec 2010 02:09:51 -0500, Lars T. Kyllingstad
=20
 <public kyllingen.nospamnet> wrote:
 On Wed, 01 Dec 2010 16:44:42 -0500, Steven Schveighoffer wrote:
 On Tue, 30 Nov 2010 18:34:11 -0500, Lars T. Kyllingstad
=20
 <public kyllingen.nospamnet> wrote:
 On Tue, 30 Nov 2010 13:52:20 -0500, Steven Schveighoffer wrote:
 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis
 <jmdavisProg gmx.com> wrote:
=20
 [...]
=20
 4. Indexing is no longer O(1), which violates the guarantees of the
 index operator.

=20
 Indexing is still O(1).
=20
 5. Slicing (other than a full slice) is no longer O(1), which
 violates the
 guarantees of the slicing operator.

=20
 Slicing is still O(1).
=20
 [...]

=20
 It feels extremely weird that the indices refer to code units and not
 code points.  If I write
=20
   auto str =3D mystring("h=C3=A6?");
   writeln(str[1], " ", str[2]);
=20
 I expect it to print "=C3=A6 ?", not "=C3=A6 =C3=A6" like it does no=





w.
=20
 I don't think it's possible to do that with any implementation without
 making indexing not O(1).  This just isn't possible, unless you want
 to use dchar[].
=20
 But your point is well taken.  I think what I'm going to do is throw
 an exception when accessing an invalid index.  While also surprising,
 it doesn't result in "extra data".  I feel it's probably very rare to
 just access hard-coded indexes like that unless you are sure of the
 data in the string.  Or to use a for-loop to access characters, etc.

=20
 As soon as you add opIndex(), your interface becomes that of a random-
 access range, something which narrow strings are not.  In fact, the
 distinction between random access and bidirectional range access for
 strings is in many ways the reason we're having this discussion.
=20
 How about dropping opIndex() for UTF-8 and UTF-16 strings, and instead
 adding a characterAt(i) function that retrieves the i'th code point,
 and which is not required to be O(1)?  Then, if someone wants O(1)
 indexing they are forced to use string_t!dchar or just plain ol'
 arrays, both of which have clear, predictable indexing semantics.

=20
 Then substring (slicing) becomes an O(n) operation.  It just doesn't
 work well.

=20
 What I meant wast that opSlice() should be disabled in the same way as
 opIndex().

A string type without slicing (which must be O(1)) is DOA without question.=
=20
Slicing is _far_ too useful to lose. Indexing in strings is fairly rare bec=
ause=20
it's generally  stupid idea, but slicing happens all the time. If nothing e=
lse,=20
that is _the_ way to get a substring.

=2D Jonathan M Davis

Dec 03 2010

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Tuesday, November 30, 2010 10:52:20 Steven Schveighoffer wrote:
 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis <jmdavisProg gmx.com>
 
 wrote:
 1. At least until universal function syntax is in the language, you use
 the
 ability to do str.func().

 
 Yes, but first, this is not a problem with the string type, it's a problem
 with the language.  Second, any string-specific functions can be added, it
 is a struct after all, not a builtin.

I really don't think that we want to have to be adding string functions to a 
string struct. Having them external is far better, particularly when there are 
so many of them which aren't directly related to strings but use them heavily. 
Even if all of std.string got tacked onto a struct string type, that would
leave 
out the rest of Phobos and any user code. I really think that universal
function 
syntax is a necessity for a struct solution to be acceptable.

I had considered a solution similar to this one a few months back, and the lack 
of universal function syntax is one of the reasons why I decided that it wasn't 
really an improvement. Honestly, without uniform function call syntax, I would 
consider a struct solution to be DOA. The loss in useability would just be too 
great.

 2. Functions that would work just fine treating strings as arrays of
 code units
 (presumably because they don't care about what the actual data is) lose
 efficiency, because now a string isn't an array.

 
 Which functions are those?  They can be allowed via wrappers.

It is my understanding that there are various functions in std.algorithm which 
are able to treat strings as arrays and therefore process them more
efficiently. I 
haven't looked into which ones are in that camp. I would think that find()
might 
be, but I'd have to look.

 3. You have no access to the underlying array unless you're dealing with
 an
 actual array of dchar.

 
 I thought of adding some kind of access.  I wasn't sure the best way.
 
 I was thinking of allowing direct access via opCast, because I think
 casting might be a sufficient red flag to let you know you are crossing
 into dangerous waters.
 
 But it could just be as easy as making the array itself public.

Something like that would need to be done. It really should be a public
property 
of some kind if not an outright public variable.

 4. Indexing is no longer O(1), which violates the guarantees of the index
 operator.

 
 Indexing is still O(1).
 
 5. Slicing (other than a full slice) is no longer O(1), which violates
 the
 guarantees of the slicing operator.

 
 Slicing is still O(1).

You're right. I misread what _charStart() did. However, if I understand it 
correctly now, you give it a code unit index and yet get a code point back.
That 
worries me. It means that there is no relation between the length of the string 
and the indices that you'd use to index it. It _is_ related to the number of 
code units, which you can get with codeUnits(), but that disjoint seems like it 
could cause problems. It might be the correct solution, but I think that it 
merits some examination. Returning a code unit would be wrong since that
totally 
breaks with the rest of the API dealing in dchar/code units, and it would be 
wrong to index by code point, since then indexing and slicing are no longer 
O(1). So, it's probably a choice between indexing by one thing and returning 
another or not having indexing and slicing, which would definitely not be good. 
So, maybe your solution is the best one in this respect, but it worries me. The 
exact ramifications of that need to be looked into.

 What you're doing here is forcing the view of a string as a range of
 dchar in
 all cases. Granted, that's what you want in most cases, but it can
 degrade
 efficiency, and the fact that some operations (in particular indexing
 and slicing)
 are not O(1) like they're supposed to be means that algorithms which
 rely on
 O(1) behavior from them could increase their cost by an order of
 magnitude. All
 the cases where treating a string as an actual array which are currently
 valid
 are left out to dry

 
 You can still use char[] and wchar[].

Except that what if you need to do both with the same type? Right now, you
could 
have a function which treats a string as a range of dchar while another one 
which can get away with treating it as a range of code units can treat it as an 
array. You can pass the same string to both, and it works. That should still 
work if we go for a struct solution. Special-casing on strings and specifically 
using the internal array instead of the struct for them could fix the problem, 
but it still needs to be possible.

 The inherent problem with strings is that we _want_ to be able to view
 them as
 both arrays of code units and as ranges of dchar. Trying to view them
 only as
 ranges of dchar is inefficient in a number of cases. Even something as
 simple as
 getting a substring is less efficient with this code. Only in cases
 where you
 don't actually know the length of the substring that you want is this
 essentially as efficient as what we have now. You're ignoring all cases
 where
 viewing a string as an array of code units is correct and desirable.
 You're only
 solving half of the problem (albeit the more prevalent half).

 
 I'm not planning on eliminating char and wchar based arrays.
 
 In other words, you should be able to get access to the array, but it
 should not be the default, and it should be considered unsafe.
 
 Now, making it possible to have a wrapper struct like this which works
 in many
 cases where you'd use strings could reduce errors in code in many
 situations, so
 giving the programmer the ability to do that could be interesting. But
 it seems
 to me that this solution is too limiting to be a viable replacement for
 strings
 as they are.

 
 Hopefully you can see that I'm not eliminating the functionality you are
 looking for, just making it not the default.

There is definitely some value in making strings treated as ranges of dchar by 
default. But for the most part, that's the case already thanks to how std.array 
is defined. The only place where that runs into trouble is if you use foreach.
It 
still treats them as arrays of code units unless you tell it to iterate over 
dchar. Either making foreach over character arrays iterate over dchar by
default 
or making it a warning or error to use foreach with a char or wchar array of
any 
kind without specifying the type to iterate over would fix that problem.

I don't think that the basic idea of what you present is necessarily a bad idea 
(I've considered proposing similar solutions in the past), but I'm not at all 
convinced that it's an improvement. What we have works quite well overall, with 
the notable exception of foreach. I'm also worried that trying to bury the fact 
that strings are arrays of code units is going to cause a leaky abstraction 
which will case problems. It's not likely to be as bad as the situation with 
char literals and std::string in C++, but I think that we need to look at this 
very carefully and examine what all of the ramifications are before we even 
consider going for it.

There is an inherent but necessary disjoint between having strings be arrays of 
code units and ranges of dchar. Sometimes they need to be treated as one, 
sometimes as the other. Ideally, the default - whichever it is - would be the 
one which leads to fewer bugs. But they both need to be there. A struct
solution 
is essentially an attempt to make strings treated as ranges of dchar in more 
situations by default than is currenly the case. As such, it could be better 
than what we have now, but I'm honestly not convinced that it is. Aside from
the 
foreach problem (which could be fixed with an appropriate warning or error - 
preferrably error), what we have works quite well.

- Jonathan M Davis

Nov 30 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 30 Nov 2010 18:34:56 -0500, Jonathan M Davis <jmdavisProg gmx.com>  
wrote:

 On Tuesday, November 30, 2010 10:52:20 Steven Schveighoffer wrote:
 On Tue, 30 Nov 2010 13:34:50 -0500, Jonathan M Davis  
 <jmdavisProg gmx.com>

 wrote:
 1. At least until universal function syntax is in the language, you  

 use
 the
 ability to do str.func().

 Yes, but first, this is not a problem with the string type, it's a  
 problem
 with the language.  Second, any string-specific functions can be added,  
 it
 is a struct after all, not a builtin.

 I really don't think that we want to have to be adding string functions  
 to a
 string struct. Having them external is far better, particularly when  
 there are
 so many of them which aren't directly related to strings but use them  
 heavily.
 Even if all of std.string got tacked onto a struct string type, that  
 would leave
 out the rest of Phobos and any user code. I really think that universal  
 function
 syntax is a necessity for a struct solution to be acceptable.

Why?  It's a choice between adding them to the struct, and doing func(str)  
instead.  You already have to do func(R) for arbitrary ranges, I don't see  
why strings should be specialized.  For things that should be considered  
'built-in' for strings, they can be included as members of the struct.

In other words, I think it's worth fixing strings now, even if it means we  
cannot call str.func() until the universal syntax is introduced.

 I had considered a solution similar to this one a few months back, and  
 the lack
 of universal function syntax is one of the reasons why I decided that it  
 wasn't
 really an improvement. Honestly, without uniform function call syntax, I  
 would
 consider a struct solution to be DOA. The loss in useability would just  
 be too
 great.

But you can still call the function, it's just func(str).  There is no  
loss of functionality.  Usability isn't even really lost.

 2. Functions that would work just fine treating strings as arrays of
 code units
 (presumably because they don't care about what the actual data is)  

 lose
 efficiency, because now a string isn't an array.

 Which functions are those?  They can be allowed via wrappers.

 It is my understanding that there are various functions in std.algorithm  
 which
 are able to treat strings as arrays and therefore process them more  
 efficiently. I
 haven't looked into which ones are in that camp. I would think that  
 find() might
 be, but I'd have to look.

Since std.algorithm works exclusively with ranges, those must be special  
cases for strings because strings are bi-directional ranges of dchar  
according to phobos.  So those can continue to be special cases for the  
new string types.

 4. Indexing is no longer O(1), which violates the guarantees of the  

 index
 operator.

 Indexing is still O(1).

 5. Slicing (other than a full slice) is no longer O(1), which violates
 the
 guarantees of the slicing operator.

 Slicing is still O(1).

 You're right. I misread what _charStart() did. However, if I understand  
 it
 correctly now, you give it a code unit index and yet get a code point  
 back. That
 worries me. It means that there is no relation between the length of the  
 string
 and the indices that you'd use to index it. It _is_ related to the  
 number of
 code units, which you can get with codeUnits(), but that disjoint seems  
 like it
 could cause problems. It might be the correct solution, but I think that  
 it
 merits some examination. Returning a code unit would be wrong since that  
 totally
 breaks with the rest of the API dealing in dchar/code units, and it  
 would be
 wrong to index by code point, since then indexing and slicing are no  
 longer
 O(1). So, it's probably a choice between indexing by one thing and  
 returning
 another or not having indexing and slicing, which would definitely not  
 be good.
 So, maybe your solution is the best one in this respect, but it worries  
 me. The
 exact ramifications of that need to be looked into.

How many times do you use a hard-coded index into a string without knowing  
the encoding of the string (i.e. I know this string is ascii)?  How many  
times do you iterate the characters of a string via an incrementing index?

It's not common to use the indexing operation with values that are not  
known or computed to be valid starts to code-points.  In fact, the  
language depends on that (otherwise you'd see sliced strings everywhere  
with invalid data).

So while it looks strange, it shouldn't be a common need.

That being said, I think Lars pointed out that the strangeness of  
returning the code point even if you point in the middle would be  
surprising in some cases, so I think the better solution is to throw an  
exception.

 What you're doing here is forcing the view of a string as a range of
 dchar in
 all cases. Granted, that's what you want in most cases, but it can
 degrade
 efficiency, and the fact that some operations (in particular indexing
 and slicing)
 are not O(1) like they're supposed to be means that algorithms which
 rely on
 O(1) behavior from them could increase their cost by an order of
 magnitude. All
 the cases where treating a string as an actual array which are  

 currently
 valid
 are left out to dry

 You can still use char[] and wchar[].

 Except that what if you need to do both with the same type? Right now,  
 you could
 have a function which treats a string as a range of dchar while another  
 one
 which can get away with treating it as a range of code units can treat  
 it as an
 array. You can pass the same string to both, and it works. That should  
 still
 work if we go for a struct solution. Special-casing on strings and  
 specifically
 using the internal array instead of the struct for them could fix the  
 problem,
 but it still needs to be possible.

Yeah, it's definitely needed.  I'll add access to the data member in the  
next version.

 Hopefully you can see that I'm not eliminating the functionality you are
 looking for, just making it not the default.

 There is definitely some value in making strings treated as ranges of  
 dchar by
 default. But for the most part, that's the case already thanks to how  
 std.array
 is defined. The only place where that runs into trouble is if you use  
 foreach.

or indexing.  This is a huge problem.

 It
 still treats them as arrays of code units unless you tell it to iterate  
 over
 dchar. Either making foreach over character arrays iterate over dchar by  
 default
 or making it a warning or error to use foreach with a char or wchar  
 array of any
 kind without specifying the type to iterate over would fix that problem.

I agree that would be ideal, but it still doesn't solve the indexing  
problem.

 There is an inherent but necessary disjoint between having strings be  
 arrays of
 code units and ranges of dchar. Sometimes they need to be treated as one,
 sometimes as the other. Ideally, the default - whichever it is - would  
 be the
 one which leads to fewer bugs. But they both need to be there. A struct  
 solution
 is essentially an attempt to make strings treated as ranges of dchar in  
 more
 situations by default than is currenly the case. As such, it could be  
 better
 than what we have now, but I'm honestly not convinced that it is. Aside  
 from the
 foreach problem (which could be fixed with an appropriate warning or  
 error -
 preferrably error), what we have works quite well.

The thing is, the most common use of strings is as a string, not as an  
array of code-units.  The common case is to print, slice, find, etc. on a  
*string*.  When dealing with the string as a whole, either using an array  
or a specialized type works equally well.

The uncommon case is to extract individual characters from the string.  In  
this case, the default needs to be the most common need in that area --  
extracting a dchar, not a code-unit.  Having the default index operation  
extract a code-unit is very incorrect.

-Steve

Dec 01 2010

foobar <foo bar.com> writes:

Steven Schveighoffer Wrote:
[snipped]
 3. You have no access to the underlying array unless you're dealing with  
 an
 actual array of dchar.

 
 I thought of adding some kind of access.  I wasn't sure the best way.
 
 I was thinking of allowing direct access via opCast, because I think  
 casting might be a sufficient red flag to let you know you are crossing  
 into dangerous waters.
 
 But it could just be as easy as making the array itself public.
 

 -Steve

A string type should always maintain the invariant that it is a valid unicode
string. Therefore I don't like having an unsafe opCast or providing direct
access to the underlying array. I feel that there should be a read-only
property for that. Algorithms that manipulate char[]'s should construct a new
string instance which will validate the char[] it is being built from is a
valid utf string.

This looks like a great start for a proper string type. There's still the issue
of literals that would require compiler/language changes.

There's one other issue that should be considered at some stage: normalization
and the fact that a single "character" can be constructed from several code
points. (acutes and such)

Dec 01 2010

spir <denis.spir gmail.com> writes:

On Wed, 01 Dec 2010 03:30:07 -0500
foobar <foo bar.com> wrote:

 Steven Schveighoffer Wrote:
 [snipped]
 3. You have no access to the underlying array unless you're dealing w=



ith =20
 an
 actual array of dchar.

=20
 I thought of adding some kind of access.  I wasn't sure the best way.
=20
 I was thinking of allowing direct access via opCast, because I think =20
 casting might be a sufficient red flag to let you know you are crossing=


 =20
 into dangerous waters.
=20
 But it could just be as easy as making the array itself public.
=20

=20
 -Steve

=20
 A string type should always maintain the invariant that it is a valid uni=

code string. Therefore I don't like having an unsafe opCast or providing di=
rect access to the underlying array. I feel that there should be a read-onl=
y property for that. Algorithms that manipulate char[]'s should construct a=
 new string instance which will validate the char[] it is being built from =
is a valid utf string.

But then, why not store a dchar[] array systematically? Validation and deco=
ding is the same job. Once decoded, all methods work as expected (eg s[3] r=
eturns the 4th code point) and blitz fast.

 This looks like a great start for a proper string type. There's still the=

 issue of literals that would require compiler/language changes.

Yop...

 There's one other issue that should be considered at some stage: normaliz=

ation and the fact that a single "character" can be constructed from severa=
l code points. (acutes and such)=20

This is my next little project. May build on Steve's job. (But it's not nec=
essary, dchar is enough as a base, I guess.)


Denis
-- -- -- -- -- -- --
vit esse estrany =E2=98=A3

spir.wikidot.com

Dec 01 2010

stephan <none example.com> writes:

 There's one other issue that should be considered at some stage: normalization
and the fact that a single "character" can be constructed from several code
points. (acutes and such)

 This is my next little project. May build on Steve's job. (But it's not
necessary, dchar is enough as a base, I guess.)

Hi Denis, you might want to consider helping us out.

We have got a feature-complete Unicode normalization, case-folding, and 
concatenation implementation passing all test cases in 
http://unicode.org/Public/6.0.0/ucd/NormalizationTest.txt (and then 
some) for all recent Unicode versions. This code was part of a bigger 
project that we have stopped working on.

We feel that the Unicode normalization part might be useful to others. 
Therefore we consider releasing them under an open source license. 
Before we can do so, we have to clean up things a bit. Some open issues are

a)    The code still contains some TODOs and FIXMEs (bugs, 
inefficiencies, some bigger issues like more efficient storing of data 
etc.).

b)    No profiling and no benchmarking against the ICU implementation 
(http://site.icu-project.org/) has been done yet (we expect surprises).

c)    Implementation of additional Unicode algorithms (e.g. full case 
mapping, matching, collation).

Since we have stopped working on the bigger project, we haven’t made 
much progress. Any help would be welcome. Let me know whether this would 
be of interest to you.

Dec 01 2010

spir <denis.spir gmail.com> writes:

On Wed, 01 Dec 2010 17:41:17 +0100
stephan <none example.com> wrote:

=20
 There's one other issue that should be considered at some stage: norma=



lization and the fact that a single "character" can be constructed from sev=
eral code points. (acutes and such)
 This is my next little project. May build on Steve's job. (But it's not=


 necessary, dchar is enough as a base, I guess.)

=20
 Hi Denis, you might want to consider helping us out.
=20
 We have got a feature-complete Unicode normalization, case-folding, and=20
 concatenation implementation passing all test cases in=20
 http://unicode.org/Public/6.0.0/ucd/NormalizationTest.txt (and then=20
 some) for all recent Unicode versions. This code was part of a bigger=20
 project that we have stopped working on.
=20
 We feel that the Unicode normalization part might be useful to others.=20
 Therefore we consider releasing them under an open source license.=20
 Before we can do so, we have to clean up things a bit. Some open issues a=

re
=20
 a)    The code still contains some TODOs and FIXMEs (bugs,=20
 inefficiencies, some bigger issues like more efficient storing of data=20
 etc.).
=20
 b)    No profiling and no benchmarking against the ICU implementation=20
 (http://site.icu-project.org/) has been done yet (we expect surprises).
=20
 c)    Implementation of additional Unicode algorithms (e.g. full case=20
 mapping, matching, collation).
=20
 Since we have stopped working on the bigger project, we haven=E2=80=99t m=

ade=20
 much progress. Any help would be welcome. Let me know whether this would=

=20
 be of interest to you.

Yes, of course it would be useful. in any case. Either you wish to go on yo=
ur project, and I may be of some help. Or it would anyway be a useful base =
or example of how to implement unicode algorithm. Maybe it's time to give s=
ome more information of what I intend to write. I have done it already (par=
tially in Python, nearly completely in Lua).

What I have in mind is a "UText" type that provides the right abstraction f=
or text processing / string maipulation as one has when dealing with ASCII =
(in any fact any legacy character set). All what is needed is having a true=
 one-to-one mapping between characters (in the common sense) and elements o=
f strings (what I call "code stacks"); one given stack unambiguously denote=
s one character. To reach this point, in addition to decoding (ag from utf8=
 to code points), we must:
* group codes into stacks=20
* normalize (into 'NFD')
* sorts points in stacks
That's the base.

Then, we can for instance index or slice in O(1) as usual, and get a consis=
tent substring of _characters_ (not "abstract characters"). We can search f=
or substrings by simple, direct, comparisons. When dealing with utf32 strin=
gs (or worse utf8), simple indexing or counting is O(n) or rather O(k.n) wh=
ere k represents the (high) cost of "stacking", and normalizing and sorting=
, on the fly -- it's not only traversing the whole string instead of random=
, it's heavy computation all along the way.
=46rom this base, all kinds of usual routines can be built without any more c=
omplexity. That's all what I want do implement. I wish to write all general=
-purpose ones (which means, for instance, nothing like casing).

Precisely, I do not want to deal with anything related to script-, language=
-, locale- specific issues. It's a completely separate & independant topic.=
 This indeed include the "compatibility" normalisation forms of unicode (wh=
ich precisely do not provide a normal form...). It seems part of your proje=
ct was to cope such issues.

I would be happy to cooperate if you feel like going on (then, let us commu=
nicate off list). I still have the Lua code (which used to run); even if us=
eless as help for implementation (the languages are too different), it coul=
d give some more concrete picture of what I have in mind. Also, it includes=
 several test datasets, reprocessed for usability, from unicode's online fi=
les.


Denis
-- -- -- -- -- -- --
vit esse estrany =E2=98=A3

spir.wikidot.com

Dec 01 2010

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

spir wrote:

 What I have in mind is a "UText" type that provides the right abstraction
 for text processing / string maipulation as one has when dealing with 

ASCII
 (in any fact any legacy character set). All what is needed is having 

a true
 one-to-one mapping between characters (in the common sense) and 

elements of
 strings (what I call "code stacks"); one given stack unambiguously 

denotes
 one character. To reach this point, in addition to decoding (ag from 

utf8 to
 code points), we must:

 * group codes into stacks
 * normalize (into 'NFD')

Are those operations independent of the context? Is stacking always desired?

I guess one would use one of the D string types when grouping or 
normalization is not desired, right? Makes sense.

 * sorts points in stacks

Ok, I see that it is possible with NFD. I am not experienced with 
Unicode, but I think there will be issues with other types of Unicode 
normalization. (Judging from your posts, I know that you know all these. 
:) )

 Then, we can for instance index or slice in O(1) as usual, and get a
 consistent substring of _characters_ [...] I do not want to deal with 

anything related to script-, language-, locale- specific issues.

Is the concept of _character_ well defined in Unicode outside of the 
context of the an alphabet (I think your "script" covers alphabet.)

It is an interesting decision when we actually want to see an array of 
code points as characters. When would it be correct to do so? I think 
the answer is when we start treating the string as a piece of text.

For a string to be considered as text, it must be based on an alphabet. 
ASCII strings are pieces of text, because they are based on the 
26-letter alphabet.

I hope I don't sound like saying against anything that you said. I am 
also thinking about the other common operations that work on pieces of text:

- sorting (e.g. ç is between c and d in many alphabets)
- lowercasing, uppercasing (e.g. i<->İ and ı<->I in many alphabets)

As a part of the Turkish D community, we've played with the idea of such 
a text type. It took advantage of D's support for Unicode encoded source 
code, so it's fully in Turkish. Yay! :)

Here is the module that takes care of sorting, capitalization, and 
producing the base forms of the letters of the alphabets:

     http://code.google.com/p/trileri/source/browse/trunk/tr/alfabe.d

It is also based on dchar[], as you recommend elsewhere in this thread.

It is written with the older D2 operator overloading, doesn't support 
ranges, etc. But it currently supports ten alphabets (including the 
26-letter English, and the Old Irish alphabet).

Going out of the context of this thread, we've also worked on a type 
that contains pieces of text from different alphabets to make a "text", 
where a text like "jim & ali" is correctly capitalized as "JIM & ALİ".

I am thinking more than what you describe. But your string would be 
useful for implementing ours, as we don't have normalization or stacking 
support at all.

Thanks,
Ali

Dec 03 2010

spir <denis.spir gmail.com> writes:

On Fri, 03 Dec 2010 14:11:35 -0800
Ali =C3=87ehreli <acehreli yahoo.com> wrote:

 spir wrote:
=20
  > What I have in mind is a "UText" type that provides the right abstract=

ion
  > for text processing / string maipulation as one has when dealing with =

 ASCII
  > (in any fact any legacy character set). All what is needed is having a=

 true
  > one-to-one mapping between characters (in the common sense) and elemen=

ts of
  > strings (what I call "code stacks"); one given stack unambiguously den=

otes
  > one character. To reach this point, in addition to decoding (ag from  =

utf8 to
  > code points), we must:
=20
  > * group codes into stacks
  > * normalize (into 'NFD')
=20
 Are those operations independent of the context? Is stacking always desir=

ed?
=20
 I guess one would use one of the D string types when grouping or=20
 normalization is not desired, right? Makes sense.

We can consider there are two kinds of uses for texts in most applications:
1. just input/output (where input is also from literals in code), possibly =
with some concatenation.
2. string manipulation / text processing
For the 1st case, there is no need for any sophisticated toolkit like the t=
ype I intend to write. We can just read in latin-x, for instance, join it, =
output is back, without any problems. (As long as all pieces of text share =
the same encoding.) Problems arise as soon as text is to be manipulated or =
processed in any other way: indexing, searching, counting, slicing, replaci=
ng, etc... all these routines require isolating _characters- in the ordinar=
y sense of the word, inside the string of code units or code points.
A true text type, usable like ASCII in old days, would either provide routi=
nes that do that do in the background, but in a costly way, or first group =
codes into characters once only -- then every later operation is as cheap a=
s possible. Normalising and sorting are also require so that each character=
 has only one representation.
I intend to write a little article to explain the issue (& misunderstanding=
s created by Unicode's use of "abstract character"), and possible solutions.
To say it again: if all one needs is text input/output, then using such a t=
ool is overkill. Actually, even the string type Steven is implementing is n=
ot strictly necessary. But it would have the advantage, if I understand cor=
rectly, to present a cleaner interface.

  > * sorts points in stacks
=20
 Ok, I see that it is possible with NFD. I am not experienced with=20
 Unicode, but I think there will be issues with other types of Unicode=20
 normalization. (Judging from your posts, I know that you know all these.=

=20
 :) )

Yes, the algorithm comes with Unicode's docs about "canonicalisation".

  > Then, we can for instance index or slice in O(1) as usual, and get a
  > consistent substring of _characters_ [...] I do not want to deal with=

=20
 anything related to script-, language-, locale- specific issues.
=20
 Is the concept of _character_ well defined in Unicode outside of the=20
 context of the an alphabet (I think your "script" covers alphabet.)
=20
 It is an interesting decision when we actually want to see an array of=20
 code points as characters. When would it be correct to do so? I think=20
 the answer is when we start treating the string as a piece of text.
=20
 For a string to be considered as text, it must be based on an alphabet.=20
 ASCII strings are pieces of text, because they are based on the=20
 26-letter alphabet.
=20
 I hope I don't sound like saying against anything that you said. I am=20
 also thinking about the other common operations that work on pieces of te=

xt:
=20
 - sorting (e.g. =C3=A7 is between c and d in many alphabets)
 - lowercasing, uppercasing (e.g. i<->=C4=B0 and =C4=B1<->I in many alphab=

ets)
=20
 As a part of the Turkish D community, we've played with the idea of such=

=20
 a text type. It took advantage of D's support for Unicode encoded source=

=20
 code, so it's fully in Turkish. Yay! :)
=20
 Here is the module that takes care of sorting, capitalization, and=20
 producing the base forms of the letters of the alphabets:
=20
      http://code.google.com/p/trileri/source/browse/trunk/tr/alfabe.d
=20
 It is also based on dchar[], as you recommend elsewhere in this thread.
=20
 It is written with the older D2 operator overloading, doesn't support=20
 ranges, etc. But it currently supports ten alphabets (including the=20
 26-letter English, and the Old Irish alphabet).
=20
 Going out of the context of this thread, we've also worked on a type=20
 that contains pieces of text from different alphabets to make a "text",=20
 where a text like "jim & ali" is correctly capitalized as "JIM & AL=C4=B0=

".
=20
 I am thinking more than what you describe. But your string would be=20
 useful for implementing ours, as we don't have normalization or stacking=

=20
 support at all.

As said, I do not wish the enter the huge area of script-, natural language=
-, culture-, specific issues; because it's not general; I just target a gen=
eral-purpose tool. My type wouldn't even have a default uppercasee routine,=
 for instance: first, it's script specific, second, there is no general def=
inition for that (*) -- even if Unicode provides such data. Sorting issue a=
re even less decidable; it goes down to personal preferences (**). It's als=
o too big, too diverse, too complicated, an area.
But I guess the type I have in mind would provide a good basis for such a w=
ork (or rather, hundreds of language- and domain-speicfic tools or applicat=
ions). At least, issues about grouping codes into characters, and multiple =
forms of characters, are solved: "AL=C4=B0" would always be represented the=
 same way, and in a logical way; so that if you count its characters, you g=
et 3 for sure, and if you search for '=C4=B0' you find it for sure (which i=
s not true today).

 Thanks,
 Ali

Thank to you,

Denis

(*) Is the uppercase of "g=C3=A2t=C3=A9" "GATE" or "G=C3=82T=C3=89"?
(**) Search for instance threads about users complaining when KDE decided t=
o sort file names according to supposed user-friendly order ("natural", lol=
!).
-- -- -- -- -- -- --
vit esse estrany =E2=98=A3

spir.wikidot.com

Dec 03 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Wed, 01 Dec 2010 03:30:07 -0500, foobar <foo bar.com> wrote:

 Steven Schveighoffer Wrote:
 [snipped]
 3. You have no access to the underlying array unless you're dealing  

 with
 an
 actual array of dchar.

 I thought of adding some kind of access.  I wasn't sure the best way.

 I was thinking of allowing direct access via opCast, because I think
 casting might be a sufficient red flag to let you know you are crossing
 into dangerous waters.

 But it could just be as easy as making the array itself public.

 -Steve

 A string type should always maintain the invariant that it is a valid  
 unicode string. Therefore I don't like having an unsafe opCast or  
 providing direct access to the underlying array. I feel that there  
 should be a read-only property for that. Algorithms that manipulate  
 char[]'s should construct a new string instance which will validate the  
 char[] it is being built from is a valid utf string.

Copying is not a good idea, nor is runtime validation.  We can only  
protect the programmer so much.

The good news is that the vast majority of strings are literals, which  
should be properly constructed by the compiler, and immutable.

 This looks like a great start for a proper string type. There's still  
 the issue of literals that would require compiler/language changes.

That is essential, the compiler has to defer the type of string literals  
to the library somehow.

 There's one other issue that should be considered at some stage:  
 normalization and the fact that a single "character" can be constructed  
 from several code points. (acutes and such)

This is more solvable with a struct, but at this point, I'm not sure if  
it's worth worrying about.  How common is that need?

-Steve

Dec 01 2010

Steven Wawryk <stevenw acres.com.au> writes:

I like the direction you're taking but have some quibbles about details. 
  Specifically, I'd go for a more complete separation into random-access 
code-unit ranges and bidirectional code-point ranges:


On 01/12/10 02:18, Steven Schveighoffer wrote:
 // Written in the D programming language.

 /**
 Copyright: Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010
 License: <a href="http://www.boost.org/LICENSE_1_0.txt">Boost License
 1.0</a>.
 Authors: $(WEB erdani.org, Andrei Alexandrescu, Steven Schveighoffer)

 Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010.
 Distributed under the Boost Software License, Version 1.0.
 (See accompanying file LICENSE_1_0.txt or copy at
 http://www.boost.org/LICENSE_1_0.txt)
 */
 import std.utf;
 import std.traits;

 struct string_t(T) if (is(Unqual!T == char) || is(Unqual!T == wchar))

Is there a reason not to include is(Unqual!T == dchar)?

 {
 private T[] _data;
 this(T[] data)
 {
 this._data = data;
 }

An opAssign from a T[] could facilitate conversion back and forth 
between code-point and code-unit ranges.

 // note, this assumes that idx is a valid index
 private size_t _charStart(size_t idx) const
 {
 static if(is(Unqual!T == wchar))
 {
 immutable c = _data.ptr[idx];
 if(c >= 0xDC00 && c <= 0xDFFF)
 {
 // surrogate pair detected, verify we have at least 2 wchars,
 // and that both wchars are properly encoded.
 assert(idx > 0, "Invalid UTF character at beginning of string");
 return idx-1;
 }
 else
 return idx;
 }
 else
 {
 const p = _data.ptr + idx;
 if ((p[0] & 0b1100_0000) != 0b1000_0000)
 {
 return idx;
 }
 else if (idx >= 1 && (p[-1] & 0b1100_0000) != 0b1000_0000)
 {
 return idx - 1;
 }
 else if (idx >= 2 && (p[-2] & 0b1100_0000) != 0b1000_0000)
 {
 return idx - 2;
 }
 else if (idx >= 3 && (p[-3] & 0b1100_0000) != 0b1000_0000)
 {
 return idx - 3;
 }
 else
 {
 assert(false, "Invalid UTF character in string");
 }
 }
 }

 void popFront()
 {
 auto nc = std.utf.stride(_data, 0);
 assert(nc <= _data.length && nc != 0xFF, "Invalid sequence at beginning
 of string");
 _data = _data[nc .. $];
 }

 void popBack()
 {
 immutable n = _data.length;
 assert(n, "Attempting to pop back of an empty string");
 _data = _data.ptr[0.._charStart(n-1)];
 }

  property dchar front() const
 {
 assert(_data.length, "Attempting to fetch the front of an empty string");
 size_t i = 0;
 return decode(_data, i);
 }

  property dchar back() const
 {
 immutable n = _data.length;
 assert(n, "Attempting to fetch the back of an empty string");
 auto idx = _charStart(n-1);
 return std.utf.decode(_data, idx);
 }

There is the alternative of deferring decoding to the user and returning 
T[]'s holding exactly 1 code-point instead of dchars.  I'm not sure 
which is best, but I'd be interested in seeing a case for choosing one 
or the other.

  property bool empty() const
 {
 return !_data.length;
 }

  property typeof(this) save()
 {
 return this;
 }

 // support read-only random access via code unit index.
 dchar opIndex(size_t idx)
 {
 idx = _charStart(idx);
 return std.utf.decode(_data, idx);
 }

 string_t opSlice()
 {
 return this;
 }

 string_t opSlice(size_t start, size_t end)
 {
 if(start != _data.length)
 start = _charStart(start);
 if(end != _data.length)
 end = _charStart(end);
 return string_t(_data[start..end]);
 }

 // note we don't call this length because length can be assumed to be the
 // number of elements, which this isn't.
  property size_t codeUnits() const
 {
 return _data.length;
 }

I don't see a need for _charStart, opIndex, opSlice and codeUnits.  If 
the underlying T[] can be returned by a property, then these can be done 
through the code-unit array, which is random-access.

 // support append and concat
 // TODO: need to support appending various types of strings to eachother
 // (right now only same-type strings can be appended, or raw arrays)
 ref string_t opOpAssign(string op, U)(U data) if (op == "~" && is(U ==
 string_t))
 {
 _data ~= data._data;
 return this;
 }

 ref string_t opOpAssign(string op, U)(U data) if (op == "~" && !is(U ==
 string_t) && is(typeof(_data ~= U.init)))
 {
 _data ~= data;
 return this;
 }

 string_t opBinary(string op, U)(U data) if (op == "~" && is(U == string_t))
 {
 return string_t(_data ~ data._data);
 }

 string_t opBinary(string op, U)(U data) if (op == "~" && !is(U ==
 string_t) && is(typeof(_data ~ U.init)))
 {
 return string_t(_data ~ data);
 }
 }

 template string_t(T) if (is(Unqual!T == dchar))
 {
 alias T[] string_t;
 }

 /** begin test code **/
 import std.stdio;

 alias string_t!(immutable char) mystring;
 alias string_t!(immutable wchar) mywstring;
 alias string_t!(immutable dchar) mydstring;

 void main()
 {
 auto str = mystring("hello");
 str ~= " world";
 str ~= mystring("!!!");
 writeln(str._data);
 }

Jan 12 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Wed, 12 Jan 2011 04:49:26 -0500, Steven Wawryk <stevenw acres.com.au>  
wrote:

 I like the direction you're taking but have some quibbles about details.  
   Specifically, I'd go for a more complete separation into random-access  
 code-unit ranges and bidirectional code-point ranges:

Thanks for taking the time.  I will respond to your points, but please  
make your rebuttals to the new thread I'm about to create with an updated  
string type.

 Is there a reason not to include is(Unqual!T == dchar)?

It's defined at the bottom as a simple alias to T[], since you do not need  
these features for a dchar[] string (it's correct to treat a dchar string  
as an array).

 An opAssign from a T[] could facilitate conversion back and forth  
 between code-point and code-unit ranges.

Yes, that would be a good feature.

 There is the alternative of deferring decoding to the user and returning  
 T[]'s holding exactly 1 code-point instead of dchars.  I'm not sure  
 which is best, but I'd be interested in seeing a case for choosing one  
 or the other.

Definitely returning decoding and returning dchar is best.  The whole  
point of this string type is to handle the decoding for you.  If you want  
to do decoding manually, you can just stick with arrays.

 I don't see a need for _charStart, opIndex, opSlice and codeUnits.  If  
 the underlying T[] can be returned by a property, then these can be done  
 through the code-unit array, which is random-access.

But that puts extra pain on the user for not much reason.  Currently,  
strings slice in one operation, you are proposing that we slice in three  
operations:

1. get the underlying array
2. slice it
3. reconstruct a string based on the slice.

Plus, if you remove opIndex, you are restricting the usefulness of the  
range.  Note that this string type already will decode dchars out of the  
front and back, why not just give that ability to the middle of the string?

 template string_t(T) if (is(Unqual!T == dchar))
 {
 alias T[] string_t;
 }


Note the definition of string_t!dchar here ^^^^

-Steve

Jan 13 2011

D Programming

C/C++ Programming

Other

digitalmars.D - [review] new string type