digitalmars.D - [review] new string type (take 2)

Steven Schveighoffer (402/402) Jan 13 2011 Based on the suggestions of others, I have updated my string type.

Steven Wawryk (18/39) Jan 13 2011 Same for both.

Steven Schveighoffer (19/66) Jan 14 2011 Exactly, that is what my string type allows. You can either do it at th...

Andrei Alexandrescu (4/66) Jan 14 2011 I equally respectfully disagree. I think random access is defined as

Steven Schveighoffer (11/29) Jan 15 2011 I'd actually go as far as saying that any container you can do a quick

Andrei Alexandrescu (6/36) Jan 15 2011 Here you are trying to equate a dictionary (map/hashtable) with an

Steven Schveighoffer (17/56) Jan 15 2011 I see a lot of similarity. Both have sparsely populated indexes. Both ...

Andrei Alexandrescu (21/78) Jan 15 2011 I think this is way off. We're comparing a highly structured topology

Steven Wawryk (22/30) Jan 16 2011 Andrei,

Andrei Alexandrescu (21/43) Jan 16 2011 One thing that I don't think is understood is that Walter is the key

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

Based on the suggestions of others, I have updated my string type.

changes:

* opApply for iterating with foreach w/ index
* indexing the string at an invalid location (i.e. not the start of a code  
point) throws a RangeError (does that make sense, or should it be an  
exception?).
* charStart is now public so you can use it to ensure you are accessing  
the start of a code point
* validIdx new function that tells you if your index is at the start of a  
code point.
* data property which gets the underlying T[]
* Added free functions for string_t!dchar to make it have the same  
properties as the other string types.
* Added an ability to assign to a T[] array for ease of use.
* Added a ptr property so it works seamlessly with code that currently  
uses strings (but we still need $, however it appears this isn't  
implemented yet in dmd).
* fully documented

Here it is:


// Written in the D programming language.

/**
Copyright: Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010
License:   <a href="http://www.boost.org/LICENSE_1_0.txt">Boost License  
1.0</a>.
Authors:   $(WEB erdani.org, Andrei Alexandrescu, Steven Schveighoffer)

Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010.
Distributed under the Boost Software License, Version 1.0.
    (See accompanying file LICENSE_1_0.txt or copy at
          http://www.boost.org/LICENSE_1_0.txt)
*/
import std.utf;
import std.traits;
import core.exception;

// BUG
version = norangeopapplyoverload;

/**
  * Narrow string type.  This string implements utf-8 or utf-16 depending  
on the
  * character type.
  *
  * The string type is a bi-directional range of dchars, with a monotonic
  * indexing scheme.  Essentially, because dchars are encoded at variable
  * widths, indexing and slicing based on dchars would be an O(n) operation.
  * Therefore, we allow indexing and slicing, but based on the 'code-unit'  
or
  * unit of encoding (wchar or char).  This means some indexes are valid and
  * some are not (those which do not point to the start of an encoded dchar  
are
  * not).
  *
  * While this might seem confusing, it is very rare that one needs  
arbitrary
  * index access.  In order to achieve this you can either ensure to  
yourself
  * that the string's code-unit to dchar ratio is 1 to 1 (never more than  
one
  * code-unit to encode a dchar), or you can use the charStart funciton to
  * ensure a valid index.
  */
struct string_t(T) if (is(Unqual!T == char) || is(Unqual!T == wchar))
{
     private T[] _data;

     /**
      * Constructor, builds a string based on given array data.
      */
     this(T[] data)
     {
         this._data = data;
     }

     /**
      * Provide access to underlying array data.
      */
      property T[] data()
     {
         return _data;
     }

     /**
      * forward access to the ptr part of the array.  This allows the most
      * efficient operations when one knows what he is doing.
      */
      property T* ptr()
     {
         return _data.ptr;
     }


     /**
      * Finds the largest valid index in the string that is <= idx.
      * Essentially, this can be used to convert arbitrary indexes into  
valid
      * indexes.
      */
     size_t charStart(size_t idx) const
     {
         static if(is(Unqual!T == wchar))
         {
             immutable c = _data.ptr[idx];
             if(c >= 0xDC00 && c <= 0xDFFF)
             {
                 // surrogate pair detected, verify we have at least 2  
wchars,
                 // and that both wchars are properly encoded.
                 assert(idx > 0, "Invalid UTF character at beginning of  
string");
                 return idx-1;
             }
             else
                 return idx;
         }
         else
         {
             const p = _data.ptr + idx;
             if ((p[0] & 0b1100_0000) != 0b1000_0000)
             {
                 return idx;
             }
             else if (idx >= 1 && (p[-1] & 0b1100_0000) != 0b1000_0000)
             {
                 return idx - 1;
             }
             else if (idx >= 2 && (p[-2] & 0b1100_0000) != 0b1000_0000)
             {
                 return idx - 2;
             }
             else if (idx >= 3 && (p[-3] & 0b1100_0000) != 0b1000_0000)
             {
                 return idx - 3;
             }
             else
             {
                 assert(false, "Invalid UTF character in string");
             }
         }
     }

     /**
      * Returns true if the given index starts an encoded dchar.
      */
     bool validIdx(size_t idx)
     {
         if(idx >= _data.length)
         {
             if(idx is _data.length)
                 return true; // index one beyond the string is valid.
             return false;
         }
         immutable c = _data[idx];
         static if(is(Unqual!T == wchar))
         {
             // make sure this isn't the second character of a surrogate  
pair
             return (c < 0xDC00 || c > 0xDFFF);
         }
         else // char
         {
             return ((c & 0b1100_0000) != 0b1000_0000);
         }
     }

     /**
      * remove the first code-point from the string.
      */
     void popFront()
     {
         auto nc = std.utf.stride(_data, 0);
         assert(nc <= _data.length && nc != 0xFF, "Invalid sequence at  
beginning of string");
         _data = _data[nc .. $];
     }

     /**
      * Remove the last code-point from the string.
      */
     void popBack()
     {
         immutable n = _data.length;
         assert(n, "Attempting to pop back of an empty string");
         _data = _data.ptr[0..charStart(n-1)];
     }

     /**
      * Get the first code-point in the string
      */
      property dchar front() const
     {
         assert(_data.length, "Attempting to fetch the front of an empty  
string");
         size_t i = 0;
         return decode(_data, i);
     }

     /**
      * Get the last code-point in the string
      */
      property dchar back() const
     {
         immutable n = _data.length;
         assert(n, "Attempting to fetch the back of an empty string");
         auto idx = charStart(n-1);
         return std.utf.decode(_data, idx);
     }

     /**
      * Does the string contain any data?
      */
      property bool empty() const
     {
         return !_data.length;
     }

     /**
      * Copy the string (trivial function, needed for range definitions)
      */
      property typeof(this) save()
     {
         return this;
     }

     /**
      * support read-only random access via code unit index.
      *
      * Note that an invalid idx (one that does not start a code point)  
results
      * in an exception
      */
     dchar opIndex(size_t idx)
     {
         if(idx is _data.length || !validIdx(idx))
             throw new RangeError(__FILE__, __LINE__);

         return std.utf.decode(_data, idx);
     }

     /**
      * slice the whole string
      */
     string_t opSlice()
     {
         return this;
     }

     /**
      * Slice based on valid start and end indexes.
      *
      * Throws RangeError if start or end are not valid indexes
      */
     string_t opSlice(size_t start, size_t end)
     {
         if(end < start || !validIdx(start) || !validIdx(end))
             throw new RangeError(__FILE__, __LINE__);
         return string_t(_data[start..end]);
     }

     /**
      * Get the number of code units in the string.
      *
      * Note that this is specifically not called length because length
      * generally implies the number of elements in a range.  Since dchar  
is our
      * element type, and the number of dchars cannot be determined in O(1)
      * time, using the name length would be incorrect.
      */
      property size_t codeUnits() const
     {
         return _data.length;
     }

     /**
      * Append a string to this string.
      *
      * TODO: support appending any string type to this string type.
      */
     ref string_t opOpAssign(string op, U)(U data) if (op == "~" && is(U ==  
string_t))
     {
         _data ~= data._data;
         return this;
     }

     /**
      * Support appending any types that the underlying array can support.
      */
     ref string_t opOpAssign(string op, U)(U data) if (op == "~" && !is(U  
== string_t) && is(typeof(_data ~= U.init)))
     {
         _data ~= data;
         return this;
     }

     /**
      * Concatenation between two strings
      */
     string_t opBinary(string op, U)(U data) if (op == "~" && is(U ==  
string_t))
     {
         return string_t(_data ~ data._data);
     }

     /**
      * Support any concatenation that is supported by the underlying data
      * array.
      */
     string_t opBinary(string op, U)(U data) if (op == "~" && !is(U ==  
string_t) && is(typeof(_data ~ U.init)))
     {
         return string_t(_data ~ data);
     }

     // note, this should not be required, it should use the range  
interface but
     // the compiler doesn't allow both to coexist for foreach.
     version(norangeopapplyoverload)
     {
         /**
          * Foreach with just dchars.  Note, you cannot actually change the
          * data, despite the argument being ref.  opApply requires ref.
          */
         int opApply(scope int delegate(ref dchar d) dg)
         {
             dchar d;
             size_t idx = 0;
             immutable len = _data.length;
             int result = 0;
             while(result == 0 && idx < len)
             {
                 d = std.utf.decode(_data, idx);
                 result = dg(d);
             }
             return result;
         }
     }

     /**
      * Foreach over the string with accompanied index.
      *
      * Note, the refs are required for foreach, you cannot change the  
index or
      * the data in the string.
      */
     int opApply(scope int delegate(ref size_t idx, ref dchar d) dg)
     {
         dchar d;
         size_t idx = 0;
         immutable len = _data.length;
         int result = 0;
         while(result == 0 && idx < len)
         {
             size_t tmpidx = idx;
             d = std.utf.decode(_data, idx);
             result = dg(tmpidx, d);
         }
         return result;
     }

     /**
      * Assign a string
      */
     string_t opAssign(U)(U u) if(is(U == string_t))
     {
         this._data = u._data;
         return this;
     }

     /**
      * Assign a string from another type
      */
     string_t opAssign(U)(U u) if (!is(U == string_t) && is(typeof(_data =  
u)))
     {
         _data = u;
         return this;
     }
}

/**
  * String type for dchar
  */
template string_t(T) if (is(Unqual!T == dchar))
{
     alias T[] string_t;
}

// support string functions for dchar that aren't already defined.

// TODO: do we need this one?
// TODO: should be inout instead of a template
 property T[] data(T)(T[] t) if (is(Unqual!T == dchar))
{
     return t;
}

/**
  * Finds the largest valid index in the string that is <= idx.
  * Essentially, this can be used to convert arbitrary indexes into valid
  * indexes.
  */
size_t charStart(const(dchar)[] t, size_t idx)
{
     return idx;
}

/**
  * Returns true if the given index starts an encoded dchar.
  */
bool validIdx(const(dchar)[] t, size_t idx)
{
     return idx <= t.length;
}

// TODO: do we need this one?
 property size_t codeUnits(const(dchar)[] t)
{
     return t.length;
}

/** begin test code **/
import std.stdio;

alias string_t!(immutable char) mystring;
alias string_t!(immutable wchar) mywstring;
alias string_t!(immutable dchar) mydstring;

void main()
{
     auto str = mystring("hello");
     foreach(dchar d; str) { }
     str ~= " world";
     str ~= mystring("!!!");
     writeln(str.data);
     mystring str2 = "blah blah";
     str2 = str;
     str2 = "blah blah";
}

Jan 13 2011

Steven Wawryk <stevenw acres.com.au> writes:

On 14/01/11 02:25, Steven Schveighoffer wrote:
 On Wed, 12 Jan 2011 04:49:26 -0500, Steven Wawryk <stevenw acres.com.au>
 wrote:

 I like the direction you're taking but have some quibbles about
 details. Specifically, I'd go for a more complete separation into
 random-access code-unit ranges and bidirectional code-point ranges:

 Thanks for taking the time. I will respond to your points, but please
 make your rebuttals to the new thread I'm about to create with an
 updated string type.

 I don't see a need for _charStart, opIndex, opSlice and codeUnits. If
 the underlying T[] can be returned by a property, then these can be
 done through the code-unit array, which is random-access.

 But that puts extra pain on the user for not much reason. Currently,
 strings slice in one operation, you are proposing that we slice in three
 operations:

 1. get the underlying array

myString vs myString.data

 2. slice it

Same for both.

 3. reconstruct a string based on the slice.

myOtherString = find(myString, 'x');
vs
myOtherString = find(myString.data, 'x');

You may see extra pain.  I see extra control.  The user is making it 
explicit at what level (code-unit/code-point/grapheme/whatever) of range 
he/she wants the called algorithm to be working on.

 Plus, if you remove opIndex, you are restricting the usefulness of the
 range. Note that this string type already will decode dchars out of the
 front and back, why not just give that ability to the middle of the 

string?

Because at the code-point level it *isn't* a random-access range and the 
index makes no sense at the code-point level, only at the code-unit 
level.  It's encouraging the confusion of 2 distinctly different 
abstractions or "views" of the same data.  All the slicing and indexing 
you're artificially putting in the code-point range is already available 
in the code-unit range, and its only benefit is to allow the user to 
save typing ".data".

- other Steve

Jan 13 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 13 Jan 2011 23:03:35 -0500, Steven Wawryk <stevenw acres.com.au>  
wrote:

 On 14/01/11 02:25, Steven Schveighoffer wrote:
  > On Wed, 12 Jan 2011 04:49:26 -0500, Steven Wawryk  
 <stevenw acres.com.au>
  > wrote:
  >
  >>
  >> I like the direction you're taking but have some quibbles about
  >> details. Specifically, I'd go for a more complete separation into
  >> random-access code-unit ranges and bidirectional code-point ranges:
  >
  > Thanks for taking the time. I will respond to your points, but please
  > make your rebuttals to the new thread I'm about to create with an
  > updated string type.
  >
  >> I don't see a need for _charStart, opIndex, opSlice and codeUnits. If
  >> the underlying T[] can be returned by a property, then these can be
  >> done through the code-unit array, which is random-access.
  >
  > But that puts extra pain on the user for not much reason. Currently,
  > strings slice in one operation, you are proposing that we slice in  
 three
  > operations:
  >
  > 1. get the underlying array

 myString vs myString.data

  > 2. slice it

 Same for both.

  > 3. reconstruct a string based on the slice.

 myOtherString = find(myString, 'x');
 vs
 myOtherString = find(myString.data, 'x');

 You may see extra pain.  I see extra control.  The user is making it  
 explicit at what level (code-unit/code-point/grapheme/whatever) of range  
 he/she wants the called algorithm to be working on.

Exactly, that is what my string type allows.  You can either do it at the  
code-point (and probably grapheme, discussion in progress) level, or you  
can do it at the code-unit level.  I don't see how restricting the user to  
only doing it at the code-unit level is not more painful.

  > Plus, if you remove opIndex, you are restricting the usefulness of the
  > range. Note that this string type already will decode dchars out of  
 the
  > front and back, why not just give that ability to the middle of the  
 string?

 Because at the code-point level it *isn't* a random-access range and the  
 index makes no sense at the code-point level, only at the code-unit  
 level.  It's encouraging the confusion of 2 distinctly different  
 abstractions or "views" of the same data.  All the slicing and indexing  
 you're artificially putting in the code-point range is already available  
 in the code-unit range, and its only benefit is to allow the user to  
 save typing ".data".

I respectfully disagree.  A stream built on fixed-sized units, but with  
variable length elements, where you can determine the start of an element  
in O(1) time given a random index absolutely provides random-access.  It  
just doesn't provide length.

You are also forgetting one thing, the main reason why a string type is  
better than the array -- it's possible to slice a code-unit array using  
indexes that create an invalid range.  With my type it is not possible to  
do that (it throws an exception).  We want the basic user to use strings  
properly (and inform them of their errors at the site of the error), and  
if an advanced user wants more control, they can jump down to the  
code-unit level by accessing the data property.

 - other Steve

hehe, you can be Steve' :)

-Steve

Jan 14 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/14/11 5:06 AM, Steven Schveighoffer wrote:
 On Thu, 13 Jan 2011 23:03:35 -0500, Steven Wawryk <stevenw acres.com.au>
 wrote:

 On 14/01/11 02:25, Steven Schveighoffer wrote:
 On Wed, 12 Jan 2011 04:49:26 -0500, Steven Wawryk

 <stevenw acres.com.au>
 wrote:

 I like the direction you're taking but have some quibbles about
 details. Specifically, I'd go for a more complete separation into
 random-access code-unit ranges and bidirectional code-point ranges:

 Thanks for taking the time. I will respond to your points, but please
 make your rebuttals to the new thread I'm about to create with an
 updated string type.

 I don't see a need for _charStart, opIndex, opSlice and codeUnits. If
 the underlying T[] can be returned by a property, then these can be
 done through the code-unit array, which is random-access.

 But that puts extra pain on the user for not much reason. Currently,
 strings slice in one operation, you are proposing that we slice in

 three
 operations:

 1. get the underlying array

 myString vs myString.data

 2. slice it

 Same for both.

 3. reconstruct a string based on the slice.

 myOtherString = find(myString, 'x');
 vs
 myOtherString = find(myString.data, 'x');

 You may see extra pain. I see extra control. The user is making it
 explicit at what level (code-unit/code-point/grapheme/whatever) of
 range he/she wants the called algorithm to be working on.

 Exactly, that is what my string type allows. You can either do it at the
 code-point (and probably grapheme, discussion in progress) level, or you
 can do it at the code-unit level. I don't see how restricting the user
 to only doing it at the code-unit level is not more painful.

 Plus, if you remove opIndex, you are restricting the usefulness of the
 range. Note that this string type already will decode dchars out of the
 front and back, why not just give that ability to the middle of the

 string?

 Because at the code-point level it *isn't* a random-access range and
 the index makes no sense at the code-point level, only at the
 code-unit level. It's encouraging the confusion of 2 distinctly
 different abstractions or "views" of the same data. All the slicing
 and indexing you're artificially putting in the code-point range is
 already available in the code-unit range, and its only benefit is to
 allow the user to save typing ".data".

 I respectfully disagree. A stream built on fixed-sized units, but with
 variable length elements, where you can determine the start of an
 element in O(1) time given a random index absolutely provides
 random-access. It just doesn't provide length.

I equally respectfully disagree. I think random access is defined as 
accessing the ith element in O(1) time. That's not the case here.

Andrei

Jan 14 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 14 Jan 2011 12:03:28 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 1/14/11 5:06 AM, Steven Schveighoffer wrote:
 On Thu, 13 Jan 2011 23:03:35 -0500, Steven Wawryk <stevenw acres.com.au>
 wrote:

 Because at the code-point level it *isn't* a random-access range and
 the index makes no sense at the code-point level, only at the
 code-unit level. It's encouraging the confusion of 2 distinctly
 different abstractions or "views" of the same data. All the slicing
 and indexing you're artificially putting in the code-point range is
 already available in the code-unit range, and its only benefit is to
 allow the user to save typing ".data".

 I respectfully disagree. A stream built on fixed-sized units, but with
 variable length elements, where you can determine the start of an
 element in O(1) time given a random index absolutely provides
 random-access. It just doesn't provide length.

 I equally respectfully disagree. I think random access is defined as  
 accessing the ith element in O(1) time. That's not the case here.

I'd actually go as far as saying that any container you can do a quick  
lookup given an index (O(lgn) or better) is random-access.  For example, a  
dictionary file of english words sorted alphabetically can look up any  
word definition in O(lgn) time, but you can't look up the nth word in  
O(lgn) time.

What do you call this type of container if not random access?  Indexed  
maybe?  Whatever term you call it, I'd call it that, and say opIndex  
belongs for that container.

-Steve

Jan 15 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/15/11 11:15 CST, Steven Schveighoffer wrote:
 On Fri, 14 Jan 2011 12:03:28 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 1/14/11 5:06 AM, Steven Schveighoffer wrote:
 On Thu, 13 Jan 2011 23:03:35 -0500, Steven Wawryk <stevenw acres.com.au>
 wrote:

 Because at the code-point level it *isn't* a random-access range and
 the index makes no sense at the code-point level, only at the
 code-unit level. It's encouraging the confusion of 2 distinctly
 different abstractions or "views" of the same data. All the slicing
 and indexing you're artificially putting in the code-point range is
 already available in the code-unit range, and its only benefit is to
 allow the user to save typing ".data".

 I respectfully disagree. A stream built on fixed-sized units, but with
 variable length elements, where you can determine the start of an
 element in O(1) time given a random index absolutely provides
 random-access. It just doesn't provide length.

 I equally respectfully disagree. I think random access is defined as
 accessing the ith element in O(1) time. That's not the case here.

 I'd actually go as far as saying that any container you can do a quick
 lookup given an index (O(lgn) or better) is random-access. For example,
 a dictionary file of english words sorted alphabetically can look up any
 word definition in O(lgn) time, but you can't look up the nth word in
 O(lgn) time.

Here you are trying to equate a dictionary (map/hashtable) with an 
otherwise unstructured variable length encoded array. I don't think 
there is much similarity.

 What do you call this type of container if not random access? Indexed
 maybe? Whatever term you call it, I'd call it that, and say opIndex
 belongs for that container.

 -Steve

I'm not sure how you can call it, but it's definitely not random access.


Andrei

Jan 15 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sat, 15 Jan 2011 13:19:34 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 1/15/11 11:15 CST, Steven Schveighoffer wrote:
 On Fri, 14 Jan 2011 12:03:28 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 1/14/11 5:06 AM, Steven Schveighoffer wrote:
 On Thu, 13 Jan 2011 23:03:35 -0500, Steven Wawryk  
 <stevenw acres.com.au>
 wrote:

 Because at the code-point level it *isn't* a random-access range and
 the index makes no sense at the code-point level, only at the
 code-unit level. It's encouraging the confusion of 2 distinctly
 different abstractions or "views" of the same data. All the slicing
 and indexing you're artificially putting in the code-point range is
 already available in the code-unit range, and its only benefit is to
 allow the user to save typing ".data".

 I respectfully disagree. A stream built on fixed-sized units, but with
 variable length elements, where you can determine the start of an
 element in O(1) time given a random index absolutely provides
 random-access. It just doesn't provide length.

 I equally respectfully disagree. I think random access is defined as
 accessing the ith element in O(1) time. That's not the case here.

 I'd actually go as far as saying that any container you can do a quick
 lookup given an index (O(lgn) or better) is random-access. For example,
 a dictionary file of english words sorted alphabetically can look up any
 word definition in O(lgn) time, but you can't look up the nth word in
 O(lgn) time.

 Here you are trying to equate a dictionary (map/hashtable) with an  
 otherwise unstructured variable length encoded array. I don't think  
 there is much similarity.

I see a lot of similarity.  Both have sparsely populated indexes.  Both  
cannot determine their length unless you save it separately.  Both are  
encoded on an array of fixed-length elements.  Both store their data  
sorted by index.

The difference is that the index is a size_t for one and a string for the  
other.

 What do you call this type of container if not random access? Indexed
 maybe? Whatever term you call it, I'd call it that, and say opIndex
 belongs for that container.

 -Steve

 I'm not sure how you can call it, but it's definitely not random access.

The former complaint was "because this != random access, opIndex doesn't  
belong."

If we can get past the terminology of what this is, I think at least  
opSlice belongs and I still think opIndex belongs.  Can you not see the  
value of 'give me the element at index X' giving you back a full element  
and not just a code-unit?

If opIndex doesn't belong unless it's a random access container (i.e.  
array), then we should get rid of the operator for hashes and sorted trees.

-Steve

Jan 15 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/15/11 2:32 PM, Steven Schveighoffer wrote:
 On Sat, 15 Jan 2011 13:19:34 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 1/15/11 11:15 CST, Steven Schveighoffer wrote:
 On Fri, 14 Jan 2011 12:03:28 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 1/14/11 5:06 AM, Steven Schveighoffer wrote:
 On Thu, 13 Jan 2011 23:03:35 -0500, Steven Wawryk
 <stevenw acres.com.au>
 wrote:

 Because at the code-point level it *isn't* a random-access range and
 the index makes no sense at the code-point level, only at the
 code-unit level. It's encouraging the confusion of 2 distinctly
 different abstractions or "views" of the same data. All the slicing
 and indexing you're artificially putting in the code-point range is
 already available in the code-unit range, and its only benefit is to
 allow the user to save typing ".data".

 I respectfully disagree. A stream built on fixed-sized units, but with
 variable length elements, where you can determine the start of an
 element in O(1) time given a random index absolutely provides
 random-access. It just doesn't provide length.

 I equally respectfully disagree. I think random access is defined as
 accessing the ith element in O(1) time. That's not the case here.

 I'd actually go as far as saying that any container you can do a quick
 lookup given an index (O(lgn) or better) is random-access. For example,
 a dictionary file of english words sorted alphabetically can look up any
 word definition in O(lgn) time, but you can't look up the nth word in
 O(lgn) time.

 Here you are trying to equate a dictionary (map/hashtable) with an
 otherwise unstructured variable length encoded array. I don't think
 there is much similarity.

 I see a lot of similarity. Both have sparsely populated indexes. Both
 cannot determine their length unless you save it separately. Both are
 encoded on an array of fixed-length elements. Both store their data
 sorted by index.

 The difference is that the index is a size_t for one and a string for
 the other.

I think this is way off. We're comparing a highly structured topology 
with a simple linear encoding.

 What do you call this type of container if not random access? Indexed
 maybe? Whatever term you call it, I'd call it that, and say opIndex
 belongs for that container.

 -Steve

 I'm not sure how you can call it, but it's definitely not random access.

 The former complaint was "because this != random access, opIndex doesn't
 belong."

 If we can get past the terminology of what this is, I think at least
 opSlice belongs and I still think opIndex belongs. Can you not see the
 value of 'give me the element at index X' giving you back a full element
 and not just a code-unit?

I do see the value. I think it's essential to reckon that (1) that is 
not random access, (2) that allows people to write wrong code because 
they believe they have random access.

 If opIndex doesn't belong unless it's a random access container (i.e.
 array), then we should get rid of the operator for hashes and sorted trees.

The existence of opIndex for hashes and ordered trees is not being 
contended, and throws red herrings into the discussion.

It's all very simple. At the end of the day, a string type must make a 
Boolean decision: the user is aware of an underlying variable-length 
encoding, or not.

Allow me to enumerate the ways I know that hide the variable encoding 
100% from the user:

1. Expose a bidirectional range.

Now, for a variety of reasons, it is efficient to have access to the 
index in the supporting structure for the encoding. That's great, but it 
should be understood that offering such access allows people who don't 
understand the underlying variable-length encoding to write wrong code.

Bottom line: you can't abstract away the variable-length encoding and 
offer indexes at the same time.


Andrei

Jan 15 2011

Steven Wawryk <stevenw acres.com.au> writes:

Andrei,

I note that your specific criticisms of Steve's proposal all relate to 
the "random-access" of the bidirectional code-point range.  This is 
essentially my objection too.

As you are the key person to determine whether it is adopted, what do 
you think of the remainder of his proposal?

Apart from the indexing and slicing of code-points I see a lot of merit 
to the proposal, especially that it would mean that std.algorithm would 
not need to special-case for strings any more.  It would need only be 
concerned about whether it has a random-access range or a bidirectional 
range.  The user could choose whether to pass to std.algorithm any of:

* a random-access code-unit range
* a bidirectional code-point range
* a bidirectional grapheme range, etc according to the outcome of the 
other discussions

In the interest of moving this on, would it become acceptable to you if:

1. indexing and slicing of the code-point range were removed?
2. any additional ranges are exposed to the user according to decisions 
made about graphemes, etc?
3. other constructive criticisms were accommodated?

Steve


On 15/01/11 03:33, Andrei Alexandrescu wrote:
 On 1/14/11 5:06 AM, Steven Schveighoffer wrote:
 I respectfully disagree. A stream built on fixed-sized units, but with
 variable length elements, where you can determine the start of an
 element in O(1) time given a random index absolutely provides
 random-access. It just doesn't provide length.

 I equally respectfully disagree. I think random access is defined as
 accessing the ith element in O(1) time. That's not the case here.

 Andrei

Jan 16 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/16/11 6:35 PM, Steven Wawryk wrote:
 Andrei,

 I note that your specific criticisms of Steve's proposal all relate to
 the "random-access" of the bidirectional code-point range. This is
 essentially my objection too.

 As you are the key person to determine whether it is adopted, what do
 you think of the remainder of his proposal?

One thing that I don't think is understood is that Walter is the key 
person here. This is a compiler and druntime change, and a big one.

My only possible role is that I can call Walter on Skype with a strong 
case. Walter has a strong "wanking detector". This whole changing the 
string type, which is working well, with a type that has its own 
problems, breaks a lot of code, and at the end of the day makes small 
improvements - that will trigger the wanking detector for sure, and I 
simply don't have enough good arguments to counter that.

 Apart from the indexing and slicing of code-points I see a lot of merit
 to the proposal, especially that it would mean that std.algorithm would
 not need to special-case for strings any more.

This is a misunderstanding. std.algorithm needs no special casing for 
strings. It uses isXxxRange which work correctly for strings. Functions 
in std.algorithm are special-cased for strings when they could gain in 
efficiency. This is because there is no VLERange abstraction.

 It would need only be
 concerned about whether it has a random-access range or a bidirectional
 range. The user could choose whether to pass to std.algorithm any of:

 * a random-access code-unit range
 * a bidirectional code-point range
 * a bidirectional grapheme range, etc according to the outcome of the
 other discussions

 In the interest of moving this on, would it become acceptable to you if:

 1. indexing and slicing of the code-point range were removed?
 2. any additional ranges are exposed to the user according to decisions
 made about graphemes, etc?
 3. other constructive criticisms were accommodated?

 Steve

I believe the proposed scheme:

1. Changes the language in a major way;

2. Is highly disruptive;

3. Improves the status quo in only minor ways.

I'd be much more willing to improve things by e.g. defining the 
representation() function I talked about a bit ago, and other less 
disruptive additions.


Andrei

Jan 16 2011

D Programming

C/C++ Programming

Other

digitalmars.D - [review] new string type (take 2)