digitalmars.D - Fixing std.string

dsimcha (13/13) Aug 19 2010 As I mentioned buried deep in another thread, std.string is in serious n...

Andrei Alexandrescu (25/38) Aug 19 2010 I don't know - my guess is that UTF-8 is widespread in English-speaking
Russel Winder (17/19) Aug 19 2010 he

Andrei Alexandrescu (14/24) Aug 20 2010 Hey Russell,

Ezneh (2/2) Aug 20 2010 There's also this in std.string which requires a fix :

Andrei Alexandrescu (8/11) Aug 20 2010 Sure. On the face of it, I think isNumeric is a silly function because

bearophile (5/11) Aug 20 2010 Do you mean that such ugly replacement is meant to be used in user code,...

Andrei Alexandrescu (4/13) Aug 20 2010 Put it in the body.
bearophile (20/21) Aug 20 2010 A possible design is to use only one template function, like:

Ezneh (7/15) Aug 20 2010 I found another way to improve the isNumeric function.

Jonathan M Davis (10/21) Aug 20 2010 Oh, the immutability can definitely be a good thing. That's why string i...
Michael Rynn (128/144) Aug 23 2010 The problems are combinatorial, because of encoding schemes.

Jonathan M Davis (11/14) Aug 24 2010 A lot of functions in Phobos are templated on string type, so you don't ...

Norbert Nemec (5/8) Aug 24 2010 Wouldn't it be sufficient to take const as input? IIRC, both mutable and...

Simen kjaeraas (6/16) Aug 24 2010 What should the functions return, then? If the output is always

dsimcha <dsimcha yahoo.com> writes:

As I mentioned buried deep in another thread, std.string is in serious need of
fixing, for two reasons:

1.  Most of it doesn't work with UTF-16/UTF-32 strings.

2.  Much of it requires the input to be immutable even when there's no good
reason for this constraint.

I'm trying to understand a few things before I dive into fixing it:

1.  How did it get to be this way?  Why did it seem like a good idea at the
time to only support UTF-8 and only immutable strings?

2.  Is there any "deep" design/technical issue that makes these hard to fix,
or is it basically just lack of manpower and other priorities?

3.  Is there any good reason to avoid just templating everything to work with
all 9 string types (mutable/const/immutable char/wchar/dchar[]) or whatever
subset is reasonable for the given function?

Aug 19 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 08/19/2010 09:22 PM, dsimcha wrote:
 As I mentioned buried deep in another thread, std.string is in serious need of
 fixing, for two reasons:

 1.  Most of it doesn't work with UTF-16/UTF-32 strings.

 2.  Much of it requires the input to be immutable even when there's no good
 reason for this constraint.

Absolutely. Thanks for looking into this!

 I'm trying to understand a few things before I dive into fixing it:

 1.  How did it get to be this way?  Why did it seem like a good idea at the
 time to only support UTF-8 and only immutable strings?

I don't know - my guess is that UTF-8 is widespread in English-speaking 
countries and this is one.

 2.  Is there any "deep" design/technical issue that makes these hard to fix,
 or is it basically just lack of manpower and other priorities?

The latter. I wanted to get to this for the longest time, and I think 
it's awesome that you're looking into it.

 3.  Is there any good reason to avoid just templating everything to work with
 all 9 string types (mutable/const/immutable char/wchar/dchar[]) or whatever
 subset is reasonable for the given function?

There's no reason. But I hope we'd go a step further:

a) Aggressively make everything string-specific more general and move it 
into std.algorithm.

b) After (a) ideally std.string should contain only a modicum of 
string-specific stuff such as case and whitespace information. I believe 
the functionality of the following functions could easily be generalized 
and move to std.algorithm or std.range, perhaps consolidated with 
existing functionality and under a different name: cmp, indexOf, 
lastIndexOf, repeat, join, split, stripl, stripr, strip, chomp, 
chompPrefix, replace, replaceSlice, insert, count, maketrans, translate, 
squeeze, munch, succ, tr.

The other functions (or certain overloads of the above) stay put in 
std.string and should be indeed templated by input with the constraint

if (isSomeString!Str)

or better yet allow any input, forward, or bidirectional range (as the 
algorithm needs) constained by

if (isXxxRange!R && is(ElementType!R : dchar).

Thanks again for looking into this, it's important and rewarding work.


Andrei

Aug 19 2010

Russel Winder <russel russel.org.uk> writes:

On Fri, 2010-08-20 at 02:22 +0000, dsimcha wrote:
[ . . . ]
 1.  How did it get to be this way?  Why did it seem like a good idea at t=

he
 time to only support UTF-8 and only immutable strings?

But isn't the thinking these days that immutable strings are a good
thing?

Immutability is generally a good thing for all parallel, and indeed
concurrent, computations.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel russel.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Aug 19 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Russel Winder wrote:
 On Fri, 2010-08-20 at 02:22 +0000, dsimcha wrote:
 [ . . . ]
 1.  How did it get to be this way?  Why did it seem like a good idea at the
 time to only support UTF-8 and only immutable strings?

 
 But isn't the thinking these days that immutable strings are a good
 thing?
 
 Immutability is generally a good thing for all parallel, and indeed
 concurrent, computations.

Hey Russell,

The idea is for the algorithms to impose as little on their inputs. If 
you're searching a character in a string you wouldn't care whether the 
string is mutable or not - the algorithm is the same. Currently many 
algorithm in std.string require (a) immutable and (b) UTF-8 strings as 
inputs. Either or both limitations should be relaxed as much as possible.

char[] thisIsMutable = new char[100];
char[] thisIsMutableW = new wchar[100];
...
assert(indexOf(thisIsMutable, "abc") != -1); // should work
assert(indexOf(thisIsMutableW, "abc") != -1); // should work
assert(indexOf(thisIsMutableW, "abc"w) != -1); // even this should work


Andrei

Aug 20 2010

Ezneh <petitv.isat gmail.com> writes:

There's also this in std.string which requires a fix :

http://d.puremagic.com/issues/show_bug.cgi?id=4673

Aug 20 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Ezneh wrote:
 There's also this in std.string which requires a fix :
 
 http://d.puremagic.com/issues/show_bug.cgi?id=4673

Sure. On the face of it, I think isNumeric is a silly function because 
the effort expended on doing a good prediction is almost the same as 
doing the actual conversion - so why not just try it.

I guess

   return collectException(to!real(input)) is null;

should be a fine replacement for isNumeric.


Andrei

Aug 20 2010

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:
 Sure. On the face of it, I think isNumeric is a silly function because 
 the effort expended on doing a good prediction is almost the same as 
 doing the actual conversion - so why not just try it.

The difference is that a well designed isNumeric doesn't need to use
exceptions, this may make it faster (never forget that DMD exceptions are
something like 12 times slower than Java-Sun ones). On the other hand I think
isNumeric() is not used in situations where high performance is needed.


 I guess
    return collectException(to!real(input)) is null;
 should be a fine replacement for isNumeric.

Do you mean that such ugly replacement is meant to be used in user code, or do
you mean to replace the contents of the isNumeric() function with that code?

Bye,
bearophile

Aug 20 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 08/20/2010 07:21 AM, bearophile wrote:
 Andrei Alexandrescu:
 Sure. On the face of it, I think isNumeric is a silly function because
 the effort expended on doing a good prediction is almost the same as
 doing the actual conversion - so why not just try it.

 The difference is that a well designed isNumeric doesn't need to use
exceptions, this may make it faster (never forget that DMD exceptions are
something like 12 times slower than Java-Sun ones). On the other hand I think
isNumeric() is not used in situations where high performance is needed.

Good point.

 I guess
     return collectException(to!real(input)) is null;
 should be a fine replacement for isNumeric.

 Do you mean that such ugly replacement is meant to be used in user code, or do
you mean to replace the contents of the isNumeric() function with that code?

Put it in the body.


Andrei

Aug 20 2010

bearophile <bearophileHUGS lycos.com> writes:

bearophile Wrote:
 The difference is that a well designed isNumeric doesn't need to use
exceptions,<

A possible design is to use only one template function, like:

private auto _realConvert(bool useExepions)(string txt) {
    ...
    if (some_error_condition) {
        static if (useExcepions)
            throw new ConversionError(...);
        else
            return false;
    }
    ...
    static if (useExcepions)
        return result;
    else
        return true;
}

And then create isNumeric() and to!real() calling _realConvert!(false) and
_realConvert!(true).

(But maybe the simple implementation with collectException(to!real(input)) is
enough because it's uncommon to use isNumeric where speed matters a lot).

Bye,
bearophile

Aug 20 2010

Ezneh <petitv.isat gmail.com> writes:

Andrei Alexandrescu Wrote:

 
 I guess
 
    return collectException(to!real(input)) is null;
 
 should be a fine replacement for isNumeric.
 
 
 Andrei


I found another way to improve the isNumeric function.

I'm doing it  with a (ugly) regular expression but it works very well but maybe
we could (should !) improve it a bit more.


See in the attach file how I did it.
There are some asserts and "old tests". I probably forget something but this
can be a new base for the isNumeric function.


Give me any feed back and telle me wath you think about it.

Aug 20 2010

Jonathan M Davis <jmdavisprog gmail.com> writes:

On Thursday 19 August 2010 23:27:33 Russel Winder wrote:
 On Fri, 2010-08-20 at 02:22 +0000, dsimcha wrote:
 [ . . . ]
 
 1.  How did it get to be this way?  Why did it seem like a good idea at
 the time to only support UTF-8 and only immutable strings?

 
 But isn't the thinking these days that immutable strings are a good
 thing?
 
 Immutability is generally a good thing for all parallel, and indeed
 concurrent, computations.

Oh, the immutability can definitely be a good thing. That's why string is 
immutable(char)[]. However, forcing people to use string instead of the other 
possible string types is unnecessarily restrictive. There are cases where you 
can't use immutable stuff or where it's inefficient to do so. By making
std.string 
handle all of the various string types as much as possible, it makes it much 
more flexible. But since string, wstring, and dstring are all immutable, most 
string processing will likely be on immutable. It's just that you won't be 
forced to do it that way if you want to take advatage of std.string.

- Jonathan m Davis

Aug 20 2010

Michael Rynn <michaelrynn optusnet.com.au> writes:

On Fri, 20 Aug 2010 02:22:56 +0000, dsimcha wrote:

 As I mentioned buried deep in another thread, std.string is in serious
 need of fixing, for two reasons:
 
 1.  Most of it doesn't work with UTF-16/UTF-32 strings.
 
 2.  Much of it requires the input to be immutable even when there's no
 good reason for this constraint.
 
 I'm trying to understand a few things before I dive into fixing it:
 
 1.  How did it get to be this way?  Why did it seem like a good idea at
 the time to only support UTF-8 and only immutable strings?
 
 2.  Is there any "deep" design/technical issue that makes these hard to
 fix, or is it basically just lack of manpower and other priorities?
 

The problems are combinatorial, because of encoding schemes.
I imagine that when someone wants a function that is missing from 
std.string, they might write one, and might even add to it.

I also found std.utf to not contain exactly what I needed.
The functions toUTF16, to UTF8, have signatures like
wstring toUTF16(const(dchar)[] s).

But when hacking a class I found I wanted functions that
would almost have the very same innards, but could also append mutable 
character arrays of any sort.

// Does almost the same as toUTF16, but creates or appends a mutable 
array.

void append_UTF16m(ref wchar[] r, const(dchar)[] s) {...}

At the expense of another nested function call, which I imagine most 
people would not want to pay, toUTF16 becomes a call to append_UTF16m.

wstring toUTF16(const(dchar)[] s)
{
	wchar[] temp = null;
	append_UTF16m(temp, s);
	return assumeUnique(temp);
}



But isNumeric for me required a parsing function, when I was religiously 
trying to use ranges, and know what sort of conversion function to call 
afterwards. I know its really simple-minded, but it did the required job.

enum NumberClass {
	NUM_ERROR = -1,
	NUM_EMPTY,
	NUM_INTEGER,
	NUM_REAL
}

/// R is an input range, P is a output range (put).
/// Return a NumberClass value.
/// Collect characters in P for later processing.
/// Does no NAN or INF, only checks for error, empty, integer, or real.
/// E or e might be an exponent, or just the end of a number.

NumberClass
getNumberString(R, P)(R ipt, P opt, int recurse = 0 )
{
int   digitct = 0;
bool  done = ipt.empty;
bool  decPoint = false;
for(;;)
{
  if (ipt.empty)
    break;
  auto test = ipt.front;
  ipt.popFront;
  switch(test)
  {
  case '-':
  case '+':
    if (digitct > 0)
    {
      done = true;
    }
    break;
  case '.':
    if (!decPoint)
      decPoint = true;
    else
      done = true;
    break;
  default:
    if (!isdigit(test))
    {
      done = true;
      if (test == 'e' || test == 'E')
      {
        // Ambiguous end of number, or exponent?
        if (recurse == 0)
        {
          opt.put(test);
          if (getNumberString(ipt,opt, recurse+1)
==NumberClass.NUM_INTEGER)
            return NumberClass.NUM_REAL;
          else 
            return NumberClass.NUM_ERROR;
        }
        // assume end of number
      }
    }
    else
      digitct++;
    break;
  }
  if (done)
    break;
  opt.put(test);
}

if (digitct == 0)
	return NumberClass.NUM_EMPTY;
if (decPoint)
	return NumberClass.NUM_REAL;
return NumberClass.NUM_INTEGER;
}


A string class.
http://dsource.org/projects/xmlp/trunk/alt/ustring.d

The component structures maintain a terminating null character and 
pretend it is not there. It seemed a good idea at the time when I was 
doing a lot of windows API calls which expected null terminated C-strings 
of char or wchar. The UString class does conversions on accessing cstr(), 
wstr() or dstr(), on the assumption that last used will be most frequent, 
and ideally caches a decent hash value.  I only have some limited uses of 
UString so far, because character arrays are so powerful.

struct cstext {
	char[]	  str_ = null;
...
}


struct wstext {
	wchar[]	  str_ = null;
...
}

struct dstext {
	dchar[]	  str_ = null;
...
}

class UString {

	private {
		union {
			vstruc vstr; // not fully supported?
			cstext cstr;
			wstext wstr;
			dstext dstr;
		}

		UStringType ztype;
		hash_t		hash_;
	}

	...

Aug 23 2010

Jonathan M Davis <jmdavisprog gmail.com> writes:

On Monday 23 August 2010 23:16:25 Michael Rynn wrote:
 The problems are combinatorial, because of encoding schemes.
 I imagine that when someone wants a function that is missing from
 std.string, they might write one, and might even add to it.

A lot of functions in Phobos are templated on string type, so you don't have to 
define multiple versions of them. Very few, if any, are actually defined for 
multiple string types. Now, because each template instantiation results in 
another version of the function in the resulting binary, if you try and use all 
of the functions with all of the string types, then you do get combinatorial 
problems. But thanks to the templates, you don't have to worry about it 
directly, and it's not like it's going to be a typical use case for most string 
functions to be used by multiple string types in the same program. It will 
happen, but not enough to generally be an issue.

- Jonathan M Davis

Aug 24 2010

Norbert Nemec <Norbert Nemec-online.de> writes:

On 20/08/10 03:22, dsimcha wrote:
 3.  Is there any good reason to avoid just templating everything to work with
 all 9 string types (mutable/const/immutable char/wchar/dchar[]) or whatever
 subset is reasonable for the given function?

Wouldn't it be sufficient to take const as input? IIRC, both mutable and 
immutable can be implicitly converted to const and this exactly the 
purpose that const is designed for: data that I can't change but that 
other code may be able to change. Or am I mixing something up here?

Aug 24 2010

"Simen kjaeraas" <simen.kjaras gmail.com> writes:

Norbert Nemec <Norbert nemec-online.de> wrote:

 On 20/08/10 03:22, dsimcha wrote:
 3.  Is there any good reason to avoid just templating everything to  
 work with
 all 9 string types (mutable/const/immutable char/wchar/dchar[]) or  
 whatever
 subset is reasonable for the given function?

 Wouldn't it be sufficient to take const as input? IIRC, both mutable and  
 immutable can be implicitly converted to const and this exactly the  
 purpose that const is designed for: data that I can't change but that  
 other code may be able to change. Or am I mixing something up here?

What should the functions return, then? If the output is always
const(char)[], I need to cast it to make it immutable(char)[] or char[],
and casting is an unsafe operation.

-- 
Simen

Aug 24 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Fixing std.string