www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Fixing std.string

reply dsimcha <dsimcha yahoo.com> writes:
As I mentioned buried deep in another thread, std.string is in serious need of
fixing, for two reasons:

1.  Most of it doesn't work with UTF-16/UTF-32 strings.

2.  Much of it requires the input to be immutable even when there's no good
reason for this constraint.

I'm trying to understand a few things before I dive into fixing it:

1.  How did it get to be this way?  Why did it seem like a good idea at the
time to only support UTF-8 and only immutable strings?

2.  Is there any "deep" design/technical issue that makes these hard to fix,
or is it basically just lack of manpower and other priorities?

3.  Is there any good reason to avoid just templating everything to work with
all 9 string types (mutable/const/immutable char/wchar/dchar[]) or whatever
subset is reasonable for the given function?
Aug 19 2010
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 08/19/2010 09:22 PM, dsimcha wrote:
 As I mentioned buried deep in another thread, std.string is in serious need of
 fixing, for two reasons:

 1.  Most of it doesn't work with UTF-16/UTF-32 strings.

 2.  Much of it requires the input to be immutable even when there's no good
 reason for this constraint.

Absolutely. Thanks for looking into this!
 I'm trying to understand a few things before I dive into fixing it:

 1.  How did it get to be this way?  Why did it seem like a good idea at the
 time to only support UTF-8 and only immutable strings?

I don't know - my guess is that UTF-8 is widespread in English-speaking countries and this is one.
 2.  Is there any "deep" design/technical issue that makes these hard to fix,
 or is it basically just lack of manpower and other priorities?

The latter. I wanted to get to this for the longest time, and I think it's awesome that you're looking into it.
 3.  Is there any good reason to avoid just templating everything to work with
 all 9 string types (mutable/const/immutable char/wchar/dchar[]) or whatever
 subset is reasonable for the given function?

There's no reason. But I hope we'd go a step further: a) Aggressively make everything string-specific more general and move it into std.algorithm. b) After (a) ideally std.string should contain only a modicum of string-specific stuff such as case and whitespace information. I believe the functionality of the following functions could easily be generalized and move to std.algorithm or std.range, perhaps consolidated with existing functionality and under a different name: cmp, indexOf, lastIndexOf, repeat, join, split, stripl, stripr, strip, chomp, chompPrefix, replace, replaceSlice, insert, count, maketrans, translate, squeeze, munch, succ, tr. The other functions (or certain overloads of the above) stay put in std.string and should be indeed templated by input with the constraint if (isSomeString!Str) or better yet allow any input, forward, or bidirectional range (as the algorithm needs) constained by if (isXxxRange!R && is(ElementType!R : dchar). Thanks again for looking into this, it's important and rewarding work. Andrei
Aug 19 2010
prev sibling next sibling parent reply Russel Winder <russel russel.org.uk> writes:
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Fri, 2010-08-20 at 02:22 +0000, dsimcha wrote:
[ . . . ]
 1.  How did it get to be this way?  Why did it seem like a good idea at t=

 time to only support UTF-8 and only immutable strings?

But isn't the thinking these days that immutable strings are a good thing? Immutability is generally a good thing for all parallel, and indeed concurrent, computations. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Aug 19 2010
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Russel Winder wrote:
 On Fri, 2010-08-20 at 02:22 +0000, dsimcha wrote:
 [ . . . ]
 1.  How did it get to be this way?  Why did it seem like a good idea at the
 time to only support UTF-8 and only immutable strings?

But isn't the thinking these days that immutable strings are a good thing? Immutability is generally a good thing for all parallel, and indeed concurrent, computations.

Hey Russell, The idea is for the algorithms to impose as little on their inputs. If you're searching a character in a string you wouldn't care whether the string is mutable or not - the algorithm is the same. Currently many algorithm in std.string require (a) immutable and (b) UTF-8 strings as inputs. Either or both limitations should be relaxed as much as possible. char[] thisIsMutable = new char[100]; char[] thisIsMutableW = new wchar[100]; ... assert(indexOf(thisIsMutable, "abc") != -1); // should work assert(indexOf(thisIsMutableW, "abc") != -1); // should work assert(indexOf(thisIsMutableW, "abc"w) != -1); // even this should work Andrei
Aug 20 2010
prev sibling next sibling parent reply Ezneh <petitv.isat gmail.com> writes:
There's also this in std.string which requires a fix :

http://d.puremagic.com/issues/show_bug.cgi?id=4673
Aug 20 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Ezneh wrote:
 There's also this in std.string which requires a fix :
 
 http://d.puremagic.com/issues/show_bug.cgi?id=4673

Sure. On the face of it, I think isNumeric is a silly function because the effort expended on doing a good prediction is almost the same as doing the actual conversion - so why not just try it. I guess return collectException(to!real(input)) is null; should be a fine replacement for isNumeric. Andrei
Aug 20 2010
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:
 Sure. On the face of it, I think isNumeric is a silly function because 
 the effort expended on doing a good prediction is almost the same as 
 doing the actual conversion - so why not just try it.

The difference is that a well designed isNumeric doesn't need to use exceptions, this may make it faster (never forget that DMD exceptions are something like 12 times slower than Java-Sun ones). On the other hand I think isNumeric() is not used in situations where high performance is needed.
 I guess
    return collectException(to!real(input)) is null;
 should be a fine replacement for isNumeric.

Do you mean that such ugly replacement is meant to be used in user code, or do you mean to replace the contents of the isNumeric() function with that code? Bye, bearophile
Aug 20 2010
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 08/20/2010 07:21 AM, bearophile wrote:
 Andrei Alexandrescu:
 Sure. On the face of it, I think isNumeric is a silly function because
 the effort expended on doing a good prediction is almost the same as
 doing the actual conversion - so why not just try it.

The difference is that a well designed isNumeric doesn't need to use exceptions, this may make it faster (never forget that DMD exceptions are something like 12 times slower than Java-Sun ones). On the other hand I think isNumeric() is not used in situations where high performance is needed.

Good point.
 I guess
     return collectException(to!real(input)) is null;
 should be a fine replacement for isNumeric.

Do you mean that such ugly replacement is meant to be used in user code, or do you mean to replace the contents of the isNumeric() function with that code?

Put it in the body. Andrei
Aug 20 2010
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
bearophile Wrote:
 The difference is that a well designed isNumeric doesn't need to use
exceptions,<

A possible design is to use only one template function, like: private auto _realConvert(bool useExepions)(string txt) { ... if (some_error_condition) { static if (useExcepions) throw new ConversionError(...); else return false; } ... static if (useExcepions) return result; else return true; } And then create isNumeric() and to!real() calling _realConvert!(false) and _realConvert!(true). (But maybe the simple implementation with collectException(to!real(input)) is enough because it's uncommon to use isNumeric where speed matters a lot). Bye, bearophile
Aug 20 2010
prev sibling parent Ezneh <petitv.isat gmail.com> writes:
Content-Type: text/plain

Andrei Alexandrescu Wrote:

 
 I guess
 
    return collectException(to!real(input)) is null;
 
 should be a fine replacement for isNumeric.
 
 
 Andrei

I found another way to improve the isNumeric function. I'm doing it with a (ugly) regular expression but it works very well but maybe we could (should !) improve it a bit more. See in the attach file how I did it. There are some asserts and "old tests". I probably forget something but this can be a new base for the isNumeric function. Give me any feed back and telle me wath you think about it.
Aug 20 2010
prev sibling next sibling parent Jonathan M Davis <jmdavisprog gmail.com> writes:
On Thursday 19 August 2010 23:27:33 Russel Winder wrote:
 On Fri, 2010-08-20 at 02:22 +0000, dsimcha wrote:
 [ . . . ]
 
 1.  How did it get to be this way?  Why did it seem like a good idea at
 the time to only support UTF-8 and only immutable strings?

But isn't the thinking these days that immutable strings are a good thing? Immutability is generally a good thing for all parallel, and indeed concurrent, computations.

Oh, the immutability can definitely be a good thing. That's why string is immutable(char)[]. However, forcing people to use string instead of the other possible string types is unnecessarily restrictive. There are cases where you can't use immutable stuff or where it's inefficient to do so. By making std.string handle all of the various string types as much as possible, it makes it much more flexible. But since string, wstring, and dstring are all immutable, most string processing will likely be on immutable. It's just that you won't be forced to do it that way if you want to take advatage of std.string. - Jonathan m Davis
Aug 20 2010
prev sibling next sibling parent reply Michael Rynn <michaelrynn optusnet.com.au> writes:
On Fri, 20 Aug 2010 02:22:56 +0000, dsimcha wrote:

 As I mentioned buried deep in another thread, std.string is in serious
 need of fixing, for two reasons:
 
 1.  Most of it doesn't work with UTF-16/UTF-32 strings.
 
 2.  Much of it requires the input to be immutable even when there's no
 good reason for this constraint.
 
 I'm trying to understand a few things before I dive into fixing it:
 
 1.  How did it get to be this way?  Why did it seem like a good idea at
 the time to only support UTF-8 and only immutable strings?
 
 2.  Is there any "deep" design/technical issue that makes these hard to
 fix, or is it basically just lack of manpower and other priorities?
 

The problems are combinatorial, because of encoding schemes. I imagine that when someone wants a function that is missing from std.string, they might write one, and might even add to it. I also found std.utf to not contain exactly what I needed. The functions toUTF16, to UTF8, have signatures like wstring toUTF16(const(dchar)[] s). But when hacking a class I found I wanted functions that would almost have the very same innards, but could also append mutable character arrays of any sort. // Does almost the same as toUTF16, but creates or appends a mutable array. void append_UTF16m(ref wchar[] r, const(dchar)[] s) {...} At the expense of another nested function call, which I imagine most people would not want to pay, toUTF16 becomes a call to append_UTF16m. wstring toUTF16(const(dchar)[] s) { wchar[] temp = null; append_UTF16m(temp, s); return assumeUnique(temp); } But isNumeric for me required a parsing function, when I was religiously trying to use ranges, and know what sort of conversion function to call afterwards. I know its really simple-minded, but it did the required job. enum NumberClass { NUM_ERROR = -1, NUM_EMPTY, NUM_INTEGER, NUM_REAL } /// R is an input range, P is a output range (put). /// Return a NumberClass value. /// Collect characters in P for later processing. /// Does no NAN or INF, only checks for error, empty, integer, or real. /// E or e might be an exponent, or just the end of a number. NumberClass getNumberString(R, P)(R ipt, P opt, int recurse = 0 ) { int digitct = 0; bool done = ipt.empty; bool decPoint = false; for(;;) { if (ipt.empty) break; auto test = ipt.front; ipt.popFront; switch(test) { case '-': case '+': if (digitct > 0) { done = true; } break; case '.': if (!decPoint) decPoint = true; else done = true; break; default: if (!isdigit(test)) { done = true; if (test == 'e' || test == 'E') { // Ambiguous end of number, or exponent? if (recurse == 0) { opt.put(test); if (getNumberString(ipt,opt, recurse+1) ==NumberClass.NUM_INTEGER) return NumberClass.NUM_REAL; else return NumberClass.NUM_ERROR; } // assume end of number } } else digitct++; break; } if (done) break; opt.put(test); } if (digitct == 0) return NumberClass.NUM_EMPTY; if (decPoint) return NumberClass.NUM_REAL; return NumberClass.NUM_INTEGER; } A string class. http://dsource.org/projects/xmlp/trunk/alt/ustring.d The component structures maintain a terminating null character and pretend it is not there. It seemed a good idea at the time when I was doing a lot of windows API calls which expected null terminated C-strings of char or wchar. The UString class does conversions on accessing cstr(), wstr() or dstr(), on the assumption that last used will be most frequent, and ideally caches a decent hash value. I only have some limited uses of UString so far, because character arrays are so powerful. struct cstext { char[] str_ = null; ... } struct wstext { wchar[] str_ = null; ... } struct dstext { dchar[] str_ = null; ... } class UString { private { union { vstruc vstr; // not fully supported? cstext cstr; wstext wstr; dstext dstr; } UStringType ztype; hash_t hash_; } ...
Aug 23 2010
parent Jonathan M Davis <jmdavisprog gmail.com> writes:
On Monday 23 August 2010 23:16:25 Michael Rynn wrote:
 The problems are combinatorial, because of encoding schemes.
 I imagine that when someone wants a function that is missing from
 std.string, they might write one, and might even add to it.

A lot of functions in Phobos are templated on string type, so you don't have to define multiple versions of them. Very few, if any, are actually defined for multiple string types. Now, because each template instantiation results in another version of the function in the resulting binary, if you try and use all of the functions with all of the string types, then you do get combinatorial problems. But thanks to the templates, you don't have to worry about it directly, and it's not like it's going to be a typical use case for most string functions to be used by multiple string types in the same program. It will happen, but not enough to generally be an issue. - Jonathan M Davis
Aug 24 2010
prev sibling parent reply Norbert Nemec <Norbert Nemec-online.de> writes:
On 20/08/10 03:22, dsimcha wrote:
 3.  Is there any good reason to avoid just templating everything to work with
 all 9 string types (mutable/const/immutable char/wchar/dchar[]) or whatever
 subset is reasonable for the given function?

Wouldn't it be sufficient to take const as input? IIRC, both mutable and immutable can be implicitly converted to const and this exactly the purpose that const is designed for: data that I can't change but that other code may be able to change. Or am I mixing something up here?
Aug 24 2010
parent "Simen kjaeraas" <simen.kjaras gmail.com> writes:
Norbert Nemec <Norbert nemec-online.de> wrote:

 On 20/08/10 03:22, dsimcha wrote:
 3.  Is there any good reason to avoid just templating everything to  
 work with
 all 9 string types (mutable/const/immutable char/wchar/dchar[]) or  
 whatever
 subset is reasonable for the given function?

Wouldn't it be sufficient to take const as input? IIRC, both mutable and immutable can be implicitly converted to const and this exactly the purpose that const is designed for: data that I can't change but that other code may be able to change. Or am I mixing something up here?

What should the functions return, then? If the output is always const(char)[], I need to cast it to make it immutable(char)[] or char[], and casting is an unsafe operation. -- Simen
Aug 24 2010