www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - std.stringbuffer

reply "Janice Caron" <caron800 googlemail.com> writes:
Hi all,

More than one person has complained about the lack of string functions
in Phobos which operate on mutable chars. In the thread titled "Is all
this Invariant ****....", I suggested creating a new module,
std.stringbuffer, to contain two things:

(1) a StringBuffer class
(2) parallel mutable versions of the functions in std.string.

Walter OKed the idea, so it looks like that's a go. To that end, I've
looked through the functions in std.string and sorted them into
different groups. I think it's important to get the API right so
comments are welcome on all of the below:

The following functions are incorrectly declared in std.string because
they are currently declared to take strings, not const(char)[]. They
should be:

	long atoi(in char[] s)
	real atof(in char[] s)
	size_t count(in char[] s, in char[] sub)
	bool inPattern(dchar c, in char[] pattern)
	int inPattern(dchar c, in char[][] patterns)
	size_t countchars(in char[] s, in char[] pattern)
	bool isNumeric(in char[] s, in bool bAllowSep = false)
	size_t column(char[] str, int tabsize = 8)

The following functions are badly declared in std.string because they
are declared to take and return strings. With the following change,
they become type agnostic

	size_t isEmail(in char[] s)
	size_t isURL(in char[] s)

The following function is the /only/ function currently in std.string
which takes an optional mutable buffer to use instead of allocating on
the heap. For consistency, let's put the mutable version into
std.stringbuffer, and let std.string have an invariant version, as
follows:

	string soundex(string s)

The remaining functions go in std.stringbuffer.

The following functions all take an optional mutable buffer as input
into which to write the return value to avoid allocation.

	char[] tolower(in char[] s, char[] buffer=null)
	char[] toupper(in char[] s, char[] buffer=null)
	char[] capitalize(in char[] s, char[] buffer=null)
	char[] capwords(in char[] s, char[] buffer=null)
	char[] repeat(in char[] s, size_t n, char[] buffer=null)
	char[] join(in char[][] words, char[] sep, char[] buffer=null)
	char[] ljustify(in char[] s, int width, char[] buffer=null)
	char[] rjustify(in char[] s, int width, char[] buffer=null)
	char[] center(in char[] s, int width, char[] buffer=null)
	char[] zfill(in char[] s, int width, char[] buffer=null)
	char[] replace(in char[] s, in char[] from, in char[] to, char[] buffer=null)
	char[] replaceSlice(in char[] s, in char[] slice, in char[]
replacement, char[] buffer=null)
	char[] insert(in char[] s, size_t index, in char[] sub, char[] buffer=null)
	char[] expandtabs(in char[] str, int tabsize=8, char[] buffer=null)
	char[] entab(in char[] s, int tabsize=8, char[] buffer=null) // in place?
	char[] maketrans(in char[] from, in char[] to, char[] buffer=null)
	char[] translate(in char[] s, in char[] transtab, in char[] delchars,
char[] buffer=null)
	char[] succ(in char[] s, char[] buffer=null)
	char[] soundex(in char[] s, char[] buffer=null)
	char[] wrap(in char[] s, int columns = 80, in char[] firstindent =
null, in char[] indent = null, int tabsize = 8, char[] buffer=null)

The following functions I am uncertain about. They could be declared
to take a mutable buffer as input, consistent with the above. /Or/
they could operate on data in place. Opinions are welcome.

	char[] removechars(in char[] s, in char[] pattern, char[]
buffer=null) // in place?
	char[] squeeze(in char[] s, in char[] pattern = null, char[]
buffer=null) // in place?
	char[] tr(in char[] str, in char[] from, in char[] to, in char[]
modifiers=null, char[] buffer=null) // in place?

The following functions need to be overloaded for both const and mutable input

	char[][] split(char[] s)
	const(char)[][] split(const(char)[] s)
	char[][] split(char[] s, in char[] delim)
	const(char)[][] split(const(char)[] s, in char[] delim)
	char[][] splitlines(char[] s)
	const(char)[][] splitlines((char)[] s)

	char[] stripl(char[] s)
	const(char)[] stripl(const(char)[] s)
	char[] stripr(char[] s)
	const(char)[] stripr(const(char)[] s)
	char[] strip(char[] s)
	const(char)[] strip(const(char)[] s)
	char[] chop(char[] s)
	const(char)[] chop(const(char)[] s)

Not sure what to do about the following one. AAs of mutable arrays are
notoriously difficult to get bug free. Should we bother with this one?

	char[][char[]] abbrev(in char[][] values) // May be impractical

Finally - what do we all think about the inconstitent capitalization
thoughout std.string. (toupper versus toString, capwords versus
endsWith, etc.)
Apr 29 2008
next sibling parent "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"Janice Caron" <caron800 googlemail.com> wrote in message 
news:mailman.508.1209497029.2351.digitalmars-d puremagic.com...
 Hi all,

 More than one person has complained about the lack of string functions
 in Phobos which operate on mutable chars. In the thread titled "Is all
 this Invariant ****....", I suggested creating a new module,
 std.stringbuffer, to contain two things:

 (1) a StringBuffer class

Might I ask why a StringBuffer class would be necessary?
Apr 29 2008
prev sibling next sibling parent reply Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
A couple of thoughts:

Janice Caron wrote:
 Hi all,
 
 More than one person has complained about the lack of string functions
 in Phobos which operate on mutable chars. In the thread titled "Is all
 this Invariant ****....", I suggested creating a new module,
 std.stringbuffer, to contain two things:
 
 (1) a StringBuffer class
 (2) parallel mutable versions of the functions in std.string.

I'm with Jarret here, why the hell do we need a StringBuffer class? 'string' is not a class either, so just use char[]. I would recomment aliasing char[] to 'mstring' (short for mutable string. I think such an alias is more readable than 'char[]' Also, is there a reason why these mutable functions shouldn't be in std.string, together with their invariant/const brethren? I don't think it makes sense to have another package if one opt by the (2) solution. -- Bruno Medeiros - Software Developer, MSc. in CS/E graduate http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
Apr 29 2008
next sibling parent reply "Me Here" <p9e883002 sneakemail.com> writes:
Bruno Medeiros wrote:

A couple of thoughts:

std.stringbuffer, to contain two things:

(1) a StringBuffer class
(2) parallel mutable versions of the functions in std.string.

I'm with Jarret here, why the hell do we need a StringBuffer class? 'string' is not a class either, so just use char[]. I would recomment aliasing char[] to 'mstring' (short for mutable string. I think such an alias is more readable than 'char[]' Also, is there a reason why these mutable functions shouldn't be in std.string, together with their invariant/const brethren? I don't think it makes sense to have another package if one opt by the (2) solution.

As one of those that has request "a standard library of string functions that accept and return mutable strings Ie. char[]", I see no reason it should be a class, free function seem to work just fine. A class would just be bloat. I would be perfectly happy for these to co-exist in the std.string space. Indeed I would prefer it. If a separate namespace is deamed /essential/, then I see no reason to go with, and certainly did "ask for" it to be called the misleading name of std.StringBuffer. As far as I recall, that was Janice's own suggestion. For preference, if a separate namespace is absolutely necessary, I go for: std.string.mutable The right namespace and does what is says on the tin. Further, /if/ I had any input to the design, then the suggestion for me to have to pass in preallocated buffers to accomodate the mutated data if it needs to grow would be scotched forthwith. They should look and work in exactly the same way as the existing v2 std.string functions taking the same number and order of parameters. Just char[] (or compatible alias) instead of string. If the buffers need to grow, then allocate space from wherever (I assume the heap) that std.string allocates from now. If they do not change size, then return the original intact. If they shrink, and if the D array internals permit this, then adjust the .length attribute whilst leaving the actual allocation unchanged. That way it is there for use should further mutations cause it to grow again. This also helps prevent heap fragmentation if the functions are called on heap allocated data. Finally, if the retention of unused but allocated space in an array is a feature of the current design, then I would add a debug time warning indicating when a char[] has had to be grown. These could be used during devlopment to adjust the preallocated size of arrays to be large enough to accomodate all (most? typical?) requirements. In summary: mutable string functions shoudl do exactly the same as the invarient functions do now, except only reallocate if necessary and (optionally) issue a warning under debug if they have to. Seems almost as if a template solution could be used, except that I think the additional conditional code would hamper the performance of both instantiations. Unless Ds templating is capable of optiising away branches of code that relate to the /other/ type instantiations? I've had no occasion to use templates in D yet, so that might be pie in the sky. --
Apr 29 2008
next sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
== Quote from Janice Caron (caron800 googlemail.com)'s article
 I would support the addition of some function like
     gc.minimise(char[])
 which returned all the unused space following the end of the array
 back to the gc, without any copying of the used part. I wouldn't be
 able to write that though - the gc is not my area of expertise.

This would only work for large arrays I'm afraid, given the GC implementation for D--it uses fixed-size blocks until the block size is 4096 bytes or larger. Also, the shrinking would be done in chunks of 4096 bytes, so a fairly substantial size change would have to occur for anything to happen at all. That said, things get a lot easier if moving the block is allowed. Tango even exposes a GC.realloc() routine which will do this for you. Sean
Apr 30 2008
parent Sean Kelly <sean invisibleduck.org> writes:
== Quote from Janice Caron (caron800 googlemail.com)'s article
 2008/4/30 Sean Kelly <sean invisibleduck.org>:
  Tango even exposes
  a GC.realloc() routine which will do this for you.

However, realloc() doesn't promise not to copy, and not copying is the objective. Thanks for all the cool info, but I just think programmers would just feel more "comfortable" if, after they've done all their in-place string manipulations, they can call some minimizing function, even if only to give them a warm fuzzy feeling that they're not wasting any more memory than is necessary.

It's perhaps worth noting here that C++ objects don't typically minimize either. That's why Scott Meyers (?) proposed the idiom: myVector.swap(std::vector(myVector));
 Frankly, it could even be implemented a do-nothing function. That way,
 at least "blame" for excessive memory use passes from the programmer
 to Phobos, and future gc implementations might do things differently.

Fair enough. Sean
Apr 30 2008
prev sibling next sibling parent reply "Me Here" <p9e883002 sneakemail.com> writes:
Janice Caron wrote:


The name can be anything we want it to be.
...
Except "std.string.anything" :-)

I did laugh. Not quite "any colour you like so long as its black", but close :)
"std.string" is a module, so it can't also be a package. That's a
limitation of the D language.

Now. This is where you show me up to be nothing but a pretender in this forum. I have no idea what the distinction is be tween thos two in D.
  Finally, if the retention of unused but allocated space in an array is a
feature of the current design, then I would add
  a debug time warning indicating when a char[] has had to be grown. These
could be used during devlopment to adjust
  the preallocated size of arrays to be large enough to accomodate all (most?
typical?) requirements.

I would support the addition of some function like gc.minimise(char[]) which returned all the unused space following the end of the array back to the gc, without any copying of the used part. I wouldn't be able to write that though - the gc is not my area of expertise.

I /think/ you may have misunderstood my intent here. Unsurprising cos it was badly outlined. And I'm not at all sure that D works this way. In, for example, Perl, an array can be pre-sized but then set to be empty. That is, it can have space preallocated to it, but contain nothing. Likewise strings have two length attributes internally. - one denotes the length of the contents, as woudl be returned to the program by the length() function. - one indicated the actual length of the ram allocated to it. This allows, or example, chomp() to simply move adjust a number (the program visible length) and do not adjustment or reallocation at all. It can also adjust the left hand end of the contents effectively foreshortening the string, again without adjusting the allocation. So visually, a scalar holding a string might at some point in its life look something like: (this ascii art is going to come out a mess on the server but...) header [ offset ] |--------+ [actualLen ]--------------------------------------------------------------------------> [pgmVisible] |------------------------------------------------> [pointer ]----v | [][][][][][][the contents the program can see is here][][][][][][] Basically, it start out with offset zero and only as much padding (if any) as is required to bring it to suitable alignment. But if you remove characters at the end (chomp or chop) then the padding grows as the content shrink and nothing is allocated. If you remove characters from the front of the string the offset accomodates that and the allocation doesn't change. And if further mutations expand the string, then these spaces are reused before a new allocation is made. If for example, you know you are going to be build ia long string up piecewise from small appendages, you can inilialise it to some length big enough for the expected final length and the truncate it (assign '' to it) and it will retain its allocation, even though the program visible length is zero. Then, as you add stuff to it, it grows into the allocation. My point was that /if/ Ds arrays have a similar capability, to be preallocated large and empty and grow into the space then when a mutation requires a reallocation of a mutable array because it has outgrown its original allocation, then a debug-enabled warning saying by how much, might allow the programmer to preallocate the initial mutable array larger and so avoid reallocation at runtime. There's a whole heap of speculation about what might be going on inside D that I have no real knowledge of at all. Note:There is no suggestion here that D shoudl work this way. Only that if it does allow preallocation of arrays sizes, then a warning when a mutation causes allocation would allow the programmer to best use that facility. Cheers, b. --
Apr 30 2008
next sibling parent reply "Me Here" <p9e883002 sneakemail.com> writes:
Janice Caron wrote:

But <shrugs> - if people don't want StringBuffers, who am I to argue?

What's in a name? Pre-conceptions of other worlds and other tools. Specifically Java. Additionally, the casing suggests a class? For my part, I simply want string functions that operate on char[]s. Because, I percieve that for the type of mutations I am currently doing, Invarient strings would incur too high a cost. If your StringBuffer concept would accept and manipulate char[]s and not require the instantiation, initialisation and syntax of an object. By which I mean that if having used a string function upon my char[] I can still apply slice operations to it using the standard syntax. And then apply another string function, and then another slice. Or even, apply a string function to a slice of a larger string and mutate that larger string, in-place through the slice: char[] a = ...2000 chars from somewhere. char[] field1 = a[ 312 .. 357 ]; field1.toUpper(); char[] checksum = a[ $-16 .. $ ]; checksum = md5hex( a ); ... Then I will be very happy. Beyond that, I have no requirements :) All the stuff about warnings and internal and external lengths was just speclation about what might be going on inside on the basis of what I know, have seen (Perl) and have personally implemented. (Not Perl). Cheers, b. Ps. Is there a paper/article/reference on the reasoning behind Invariant strings somewhere? --
Apr 30 2008
next sibling parent reply Matti Niemenmaa <see_signature for.real.address> writes:
Janice Caron wrote:
 If any of you have plans to uppercase or lowercase UTF-8 in place,
 forget that now. It just ain't possible. (You can uppercase ASCII,
 UTF-16, or UTF-32 in place. But not UTF-8, and char[], by definition,
 is UTF-8).

It's possible that, in some obscure case, you can't uppercase UTF-16 in place either. A code point in the private use area (U+E000 to U+F8FF), which can be represented with one UTF-16 code unit, may uppercase to something in the supplementary private use areas (U+F0000 upwards), whose code points require two UTF-16 code units each. Of course the toUpper function in question must be aware of this configuration of the private use areas. This is an extremely contrived case and I doubt it'll ever come up in practice, anywhere, but in theory it might. <g> -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
Apr 30 2008
next sibling parent terranium <spam here.lot> writes:
Matti Niemenmaa Wrote:

 A code point in the private use area (U+E000 to U+F8FF), which can be 
 represented with one UTF-16 code unit, may uppercase to something in the 

does this have any practical use?
Apr 30 2008
prev sibling parent Matti Niemenmaa <see_signature for.real.address> writes:
Janice Caron wrote:
 OK, so private use characters might be a contrived exception. BUT,
 nobody expects toUpper() to acknowledge private use characters. That
 would require a run-time extensibility mechanism which is way beyond
 what toUpper() does now, and likely beyond anything it's ever likely
 to do any time soon.

You're right, of course. I was referring more to some hypothetical toUpper() function rather than one which I would expect to find in any standard library---the generic case of "uppercasing a character" as opposed to std.string[buffer].toUpper. -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
Apr 30 2008
prev sibling next sibling parent terranium <spam here.lot> writes:
Janice Caron Wrote:

 You cannot uppercase in place, because for any given dchar, c, the
 number of UTF-8 bytes required to express c may be different from the
 number of UTF-8 bytes required to express toupper(c).

really?
Apr 30 2008
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
== Quote from Janice Caron (caron800 googlemail.com)'s article
 2008/4/30 Me Here <p9e883002 sneakemail.com>:
     char[] a = ...2000 chars from somewhere.

     char[] field1 = a[ 312 .. 357 ];
     field1.toUpper();

recent days, but... You cannot uppercase in place, because for any given dchar, c, the number of UTF-8 bytes required to express c may be different from the number of UTF-8 bytes required to express toupper(c). If any of you have plans to uppercase or lowercase UTF-8 in place, forget that now. It just ain't possible. (You can uppercase ASCII, UTF-16, or UTF-32 in place. But not UTF-8, and char[], by definition, is UTF-8).

In all fairness, you can uppercase UTF-8 in place so long as none of the characters within the string require a multi-byte capital. Thus one questionable strategy would be to uppercase in place until the first multibyte conversion is required. The obvious downside being that the original buffer may end up partially capitalized, with the fully capitalized result returned in a new buffer. I'm sure people processing ASCII text would love this, but I can see it causing problems elsewhere. Sean
Apr 30 2008
prev sibling next sibling parent "Me Here" <p9e883002 sneakemail.com> writes:
Janice Caron wrote:

2008/4/30 Me Here <p9e883002 sneakemail.com>:
     char[] a = ...2000 chars from somewhere.

     char[] field1 = a[ 312 .. 357 ];
     field1.toUpper();

I've kind of lost track of the number of times I've said this in recent days, but... You cannot uppercase in place, because for any given dchar, c, the number of UTF-8 bytes required to express c may be different from the number of UTF-8 bytes required to express toupper(c). If any of you have plans to uppercase or lowercase UTF-8 in place, forget that now. It just ain't possible. (You can uppercase ASCII, UTF-16, or UTF-32 in place. But not UTF-8, and char[], by definition, is UTF-8).

Ignoring for the moment Matti's pronouncement that this is an obscure and unlikely event, it really depends upon how the library is coded. For example, if the case change is effected in place for the majority of cases when it can be, when the occasion occurs that it cannot, and raises a runtime exception, catch the error and use replaceSlice to handle it: import std.stdio; import std.string; int main( char[][] args ) { char [] s = "the quick brown fox"; try{ s[ 8 .. 9 ] = \u1234; } catch { s = s.replaceSlice( s[ 8 .. 9 ], \u1234 ); } writefln( s ); return 0; } Though it would be (much) nicer if the builtin lvalue slice handled this for us. I was just disappointed for the second to (re)discover this imitation of Ds slicing. I had forgotten because other languages I used do not. This is one of those things that I doubt I will ever agree with the decision. But I'm just another jerk on the internet with an opinion, and we all know what that is analogous to. If the language doesn't handle it, the the library should. If it doesn't, then I will have to. And you, and Bill and Fred and Sue ,,, Cheers, b. --
Apr 30 2008
prev sibling next sibling parent Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Janice Caron wrote:
 2008/4/30 Me Here <p9e883002 sneakemail.com>:
     char[] a = ...2000 chars from somewhere.

     char[] field1 = a[ 312 .. 357 ];
     field1.toUpper();

I've kind of lost track of the number of times I've said this in recent days, but... You cannot uppercase in place, because for any given dchar, c, the number of UTF-8 bytes required to express c may be different from the number of UTF-8 bytes required to express toupper(c). If any of you have plans to uppercase or lowercase UTF-8 in place, forget that now. It just ain't possible. (You can uppercase ASCII, UTF-16, or UTF-32 in place. But not UTF-8, and char[], by definition, is UTF-8).

Actually, you can't uppercase UTF-16 and UTF-32 in-place either if you want to be entirely correct. For example: \u00df ("ß") --> \u0053 \u0053 ("SS"). This increases the byte count for both UTF-16 and UTF-32. (This does work for UTF-8 though, since \u00df happens to require 2 UTF-8 code units, and both \u0053s only one each) (See <http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt> for what should be a complete list of characters with similar annoying casing properties)
Apr 30 2008
prev sibling parent Spacen Jasset <spacenjasset yahoo.co.uk> writes:
Janice Caron wrote:
 2008/4/30 Me Here <p9e883002 sneakemail.com>:
     char[] a = ...2000 chars from somewhere.

     char[] field1 = a[ 312 .. 357 ];
     field1.toUpper();

I've kind of lost track of the number of times I've said this in recent days, but... You cannot uppercase in place, because for any given dchar, c, the number of UTF-8 bytes required to express c may be different from the number of UTF-8 bytes required to express toupper(c). If any of you have plans to uppercase or lowercase UTF-8 in place, forget that now. It just ain't possible. (You can uppercase ASCII, UTF-16, or UTF-32 in place. But not UTF-8, and char[], by definition, is UTF-8).

I think uppercasing non ascii (english) characters is a more of specialised business anyway (some languages have no notion of upper case, and yet others depend on context), which often should be perfomed by a presentation layer. People need a toupper/lower all the time, and 90% of the time they use it on strings that are in the ascii range, often because they deal with protocols, file formats and other such things. In which case phobos's string.toupper shouldn't really be doing work outside of ascii, in my opinion anyway. This also means that a string can be uppercased in place.
May 01 2008
prev sibling next sibling parent Robert Fraser <fraserofthenight gmail.com> writes:
Janice Caron wrote:
 2008/4/30 Me Here <p9e883002 sneakemail.com>:
 "std.string" is a module, so it can't also be a package. That's a
 limitation of the D language.

forum. I have no idea what the distinction is be tween thos two in D.

One is file, the other is a folder. std.string is a file, so it can't also be a folder.
  I /think/ you may have misunderstood my intent here. Unsurprising cos it
 was badly outlined.
  And I'm not at all sure that D works this way.

  In, for example, Perl, an array can be pre-sized but then set to be empty.
  That is, it can have space preallocated to it, but contain nothing.
  Likewise strings have two length attributes internally.
  - one denotes the length of the contents, as woudl be returned to the
 program by the length() function.
  - one indicated the actual length of the ram allocated to it.

Well, that's what a StringBuffer would do, but nobody seemed to like the idea. A string contains two pieces of information: (1) ptr, and (2) length. A StringBuffer would carry a third piece of information: (3) capacity. (Actually, in general it would be Buffer!(T), with StringBuffer just being a special case). Built in-strings to have a capacity, but it's not carried round in a field. Instead. to find the capacity of an array, you have to call std.gc.capacity(array) - and I can't see how there can not be a performance hit there. Increasing the length of a D array doesn't necessarily mean reallocating (although as noted above, the code has to do some work to find out the capacity), but it /does/ mean re-initialising the newly exposed elements. Again, that has to be a performance hit. With a Buffer!(), you could increase the length (up to capacity) not only without reallocating but also without reinitializing, just by changing the value of an int. But <shrugs> - if people don't want StringBuffers, who am I to argue?

I like StringBuffers :-). Did Walter veto the idea completely or did he say "not a class". I'd use a struct - there's no extra bloat, the interface can be encapsulated, and people can use a pointer if they're passing between functions (since it will most often be used within the scope of a single function anyway). Or just pass it on the stack, if it's guaranteed to only be 3 DWORDs. My suggestion (grain of salt) is to represent them similarly to the way mtext does by using two bits somewhere to hold the character type (char, wchar, dchar) and change character types as needed.
Apr 30 2008
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
== Quote from Me Here (p9e883002 sneakemail.com)'s article
 My point was that /if/ Ds arrays have a similar capability, to be
 preallocated large and empty and grow into the space then
 when a mutation requires a reallocation of a mutable array because it has
 outgrown its original allocation,
 then a debug-enabled warning saying by how much, might allow the
 programmer to preallocate the initial mutable array larger and
 so avoid reallocation at runtime.

D arrays do have this feature, thanks to a suggestion by Derek Parnell. That is, reducing the array's length property does not cause a reallocation, even when length is set to zero. Thus it is possible to do: void fn( inout char[] buf ) { buf.length = 1024; // preallocate 1024 bytes of storage buf.length = 0; buf ~= "hello"; // will copy into preallocated buffer } Thus the proper way to discard a buffer is to do: buf = null; I think for specific buffers it's probably enough to print their length when you're done filling them and then explicitly preallocate the next run based on this info. Tango also offers a means of performing program-level preallocation via GC.reserve() for people so inclined. Sean
Apr 30 2008
prev sibling parent reply Bill Baxter <dnewsgroup billbaxter.com> writes:
Janice Caron wrote:
 2008/4/30 Me Here <p9e883002 sneakemail.com>:
 "std.string" is a module, so it can't also be a package. That's a
 limitation of the D language.

forum. I have no idea what the distinction is be tween thos two in D.

One is file, the other is a folder. std.string is a file, so it can't also be a folder.

Herein lies the genius in Tango's naming conventions. You *can* have both a package std.string, and a module named std.String. If you consistently use different case for package and module names, then you can have your cake and eat it too. --bb
Apr 30 2008
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Bill Baxter" wrote
 Janice Caron wrote:
 2008/4/30 Me Here :
 "std.string" is a module, so it can't also be a package. That's a
 limitation of the D language.

forum. I have no idea what the distinction is be tween thos two in D.

One is file, the other is a folder. std.string is a file, so it can't also be a folder.

Herein lies the genius in Tango's naming conventions. You *can* have both a package std.string, and a module named std.String. If you consistently use different case for package and module names, then you can have your cake and eat it too.

Not on Windoze :) -Steve
Apr 30 2008
next sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
== Quote from Steven Schveighoffer (schveiguy yahoo.com)'s article
 "Bill Baxter" wrote
 Janice Caron wrote:
 2008/4/30 Me Here :
 "std.string" is a module, so it can't also be a package. That's a
 limitation of the D language.

forum. I have no idea what the distinction is be tween thos two in D.

One is file, the other is a folder. std.string is a file, so it can't also be a folder.

Herein lies the genius in Tango's naming conventions. You *can* have both a package std.string, and a module named std.String. If you consistently use different case for package and module names, then you can have your cake and eat it too.


It should still work, I believe. The source file will have a .d extension and the folder won't, so there shouldn't be a filesystem collision. Or are you saying that the compiler does some checking behind the scenes anyway? I'll admit I've never actually tried this. Sean
Apr 30 2008
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Sean Kelly" wrote
 == Quote from Steven Schveighoffer
 "Bill Baxter" wrote
 Janice Caron wrote:
 2008/4/30 Me Here :
 "std.string" is a module, so it can't also be a package. That's a
 limitation of the D language.

this forum. I have no idea what the distinction is be tween thos two in D.

One is file, the other is a folder. std.string is a file, so it can't also be a folder.

Herein lies the genius in Tango's naming conventions. You *can* have both a package std.string, and a module named std.String. If you consistently use different case for package and module names, then you can have your cake and eat it too.


It should still work, I believe. The source file will have a .d extension and the folder won't, so there shouldn't be a filesystem collision. Or are you saying that the compiler does some checking behind the scenes anyway? I'll admit I've never actually tried this.

Excellent point, I completely forgot that even though you import std.String, you are really looking at the file std/String.d. In that case, I think you are right, it would work on Windoze. -Steve
Apr 30 2008
parent Bill Baxter <dnewsgroup billbaxter.com> writes:
Steven Schveighoffer wrote:
 "Sean Kelly" wrote
 == Quote from Steven Schveighoffer
 "Bill Baxter" wrote
 Janice Caron wrote:
 2008/4/30 Me Here :
 "std.string" is a module, so it can't also be a package. That's a
 limitation of the D language.

this forum. I have no idea what the distinction is be tween thos two in D.

also be a folder.

both a package std.string, and a module named std.String. If you consistently use different case for package and module names, then you can have your cake and eat it too.


and the folder won't, so there shouldn't be a filesystem collision. Or are you saying that the compiler does some checking behind the scenes anyway? I'll admit I've never actually tried this.

Excellent point, I completely forgot that even though you import std.String, you are really looking at the file std/String.d. In that case, I think you are right, it would work on Windoze. -Steve

Yes it works fine on Windows too. I pretty much work only on Windows testing things occasionally on VMWare Linux. --bb
May 01 2008
prev sibling parent reply Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
Steven Schveighoffer wrote:
 "Bill Baxter" wrote
 
 Not on Windoze :)
 
 -Steve 
 
 

Something like this would be completely unacceptable not to work on Windows. -- Bruno Medeiros - Software Developer, MSc. in CS/E graduate http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
May 01 2008
parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Bruno Medeiros" wrote
 Steven Schveighoffer wrote:
 Not on Windoze :)

 -Steve

Something like this would be completely unacceptable not to work on Windows.

I was wrong, look at my response to Sean. Sorry about that. -Steve
May 01 2008
prev sibling next sibling parent "Me Here" <p9e883002 sneakemail.com> writes:
As my ascii art was screwed by the time it got to the server, here is a 
better illustration of what goes on:
This is long and wordy and maybe of no interest. But it does illustrate 
the point i was trying to make.

[0] Perl> use Devel::Peek;;
[0] Perl> Dump $s;;              ### Uninitialised scalar -- no space 
allocated.
SV = NULL(0x0) at 0x194a9cc
   REFCNT = 1
   FLAGS = ()

[0] Perl> $s = 'abcdefghijklmnopqrstuvwxyz';;        ### Assign it a string
[0] Perl> Dump $s;;
SV = PV(0x2252e8) at 0x194a9cc
   REFCNT = 1
   FLAGS = (POK,pPOK)
   PV = 0x191e1a4 "abcdefghijklmnopqrstuvwxyz"\0   ## the data
   CUR = 26                                                  ### user visible
length
   LEN = 27                                                   ## +1 for null
incase we pass it to C

[0] Perl> substr( $s, 0, 5 ) = '';;                           ### Remove 
teh first 5 characters
[0] Perl> Dump $s;;
SV = PVIV(0x2256ec) at 0x194a9cc
   REFCNT = 1
   FLAGS = (POK,OOK,pPOK)
   IV = 5  (OFFSET)                                 ### offset of 5 fro the
start of the buffer
   PV = 0x191e1a9 ( "abcde" . ) "fghijklmnopqrstuvwxyz"\0
### Still there but not visible
   CUR = 21                         ### User visible length
   LEN = 22                          ### Internal length (+ offset above)

[0] Perl> substr( $s, -10 ) = '';;   ##' chop of the last 10 chars
[0] Perl> Dump $s;;
SV = PVIV(0x2256ec) at 0x194a9cc
   REFCNT = 1
   FLAGS = (POK,OOK,pPOK)
   IV = 5  (OFFSET)
   PV = 0x191e1a9 ( "abcde" . ) "fghijklmnop"\0
   CUR = 11                       ### User visible lentgh changes
   LEN = 22                        ### Internal length doesn't.

[0] Perl> $s = 'XX' . $s;;          Prepend some new stuff back
[0] Perl> Dump $s;;
SV = PVIV(0x2256ec) at 0x194a9cc
   REFCNT = 1
   FLAGS = (POK,OOK,pPOK)
   IV = 5  (OFFSET)
   PV = 0x191e1a9 ( "abcde" . ) "XXfghijklmnop"\0
   CUR = 13                      ### User length grows
   LEN = 22                       ### internal length doesn't

[0] Perl> $s .= 'XX';;          ### Append some new chars
[0] Perl> Dump $s;;
SV = PVIV(0x2256ec) at 0x194a9cc
   REFCNT = 1
   FLAGS = (POK,OOK,pPOK)
   IV = 5  (OFFSET)
   PV = 0x191e1a9 ( "abcde" . ) "XXfghijklmnopXX"\0
   CUR = 15                     ### Ditto the above
   LEN = 22

[0] Perl> $s .= '??????';;       ### Fill it to the limit of the not 
offset space
[0] Perl> Dump $s;;
SV = PVIV(0x2256ec) at 0x194a9cc
   REFCNT = 1
   FLAGS = (POK,OOK,pPOK)
   IV = 5  (OFFSET)            ### Offset unchanged
   PV = 0x191e1a9 ( "abcde" . ) "XXfghijklmnopXX??????"\0
   CUR = 21
   LEN = 22

[0] Perl> $s .= '##';;         ### Push it beyond that limit
[0] Perl> Dump $s;;
SV = PVIV(0x2256ec) at 0x194a9cc
   REFCNT = 1
   FLAGS = (POK,pPOK)
   IV = 0                       ### Offset reclaimed
   PV = 0x191e1a4 "XXfghijklmnopXX??????##"\0
   CUR = 23
   LEN = 27

[0] Perl> $s .= '###';;     Upto the original allocation
[0] Perl> Dump $s;;
SV = PVIV(0x2256ec) at 0x194a9cc
   REFCNT = 1
   FLAGS = (POK,pPOK)
   IV = 0                 ### Still the same address (below)
   PV = 0x191e1a4 "XXfghijklmnopXX??????#####"\0
   CUR = 26
   LEN = 27

[0] Perl> $s .= '!';;    ### Push it beyond the original allocation
[0] Perl> Dump $s;;
SV = PVIV(0x2256ec) at 0x194a9cc
   REFCNT = 1
   FLAGS = (POK,pPOK)
   IV = 0          ### Reallocation occurs now
                    ### Though in place because nothing else has allocated
memory.
   PV = 0x191e1a4 "XXfghijklmnopXX??????#####!"\0
   CUR = 27
   LEN = 28

-- 
Apr 30 2008
prev sibling parent Pedro Ferreira <ask me.pt> writes:
Janice Caron escreveu:
 2008/4/30 Janice Caron <caron800 googlemail.com>:
  I would support the addition of some function like

     gc.minimise(char[])

  which returned all the unused space following the end of the array
  back to the gc, without any copying of the used part. I wouldn't be
  able to write that though - the gc is not my area of expertise.

Sorry, I meant std.gc.minimise(void[] array) This function doesn't exist right now.

Weren't 'void[]'s banned?
May 02 2008
prev sibling parent reply Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
Janice Caron wrote:
 2008/4/30 Bruno Medeiros <brunodomedeiros+spam com.gmail>:
  Also, is there a reason why these mutable functions shouldn't be in
 std.string, together with their invariant/const brethren?

That's why we're having this discussion. The idea is that std.string can be optimised for invariant strings, while std.stringbuffer could be optimised for mutable strings. There are pros and cons for separate modules. I don't think Walter wants std.string "polluted" by all these functions he doesn't much care for. Also, it would be bad if mutable versions were called "by mistake" with consequent unexpected behavior.

"mutable versions were called "by mistake" "? I don't think that point applies to D, after all, the purpose of the immutability system is for the compiler to check that this won't happen, so unless there is some compiler bug, that shouldn't happen in D.
 But keep discussing. The people I want to hear from most are the
 people calling for mutable string functions.

You may find that a large segment of those people are using Tango, and so they might not participate much in this Phobos design issue discussion. -- Bruno Medeiros - Software Developer, MSc. in CS/E graduate http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
May 01 2008
parent reply Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Bruno Medeiros wrote:
 Janice Caron wrote:
 2008/4/30 Bruno Medeiros <brunodomedeiros+spam com.gmail>:
  Also, is there a reason why these mutable functions shouldn't be in
 std.string, together with their invariant/const brethren?

That's why we're having this discussion. The idea is that std.string can be optimised for invariant strings, while std.stringbuffer could be optimised for mutable strings. There are pros and cons for separate modules. I don't think Walter wants std.string "polluted" by all these functions he doesn't much care for. Also, it would be bad if mutable versions were called "by mistake" with consequent unexpected behavior.

"mutable versions were called "by mistake" "? I don't think that point applies to D, after all, the purpose of the immutability system is for the compiler to check that this won't happen, so unless there is some compiler bug, that shouldn't happen in D.

What if you wanted a modified copy of the input, but that input happened to be mutable? The modifying versions should have some distinguishing characteristic to separate them from the COW versions. I'd say either a different function name or an extra out-buffer parameter (as long as they still work if the buffer is the same array as the normal input).
May 01 2008
next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Frits van Bommel" wrote
 Bruno Medeiros wrote:
 Janice Caron wrote:
 2008/4/30 Bruno Medeiros:
  Also, is there a reason why these mutable functions shouldn't be in
 std.string, together with their invariant/const brethren?

That's why we're having this discussion. The idea is that std.string can be optimised for invariant strings, while std.stringbuffer could be optimised for mutable strings. There are pros and cons for separate modules. I don't think Walter wants std.string "polluted" by all these functions he doesn't much care for. Also, it would be bad if mutable versions were called "by mistake" with consequent unexpected behavior.

"mutable versions were called "by mistake" "? I don't think that point applies to D, after all, the purpose of the immutability system is for the compiler to check that this won't happen, so unless there is some compiler bug, that shouldn't happen in D.

What if you wanted a modified copy of the input, but that input happened to be mutable? The modifying versions should have some distinguishing characteristic to separate them from the COW versions. I'd say either a different function name or an extra out-buffer parameter (as long as they still work if the buffer is the same array as the normal input).

Any modifying versions would take mutable strings, COW version would require invariant strings. They would be able to go in the same module, because there would be no ambiguity. But if you have non-modifying versions that you want to use on mutable strings, those would most likely take a const pointer. Those would have to be named differently than the invariant versions, because invariant implicitly casts to const. Besides all this, it is good to separate them into 2 different modules because the linker includes all functions that are in a module, not just ones that are used. So if you are of the persuasion to only use mutable or only use COW functions, then you probably don't want to link in the other versions if you can help it. -Steve
May 01 2008
prev sibling parent reply Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
Frits van Bommel wrote:
 Bruno Medeiros wrote:
 Janice Caron wrote:
 2008/4/30 Bruno Medeiros <brunodomedeiros+spam com.gmail>:
  Also, is there a reason why these mutable functions shouldn't be in
 std.string, together with their invariant/const brethren?

That's why we're having this discussion. The idea is that std.string can be optimised for invariant strings, while std.stringbuffer could be optimised for mutable strings. There are pros and cons for separate modules. I don't think Walter wants std.string "polluted" by all these functions he doesn't much care for. Also, it would be bad if mutable versions were called "by mistake" with consequent unexpected behavior.

"mutable versions were called "by mistake" "? I don't think that point applies to D, after all, the purpose of the immutability system is for the compiler to check that this won't happen, so unless there is some compiler bug, that shouldn't happen in D.

What if you wanted a modified copy of the input, but that input happened to be mutable?

Hum, I see what you mean, yes, that could happen.
 The modifying versions should have some distinguishing characteristic to 
 separate them from the COW versions. I'd say either a different function 
 name or an extra out-buffer parameter (as long as they still work if the 
 buffer is the same array as the normal input).

Yes, the idea to distinguish them with a different name sounds good (names like "doToUpper", maybe?). So that means you agree it should be in the same package? :P -- Bruno Medeiros - Software Developer, MSc. in CS/E graduate http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
May 01 2008
parent Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Bruno Medeiros wrote:
 Frits van Bommel wrote:
 The modifying versions should have some distinguishing characteristic 
 to separate them from the COW versions. I'd say either a different 
 function name or an extra out-buffer parameter (as long as they still 
 work if the buffer is the same array as the normal input).

Yes, the idea to distinguish them with a different name sounds good (names like "doToUpper", maybe?). So that means you agree it should be in the same package? :P

I don't like 'doToUpper', but something like 'makeUpper' could be a good convention. That makes it pretty clear they're modifying the input, I think. I don't particularly care what package they're in, but their names should make it clear what they do. Especially if you're working with both sets of functions in the same module... Looking at the Phobos2 std.string docs, I do think some of those functions could benefit from at least a const(char)[] overload so they'll work with non-invariant parameters too. The ones that don't even return string data[1] should probably just replace all invariant parameters with const ones. Of course, for the rest the return type of const overloads could be debated. (First question: should they ever return a slice? If not, should the return type be mutable or invariant[2]?) [1]: In particular: inPattern(), size_t count*(), bool is*() and size_t column() are the ones I saw. [2]: It shouldn't be const though, that'd be pointless: returning newly allocated memory as const means it's effectively invariant anyway.
May 01 2008
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
2008/4/29 Jarrett Billingsley <kb3ctd2 yahoo.com>:
  Might I ask why a StringBuffer class would be necessary?

Walter has vetoed that one, so it's moot now. :-)
Apr 29 2008
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
2008/4/30 Bruno Medeiros <brunodomedeiros+spam com.gmail>:
  Also, is there a reason why these mutable functions shouldn't be in
 std.string, together with their invariant/const brethren?

That's why we're having this discussion. The idea is that std.string can be optimised for invariant strings, while std.stringbuffer could be optimised for mutable strings. There are pros and cons for separate modules. I don't think Walter wants std.string "polluted" by all these functions he doesn't much care for. Also, it would be bad if mutable versions were called "by mistake" with consequent unexpected behavior. But keep discussing. The people I want to hear from most are the people calling for mutable string functions.
Apr 29 2008
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
2008/4/30 Me Here <p9e883002 sneakemail.com>:
  If a separate namespace is deamed /essential/, then I see no reason to go
 with, and certainly did "ask for" it to be called
  the misleading name of std.StringBuffer. As far as I recall, that was
 Janice's own suggestion.

Yeah, I got that from an earlier post when someone said "What you need is a string buffer" in response to some question. The name can be anything we want it to be.
  For preference, if a separate namespace is absolutely necessary, I go for:

     std.string.mutable

Except "std.string.anything" :-) "std.string" is a module, so it can't also be a package. That's a limitation of the D language.
  Finally, if the retention of unused but allocated space in an array is a
 feature of the current design, then I would add
  a debug time warning indicating when a char[] has had to be grown. These
 could be used during devlopment to adjust
  the preallocated size of arrays to be large enough to accomodate all (most?
 typical?) requirements.

I would support the addition of some function like gc.minimise(char[]) which returned all the unused space following the end of the array back to the gc, without any copying of the used part. I wouldn't be able to write that though - the gc is not my area of expertise.
Apr 30 2008
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
2008/4/30 Janice Caron <caron800 googlemail.com>:
  I would support the addition of some function like

     gc.minimise(char[])

  which returned all the unused space following the end of the array
  back to the gc, without any copying of the used part. I wouldn't be
  able to write that though - the gc is not my area of expertise.

Sorry, I meant std.gc.minimise(void[] array) This function doesn't exist right now.
Apr 30 2008
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
2008/4/30 Sean Kelly <sean invisibleduck.org>:
  Tango even exposes
  a GC.realloc() routine which will do this for you.

So does Phobos. std.gc.realloc(). However, realloc() doesn't promise not to copy, and not copying is the objective. Thanks for all the cool info, but I just think programmers would just feel more "comfortable" if, after they've done all their in-place string manipulations, they can call some minimizing function, even if only to give them a warm fuzzy feeling that they're not wasting any more memory than is necessary. Frankly, it could even be implemented a do-nothing function. That way, at least "blame" for excessive memory use passes from the programmer to Phobos, and future gc implementations might do things differently.
Apr 30 2008
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
2008/4/30 Me Here <p9e883002 sneakemail.com>:
 "std.string" is a module, so it can't also be a package. That's a
 limitation of the D language.

Now. This is where you show me up to be nothing but a pretender in this forum. I have no idea what the distinction is be tween thos two in D.

One is file, the other is a folder. std.string is a file, so it can't also be a folder.
  I /think/ you may have misunderstood my intent here. Unsurprising cos it
 was badly outlined.
  And I'm not at all sure that D works this way.

  In, for example, Perl, an array can be pre-sized but then set to be empty.
  That is, it can have space preallocated to it, but contain nothing.
  Likewise strings have two length attributes internally.
  - one denotes the length of the contents, as woudl be returned to the
 program by the length() function.
  - one indicated the actual length of the ram allocated to it.

Well, that's what a StringBuffer would do, but nobody seemed to like the idea. A string contains two pieces of information: (1) ptr, and (2) length. A StringBuffer would carry a third piece of information: (3) capacity. (Actually, in general it would be Buffer!(T), with StringBuffer just being a special case). Built in-strings to have a capacity, but it's not carried round in a field. Instead. to find the capacity of an array, you have to call std.gc.capacity(array) - and I can't see how there can not be a performance hit there. Increasing the length of a D array doesn't necessarily mean reallocating (although as noted above, the code has to do some work to find out the capacity), but it /does/ mean re-initialising the newly exposed elements. Again, that has to be a performance hit. With a Buffer!(), you could increase the length (up to capacity) not only without reallocating but also without reinitializing, just by changing the value of an int. But <shrugs> - if people don't want StringBuffers, who am I to argue?
Apr 30 2008
prev sibling next sibling parent reply "Janice Caron" <caron800 googlemail.com> writes:
2008/4/30 Me Here <p9e883002 sneakemail.com>:
     char[] a = ...2000 chars from somewhere.

     char[] field1 = a[ 312 .. 357 ];
     field1.toUpper();

I've kind of lost track of the number of times I've said this in recent days, but... You cannot uppercase in place, because for any given dchar, c, the number of UTF-8 bytes required to express c may be different from the number of UTF-8 bytes required to express toupper(c). If any of you have plans to uppercase or lowercase UTF-8 in place, forget that now. It just ain't possible. (You can uppercase ASCII, UTF-16, or UTF-32 in place. But not UTF-8, and char[], by definition, is UTF-8).
Apr 30 2008
parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"Janice Caron" wrote
 2008/4/30 Me Here:
     char[] a = ...2000 chars from somewhere.

     char[] field1 = a[ 312 .. 357 ];
     field1.toUpper();

I've kind of lost track of the number of times I've said this in recent days, but... You cannot uppercase in place, because for any given dchar, c, the number of UTF-8 bytes required to express c may be different from the number of UTF-8 bytes required to express toupper(c). If any of you have plans to uppercase or lowercase UTF-8 in place, forget that now. It just ain't possible. (You can uppercase ASCII, UTF-16, or UTF-32 in place. But not UTF-8, and char[], by definition, is UTF-8).

What about inPlaceToUpperASCII(char[] str)? in other words, yeah, toUpper can use a UTF-8 string, and return a UTF-8 string, but I can see use in having a function that expects to receive ASCII and uppercases in-place. The function would be a lot simpler in any case :) -Steve
May 01 2008
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
2008/4/30 Matti Niemenmaa <see_signature for.real.address>:
 It's possible that, in some obscure case, you can't uppercase UTF-16 in
 place either.

Perhaps surprisingly, that's not so. This is because the alphabets of *ALL* living languages exist within Unicode's "Basic Multilingual Plane" (...which is to say, they can be encoded in a single wchar). The characters outside the BMP (...those which need a dchar, not a wchar...) are the letters of dead languages, or other special symbols. The probability that a letter from a living language will uppercase to a letter of a dead language is as near to zero as makes no odds.
Apr 30 2008
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
Oh, sorry, I didn't read your whole post before replying. <embarrassed>.

OK, so private use characters might be a contrived exception. BUT,
nobody expects toUpper() to acknowledge private use characters. That
would require a run-time extensibility mechanism which is way beyond
what toUpper() does now, and likely beyond anything it's ever likely
to do any time soon. Maybe some future Unicode library with a
registerPrivateUseCharacters() function might cover that
functionality, but there are no plans for that on the table right now.
(And even then - as you say - it's a /very/ contrived case).
Apr 30 2008
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
2008/4/30 Robert Fraser <fraserofthenight gmail.com>:
  I like StringBuffers :-). Did Walter veto the idea completely or did he say
 "not a class".

Yeah, he said not a class. And that was probably my fault because in my first post on this thread I used the word "class". Janice
Apr 30 2008
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
2008/4/30 terranium <spam here.lot>:
 Matti Niemenmaa Wrote:

  > A code point in the private use area (U+E000 to U+F8FF), which can be
  > represented with one UTF-16 code unit, may uppercase to something in the

  does this have any practical use?

Private use characters can be used for invented alphabets, e.g. Klingon, or my-made-up-funky-alphabet. You can define them to be whatever you want. However the mechanism for /interpreting/ such characters is outside the scope of Unicode. All co-operating applications have to have the same knowledge of what those characters "mean".
Apr 30 2008
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
2008/4/30 terranium <spam here.lot>:
 Janice Caron Wrote:

  > You cannot uppercase in place, because for any given dchar, c, the
  > number of UTF-8 bytes required to express c may be different from the
  > number of UTF-8 bytes required to express toupper(c).

  really?

Yes really. toUpper( '\u2C65' ) == '\u023A' toLower( '\u023A' ) == '\u2C65' '\u023A' requires two bytes in UTF-8 '\u2C65' requires three bytes in UTF-8 Not a problem in UTF-16, of course.
Apr 30 2008
prev sibling next sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Thu, May 01, 2008 at 02:19:51AM +0900, Bill Baxter wrote:
 Herein lies the genius in Tango's naming conventions.  You *can* have 
 both a package std.string, and a module named std.String.  If you 
 consistently use different case for package and module names, then you 
 can have your cake and eat it too.

Does that work on Windows?
 --bb

-- Adam D. Ruppe http://arsdnet.net
Apr 30 2008
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
2008/5/1 Frits van Bommel <fvbommel remwovexcapss.nl>:
  Actually, you can't uppercase UTF-16 and UTF-32 in-place either if you want
 to be entirely correct. For example: \u00df ("ß") --> \u0053 \u0053 ("SS").

I know about that, and for the future I have plans for a proper unicode lib with normalisation, full casing, etc. However - none of that is the job of std.string.toUpper() or std.string.toLower(). These functions only need to /simple/ casing, not /full/ casing, and in /simple/ casing, one dchar always maps to one dchar. In particular '\u00DF' maps to '\u00DF'. In full casing, toLower('\u1E9E') (LATIN CAPITAL LETTER SHARP S) is '\u00DF' (LATIN SMALL LETTER SHARP S), but the converse is not true. What fun! :-). But full casing is not the concern of std.string (nor of std.stringbuffer, or whatever we end up calling it), so we don't need to worry about that here.
Apr 30 2008
prev sibling next sibling parent "Janice Caron" <caron800 googlemail.com> writes:
On 01/05/2008, Spacen Jasset <spacenjasset yahoo.co.uk> wrote:
  I think uppercasing non ascii (english) characters is a more of specialised
 business anyway (some languages have no notion of upper case, and yet others
 depend on context), which often should be perfomed by a presentation layer.

The Unicode Standard defines casing unambiguously for all characters. Yes, toupper() of a Chinese character will leave it unchanged, but it's still defined, and that is /not/ locale dependent. However, casing in place is possible for UTF-8 if you're prepared to throw an exception for those (extremely rare) cases when the sequence length changes. So that means, you'd need two versions, the in-place version toUpperInPlace(char[] s) // might throw and the general version char[] toUpper(const(char)[] s, char[] buffer=null) That could be done
May 01 2008
prev sibling parent Pedro Ferreira <ask me.pt> writes:
Janice Caron escreveu:
 Hi all,
 
 More than one person has complained about the lack of string functions
 in Phobos which operate on mutable chars. In the thread titled "Is all
 this Invariant ****....", I suggested creating a new module,
 std.stringbuffer, to contain two things:
 
 (1) a StringBuffer class
 (2) parallel mutable versions of the functions in std.string.
 
 Walter OKed the idea, so it looks like that's a go. To that end, I've
 looked through the functions in std.string and sorted them into
 different groups. I think it's important to get the API right so
 comments are welcome on all of the below:

I agree with this and will welcome the module. I've had to do some ugly .idup and .dup around a compiler I coded to accomodate for various functions around Phobos (such as writeLine from OutputStream). I'd like to suggest, though, the usage of template code: T[] split(T)(in data) and perform a static if inside. It'd save the assle of maintaining two modules in seperate, which are bound to have different functions some day. For example,say that a function is added to std.string and not to std.stringbuffer. Also, it would be easier to maintain documentation consistency. On an extra note, ASCII UTF variants could be taken care in a single function. That would require a lot of work though. Well, should you require assistance, gimme a shout. Cheers
May 02 2008