Hi all,
More than one person has complained about the lack of string functions
in Phobos which operate on mutable chars. In the thread titled "Is all
this Invariant ****....", I suggested creating a new module,
std.stringbuffer, to contain two things:
(1) a StringBuffer class
(2) parallel mutable versions of the functions in std.string.
Walter OKed the idea, so it looks like that's a go. To that end, I've
looked through the functions in std.string and sorted them into
different groups. I think it's important to get the API right so
comments are welcome on all of the below:
The following functions are incorrectly declared in std.string because
they are currently declared to take strings, not const(char)[]. They
should be:
long atoi(in char[] s)
real atof(in char[] s)
size_t count(in char[] s, in char[] sub)
bool inPattern(dchar c, in char[] pattern)
int inPattern(dchar c, in char[][] patterns)
size_t countchars(in char[] s, in char[] pattern)
bool isNumeric(in char[] s, in bool bAllowSep = false)
size_t column(char[] str, int tabsize = 8)
The following functions are badly declared in std.string because they
are declared to take and return strings. With the following change,
they become type agnostic
size_t isEmail(in char[] s)
size_t isURL(in char[] s)
The following function is the /only/ function currently in std.string
which takes an optional mutable buffer to use instead of allocating on
the heap. For consistency, let's put the mutable version into
std.stringbuffer, and let std.string have an invariant version, as
follows:
string soundex(string s)
The remaining functions go in std.stringbuffer.
The following functions all take an optional mutable buffer as input
into which to write the return value to avoid allocation.
char[] tolower(in char[] s, char[] buffer=null)
char[] toupper(in char[] s, char[] buffer=null)
char[] capitalize(in char[] s, char[] buffer=null)
char[] capwords(in char[] s, char[] buffer=null)
char[] repeat(in char[] s, size_t n, char[] buffer=null)
char[] join(in char[][] words, char[] sep, char[] buffer=null)
char[] ljustify(in char[] s, int width, char[] buffer=null)
char[] rjustify(in char[] s, int width, char[] buffer=null)
char[] center(in char[] s, int width, char[] buffer=null)
char[] zfill(in char[] s, int width, char[] buffer=null)
char[] replace(in char[] s, in char[] from, in char[] to, char[] buffer=null)
char[] replaceSlice(in char[] s, in char[] slice, in char[]
replacement, char[] buffer=null)
char[] insert(in char[] s, size_t index, in char[] sub, char[] buffer=null)
char[] expandtabs(in char[] str, int tabsize=8, char[] buffer=null)
char[] entab(in char[] s, int tabsize=8, char[] buffer=null) // in place?
char[] maketrans(in char[] from, in char[] to, char[] buffer=null)
char[] translate(in char[] s, in char[] transtab, in char[] delchars,
char[] buffer=null)
char[] succ(in char[] s, char[] buffer=null)
char[] soundex(in char[] s, char[] buffer=null)
char[] wrap(in char[] s, int columns = 80, in char[] firstindent =
null, in char[] indent = null, int tabsize = 8, char[] buffer=null)
The following functions I am uncertain about. They could be declared
to take a mutable buffer as input, consistent with the above. /Or/
they could operate on data in place. Opinions are welcome.
char[] removechars(in char[] s, in char[] pattern, char[]
buffer=null) // in place?
char[] squeeze(in char[] s, in char[] pattern = null, char[]
buffer=null) // in place?
char[] tr(in char[] str, in char[] from, in char[] to, in char[]
modifiers=null, char[] buffer=null) // in place?
The following functions need to be overloaded for both const and mutable input
char[][] split(char[] s)
const(char)[][] split(const(char)[] s)
char[][] split(char[] s, in char[] delim)
const(char)[][] split(const(char)[] s, in char[] delim)
char[][] splitlines(char[] s)
const(char)[][] splitlines((char)[] s)
char[] stripl(char[] s)
const(char)[] stripl(const(char)[] s)
char[] stripr(char[] s)
const(char)[] stripr(const(char)[] s)
char[] strip(char[] s)
const(char)[] strip(const(char)[] s)
char[] chop(char[] s)
const(char)[] chop(const(char)[] s)
Not sure what to do about the following one. AAs of mutable arrays are
notoriously difficult to get bug free. Should we bother with this one?
char[][char[]] abbrev(in char[][] values) // May be impractical
Finally - what do we all think about the inconstitent capitalization
thoughout std.string. (toupper versus toString, capwords versus
endsWith, etc.)
"Janice Caron" <caron800 googlemail.com> wrote in message
news:mailman.508.1209497029.2351.digitalmars-d puremagic.com...
Hi all,
More than one person has complained about the lack of string functions
in Phobos which operate on mutable chars. In the thread titled "Is all
this Invariant ****....", I suggested creating a new module,
std.stringbuffer, to contain two things:
(1) a StringBuffer class
Might I ask why a StringBuffer class would be necessary?
Apr 29 2008
↑↓←→ Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
A couple of thoughts:
Janice Caron wrote:
Hi all,
More than one person has complained about the lack of string functions
in Phobos which operate on mutable chars. In the thread titled "Is all
this Invariant ****....", I suggested creating a new module,
std.stringbuffer, to contain two things:
(1) a StringBuffer class
(2) parallel mutable versions of the functions in std.string.
I'm with Jarret here, why the hell do we need a StringBuffer class?
'string' is not a class either, so just use char[].
I would recomment aliasing char[] to 'mstring' (short for mutable
string. I think such an alias is more readable than 'char[]'
Also, is there a reason why these mutable functions shouldn't be in
std.string, together with their invariant/const brethren? I don't think
it makes sense to have another package if one opt by the (2) solution.
--
Bruno Medeiros - Software Developer, MSc. in CS/E graduate
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
std.stringbuffer, to contain two things:
(1) a StringBuffer class
(2) parallel mutable versions of the functions in std.string.
I'm with Jarret here, why the hell do we need a StringBuffer class?
'string' is not a class either, so just use char[].
I would recomment aliasing char[] to 'mstring' (short for mutable string.
I think such an alias is more readable than 'char[]'
Also, is there a reason why these mutable functions shouldn't be in
std.string, together with their invariant/const brethren?
I don't think it makes sense to have another package if one opt by the (2)
solution.
As one of those that has request "a standard library of string functions
that accept and return mutable strings Ie. char[]",
I see no reason it should be a class, free function seem to work just
fine. A class would just be bloat.
I would be perfectly happy for these to co-exist in the std.string space.
Indeed I would prefer it.
If a separate namespace is deamed /essential/, then I see no reason to go
with, and certainly did "ask for" it to be called
the misleading name of std.StringBuffer. As far as I recall, that was
Janice's own suggestion.
For preference, if a separate namespace is absolutely necessary, I go for:
std.string.mutable
The right namespace and does what is says on the tin.
Further, /if/ I had any input to the design, then the suggestion for me to
have to pass in preallocated buffers
to accomodate the mutated data if it needs to grow would be scotched
forthwith.
They should look and work in exactly the same way as the existing v2
std.string functions taking the same number
and order of parameters. Just char[] (or compatible alias) instead of
string.
If the buffers need to grow, then allocate space from wherever (I assume
the heap) that std.string allocates from now.
If they do not change size, then return the original intact.
If they shrink, and if the D array internals permit this, then adjust the
.length attribute whilst leaving the actual
allocation unchanged. That way it is there for use should further
mutations cause it to grow again.
This also helps prevent heap fragmentation if the functions are called on
heap allocated data.
Finally, if the retention of unused but allocated space in an array is a
feature of the current design, then I would add
a debug time warning indicating when a char[] has had to be grown. These
could be used during devlopment to adjust
the preallocated size of arrays to be large enough to accomodate all
(most? typical?) requirements.
In summary: mutable string functions shoudl do exactly the same as the
invarient functions do now,
except only reallocate if necessary and (optionally) issue a warning under
debug if they have to.
Seems almost as if a template solution could be used, except that I think
the additional conditional code would
hamper the performance of both instantiations. Unless Ds templating is
capable of optiising away branches
of code that relate to the /other/ type instantiations? I've had no
occasion to use templates in D yet, so that might
be pie in the sky.
--
== Quote from Janice Caron (caron800 googlemail.com)'s article
I would support the addition of some function like
gc.minimise(char[])
which returned all the unused space following the end of the array
back to the gc, without any copying of the used part. I wouldn't be
able to write that though - the gc is not my area of expertise.
This would only work for large arrays I'm afraid, given the GC
implementation for D--it uses fixed-size blocks until the block
size is 4096 bytes or larger. Also, the shrinking would be done
in chunks of 4096 bytes, so a fairly substantial size change would
have to occur for anything to happen at all. That said, things get
a lot easier if moving the block is allowed. Tango even exposes
a GC.realloc() routine which will do this for you.
Sean
Apr 30 2008
↑ ↓ ← → Sean Kelly <sean invisibleduck.org> writes:
== Quote from Janice Caron (caron800 googlemail.com)'s article
2008/4/30 Sean Kelly <sean invisibleduck.org>:
Tango even exposes
a GC.realloc() routine which will do this for you.
However, realloc() doesn't promise not to copy, and not copying is the
objective. Thanks for all the cool info, but I just think programmers
would just feel more "comfortable" if, after they've done all their
in-place string manipulations, they can call some minimizing function,
even if only to give them a warm fuzzy feeling that they're not
wasting any more memory than is necessary.
It's perhaps worth noting here that C++ objects don't typically minimize
either. That's why Scott Meyers (?) proposed the idiom:
myVector.swap(std::vector(myVector));
Frankly, it could even be implemented a do-nothing function. That way,
at least "blame" for excessive memory use passes from the programmer
to Phobos, and future gc implementations might do things differently.
The name can be anything we want it to be.
...
Except "std.string.anything" :-)
I did laugh. Not quite "any colour you like so long as its black", but
close :)
"std.string" is a module, so it can't also be a package. That's a
limitation of the D language.
Now. This is where you show me up to be nothing but a pretender in this
forum.
I have no idea what the distinction is be tween thos two in D.
Finally, if the retention of unused but allocated space in an array is a
feature of the current design, then I would add
a debug time warning indicating when a char[] has had to be grown. These
could be used during devlopment to adjust
the preallocated size of arrays to be large enough to accomodate all (most?
typical?) requirements.
I would support the addition of some function like
gc.minimise(char[])
which returned all the unused space following the end of the array
back to the gc, without any copying of the used part. I wouldn't be
able to write that though - the gc is not my area of expertise.
I /think/ you may have misunderstood my intent here. Unsurprising cos it
was badly outlined.
And I'm not at all sure that D works this way.
In, for example, Perl, an array can be pre-sized but then set to be empty.
That is, it can have space preallocated to it, but contain nothing.
Likewise strings have two length attributes internally.
- one denotes the length of the contents, as woudl be returned to the
program by the length() function.
- one indicated the actual length of the ram allocated to it.
This allows, or example, chomp() to simply move adjust a number (the
program visible length) and do
not adjustment or reallocation at all. It can also adjust the left hand
end of the contents
effectively foreshortening the string, again without adjusting the
allocation.
So visually, a scalar holding a string might at some point in its life
look something like:
(this ascii art is going to come out a mess on the server but...)
header
[ offset ] |--------+
[actualLen
]-------------------------------------------------------------------------->
[pgmVisible] |------------------------------------------------>
[pointer ]----v |
[][][][][][][the contents the program can see is
here][][][][][][]
Basically, it start out with offset zero and only as much padding (if any)
as is required to bring it to suitable alignment.
But if you remove characters at the end (chomp or chop) then the padding
grows as the content shrink and nothing is allocated.
If you remove characters from the front of the string the offset
accomodates that and the allocation doesn't change.
And if further mutations expand the string, then these spaces are reused
before a new allocation is made.
If for example, you know you are going to be build ia long string up
piecewise from small appendages, you can inilialise it to some
length big enough for the expected final length and the truncate it
(assign '' to it) and it will retain its allocation, even though the
program visible length is zero. Then, as you add stuff to it, it grows
into the allocation.
My point was that /if/ Ds arrays have a similar capability, to be
preallocated large and empty and grow into the space then
when a mutation requires a reallocation of a mutable array because it has
outgrown its original allocation,
then a debug-enabled warning saying by how much, might allow the
programmer to preallocate the initial mutable array larger and
so avoid reallocation at runtime.
There's a whole heap of speculation about what might be going on inside D
that I have no real knowledge of at all.
Note:There is no suggestion here that D shoudl work this way. Only that if
it does allow preallocation of arrays sizes,
then a warning when a mutation causes allocation would allow the
programmer to best use that facility.
Cheers, b.
--
But <shrugs> - if people don't want StringBuffers, who am I to argue?
What's in a name? Pre-conceptions of other worlds and other tools.
Specifically Java.
Additionally, the casing suggests a class?
For my part, I simply want string functions that operate on char[]s.
Because, I percieve that for the type of mutations I am currently doing,
Invarient strings would incur too high a cost.
If your StringBuffer concept would accept and manipulate char[]s
and not require the instantiation, initialisation and syntax of an object.
By which I mean that if having used a string function upon my char[]
I can still apply slice operations to it using the standard syntax.
And then apply another string function, and then another slice.
Or even, apply a string function to a slice of a larger string and
mutate that larger string, in-place through the slice:
char[] a = ...2000 chars from somewhere.
char[] field1 = a[ 312 .. 357 ];
field1.toUpper();
char[] checksum = a[ $-16 .. $ ];
checksum = md5hex( a );
...
Then I will be very happy.
Beyond that, I have no requirements :)
All the stuff about warnings and internal and external lengths was just
speclation
about what might be going on inside on the basis of what I know, have seen
(Perl)
and have personally implemented. (Not Perl).
Cheers, b.
Ps. Is there a paper/article/reference on the reasoning behind Invariant
strings somewhere?
--
Apr 30 2008
↑ ↓←→ Matti Niemenmaa <see_signature for.real.address> writes:
Janice Caron wrote:
If any of you have plans to uppercase or lowercase UTF-8 in place,
forget that now. It just ain't possible. (You can uppercase ASCII,
UTF-16, or UTF-32 in place. But not UTF-8, and char[], by definition,
is UTF-8).
It's possible that, in some obscure case, you can't uppercase UTF-16 in place
either.
A code point in the private use area (U+E000 to U+F8FF), which can be
represented with one UTF-16 code unit, may uppercase to something in the
supplementary private use areas (U+F0000 upwards), whose code points require
two
UTF-16 code units each. Of course the toUpper function in question must be
aware
of this configuration of the private use areas.
This is an extremely contrived case and I doubt it'll ever come up in practice,
anywhere, but in theory it might. <g>
--
E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
A code point in the private use area (U+E000 to U+F8FF), which can be
represented with one UTF-16 code unit, may uppercase to something in the
does this have any practical use?
Apr 30 2008
↑ ↓ ← → Matti Niemenmaa <see_signature for.real.address> writes:
Janice Caron wrote:
OK, so private use characters might be a contrived exception. BUT,
nobody expects toUpper() to acknowledge private use characters. That
would require a run-time extensibility mechanism which is way beyond
what toUpper() does now, and likely beyond anything it's ever likely
to do any time soon.
You're right, of course. I was referring more to some hypothetical toUpper()
function rather than one which I would expect to find in any standard
library---the generic case of "uppercasing a character" as opposed to
std.string[buffer].toUpper.
--
E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
You cannot uppercase in place, because for any given dchar, c, the
number of UTF-8 bytes required to express c may be different from the
number of UTF-8 bytes required to express toupper(c).
== Quote from Janice Caron (caron800 googlemail.com)'s article
2008/4/30 Me Here <p9e883002 sneakemail.com>:
char[] a = ...2000 chars from somewhere.
char[] field1 = a[ 312 .. 357 ];
field1.toUpper();
recent days, but...
You cannot uppercase in place, because for any given dchar, c, the
number of UTF-8 bytes required to express c may be different from the
number of UTF-8 bytes required to express toupper(c).
If any of you have plans to uppercase or lowercase UTF-8 in place,
forget that now. It just ain't possible. (You can uppercase ASCII,
UTF-16, or UTF-32 in place. But not UTF-8, and char[], by definition,
is UTF-8).
In all fairness, you can uppercase UTF-8 in place so long as none of
the characters within the string require a multi-byte capital. Thus
one questionable strategy would be to uppercase in place until the
first multibyte conversion is required. The obvious downside being
that the original buffer may end up partially capitalized, with the
fully capitalized result returned in a new buffer. I'm sure people
processing ASCII text would love this, but I can see it causing
problems elsewhere.
Sean
char[] a = ...2000 chars from somewhere.
char[] field1 = a[ 312 .. 357 ];
field1.toUpper();
I've kind of lost track of the number of times I've said this in
recent days, but...
You cannot uppercase in place, because for any given dchar, c, the
number of UTF-8 bytes required to express c may be different from the
number of UTF-8 bytes required to express toupper(c).
If any of you have plans to uppercase or lowercase UTF-8 in place,
forget that now. It just ain't possible. (You can uppercase ASCII,
UTF-16, or UTF-32 in place. But not UTF-8, and char[], by definition,
is UTF-8).
Ignoring for the moment Matti's pronouncement that this is an obscure and
unlikely event,
it really depends upon how the library is coded.
For example, if the case change is effected in place for the majority of
cases when it can be,
when the occasion occurs that it cannot, and raises a runtime exception,
catch the error
and use replaceSlice to handle it:
import std.stdio;
import std.string;
int main( char[][] args ) {
char [] s = "the quick brown fox";
try{
s[ 8 .. 9 ] = \u1234;
}
catch {
s = s.replaceSlice( s[ 8 .. 9 ], \u1234 );
}
writefln( s );
return 0;
}
Though it would be (much) nicer if the builtin lvalue slice handled this
for us.
I was just disappointed for the second to (re)discover this imitation of
Ds slicing. I had forgotten because other languages I used do not.
This is one of those things that I doubt I will ever agree with the
decision.
But I'm just another jerk on the internet with an opinion, and we all know
what that is analogous to.
If the language doesn't handle it, the the library should.
If it doesn't, then I will have to. And you, and Bill and Fred and Sue ,,,
Cheers, b.
--
Apr 30 2008
↑↓← → Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Janice Caron wrote:
2008/4/30 Me Here <p9e883002 sneakemail.com>:
char[] a = ...2000 chars from somewhere.
char[] field1 = a[ 312 .. 357 ];
field1.toUpper();
I've kind of lost track of the number of times I've said this in
recent days, but...
You cannot uppercase in place, because for any given dchar, c, the
number of UTF-8 bytes required to express c may be different from the
number of UTF-8 bytes required to express toupper(c).
If any of you have plans to uppercase or lowercase UTF-8 in place,
forget that now. It just ain't possible. (You can uppercase ASCII,
UTF-16, or UTF-32 in place. But not UTF-8, and char[], by definition,
is UTF-8).
Actually, you can't uppercase UTF-16 and UTF-32 in-place either if you
want to be entirely correct. For example: \u00df ("ß") --> \u0053 \u0053
("SS"). This increases the byte count for both UTF-16 and UTF-32.
(This does work for UTF-8 though, since \u00df happens to require 2
UTF-8 code units, and both \u0053s only one each)
(See <http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt> for what
should be a complete list of characters with similar annoying casing
properties)
char[] a = ...2000 chars from somewhere.
char[] field1 = a[ 312 .. 357 ];
field1.toUpper();
I've kind of lost track of the number of times I've said this in
recent days, but...
You cannot uppercase in place, because for any given dchar, c, the
number of UTF-8 bytes required to express c may be different from the
number of UTF-8 bytes required to express toupper(c).
If any of you have plans to uppercase or lowercase UTF-8 in place,
forget that now. It just ain't possible. (You can uppercase ASCII,
UTF-16, or UTF-32 in place. But not UTF-8, and char[], by definition,
is UTF-8).
I think uppercasing non ascii (english) characters is a more of
specialised business anyway (some languages have no notion of upper
case, and yet others depend on context), which often should be perfomed
by a presentation layer.
People need a toupper/lower all the time, and 90% of the time they use
it on strings that are in the ascii range, often because they deal with
protocols, file formats and other such things.
In which case phobos's string.toupper shouldn't really be doing work
outside of ascii, in my opinion anyway. This also means that a string
can be uppercased in place.
May 01 2008
↑↓← → Robert Fraser <fraserofthenight gmail.com> writes:
Janice Caron wrote:
2008/4/30 Me Here <p9e883002 sneakemail.com>:
"std.string" is a module, so it can't also be a package. That's a
limitation of the D language.
forum.
I have no idea what the distinction is be tween thos two in D.
One is file, the other is a folder. std.string is a file, so it can't
also be a folder.
I /think/ you may have misunderstood my intent here. Unsurprising cos it
was badly outlined.
And I'm not at all sure that D works this way.
In, for example, Perl, an array can be pre-sized but then set to be empty.
That is, it can have space preallocated to it, but contain nothing.
Likewise strings have two length attributes internally.
- one denotes the length of the contents, as woudl be returned to the
program by the length() function.
- one indicated the actual length of the ram allocated to it.
Well, that's what a StringBuffer would do, but nobody seemed to like
the idea. A string contains two pieces of information: (1) ptr, and
(2) length. A StringBuffer would carry a third piece of information:
(3) capacity. (Actually, in general it would be Buffer!(T), with
StringBuffer just being a special case).
Built in-strings to have a capacity, but it's not carried round in a
field. Instead. to find the capacity of an array, you have to call
std.gc.capacity(array) - and I can't see how there can not be a
performance hit there.
Increasing the length of a D array doesn't necessarily mean
reallocating (although as noted above, the code has to do some work to
find out the capacity), but it /does/ mean re-initialising the newly
exposed elements. Again, that has to be a performance hit. With a
Buffer!(), you could increase the length (up to capacity) not only
without reallocating but also without reinitializing, just by changing
the value of an int.
But <shrugs> - if people don't want StringBuffers, who am I to argue?
I like StringBuffers :-). Did Walter veto the idea completely or did he
say "not a class". I'd use a struct - there's no extra bloat, the
interface can be encapsulated, and people can use a pointer if they're
passing between functions (since it will most often be used within the
scope of a single function anyway). Or just pass it on the stack, if
it's guaranteed to only be 3 DWORDs.
My suggestion (grain of salt) is to represent them similarly to the way
mtext does by using two bits somewhere to hold the character type (char,
wchar, dchar) and change character types as needed.
== Quote from Me Here (p9e883002 sneakemail.com)'s article
My point was that /if/ Ds arrays have a similar capability, to be
preallocated large and empty and grow into the space then
when a mutation requires a reallocation of a mutable array because it has
outgrown its original allocation,
then a debug-enabled warning saying by how much, might allow the
programmer to preallocate the initial mutable array larger and
so avoid reallocation at runtime.
D arrays do have this feature, thanks to a suggestion by Derek Parnell. That
is,
reducing the array's length property does not cause a reallocation, even when
length is set to zero. Thus it is possible to do:
void fn( inout char[] buf )
{
buf.length = 1024; // preallocate 1024 bytes of storage
buf.length = 0;
buf ~= "hello"; // will copy into preallocated buffer
}
Thus the proper way to discard a buffer is to do:
buf = null;
I think for specific buffers it's probably enough to print their length when
you're
done filling them and then explicitly preallocate the next run based on this
info.
Tango also offers a means of performing program-level preallocation via
GC.reserve()
for people so inclined.
Sean
Apr 30 2008
↑ ↓ ←→ Bill Baxter <dnewsgroup billbaxter.com> writes:
Janice Caron wrote:
2008/4/30 Me Here <p9e883002 sneakemail.com>:
"std.string" is a module, so it can't also be a package. That's a
limitation of the D language.
forum.
I have no idea what the distinction is be tween thos two in D.
One is file, the other is a folder. std.string is a file, so it can't
also be a folder.
Herein lies the genius in Tango's naming conventions. You *can* have
both a package std.string, and a module named std.String. If you
consistently use different case for package and module names, then you
can have your cake and eat it too.
--bb
"std.string" is a module, so it can't also be a package. That's a
limitation of the D language.
forum.
I have no idea what the distinction is be tween thos two in D.
One is file, the other is a folder. std.string is a file, so it can't
also be a folder.
Herein lies the genius in Tango's naming conventions. You *can* have both
a package std.string, and a module named std.String. If you consistently
use different case for package and module names, then you can have your
cake and eat it too.
== Quote from Steven Schveighoffer (schveiguy yahoo.com)'s article
"Bill Baxter" wrote
Janice Caron wrote:
2008/4/30 Me Here :
"std.string" is a module, so it can't also be a package. That's a
limitation of the D language.
forum.
I have no idea what the distinction is be tween thos two in D.
One is file, the other is a folder. std.string is a file, so it can't
also be a folder.
Herein lies the genius in Tango's naming conventions. You *can* have both
a package std.string, and a module named std.String. If you consistently
use different case for package and module names, then you can have your
cake and eat it too.
It should still work, I believe. The source file will have a .d extension and
the folder
won't, so there shouldn't be a filesystem collision. Or are you saying that the
compiler does some checking behind the scenes anyway? I'll admit I've never
actually tried this.
Sean
"std.string" is a module, so it can't also be a package. That's a
limitation of the D language.
this
forum.
I have no idea what the distinction is be tween thos two in D.
One is file, the other is a folder. std.string is a file, so it can't
also be a folder.
Herein lies the genius in Tango's naming conventions. You *can* have
both
a package std.string, and a module named std.String. If you
consistently
use different case for package and module names, then you can have your
cake and eat it too.
It should still work, I believe. The source file will have a .d extension
and the folder
won't, so there shouldn't be a filesystem collision. Or are you saying
that the
compiler does some checking behind the scenes anyway? I'll admit I've
never
actually tried this.
Excellent point, I completely forgot that even though you import std.String,
you are really looking at the file
std/String.d.
In that case, I think you are right, it would work on Windoze.
-Steve
Apr 30 2008
↑ ↓ ← → Bill Baxter <dnewsgroup billbaxter.com> writes:
Steven Schveighoffer wrote:
"Sean Kelly" wrote
== Quote from Steven Schveighoffer
"Bill Baxter" wrote
Janice Caron wrote:
2008/4/30 Me Here :
"std.string" is a module, so it can't also be a package. That's a
limitation of the D language.
this
forum.
I have no idea what the distinction is be tween thos two in D.
also be a folder.
both
a package std.string, and a module named std.String. If you
consistently
use different case for package and module names, then you can have your
cake and eat it too.
and the folder
won't, so there shouldn't be a filesystem collision. Or are you saying
that the
compiler does some checking behind the scenes anyway? I'll admit I've
never
actually tried this.
Excellent point, I completely forgot that even though you import std.String,
you are really looking at the file
std/String.d.
In that case, I think you are right, it would work on Windoze.
-Steve
Yes it works fine on Windows too. I pretty much work only on Windows
testing things occasionally on VMWare Linux.
--bb
May 01 2008
↑ ↓ ←→ Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
As my ascii art was screwed by the time it got to the server, here is a
better illustration of what goes on:
This is long and wordy and maybe of no interest. But it does illustrate
the point i was trying to make.
[0] Perl> use Devel::Peek;;
[0] Perl> Dump $s;; ### Uninitialised scalar -- no space
allocated.
SV = NULL(0x0) at 0x194a9cc
REFCNT = 1
FLAGS = ()
[0] Perl> $s = 'abcdefghijklmnopqrstuvwxyz';; ### Assign it a string
[0] Perl> Dump $s;;
SV = PV(0x2252e8) at 0x194a9cc
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x191e1a4 "abcdefghijklmnopqrstuvwxyz"\0 ## the data
CUR = 26 ### user visible
length
LEN = 27 ## +1 for null
incase we pass it to C
[0] Perl> substr( $s, 0, 5 ) = '';; ### Remove
teh first 5 characters
[0] Perl> Dump $s;;
SV = PVIV(0x2256ec) at 0x194a9cc
REFCNT = 1
FLAGS = (POK,OOK,pPOK)
IV = 5 (OFFSET) ### offset of 5 fro the
start of the buffer
PV = 0x191e1a9 ( "abcde" . ) "fghijklmnopqrstuvwxyz"\0
### Still there but not visible
CUR = 21 ### User visible length
LEN = 22 ### Internal length (+ offset above)
[0] Perl> substr( $s, -10 ) = '';; ##' chop of the last 10 chars
[0] Perl> Dump $s;;
SV = PVIV(0x2256ec) at 0x194a9cc
REFCNT = 1
FLAGS = (POK,OOK,pPOK)
IV = 5 (OFFSET)
PV = 0x191e1a9 ( "abcde" . ) "fghijklmnop"\0
CUR = 11 ### User visible lentgh changes
LEN = 22 ### Internal length doesn't.
[0] Perl> $s = 'XX' . $s;; Prepend some new stuff back
[0] Perl> Dump $s;;
SV = PVIV(0x2256ec) at 0x194a9cc
REFCNT = 1
FLAGS = (POK,OOK,pPOK)
IV = 5 (OFFSET)
PV = 0x191e1a9 ( "abcde" . ) "XXfghijklmnop"\0
CUR = 13 ### User length grows
LEN = 22 ### internal length doesn't
[0] Perl> $s .= 'XX';; ### Append some new chars
[0] Perl> Dump $s;;
SV = PVIV(0x2256ec) at 0x194a9cc
REFCNT = 1
FLAGS = (POK,OOK,pPOK)
IV = 5 (OFFSET)
PV = 0x191e1a9 ( "abcde" . ) "XXfghijklmnopXX"\0
CUR = 15 ### Ditto the above
LEN = 22
[0] Perl> $s .= '??????';; ### Fill it to the limit of the not
offset space
[0] Perl> Dump $s;;
SV = PVIV(0x2256ec) at 0x194a9cc
REFCNT = 1
FLAGS = (POK,OOK,pPOK)
IV = 5 (OFFSET) ### Offset unchanged
PV = 0x191e1a9 ( "abcde" . ) "XXfghijklmnopXX??????"\0
CUR = 21
LEN = 22
[0] Perl> $s .= '##';; ### Push it beyond that limit
[0] Perl> Dump $s;;
SV = PVIV(0x2256ec) at 0x194a9cc
REFCNT = 1
FLAGS = (POK,pPOK)
IV = 0 ### Offset reclaimed
PV = 0x191e1a4 "XXfghijklmnopXX??????##"\0
CUR = 23
LEN = 27
[0] Perl> $s .= '###';; Upto the original allocation
[0] Perl> Dump $s;;
SV = PVIV(0x2256ec) at 0x194a9cc
REFCNT = 1
FLAGS = (POK,pPOK)
IV = 0 ### Still the same address (below)
PV = 0x191e1a4 "XXfghijklmnopXX??????#####"\0
CUR = 26
LEN = 27
[0] Perl> $s .= '!';; ### Push it beyond the original allocation
[0] Perl> Dump $s;;
SV = PVIV(0x2256ec) at 0x194a9cc
REFCNT = 1
FLAGS = (POK,pPOK)
IV = 0 ### Reallocation occurs now
### Though in place because nothing else has allocated
memory.
PV = 0x191e1a4 "XXfghijklmnopXX??????#####!"\0
CUR = 27
LEN = 28
--
I would support the addition of some function like
gc.minimise(char[])
which returned all the unused space following the end of the array
back to the gc, without any copying of the used part. I wouldn't be
able to write that though - the gc is not my area of expertise.
Sorry, I meant
std.gc.minimise(void[] array)
This function doesn't exist right now.
Weren't 'void[]'s banned?
May 02 2008
↑ ↓ ←→ Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
Janice Caron wrote:
2008/4/30 Bruno Medeiros <brunodomedeiros+spam com.gmail>:
Also, is there a reason why these mutable functions shouldn't be in
std.string, together with their invariant/const brethren?
That's why we're having this discussion.
The idea is that std.string can be optimised for invariant strings,
while std.stringbuffer could be optimised for mutable strings. There
are pros and cons for separate modules. I don't think Walter wants
std.string "polluted" by all these functions he doesn't much care for.
Also, it would be bad if mutable versions were called "by mistake"
with consequent unexpected behavior.
"mutable versions were called "by mistake" "? I don't think that point
applies to D, after all, the purpose of the immutability system is for
the compiler to check that this won't happen, so unless there is some
compiler bug, that shouldn't happen in D.
But keep discussing. The people I want to hear from most are the
people calling for mutable string functions.
You may find that a large segment of those people are using Tango, and
so they might not participate much in this Phobos design issue discussion.
--
Bruno Medeiros - Software Developer, MSc. in CS/E graduate
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
May 01 2008
↑ ↓ ←→ Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Bruno Medeiros wrote:
Janice Caron wrote:
2008/4/30 Bruno Medeiros <brunodomedeiros+spam com.gmail>:
Also, is there a reason why these mutable functions shouldn't be in
std.string, together with their invariant/const brethren?
That's why we're having this discussion.
The idea is that std.string can be optimised for invariant strings,
while std.stringbuffer could be optimised for mutable strings. There
are pros and cons for separate modules. I don't think Walter wants
std.string "polluted" by all these functions he doesn't much care for.
Also, it would be bad if mutable versions were called "by mistake"
with consequent unexpected behavior.
"mutable versions were called "by mistake" "? I don't think that point
applies to D, after all, the purpose of the immutability system is for
the compiler to check that this won't happen, so unless there is some
compiler bug, that shouldn't happen in D.
What if you wanted a modified copy of the input, but that input happened
to be mutable?
The modifying versions should have some distinguishing characteristic to
separate them from the COW versions. I'd say either a different function
name or an extra out-buffer parameter (as long as they still work if the
buffer is the same array as the normal input).
Also, is there a reason why these mutable functions shouldn't be in
std.string, together with their invariant/const brethren?
That's why we're having this discussion.
The idea is that std.string can be optimised for invariant strings,
while std.stringbuffer could be optimised for mutable strings. There
are pros and cons for separate modules. I don't think Walter wants
std.string "polluted" by all these functions he doesn't much care for.
Also, it would be bad if mutable versions were called "by mistake"
with consequent unexpected behavior.
"mutable versions were called "by mistake" "? I don't think that point
applies to D, after all, the purpose of the immutability system is for
the compiler to check that this won't happen, so unless there is some
compiler bug, that shouldn't happen in D.
What if you wanted a modified copy of the input, but that input happened
to be mutable?
The modifying versions should have some distinguishing characteristic to
separate them from the COW versions. I'd say either a different function
name or an extra out-buffer parameter (as long as they still work if the
buffer is the same array as the normal input).
Any modifying versions would take mutable strings, COW version would require
invariant strings. They would be able to go in the same module, because
there would be no ambiguity.
But if you have non-modifying versions that you want to use on mutable
strings, those would most likely take a const pointer. Those would have to
be named differently than the invariant versions, because invariant
implicitly casts to const.
Besides all this, it is good to separate them into 2 different modules
because the linker includes all functions that are in a module, not just
ones that are used. So if you are of the persuasion to only use mutable or
only use COW functions, then you probably don't want to link in the other
versions if you can help it.
-Steve
May 01 2008
↑ ↓ ←→ Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
Frits van Bommel wrote:
Bruno Medeiros wrote:
Janice Caron wrote:
2008/4/30 Bruno Medeiros <brunodomedeiros+spam com.gmail>:
Also, is there a reason why these mutable functions shouldn't be in
std.string, together with their invariant/const brethren?
That's why we're having this discussion.
The idea is that std.string can be optimised for invariant strings,
while std.stringbuffer could be optimised for mutable strings. There
are pros and cons for separate modules. I don't think Walter wants
std.string "polluted" by all these functions he doesn't much care for.
Also, it would be bad if mutable versions were called "by mistake"
with consequent unexpected behavior.
"mutable versions were called "by mistake" "? I don't think that point
applies to D, after all, the purpose of the immutability system is for
the compiler to check that this won't happen, so unless there is some
compiler bug, that shouldn't happen in D.
What if you wanted a modified copy of the input, but that input happened
to be mutable?
Hum, I see what you mean, yes, that could happen.
The modifying versions should have some distinguishing characteristic to
separate them from the COW versions. I'd say either a different function
name or an extra out-buffer parameter (as long as they still work if the
buffer is the same array as the normal input).
Yes, the idea to distinguish them with a different name sounds good
(names like "doToUpper", maybe?). So that means you agree it should be
in the same package? :P
--
Bruno Medeiros - Software Developer, MSc. in CS/E graduate
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
May 01 2008
↑ ↓ ← → Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Bruno Medeiros wrote:
Frits van Bommel wrote:
The modifying versions should have some distinguishing characteristic
to separate them from the COW versions. I'd say either a different
function name or an extra out-buffer parameter (as long as they still
work if the buffer is the same array as the normal input).
Yes, the idea to distinguish them with a different name sounds good
(names like "doToUpper", maybe?). So that means you agree it should be
in the same package? :P
I don't like 'doToUpper', but something like 'makeUpper' could be a good
convention. That makes it pretty clear they're modifying the input, I think.
I don't particularly care what package they're in, but their names
should make it clear what they do. Especially if you're working with
both sets of functions in the same module...
Looking at the Phobos2 std.string docs, I do think some of those
functions could benefit from at least a const(char)[] overload so
they'll work with non-invariant parameters too. The ones that don't even
return string data[1] should probably just replace all invariant
parameters with const ones.
Of course, for the rest the return type of const overloads could be
debated. (First question: should they ever return a slice? If not,
should the return type be mutable or invariant[2]?)
[1]: In particular: inPattern(), size_t count*(), bool is*() and size_t
column() are the ones I saw.
[2]: It shouldn't be const though, that'd be pointless: returning newly
allocated memory as const means it's effectively invariant anyway.
2008/4/30 Bruno Medeiros <brunodomedeiros+spam com.gmail>:
Also, is there a reason why these mutable functions shouldn't be in
std.string, together with their invariant/const brethren?
That's why we're having this discussion.
The idea is that std.string can be optimised for invariant strings,
while std.stringbuffer could be optimised for mutable strings. There
are pros and cons for separate modules. I don't think Walter wants
std.string "polluted" by all these functions he doesn't much care for.
Also, it would be bad if mutable versions were called "by mistake"
with consequent unexpected behavior.
But keep discussing. The people I want to hear from most are the
people calling for mutable string functions.
If a separate namespace is deamed /essential/, then I see no reason to go
with, and certainly did "ask for" it to be called
the misleading name of std.StringBuffer. As far as I recall, that was
Janice's own suggestion.
Yeah, I got that from an earlier post when someone said "What you need
is a string buffer" in response to some question.
The name can be anything we want it to be.
For preference, if a separate namespace is absolutely necessary, I go for:
std.string.mutable
Except "std.string.anything" :-)
"std.string" is a module, so it can't also be a package. That's a
limitation of the D language.
Finally, if the retention of unused but allocated space in an array is a
feature of the current design, then I would add
a debug time warning indicating when a char[] has had to be grown. These
could be used during devlopment to adjust
the preallocated size of arrays to be large enough to accomodate all (most?
typical?) requirements.
I would support the addition of some function like
gc.minimise(char[])
which returned all the unused space following the end of the array
back to the gc, without any copying of the used part. I wouldn't be
able to write that though - the gc is not my area of expertise.
I would support the addition of some function like
gc.minimise(char[])
which returned all the unused space following the end of the array
back to the gc, without any copying of the used part. I wouldn't be
able to write that though - the gc is not my area of expertise.
Sorry, I meant
std.gc.minimise(void[] array)
This function doesn't exist right now.
Tango even exposes
a GC.realloc() routine which will do this for you.
So does Phobos. std.gc.realloc().
However, realloc() doesn't promise not to copy, and not copying is the
objective. Thanks for all the cool info, but I just think programmers
would just feel more "comfortable" if, after they've done all their
in-place string manipulations, they can call some minimizing function,
even if only to give them a warm fuzzy feeling that they're not
wasting any more memory than is necessary.
Frankly, it could even be implemented a do-nothing function. That way,
at least "blame" for excessive memory use passes from the programmer
to Phobos, and future gc implementations might do things differently.
"std.string" is a module, so it can't also be a package. That's a
limitation of the D language.
Now. This is where you show me up to be nothing but a pretender in this
forum.
I have no idea what the distinction is be tween thos two in D.
One is file, the other is a folder. std.string is a file, so it can't
also be a folder.
I /think/ you may have misunderstood my intent here. Unsurprising cos it
was badly outlined.
And I'm not at all sure that D works this way.
In, for example, Perl, an array can be pre-sized but then set to be empty.
That is, it can have space preallocated to it, but contain nothing.
Likewise strings have two length attributes internally.
- one denotes the length of the contents, as woudl be returned to the
program by the length() function.
- one indicated the actual length of the ram allocated to it.
Well, that's what a StringBuffer would do, but nobody seemed to like
the idea. A string contains two pieces of information: (1) ptr, and
(2) length. A StringBuffer would carry a third piece of information:
(3) capacity. (Actually, in general it would be Buffer!(T), with
StringBuffer just being a special case).
Built in-strings to have a capacity, but it's not carried round in a
field. Instead. to find the capacity of an array, you have to call
std.gc.capacity(array) - and I can't see how there can not be a
performance hit there.
Increasing the length of a D array doesn't necessarily mean
reallocating (although as noted above, the code has to do some work to
find out the capacity), but it /does/ mean re-initialising the newly
exposed elements. Again, that has to be a performance hit. With a
Buffer!(), you could increase the length (up to capacity) not only
without reallocating but also without reinitializing, just by changing
the value of an int.
But <shrugs> - if people don't want StringBuffers, who am I to argue?
char[] a = ...2000 chars from somewhere.
char[] field1 = a[ 312 .. 357 ];
field1.toUpper();
I've kind of lost track of the number of times I've said this in
recent days, but...
You cannot uppercase in place, because for any given dchar, c, the
number of UTF-8 bytes required to express c may be different from the
number of UTF-8 bytes required to express toupper(c).
If any of you have plans to uppercase or lowercase UTF-8 in place,
forget that now. It just ain't possible. (You can uppercase ASCII,
UTF-16, or UTF-32 in place. But not UTF-8, and char[], by definition,
is UTF-8).
char[] a = ...2000 chars from somewhere.
char[] field1 = a[ 312 .. 357 ];
field1.toUpper();
I've kind of lost track of the number of times I've said this in
recent days, but...
You cannot uppercase in place, because for any given dchar, c, the
number of UTF-8 bytes required to express c may be different from the
number of UTF-8 bytes required to express toupper(c).
If any of you have plans to uppercase or lowercase UTF-8 in place,
forget that now. It just ain't possible. (You can uppercase ASCII,
UTF-16, or UTF-32 in place. But not UTF-8, and char[], by definition,
is UTF-8).
What about inPlaceToUpperASCII(char[] str)?
in other words, yeah, toUpper can use a UTF-8 string, and return a UTF-8
string, but I can see use in having a function that expects to receive ASCII
and uppercases in-place. The function would be a lot simpler in any case :)
-Steve
2008/4/30 Matti Niemenmaa <see_signature for.real.address>:
It's possible that, in some obscure case, you can't uppercase UTF-16 in
place either.
Perhaps surprisingly, that's not so. This is because the alphabets of
*ALL* living languages exist within Unicode's "Basic Multilingual
Plane" (...which is to say, they can be encoded in a single wchar).
The characters outside the BMP (...those which need a dchar, not a
wchar...) are the letters of dead languages, or other special symbols.
The probability that a letter from a living language will uppercase to
a letter of a dead language is as near to zero as makes no odds.
Oh, sorry, I didn't read your whole post before replying. <embarrassed>.
OK, so private use characters might be a contrived exception. BUT,
nobody expects toUpper() to acknowledge private use characters. That
would require a run-time extensibility mechanism which is way beyond
what toUpper() does now, and likely beyond anything it's ever likely
to do any time soon. Maybe some future Unicode library with a
registerPrivateUseCharacters() function might cover that
functionality, but there are no plans for that on the table right now.
(And even then - as you say - it's a /very/ contrived case).
Matti Niemenmaa Wrote:
> A code point in the private use area (U+E000 to U+F8FF), which can be
> represented with one UTF-16 code unit, may uppercase to something in the
does this have any practical use?
Private use characters can be used for invented alphabets, e.g.
Klingon, or my-made-up-funky-alphabet. You can define them to be
whatever you want. However the mechanism for /interpreting/ such
characters is outside the scope of Unicode. All co-operating
applications have to have the same knowledge of what those characters
"mean".
Janice Caron Wrote:
> You cannot uppercase in place, because for any given dchar, c, the
> number of UTF-8 bytes required to express c may be different from the
> number of UTF-8 bytes required to express toupper(c).
really?
Yes really.
toUpper( '\u2C65' ) == '\u023A'
toLower( '\u023A' ) == '\u2C65'
'\u023A' requires two bytes in UTF-8
'\u2C65' requires three bytes in UTF-8
Not a problem in UTF-16, of course.
Apr 30 2008
↑↓← → "Adam D. Ruppe" <destructionator gmail.com> writes:
On Thu, May 01, 2008 at 02:19:51AM +0900, Bill Baxter wrote:
Herein lies the genius in Tango's naming conventions. You *can* have
both a package std.string, and a module named std.String. If you
consistently use different case for package and module names, then you
can have your cake and eat it too.
2008/5/1 Frits van Bommel <fvbommel remwovexcapss.nl>:
Actually, you can't uppercase UTF-16 and UTF-32 in-place either if you want
to be entirely correct. For example: \u00df ("ß") --> \u0053 \u0053 ("SS").
I know about that, and for the future I have plans for a proper
unicode lib with normalisation, full casing, etc. However - none of
that is the job of std.string.toUpper() or std.string.toLower(). These
functions only need to /simple/ casing, not /full/ casing, and in
/simple/ casing, one dchar always maps to one dchar. In particular
'\u00DF' maps to '\u00DF'.
In full casing, toLower('\u1E9E') (LATIN CAPITAL LETTER SHARP S) is
'\u00DF' (LATIN SMALL LETTER SHARP S), but the converse is not true.
What fun! :-). But full casing is not the concern of std.string (nor
of std.stringbuffer, or whatever we end up calling it), so we don't
need to worry about that here.
On 01/05/2008, Spacen Jasset <spacenjasset yahoo.co.uk> wrote:
I think uppercasing non ascii (english) characters is a more of specialised
business anyway (some languages have no notion of upper case, and yet others
depend on context), which often should be perfomed by a presentation layer.
The Unicode Standard defines casing unambiguously for all characters.
Yes, toupper() of a Chinese character will leave it unchanged, but
it's still defined, and that is /not/ locale dependent.
However, casing in place is possible for UTF-8 if you're prepared to
throw an exception for those (extremely rare) cases when the sequence
length changes. So that means, you'd need two versions, the in-place
version
toUpperInPlace(char[] s) // might throw
and the general version
char[] toUpper(const(char)[] s, char[] buffer=null)
That could be done
Hi all,
More than one person has complained about the lack of string functions
in Phobos which operate on mutable chars. In the thread titled "Is all
this Invariant ****....", I suggested creating a new module,
std.stringbuffer, to contain two things:
(1) a StringBuffer class
(2) parallel mutable versions of the functions in std.string.
Walter OKed the idea, so it looks like that's a go. To that end, I've
looked through the functions in std.string and sorted them into
different groups. I think it's important to get the API right so
comments are welcome on all of the below:
I agree with this and will welcome the module. I've had to do some ugly
.idup and .dup around a compiler I coded to accomodate for various
functions around Phobos (such as writeLine from OutputStream).
I'd like to suggest, though, the usage of template code:
T[] split(T)(in data)
and perform a static if inside. It'd save the assle of maintaining two
modules in seperate, which are bound to have different functions some
day. For example,say that a function is added to std.string and not to
std.stringbuffer.
Also, it would be easier to maintain documentation consistency.
On an extra note, ASCII UTF variants could be taken care in a single
function.
That would require a lot of work though. Well, should you require
assistance, gimme a shout.
Cheers