|
Archives
D Programming
digitalmars.D
digitalmars.D.bugs
digitalmars.D.dtl
digitalmars.D.ide
digitalmars.D.dwt
digitalmars.D.announce
digitalmars.D.learn
digitalmars.D.debugger
D.gnu
D
C/C++ Programming
c++
c++.announce
c++.atl
c++.beta
c++.chat
c++.command-line
c++.dos
c++.dos.16-bits
c++.dos.32-bits
c++.idde
c++.mfc
c++.rtl
c++.stl
c++.stl.hp
c++.stl.port
c++.stl.sgi
c++.stlsoft
c++.windows
c++.windows.16-bits
c++.windows.32-bits
c++.wxwindows
digitalmars.empire
digitalmars.DMDScript
electronics
|
digitalmars.D - Making all strings UTF ranges has some risk of WTF
It's no secret that string et al. are not a magic recipe for writing
correct Unicode code. However, things are pretty good and could be
further improved by operating the following changes in std.array and
std.range:
- make front() and back() for UTF-8 and UTF-16 automatically decode the
first and last Unicode character
- make popFront() and popBack() skip one entire Unicode character
(instead of just one code unit)
- alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings
- change hasLength to return false for UTF-8 and UTF-16 strings
These changes effectively make UTF-8 and UTF-16 bidirectional ranges,
with the quirk that you still have a sort of a random-access operator.
I'm very strongly in favor of this change. Bidirectional strings allow
beautiful correct algorithms to be written that handle encoded strings
without any additional effort; with these changes, everything applicable
of std.algorithm works out of the box (with the appropriate fixes here
and there), which is really remarkable.
The remaining WTF is the length property. Traditionally, a range
offering length also implies the expectation that a range of length n
allows you to call popFront n times and then assert that the range is
empty. However, if you check e.g. hasLength!string it will yield false,
although the string does have an accessible member by that name and of
the appropriate type.
Although Phobos always checks its assumptions, people might occasionally
write code that just uses .length without checking hasLength. Then,
they'll be annoyed when the code fails with UTF-8 and UTF-16 strings.
(The "real" length of the range is not stored, but can be computed by
using str.walkLength() in std.range.)
What can be done about that? I see a number of solutions:
(a) Do not operate the change at all.
(b) Operate the change and mention that in range algorithms you should
check hasLength and only then use "length" under the assumption that it
really means "elements count".
(c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define
a different name for that. Any other name (codeUnits, codes etc.) would
do. The entire point is to not make algorithms believe strings have a
.length property.
(d) Have std.range define a distinct property called e.g. "count" and
then specialize it appropriately. Then change all references to .length
in std.algorithm and elsewhere to .count.
What would you do? Any ideas are welcome.
Andrei
On Wed, 03 Feb 2010 21:00:21 -0500, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
(a) Do not operate the change at all.
(b) Operate the change and mention that in range algorithms you should
check hasLength and only then use "length" under the assumption that it
really means "elements count".
(c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define
a different name for that. Any other name (codeUnits, codes etc.) would
do. The entire point is to not make algorithms believe strings have a
.length property.
(d) Have std.range define a distinct property called e.g. "count" and
then specialize it appropriately. Then change all references to .length
in std.algorithm and elsewhere to .count.
What would you do? Any ideas are welcome.
I like b) and d), with a slight preference for d. I think the benefits of
strings being encoding correct and able to use std.algorithm outweighs the
disadvantages. And making char[] different from T[] is going to play havoc
with templated algorithms. Another alternative is to remove the char types
from the language and implement them as library ranges.
Andrei Alexandrescu Wrote:
What can be done about that? I see a number of solutions:
(a) Do not operate the change at all.
(b) Operate the change and mention that in range algorithms you should
check hasLength and only then use "length" under the assumption that it
really means "elements count".
(c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define
a different name for that. Any other name (codeUnits, codes etc.) would
do. The entire point is to not make algorithms believe strings have a
.length property.
(d) Have std.range define a distinct property called e.g. "count" and
then specialize it appropriately. Then change all references to .length
in std.algorithm and elsewhere to .count.
What would you do? Any ideas are welcome.
Andrei
I choose (b)
ZY Zhou wrote:
Andrei Alexandrescu Wrote:
What can be done about that? I see a number of solutions:
(a) Do not operate the change at all.
(b) Operate the change and mention that in range algorithms you should
check hasLength and only then use "length" under the assumption that it
really means "elements count".
(c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define
a different name for that. Any other name (codeUnits, codes etc.) would
do. The entire point is to not make algorithms believe strings have a
.length property.
(d) Have std.range define a distinct property called e.g. "count" and
then specialize it appropriately. Then change all references to .length
in std.algorithm and elsewhere to .count.
What would you do? Any ideas are welcome.
Andrei
I choose (b)
Perfect. Then you'll be glad to know that I changed all of Phobos to
support it. All unittests pass now, but I suspect there are a couple of
bugs left.
The change will be part of the next release. I'll commit during the
weekend. This is a very exciting development, I was very unhappy about
the interaction between std.algorithm and strings.
Andrei
Andrei Alexandrescu wrote:
...
What can be done about that? I see a number of solutions:
(a) Do not operate the change at all.
(b) Operate the change and mention that in range algorithms you should
check hasLength and only then use "length" under the assumption that it
really means "elements count".
(c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define
a different name for that. Any other name (codeUnits, codes etc.) would
do. The entire point is to not make algorithms believe strings have a
.length property.
(d) Have std.range define a distinct property called e.g. "count" and
then specialize it appropriately. Then change all references to .length
in std.algorithm and elsewhere to .count.
What would you do? Any ideas are welcome.
Andrei
I'm leaning towards (c) here.
To me the .length on char[] and wchar[] are kinda like doing this:
struct SomePOD
{
int a, b;
double y;
}
SomePOD pod;
auto len = pod.length;
assert(len == 16); // true.
I'll admit it's not a perfect analogy. What I'm playing on here is that
the .length on char[] and wchar[] returns the /size of/ the string in
bytes rather than the /length/ of the string in number of (well-formed)
characters.
Unfortunately .sizeof is supposed to return the size of the string's
reference (8 bytes on x86 systems) and not the size of the string, IIRC.
So that's taken.
So perhaps a .bytes or .nbytes property. Maybe make it work for arrays
of structs and things like that too. A tuple (or any container) of
non-homogeneous elements could probably benefit from this property as well.
Given such a property being available, I wouldn't miss .length at all.
It's quite misleading.
Chad J wrote:
Andrei Alexandrescu wrote:
...
What can be done about that? I see a number of solutions:
(a) Do not operate the change at all.
(b) Operate the change and mention that in range algorithms you should
check hasLength and only then use "length" under the assumption that it
really means "elements count".
(c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define
a different name for that. Any other name (codeUnits, codes etc.) would
do. The entire point is to not make algorithms believe strings have a
.length property.
(d) Have std.range define a distinct property called e.g. "count" and
then specialize it appropriately. Then change all references to .length
in std.algorithm and elsewhere to .count.
What would you do? Any ideas are welcome.
Andrei
I'm leaning towards (c) here.
To me the .length on char[] and wchar[] are kinda like doing this:
struct SomePOD
{
int a, b;
double y;
}
SomePOD pod;
auto len = pod.length;
assert(len == 16); // true.
I'll admit it's not a perfect analogy. What I'm playing on here is that
the .length on char[] and wchar[] returns the /size of/ the string in
bytes rather than the /length/ of the string in number of (well-formed)
characters.
Unfortunately .sizeof is supposed to return the size of the string's
reference (8 bytes on x86 systems) and not the size of the string, IIRC.
So that's taken.
So perhaps a .bytes or .nbytes property. Maybe make it work for arrays
of structs and things like that too. A tuple (or any container) of
non-homogeneous elements could probably benefit from this property as well.
Given such a property being available, I wouldn't miss .length at all.
It's quite misleading.
I hear you. Actually, to either quench or add to the confusion, .length
for wstring returns the length in 16-bit units, not bytes.
Andrei
Andrei Alexandrescu wrote:
...
I hear you. Actually, to either quench or add to the confusion, .length
for wstring returns the length in 16-bit units, not bytes.
Andrei
0.o
...
kay.
Andrei Alexandrescu wrote:
What can be done about that? I see a number of solutions:
(a) Do not operate the change at all.
(b) Operate the change and mention that in range algorithms you should
check hasLength and only then use "length" under the assumption that it
really means "elements count".
(c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define
a different name for that. Any other name (codeUnits, codes etc.) would
do. The entire point is to not make algorithms believe strings have a
.length property.
(d) Have std.range define a distinct property called e.g. "count" and
then specialize it appropriately. Then change all references to .length
in std.algorithm and elsewhere to .count.
What would you do? Any ideas are welcome.
Change the type of string literals from char[] (or whatever the string
type is in D2) to a wrapper struct defined in object.d:
struct string {
char[] raw;
}
Now string.length is invalid, and you don't have to do weird stuff as in
(b) or (c).
From here on, you could do 2 things:
1. add accessor methods to string like string classes in other languages do
2. leave the wrapper struct as it is (just add the required range foo),
and require the user to use either a) the range API (with utf-8 decoding
etc.) or b) access the raw "byte" string with string.raw.
I really liked how strings were simply char[]'s, but now with immutable,
there's a lot of noise around it anyway, and there's no real value to
strings being array slices anymore. Making the user deal directly with
utf-8 was probably a bad idea to begin with.
Am 04.02.2010, 04:05 Uhr, schrieb grauzone <none example.net>:
Andrei Alexandrescu wrote:
What can be done about that? I see a number of solutions:
(a) Do not operate the change at all.
(b) Operate the change and mention that in range algorithms you should
check hasLength and only then use "length" under the assumption that it
really means "elements count".
(c) Deprecate the name .length for UTF-8 and UTF-16 strings, and
define a different name for that. Any other name (codeUnits, codes
etc.) would do. The entire point is to not make algorithms believe
strings have a .length property.
(d) Have std.range define a distinct property called e.g. "count" and
then specialize it appropriately. Then change all references to .length
in std.algorithm and elsewhere to .count.
What would you do? Any ideas are welcome.
Definitely against (c)+(d).
Change the type of string literals from char[] (or whatever the string
type is in D2) to a wrapper struct defined in object.d:
struct string {
char[] raw;
}
That sounds like a really reasonable way to me.
grauzone <none example.net> wrote:
Change the type of string literals from char[] (or whatever the string
type is in D2) to a wrapper struct defined in object.d:
struct string {
char[] raw;
}
Now string.length is invalid, and you don't have to do weird stuff as in
(b) or (c).
From here on, you could do 2 things:
1. add accessor methods to string like string classes in other languages
do
2. leave the wrapper struct as it is (just add the required range foo),
and require the user to use either a) the range API (with utf-8 decoding
etc.) or b) access the raw "byte" string with string.raw.
This seems to me a job for disable:
struct string {
immutable( char )[] payload;
alias payload this;
disable int length( ) { return 0; }
}
--
Simen
Andrei Alexandrescu wrote:
It's no secret that string et al. are not a magic recipe for writing
correct Unicode code.
I'm concerned it would be slow. Most operations on strings do not need
to decode the unicode characters, for example, find, startsWith, etc.,
do not. Decoding then doing find, startsWith, etc., will be considerably
slower.
Walter Bright wrote:
Andrei Alexandrescu wrote:
It's no secret that string et al. are not a magic recipe for writing
correct Unicode code.
I'm concerned it would be slow. Most operations on strings do not need
to decode the unicode characters, for example, find, startsWith, etc.,
do not. Decoding then doing find, startsWith, etc., will be considerably
slower.
I thought you're going to say that, but fortunately it's easy to
special-case certain algorithms for strings during compilation. In fact
I already did - for example, Boyer-Moore searching would be very
difficult to rewrite for variable-length characters, but there's no need
for it. I special-cased that algorithm.
I believe this is a good strategy.
Andrei
Walter Bright wrote:
Andrei Alexandrescu wrote:
It's no secret that string et al. are not a magic recipe for writing
correct Unicode code.
I'm concerned it would be slow. Most operations on strings do not need
to decode the unicode characters, for example, find, startsWith, etc.,
do not. Decoding then doing find, startsWith, etc., will be considerably
slower.
Oh, one more thing: doing mixed-width searches would require decoding.
Andrei
Andrei Alexandrescu wrote:
Oh, one more thing: doing mixed-width searches would require decoding.
Or a conversion before the loop starts of the search term.
Walter Bright wrote:
Andrei Alexandrescu wrote:
Oh, one more thing: doing mixed-width searches would require decoding.
Or a conversion before the loop starts of the search term.
That triggers memory allocation plus the same cost. It's not likely to
work any better.
Andrei
Andrei Alexandrescu Wrote:
Walter Bright wrote:
Andrei Alexandrescu wrote:
Oh, one more thing: doing mixed-width searches would require decoding.
Or a conversion before the loop starts of the search term.
That triggers memory allocation plus the same cost. It's not likely to
work any better.
Andrei
Exactly. Please don't go down the route that Microsoft did for regex case
insensitive searching (at least when I last looked at it) - they made the
*input string* all lower case... D'oh!
Regards,
Ben
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
It's no secret that string et al. are not a magic recipe for writing
correct Unicode code. However, things are pretty good and could be
further improved by operating the following changes in std.array and
std.range:
- make front() and back() for UTF-8 and UTF-16 automatically decode the
first and last Unicode character
- make popFront() and popBack() skip one entire Unicode character
(instead of just one code unit)
- alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings
- change hasLength to return false for UTF-8 and UTF-16 strings
These changes effectively make UTF-8 and UTF-16 bidirectional ranges,
with the quirk that you still have a sort of a random-access operator.
I'm very strongly in favor of this change. Bidirectional strings allow
beautiful correct algorithms to be written that handle encoded strings
without any additional effort; with these changes, everything applicable
of std.algorithm works out of the box (with the appropriate fixes here
and there), which is really remarkable.
The remaining WTF is the length property. Traditionally, a range
offering length also implies the expectation that a range of length n
allows you to call popFront n times and then assert that the range is
empty. However, if you check e.g. hasLength!string it will yield false,
although the string does have an accessible member by that name and of
the appropriate type.
Although Phobos always checks its assumptions, people might occasionally
write code that just uses .length without checking hasLength. Then,
they'll be annoyed when the code fails with UTF-8 and UTF-16 strings.
(The "real" length of the range is not stored, but can be computed by
using str.walkLength() in std.range.)
What can be done about that? I see a number of solutions:
(a) Do not operate the change at all.
(b) Operate the change and mention that in range algorithms you should
check hasLength and only then use "length" under the assumption that it
really means "elements count".
(c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define
a different name for that. Any other name (codeUnits, codes etc.) would
do. The entire point is to not make algorithms believe strings have a
.length property.
(d) Have std.range define a distinct property called e.g. "count" and
then specialize it appropriately. Then change all references to .length
in std.algorithm and elsewhere to .count.
What would you do? Any ideas are welcome.
Andrei
I personally would find this extremely annoying because most of the code I write
that involves strings is scientific computing code that will never be
internationalized, let alone released to the general public. I basically just
use
ASCII because it's all I need and if your UTF-8 string contains only ASCII
characters, it can be treated as random-access. I don't know how many people
out
there are in similar situations, but I doubt they'll be too happy.
On the other hand, I guess it wouldn't be hard to write a simple wrapper struct
on
top of immutable(ubyte)[] and call it AsciiString. Once alias this gets fully
debugged, I could even make it implicitly convert to immutable(char)[].
dsimcha wrote:
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
It's no secret that string et al. are not a magic recipe for writing
correct Unicode code. However, things are pretty good and could be
further improved by operating the following changes in std.array and
std.range:
- make front() and back() for UTF-8 and UTF-16 automatically decode the
first and last Unicode character
- make popFront() and popBack() skip one entire Unicode character
(instead of just one code unit)
- alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings
- change hasLength to return false for UTF-8 and UTF-16 strings
These changes effectively make UTF-8 and UTF-16 bidirectional ranges,
with the quirk that you still have a sort of a random-access operator.
I'm very strongly in favor of this change. Bidirectional strings allow
beautiful correct algorithms to be written that handle encoded strings
without any additional effort; with these changes, everything applicable
of std.algorithm works out of the box (with the appropriate fixes here
and there), which is really remarkable.
The remaining WTF is the length property. Traditionally, a range
offering length also implies the expectation that a range of length n
allows you to call popFront n times and then assert that the range is
empty. However, if you check e.g. hasLength!string it will yield false,
although the string does have an accessible member by that name and of
the appropriate type.
Although Phobos always checks its assumptions, people might occasionally
write code that just uses .length without checking hasLength. Then,
they'll be annoyed when the code fails with UTF-8 and UTF-16 strings.
(The "real" length of the range is not stored, but can be computed by
using str.walkLength() in std.range.)
What can be done about that? I see a number of solutions:
(a) Do not operate the change at all.
(b) Operate the change and mention that in range algorithms you should
check hasLength and only then use "length" under the assumption that it
really means "elements count".
(c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define
a different name for that. Any other name (codeUnits, codes etc.) would
do. The entire point is to not make algorithms believe strings have a
.length property.
(d) Have std.range define a distinct property called e.g. "count" and
then specialize it appropriately. Then change all references to .length
in std.algorithm and elsewhere to .count.
What would you do? Any ideas are welcome.
Andrei
I personally would find this extremely annoying because most of the code I
write
that involves strings is scientific computing code that will never be
internationalized, let alone released to the general public. I basically just
use
ASCII because it's all I need and if your UTF-8 string contains only ASCII
characters, it can be treated as random-access. I don't know how many people
out
there are in similar situations, but I doubt they'll be too happy.
On the other hand, I guess it wouldn't be hard to write a simple wrapper
struct on
top of immutable(ubyte)[] and call it AsciiString. Once alias this gets fully
debugged, I could even make it implicitly convert to immutable(char)[].
It's definitely going to be easy to use all sensible algorithms with
immutable(ubyte)[]. But even if you go with string, there should be no
problem at all. Remember, telling ASCII from UTF is one mask and one
test away, and the way Walter and I wrote virtually all related routines
was to special-case ASCII. In most cases I don't think you'll notice a
decrease in performance.
Andrei
On Wed, 03 Feb 2010 23:41:02 -0500, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> wrote:
dsimcha wrote:
I personally would find this extremely annoying because most of the
code I write
that involves strings is scientific computing code that will never be
internationalized, let alone released to the general public. I
basically just use
ASCII because it's all I need and if your UTF-8 string contains only
ASCII
characters, it can be treated as random-access. I don't know how many
people out
there are in similar situations, but I doubt they'll be too happy.
On the other hand, I guess it wouldn't be hard to write a simple
wrapper struct on
top of immutable(ubyte)[] and call it AsciiString. Once alias this
gets fully
debugged, I could even make it implicitly convert to immutable(char)[].
It's definitely going to be easy to use all sensible algorithms with
immutable(ubyte)[]. But even if you go with string, there should be no
problem at all. Remember, telling ASCII from UTF is one mask and one
test away, and the way Walter and I wrote virtually all related routines
was to special-case ASCII. In most cases I don't think you'll notice a
decrease in performance.
I'm in the same camp as dsimcha, I generally write all my apps assuming
ASCII strings (most are internal tools anyways).
Can the compiler help making ASCII strings easier to use? i.e., this
already works:
wstring s = "hello"; // converts to immutable(wchar)[]
what about this?
asciistring a = "hello"; // converts to immutable(ubyte)[] (or
immutable(ASCIIChar)[])
asciistring a = "\uFBCD"; // error, requires cast.
The only issue that remains to be resolved then is the upgradability that
ascii characters currently enjoy for utf8. I.e. I can call any utf-8
accepting function with an ASCII string, but not an ASCII string accepting
function with utf-8 data.
Ideally, there should be a 7-bit ASCII character type that implicitly
upconverts to char, and can be initialized with a string literal.
In addition, you are putting D's utf8 char even further away from C's
ASCII char. It would be nice to separate compatible C strings from d
strings. At some point, I should be able to designate a function (even a
C function) takes only ASCII data, and the compiler should disallow
passing general utf8 data into it. This involves either renaming D's char
to keep source closer to C, or rewriting C function signatures to reflect
the difference.
-Steve
On 2010-02-03 21:00:21 -0500, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> said:
It's no secret that string et al. are not a magic recipe for writing
correct Unicode code.
[...]
What would you do? Any ideas are welcome.
UTF-8 and UTF-16 encodings are interesting beasts. If you have a UTF-8
string and want to search for an occurrence of that string in another
UTF-8 string, you don't have to decode each multi-byte code-points: a
binary comparison is enough. If you're counting counting the number of
code points, then all you need is to count the number of code unit with
the most significant bit set to zero. If on the other hand you're
applying a character-by-character transformation, then you need to
fully decode each character, unless you're only interested in
transforming characters from the lower non-multibyte subrange of the
encoding (which happens quite often).
Clearly, I don't think there's a one-size-fit-all way to iterate over
string arrays. Fully decoding each code unit is clearly the most costly
method; it shouldn't be required when its not necessary.
I think we need to be able to represent char[] and wchar[] as a range
of dchar to deal with cases where you want to iterate over Unicode code
points, but I'd let the programmer ultimately decide what to do.
As for .length, I'll say that removing this property would make it hard
to write low-level code. For instance, if I copy a string into a
buffer, I need to know the length in bytes (array.length *
sizeof(array[0])), not the number of characters. So it doesn't make
much sense to disable .length.
So my answer would be mostly to leave things as they are.
Perhaps the char[] and wchar[] as dchar ranges could be aliased to
string and wstring, but that'd definitely be a blow to the philosophy
of strings as simple arrays. You'd also still need to be able to access
the actual array underneath. And will all the implicit conversions
still work? I'm really not sure it's worth it, but perhaps.
--
Michel Fortin
michel.fortin michelf.com
http://michelf.com/
Michel Fortin wrote:
...
As for .length, I'll say that removing this property would make it hard
to write low-level code. For instance, if I copy a string into a buffer,
I need to know the length in bytes (array.length * sizeof(array[0])),
not the number of characters. So it doesn't make much sense to disable
.length.
...
This would be under the condition that there is another property that
does the same (or similar) thing. It's the part where he said, "and
define a different name for that".
You could also just reinterpret them as byte[] or ubyte[]. Then they
will behave in a low level way. No surprises.
Andrei Alexandrescu wrote:
- make front() and back() for UTF-8 and UTF-16 automatically decode the
first and last Unicode character
- make popFront() and popBack() skip one entire Unicode character
(instead of just one code unit)
- alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings
- change hasLength to return false for UTF-8 and UTF-16 strings
These are all fine for a dedicated string type. They're horrible for
generic arrays, for the following reasons:
- They break generic code.
- They make it impossible to manipulate an array of code units as an
array of code units.
--
Rainer Deyke - rainerd eldwood.com
Rainer Deyke wrote:
Andrei Alexandrescu wrote:
- make front() and back() for UTF-8 and UTF-16 automatically decode the
first and last Unicode character
- make popFront() and popBack() skip one entire Unicode character
(instead of just one code unit)
- alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings
- change hasLength to return false for UTF-8 and UTF-16 strings
These are all fine for a dedicated string type. They're horrible for
generic arrays, for the following reasons:
- They break generic code.
- They make it impossible to manipulate an array of code units as an
array of code units.
Arrays of char and wchar are not quite generic - they are definitely UTF
strings.
Andrei
Andrei Alexandrescu wrote:
Arrays of char and wchar are not quite generic - they are definitely UTF
strings.
A 'char' is a single utf-8 code unit. A 'char[]' is (or should be) a
generic array of utf-8 code units. Sometimes these code units line up
to form valid unicode code points, sometimes they don't.
If you want a data type that always contains a valid utf-8 string, don't
call it 'char[]'. It's misleading, it breaks generic code, and it
renders built-in arrays useless for when you actually want an array of
utf-8 code units. It's the same mistake as std::vector<bool> in C++,
but much worse.
--
Rainer Deyke - rainerd eldwood.com
Rainer Deyke wrote:
Andrei Alexandrescu wrote:
Arrays of char and wchar are not quite generic - they are definitely UTF
strings.
A 'char' is a single utf-8 code unit. A 'char[]' is (or should be) a
generic array of utf-8 code units. Sometimes these code units line up
to form valid unicode code points, sometimes they don't.
If you want a data type that always contains a valid utf-8 string, don't
call it 'char[]'. It's misleading, it breaks generic code, and it
renders built-in arrays useless for when you actually want an array of
utf-8 code units. It's the same mistake as std::vector<bool> in C++,
but much worse.
I agree up to the assessment of the size of the problem and a couple of
other points. I've had a great time writing utf code in D with string.
Getting back to C++'s std::string really put things in perspective.
If your purpose is to store some disparate utf-8 code units (a need that
I've never had), I see no problem with storing then as ubyte[].
Andrei
Andrei Alexandrescu wrote:
It's no secret that string et al. are not a magic recipe for writing
correct Unicode code. However, things are pretty good and could be
further improved by operating the following changes in std.array and
std.range:
- make front() and back() for UTF-8 and UTF-16 automatically decode the
first and last Unicode character
They would yield dchar, right? Wouldn't that cause trouble in templated
code?
- make popFront() and popBack() skip one entire Unicode character
(instead of just one code unit)
That's perfectly fine, because the opposite operations do "encode":
string s = "aÄŸ";
assert(s.length == 3);
s ~= 'ÅŸ';
assert(s.length == 5);
- alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings
Ok.
- change hasLength to return false for UTF-8 and UTF-16 strings
I don't understand that one. strings have lengths. Adding and removing
does not alter length by 1 for those types. I don't think it's a big
deal. It is already so in the language for those types. dstring does not
have that problem and could be used when by-1 change is desired.
(b) Operate the change and mention that in range algorithms you should
check hasLength and only then use "length" under the assumption that it
really means "elements count".
The change sounds ok and hasLength should yield true. Or... can it
return an enum { no, kind_of, yes } ;)
Current utf.decode takes the index by reference and modifies it by the
amount. Could popFront() do something similar?
I think that's it: front() and popFront() are separated for cohesion.
What is causing trouble here is the separation of "by-N" from popFront().
You are concerned that the user makes the assumption and popFront() will
reduce by 1. I think that is the problem here.
How about something like:
// returns the amount that the next popFront() will reduce length
int nextStep();
Ali
Ali Çehreli wrote:
Andrei Alexandrescu wrote:
> It's no secret that string et al. are not a magic recipe for writing
> correct Unicode code. However, things are pretty good and could be
> further improved by operating the following changes in std.array and
> std.range:
>
> - make front() and back() for UTF-8 and UTF-16 automatically decode the
> first and last Unicode character
They would yield dchar, right? Wouldn't that cause trouble in templated
code?
Yes, dchar. There was some figuring out in parts of Phobos, but the
gains are well worth it.
The simplifications are enormous. Until now, Phobos didn't hit the nail
on the head with simple encoding/decoding/transcoding primitives. There
were many attempts in std.utf, std.encoding, and std.string - all very
clunky to use. Now I can just write s.front to get the first dchar of
any string, and s.popFront to drop it. Very simple!
> - make popFront() and popBack() skip one entire Unicode character
> (instead of just one code unit)
That's perfectly fine, because the opposite operations do "encode":
string s = "aÄŸ";
assert(s.length == 3);
s ~= 'ÅŸ';
assert(s.length == 5);
> - alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings
Ok.
> - change hasLength to return false for UTF-8 and UTF-16 strings
I don't understand that one. strings have lengths. Adding and removing
does not alter length by 1 for those types. I don't think it's a big
deal. It is already so in the language for those types. dstring does not
have that problem and could be used when by-1 change is desired.
hasLength is a property used by range algorithms to tell them that a
range stores the length with a particular meaning (the number of
elements). It is perfectly fine that strings don't obey hasLength but do
expose .length - it's just that it has different semantics.
> (b) Operate the change and mention that in range algorithms you should
> check hasLength and only then use "length" under the assumption that it
> really means "elements count".
The change sounds ok and hasLength should yield true. Or... can it
return an enum { no, kind_of, yes } ;)
Current utf.decode takes the index by reference and modifies it by the
amount. Could popFront() do something similar?
I think we could dedicate a special function for that. In fact it does
exist I think - it's called stride().
Andrei
Ali Çehreli wrote:
...
The change sounds ok and hasLength should yield true. Or... can it
return an enum { no, kind_of, yes } ;)
...
Ali
I believe what you're after is this:
enum Bool
{
True,
False,
FileNotFound
}
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:
What can be done about that? I see a number of solutions:
(a) Do not operate the change at all.
(b) Operate the change and mention that in range algorithms you should
check hasLength and only then use "length" under the assumption that it
really means "elements count".
(c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define
a different name for that. Any other name (codeUnits, codes etc.) would
do. The entire point is to not make algorithms believe strings have a
.length property.
(d) Have std.range define a distinct property called e.g. "count" and
then specialize it appropriately. Then change all references to .length
in std.algorithm and elsewhere to .count.
What would you do? Any ideas are welcome.
Of the above, I feel (b) is the correct solution, and I understand it
has already been implemented in svn.
--
Simen
Simen kjaeraas:
Of the above, I feel (b) is the correct solution, and I understand it
has already been implemented in svn.
Yes, I presume he was mostly looking for a justification of his ideas he has
already accepted and even partially implemented :-)
Bye,
bearophile
bearophile wrote:
Simen kjaeraas:
Of the above, I feel (b) is the correct solution, and I understand
it has already been implemented in svn.
Yes, I presume he was mostly looking for a justification of his ideas
he has already accepted and even partially implemented :-)
I am ready to throw away the implementation as soon as a better idea
comes around. As other times, I operated the change to see how things
feel with the new approach.
Generally it feels like the new state of affairs is a solid improvement.
One recurring problem has been that some code has assumed that
ElementType!SomeString has the width of one encoding unit. That
assumption is no longer true so I had to change such code with
typeof(SomeString.init[0]). Probably I'll abstract that as
CodeUnit!SomeString in std.traits.
I also found some bugs; for example Levenshtein distance was erroneous
because it didn't operate at character level. The fix using front and
popFront was very simple.
Regarding defining an entire new struct for strings, I think that's a
sensible approach. With the new operators in tow, UString (universal
string) that traffics in dchar and makes representation a detail would
be nicely implementable. It could even have mutable elements at dchar
granularity. My feeling is, however, that at this point too much
toothpaste is out of the tube for that to happen in D2. That would be
offset if current strings were unbearable, but I think they're working
very well.
Andrei
On 2010-02-04 12:19:42 -0500, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> said:
bearophile wrote:
Simen kjaeraas:
Of the above, I feel (b) is the correct solution, and I understand
it has already been implemented in svn.
Yes, I presume he was mostly looking for a justification of his ideas
he has already accepted and even partially implemented :-)
I am ready to throw away the implementation as soon as a better idea
comes around. As other times, I operated the change to see how things
feel with the new approach.
Has any thought been given to foreach? Currently all these work for strings:
foreach (c; "abc") { } // typeof(c) is 'char'
foreach (char c; "abc") { }
foreach (wchar c; "abc") { }
foreach (dchar c; "abc") { }
I'm concerned about the first case where the element type is implicit.
The implicit element type is (currently) the code units. If the range
use code points 'dchar' as the element type, then I think foreach needs
to be changed so that the default element type is 'dchar' too (in the
first line of my example). Having ranges and foreach disagree on this
would be very inconsistent. Of course you should be allowed to iterate
using 'char' and 'wchar' too.
I think this would fit nicely. I was surprised at first when learning D
and I noticed that foreach didn't do this, that I had to explicitly has
for it.
--
Michel Fortin
michel.fortin michelf.com
http://michelf.com/
Michel Fortin wrote:
On 2010-02-04 12:19:42 -0500, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> said:
bearophile wrote:
Simen kjaeraas:
Of the above, I feel (b) is the correct solution, and I understand
it has already been implemented in svn.
Yes, I presume he was mostly looking for a justification of his ideas
he has already accepted and even partially implemented :-)
I am ready to throw away the implementation as soon as a better idea
comes around. As other times, I operated the change to see how things
feel with the new approach.
Has any thought been given to foreach? Currently all these work for
strings:
foreach (c; "abc") { } // typeof(c) is 'char'
foreach (char c; "abc") { }
foreach (wchar c; "abc") { }
foreach (dchar c; "abc") { }
I'm concerned about the first case where the element type is implicit.
The implicit element type is (currently) the code units. If the range
use code points 'dchar' as the element type, then I think foreach needs
to be changed so that the default element type is 'dchar' too (in the
first line of my example). Having ranges and foreach disagree on this
would be very inconsistent. Of course you should be allowed to iterate
using 'char' and 'wchar' too.
I think this would fit nicely. I was surprised at first when learning D
and I noticed that foreach didn't do this, that I had to explicitly has
for it.
This is a good point. I'm in favor of changing the language to make the
implicit type dchar.
Andrei
Andrei Alexandrescu wrote:
Michel Fortin wrote:
On 2010-02-04 12:19:42 -0500, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> said:
bearophile wrote:
Simen kjaeraas:
Of the above, I feel (b) is the correct solution, and I understand
it has already been implemented in svn.
Yes, I presume he was mostly looking for a justification of his ideas
he has already accepted and even partially implemented :-)
I am ready to throw away the implementation as soon as a better idea
comes around. As other times, I operated the change to see how things
feel with the new approach.
Has any thought been given to foreach? Currently all these work for
strings:
foreach (c; "abc") { } // typeof(c) is 'char'
foreach (char c; "abc") { }
foreach (wchar c; "abc") { }
foreach (dchar c; "abc") { }
I'm concerned about the first case where the element type is implicit.
The implicit element type is (currently) the code units. If the range
use code points 'dchar' as the element type, then I think foreach
needs to be changed so that the default element type is 'dchar' too
(in the first line of my example). Having ranges and foreach disagree
on this would be very inconsistent. Of course you should be allowed to
iterate using 'char' and 'wchar' too.
I think this would fit nicely. I was surprised at first when learning
D and I noticed that foreach didn't do this, that I had to explicitly
has for it.
This is a good point. I'm in favor of changing the language to make the
implicit type dchar.
Andrei
We seem to be approaching the point where char[], wchar[] and dchar[]
are all arrays of dchar, but with different levels of compression.
It makes me wonder if the char, wchar types actually make any sense.
If char[] is actually a UTF string, then char[] ~ char should be
permitted ONLY if char can be implicitly converted to dchar. Otherwise,
you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which will
not necessarily result in a valid unicode string.
I suspect that string, wstring should have been the primary types and
had a .codepoints property, which returned a ubyte[] resp. ushort[]
reference to the data. It's too late, of course. The extra value you get
by having a specific type for 'this is a code point for a UTF8 string'
seems to be very minor, compared to just using a ubyte.
Don Wrote:
We seem to be approaching the point where char[], wchar[] and dchar[]
are all arrays of dchar, but with different levels of compression.
It makes me wonder if the char, wchar types actually make any sense.
If char[] is actually a UTF string, then char[] ~ char should be
permitted ONLY if char can be implicitly converted to dchar. Otherwise,
you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which will
not necessarily result in a valid unicode string.
Well, if you're working with a LOT of text, you may be mmapping GB's of UTF-8
text. Yes, this does happen. You better be able to handle it in a sane
manner, i.e. not reallocating the memory to read the data in. So, there is a
definite need for casting to array of char, and dealing with the inevitable
stray non-unicode char in that mess.
Real-world text processing can be a messy affair. It probably requires walking
such an array and returning slices cast to char after they've been validated.
== Quote from Jerry Quinn (jlquinn optonline.net)'s article
Don Wrote:
We seem to be approaching the point where char[], wchar[] and dchar[]
are all arrays of dchar, but with different levels of compression.
It makes me wonder if the char, wchar types actually make any sense.
If char[] is actually a UTF string, then char[] ~ char should be
permitted ONLY if char can be implicitly converted to dchar. Otherwise,
you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which will
not necessarily result in a valid unicode string.
i.e. not reallocating the memory to read the data in. So, there is a definite
need for casting to array of char, and dealing with the inevitable stray
non-unicode char in that mess.
Welcome to the world of DNA sequence manipulation.
Don wrote:
I suspect that string, wstring should have been the primary types and
had a .codepoints property, which returned a ubyte[] resp. ushort[]
reference to the data. It's too late, of course. The extra value you get
by having a specific type for 'this is a code point for a UTF8 string'
seems to be very minor, compared to just using a ubyte.
If it's not too late to completely change the semantics of char[], then
it's also not too late to dump 'char' completely. If it /is/ too late
to remove 'char', then 'char[]' should retain the current semantics and
a new string type should be added for the new semantics.
--
Rainer Deyke - rainerd eldwood.com
Rainer Deyke wrote:
Don wrote:
I suspect that string, wstring should have been the primary types and
had a .codepoints property, which returned a ubyte[] resp. ushort[]
reference to the data. It's too late, of course. The extra value you get
by having a specific type for 'this is a code point for a UTF8 string'
seems to be very minor, compared to just using a ubyte.
If it's not too late to completely change the semantics of char[], then
it's also not too late to dump 'char' completely. If it /is/ too late
to remove 'char', then 'char[]' should retain the current semantics and
a new string type should be added for the new semantics.
One idea I've had for a while was to have a universal string type:
struct UString {
union {
char[] utf8;
wchar[] utf16;
dchar[] utf32;
}
enum Discriminator { utf8, utf16, utf32 };
Discriminator kind;
IntervalTree!(size_t) skip;
...
}
The IntervalTree stores the skip amounts that must be added for a given
index in the string. For ASCII strings that would be null. Then its size
grows with the number of multibyte characters. Beyond a threshold,
representation is transparently switched to utf16 or utf32 as needed and
the tree becomes smaller or null again.
In an advanced implementation the discriminator and the tree could be
stored at negative offset, and the tree could be compressed taking
advantage of its limited size. That would make UString quite
low-overhead while offering a staunchly dchar-based interface.
I don't mind at all using string, but I also think UString would be a
good extra abstraction.
Andrei
Andrei Alexandrescu wrote:
Rainer Deyke wrote:
Don wrote:
I suspect that string, wstring should have been the primary types and
had a .codepoints property, which returned a ubyte[] resp. ushort[]
reference to the data. It's too late, of course. The extra value you get
by having a specific type for 'this is a code point for a UTF8 string'
seems to be very minor, compared to just using a ubyte.
If it's not too late to completely change the semantics of char[], then
it's also not too late to dump 'char' completely. If it /is/ too late
to remove 'char', then 'char[]' should retain the current semantics and
a new string type should be added for the new semantics.
One idea I've had for a while was to have a universal string type:
struct UString {
union {
char[] utf8;
wchar[] utf16;
dchar[] utf32;
}
enum Discriminator { utf8, utf16, utf32 };
Discriminator kind;
IntervalTree!(size_t) skip;
...
}
You mean like this?
http://www.dprogramming.com/mtext.php
Andrei Alexandrescu wrote:
One idea I've had for a while was to have a universal string type:
struct UString {
union {
char[] utf8;
wchar[] utf16;
dchar[] utf32;
}
enum Discriminator { utf8, utf16, utf32 };
Discriminator kind;
IntervalTree!(size_t) skip;
...
}
The IntervalTree stores the skip amounts that must be added for a given
index in the string. For ASCII strings that would be null. Then its size
grows with the number of multibyte characters. Beyond a threshold,
representation is transparently switched to utf16 or utf32 as needed and
the tree becomes smaller or null again.
Although I see some potential in a universal string type, I don't think
this is the right implementation strategy. I'd rather have my short
strings in utf-32 (optimized for speed) and my long strings in
utf-8/utf-16 (optimized for memory usage).
--
Rainer Deyke - rainerd eldwood.com
Rainer Deyke wrote:
Andrei Alexandrescu wrote:
One idea I've had for a while was to have a universal string type:
struct UString {
union {
char[] utf8;
wchar[] utf16;
dchar[] utf32;
}
enum Discriminator { utf8, utf16, utf32 };
Discriminator kind;
IntervalTree!(size_t) skip;
...
}
The IntervalTree stores the skip amounts that must be added for a given
index in the string. For ASCII strings that would be null. Then its size
grows with the number of multibyte characters. Beyond a threshold,
representation is transparently switched to utf16 or utf32 as needed and
the tree becomes smaller or null again.
Although I see some potential in a universal string type, I don't think
this is the right implementation strategy. I'd rather have my short
strings in utf-32 (optimized for speed) and my long strings in
utf-8/utf-16 (optimized for memory usage).
The definition I outlined does not specify or constrain the strategy of
changing the discriminator.
Andrei
On 2010-02-04 18:16:55 -0500, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> said:
Rainer Deyke wrote:
Don wrote:
I suspect that string, wstring should have been the primary types and
had a .codepoints property, which returned a ubyte[] resp. ushort[]
reference to the data. It's too late, of course. The extra value you get
by having a specific type for 'this is a code point for a UTF8 string'
seems to be very minor, compared to just using a ubyte.
If it's not too late to completely change the semantics of char[], then
it's also not too late to dump 'char' completely. If it /is/ too late
to remove 'char', then 'char[]' should retain the current semantics and
a new string type should be added for the new semantics.
One idea I've had for a while was to have a universal string type:
struct UString {
union {
char[] utf8;
wchar[] utf16;
dchar[] utf32;
}
enum Discriminator { utf8, utf16, utf32 };
Discriminator kind;
IntervalTree!(size_t) skip;
...
}
That's a nice concept, but it seems to me that it adds much overhead to
improve a rather niche area. It's not often that you need to access
characters by index. Generally when you need to it's because you've
already parsed the string and want to return to a previous location, in
which case you'd better when you first parse to just save the range or
the index in code units rather than the index in code point.
But I have to say quite satisfied in the way D handle strings in
general. Easy access to code points and direct access to the data is
quite handy. I think it fits very well with a low-level language.
I'd say in general when manipulating strings I rarely need to bother
about code points. Most of the time I'm just searching for ASCII-range
markers when parsing so I can search for them directly as code units,
not bothering at all about multi-byte characters. That's why I'm a
little wary about your changes. If I'm looking for a substring then I
can search by code units too. It's just for the more fancy stuff
(case-insensitive searching, character transformation) that it becomes
necessary to work with code points.
That's why I'm a little wary about your changes in that area: I fear
it'll make the common case of working with code units more difficult to
deal with. But I won't judge before I see.
--
Michel Fortin
michel.fortin michelf.com
http://michelf.com/
Don wrote:
Andrei Alexandrescu wrote:
Michel Fortin wrote:
On 2010-02-04 12:19:42 -0500, Andrei Alexandrescu
<SeeWebsiteForEmail erdani.org> said:
bearophile wrote:
Simen kjaeraas:
Of the above, I feel (b) is the correct solution, and I understand
it has already been implemented in svn.
Yes, I presume he was mostly looking for a justification of his ideas
he has already accepted and even partially implemented :-)
I am ready to throw away the implementation as soon as a better idea
comes around. As other times, I operated the change to see how
things feel with the new approach.
Has any thought been given to foreach? Currently all these work for
strings:
foreach (c; "abc") { } // typeof(c) is 'char'
foreach (char c; "abc") { }
foreach (wchar c; "abc") { }
foreach (dchar c; "abc") { }
I'm concerned about the first case where the element type is
implicit. The implicit element type is (currently) the code units. If
the range use code points 'dchar' as the element type, then I think
foreach needs to be changed so that the default element type is
'dchar' too (in the first line of my example). Having ranges and
foreach disagree on this would be very inconsistent. Of course you
should be allowed to iterate using 'char' and 'wchar' too.
I think this would fit nicely. I was surprised at first when learning
D and I noticed that foreach didn't do this, that I had to explicitly
has for it.
This is a good point. I'm in favor of changing the language to make
the implicit type dchar.
Andrei
We seem to be approaching the point where char[], wchar[] and dchar[]
are all arrays of dchar, but with different levels of compression.
That is a good way to look at things.
It makes me wonder if the char, wchar types actually make any sense.
If char[] is actually a UTF string, then char[] ~ char should be
permitted ONLY if char can be implicitly converted to dchar. Otherwise,
you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which will
not necessarily result in a valid unicode string.
Well as it's been mentioned, sometimes you may assemble a string out of
individual characters. Probably that case is rare enough to warrant a
cast. Note that today char is already convertible to dchar (there's no
checking).
I suspect that string, wstring should have been the primary types and
had a .codepoints property, which returned a ubyte[] resp. ushort[]
reference to the data. It's too late, of course. The extra value you get
by having a specific type for 'this is a code point for a UTF8 string'
seems to be very minor, compared to just using a ubyte.
What we can do is to have to!(const ubyte[]) work for all UTF8 strings
and to!(const ushort[]) work for all UTF16 strings. That view is correct
and safe. Also, it's not difficult to add a .codepoints pseudo-property.
Andrei
Andrei Alexandrescu wrote:
Has any thought been given to foreach? Currently all these work for
strings:
foreach (c; "abc") { } // typeof(c) is 'char'
foreach (char c; "abc") { }
foreach (wchar c; "abc") { }
foreach (dchar c; "abc") { }
I'm concerned about the first case where the element type is implicit.
The implicit element type is (currently) the code units. If the range
use code points 'dchar' as the element type, then I think foreach
needs to be changed so that the default element type is 'dchar' too
(in the first line of my example). Having ranges and foreach disagree
on this would be very inconsistent. Of course you should be allowed to
iterate using 'char' and 'wchar' too.
I think this would fit nicely. I was surprised at first when learning
D and I noticed that foreach didn't do this, that I had to explicitly
has for it.
This is a good point. I'm in favor of changing the language to make the
implicit type dchar.
Andrei
I concur. It's great to see consensus moving in this direction. For
too long Java has suffered the err that a short (i.e. UTF-16 codeunit)
is just about as good as a full Unicode codepoint (i.e. UTF-32
"codeunit"). As a result, the near-enough is good-enough, 16-bit Java
API's means that programmers either forget (as best) or become slack (at
worse) in the dealing of valid Unicode characters. Part of this
also stems from the culture that if it ain't ASCII or in a Western
character set (BMP), who cares.
As a matter of taste, I'd prefer to see a dchar Unicode codepoint
officially acknowledged/ordained as "unichar", though I guess there
is always the alias resort for pedants like myself.
Cheers
Justin Johansson
Andrei Alexandrescu:
I am ready to throw away the implementation as soon as a better idea
comes around. As other times, I operated the change to see how things
feel with the new approach.
And just to be sure: you are better than me because you actually implement
things, while I am here just complaining all the time :o) So thank you for your
work.
Later,
bearophile
On Thu, 04 Feb 2010 18:41:48 -0700, Rainer Deyke wrote:
Andrei Alexandrescu wrote:
One idea I've had for a while was to have a universal string type:
struct UString {
union {
char[] utf8;
wchar[] utf16;
dchar[] utf32;
}
enum Discriminator { utf8, utf16, utf32 }; Discriminator kind;
IntervalTree!(size_t) skip;
...
}
Firstly, for such "augmented types" in D, such as strings, bignums or any
future ideas , it is great to have the facilities of creating them using
the struct, so that they can be used elsewhere without regards to whether
they are built in as compiler specials or in the library.
What is there for struct now is good and getting better in D2, but I
still feel a little insecure with understanding how to make a really
optimal implementation that is as good as a built in type that the
compiler understands. The DPL is being been a help for this.
Programmers will want to use raw char[] wchar[] dchar[] for whatever
reasons with their ?simple? behaviours, so they should not be made
unavailable because more sophisticated types are creatable, purely for
unicode strings.
I have made a UString implementation, similar to above. But I played a
different trick. I was interested for this to also maintain a terminating
null char for conversion passing to Windows API functions, in particular
16 bit W. interfaces.
struct UString_char {
char[] str_;
/// ... lots of good D type stuff, constructor and assign
conversions, access
size_t length() {
return str_.length - 1; // hide terminating null
}
}
struct UString_wchar {
wchar[] str_;
/// ditto D type stuff
}
struct UString_dchar {
dchar[] str_;
}
// throw in void[] for charity. (although no one will need it)
struct UString_void {
void[] ptr_;
}
enum UStringType { UC_CHAR, UC_WCHAR, UC_DCHAR }
struct UString {
union {
UString_void vstr;
UString_char cstr;
UString_wchar wstr;
UString_dchar dstr;
}
UStringType ztype;
// type things to track what we are.
}
I could then choose individual components by themselves, where
appropriate, even get them working.
In D2 immutable not for str_ array, while appending or fiddling null
terminator.
I did not get associative array working as a key using UString, have not
tried since.
Also made a class version called VString containing the union.
There's a lot of issues.
I also must acknowledge the prior art of the mtext code, and its MString
structure type. I was partly inspired by seeing this, and how complex it
was to do nearly everything.
When last I checked mtext it was kind of broken for recent D1 and D2
compilers, and I did not want to fix. I admit I did not like the
complexity of the direct union { char[], wchar[], dchar[] } Splitting up
into seperatedly usable structs seems to me to give 3 times the potential
for the same price.
The advantage of using struct for such types is it may help bring about
perfection of such a POD based "type creation" facility. I note from
looking at some of the phobos D2 code, eg std.array, this seems to be
attempted in places.
Nearly all the more interesting D types, arrays, maps, are all equivalent
to smallish POD types, with at least 2-3 times the machine word size
(32/64 bit).
Making it all work and understandable and avoiding WTFbug is a big
challenge.
Andrei Alexandrescu Wrote:
What would you do? Any ideas are welcome.
Andrei, congratulations on starting the most interesting thread I have seen on
dm.D for as long as I can remember. It has managed to stay really focused, and
has produced a lot of interesting suggestions.
I don't envy you the task of sorting through them - but you did ask. From what
you've seen so far, are we talking Phobos or language issues, and if both, in
what proportion?
Steve.
Steve Teale wrote:
Andrei Alexandrescu Wrote:
What would you do? Any ideas are welcome.
Andrei, congratulations on starting the most interesting thread I have seen on
dm.D for as long as I can remember. It has managed to stay really focused, and
has produced a lot of interesting suggestions.
I don't envy you the task of sorting through them - but you did ask. From what
you've seen so far, are we talking Phobos or language issues, and if both, in
what proportion?
A bit of both, with an emphasis on library.
Andrei
|
|