|
Archives
D Programming
D
D.gnu
digitalmars.D
digitalmars.D.bugs
digitalmars.D.dtl
digitalmars.D.dwt
digitalmars.D.announce
digitalmars.D.learn
digitalmars.D.debugger
C/C++ Programming
c++
c++.announce
c++.atl
c++.beta
c++.chat
c++.command-line
c++.dos
c++.dos.16-bits
c++.dos.32-bits
c++.idde
c++.mfc
c++.rtl
c++.stl
c++.stl.hp
c++.stl.port
c++.stl.sgi
c++.stlsoft
c++.windows
c++.windows.16-bits
c++.windows.32-bits
c++.wxwindows
digitalmars.empire
digitalmars.DMDScript
|
digitalmars.D - First Impressions
↑ ↓ ← → Geoff Carlton <gcarlton iinet.net.au> writes:
Hi,
I'm a C++ user who's just tried D and I wanted to give my first
impressions. I can't really justify moving any of my codebase over to
D, so I wrote a quick tool to parse a dictionary file and make a
histogram - a bit like the wc demo in the dmd package.
1.)
I was a bit underwhelmed by the syntax of char[]. I've used lua which
also has strings,functions and maps as basic primitives, so going back
to array notation seems a bit low level. Also, char[][] is not the best
start in the main() declaration. Is it a 2D array, an array of
arrays? Then there is the char[][char[]]. What a mouthful for a simple
map!
Well, now I need to find elements.. I'd use std::string's find() here,
but the wc example has all array operations. Even isalpha is done as
'a', 'z' comparisons on an indexed array. Back to low level C stuff.
A simple alias of char[] to string would simplify the first glance code.
string x; // yep, a string
main (string[]) // an array of strings
string[string] m; // map of string to string
I believe single functions get pulled in as member functions? e.g.
find(string) can be used as string.find()? If so, it means that all the
string functionality can be added and then used naturally as member
functions on this "string" (which is really just the plain old char[] in
disguise).
This is a small thing, but I think it would help in terms of the mindset
of strings being a first class primitive, and clear up simple "hello
world" examples at the same time. Put simply, every modern language has
a first class string primitive type, except D - at least in terms of
nomenclature.
2.)
I liked the more powerful for loop. I'm curious is there any ability to
use delegates in the same way as lua does? I was blown away the first
time I realised how simple it was for custom iteration in lua. In
short, you write a function that returns a delegate (a closure?) that
itself returns arguments, terminating in nil.
e.g. for r in rooms_in_level(lvl) // custom function
As lua can handle multiple return arguments, it can also do a key,value
sort of thing that D can do. What a wonderful way of allowing any sort
of iteration.
It beats pages of code in C++ to write an iterator that can go forwards,
or one that can go backwards (wow, the power of C++!). C++09 still
isn't much of an improvement here, it only sugars the awful iterator syntax.
3.)
From the newsgroups, it seems like 'auto' as local raii and 'auto' as
automatic type deduction are still linked to the one keyword. Well in
lua, 'local' is pretty intuitive for locally scoped variables. Also
'auto' will soon mean automatic type deduction in C++. So those make
sense to me personally. Looks like this has been discussed to death,
but thats my 2c.
4.)
The D version of Scintilla and d-build was nice, very easy to use.
Personally I would have preferred the default behaviour of dbuild to put
object files in an /obj subdirectory and the final exe in the original
directory dbuild is run from.
This way, it could be run from a root directory, operate on a /src
subdirectory, and not clutter up the source with object files. There is
a switch for that, of course, but I can't imagine when you would want
object files sitting in the same directory as the source.
Well, as first impressions go, I was pleased by D, and am interested to
see how well it fares as time goes on. Its just a shame that all the
tools/library/IDE is all in C++!
Thanks,
Geoff
↑ ↓ ← → "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"Geoff Carlton" <gcarlton iinet.net.au> wrote in message
news:efhp1r$1r9s$1 digitaldaemon.com...
Hi,
I'm a C++ user who's just tried D and I wanted to give my first
impressions. I can't really justify moving any of my codebase over to
D, so I wrote a quick tool to parse a dictionary file and make a
histogram - a bit like the wc demo in the dmd package.
1.)
I was a bit underwhelmed by the syntax of char[]. I've used lua which
also has strings,functions and maps as basic primitives, so going back
to array notation seems a bit low level. Also, char[][] is not the best
start in the main() declaration. Is it a 2D array, an array of
arrays? Then there is the char[][char[]]. What a mouthful for a simple
map!
Well, now I need to find elements.. I'd use std::string's find() here, but
the wc example has all array operations. Even isalpha is done as 'a', 'z'
comparisons on an indexed array. Back to low level C stuff.
A simple alias of char[] to string would simplify the first glance code.
string x; // yep, a string
main (string[]) // an array of strings
string[string] m; // map of string to string
I believe single functions get pulled in as member functions? e.g.
find(string) can be used as string.find()? If so, it means that all the
string functionality can be added and then used naturally as member
functions on this "string" (which is really just the plain old char[] in
disguise).
They're more just syntactic sugar than member functions. You can, in fact
do this with any array type, e.g
void foo(int[] arr)
{
...
}
int[] x; x = [4, 5, 6, 7]; // bug in the new array literals ;)
x.foo();
This is a small thing, but I think it would help in terms of the mindset
of strings being a first class primitive, and clear up simple "hello
world" examples at the same time. Put simply, every modern language has
a first class string primitive type, except D - at least in terms of
nomenclature.
It does look nicer. I suppose the counterargument would be that having an
alias char[] string might not be portable -- what about wchar[] and dchar[]?
Would they be wstring and dstring? Or would we choose wchar[] or dchar[] to
be string to be more forward-thinking (since languages like Java and C#
already use UTF-16 as the default string type)?
I've never been too incredibly put off by char[], but of course other people
have other opinions.
2.)
I liked the more powerful for loop. I'm curious is there any ability to
use delegates in the same way as lua does? I was blown away the first
time I realised how simple it was for custom iteration in lua. In short,
you write a function that returns a delegate (a closure?) that itself
returns arguments, terminating in nil.
e.g. for r in rooms_in_level(lvl) // custom function
As lua can handle multiple return arguments, it can also do a key,value
sort of thing that D can do. What a wonderful way of allowing any sort of
iteration.
Unfortunately the way Lua does "foreach" iteration is exactly the inverse of
how D does it. Lua gets an iterator and keeps calling it in the loop; D
gives the loop (the entire body!) to the iterator function, which runs the
loop. So it's something like a "true" iterator as described in the Lua
book:
level.each(function(r) print("Room: " .. r) end)
D does it this way I guess to make it easier to write iterators. Since
you're limited to one return value, it's simpler to make the iterator a
callback and pass the indices into the foreach body than it is to make the
iterator return multiple parameters through "out" parameters. That, and
it's easier to keep track of state with a callback iterator. (I'm going
through which to use in a Lua-like language that I'm designing too!)
It beats pages of code in C++ to write an iterator that can go forwards,
or one that can go backwards (wow, the power of C++!). C++09 still isn't
much of an improvement here, it only sugars the awful iterator syntax.
Weeeeeeeee! C++
3.)
From the newsgroups, it seems like 'auto' as local raii and 'auto' as
automatic type deduction are still linked to the one keyword. Well in
lua, 'local' is pretty intuitive for locally scoped variables. Also
'auto' will soon mean automatic type deduction in C++. So those make
sense to me personally. Looks like this has been discussed to death, but
thats my 2c.
I don't even wanna get into it ;) _Technically_ speaking, auto isn't really
"used" in type deduction; instead, the syntax is just <storage class>
<identifier>, skipping the type. Since the default storage class is auto,
it looks like auto is being used to determine the type, but it also works
like e.g.
static x = 5;
I think a better way to do it would be to have a special "stand-in" type,
such as
var x = 5;
static var y = 20;
auto var f = new Foo(); // this will be RAII and automatically
type-determined
4.)
The D version of Scintilla and d-build was nice, very easy to use.
Personally I would have preferred the default behaviour of dbuild to put
object files in an /obj subdirectory and the final exe in the original
directory dbuild is run from.
This way, it could be run from a root directory, operate on a /src
subdirectory, and not clutter up the source with object files. There is a
switch for that, of course, but I can't imagine when you would want object
files sitting in the same directory as the source.
Well, as first impressions go, I was pleased by D, and am interested to
see how well it fares as time goes on. Its just a shame that all the
tools/library/IDE is all in C++!
Thanks,
Geoff
↑ ↓ ← → Geoff Carlton <gcarlton iinet.net.au> writes:
Jarrett Billingsley wrote:
It does look nicer. I suppose the counterargument would be that having an
alias char[] string might not be portable -- what about wchar[] and dchar[]?
Would they be wstring and dstring? Or would we choose wchar[] or dchar[] to
be string to be more forward-thinking (since languages like Java and C#
already use UTF-16 as the default string type)?
I'm a fan of utf-8 so it would seem natural to have string, wstring, and
dstring. IMO utf-16 is backward thinking, and has the dubious property
of being mostly fixed width, except when its not. And even utf-32 isn't
one-to-one in terms of glyphs rendered on screen.
Anyway, as a low level programmer, I appreciate that its all based on
very powerful and flexible arrays. But as a high level programmer, I
don't want to be reminded of that fact every time I need a to use a string.
Unfortunately the way Lua does "foreach" iteration is exactly the inverse of
how D does it. Lua gets an iterator and keeps calling it in the loop; D
gives the loop (the entire body!) to the iterator function, which runs the
loop. So it's something like a "true" iterator as described in the Lua
book:
Ok, although the advantage of the first method is that you write the
iterator once, and then its easy to use for all clients. Wrapping up
the loop in a function is just backward, although it is much more
palatable in the inline format than a clunky out of line functor or
using _1, _2 hackery magic.
As an example, I love the fact that I can do this in lua:
for r1 in rooms_in_level(lvl) do
for r2 in rooms_in_level(lvl) do
for c in connections(r1, r2) do
print("got connection " .. c)
end
end
end
I wrote Floyd's algorithm in lua in the time it would take me in C++ to
not even finish thinking about what structures, classes, vectors I would
use. I imagine D would be as easy, although not as nice as the above style.
D does it this way I guess to make it easier to write iterators. Since
you're limited to one return value, it's simpler to make the iterator a
callback and pass the indices into the foreach body than it is to make the
iterator return multiple parameters through "out" parameters. That, and
it's easier to keep track of state with a callback iterator. (I'm going
through which to use in a Lua-like language that I'm designing too!)
Multiple returns would be tricky. C++ looks like its getting there with
std::tuple and std::tie, but as always the downside is the sheer
clunkiness. As hetrogenous arrays aren't in the core language for
either C++ or D, its tricky to come up with a clean solution.
Designing a language would be great fun, and I think lua has done a
great many things right. Not sure about the typeless state though, it
gets messy with large projects. Still, no templates (or rather, every
function is like a template).
↑ ↓ ← → Lutger <lutger.blijdestijn gmail.com> writes:
Geoff Carlton wrote:
Hi,
I'm a C++ user who's just tried D and I wanted to give my first
impressions. I can't really justify moving any of my codebase over to
D, so I wrote a quick tool to parse a dictionary file and make a
histogram - a bit like the wc demo in the dmd package.
You'll sure be pleased with D coming from C++.
1.)
I was a bit underwhelmed by the syntax of char[]...
Yes, I was too. But although it looks not very nice at first sight, D's
arrays are nothing like C++ arrays. Strings are first class, array
notation is consistent and getting used to them together with
concatenation and slicing operators, I found they are quite powerful yet
simple to use.
2.)
I liked the more powerful for loop. I'm curious is there any ability to
use delegates in the same way as lua does? I was blown away the first
time I realised how simple it was for custom iteration in lua. In
short, you write a function that returns a delegate (a closure?) that
itself returns arguments, terminating in nil.
You can enable a class to use the foreach statement.
http://www.digitalmars.com/d/statement.html#foreach
4.)
The D version of Scintilla and d-build was nice, very easy to use.
Personally I would have preferred the default behaviour of dbuild to put
object files in an /obj subdirectory and the final exe in the original
directory dbuild is run from.
This way, it could be run from a root directory, operate on a /src
subdirectory, and not clutter up the source with object files. There is
a switch for that, of course, but I can't imagine when you would want
object files sitting in the same directory as the source.
Check out build: http://www.dsource.org/projects/build
Well, as first impressions go, I was pleased by D, and am interested to
see how well it fares as time goes on. Its just a shame that all the
tools/library/IDE is all in C++!
Thanks,
Geoff
↑ ↓ ← → Derek Parnell <derek nomail.afraid.org> writes:
On Fri, 29 Sep 2006 10:23:32 +1000, Geoff Carlton wrote:
Hi,
I'm a C++ user who's just tried D
I was a bit underwhelmed by the syntax of char[].
Yes. It isn't very 'nice' for a modern language. Though as you note below a
simple alias can help a lot.
alias char[] string;
I believe single functions get pulled in as member functions? e.g.
find(string) can be used as string.find()?
This syntax sugar works for all arrays.
func(T[] x, a)
x.func(a)
are equivalent.
2.)
I liked the more powerful for loop. I'm curious is there any ability to
use delegates in the same way as lua does?
Yes it can use anonymous delegates. You can also overload it in classes.
3.)
From the newsgroups, it seems like 'auto' as local raii and 'auto' as
automatic type deduction are still linked to the one keyword.
There are lots of D users hoping that this wart will be repaired before too
long.
4.)
The D version of Scintilla and d-build was nice, very easy to use.
Personally I would have preferred the default behaviour of dbuild to put
object files in an /obj subdirectory and the final exe in the original
directory dbuild is run from.
This way, it could be run from a root directory, operate on a /src
subdirectory, and not clutter up the source with object files. There is
a switch for that, of course, but I can't imagine when you would want
object files sitting in the same directory as the source.
Thanks for the Build comments. One unfortunate thing I find is that one
person's defaults are another's exceptions. That is why you can tailor
Build to your 'default' behaviour requirements. In this case, create a text
file in the same directory that Build.exe is installed in, called
'build.cfg' and place in it the line ...
CMDLINE=-od./obj
Then when you run the tool, the command line switch is applied every time
you run it.
--
Derek
(skype: derek.j.parnell)
Melbourne, Australia
"Down with mediocrity!"
29/09/2006 4:44:52 PM
↑ ↓ ← → Walter Bright <newshound digitalmars.com> writes:
Derek Parnell wrote:
On Fri, 29 Sep 2006 10:23:32 +1000, Geoff Carlton wrote:
I was a bit underwhelmed by the syntax of char[].
Yes. It isn't very 'nice' for a modern language. Though as you note below a
simple alias can help a lot.
alias char[] string;
On the other hand, the reasons other languages have strings as classes
is because they just don't support arrays very well. C++'s std::string
combines the worst of core functionality and libraries, and has the
advantages of neither.
An early design goal for D was to upgrade arrays to the point where
string classes weren't necessary.
↑ ↓ ← → =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Walter Bright wrote:
An early design goal for D was to upgrade arrays to the point where
string classes weren't necessary.
A string alias might still be, just as the bool alias was.
--anders
↑ ↓ ← → Derek Parnell <derek psyc.ward> writes:
On Fri, 29 Sep 2006 01:24:50 -0700, Walter Bright wrote:
Derek Parnell wrote:
On Fri, 29 Sep 2006 10:23:32 +1000, Geoff Carlton wrote:
I was a bit underwhelmed by the syntax of char[].
Yes. It isn't very 'nice' for a modern language. Though as you note below a
simple alias can help a lot.
alias char[] string;
On the other hand, the reasons other languages have strings as classes
is because they just don't support arrays very well. C++'s std::string
combines the worst of core functionality and libraries, and has the
advantages of neither.
An early design goal for D was to upgrade arrays to the point where
string classes weren't necessary.
And is it there yet? I mean, given that a string is just a lump of text, is
there any text processing operation that cannot be simply done to a char[]
item? I can't think of any but maybe somebody else can.
And if a char[] is just as capable as a std::string, then why not have an
official alias in Phobos? Will 'alias char[] string' cause anyone any
problems?
--
Derek Parnell
Melbourne, Australia
"Down with mediocrity!"
↑ ↓ ← → Georg Wrede <georg.wrede nospam.org> writes:
Derek Parnell wrote:
On Fri, 29 Sep 2006 01:24:50 -0700, Walter Bright wrote:
An early design goal for D was to upgrade arrays to the point where
string classes weren't necessary.
And is it there yet? I mean, given that a string is just a lump of text
The string you're talking about is not just a lump of text.
More specifically it's a lump of text, irregularly interspersed with
short non-ascii ubyte sequences.
The latter being of course the tails of UTF-8 "characters".
↑ ↓ ← → David Medlock <noone nowhere.com> writes:
Derek Parnell wrote:
On Fri, 29 Sep 2006 01:24:50 -0700, Walter Bright wrote:
Derek Parnell wrote:
On Fri, 29 Sep 2006 10:23:32 +1000, Geoff Carlton wrote:
I was a bit underwhelmed by the syntax of char[].
Yes. It isn't very 'nice' for a modern language. Though as you note below a
simple alias can help a lot.
alias char[] string;
On the other hand, the reasons other languages have strings as classes
is because they just don't support arrays very well. C++'s std::string
combines the worst of core functionality and libraries, and has the
advantages of neither.
An early design goal for D was to upgrade arrays to the point where
string classes weren't necessary.
And is it there yet? I mean, given that a string is just a lump of text, is
there any text processing operation that cannot be simply done to a char[]
item? I can't think of any but maybe somebody else can.
And if a char[] is just as capable as a std::string, then why not have an
official alias in Phobos? Will 'alias char[] string' cause anyone any
problems?
string array types.
-DavidM
↑ ↓ ← → Walter Bright <newshound digitalmars.com> writes:
Derek Parnell wrote:
And is it there yet? I mean, given that a string is just a lump of text, is
there any text processing operation that cannot be simply done to a char[]
item? I can't think of any but maybe somebody else can.
I believe it's there. I don't think std::string or java.lang.String have
anything over it.
And if a char[] is just as capable as a std::string, then why not have an
official alias in Phobos? Will 'alias char[] string' cause anyone any
problems?
I don't think it'll cause problems, it just seems pointless.
↑ ↓ ← → Matthias Spycher <matthias coware.com> writes:
Immutability and some guarantees about the validity of the state of an
immutable string in a concurrent setting are what set Java strings
apart. Garbage collection without immutable strings in the standard
library is quite out of the ordinary.
Walter Bright wrote:
Derek Parnell wrote:
And is it there yet? I mean, given that a string is just a lump of
text, is
there any text processing operation that cannot be simply done to a
char[]
item? I can't think of any but maybe somebody else can.
I believe it's there. I don't think std::string or java.lang.String have
anything over it.
And if a char[] is just as capable as a std::string, then why not have an
official alias in Phobos? Will 'alias char[] string' cause anyone any
problems?
I don't think it'll cause problems, it just seems pointless.
↑ ↓ ← → Derek Parnell <derek psyc.ward> writes:
On Fri, 29 Sep 2006 10:04:57 -0700, Walter Bright wrote:
Derek Parnell wrote:
And is it there yet? I mean, given that a string is just a lump of text, is
there any text processing operation that cannot be simply done to a char[]
item? I can't think of any but maybe somebody else can.
I believe it's there. I don't think std::string or java.lang.String have
anything over it.
I'm pretty sure that the phobos routines for search and replace only work
for ASCII text. For example, std.string.find(japanesetext, "a") will nearly
always fail to deliver the correct result. It finds the first occurance of
the byte value for the letter 'a' which may well be inside a Japanese
character. It looks for byte-subsets rather than character sub-sets.
And if a char[] is just as capable as a std::string, then why not have an
official alias in Phobos? Will 'alias char[] string' cause anyone any
problems?
I don't think it'll cause problems, it just seems pointless.
It may very well be pointless for your way of thinking, but your language
is also for people who may not necessarily think in the same manner as
yourself. I, for example, think there is a point to having my code read
like its dealing with strings rather than arrays of characters. I suspect
I'm not alone. We could all write the alias in all our code, but you could
also be helpful and do it for us - like you did with bit/bool.
--
Derek Parnell
Melbourne, Australia
"Down with mediocrity!"
↑ ↓ ← → Georg Wrede <georg.wrede nospam.org> writes:
Derek Parnell wrote:
I'm pretty sure that the phobos routines for search and replace only work
for ASCII text. For example, std.string.find(japanesetext, "a") will nearly
always fail to deliver the correct result. It finds the first occurance of
the byte value for the letter 'a' which may well be inside a Japanese
character. It looks for byte-subsets rather than character sub-sets.
I take it that you mean that the bit pattern, or byte, 'a' (as in 0x61)
may be found within a Japanese multibyte glyph? Or even a very long
Japanese text.
That is not correct.
The designers of UTF-8 knew that this would be dangerous, and created
UTF-8 so that such _will_not_happen_. Ever.
Therefore, something like std.string.find() doesn't even have to know
about it.
Basically, std.string.find() and comparable functions, only have to
receive two octet sequences, and see where one of them first occurs in
the other. No need to be aware of UTF or ASCII. For all we know, the
strings may even be in EBCDIC. Still works.
If the strings themselves are valid (in whichever encoding you have
chosen to use), then the result will also be valid.
((For the sake of completeness, here I've restricted the discussion to
the version of such functions that accept ubyte[] compatible input
(obviously including char[]). Those taking 16 or 32 bits, and especially
if we deliberately feed input of wrong width to any of these, then of
course the results will be more complicated.))
↑ ↓ ← → Walter Bright <newshound digitalmars.com> writes:
Derek Parnell wrote:
I'm pretty sure that the phobos routines for search and replace only work
for ASCII text. For example, std.string.find(japanesetext, "a") will nearly
always fail to deliver the correct result. It finds the first occurance of
the byte value for the letter 'a' which may well be inside a Japanese
character.
That cannot happen, because multibyte sequences *always* have the high
bit set, and 'a' does not. That's one of the things that sets UTF-8
apart from other multibyte formats. You might be thinking of the older
Shift-JIS multibyte encoding, which did suffer from such problems.
It looks for byte-subsets rather than character sub-sets.
I don't think it's broken, but if it is, those are bugs, not fundamental
problems with char[], and should be filed in bugzilla.
It may very well be pointless for your way of thinking, but your language
is also for people who may not necessarily think in the same manner as
yourself. I, for example, think there is a point to having my code read
like its dealing with strings rather than arrays of characters. I suspect
I'm not alone. We could all write the alias in all our code, but you could
also be helpful and do it for us - like you did with bit/bool.
I'm concerned about just adding more names that don't add real value. As
I wrote in a private email discussion about C++ typedefs, they should
only be used when:
1) they provide an abstraction against the presumption that the
underlying type may change
2) they provide a self-documentation purpose
(1) certainly doesn't apply to string. (2) may, but char[] has no use
other than that of being a string, as a char[] is always a string and a
string is always a char[]. So I don't think string fits (2).
And lastly, there's the inevitable confusion. People learning the
language will see char[] and string, and wonder which should be used
when. I can't think of any consistent understandable rule for that. So
it just winds up being wishy-washy. Adding more names into the global
space (which is what names in object.d are) should be done extremely
conservatively.
If someone wants to use the string alias as their personal or company
style, I have no issue with that, as other people *do* think differently
than me (which is abundantly clear here!).
↑ ↓ ← → Derek Parnell <derek psyc.ward> writes:
On Fri, 29 Sep 2006 23:11:37 -0700, Walter Bright wrote:
Derek Parnell wrote:
I'm pretty sure that the phobos routines for search and replace only work
for ASCII text. For example, std.string.find(japanesetext, "a") will nearly
always fail to deliver the correct result. It finds the first occurance of
the byte value for the letter 'a' which may well be inside a Japanese
character.
That cannot happen, because multibyte sequences *always* have the high
bit set, and 'a' does not. That's one of the things that sets UTF-8
apart from other multibyte formats. You might be thinking of the older
Shift-JIS multibyte encoding, which did suffer from such problems.
Thanks. That has cleared up some misconceptions and pre-concenptions that I
had with utf encoding. I can reduce some of my home-grown routines now and
reduce that number of times that I (think I) need dchar[] ;-)
It may very well be pointless for your way of thinking, but your language
is also for people who may not necessarily think in the same manner as
yourself. I, for example, think there is a point to having my code read
like its dealing with strings rather than arrays of characters. I suspect
I'm not alone. We could all write the alias in all our code, but you could
also be helpful and do it for us - like you did with bit/bool.
I'm concerned about just adding more names that don't add real value. As
I wrote in a private email discussion about C++ typedefs, they should
only be used when:
1) they provide an abstraction against the presumption that the
underlying type may change
2) they provide a self-documentation purpose
(1) certainly doesn't apply to string.
No argument there.
(2) may, but char[] has no use
other than that of being a string, as a char[] is always a string and a
string is always a char[]. So I don't think string fits (2).
This is a lttle more debatable, but not worth generating hostility.
A string of text contains characters whose position in the string is
significant - there are semantics to be applied to the entire text. It is
quite possible to conceive of an application in which the characters in the
char[] array have no importance attached to their relative position within
the array *where compared to neighboring characters*. The order of
characters in text is significant but not necessarily so in a arbitary
character array.
Conceptually a string is different from a char[], even though they are
implemented using the same technology.
And lastly, there's the inevitable confusion. People learning the
language will see char[] and string, and wonder which should be used
when. I can't think of any consistent understandable rule for that. So
it just winds up being wishy-washy. Adding more names into the global
space (which is what names in object.d are) should be done extremely
conservatively.
And yet we have "toString" and not "toCharArray" or "toUTF"!
And we still have the "printf" in object.d too!
If someone wants to use the string alias as their personal or company
style, I have no issue with that, as other people *do* think differently
than me (which is abundantly clear here!).
I'll revert Build to string again as it is a lot easier to read. It started
out that way but I converted it to char[] to appease you (why I thought you
need appeasing is lost though). :-)
--
Derek Parnell
Melbourne, Australia
"Down with mediocrity!"
↑ ↓ ← → Walter Bright <newshound digitalmars.com> writes:
Derek Parnell wrote:
(2) may, but char[] has no use
other than that of being a string, as a char[] is always a string and a
string is always a char[]. So I don't think string fits (2).
This is a lttle more debatable, but not worth generating hostility.
I certainly hope this thread doesn't degenerate into that like some of
the others.
A string of text contains characters whose position in the string is
significant - there are semantics to be applied to the entire text. It is
quite possible to conceive of an application in which the characters in the
char[] array have no importance attached to their relative position within
the array *where compared to neighboring characters*. The order of
characters in text is significant but not necessarily so in a arbitary
character array.
Conceptually a string is different from a char[], even though they are
implemented using the same technology.
You do have a point there.
And lastly, there's the inevitable confusion. People learning the
language will see char[] and string, and wonder which should be used
when. I can't think of any consistent understandable rule for that. So
it just winds up being wishy-washy. Adding more names into the global
space (which is what names in object.d are) should be done extremely
conservatively.
And yet we have "toString" and not "toCharArray" or "toUTF"!
True, and some have called for renaming char to utf8. While that would
be technically more correct (as toUTF would be, too), it just looks awful.
I suppose that since I grew up with char* meaning string, using char[]
seems perfectly natural. I tried typedef'ing char* to string now and
then, but always wound up going back to just using char*.
And we still have the "printf" in object.d too!
I know many feel that printf doesn't belong there. It certainly isn't
there for purity or consistency. It's there purely (!) for the
convenience of writing short quickie programs. I tend to use it for
quick debugging test cases, because it doesn't rely on the rest of D
working.
If someone wants to use the string alias as their personal or company
style, I have no issue with that, as other people *do* think differently
than me (which is abundantly clear here!).
I'll revert Build to string again as it is a lot easier to read. It started
out that way but I converted it to char[] to appease you (why I thought you
need appeasing is lost though). :-)
No, you certainly don't need to appease me! I do care about maintaining
a reasonably consistent style in Phobos, but I don't believe a language
should enforce a particular style beyond the standard library. Viva la
difference.
P.S. I did say to not 'enforce', but that doesn't mean I am above
recommending a particular style, as in
http://www.digitalmars.com/d/dstyle.html
↑ ↓ ← → Derek Parnell <derek psyc.ward> writes:
On Sat, 30 Sep 2006 21:18:02 -0700, Walter Bright wrote:
P.S. I did say to not 'enforce', but that doesn't mean I am above
recommending a particular style, as in
http://www.digitalmars.com/d/dstyle.html
Oh, I threw trhat away ages ago ;-)
--
Derek Parnell
Melbourne, Australia
"Down with mediocrity!"
↑ ↓ ← → Lars Ivar Igesund <larsivar igesund.net> writes:
Walter Bright wrote:
And yet we have "toString" and not "toCharArray" or "toUTF"!
True, and some have called for renaming char to utf8. While that would
be technically more correct (as toUTF would be, too), it just looks awful.
Nope, it just looks correct.
--
Lars Ivar Igesund
blog at http://larsivi.net
DSource & #D: larsivi
↑ ↓ ← → Lionello Lunesu <lio lunesu.remove.com> writes:
Lars Ivar Igesund wrote:
Walter Bright wrote:
And yet we have "toString" and not "toCharArray" or "toUTF"!
be technically more correct (as toUTF would be, too), it just looks awful.
Nope, it just looks correct.
I don't think renaming toString to toUTF gets rid of any confusion.
AFAIK, toString is meant for debugging and char[] should be enough, and
yet flexible enough for unicode strings.
In fact, "string toString()" would be a good solution too.
---
My 4 reasons for the "string" aliases:
* readability: less [] pairs;
* safety: char[] is not zero-terminated, so lets not pretend there's a
relation with C's char*. In fact: lets hide any relation;
* clarity: a char[] should not be iterated 1 char at a time, which makes
it different from an int[].
* consistency: "string toString()"
L.
↑ ↓ ← → Georg Wrede <georg.wrede nospam.org> writes:
Walter Bright wrote:
True, and some have called for renaming char to utf8. While that would
be technically more correct (as toUTF would be, too), it just looks awful.
Let's just say it would be a first step in lessening the confusion _we_
create in newcomers' heads.
↑ ↓ ← → Kevin Bealer <kevinbealer gmail.com> writes:
Georg Wrede wrote:
Walter Bright wrote:
True, and some have called for renaming char to utf8. While that would
be technically more correct (as toUTF would be, too), it just looks
awful.
Let's just say it would be a first step in lessening the confusion _we_
create in newcomers' heads.
I would kind of agree with this, but I think it's a two-edged knife.
If we say 'char[]' then users don't know it's a string until they read
the 'why D arrays are great' page (which they should read, but...)
If we say 'string' then we hide the fact that [] can be applied and that
other array-like operations can work.
For instance, from a Java perspective:
char[] : Users don't know that it's "String"; users see it as low-level.
Some will try to write things like 'find()' by hand since they
will figure arrays are low level and not expect this to exist.
string : Users will think it's immutable, special; they will ask "how do
I get one of the characters out of a string", "how do I convert
string to char[]?", and other things that would be obvious
without the alias.
Kevin
↑ ↓ ← → Georg Wrede <georg.wrede nospam.org> writes:
Kevin Bealer wrote:
Georg Wrede wrote:
Walter Bright wrote:
True, and some have called for renaming char to utf8. While that
would be technically more correct (as toUTF would be, too), it just
looks awful.
Let's just say it would be a first step in lessening the confusion
_we_ create in newcomers' heads.
I would kind of agree with this, but I think it's a two-edged knife.
If we say 'char[]' then users don't know it's a string until they read
the 'why D arrays are great' page (which they should read, but...)
If we say 'string' then we hide the fact that [] can be applied and that
other array-like operations can work.
For instance, from a Java perspective:
char[] : Users don't know that it's "String"; users see it as low-level.
Some will try to write things like 'find()' by hand since they
will figure arrays are low level and not expect this to exist.
Yes.
string : Users will think it's immutable, special; they will ask "how do
I get one of the characters out of a string", "how do I convert
string to char[]?", and other things that would be obvious
without the alias.
Well, with string, folks would at least be inclined to search for the
library function to do it.
---
Overall, having string instead of char[] should result in folks learning
and doing more with D _before_ they get tangled with UTF issues. (I
guess, getting tangled with UTF is unavoidable.) But the more later
folks stumble on this, the better they can handle it. If it happens too
soon, then they will just run away from D.
But substituting string for char[] in D is not enough. More than half
the issue is the wording in the docs.
---
Another thing intimately connected with this is whether we should have
char[] or utf8[] (string or no string, this is an important thing anyway).
I understand that "char" is one of the words that a seasoned
programmer's fingers know by heart. So it would feel simply disgusting
to have to learn (and bother) to write "utf8" which I admit is a lot
more work to type. (Seriously.)
Now, "string" is easy for the fingers, and then you get to skip "[]",
which makes it all a little more palatable.
Having string would let us have the underlying type be utf8[], which
really emphasizes and calls your attention to the fact that it's not
byte-by-byte stuff we have there.
↑ ↓ ← → =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Kevin Bealer wrote:
If we say 'char[]' then users don't know it's a string until they read
the 'why D arrays are great' page (which they should read, but...)
If we say 'string' then we hide the fact that [] can be applied and that
other array-like operations can work.
Which could be a *good* thing, since it would stop users from hurting
themselves by pretending that the D strings are arrays of characters ?
And when they have read up that they are "arrays of Unicode code units",
they should be OK with interpreting the "string" alias as char[] arrays.
For instance, from a Java perspective:
char[] : Users don't know that it's "String"; users see it as low-level.
Some will try to write things like 'find()' by hand since they
will figure arrays are low level and not expect this to exist.
string : Users will think it's immutable, special; they will ask "how do
I get one of the characters out of a string", "how do I convert
string to char[]?", and other things that would be obvious
without the alias.
I think the best answer would be: "to get a char[] from the string,
use the std.utf.toUTF8 function", since this also works even if you
redeclare the "string" alias to be something else - like wchar_t[] ?
Earlier* I suggested adding the alias utf8_t for "char", just like
we have int8_t for "byte", but I wouldn't rename the actual D types.
Just a little std.stdutf module with some aliases, if ever needed...
string std.string.toString( )
utf8_t[] std.utf.toUTF8( )
utf16_t[] std.utf.toUTF16( )
utf32_t[] std.utf.toUTF32( )
--anders
* digitalmars.D/11821, 2004-10-15
↑ ↓ ← → Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
Derek Parnell wrote:
(2) may, but char[] has no use
other than that of being a string, as a char[] is always a string and a
string is always a char[]. So I don't think string fits (2).
This is a lttle more debatable, but not worth generating hostility.
A string of text contains characters whose position in the string is
significant - there are semantics to be applied to the entire text. It is
quite possible to conceive of an application in which the characters in the
char[] array have no importance attached to their relative position within
the array *where compared to neighboring characters*. The order of
characters in text is significant but not necessarily so in a arbitary
character array.
Conceptually a string is different from a char[], even though they are
implemented using the same technology.
Precisely! And even if such conceptual difference didn't exist, or is
very rare, 'string' is nonetheless more readable than 'char[]', a fact I
am constantly reminded of when I see 'int main(char[][] args)' instead
of 'int main(string[] args)', which translates much more quickly into
the brain as 'array of strings' than its current counterpart.
--
Bruno Medeiros - MSc in CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
↑ ↓ ← → Geoff Carlton <gcarlton iinet.net.au> writes:
Bruno Medeiros wrote:
Precisely! And even if such conceptual difference didn't exist, or is
very rare, 'string' is nonetheless more readable than 'char[]', a fact I
am constantly reminded of when I see 'int main(char[][] args)' instead
of 'int main(string[] args)', which translates much more quickly into
the brain as 'array of strings' than its current counterpart.
There are also many cases where char arrays are not strings:
Single array of characters, not strings:
char GAME_10PT_LETTERS[] = { 'x', 'z' };
Two-dimensional array of characters, not string arrays:
char GAME_LETTERS[][] = { GAME_0PT_LETTERS, GAME_1PT_LETTERS, .. };
char m_scrabbleBoard[20][20];
↑ ↓ ← → Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Derek Parnell schrieb am 2006-09-30:
On Fri, 29 Sep 2006 10:04:57 -0700, Walter Bright wrote:
Derek Parnell wrote:
And is it there yet? I mean, given that a string is just a lump of text, is
there any text processing operation that cannot be simply done to a char[]
item? I can't think of any but maybe somebody else can.
I believe it's there. I don't think std::string or java.lang.String have
anything over it.
I'm pretty sure that the phobos routines for search and replace only work
for ASCII text. For example, std.string.find(japanesetext, "a") will nearly
always fail to deliver the correct result. It finds the first occurance of
the byte value for the letter 'a' which may well be inside a Japanese
character. It looks for byte-subsets rather than character sub-sets.
~wow~
Have a look at std.string.find's source and try to stop giggling *g*
The correct implementation would be:
# import std.string;
# import std.c.string;
# import std.utf;
#
# int find(char[] s, dchar c)
# {
# if (c <= 0x7F)
# { // Plain old ASCII
# auto p = cast(char*)memchr(s, c, s.length);
# if (p)
# return p - cast(char *)s;
# else
# return -1;
# }
#
# // c is a universal character
# return std.string.find(s, toUTF8([c]));
# }
The same applies to ifind and the like.
Thomas
-----BEGIN PGP SIGNATURE-----
iD8DBQFFHj4fLK5blCcjpWoRAj67AJoDagf5zf7Az7ZqMDfOyZdRJ+aIqQCdGeen
ye80pstE4IJC1WoxgTVVgdc=
=iwT5
-----END PGP SIGNATURE-----
↑ ↓ ← → Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Thomas Kuehne schrieb am 2006-09-30:
Derek Parnell schrieb am 2006-09-30:
On Fri, 29 Sep 2006 10:04:57 -0700, Walter Bright wrote:
Derek Parnell wrote:
And is it there yet? I mean, given that a string is just a lump of text, is
there any text processing operation that cannot be simply done to a char[]
item? I can't think of any but maybe somebody else can.
I believe it's there. I don't think std::string or java.lang.String have
anything over it.
I'm pretty sure that the phobos routines for search and replace only work
for ASCII text. For example, std.string.find(japanesetext, "a") will nearly
always fail to deliver the correct result. It finds the first occurance of
the byte value for the letter 'a' which may well be inside a Japanese
character. It looks for byte-subsets rather than character sub-sets.
~wow~
Have a look at std.string.find's source and try to stop giggling *g*
The correct implementation would be:
As it seems, the original code depends on the undocumented index behavior
with regards to silent transcoding in foreach.
Thomas
-----BEGIN PGP SIGNATURE-----
iD8DBQFFHkOILK5blCcjpWoRAnmjAJ9PKdGDHsghycgxHdr7hkc+IP+XEgCgohH8
LH7OOQgQAZoTMLRQXtWhqbE=
=or0x
-----END PGP SIGNATURE-----
↑ ↓ ← → Sean Kelly <sean f4.ca> writes:
Thomas Kuehne wrote:
As it seems, the original code depends on the undocumented index behavior
with regards to silent transcoding in foreach.
The wording could be more explicit, but I think the current
documentation implies the actual behavior:
"The index must be of int or uint type, it cannot be inout, and it is
set to be the index of the array element."
The docs should probably also be revised to allow for 64-bit indices,
where the index would be long or ulong. Something along the lines of:
"The index must be an integer type of size equal to size_t.sizeof. . ."
Sean
↑ ↓ ← → Geoff Carlton <gcarlton iinet.net.au> writes:
Walter Bright wrote:
Derek Parnell wrote:
And is it there yet? I mean, given that a string is just a lump of
text, is
there any text processing operation that cannot be simply done to a
char[]
item? I can't think of any but maybe somebody else can.
I believe it's there. I don't think std::string or java.lang.String have
anything over it.
And if a char[] is just as capable as a std::string, then why not have an
official alias in Phobos? Will 'alias char[] string' cause anyone any
problems?
I don't think it'll cause problems, it just seems pointless.
Hi,
The main reasons I think are these:
It simplifies the initial examples, particularly main(string[]), and
maps such as string[string]. More complex examples are a map of words
to text lines, string[][string], rather than char[][][char[]].
It clarifies the actual use of the entity. It is a text string, not
just a jumbled array of characters. Arrays of char can be used for
other things, such as the set of player letters in a scrabble game. A
string has the additional usage that we know it as is text string. The
alias reflects that intent.
Given a user wants to use a string, there is no need to expose the
implementation detail of how strings are done in D. Perhaps in perl,
strings are a linked list of shorts, but it doesn't mean that you'd have
list<short> all over the place.
Use of char[] and char[][] looks like low level C. It has also been
noted that it encourages char based indexing, which is not a good thing
for utf8.
Anyway, hope one of those points grabbed you!
Geoff
↑ ↓ ← → David Medlock <noone nowhere.com> writes:
Walter Bright wrote:
Derek Parnell wrote:
And is it there yet? I mean, given that a string is just a lump of
text, is
there any text processing operation that cannot be simply done to a
char[]
item? I can't think of any but maybe somebody else can.
I believe it's there. I don't think std::string or java.lang.String have
anything over it.
And if a char[] is just as capable as a std::string, then why not have an
official alias in Phobos? Will 'alias char[] string' cause anyone any
problems?
I don't think it'll cause problems, it just seems pointless.
The reason *I* want it is _alias_ does not respect the private:
visibility modifier.
So when I pull out an old piece of code which says
alias char[] string
and import it in my newer module I get conflicts when I compile.
Then I must do this silly hack where I include the newer file from the
old or vice versa.
If you didn't add this into phobos, at least or adopt a method to
discriminate between more than one alias with the same name to resolve
the issue.
-DavidM
↑ ↓ ← → =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Geoff Carlton wrote:
A simple alias of char[] to string would simplify the first glance code.
string x; // yep, a string
main (string[]) // an array of strings
string[string] m; // map of string to string
I believe single functions get pulled in as member functions? e.g.
find(string) can be used as string.find()? If so, it means that all the
string functionality can be added and then used naturally as member
functions on this "string" (which is really just the plain old char[] in
disguise).
Problem of "char[]" is both that it hides the fact that "char" is UTF-8
while at the same time it exposes the fact that it's stored as an array.
You can "improve" upon that readability with aliases, like declaring say
utf8_t -> char and string -> utf8_t[], but you still need to understand
Unicode and Arrays in order to use it outside of the provided methods...
I think "hides the implementation" was the biggest argument against it ?
http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues
This is a small thing, but I think it would help in terms of the mindset
of strings being a first class primitive, and clear up simple "hello
world" examples at the same time. Put simply, every modern language has
a first class string primitive type, except D - at least in terms of
nomenclature.
I did the big mistake of thinking it would be a good thing to be able to
switch between "ANSI" and "UNICODE" builds (of wxD), and so did it like:
version(UNICODE)
alias char[] string;
else // version(ANSI)
alias wchar_t[] string; // wchar[] on Windows, dchar[] on Unix
Still trying to sort out all the code problems with that idea, as there
is a ton of toUTF8 and other conversions to make strings work together.
In retrospect it would have been much easier to have stuck with char[],
and do the conversion from UTF-8 to the local encoding on the C++ side.
(since there were no guarantees that the "char" and "wchar_t" types in
C++ used UTF encodings, even if they did so in Unix/GTK+ for instance)
Any (minor) performance issues of having to do the UTF-8 <-> UTF-32
conversions were not worth the hassle of doing it on the D side, IMHO.
So I agree with the "alias char[] string;" and the string[string] args.
It's going to be used as wx.common.string for instance, in wxD library.
--anders
↑ ↓ ← → =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
I did the big mistake of thinking it would be a good thing to be able to
switch between "ANSI" and "UNICODE" builds (of wxD), and so did it like:
version(UNICODE)
alias char[] string;
else // version(ANSI)
alias wchar_t[] string; // wchar[] on Windows, dchar[] on Unix
Except the other way around, of course!
version(UNICODE)
alias wchar_t[] string;
else // version(ANSI)
alias char[] string;
Now, to get me some more coffee... :-P
--anders
↑ ↓ ← → Lionello Lunesu <lio lunesu.remove.com> writes:
I also ALWAYS create aliases for char[], wchar[], dchar[]... I DO wish
they would be included by default in Phobos.
alias char[] string;
alias wchar[] wstring;
alias dchar[] dstring;
Perhaps, using string instead of char[], it's more obvious that it's not
zero-terminated. I've seen D examples online that just cast a char[] to
char* for use in MessageBox and the like (which worked since it were
string constants.)
L.
↑ ↓ ← → =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Lionello Lunesu wrote:
Perhaps, using string instead of char[], it's more obvious that it's not
zero-terminated. I've seen D examples online that just cast a char[] to
char* for use in MessageBox and the like (which worked since it were
string constants.)
And probably only for ASCII string constants, at that...
--anders
↑ ↓ ← → Lionello Lunesu <lio lunesu.remove.com> writes:
Anders F Björklund wrote:
Lionello Lunesu wrote:
Perhaps, using string instead of char[], it's more obvious that it's
not zero-terminated. I've seen D examples online that just cast a
char[] to char* for use in MessageBox and the like (which worked since
it were string constants.)
And probably only for ASCII string constants, at that...
Right, that too!
char[] somestring = "....";
func( somestring[0] ); // WRONG: somestring[x] is not 1 character!
Using "string" would make it less obvious:
string somestring = ".....";
func( somestring[0] ); // [0] means what?
This goes for iteration as well. DMD will still deduct 'char' as the
type type, but at least one's less likely to type foreach(char c;str).
If you want to iterate the UNICODE characters in a string, you'll
specify "dchar" as the type and you won't worry about "how come I can
use dchar when it's a char[]":
foreach(dchar c; somestring)
func(c); // correct
L.
↑ ↓ ← → Georg Wrede <georg.wrede nospam.org> writes:
Lionello Lunesu wrote:
I also ALWAYS create aliases for char[], wchar[], dchar[]... I DO wish
they would be included by default in Phobos.
alias char[] string;
alias wchar[] wstring;
alias dchar[] dstring;
Perhaps, using string instead of char[], it's more obvious that it's not
zero-terminated. I've seen D examples online that just cast a char[] to
char* for use in MessageBox and the like (which worked since it were
string constants.)
Using char[] as long as you don't know about UTF seems to work pretty
well in D. But the moment you realise that we're having potential
multibyte characters in what essentially is a ubyte[], you get scared to
death, and start to wonder how on earth you haven't yet blown up your
hard disk.
You start having nightmares about slicing char arrays at the wrong
place, extracting single chars that might not be storable in a char, and
all of a sudden you decide to stick with your old language "till things
calm down".
The only medicine to this is simply to shut your eyes and keep coding on
like you never did realise anything.
It's a little like when you first realised Daddy isn't holding your
bike: you instantly fall hurting yourself, instead of realizing that
he's probably let go ages ago, and you still haven't fallen, so simply
keep going.
---
This doesn't mean I'm happy with this either, but I don't have the
energy to conjure up a significantly better solution _and_ fight for it
till it gets accepted. (Some things are just too hard to fix, like
"bit=bool" was, and now "auto/auto".)
↑ ↓ ← → Chad J <""gamerChad\" spamIsBad gmail.com"> writes:
Georg Wrede wrote:
Lionello Lunesu wrote:
I also ALWAYS create aliases for char[], wchar[], dchar[]... I DO wish
they would be included by default in Phobos.
alias char[] string;
alias wchar[] wstring;
alias dchar[] dstring;
Perhaps, using string instead of char[], it's more obvious that it's
not zero-terminated. I've seen D examples online that just cast a
char[] to char* for use in MessageBox and the like (which worked since
it were string constants.)
Using char[] as long as you don't know about UTF seems to work pretty
well in D. But the moment you realise that we're having potential
multibyte characters in what essentially is a ubyte[], you get scared to
death, and start to wonder how on earth you haven't yet blown up your
hard disk.
You start having nightmares about slicing char arrays at the wrong
place, extracting single chars that might not be storable in a char, and
all of a sudden you decide to stick with your old language "till things
calm down".
The only medicine to this is simply to shut your eyes and keep coding on
like you never did realise anything.
It's a little like when you first realised Daddy isn't holding your
bike: you instantly fall hurting yourself, instead of realizing that
he's probably let go ages ago, and you still haven't fallen, so simply
keep going.
---
This doesn't mean I'm happy with this either, but I don't have the
energy to conjure up a significantly better solution _and_ fight for it
till it gets accepted. (Some things are just too hard to fix, like
"bit=bool" was, and now "auto/auto".)
haha too true.
I experienced this too as I read this ng. It hasn't been THAT truamatic
for me though, since everything seems to work as long as you stick to
english. I don't have the resources to even begin thinking about
non-english text (ex: paying people to translate stuff), so I don't lose
any sleep about it, at least not yet.
Perhaps there should be a string struct/class that has an undefined
underlying type (it could be UTF-8, 16, 32, you dunno really), and you
could index it to get the *complete* character at any position in the
string. Basically, it is like char[], but it /just works/ in all cases.
I'd almost rather have the size of a char be undefined, and just have
char[] be the said magic string type. If you want something with a
.size of 1, then there is byte/ubyte. There would probably have to be
some stuff in the phobos internals to handle such a string in a correct
manner.
Going even further... if you could make char[] be such a magic string
type, then wchar[] and dchar[] could probably be deprecated - use ushort
and uint instead. Then add the following aliases to phobos:
alias ubyte utf8;
alias ushort utf16;
alias uint utf32;
Just a thought. I'm no expert on UTF, but maybe this can start a
discussion that will result in the nightmares ending :)
↑ ↓ ← → Johan Granberg <lijat.meREM OVEgmail.com> writes:
Chad J > wrote:
Perhaps there should be a string struct/class that has an undefined
underlying type (it could be UTF-8, 16, 32, you dunno really), and you
could index it to get the *complete* character at any position in the
string. Basically, it is like char[], but it /just works/ in all cases.
I'd almost rather have the size of a char be undefined, and just have
char[] be the said magic string type. If you want something with a
..size of 1, then there is byte/ubyte. There would probably have to be
some stuff in the phobos internals to handle such a string in a correct
manner.
I have thought about this to.
Going even further... if you could make char[] be such a magic string
type, then wchar[] and dchar[] could probably be deprecated - use ushort
and uint instead. Then add the following aliases to phobos:
alias ubyte utf8;
alias ushort utf16;
alias uint utf32;
I completely agree, char should hold a character independently of
encoding and NOT a code unit or something else. I think it would bee
beneficial to D in the long term if chars where done right (meaning that
they can store any character) how it is implemented is not important and
i believe performance is not a problem here, so ease of use and
correctness would be appreciated.
↑ ↓ ← → BCS <BCS pathlink.com> writes:
Johan Granberg wrote:
I completely agree, char should hold a character independently of
encoding and NOT a code unit or something else. I think it would be
beneficial to D in the long term if chars where done right (meaning that
they can store any character) how it is implemented is not important and
i believe performance is not a problem here, so ease of use and
correctness would be appreciated.
Why isn't performance a problem?
If you are saying that this won't cause performance hits in run times or
memory space, I might be able to buy it, but I'm not yet convinced.
If you are saying that causing a performance hit in run times or memory
space is not a problem... in that case I think you are dead wrong and
you will not convince me otherwise.
In my opinion, any compiled language should allow fairly direct access
to the most efficient practical means of doing something*. If I didn't
care about speed and memory I wound use some sort of scripting language.
A good set of libs should make most of this moot. Leave the char as is
and define a typedef struct or whatever that provides the added
functionality that you want.
* OTOH a language should not mandate code to be efficient at the expense
of ease of coding.
↑ ↓ ← → Chad J <""gamerChad\" spamIsBad gmail.com"> writes:
BCS wrote:
Johan Granberg wrote:
I completely agree, char should hold a character independently of
encoding and NOT a code unit or something else. I think it would be
beneficial to D in the long term if chars where done right (meaning
that they can store any character) how it is implemented is not
important and i believe performance is not a problem here, so ease of
use and correctness would be appreciated.
Why isn't performance a problem?
If you are saying that this won't cause performance hits in run times or
memory space, I might be able to buy it, but I'm not yet convinced.
If you are saying that causing a performance hit in run times or memory
space is not a problem... in that case I think you are dead wrong and
you will not convince me otherwise.
In my opinion, any compiled language should allow fairly direct access
to the most efficient practical means of doing something*. If I didn't
care about speed and memory I wound use some sort of scripting language.
A good set of libs should make most of this moot. Leave the char as is
and define a typedef struct or whatever that provides the added
functionality that you want.
* OTOH a language should not mandate code to be efficient at the expense
of ease of coding.
I will go ahead and say that the current state of char[] is incorrect.
That is, if you write a program manipulating char[] strings, then run it
in china, you will be dissapointed with the results. It won't matter
how fast the program runs, because bad stuff will happen like entire
strings becoming unreadable to the user.
Technically if you follow UTF and do your char[] manipulations very
carefully, it is correct, but realistically few if any people will do
such things (I won't). Also, if you do this, your program will probably
run as slow as one with the proposed char/string solution, maybe slower
(since language/stdlib level support can be heavily optimized).
What I'd like then, is a program that is correct and as fast as possible
while still being correct.
Sure you can get some speed gains by just using ASCII and saying to hell
with UTF, but you should probably only do that when profiling has shown
that such speed gains are actually useful/needed in your program.
Ultimately we have to decide whether we want D to default to UTF code
which might run slightly slower but allow better localization and
international friendliness, or if we want it to default to ASCII or some
such encoding that runs slightly faster but is mostly limited to english.
I'd like the default to be UTF. Then we can have a base of code to
correctly manipulate UTF strings (in phobos and language supported).
Writing correct ASCII manipulation routine without good library/language
support is a lot easier than writing good UTF manipulation routines
without good library/language support, and UTF will probably be used
much more than ASCII.
Also, if we move over to full blown UTF, we won't have to give up ASCII.
It seems to me like the phobos std.string functions are pretty much
ASCII string manipulating functions (no multibyte string support). So
just copy those out to a seperate library, call it "ASCII lib", and
there's your library support for ASCII. That leaves string literals,
which is a slight problem, but I suppose easily fixed:
ubyte[] hi = "hello!"a;
Just add a postfix 'a' for strings which makes the string an ASCII
literal, of type ubyte[]. D arrays don't seem powerful enough to do UTF
manipulations without special attention, but they are powerful enough to
do ASCII manipulations without special attention, so using ubyte[] as an
ASCII string should give full language support for these. Given that
and ASCIILIB you pretty much have the current D string manipulation
capabilities afaik, and it will be fast.
↑ ↓ ← → =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Chad J > wrote:
I'd like the default to be UTF. Then we can have a base of code to
correctly manipulate UTF strings (in phobos and language supported).
Writing correct ASCII manipulation routine without good library/language
support is a lot easier than writing good UTF manipulation routines
without good library/language support, and UTF will probably be used
much more than ASCII.
But D already uses Unicode for all strings, encoded as UTF ?
When you say "ASCII", do you mean 8-bit encodings perhaps ?
(since all proper 7-bit ASCII are already valid UTF-8 too)
Also, if we move over to full blown UTF, we won't have to give up ASCII.
It seems to me like the phobos std.string functions are pretty much
ASCII string manipulating functions (no multibyte string support). So
just copy those out to a seperate library, call it "ASCII lib", and
there's your library support for ASCII. That leaves string literals,
which is a slight problem, but I suppose easily fixed:
ubyte[] hi = "hello!"a;
I don't understand this, why can't you use UTF-8 for this ?
char[] hi = "hello!";
Just add a postfix 'a' for strings which makes the string an ASCII
literal, of type ubyte[]. D arrays don't seem powerful enough to do UTF
manipulations without special attention, but they are powerful enough to
do ASCII manipulations without special attention, so using ubyte[] as an
ASCII string should give full language support for these. Given that
and ASCIILIB you pretty much have the current D string manipulation
capabilities afaik, and it will be fast.
What is not powerful enough about the foreach(dchar c; str) ?
It will step through that UTF-8 array one codepoint at a time.
--anders
↑ ↓ ← → Chad J <""gamerChad\" spamIsBad gmail.com"> writes:
Anders F Björklund wrote:
Chad J > wrote:
I'd like the default to be UTF. Then we can have a base of code to
correctly manipulate UTF strings (in phobos and language supported).
Writing correct ASCII manipulation routine without good library/language
support is a lot easier than writing good UTF manipulation routines
without good library/language support, and UTF will probably be used
much more than ASCII.
But D already uses Unicode for all strings, encoded as UTF ?
When you say "ASCII", do you mean 8-bit encodings perhaps ?
(since all proper 7-bit ASCII are already valid UTF-8 too)
Probably 7-bit. Anything where the size of one character is ALWAYS one
byte. I am already assuming that ASCII is a subset or at least is
mostly a subset of UTF8. However, I talk about it in an exclusive
manner because if you handle UTF8 strings properly then the code will
probably run at least slightly slower than with ASCII-only strings.
Also, if we move over to full blown UTF, we won't have to give up
ASCII. It seems to me like the phobos std.string functions are pretty
much ASCII string manipulating functions (no multibyte string
support). So just copy those out to a seperate library, call it
"ASCII lib", and there's your library support for ASCII. That leaves
string literals, which is a slight problem, but I suppose easily fixed:
ubyte[] hi = "hello!"a;
I don't understand this, why can't you use UTF-8 for this ?
char[] hi = "hello!";
I was talking about IF we made char[] into a datatype that handles all
of those odd corner cases correctly (slices into multibyte strings, for
instance) then it will no longer be the same fast ASCII-only routines.
So for those who want the fast ASCII-only stuff, it would nice to
specify a way to make string literals such that each character in the
literal takes only one byte, without ugly casting. To get an ASCII
monobyte string from a string literal in D I would have to do the following:
ubyte[] hi = cast(ubyte[])"hello!";
hmmm, yuck.
Just add a postfix 'a' for strings which makes the string an ASCII
literal, of type ubyte[]. D arrays don't seem powerful enough to do
UTF manipulations without special attention, but they are powerful
enough to do ASCII manipulations without special attention, so using
ubyte[] as an ASCII string should give full language support for
these. Given that and ASCIILIB you pretty much have the current D
string manipulation capabilities afaik, and it will be fast.
What is not powerful enough about the foreach(dchar c; str) ?
It will step through that UTF-8 array one codepoint at a time.
I'm assuming 'str' is a char[], which would make that very nice. But it
doesn't solve correctly slicing or indexing into a char[]. If nothing
was done about this and I absolutely needed UTF support, I'd probably
make a class like so:
class String
{
char[] data;
...
dchar opIndex( int index )
{
foreach( int i, dchar c; data )
{
if ( i == index )
return c;
i++;
}
}
// similar thing for opSlice down here
...
}
Which is probably slower than could be done.
All in all it is a drag that we should have to learn all of this UTF
stuff. I want char[] to just work!
↑ ↓ ← → =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:
Chad J > wrote:
Probably 7-bit. Anything where the size of one character is ALWAYS one
byte. I am already assuming that ASCII is a subset or at least is
mostly a subset of UTF8. However, I talk about it in an exclusive
manner because if you handle UTF8 strings properly then the code will
probably run at least slightly slower than with ASCII-only strings.
It's mostly about looking out for the UTF "control" characters, which is
not more than a simple assertion in your ASCII-only functions really...
I don't think handling UTF-8 properly is a burden for string functions,
when you compare it with the enormous gain that it has over ASCII-only.
What is not powerful enough about the foreach(dchar c; str) ?
It will step through that UTF-8 array one codepoint at a time.
I'm assuming 'str' is a char[], which would make that very nice. But it
doesn't solve correctly slicing or indexing into a char[].
Well, it's also a lot "trickier" than that... For instance, my last name
can be written in Unicode as Björklund or Bj¨orklund, both of which are
valid - only that in one of them, the 'ö' occupies two full code points!
It's still a single character, which is why Unicode avoids that term...
As you know, if you need to access your strings by codepoint (something
that the Unicode group explicitly recommends against, in their FAQ) then
char[] isn't a very nice format - because of the conversion overhead...
But it's still possible to translate, transform, and translate back ?
If nothing was done about this and I absolutely needed UTF support,
I'd probably make a class like so: [...]
In my own mock String class, I cached the dchar[] codepoints on demand.
(viewable at http://www.algonet.se/~afb/d/dcaf/html/class_string.html)
All in all it is a drag that we should have to learn all of this UTF
stuff. I want char[] to just work!
Using Unicode strings and characters does require a little learning...
(where http://www.unicode.org/faq/utf_bom.html is a very good page)
And D does force you to think about string implementation, no question.
This has both pros and cons, but it is a deliberate language decision.
If you're willing to handle the "surrogates", then UTF-16 is a rather
good trade-off between the default UTF-8 and wasteful UTF-32 formats ?
A downside is that it is not "ascii-compatible" (has embedded NUL chars)
and that it is endian-dependant unlike the more universal UTF-8 format.
--anders
↑ ↓ ← → Georg Wrede <georg.wrede nospam.org> writes:
Anders F Björklund wrote:
If you're willing to handle the "surrogates", then UTF-16 is a rather
good trade-off between the default UTF-8 and wasteful UTF-32 formats ?
A downside is that it is not "ascii-compatible" (has embedded NUL chars)
and that it is endian-dependant unlike the more universal UTF-8 format.
Problem is, using 16-bit you sort-of get away with _almost_ all of it.
But as a pay-back, the day your 16 bits don't suffice, you're in deep
crap. And that day _will_ come.
↑ ↓ ← → Chad J <""gamerChad\" spamIsBad gmail.com"> writes:
Anders F Björklund wrote:
What is not powerful enough about the foreach(dchar c; str) ?
It will step through that UTF-8 array one codepoint at a time.
I'm assuming 'str' is a char[], which would make that very nice. But
it doesn't solve correctly slicing or indexing into a char[].
Well, it's also a lot "trickier" than that... For instance, my last name
can be written in Unicode as Björklund or Bj¨orklund, both of which are
valid - only that in one of them, the 'ö' occupies two full code points!
It's still a single character, which is why Unicode avoids that term...
So it seems to me the problem is that those 2 bytes are both 2
characters and 1 character at the same time.
In this case, I'd prefer being able to index to a safe default (like the
ö, instead of the umlauts next to the o), or not being able to index at
all.
As you know, if you need to access your strings by codepoint (something
that the Unicode group explicitly recommends against, in their FAQ) then
char[] isn't a very nice format - because of the conversion overhead...
But it's still possible to translate, transform, and translate back ?
I read that FAQ at the bottom of this post, and didn't see anything
about accessing strings by codepoint. Maybe you mean a different FAQ
here, in which case, could I have a link please? I've been to the
unicode site before and all I remember was being confused and having a
hard time finding the info I wanted :(
Also I still am not sure exactly what a code point is. And that FAQ at
the bottom used the word "surrogate" a lot; I'm not sure about that one
either.
When you say char[] isn't a nice format, I wasn't thinking about having
the string class I mentioned earlier store the data ONLY as char[]. It
might be wchar[]. Or dchar[]. Then it would be automatically converted
between the two either at compile time (when possible) or dynamically at
runtime (hopefully only when needed). So if someone throws a Chinese
character literal at it, there is a very big clue there to use UTF32 or
something that can store all of the characters in a uniform width sort
of way, to speed indexing. Algorithms could be used so that a program
'learns' at runtime what kind of strings are dominating the program, and
uses algorithms optimized for those. Maybe this is a bit too complex,
but I can dream, hehe.
If nothing was done about this and I absolutely needed UTF support,
I'd probably make a class like so: [...]
In my own mock String class, I cached the dchar[] codepoints on demand.
(viewable at http://www.algonet.se/~afb/d/dcaf/html/class_string.html)
All in all it is a drag that we should have to learn all of this UTF
stuff. I want char[] to just work!
Using Unicode strings and characters does require a little learning...
(where http://www.unicode.org/faq/utf_bom.html is a very good page)
And D does force you to think about string implementation, no question.
This has both pros and cons, but it is a deliberate language decision.
If you're willing to handle the "surrogates", then UTF-16 is a rather
good trade-off between the default UTF-8 and wasteful UTF-32 formats ?
A downside is that it is not "ascii-compatible" (has embedded NUL chars)
and that it is endian-dependant unlike the more universal UTF-8 format.
--anders
My impression has gone from being quite scared of UTF to being not so
worried, but only for myself. D seems to be good at handling UTF, but
only if someone tells you to never handle strings as arrays of
characters. Unfortunately, the first thing you see in a lot of D
programs is "int main( char[][] args )" and there are some arrays of
characters being used as strings. This also means that some array
capabilities like indexing and the braggable slicing are more dangerous
than useful for string handling. It's a newbie trap.
Like I said earlier, I either want to be able to index/slice strings
safely, or not at all (or better yet, not by any intuitive means).
↑ ↓ ← → =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:
Chad J > wrote:
I read that FAQ at the bottom of this post, and didn't see anything
about accessing strings by codepoint. Maybe you mean a different FAQ
here, in which case, could I have a link please? I've been to the
unicode site before and all I remember was being confused and having a
hard time finding the info I wanted :(
I meant http://www.unicode.org/faq/utf_bom.html#12
Also I still am not sure exactly what a code point is. And that FAQ at
the bottom used the word "surrogate" a lot; I'm not sure about that one
either.
Code point is the closest thing to a "character", although it might take
more than one Unicode code point to represent a single Unicode grapheme.
Surrogates are used with UTF-16, to represent "too large" code points...
i.e. they always occur in "surrogate pairs", which combine to a single
When you say char[] isn't a nice format, I wasn't thinking about having
the string class I mentioned earlier store the data ONLY as char[]. It
might be wchar[]. Or dchar[]. Then it would be automatically converted
between the two either at compile time (when possible) or dynamically at
runtime (hopefully only when needed). So if someone throws a Chinese
character literal at it, there is a very big clue there to use UTF32 or
something that can store all of the characters in a uniform width sort
of way, to speed indexing. Algorithms could be used so that a program
'learns' at runtime what kind of strings are dominating the program, and
uses algorithms optimized for those. Maybe this is a bit too complex,
but I can dream, hehe.
Actually I said that dchar[] (i.e. UTF-32) wasn't ideal, but anyway...
(UTF-8 or UTF-16 is preferrable, for the reasons in the UTF FAQ above)
We already have char[] as the string default in D, but most models for
a String class uses wchar[] (i.e. UTF-16), for instance Mango or Java:
* http://mango.dsource.org/classUString.html (uses the ICU lib)
* http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html
All formats do use Unicode, so converting from one UTF to another is
mostly a question of memory/performance and not about any data loss.
However, it is not converted at compile time (without using templates)
so mixing and matching different representations is somewhat of a pain.
I think that char[] for string and wchar[] for String are good defaults.
My impression has gone from being quite scared of UTF to being not so
worried, but only for myself. D seems to be good at handling UTF, but
only if someone tells you to never handle strings as arrays of
characters. Unfortunately, the first thing you see in a lot of D
programs is "int main( char[][] args )" and there are some arrays of
characters being used as strings. This also means that some array
capabilities like indexing and the braggable slicing are more dangerous
than useful for string handling. It's a newbie trap.
It is, since it isn't really "arrays of characters" but "arrays of code
units". What muddies the waters further is that sometimes they're equal.
That is, with ASCII characters each character fits into a a D char unit.
Without surrogates, each character (from BMP) fits into one wchar unit.
However, all code that handles the shorter formats should be prepared to
handle non-ASCII (for UTF-8) and surrogates (for UTF-16), or use UTF-32:
bool isAscii(char c) { return (c <= 0x7f); }
bool isSurrogate(wchar c) { return (c >= 0xD800 && c <= 0xDFFF); }
But a warning that D uses multi-byte strings might be in order, yes...
Another warning that it only supports UTF-8 platforms* might also be ?
--anders
* "main(char[][] args)" does not work for any non-UTF consoles,
as you will get invalid UTF sequences for the non-ASCII chars.
↑ ↓ ← → =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:
Chad J > wrote:
char[] data;
dchar opIndex( int index )
{
foreach( int i, dchar c; data )
{
if ( i == index )
return c;
i++;
}
}
This code probably does not work as you think it does...
If you loop through a char[] using dchars (with a foreach),
then the int will get the codeunit index - *not* codepoint.
(the ++ in your code above looks more like a typo though,
since it needs to *either* foreach i, or do it "manually")
import std.stdio;
void main()
{
char[] str = "Björklund";
foreach(int i, dchar c; str)
{
writefln("%4d \\U%08X '%s'", i, c, ""d ~ c);
}
}
Will print the following sequence:
0 \U00000042 'B'
1 \U0000006A 'j'
2 \U000000F6 'ö'
4 \U00000072 'r'
5 \U0000006B 'k'
6 \U0000006C 'l'
7 \U00000075 'u'
8 \U0000006E 'n'
9 \U00000064 'd'
Notice how the non-ASCII character takes *two* code units ?
(if you expect indexing to use characters, that'd be wrong)
More at http://prowiki.org/wiki4d/wiki.cgi?CharsAndStrs
--anders
↑ ↓ ← → Chad J <""gamerChad\" spamIsBad gmail.com"> writes:
Anders F Björklund wrote:
Chad J > wrote:
char[] data;
dchar opIndex( int index )
{
foreach( int i, dchar c; data )
{
if ( i == index )
return c;
i++;
}
}
This code probably does not work as you think it does...
If you loop through a char[] using dchars (with a foreach),
then the int will get the codeunit index - *not* codepoint.
(the ++ in your code above looks more like a typo though,
since it needs to *either* foreach i, or do it "manually")
import std.stdio;
void main()
{
char[] str = "Björklund";
foreach(int i, dchar c; str)
{
writefln("%4d \\U%08X '%s'", i, c, ""d ~ c);
}
}
Will print the following sequence:
0 \U00000042 'B'
1 \U0000006A 'j'
2 \U000000F6 'ö'
4 \U00000072 'r'
5 \U0000006B 'k'
6 \U0000006C 'l'
7 \U00000075 'u'
8 \U0000006E 'n'
9 \U00000064 'd'
Notice how the non-ASCII character takes *two* code units ?
(if you expect indexing to use characters, that'd be wrong)
More at http://prowiki.org/wiki4d/wiki.cgi?CharsAndStrs
--anders
ah. And yep the i++ was a typo (oops).
So maybe something like:
dchar opIndex( int index )
{
int i;
foreach( dchar c; data )
{
if ( i == index )
return c;
i++;
}
}
The i is no longer the foreach's index, so the i++ isn't a typo anymore.
Thanks for the info. I'll check out that faq a little later, gotta go.
↑ ↓ ← → Georg Wrede <georg.wrede nospam.org> writes:
Chad J > wrote:
I will go ahead and say that the current state of char[] is incorrect.
That is, if you write a program manipulating char[] strings, then run it
in china, you will be dissapointed with the results. It won't matter
how fast the program runs, because bad stuff will happen like entire
strings becoming unreadable to the user.
Wrong.
And that's precisely what I meant about the Daddy holding bike allegory
a few messages back.
The current system seems to work "by magic". So, if you do go to China,
itll "just work".
At this point you _should_ not believe me. :-) But it still works.
---
The secret is, there actually is a delicate balance between UTF-8 and
the library string operations. As long as you use library functions to
extract substrings, join or manipulate them, everything is OK. And very
few of us actually either need to, or see the effort of bit-twiddling
individual octets in these "char" arrays.
So things just keep on working.
---
Not convinced yet? Well, a lot of folks here are from Europe, and our
languages contain "non-ASCII" characters. Our text manipulating programs
still work allright. And, actually D is pretty popular in Japan. Every
once in a while some Japanese guys pop on-and-off here, and some of them
don't even speak English, so they use a machine translator(!) to talk
with us. Just guess if they use ASCII in their programs. And you know
what, most of these guys even use their own characters for variable
names in D!
And not one of them has complained about "disappointing results".
---
That's why I continued with: keep your eyes shut and keep on coding.
↑ ↓ ← → Chad J <""gamerChad\" spamIsBad gmail.com"> writes:
Georg Wrede wrote:
The secret is, there actually is a delicate balance between UTF-8 and
the library string operations. As long as you use library functions to
extract substrings, join or manipulate them, everything is OK. And very
few of us actually either need to, or see the effort of bit-twiddling
individual octets in these "char" arrays.
But this is what I'm talking about... you can't slice them or index
them. I might actually index a character out of an array from time to
time. If I don't know about UTF, and I do just keep on coding, and I do
something like this:
char[] str = "some string in nonenglish text";
for ( int i = 0; i < str.length; i++ )
{
str[i] = doSomething( str[i] );
}
and this will fail right?
If it does fail, then everything is not alright. You do have to worry
about UTF. Someone has to tell you to use a foreach there.
↑ ↓ ← → Georg Wrede <georg.wrede nospam.org> writes:
Chad J > wrote:
Georg Wrede wrote:
The secret is, there actually is a delicate balance between UTF-8 and
the library string operations. As long as you use library functions to
extract substrings, join or manipulate them, everything is OK. And
very few of us actually either need to, or see the effort of
bit-twiddling individual octets in these "char" arrays.
But this is what I'm talking about... you can't slice them or index
them. I might actually index a character out of an array from time to
time. If I don't know about UTF, and I do just keep on coding, and I do
something like this:
char[] str = "some string in nonenglish text";
for ( int i = 0; i < str.length; i++ )
{
str[i] = doSomething( str[i] );
}
and this will fail right?
If it does fail, then everything is not alright. You do have to worry
about UTF. Someone has to tell you to use a foreach there.
Yes. That's why I talked about you falling down once you realise Daddy's
not holding the bike.
Part of UTF-8's magic lies in that it is amazingly easy to get working
smoothly with truly minor tweaks to "formerly ASCII-only" libraries --
so that even the most exotic languages have no problem.
Your concerns about the for loop are valid, and expected. Now, IMHO, the
standard library should take care of "all" the situations where you
would ever need to split, join, examine, or otherwise use strings,
"non-ASCII" or not. (And I really have no complaint (Walter!) about
this.) Therefore, in no normal circumstances should you have to twiddle
them yourself -- unless.
And this "unless" is exactly why I'm unhappy with the situation, too.
Problem is, _technology_wise_ the existing setup may actually be the
best, both considering ease of writing the library, ease of using it,
robustness of both the library and users' code, and the headaches saved
from programmers who, either haven't heard of the issue (whether they're
American or Chinese!), or who simply trust their lives with the machinery.
So, where's the actual problem???
At this point I'm inclined to say: the documentation, and the stage
props! The latter meaning: exposing the fact that our "strings" are just
arrays is psychologically wrong, and even more so is the fact that we're
shamelessly storing entities of variable length in arrays which have no
notion of such -- even worse, while we brag with slices!
If this had been a university course assignment, we'd have been thrown
out of class, for both half baked work, and for arrogance towards our
client, victimizing the coder.
The former meaning: we should not be like "we're bad enough to overtly
use plain arrays for variable-length data, now if you have a problem
with it, the go home and learn stuff, or then just trust us".
Both "documentation" and "stage props" ultimately meaning that the
largest problem here is psychology, pedagogy, and education.
---
A lot would already be won by:
merely aliasing char[] to string, and discouraging other than guru-level
folks from screwing with their internals. This alone would save a lot of
Fear, Uncertainty and D-phobia.
The documentation should take pains in explaining up front that if you
_really_ want to do Character-by-Character ops _and_ you live outside of
America, then the Right way to do it (ehh, actually the Canonical Way),
is to first convert the string to dchar[]. Period.
Then, if somebody else knows enough of UTF-8 and knows he can handle bit
twiddling more efficiently than using the Canonical Way, with plain
char[] and "foreignish", then let him. But let that be undocumented and
Un-Discussed in the docs. Precisely like a lot of other things are. (And
should be.) And will be. He's on his own, and he ought to know it.
---
In other words, the normal programmer should believe he's working with
black-box Strings, and he will be happy with it. That way he'll survive
whether he's in Urduland or Boise, Idaho -- without neither ever needing
to have heard about UTF nor other crap.
Not until in Appendix Z of the manual should we ever admit that the
Emperor's Clothes are just plain arrays, and we apologize for the breach
of manners of storing variable length data in simple naked arrays. And
here would be the right place to explain how come this hasn't blown up
in our faces already. And, exactly how you'll avoid it too. (This
_needs_ to contain an adequate explanation about the actual format of
UTF-8.)
---
TO RECAP
The _single_ biggest strings-related disservice to our pilgrims is to
lead them to believe, that D stores
strings in something like utf8[]
internally.
Now that's an oxymoron, if I ever saw one. (If utf8[] was _actually_
implemented, it would probably have to be an alias of char[][]. Right?
Right? What we have instead is ubyte[], which is _not_ the same as
utf8[].) (Oh, and if it ever becomes obvious that not _everybody_
understood this, then that in itself simply proves my point here.)
(*1)
And the fault lies in the documentation, not the implementation!
This results, in braincell-hours wasted, precisely as much as everybody
has to waste them, before they realise that the acronym RAII is a filthy
lie. Akin only to the former "German _Democratic_ Republic". Only a
politician should be capable of this kind of deception.
Ok, nobody is doing it on purpose. Things being too clear to oneself
often result in difficulties to find ways to express them to new people.
(Happens every day at the Math department! :-( ) And since all
in-the-know are unable to see it, and all not-in-the-know are too, then
both groups might think it's the thing itself that is "the problem", and
not merely the chosen _presentation_ of it.
#################
Sorry for sonding Righteous, arrogant and whatever. But this really is a
5 minute thing for one person to fix for good, while it wastes entire
days or months _per_person_, from _every_ non-defoiled victim who
approaches the issue. Originally I was one of them: hence the aggression.
-------------------------------------------
(*1) Even I am not simultaneously both literally and theoretically right
here. Those who saw it right away, probably won't mind, since it's the
point that is the issue here.
Now, having to write this disclaimer, IMHO simply again underlines the
very point attempted here.
↑ ↓ ← → Walter Bright <newshound digitalmars.com> writes:
Chad J > wrote:
But this is what I'm talking about... you can't slice them or index
them. I might actually index a character out of an array from time to
time. If I don't know about UTF, and I do just keep on coding, and I do
something like this:
char[] str = "some string in nonenglish text";
for ( int i = 0; i < str.length; i++ )
{
str[i] = doSomething( str[i] );
}
and this will fail right?
If it does fail, then everything is not alright. You do have to worry
about UTF. Someone has to tell you to use a foreach there.
Yes, you do have to be aware of it being UTF, just like in C you have to
be aware that strings are 0 terminated. But once aware of it, there is
plenty of support for it in the core language and in std.utf.
You can also simply use dchar[], which has a one to one mapping between
characters and indices, if you prefer.
Contrast that with C++, which has no usable or portable support for
UTF-8, UTF-16, or any Unicode. All your carefully coded use of
std::string needs to be totally scrapped and redone with your own custom
classes, should you decide your app needs to support unicode.
You can also wrap char[] inside a class that provides a view of the data
as if it were dchar's. But I don't think the performance of such a
class would be competitive. Interestingly, it turns out that most string
operations do not need to be concerned with the number of char's in a
character (like "find this substring"), and forcing them to care just
makes for inefficiency.
↑ ↓ ← → Sean Kelly <sean f4.ca> writes:
Walter Bright wrote:
Contrast that with C++, which has no usable or portable support for
UTF-8, UTF-16, or any Unicode. All your carefully coded use of
std::string needs to be totally scrapped and redone with your own custom
classes, should you decide your app needs to support unicode.
As long as you're aware that you are working in UTF-8 I think
std::string could still be used. It just may be strange to use
substring searches to find multibyte characters with no built-in support
for dchar-type searching.
You can also wrap char[] inside a class that provides a view of the data
as if it were dchar's. But I don't think the performance of such a
class would be competitive. Interestingly, it turns out that most string
operations do not need to be concerned with the number of char's in a
character (like "find this substring"), and forcing them to care just
makes for inefficiency.
Yup. I realized this while working on array operations and it came as a
surprise--when I began I figured I would have to provide overloads for
char strings, but in most cases it simply isn't necessary.
Sean
↑ ↓ ← → Walter Bright <newshound digitalmars.com> writes:
Sean Kelly wrote:
Walter Bright wrote:
Contrast that with C++, which has no usable or portable support for
UTF-8, UTF-16, or any Unicode. All your carefully coded use of
std::string needs to be totally scrapped and redone with your own
custom classes, should you decide your app needs to support unicode.
As long as you're aware that you are working in UTF-8 I think
std::string could still be used. It just may be strange to use
substring searches to find multibyte characters with no built-in support
for dchar-type searching.
It's so broken that there are proposals to reengineer core C++ to add
support for UTF types.
1) implementation-defined whether a char is signed or unsigned, so
you've got to cast the result of any string[i]
2) none of the iteration, insertion, appending, etc., operations can
handle multibyte
3) no UTF conversion or transliteration
4) C++ source text encoding is implementation-defined, so no using UTF
characters in source code (have to use \u or \U notation)
↑ ↓ ← → Sean Kelly <sean f4.ca> writes:
Walter Bright wrote:
Sean Kelly wrote:
Walter Bright wrote:
Contrast that with C++, which has no usable or portable support for
UTF-8, UTF-16, or any Unicode. All your carefully coded use of
std::string needs to be totally scrapped and redone with your own
custom classes, should you decide your app needs to support unicode.
As long as you're aware that you are working in UTF-8 I think
std::string could still be used. It just may be strange to use
substring searches to find multibyte characters with no built-in
support for dchar-type searching.
It's so broken that there are proposals to reengineer core C++ to add
support for UTF types.
1) implementation-defined whether a char is signed or unsigned, so
you've got to cast the result of any string[i]
Oops, forgot about this.
2) none of the iteration, insertion, appending, etc., operations can
handle multibyte
True. And I hinted at this above.
3) no UTF conversion or transliteration
4) C++ source text encoding is implementation-defined, so no using UTF
characters in source code (have to use \u or \U notation)
Personally, I see this as a language deficiency more than a deficiency
in std::string. std::string is really just a vector with some search
capabilities thrown in. It's not that great for a string class, but it
works well enough as a general sequence container. And it will work a
tad better once they impose the came data contiguity guarantee that
vector has (I believe that's one of the issues set to be resolved for 0x).
Overall, I do agree with you. Though I suppose that's obvious as I'm a
former C++ advocate who now uses D quite a bit :-)
Sean
↑ ↓ ← → Walter Bright <newshound digitalmars.com> writes:
Sean Kelly wrote:
3) no UTF conversion or transliteration
4) C++ source text encoding is implementation-defined, so no using UTF
characters in source code (have to use \u or \U notation)
Personally, I see this as a language deficiency more than a deficiency
in std::string.
That's why the proposals to fix it are rewriting some of the *core* C++
language.
std::string is really just a vector with some search
capabilities thrown in.
Another difficulty with it is it doesn't have a connection with
std::vector<char>.
It's not that great for a string class, but it
works well enough as a general sequence container. And it will work a
tad better once they impose the came data contiguity guarantee that
vector has (I believe that's one of the issues set to be resolved for 0x).
Overall, I do agree with you. Though I suppose that's obvious as I'm a
former C++ advocate who now uses D quite a bit :-)
:-)
↑ ↓ ← → Johan Granberg <lijat.meREM OVEgmail.com> writes:
Georg Wrede wrote:
Wrong.
And that's precisely what I meant about the Daddy holding bike allegory
a few messages back.
The current system seems to work "by magic". So, if you do go to China,
itll "just work".
At this point you _should_ not believe me. :-) But it still works.
---
But is this not a needless source of confusion, that could be eliminated
by defining char as "big enough to hold a unicode code point" or
something else that eliminates the possibility to incorrectly divide utf
tokens.
I will have to try using char[] with non ascii characters thou I have
been using dchar fore that up till now.
↑ ↓ ← → Georg Wrede <georg.wrede nospam.org> writes:
Johan Granberg wrote:
Georg Wrede wrote:
Wrong.
And that's precisely what I meant about the Daddy holding bike
allegory a few messages back.
The current system seems to work "by magic". So, if you do go to
China, itll "just work".
At this point you _should_ not believe me. :-) But it still works.
---
But is this not a needless source of confusion, that could be eliminated
by defining char as "big enough to hold a unicode code point" or
something else that eliminates the possibility to incorrectly divide utf
tokens.
I will have to try using char[] with non ascii characters thou I have
been using dchar fore that up till now.
You might begin with pasting this and compiling it:
import std.stdio;
void main()
{
int öylätti;
int ШеФФ;
öylätti = 37;
ШеФФ = 19;
writefln("Köyhyys 1 on %d ja nöyrä 2 on %d, että näin.", öylätti,
ШеФФ);
}
It will compile, and run just fine. (The source file having been read
into DMD as a single big string, and then having gone through comment
removal, tokenizing, parsing, lexing, compiling, optimizing, and finally
the variable names having found their way into the executable. Even
though the front end has been written in D itself, with simply char[]
all over the place.)
(Then you might see that the Windows "command prompt window" renders the
output wrong, but it's only from the fact that Windows itself doesn't
handle UTF-8 right in the Command Window.)
The next thing you might do is to write a grep program (that takes as
input a file and as output writes the lines found). Write the program as
if you had never heard this discussion. Then feed it the Kalevala in
Finnish, or Mao's Red Book in Chinese. Should still work.
As long as you don't start tampering with the individual octets in
strings, you should be just fine. Don't think about UTF and you'll prosper.
↑ ↓ ← → Derek Parnell <derek psyc.ward> writes:
On Sat, 30 Sep 2006 03:03:02 +0300, Georg Wrede wrote:
As long as you don't start tampering with the individual octets in
strings, you should be just fine. Don't think about UTF and you'll prosper.
The Build program does lots of 'tampering'. I had to rewrite many standard
routines and create some new ones to deal with unicode characters because
the standard ones just don't work. And Build still fails to do somethings
correctly (e.g. case insensitive compares) but that's on the TODO list.
I have to think about UTF because it doesn't work unless I do that.
--
Derek Parnell
Melbourne, Australia
"Down with mediocrity!"
↑ ↓ ← → Georg Wrede <georg.wrede nospam.org> writes:
Derek Parnell wrote:
On Sat, 30 Sep 2006 03:03:02 +0300, Georg Wrede wrote:
As long as you don't start tampering with the individual octets in
strings, you should be just fine. Don't think about UTF and you'll prosper.
The Build program does lots of 'tampering'. I had to rewrite many standard
routines and create some new ones to deal with unicode characters because
the standard ones just don't work.
Do you still remember which they were?
And Build still fails to do somethings
correctly (e.g. case insensitive compares) but that's on the TODO list.
Yes, case insensitive compares are difficult if you want to cater for
non-ASCII strings. While it may not be unreasonably difficult to get
American, European and Russian strings right, there will always be
languages and character sets where even the Unicode guys aren't sure
what is right. Unfortunately.
↑ ↓ ← → Geoff Carlton <gcarlton iinet.net.au> writes:
Georg Wrede wrote:
The secret is, there actually is a delicate balance between UTF-8 and
the library string operations. As long as you use library functions to
extract substrings, join or manipulate them, everything is OK. And very
few of us actually either need to, or see the effort of bit-twiddling
individual octets in these "char" arrays.
So things just keep on working.
I agree, but I disagree that there is a problem, or that utf-8 is a bad
choice, or that perhaps char[] or string should be called utf8 instead.
As a note here, I actually had a page of text localised into Chinese
last week - it came back as a utf8 text file.
The only thing with utf8 is that a glyphs aren't represented by a single
char. But utf16 is no better! And even utf32 codepoints can be
combined into a single rendered glyph. So truncating a string at an
arbitrary index is not going to slice on a glyph boundary.
However, it doesn't mean utf8 is ASCII mixed with "garbage" bytes. That
garbage is a unique series of bytes that represent a codepoint. This is
a property not found in any other encoding.
As such, everything works, strstr, strchr, strcat, printf, scanf - for
ASCII, normal unicode, and the "Astral planes". It all just works. The
only thing that breaks is if you tried to index or truncate the data by
hand.
But even that mostly works, you can iterate through, looking for ASCII
sequences, chop out ASCII and string together more stuff, it all works
because you can just ignore the higher order bytes. Pretty much the
only thing that fails is if you said "I don't know whats in the string,
but chop it off at index 12".
↑ ↓ ← → Georg Wrede <georg.wrede nospam.org> writes:
Geoff Carlton wrote:
Georg Wrede wrote:
The secret is, there actually is a delicate balance between UTF-8 and
the library string operations. As long as you use library functions to
extract substrings, join or manipulate them, everything is OK. And
very few of us actually either need to, or see the effort of
bit-twiddling individual octets in these "char" arrays.
So things just keep on working.
I agree, but I disagree that there is a problem, or that utf-8 is a bad
choice, or that perhaps char[] or string should be called utf8 instead.
As a note here, I actually had a page of text localised into Chinese
last week - it came back as a utf8 text file.
The only thing with utf8 is that a glyphs aren't represented by a single
char. But utf16 is no better! And even utf32 codepoints can be
combined into a single rendered glyph. So truncating a string at an
arbitrary index is not going to slice on a glyph boundary.
However, it doesn't mean utf8 is ASCII mixed with "garbage" bytes. That
garbage is a unique series of bytes that represent a codepoint. This is
a property not found in any other encoding.
As such, everything works, strstr, strchr, strcat, printf, scanf - for
ASCII, normal unicode, and the "Astral planes". It all just works. The
only thing that breaks is if you tried to index or truncate the data by
hand.
But even that mostly works, you can iterate through, looking for ASCII
sequences, chop out ASCII and string together more stuff, it all works
because you can just ignore the higher order bytes. Pretty much the
only thing that fails is if you said "I don't know whats in the string,
but chop it off at index 12".
Yes.
↑ ↓ ← → Johan Granberg <lijat.meREM OVEgmail.com> writes:
Georg Wrede wrote:
Geoff Carlton wrote:
But even that mostly works, you can iterate through, looking for ASCII
sequences, chop out ASCII and string together more stuff, it all works
because you can just ignore the higher order bytes. Pretty much the
only thing that fails is if you said "I don't know whats in the
string, but chop it off at index 12".
Yes.
How should we chop strings on character boundaries?
I have a text rendering function that uses freetype and want to restrict
the width of the renderd string, (i have to use some sort of search
here, binary or linear) by truncating it. Right now I use dchar but if
char is sufficient it would save me conversions all over the place.
↑ ↓ ← → Walter Bright <newshound digitalmars.com> writes:
Johan Granberg wrote:
How should we chop strings on character boundaries?
std.utf.toUTFindex() should do the trick.
↑ ↓ ← → Johan Granberg <lijat.meREM OVEgmail.com> writes:
BCS wrote:
Why isn't performance a problem?
If you are saying that this won't cause performance hits in run times or
memory space, I might be able to buy it, but I'm not yet convinced.
If you are saying that causing a performance hit in run times or memory
space is not a problem... in that case I think you are dead wrong and
you will not convince me otherwise.
In my opinion, any compiled language should allow fairly direct access
to the most efficient practical means of doing something*. If I didn't
care about speed and memory I wound use some sort of scripting language.
A good set of libs should make most of this moot. Leave the char as is
and define a typedef struct or whatever that provides the added
functionality that you want.
* OTOH a language should not mandate code to be efficient at the expense
of ease of coding.
I don't think any performance hit will be so big that it causes problems
(max x4 memory and negligible computation overhead). Hope that made
clear what I meant.
↑ ↓ ← → BCS <BCS pathlink.com> writes:
Johan Granberg wrote:
BCS wrote:
Why isn't performance a problem?
If you are saying that causing a performance hit in run times or
memory space is not a problem... in that case I think you are dead
wrong and you will not convince me otherwise.
I don't think any performance hit will be so big that it causes problems
(max x4 memory and negligible computation overhead). Hope that made
clear what I meant.
If you will note, I said nothing about the size of the hit. While some
may disagree, I think that any unneeded hit is a problem.
One alternative that I could live with would use 4 character types:
char one codeunit in whatever encoding the runtime uses
schar one 8 bit code unit (ASCII or utf-8)
wchar one 16 bit code unit (same as before)
dchar one 32 bit code unit (same as before)
(using the same thing for ASCII and UTF-8 may be a problem, but this
isn't my field)
The point being that char, wchar and dchar are not representing numbers
and should be there own type. This also preserves direct access to 8, 16
and 32 bit types.
↑ ↓ ← → =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
BCS wrote:
One alternative that I could live with would use 4 character types:
char one codeunit in whatever encoding the runtime uses
schar one 8 bit code unit (ASCII or utf-8)
wchar one 16 bit code unit (same as before)
dchar one 32 bit code unit (same as before)
We have that already:
ubyte one codeunit in whatever encoding the runtime uses
char one 8 bit code unit (ASCII or utf-8)
There is no support in Phobos for runtime/native encodings,
but you can use the "iconv" library to do such conversions ?
(using the same thing for ASCII and UTF-8 may be a problem, but this isn't my
field)
All ASCII characters are valid UTF-8 code units, so it's OK.
--anders
↑ ↓ ← → BCS <BCS pathlink.com> writes:
Anders F Björklund wrote:
BCS wrote:
One alternative that I could live with would use 4 character types:
char one codeunit in whatever encoding the runtime uses
schar one 8 bit code unit (ASCII or utf-8)
wchar one 16 bit code unit (same as before)
dchar one 32 bit code unit (same as before)
We have that already:
ubyte one codeunit in whatever encoding the runtime uses
char one 8 bit code unit (ASCII or utf-8)
ubyte is an 8 bit unsigned number not a character encoding.
[after some more reading]
I may be just rambling but...
how about have the type of the value denote the encoding. One for ASCII
would only ever store ASCII (UTF-8 is invalid), same for UTF-8,16 and
32. Direct assignment would be illegal (as with, say int[] -> Object) or
implicitly converted (as with int -> real). Casts would be provided.
Indexing would be by codepoint. Non-array variables would be big enough
to store any codepoint (ASCII -> 8bit, !ASCII -> 32-bit). Some sort of
"whatever the system uses" data type (ah la C's int) could be used for
actual output, maybe even escaping anything that won't get displayed
correctly.
This all sort of follows the idea of "call it what it is and don't hide
the overhead". 1) Characters are a different type of data than numbers
(see the threads on bool) and as such, that should be reflected in the
type system. 2) I have no problem with high overhead operations as long
as I can avoid using them when I don't want to.
There is no support in Phobos for runtime/native encodings,
but you can use the "iconv" library to do such conversions ?
(using the same thing for ASCII and UTF-8 may be a problem, but this
isn't my field)
All ASCII characters are valid UTF-8 code units, so it's OK.
But UTF-8 is not ASCII.
--anders
↑ ↓ ← → Georg Wrede <georg.wrede nospam.org> writes:
BCS wrote:
I may be just rambling but...
how about have the type of the value denote the encoding. One for ASCII
would only ever store ASCII (UTF-8 is invalid)
Then all Americans would use that instead of UTF-8.
This is natural, since first you code for yourself, later maybe for your
boss, etc. And, you'd only become aware of any problems when a Latino
tries to use his own name José, talk about Motörhead, or Anaïs the
fragrance. And the mail and newsreader you wrote in D simply would not work.
Guess if anybody would heed the warning "Only use this new ASCII
encoding when you are absolutely positive the program never will
encounter a single foreign sentence or letter".
So, better not.
---
D's current setup and documentation engourage this kind of suggestions,
and I don't blame you.
Things being like they are, a programmer who wants to write a crossword
puzzle generator, would of course begin with:
char[20][20] theGrid;
It's a shame that an otherwise so excellent language ( + the wording it
its docs) downright leads you to do this.
The guy naturally assumes that D being a "UTF-8" language, this would
work even in Chinese. (Hey, char[] foo = "José Motörhead from the band
Anaïs is on stage!"; works, so why wouldn't theGrid? Poor guy.
I can't blame anyone then wanting to stay within ASCII for the rest of
D's life.
↑ ↓ ← → =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
BCS wrote:
ubyte is an 8 bit unsigned number not a character encoding.
Right, I actually meant ubyte[] but void[] might have been
more accurate for representing any (even non-UTF) encoding.
(I used ubyte[] in my mapping functions, since they only
used legacy 8-bit encodings like "cp1252" or "macroman")
Re-reading your post, it seems to me that you were more talking
about doing an alias to the UTF type most suitable for the OS ?
I guess UTF-8 would be a good choice if the operating system
doesn't use Unicode, since then it'll have to do lookups anyway.
Otherwise the existing "wchar_t" isn't bad for such an UTF type,
it will be UTF-16 on Windows and UTF-32 on Unix (linux,darwin,...)
All ASCII characters are valid UTF-8 code units, so it's OK.
But UTF-8 is not ASCII.
So you would like a char "type" that would only take ASCII ?
I guess that is *one* way of dealing with it, you could also
have a wchar type that wouldn't accept surrogates (BMP only)
Then it would be OK to index them by code unit / character...
(since each allowed character would fit into one code unit)
Sounds a little like signed vs. unsigned integers actually ?
Then again, 5 character types is even worse than the 3 now.
--anders
↑ ↓ ← → BCS <BCS pathlink.com> writes:
Anders F Björklund wrote:
[...]
Then again, 5 character types is even worse than the 3 now.
--anders
The more I think about it the worse this get.
What I really would like is a system that allows O(1) operations on
strings (slice out char 7 to 27), allows somewhat compact encoding
(8bit) and allows safe operations on UTF (if I do something dumb, it
complains). All at the same time would be nice, but is not needed.
Come to think about it, a lib that will do good FAST convention between
buffers:
//note: "in" is intentional, it wont allocate anything
UTF8to16(in char[], in wchar[]);
UTF8to32(in char[], in dchar[]);
UTF16to32(in wchar[], in dchar[]);
...
would get most of what I want.
<sarcasm>
And while I'm at it, I'd like a million bucks please.
</sarcasm>
|
|