www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - First Impressions

reply Geoff Carlton <gcarlton iinet.net.au> writes:
Hi,
I'm a C++ user who's just tried D and I wanted to give my first
impressions.  I can't really justify moving any of my codebase over to
D, so I wrote a quick tool to parse a dictionary file and make a
histogram - a bit like the wc demo in the dmd package.

1.)
I was a bit underwhelmed by the syntax of char[].  I've used lua which
also has strings,functions and maps as basic primitives, so going back
to array notation seems a bit low level.  Also, char[][] is not the best
start in the main() declaration.  Is it a 2D array, an array of
arrays?  Then there is the char[][char[]].  What a mouthful for a simple
map!

Well, now I need to find elements.. I'd use std::string's find() here, 
but the wc example has all array operations. Even isalpha is done as 
'a', 'z' comparisons on an indexed array. Back to low level C stuff.

A simple alias of char[] to string would simplify the first glance code.
   string x;    // yep, a string
   main (string[]) // an array of strings
   string[string] m; // map of string to string

I believe single functions get pulled in as member functions?  e.g.
find(string) can be used as string.find()?  If so, it means that all the
string functionality can be added and then used naturally as member
functions on this "string" (which is really just the plain old char[] in
disguise).

This is a small thing, but I think it would help in terms of the mindset
of strings being a first class primitive, and clear up simple "hello
world" examples at the same time.  Put simply, every modern language has
a first class string primitive type, except D - at least in terms of
nomenclature.

2.)
I liked the more powerful for loop.  I'm curious is there any ability to 
use delegates in the same way as lua does?  I was blown away the first 
time I realised how simple it was for custom iteration in lua.  In 
short, you write a function that returns a delegate (a closure?) that 
itself returns arguments, terminating in nil.

   e.g. for r in rooms_in_level(lvl) // custom function

As lua can handle multiple return arguments, it can also do a key,value 
sort of thing that D can do.  What a wonderful way of allowing any sort 
of iteration.

It beats pages of code in C++ to write an iterator that can go forwards, 
or one that can go backwards (wow, the power of C++!).  C++09 still 
isn't much of an improvement here, it only sugars the awful iterator syntax.

3.)
 From the newsgroups, it seems like 'auto' as local raii and 'auto' as
automatic type deduction are still linked to the one keyword.  Well in 
lua, 'local' is pretty intuitive for locally scoped variables.  Also 
'auto' will soon mean automatic type deduction in C++.  So those make 
sense to me personally.  Looks like this has been discussed to death, 
but thats my 2c.

4.)
The D version of Scintilla and d-build was nice, very easy to use.
Personally I would have preferred the default behaviour of dbuild to put 
object files in an /obj subdirectory and the final exe in the original 
directory dbuild is run from.

This way, it could be run from a root directory, operate on a /src 
subdirectory, and not clutter up the source with object files.  There is 
a switch for that, of course, but I can't imagine when you would want 
object files sitting in the same directory as the source.

Well, as first impressions go, I was pleased by D, and am interested to 
see how well it fares as time goes on.  Its just a shame that all the 
tools/library/IDE is all in C++!

Thanks,
Geoff
Sep 28 2006
next sibling parent reply "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"Geoff Carlton" <gcarlton iinet.net.au> wrote in message 
news:efhp1r$1r9s$1 digitaldaemon.com...
 Hi,
 I'm a C++ user who's just tried D and I wanted to give my first
 impressions.  I can't really justify moving any of my codebase over to
 D, so I wrote a quick tool to parse a dictionary file and make a
 histogram - a bit like the wc demo in the dmd package.

 1.)
 I was a bit underwhelmed by the syntax of char[].  I've used lua which
 also has strings,functions and maps as basic primitives, so going back
 to array notation seems a bit low level.  Also, char[][] is not the best
 start in the main() declaration.  Is it a 2D array, an array of
 arrays?  Then there is the char[][char[]].  What a mouthful for a simple
 map!

 Well, now I need to find elements.. I'd use std::string's find() here, but 
 the wc example has all array operations. Even isalpha is done as 'a', 'z' 
 comparisons on an indexed array. Back to low level C stuff.

 A simple alias of char[] to string would simplify the first glance code.
   string x;    // yep, a string
   main (string[]) // an array of strings
   string[string] m; // map of string to string

 I believe single functions get pulled in as member functions?  e.g.
 find(string) can be used as string.find()?  If so, it means that all the
 string functionality can be added and then used naturally as member
 functions on this "string" (which is really just the plain old char[] in
 disguise).

They're more just syntactic sugar than member functions. You can, in fact do this with any array type, e.g void foo(int[] arr) { ... } int[] x; x = [4, 5, 6, 7]; // bug in the new array literals ;) x.foo();
 This is a small thing, but I think it would help in terms of the mindset
 of strings being a first class primitive, and clear up simple "hello
 world" examples at the same time.  Put simply, every modern language has
 a first class string primitive type, except D - at least in terms of
 nomenclature.

It does look nicer. I suppose the counterargument would be that having an alias char[] string might not be portable -- what about wchar[] and dchar[]? Would they be wstring and dstring? Or would we choose wchar[] or dchar[] to be string to be more forward-thinking (since languages like Java and C# already use UTF-16 as the default string type)? I've never been too incredibly put off by char[], but of course other people have other opinions.
 2.)
 I liked the more powerful for loop.  I'm curious is there any ability to 
 use delegates in the same way as lua does?  I was blown away the first 
 time I realised how simple it was for custom iteration in lua.  In short, 
 you write a function that returns a delegate (a closure?) that itself 
 returns arguments, terminating in nil.

   e.g. for r in rooms_in_level(lvl) // custom function

 As lua can handle multiple return arguments, it can also do a key,value 
 sort of thing that D can do.  What a wonderful way of allowing any sort of 
 iteration.

Unfortunately the way Lua does "foreach" iteration is exactly the inverse of how D does it. Lua gets an iterator and keeps calling it in the loop; D gives the loop (the entire body!) to the iterator function, which runs the loop. So it's something like a "true" iterator as described in the Lua book: level.each(function(r) print("Room: " .. r) end) D does it this way I guess to make it easier to write iterators. Since you're limited to one return value, it's simpler to make the iterator a callback and pass the indices into the foreach body than it is to make the iterator return multiple parameters through "out" parameters. That, and it's easier to keep track of state with a callback iterator. (I'm going through which to use in a Lua-like language that I'm designing too!)
 It beats pages of code in C++ to write an iterator that can go forwards, 
 or one that can go backwards (wow, the power of C++!).  C++09 still isn't 
 much of an improvement here, it only sugars the awful iterator syntax.

Weeeeeeeee! C++
 3.)
 From the newsgroups, it seems like 'auto' as local raii and 'auto' as
 automatic type deduction are still linked to the one keyword.  Well in 
 lua, 'local' is pretty intuitive for locally scoped variables.  Also 
 'auto' will soon mean automatic type deduction in C++.  So those make 
 sense to me personally.  Looks like this has been discussed to death, but 
 thats my 2c.

I don't even wanna get into it ;) _Technically_ speaking, auto isn't really "used" in type deduction; instead, the syntax is just <storage class> <identifier>, skipping the type. Since the default storage class is auto, it looks like auto is being used to determine the type, but it also works like e.g. static x = 5; I think a better way to do it would be to have a special "stand-in" type, such as var x = 5; static var y = 20; auto var f = new Foo(); // this will be RAII and automatically type-determined
 4.)
 The D version of Scintilla and d-build was nice, very easy to use.
 Personally I would have preferred the default behaviour of dbuild to put 
 object files in an /obj subdirectory and the final exe in the original 
 directory dbuild is run from.

 This way, it could be run from a root directory, operate on a /src 
 subdirectory, and not clutter up the source with object files.  There is a 
 switch for that, of course, but I can't imagine when you would want object 
 files sitting in the same directory as the source.

 Well, as first impressions go, I was pleased by D, and am interested to 
 see how well it fares as time goes on.  Its just a shame that all the 
 tools/library/IDE is all in C++!

 Thanks,
 Geoff 

Sep 28 2006
parent Geoff Carlton <gcarlton iinet.net.au> writes:
Jarrett Billingsley wrote:

 It does look nicer.  I suppose the counterargument would be that having an 
 alias char[] string might not be portable -- what about wchar[] and dchar[]? 
 Would they be wstring and dstring?  Or would we choose wchar[] or dchar[] to 
 be string to be more forward-thinking (since languages like Java and C# 
 already use UTF-16 as the default string type)?
 

I'm a fan of utf-8 so it would seem natural to have string, wstring, and dstring. IMO utf-16 is backward thinking, and has the dubious property of being mostly fixed width, except when its not. And even utf-32 isn't one-to-one in terms of glyphs rendered on screen. Anyway, as a low level programmer, I appreciate that its all based on very powerful and flexible arrays. But as a high level programmer, I don't want to be reminded of that fact every time I need a to use a string.
 Unfortunately the way Lua does "foreach" iteration is exactly the inverse of 
 how D does it.  Lua gets an iterator and keeps calling it in the loop; D 
 gives the loop (the entire body!) to the iterator function, which runs the 
 loop.  So it's something like a "true" iterator as described in the Lua 
 book:

Ok, although the advantage of the first method is that you write the iterator once, and then its easy to use for all clients. Wrapping up the loop in a function is just backward, although it is much more palatable in the inline format than a clunky out of line functor or using _1, _2 hackery magic. As an example, I love the fact that I can do this in lua: for r1 in rooms_in_level(lvl) do for r2 in rooms_in_level(lvl) do for c in connections(r1, r2) do print("got connection " .. c) end end end I wrote Floyd's algorithm in lua in the time it would take me in C++ to not even finish thinking about what structures, classes, vectors I would use. I imagine D would be as easy, although not as nice as the above style.
 
 D does it this way I guess to make it easier to write iterators.  Since 
 you're limited to one return value, it's simpler to make the iterator a 
 callback and pass the indices into the foreach body than it is to make the 
 iterator return multiple parameters through "out" parameters.  That, and 
 it's easier to keep track of state with a callback iterator.  (I'm going 
 through which to use in a Lua-like language that I'm designing too!)

Multiple returns would be tricky. C++ looks like its getting there with std::tuple and std::tie, but as always the downside is the sheer clunkiness. As hetrogenous arrays aren't in the core language for either C++ or D, its tricky to come up with a clean solution. Designing a language would be great fun, and I think lua has done a great many things right. Not sure about the typeless state though, it gets messy with large projects. Still, no templates (or rather, every function is like a template).
Sep 28 2006
prev sibling next sibling parent Lutger <lutger.blijdestijn gmail.com> writes:
Geoff Carlton wrote:
 Hi,
 I'm a C++ user who's just tried D and I wanted to give my first
 impressions.  I can't really justify moving any of my codebase over to
 D, so I wrote a quick tool to parse a dictionary file and make a
 histogram - a bit like the wc demo in the dmd package.

You'll sure be pleased with D coming from C++.
 1.)
 I was a bit underwhelmed by the syntax of char[]...

Yes, I was too. But although it looks not very nice at first sight, D's arrays are nothing like C++ arrays. Strings are first class, array notation is consistent and getting used to them together with concatenation and slicing operators, I found they are quite powerful yet simple to use.
 2.)
 I liked the more powerful for loop.  I'm curious is there any ability to 
 use delegates in the same way as lua does?  I was blown away the first 
 time I realised how simple it was for custom iteration in lua.  In 
 short, you write a function that returns a delegate (a closure?) that 
 itself returns arguments, terminating in nil.

You can enable a class to use the foreach statement. http://www.digitalmars.com/d/statement.html#foreach
 4.)
 The D version of Scintilla and d-build was nice, very easy to use.
 Personally I would have preferred the default behaviour of dbuild to put 
 object files in an /obj subdirectory and the final exe in the original 
 directory dbuild is run from.
 
 This way, it could be run from a root directory, operate on a /src 
 subdirectory, and not clutter up the source with object files.  There is 
 a switch for that, of course, but I can't imagine when you would want 
 object files sitting in the same directory as the source.

Check out build: http://www.dsource.org/projects/build
 Well, as first impressions go, I was pleased by D, and am interested to 
 see how well it fares as time goes on.  Its just a shame that all the 
 tools/library/IDE is all in C++!
 
 Thanks,
 Geoff

Sep 28 2006
prev sibling next sibling parent reply Derek Parnell <derek nomail.afraid.org> writes:
On Fri, 29 Sep 2006 10:23:32 +1000, Geoff Carlton wrote:

 Hi,
 I'm a C++ user who's just tried D

 I was a bit underwhelmed by the syntax of char[].

Yes. It isn't very 'nice' for a modern language. Though as you note below a simple alias can help a lot. alias char[] string;
 I believe single functions get pulled in as member functions?  e.g.
 find(string) can be used as string.find()? 

This syntax sugar works for all arrays. func(T[] x, a) x.func(a) are equivalent.
 2.)
 I liked the more powerful for loop.  I'm curious is there any ability to 
 use delegates in the same way as lua does?

Yes it can use anonymous delegates. You can also overload it in classes.
 
 3.)
  From the newsgroups, it seems like 'auto' as local raii and 'auto' as
 automatic type deduction are still linked to the one keyword.

There are lots of D users hoping that this wart will be repaired before too long.
 4.)
 The D version of Scintilla and d-build was nice, very easy to use.
 Personally I would have preferred the default behaviour of dbuild to put 
 object files in an /obj subdirectory and the final exe in the original 
 directory dbuild is run from.
 
 This way, it could be run from a root directory, operate on a /src 
 subdirectory, and not clutter up the source with object files.  There is 
 a switch for that, of course, but I can't imagine when you would want 
 object files sitting in the same directory as the source.

Thanks for the Build comments. One unfortunate thing I find is that one person's defaults are another's exceptions. That is why you can tailor Build to your 'default' behaviour requirements. In this case, create a text file in the same directory that Build.exe is installed in, called 'build.cfg' and place in it the line ... CMDLINE=-od./obj Then when you run the tool, the command line switch is applied every time you run it. -- Derek (skype: derek.j.parnell) Melbourne, Australia "Down with mediocrity!" 29/09/2006 4:44:52 PM
Sep 28 2006
parent reply Walter Bright <newshound digitalmars.com> writes:
Derek Parnell wrote:
 On Fri, 29 Sep 2006 10:23:32 +1000, Geoff Carlton wrote:
 I was a bit underwhelmed by the syntax of char[].

Yes. It isn't very 'nice' for a modern language. Though as you note below a simple alias can help a lot. alias char[] string;

On the other hand, the reasons other languages have strings as classes is because they just don't support arrays very well. C++'s std::string combines the worst of core functionality and libraries, and has the advantages of neither. An early design goal for D was to upgrade arrays to the point where string classes weren't necessary.
Sep 29 2006
next sibling parent =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Walter Bright wrote:

 An early design goal for D was to upgrade arrays to the point where 
 string classes weren't necessary.

A string alias might still be, just as the bool alias was. --anders
Sep 29 2006
prev sibling parent reply Derek Parnell <derek psyc.ward> writes:
On Fri, 29 Sep 2006 01:24:50 -0700, Walter Bright wrote:

 Derek Parnell wrote:
 On Fri, 29 Sep 2006 10:23:32 +1000, Geoff Carlton wrote:
 I was a bit underwhelmed by the syntax of char[].

Yes. It isn't very 'nice' for a modern language. Though as you note below a simple alias can help a lot. alias char[] string;

On the other hand, the reasons other languages have strings as classes is because they just don't support arrays very well. C++'s std::string combines the worst of core functionality and libraries, and has the advantages of neither. An early design goal for D was to upgrade arrays to the point where string classes weren't necessary.

And is it there yet? I mean, given that a string is just a lump of text, is there any text processing operation that cannot be simply done to a char[] item? I can't think of any but maybe somebody else can. And if a char[] is just as capable as a std::string, then why not have an official alias in Phobos? Will 'alias char[] string' cause anyone any problems? -- Derek Parnell Melbourne, Australia "Down with mediocrity!"
Sep 29 2006
next sibling parent Georg Wrede <georg.wrede nospam.org> writes:
Derek Parnell wrote:
 On Fri, 29 Sep 2006 01:24:50 -0700, Walter Bright wrote:
An early design goal for D was to upgrade arrays to the point where 
string classes weren't necessary.

And is it there yet? I mean, given that a string is just a lump of text

The string you're talking about is not just a lump of text. More specifically it's a lump of text, irregularly interspersed with short non-ascii ubyte sequences. The latter being of course the tails of UTF-8 "characters".
Sep 29 2006
prev sibling next sibling parent David Medlock <noone nowhere.com> writes:
Derek Parnell wrote:
 On Fri, 29 Sep 2006 01:24:50 -0700, Walter Bright wrote:
 
 
Derek Parnell wrote:

On Fri, 29 Sep 2006 10:23:32 +1000, Geoff Carlton wrote:

I was a bit underwhelmed by the syntax of char[].

Yes. It isn't very 'nice' for a modern language. Though as you note below a simple alias can help a lot. alias char[] string;

On the other hand, the reasons other languages have strings as classes is because they just don't support arrays very well. C++'s std::string combines the worst of core functionality and libraries, and has the advantages of neither. An early design goal for D was to upgrade arrays to the point where string classes weren't necessary.

And is it there yet? I mean, given that a string is just a lump of text, is there any text processing operation that cannot be simply done to a char[] item? I can't think of any but maybe somebody else can. And if a char[] is just as capable as a std::string, then why not have an official alias in Phobos? Will 'alias char[] string' cause anyone any problems?

string array types. -DavidM
Sep 29 2006
prev sibling parent reply Walter Bright <newshound digitalmars.com> writes:
Derek Parnell wrote:
 And is it there yet? I mean, given that a string is just a lump of text, is
 there any text processing operation that cannot be simply done to a char[]
 item? I can't think of any but maybe somebody else can.

I believe it's there. I don't think std::string or java.lang.String have anything over it.
 And if a char[] is just as capable as a std::string, then why not have an
 official alias in Phobos? Will 'alias char[] string' cause anyone any
 problems?

I don't think it'll cause problems, it just seems pointless.
Sep 29 2006
next sibling parent Matthias Spycher <matthias coware.com> writes:
Immutability and some guarantees about the validity of the state of an 
immutable string in a concurrent setting are what set Java strings 
apart. Garbage collection without immutable strings in the standard 
library is quite out of the ordinary.

Walter Bright wrote:
 Derek Parnell wrote:
 And is it there yet? I mean, given that a string is just a lump of 
 text, is
 there any text processing operation that cannot be simply done to a 
 char[]
 item? I can't think of any but maybe somebody else can.

I believe it's there. I don't think std::string or java.lang.String have anything over it.
 And if a char[] is just as capable as a std::string, then why not have an
 official alias in Phobos? Will 'alias char[] string' cause anyone any
 problems?

I don't think it'll cause problems, it just seems pointless.

Sep 29 2006
prev sibling next sibling parent reply Derek Parnell <derek psyc.ward> writes:
On Fri, 29 Sep 2006 10:04:57 -0700, Walter Bright wrote:

 Derek Parnell wrote:
 And is it there yet? I mean, given that a string is just a lump of text, is
 there any text processing operation that cannot be simply done to a char[]
 item? I can't think of any but maybe somebody else can.

I believe it's there. I don't think std::string or java.lang.String have anything over it.

I'm pretty sure that the phobos routines for search and replace only work for ASCII text. For example, std.string.find(japanesetext, "a") will nearly always fail to deliver the correct result. It finds the first occurance of the byte value for the letter 'a' which may well be inside a Japanese character. It looks for byte-subsets rather than character sub-sets.
 And if a char[] is just as capable as a std::string, then why not have an
 official alias in Phobos? Will 'alias char[] string' cause anyone any
 problems?

I don't think it'll cause problems, it just seems pointless.

It may very well be pointless for your way of thinking, but your language is also for people who may not necessarily think in the same manner as yourself. I, for example, think there is a point to having my code read like its dealing with strings rather than arrays of characters. I suspect I'm not alone. We could all write the alias in all our code, but you could also be helpful and do it for us - like you did with bit/bool. -- Derek Parnell Melbourne, Australia "Down with mediocrity!"
Sep 29 2006
next sibling parent Georg Wrede <georg.wrede nospam.org> writes:
Derek Parnell wrote:
 I'm pretty sure that the phobos routines for search and replace only work
 for ASCII text. For example, std.string.find(japanesetext, "a") will nearly
 always fail to deliver the correct result. It finds the first occurance of
 the byte value for the letter 'a' which may well be inside a Japanese
 character. It looks for byte-subsets rather than character sub-sets.

I take it that you mean that the bit pattern, or byte, 'a' (as in 0x61) may be found within a Japanese multibyte glyph? Or even a very long Japanese text. That is not correct. The designers of UTF-8 knew that this would be dangerous, and created UTF-8 so that such _will_not_happen_. Ever. Therefore, something like std.string.find() doesn't even have to know about it. Basically, std.string.find() and comparable functions, only have to receive two octet sequences, and see where one of them first occurs in the other. No need to be aware of UTF or ASCII. For all we know, the strings may even be in EBCDIC. Still works. If the strings themselves are valid (in whichever encoding you have chosen to use), then the result will also be valid. ((For the sake of completeness, here I've restricted the discussion to the version of such functions that accept ubyte[] compatible input (obviously including char[]). Those taking 16 or 32 bits, and especially if we deliberately feed input of wrong width to any of these, then of course the results will be more complicated.))
Sep 29 2006
prev sibling next sibling parent reply Walter Bright <newshound digitalmars.com> writes:
Derek Parnell wrote:
 I'm pretty sure that the phobos routines for search and replace only work
 for ASCII text. For example, std.string.find(japanesetext, "a") will nearly
 always fail to deliver the correct result. It finds the first occurance of
 the byte value for the letter 'a' which may well be inside a Japanese
 character.

That cannot happen, because multibyte sequences *always* have the high bit set, and 'a' does not. That's one of the things that sets UTF-8 apart from other multibyte formats. You might be thinking of the older Shift-JIS multibyte encoding, which did suffer from such problems.
 It looks for byte-subsets rather than character sub-sets.

I don't think it's broken, but if it is, those are bugs, not fundamental problems with char[], and should be filed in bugzilla.
 It may very well be pointless for your way of thinking, but your language
 is also for people who may not necessarily think in the same manner as
 yourself. I, for example, think there is a point to having my code read
 like its dealing with strings rather than arrays of characters. I suspect
 I'm not alone. We could all write the alias in all our code, but you could
 also be helpful and do it for us - like you did with bit/bool.

I'm concerned about just adding more names that don't add real value. As I wrote in a private email discussion about C++ typedefs, they should only be used when: 1) they provide an abstraction against the presumption that the underlying type may change 2) they provide a self-documentation purpose (1) certainly doesn't apply to string. (2) may, but char[] has no use other than that of being a string, as a char[] is always a string and a string is always a char[]. So I don't think string fits (2). And lastly, there's the inevitable confusion. People learning the language will see char[] and string, and wonder which should be used when. I can't think of any consistent understandable rule for that. So it just winds up being wishy-washy. Adding more names into the global space (which is what names in object.d are) should be done extremely conservatively. If someone wants to use the string alias as their personal or company style, I have no issue with that, as other people *do* think differently than me (which is abundantly clear here!).
Sep 29 2006
parent reply Derek Parnell <derek psyc.ward> writes:
On Fri, 29 Sep 2006 23:11:37 -0700, Walter Bright wrote:

 Derek Parnell wrote:
 I'm pretty sure that the phobos routines for search and replace only work
 for ASCII text. For example, std.string.find(japanesetext, "a") will nearly
 always fail to deliver the correct result. It finds the first occurance of
 the byte value for the letter 'a' which may well be inside a Japanese
 character.

That cannot happen, because multibyte sequences *always* have the high bit set, and 'a' does not. That's one of the things that sets UTF-8 apart from other multibyte formats. You might be thinking of the older Shift-JIS multibyte encoding, which did suffer from such problems.

Thanks. That has cleared up some misconceptions and pre-concenptions that I had with utf encoding. I can reduce some of my home-grown routines now and reduce that number of times that I (think I) need dchar[] ;-)
 It may very well be pointless for your way of thinking, but your language
 is also for people who may not necessarily think in the same manner as
 yourself. I, for example, think there is a point to having my code read
 like its dealing with strings rather than arrays of characters. I suspect
 I'm not alone. We could all write the alias in all our code, but you could
 also be helpful and do it for us - like you did with bit/bool.

I'm concerned about just adding more names that don't add real value. As I wrote in a private email discussion about C++ typedefs, they should only be used when: 1) they provide an abstraction against the presumption that the underlying type may change 2) they provide a self-documentation purpose (1) certainly doesn't apply to string.

No argument there.
  (2) may, but char[] has no use 
 other than that of being a string, as a char[] is always a string and a 
 string is always a char[]. So I don't think string fits (2).

This is a lttle more debatable, but not worth generating hostility. A string of text contains characters whose position in the string is significant - there are semantics to be applied to the entire text. It is quite possible to conceive of an application in which the characters in the char[] array have no importance attached to their relative position within the array *where compared to neighboring characters*. The order of characters in text is significant but not necessarily so in a arbitary character array. Conceptually a string is different from a char[], even though they are implemented using the same technology.
 And lastly, there's the inevitable confusion. People learning the 
 language will see char[] and string, and wonder which should be used 
 when. I can't think of any consistent understandable rule for that. So 
 it just winds up being wishy-washy. Adding more names into the global 
 space (which is what names in object.d are) should be done extremely 
 conservatively.

And yet we have "toString" and not "toCharArray" or "toUTF"! And we still have the "printf" in object.d too!
 If someone wants to use the string alias as their personal or company 
 style, I have no issue with that, as other people *do* think differently 
 than me (which is abundantly clear here!).

I'll revert Build to string again as it is a lot easier to read. It started out that way but I converted it to char[] to appease you (why I thought you need appeasing is lost though). :-) -- Derek Parnell Melbourne, Australia "Down with mediocrity!"
Sep 30 2006
next sibling parent reply Walter Bright <newshound digitalmars.com> writes:
Derek Parnell wrote:
  (2) may, but char[] has no use 
 other than that of being a string, as a char[] is always a string and a 
 string is always a char[]. So I don't think string fits (2).

This is a lttle more debatable, but not worth generating hostility.

I certainly hope this thread doesn't degenerate into that like some of the others.
 A string of text contains characters whose position in the string is
 significant - there are semantics to be applied to the entire text. It is
 quite possible to conceive of an application in which the characters in the
 char[] array have no importance attached to their relative position within
 the array *where compared to neighboring characters*. The order of
 characters in text is significant but not necessarily so in a arbitary
 character array. 
 
 Conceptually a string is different from a char[], even though they are
 implemented using the same technology.

You do have a point there.
 And lastly, there's the inevitable confusion. People learning the 
 language will see char[] and string, and wonder which should be used 
 when. I can't think of any consistent understandable rule for that. So 
 it just winds up being wishy-washy. Adding more names into the global 
 space (which is what names in object.d are) should be done extremely 
 conservatively.

And yet we have "toString" and not "toCharArray" or "toUTF"!

True, and some have called for renaming char to utf8. While that would be technically more correct (as toUTF would be, too), it just looks awful. I suppose that since I grew up with char* meaning string, using char[] seems perfectly natural. I tried typedef'ing char* to string now and then, but always wound up going back to just using char*.
 And we still have the "printf" in object.d too!

I know many feel that printf doesn't belong there. It certainly isn't there for purity or consistency. It's there purely (!) for the convenience of writing short quickie programs. I tend to use it for quick debugging test cases, because it doesn't rely on the rest of D working.
 If someone wants to use the string alias as their personal or company 
 style, I have no issue with that, as other people *do* think differently 
 than me (which is abundantly clear here!).

I'll revert Build to string again as it is a lot easier to read. It started out that way but I converted it to char[] to appease you (why I thought you need appeasing is lost though). :-)

No, you certainly don't need to appease me! I do care about maintaining a reasonably consistent style in Phobos, but I don't believe a language should enforce a particular style beyond the standard library. Viva la difference. P.S. I did say to not 'enforce', but that doesn't mean I am above recommending a particular style, as in http://www.digitalmars.com/d/dstyle.html
Sep 30 2006
next sibling parent Derek Parnell <derek psyc.ward> writes:
On Sat, 30 Sep 2006 21:18:02 -0700, Walter Bright wrote:

 P.S. I did say to not 'enforce', but that doesn't mean I am above 
 recommending a particular style, as in 
 http://www.digitalmars.com/d/dstyle.html

Oh, I threw trhat away ages ago ;-) -- Derek Parnell Melbourne, Australia "Down with mediocrity!"
Oct 01 2006
prev sibling next sibling parent reply Lars Ivar Igesund <larsivar igesund.net> writes:
Walter Bright wrote:

 
 And yet we have "toString" and not "toCharArray" or "toUTF"!

True, and some have called for renaming char to utf8. While that would be technically more correct (as toUTF would be, too), it just looks awful.

Nope, it just looks correct. -- Lars Ivar Igesund blog at http://larsivi.net DSource & #D: larsivi
Oct 01 2006
parent Lionello Lunesu <lio lunesu.remove.com> writes:
Lars Ivar Igesund wrote:
 Walter Bright wrote:
 
 And yet we have "toString" and not "toCharArray" or "toUTF"!

be technically more correct (as toUTF would be, too), it just looks awful.

Nope, it just looks correct.

I don't think renaming toString to toUTF gets rid of any confusion. AFAIK, toString is meant for debugging and char[] should be enough, and yet flexible enough for unicode strings. In fact, "string toString()" would be a good solution too. --- My 4 reasons for the "string" aliases: * readability: less [] pairs; * safety: char[] is not zero-terminated, so lets not pretend there's a relation with C's char*. In fact: lets hide any relation; * clarity: a char[] should not be iterated 1 char at a time, which makes it different from an int[]. * consistency: "string toString()" L.
Oct 02 2006
prev sibling parent reply Georg Wrede <georg.wrede nospam.org> writes:
Walter Bright wrote:
 True, and some have called for renaming char to utf8. While that would 
 be technically more correct (as toUTF would be, too), it just looks awful.

Let's just say it would be a first step in lessening the confusion _we_ create in newcomers' heads.
Oct 01 2006
parent reply Kevin Bealer <kevinbealer gmail.com> writes:
Georg Wrede wrote:
 Walter Bright wrote:
 True, and some have called for renaming char to utf8. While that would 
 be technically more correct (as toUTF would be, too), it just looks 
 awful.

Let's just say it would be a first step in lessening the confusion _we_ create in newcomers' heads.

I would kind of agree with this, but I think it's a two-edged knife. If we say 'char[]' then users don't know it's a string until they read the 'why D arrays are great' page (which they should read, but...) If we say 'string' then we hide the fact that [] can be applied and that other array-like operations can work. For instance, from a Java perspective: char[] : Users don't know that it's "String"; users see it as low-level. Some will try to write things like 'find()' by hand since they will figure arrays are low level and not expect this to exist. string : Users will think it's immutable, special; they will ask "how do I get one of the characters out of a string", "how do I convert string to char[]?", and other things that would be obvious without the alias. Kevin
Oct 02 2006
next sibling parent Georg Wrede <georg.wrede nospam.org> writes:
Kevin Bealer wrote:
 Georg Wrede wrote:
 
 Walter Bright wrote:

 True, and some have called for renaming char to utf8. While that 
 would be technically more correct (as toUTF would be, too), it just 
 looks awful.

Let's just say it would be a first step in lessening the confusion _we_ create in newcomers' heads.

I would kind of agree with this, but I think it's a two-edged knife. If we say 'char[]' then users don't know it's a string until they read the 'why D arrays are great' page (which they should read, but...) If we say 'string' then we hide the fact that [] can be applied and that other array-like operations can work. For instance, from a Java perspective: char[] : Users don't know that it's "String"; users see it as low-level. Some will try to write things like 'find()' by hand since they will figure arrays are low level and not expect this to exist.

Yes.
 string : Users will think it's immutable, special; they will ask "how do
          I get one of the characters out of a string", "how do I convert
          string to char[]?", and other things that would be obvious
          without the alias.

Well, with string, folks would at least be inclined to search for the library function to do it. --- Overall, having string instead of char[] should result in folks learning and doing more with D _before_ they get tangled with UTF issues. (I guess, getting tangled with UTF is unavoidable.) But the more later folks stumble on this, the better they can handle it. If it happens too soon, then they will just run away from D. But substituting string for char[] in D is not enough. More than half the issue is the wording in the docs. --- Another thing intimately connected with this is whether we should have char[] or utf8[] (string or no string, this is an important thing anyway). I understand that "char" is one of the words that a seasoned programmer's fingers know by heart. So it would feel simply disgusting to have to learn (and bother) to write "utf8" which I admit is a lot more work to type. (Seriously.) Now, "string" is easy for the fingers, and then you get to skip "[]", which makes it all a little more palatable. Having string would let us have the underlying type be utf8[], which really emphasizes and calls your attention to the fact that it's not byte-by-byte stuff we have there.
Oct 03 2006
prev sibling parent =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Kevin Bealer wrote:

 If we say 'char[]' then users don't know it's a string until they read 
 the 'why D arrays are great' page (which they should read, but...)
 
 If we say 'string' then we hide the fact that [] can be applied and that 
 other array-like operations can work.

Which could be a *good* thing, since it would stop users from hurting themselves by pretending that the D strings are arrays of characters ? And when they have read up that they are "arrays of Unicode code units", they should be OK with interpreting the "string" alias as char[] arrays.
 For instance, from a Java perspective:
 
 char[] : Users don't know that it's "String"; users see it as low-level.
          Some will try to write things like 'find()' by hand since they
          will figure arrays are low level and not expect this to exist.
 
 string : Users will think it's immutable, special; they will ask "how do
          I get one of the characters out of a string", "how do I convert
          string to char[]?", and other things that would be obvious
          without the alias.

I think the best answer would be: "to get a char[] from the string, use the std.utf.toUTF8 function", since this also works even if you redeclare the "string" alias to be something else - like wchar_t[] ? Earlier* I suggested adding the alias utf8_t for "char", just like we have int8_t for "byte", but I wouldn't rename the actual D types. Just a little std.stdutf module with some aliases, if ever needed... string std.string.toString( ) utf8_t[] std.utf.toUTF8( ) utf16_t[] std.utf.toUTF16( ) utf32_t[] std.utf.toUTF32( ) --anders * digitalmars.D/11821, 2004-10-15
Oct 03 2006
prev sibling parent reply Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
Derek Parnell wrote:
  (2) may, but char[] has no use 
 other than that of being a string, as a char[] is always a string and a 
 string is always a char[]. So I don't think string fits (2).

This is a lttle more debatable, but not worth generating hostility. A string of text contains characters whose position in the string is significant - there are semantics to be applied to the entire text. It is quite possible to conceive of an application in which the characters in the char[] array have no importance attached to their relative position within the array *where compared to neighboring characters*. The order of characters in text is significant but not necessarily so in a arbitary character array. Conceptually a string is different from a char[], even though they are implemented using the same technology.

Precisely! And even if such conceptual difference didn't exist, or is very rare, 'string' is nonetheless more readable than 'char[]', a fact I am constantly reminded of when I see 'int main(char[][] args)' instead of 'int main(string[] args)', which translates much more quickly into the brain as 'array of strings' than its current counterpart. -- Bruno Medeiros - MSc in CS/E student http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
Oct 01 2006
parent Geoff Carlton <gcarlton iinet.net.au> writes:
Bruno Medeiros wrote:
 Precisely! And even if such conceptual difference didn't exist, or is 
 very rare, 'string' is nonetheless more readable than 'char[]', a fact I 
 am constantly reminded of when I see 'int main(char[][] args)' instead 
 of 'int main(string[] args)', which translates much more quickly into 
 the  brain as 'array of strings' than its current counterpart.
 

There are also many cases where char arrays are not strings: Single array of characters, not strings: char GAME_10PT_LETTERS[] = { 'x', 'z' }; Two-dimensional array of characters, not string arrays: char GAME_LETTERS[][] = { GAME_0PT_LETTERS, GAME_1PT_LETTERS, .. }; char m_scrabbleBoard[20][20];
Oct 01 2006
prev sibling parent reply Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Derek Parnell schrieb am 2006-09-30:
 On Fri, 29 Sep 2006 10:04:57 -0700, Walter Bright wrote:

 Derek Parnell wrote:
 And is it there yet? I mean, given that a string is just a lump of text, is
 there any text processing operation that cannot be simply done to a char[]
 item? I can't think of any but maybe somebody else can.

I believe it's there. I don't think std::string or java.lang.String have anything over it.

I'm pretty sure that the phobos routines for search and replace only work for ASCII text. For example, std.string.find(japanesetext, "a") will nearly always fail to deliver the correct result. It finds the first occurance of the byte value for the letter 'a' which may well be inside a Japanese character. It looks for byte-subsets rather than character sub-sets.

~wow~ Have a look at std.string.find's source and try to stop giggling *g* The correct implementation would be: # import std.string; # import std.c.string; # import std.utf; # # int find(char[] s, dchar c) # { # if (c <= 0x7F) # { // Plain old ASCII # auto p = cast(char*)memchr(s, c, s.length); # if (p) # return p - cast(char *)s; # else # return -1; # } # # // c is a universal character # return std.string.find(s, toUTF8([c])); # } The same applies to ifind and the like. Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFFHj4fLK5blCcjpWoRAj67AJoDagf5zf7Az7ZqMDfOyZdRJ+aIqQCdGeen ye80pstE4IJC1WoxgTVVgdc= =iwT5 -----END PGP SIGNATURE-----
Sep 30 2006
parent reply Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Thomas Kuehne schrieb am 2006-09-30:
 Derek Parnell schrieb am 2006-09-30:
 On Fri, 29 Sep 2006 10:04:57 -0700, Walter Bright wrote:

 Derek Parnell wrote:
 And is it there yet? I mean, given that a string is just a lump of text, is
 there any text processing operation that cannot be simply done to a char[]
 item? I can't think of any but maybe somebody else can.

I believe it's there. I don't think std::string or java.lang.String have anything over it.

I'm pretty sure that the phobos routines for search and replace only work for ASCII text. For example, std.string.find(japanesetext, "a") will nearly always fail to deliver the correct result. It finds the first occurance of the byte value for the letter 'a' which may well be inside a Japanese character. It looks for byte-subsets rather than character sub-sets.

~wow~ Have a look at std.string.find's source and try to stop giggling *g* The correct implementation would be:

As it seems, the original code depends on the undocumented index behavior with regards to silent transcoding in foreach. Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFFHkOILK5blCcjpWoRAnmjAJ9PKdGDHsghycgxHdr7hkc+IP+XEgCgohH8 LH7OOQgQAZoTMLRQXtWhqbE= =or0x -----END PGP SIGNATURE-----
Sep 30 2006
parent Sean Kelly <sean f4.ca> writes:
Thomas Kuehne wrote:
 
 As it seems, the original code depends on the undocumented index behavior
 with regards to silent transcoding in foreach.

The wording could be more explicit, but I think the current documentation implies the actual behavior: "The index must be of int or uint type, it cannot be inout, and it is set to be the index of the array element." The docs should probably also be revised to allow for 64-bit indices, where the index would be long or ulong. Something along the lines of: "The index must be an integer type of size equal to size_t.sizeof. . ." Sean
Sep 30 2006
prev sibling next sibling parent Geoff Carlton <gcarlton iinet.net.au> writes:
Walter Bright wrote:
 Derek Parnell wrote:
 And is it there yet? I mean, given that a string is just a lump of 
 text, is
 there any text processing operation that cannot be simply done to a 
 char[]
 item? I can't think of any but maybe somebody else can.

I believe it's there. I don't think std::string or java.lang.String have anything over it.
 And if a char[] is just as capable as a std::string, then why not have an
 official alias in Phobos? Will 'alias char[] string' cause anyone any
 problems?

I don't think it'll cause problems, it just seems pointless.

Hi, The main reasons I think are these: It simplifies the initial examples, particularly main(string[]), and maps such as string[string]. More complex examples are a map of words to text lines, string[][string], rather than char[][][char[]]. It clarifies the actual use of the entity. It is a text string, not just a jumbled array of characters. Arrays of char can be used for other things, such as the set of player letters in a scrabble game. A string has the additional usage that we know it as is text string. The alias reflects that intent. Given a user wants to use a string, there is no need to expose the implementation detail of how strings are done in D. Perhaps in perl, strings are a linked list of shorts, but it doesn't mean that you'd have list<short> all over the place. Use of char[] and char[][] looks like low level C. It has also been noted that it encourages char based indexing, which is not a good thing for utf8. Anyway, hope one of those points grabbed you! Geoff
Sep 29 2006
prev sibling parent David Medlock <noone nowhere.com> writes:
Walter Bright wrote:
 Derek Parnell wrote:
 
 And is it there yet? I mean, given that a string is just a lump of 
 text, is
 there any text processing operation that cannot be simply done to a 
 char[]
 item? I can't think of any but maybe somebody else can.

I believe it's there. I don't think std::string or java.lang.String have anything over it.
 And if a char[] is just as capable as a std::string, then why not have an
 official alias in Phobos? Will 'alias char[] string' cause anyone any
 problems?

I don't think it'll cause problems, it just seems pointless.

The reason *I* want it is _alias_ does not respect the private: visibility modifier. So when I pull out an old piece of code which says alias char[] string and import it in my newer module I get conflicts when I compile. Then I must do this silly hack where I include the newer file from the old or vice versa. If you didn't add this into phobos, at least or adopt a method to discriminate between more than one alias with the same name to resolve the issue. -DavidM
Sep 29 2006
prev sibling next sibling parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Geoff Carlton wrote:

 A simple alias of char[] to string would simplify the first glance code.
   string x;    // yep, a string
   main (string[]) // an array of strings
   string[string] m; // map of string to string
 
 I believe single functions get pulled in as member functions?  e.g.
 find(string) can be used as string.find()?  If so, it means that all the
 string functionality can be added and then used naturally as member
 functions on this "string" (which is really just the plain old char[] in
 disguise).

Problem of "char[]" is both that it hides the fact that "char" is UTF-8 while at the same time it exposes the fact that it's stored as an array. You can "improve" upon that readability with aliases, like declaring say utf8_t -> char and string -> utf8_t[], but you still need to understand Unicode and Arrays in order to use it outside of the provided methods... I think "hides the implementation" was the biggest argument against it ? http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues
 This is a small thing, but I think it would help in terms of the mindset
 of strings being a first class primitive, and clear up simple "hello
 world" examples at the same time.  Put simply, every modern language has
 a first class string primitive type, except D - at least in terms of
 nomenclature.

I did the big mistake of thinking it would be a good thing to be able to switch between "ANSI" and "UNICODE" builds (of wxD), and so did it like: version(UNICODE) alias char[] string; else // version(ANSI) alias wchar_t[] string; // wchar[] on Windows, dchar[] on Unix Still trying to sort out all the code problems with that idea, as there is a ton of toUTF8 and other conversions to make strings work together. In retrospect it would have been much easier to have stuck with char[], and do the conversion from UTF-8 to the local encoding on the C++ side. (since there were no guarantees that the "char" and "wchar_t" types in C++ used UTF encodings, even if they did so in Unix/GTK+ for instance) Any (minor) performance issues of having to do the UTF-8 <-> UTF-32 conversions were not worth the hassle of doing it on the D side, IMHO. So I agree with the "alias char[] string;" and the string[string] args. It's going to be used as wx.common.string for instance, in wxD library. --anders
Sep 29 2006
parent =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
 I did the big mistake of thinking it would be a good thing to be able to
 switch between "ANSI" and "UNICODE" builds (of wxD), and so did it like:
 
 version(UNICODE)
     alias char[] string;
 else // version(ANSI)
     alias wchar_t[] string; // wchar[] on Windows, dchar[] on Unix

Except the other way around, of course! version(UNICODE) alias wchar_t[] string; else // version(ANSI) alias char[] string; Now, to get me some more coffee... :-P --anders
Sep 29 2006
prev sibling parent reply Lionello Lunesu <lio lunesu.remove.com> writes:
I also ALWAYS create aliases for char[], wchar[], dchar[]... I DO wish 
they would be included by default in Phobos.

alias char[] string;
alias wchar[] wstring;
alias dchar[] dstring;

Perhaps, using string instead of char[], it's more obvious that it's not 
zero-terminated. I've seen D examples online that just cast a char[] to 
char* for use in MessageBox and the like (which worked since it were 
string constants.)

L.
Sep 29 2006
next sibling parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Lionello Lunesu wrote:

 Perhaps, using string instead of char[], it's more obvious that it's not 
 zero-terminated. I've seen D examples online that just cast a char[] to 
 char* for use in MessageBox and the like (which worked since it were 
 string constants.)

And probably only for ASCII string constants, at that... --anders
Sep 29 2006
parent Lionello Lunesu <lio lunesu.remove.com> writes:
Anders F BjŲrklund wrote:
 Lionello Lunesu wrote:
 
 Perhaps, using string instead of char[], it's more obvious that it's 
 not zero-terminated. I've seen D examples online that just cast a 
 char[] to char* for use in MessageBox and the like (which worked since 
 it were string constants.)

And probably only for ASCII string constants, at that...

Right, that too! char[] somestring = "...."; func( somestring[0] ); // WRONG: somestring[x] is not 1 character! Using "string" would make it less obvious: string somestring = "....."; func( somestring[0] ); // [0] means what? This goes for iteration as well. DMD will still deduct 'char' as the type type, but at least one's less likely to type foreach(char c;str). If you want to iterate the UNICODE characters in a string, you'll specify "dchar" as the type and you won't worry about "how come I can use dchar when it's a char[]": foreach(dchar c; somestring) func(c); // correct L.
Sep 29 2006
prev sibling parent reply Georg Wrede <georg.wrede nospam.org> writes:
Lionello Lunesu wrote:
 I also ALWAYS create aliases for char[], wchar[], dchar[]... I DO wish 
 they would be included by default in Phobos.
 
 alias char[] string;
 alias wchar[] wstring;
 alias dchar[] dstring;
 
 Perhaps, using string instead of char[], it's more obvious that it's not 
 zero-terminated. I've seen D examples online that just cast a char[] to 
 char* for use in MessageBox and the like (which worked since it were 
 string constants.)

Using char[] as long as you don't know about UTF seems to work pretty well in D. But the moment you realise that we're having potential multibyte characters in what essentially is a ubyte[], you get scared to death, and start to wonder how on earth you haven't yet blown up your hard disk. You start having nightmares about slicing char arrays at the wrong place, extracting single chars that might not be storable in a char, and all of a sudden you decide to stick with your old language "till things calm down". The only medicine to this is simply to shut your eyes and keep coding on like you never did realise anything. It's a little like when you first realised Daddy isn't holding your bike: you instantly fall hurting yourself, instead of realizing that he's probably let go ages ago, and you still haven't fallen, so simply keep going. --- This doesn't mean I'm happy with this either, but I don't have the energy to conjure up a significantly better solution _and_ fight for it till it gets accepted. (Some things are just too hard to fix, like "bit=bool" was, and now "auto/auto".)
Sep 29 2006
parent reply Chad J <""gamerChad\" spamIsBad gmail.com"> writes:
Georg Wrede wrote:
 Lionello Lunesu wrote:
 
 I also ALWAYS create aliases for char[], wchar[], dchar[]... I DO wish 
 they would be included by default in Phobos.

 alias char[] string;
 alias wchar[] wstring;
 alias dchar[] dstring;

 Perhaps, using string instead of char[], it's more obvious that it's 
 not zero-terminated. I've seen D examples online that just cast a 
 char[] to char* for use in MessageBox and the like (which worked since 
 it were string constants.)

Using char[] as long as you don't know about UTF seems to work pretty well in D. But the moment you realise that we're having potential multibyte characters in what essentially is a ubyte[], you get scared to death, and start to wonder how on earth you haven't yet blown up your hard disk. You start having nightmares about slicing char arrays at the wrong place, extracting single chars that might not be storable in a char, and all of a sudden you decide to stick with your old language "till things calm down". The only medicine to this is simply to shut your eyes and keep coding on like you never did realise anything. It's a little like when you first realised Daddy isn't holding your bike: you instantly fall hurting yourself, instead of realizing that he's probably let go ages ago, and you still haven't fallen, so simply keep going. --- This doesn't mean I'm happy with this either, but I don't have the energy to conjure up a significantly better solution _and_ fight for it till it gets accepted. (Some things are just too hard to fix, like "bit=bool" was, and now "auto/auto".)

haha too true. I experienced this too as I read this ng. It hasn't been THAT truamatic for me though, since everything seems to work as long as you stick to english. I don't have the resources to even begin thinking about non-english text (ex: paying people to translate stuff), so I don't lose any sleep about it, at least not yet. Perhaps there should be a string struct/class that has an undefined underlying type (it could be UTF-8, 16, 32, you dunno really), and you could index it to get the *complete* character at any position in the string. Basically, it is like char[], but it /just works/ in all cases. I'd almost rather have the size of a char be undefined, and just have char[] be the said magic string type. If you want something with a .size of 1, then there is byte/ubyte. There would probably have to be some stuff in the phobos internals to handle such a string in a correct manner. Going even further... if you could make char[] be such a magic string type, then wchar[] and dchar[] could probably be deprecated - use ushort and uint instead. Then add the following aliases to phobos: alias ubyte utf8; alias ushort utf16; alias uint utf32; Just a thought. I'm no expert on UTF, but maybe this can start a discussion that will result in the nightmares ending :)
Sep 29 2006
parent reply Johan Granberg <lijat.meREM OVEgmail.com> writes:
Chad J > wrote:
 Perhaps there should be a string struct/class that has an undefined 
 underlying type (it could be UTF-8, 16, 32, you dunno really), and you 
 could index it to get the *complete* character at any position in the 
 string.  Basically, it is like char[], but it /just works/ in all cases. 
  I'd almost rather have the size of a char be undefined, and just have 
 char[] be the said magic string type.  If you want something with a 
 ..size of 1, then there is byte/ubyte.  There would probably have to be 
 some stuff in the phobos internals to handle such a string in a correct 
 manner.

I have thought about this to.
 Going even further... if you could make char[] be such a magic string 
 type, then wchar[] and dchar[] could probably be deprecated - use ushort 
 and uint instead.  Then add the following aliases to phobos:
 alias ubyte utf8;
 alias ushort utf16;
 alias uint utf32;

I completely agree, char should hold a character independently of encoding and NOT a code unit or something else. I think it would bee beneficial to D in the long term if chars where done right (meaning that they can store any character) how it is implemented is not important and i believe performance is not a problem here, so ease of use and correctness would be appreciated.
Sep 29 2006
parent reply BCS <BCS pathlink.com> writes:
Johan Granberg wrote:
 
 
 I completely agree, char should hold a character independently of 
 encoding and NOT a code unit or something else. I think it would be
 beneficial to D in the long term if chars where done right (meaning that 
 they can store any character) how it is implemented is not important and 
 i believe performance is not a problem here, so ease of use and 
 correctness would be appreciated.

Why isn't performance a problem? If you are saying that this won't cause performance hits in run times or memory space, I might be able to buy it, but I'm not yet convinced. If you are saying that causing a performance hit in run times or memory space is not a problem... in that case I think you are dead wrong and you will not convince me otherwise. In my opinion, any compiled language should allow fairly direct access to the most efficient practical means of doing something*. If I didn't care about speed and memory I wound use some sort of scripting language. A good set of libs should make most of this moot. Leave the char as is and define a typedef struct or whatever that provides the added functionality that you want. * OTOH a language should not mandate code to be efficient at the expense of ease of coding.
Sep 29 2006
next sibling parent reply Chad J <""gamerChad\" spamIsBad gmail.com"> writes:
BCS wrote:
 Johan Granberg wrote:
 
 I completely agree, char should hold a character independently of 
 encoding and NOT a code unit or something else. I think it would be
 beneficial to D in the long term if chars where done right (meaning 
 that they can store any character) how it is implemented is not 
 important and i believe performance is not a problem here, so ease of 
 use and correctness would be appreciated.

Why isn't performance a problem? If you are saying that this won't cause performance hits in run times or memory space, I might be able to buy it, but I'm not yet convinced. If you are saying that causing a performance hit in run times or memory space is not a problem... in that case I think you are dead wrong and you will not convince me otherwise. In my opinion, any compiled language should allow fairly direct access to the most efficient practical means of doing something*. If I didn't care about speed and memory I wound use some sort of scripting language. A good set of libs should make most of this moot. Leave the char as is and define a typedef struct or whatever that provides the added functionality that you want. * OTOH a language should not mandate code to be efficient at the expense of ease of coding.

I will go ahead and say that the current state of char[] is incorrect. That is, if you write a program manipulating char[] strings, then run it in china, you will be dissapointed with the results. It won't matter how fast the program runs, because bad stuff will happen like entire strings becoming unreadable to the user. Technically if you follow UTF and do your char[] manipulations very carefully, it is correct, but realistically few if any people will do such things (I won't). Also, if you do this, your program will probably run as slow as one with the proposed char/string solution, maybe slower (since language/stdlib level support can be heavily optimized). What I'd like then, is a program that is correct and as fast as possible while still being correct. Sure you can get some speed gains by just using ASCII and saying to hell with UTF, but you should probably only do that when profiling has shown that such speed gains are actually useful/needed in your program. Ultimately we have to decide whether we want D to default to UTF code which might run slightly slower but allow better localization and international friendliness, or if we want it to default to ASCII or some such encoding that runs slightly faster but is mostly limited to english. I'd like the default to be UTF. Then we can have a base of code to correctly manipulate UTF strings (in phobos and language supported). Writing correct ASCII manipulation routine without good library/language support is a lot easier than writing good UTF manipulation routines without good library/language support, and UTF will probably be used much more than ASCII. Also, if we move over to full blown UTF, we won't have to give up ASCII. It seems to me like the phobos std.string functions are pretty much ASCII string manipulating functions (no multibyte string support). So just copy those out to a seperate library, call it "ASCII lib", and there's your library support for ASCII. That leaves string literals, which is a slight problem, but I suppose easily fixed: ubyte[] hi = "hello!"a; Just add a postfix 'a' for strings which makes the string an ASCII literal, of type ubyte[]. D arrays don't seem powerful enough to do UTF manipulations without special attention, but they are powerful enough to do ASCII manipulations without special attention, so using ubyte[] as an ASCII string should give full language support for these. Given that and ASCIILIB you pretty much have the current D string manipulation capabilities afaik, and it will be fast.
Sep 29 2006
next sibling parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Chad J > wrote:

 I'd like the default to be UTF. Then we can have a base of code to
 correctly manipulate UTF strings (in phobos and language supported).
 Writing correct ASCII manipulation routine without good library/language
 support is a lot easier than writing good UTF manipulation routines
 without good library/language support, and UTF will probably be used
 much more than ASCII.

But D already uses Unicode for all strings, encoded as UTF ? When you say "ASCII", do you mean 8-bit encodings perhaps ? (since all proper 7-bit ASCII are already valid UTF-8 too)
 Also, if we move over to full blown UTF, we won't have to give up ASCII. 
  It seems to me like the phobos std.string functions are pretty much 
 ASCII string manipulating functions (no multibyte string support).  So 
 just copy those out to a seperate library, call it "ASCII lib", and 
 there's your library support for ASCII.  That leaves string literals, 
 which is a slight problem, but I suppose easily fixed:
 ubyte[] hi = "hello!"a;

I don't understand this, why can't you use UTF-8 for this ? char[] hi = "hello!";
 Just add a postfix 'a' for strings which makes the string an ASCII 
 literal, of type ubyte[].  D arrays don't seem powerful enough to do UTF 
 manipulations without special attention, but they are powerful enough to 
 do ASCII manipulations without special attention, so using ubyte[] as an 
 ASCII string should give full language support for these.  Given that 
 and ASCIILIB you pretty much have the current D string manipulation 
 capabilities afaik, and it will be fast.

What is not powerful enough about the foreach(dchar c; str) ? It will step through that UTF-8 array one codepoint at a time. --anders
Sep 29 2006
parent reply Chad J <""gamerChad\" spamIsBad gmail.com"> writes:
Anders F BjŲrklund wrote:
 Chad J > wrote:
 
 I'd like the default to be UTF. Then we can have a base of code to
 correctly manipulate UTF strings (in phobos and language supported).
 Writing correct ASCII manipulation routine without good library/language
 support is a lot easier than writing good UTF manipulation routines
 without good library/language support, and UTF will probably be used
 much more than ASCII.

But D already uses Unicode for all strings, encoded as UTF ? When you say "ASCII", do you mean 8-bit encodings perhaps ? (since all proper 7-bit ASCII are already valid UTF-8 too)

Probably 7-bit. Anything where the size of one character is ALWAYS one byte. I am already assuming that ASCII is a subset or at least is mostly a subset of UTF8. However, I talk about it in an exclusive manner because if you handle UTF8 strings properly then the code will probably run at least slightly slower than with ASCII-only strings.
 Also, if we move over to full blown UTF, we won't have to give up 
 ASCII.  It seems to me like the phobos std.string functions are pretty 
 much ASCII string manipulating functions (no multibyte string 
 support).  So just copy those out to a seperate library, call it 
 "ASCII lib", and there's your library support for ASCII.  That leaves 
 string literals, which is a slight problem, but I suppose easily fixed:
 ubyte[] hi = "hello!"a;

I don't understand this, why can't you use UTF-8 for this ? char[] hi = "hello!";

I was talking about IF we made char[] into a datatype that handles all of those odd corner cases correctly (slices into multibyte strings, for instance) then it will no longer be the same fast ASCII-only routines. So for those who want the fast ASCII-only stuff, it would nice to specify a way to make string literals such that each character in the literal takes only one byte, without ugly casting. To get an ASCII monobyte string from a string literal in D I would have to do the following: ubyte[] hi = cast(ubyte[])"hello!"; hmmm, yuck.
 Just add a postfix 'a' for strings which makes the string an ASCII 
 literal, of type ubyte[].  D arrays don't seem powerful enough to do 
 UTF manipulations without special attention, but they are powerful 
 enough to do ASCII manipulations without special attention, so using 
 ubyte[] as an ASCII string should give full language support for 
 these.  Given that and ASCIILIB you pretty much have the current D 
 string manipulation capabilities afaik, and it will be fast.

What is not powerful enough about the foreach(dchar c; str) ? It will step through that UTF-8 array one codepoint at a time.

I'm assuming 'str' is a char[], which would make that very nice. But it doesn't solve correctly slicing or indexing into a char[]. If nothing was done about this and I absolutely needed UTF support, I'd probably make a class like so: class String { char[] data; ... dchar opIndex( int index ) { foreach( int i, dchar c; data ) { if ( i == index ) return c; i++; } } // similar thing for opSlice down here ... } Which is probably slower than could be done. All in all it is a drag that we should have to learn all of this UTF stuff. I want char[] to just work!
Sep 29 2006
next sibling parent reply =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:
Chad J > wrote:

 Probably 7-bit.  Anything where the size of one character is ALWAYS one 
 byte.  I am already assuming that ASCII is a subset or at least is 
 mostly a subset of UTF8.  However, I talk about it in an exclusive 
 manner because if you handle UTF8 strings properly then the code will 
 probably run at least slightly slower than with ASCII-only strings.

It's mostly about looking out for the UTF "control" characters, which is not more than a simple assertion in your ASCII-only functions really... I don't think handling UTF-8 properly is a burden for string functions, when you compare it with the enormous gain that it has over ASCII-only.
 What is not powerful enough about the foreach(dchar c; str) ?
 It will step through that UTF-8 array one codepoint at a time.

I'm assuming 'str' is a char[], which would make that very nice. But it doesn't solve correctly slicing or indexing into a char[].

Well, it's also a lot "trickier" than that... For instance, my last name can be written in Unicode as Björklund or Bj¨orklund, both of which are valid - only that in one of them, the 'ö' occupies two full code points! It's still a single character, which is why Unicode avoids that term... As you know, if you need to access your strings by codepoint (something that the Unicode group explicitly recommends against, in their FAQ) then char[] isn't a very nice format - because of the conversion overhead... But it's still possible to translate, transform, and translate back ?
 If nothing was done about this and I absolutely needed UTF support,
 I'd probably make a class like so: [...]

In my own mock String class, I cached the dchar[] codepoints on demand. (viewable at http://www.algonet.se/~afb/d/dcaf/html/class_string.html)
 All in all it is a drag that we should have to learn all of this UTF 
 stuff.  I want char[] to just work!

Using Unicode strings and characters does require a little learning... (where http://www.unicode.org/faq/utf_bom.html is a very good page) And D does force you to think about string implementation, no question. This has both pros and cons, but it is a deliberate language decision. If you're willing to handle the "surrogates", then UTF-16 is a rather good trade-off between the default UTF-8 and wasteful UTF-32 formats ? A downside is that it is not "ascii-compatible" (has embedded NUL chars) and that it is endian-dependant unlike the more universal UTF-8 format. --anders
Sep 29 2006
next sibling parent Georg Wrede <georg.wrede nospam.org> writes:
Anders F Björklund wrote:
 If you're willing to handle the "surrogates", then UTF-16 is a rather
 good trade-off between the default UTF-8 and wasteful UTF-32 formats ?
 A downside is that it is not "ascii-compatible" (has embedded NUL chars)
 and that it is endian-dependant unlike the more universal UTF-8 format.

Problem is, using 16-bit you sort-of get away with _almost_ all of it. But as a pay-back, the day your 16 bits don't suffice, you're in deep crap. And that day _will_ come.
Sep 29 2006
prev sibling parent reply Chad J <""gamerChad\" spamIsBad gmail.com"> writes:
Anders F Björklund wrote:
 What is not powerful enough about the foreach(dchar c; str) ?
 It will step through that UTF-8 array one codepoint at a time.

I'm assuming 'str' is a char[], which would make that very nice. But it doesn't solve correctly slicing or indexing into a char[].

Well, it's also a lot "trickier" than that... For instance, my last name can be written in Unicode as Björklund or Bj¨orklund, both of which are valid - only that in one of them, the 'ö' occupies two full code points! It's still a single character, which is why Unicode avoids that term...

So it seems to me the problem is that those 2 bytes are both 2 characters and 1 character at the same time. In this case, I'd prefer being able to index to a safe default (like the ö, instead of the umlauts next to the o), or not being able to index at all.
 As you know, if you need to access your strings by codepoint (something 
 that the Unicode group explicitly recommends against, in their FAQ) then 
 char[] isn't a very nice format - because of the conversion overhead...
 But it's still possible to translate, transform, and translate back ?
 

I read that FAQ at the bottom of this post, and didn't see anything about accessing strings by codepoint. Maybe you mean a different FAQ here, in which case, could I have a link please? I've been to the unicode site before and all I remember was being confused and having a hard time finding the info I wanted :( Also I still am not sure exactly what a code point is. And that FAQ at the bottom used the word "surrogate" a lot; I'm not sure about that one either. When you say char[] isn't a nice format, I wasn't thinking about having the string class I mentioned earlier store the data ONLY as char[]. It might be wchar[]. Or dchar[]. Then it would be automatically converted between the two either at compile time (when possible) or dynamically at runtime (hopefully only when needed). So if someone throws a Chinese character literal at it, there is a very big clue there to use UTF32 or something that can store all of the characters in a uniform width sort of way, to speed indexing. Algorithms could be used so that a program 'learns' at runtime what kind of strings are dominating the program, and uses algorithms optimized for those. Maybe this is a bit too complex, but I can dream, hehe.
 If nothing was done about this and I absolutely needed UTF support,
 I'd probably make a class like so: [...]

In my own mock String class, I cached the dchar[] codepoints on demand. (viewable at http://www.algonet.se/~afb/d/dcaf/html/class_string.html)
 All in all it is a drag that we should have to learn all of this UTF 
 stuff.  I want char[] to just work!

Using Unicode strings and characters does require a little learning... (where http://www.unicode.org/faq/utf_bom.html is a very good page) And D does force you to think about string implementation, no question. This has both pros and cons, but it is a deliberate language decision. If you're willing to handle the "surrogates", then UTF-16 is a rather good trade-off between the default UTF-8 and wasteful UTF-32 formats ? A downside is that it is not "ascii-compatible" (has embedded NUL chars) and that it is endian-dependant unlike the more universal UTF-8 format. --anders

My impression has gone from being quite scared of UTF to being not so worried, but only for myself. D seems to be good at handling UTF, but only if someone tells you to never handle strings as arrays of characters. Unfortunately, the first thing you see in a lot of D programs is "int main( char[][] args )" and there are some arrays of characters being used as strings. This also means that some array capabilities like indexing and the braggable slicing are more dangerous than useful for string handling. It's a newbie trap. Like I said earlier, I either want to be able to index/slice strings safely, or not at all (or better yet, not by any intuitive means).
Sep 30 2006
parent =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:
Chad J > wrote:

 I read that FAQ at the bottom of this post, and didn't see anything 
 about accessing strings by codepoint.  Maybe you mean a different FAQ 
 here, in which case, could I have a link please?  I've been to the 
 unicode site before and all I remember was being confused and having a 
 hard time finding the info I wanted :(

I meant http://www.unicode.org/faq/utf_bom.html#12
 Also I still am not sure exactly what a code point is.  And that FAQ at 
 the bottom used the word "surrogate" a lot; I'm not sure about that one 
 either.

Code point is the closest thing to a "character", although it might take more than one Unicode code point to represent a single Unicode grapheme. Surrogates are used with UTF-16, to represent "too large" code points... i.e. they always occur in "surrogate pairs", which combine to a single
 When you say char[] isn't a nice format, I wasn't thinking about having 
 the string class I mentioned earlier store the data ONLY as char[].  It 
 might be wchar[].  Or dchar[].  Then it would be automatically converted 
 between the two either at compile time (when possible) or dynamically at 
 runtime (hopefully only when needed).  So if someone throws a Chinese 
 character literal at it, there is a very big clue there to use UTF32 or 
 something that can store all of the characters in a uniform width sort 
 of way, to speed indexing.  Algorithms could be used so that a program 
 'learns' at runtime what kind of strings are dominating the program, and 
 uses algorithms optimized for those.  Maybe this is a bit too complex, 
 but I can dream, hehe.

Actually I said that dchar[] (i.e. UTF-32) wasn't ideal, but anyway... (UTF-8 or UTF-16 is preferrable, for the reasons in the UTF FAQ above) We already have char[] as the string default in D, but most models for a String class uses wchar[] (i.e. UTF-16), for instance Mango or Java: * http://mango.dsource.org/classUString.html (uses the ICU lib) * http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html All formats do use Unicode, so converting from one UTF to another is mostly a question of memory/performance and not about any data loss. However, it is not converted at compile time (without using templates) so mixing and matching different representations is somewhat of a pain. I think that char[] for string and wchar[] for String are good defaults.
 My impression has gone from being quite scared of UTF to being not so 
 worried, but only for myself.  D seems to be good at handling UTF, but 
 only if someone tells you to never handle strings as arrays of 
 characters.  Unfortunately, the first thing you see in a lot of D 
 programs is "int main( char[][] args )" and there are some arrays of 
 characters being used as strings.  This also means that some array 
 capabilities like indexing and the braggable slicing are more dangerous 
 than useful for string handling.  It's a newbie trap.

It is, since it isn't really "arrays of characters" but "arrays of code units". What muddies the waters further is that sometimes they're equal. That is, with ASCII characters each character fits into a a D char unit. Without surrogates, each character (from BMP) fits into one wchar unit. However, all code that handles the shorter formats should be prepared to handle non-ASCII (for UTF-8) and surrogates (for UTF-16), or use UTF-32: bool isAscii(char c) { return (c <= 0x7f); } bool isSurrogate(wchar c) { return (c >= 0xD800 && c <= 0xDFFF); } But a warning that D uses multi-byte strings might be in order, yes... Another warning that it only supports UTF-8 platforms* might also be ? --anders * "main(char[][] args)" does not work for any non-UTF consoles, as you will get invalid UTF sequences for the non-ASCII chars.
Oct 01 2006
prev sibling parent reply =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:
Chad J > wrote:

 char[] data; 

   dchar opIndex( int index )
   {
     foreach( int i, dchar c; data )
     {
       if ( i == index )
         return c;
 
       i++;
     }
   }

This code probably does not work as you think it does... If you loop through a char[] using dchars (with a foreach), then the int will get the codeunit index - *not* codepoint. (the ++ in your code above looks more like a typo though, since it needs to *either* foreach i, or do it "manually") import std.stdio; void main() { char[] str = "Björklund"; foreach(int i, dchar c; str) { writefln("%4d \\U%08X '%s'", i, c, ""d ~ c); } } Will print the following sequence: 0 \U00000042 'B' 1 \U0000006A 'j' 2 \U000000F6 'ö' 4 \U00000072 'r' 5 \U0000006B 'k' 6 \U0000006C 'l' 7 \U00000075 'u' 8 \U0000006E 'n' 9 \U00000064 'd' Notice how the non-ASCII character takes *two* code units ? (if you expect indexing to use characters, that'd be wrong) More at http://prowiki.org/wiki4d/wiki.cgi?CharsAndStrs --anders
Sep 29 2006
parent Chad J <""gamerChad\" spamIsBad gmail.com"> writes:
Anders F Björklund wrote:
 Chad J > wrote:
 
 char[] data; 

   dchar opIndex( int index )
   {
     foreach( int i, dchar c; data )
     {
       if ( i == index )
         return c;

       i++;
     }
   }

This code probably does not work as you think it does... If you loop through a char[] using dchars (with a foreach), then the int will get the codeunit index - *not* codepoint. (the ++ in your code above looks more like a typo though, since it needs to *either* foreach i, or do it "manually") import std.stdio; void main() { char[] str = "Björklund"; foreach(int i, dchar c; str) { writefln("%4d \\U%08X '%s'", i, c, ""d ~ c); } } Will print the following sequence: 0 \U00000042 'B' 1 \U0000006A 'j' 2 \U000000F6 'ö' 4 \U00000072 'r' 5 \U0000006B 'k' 6 \U0000006C 'l' 7 \U00000075 'u' 8 \U0000006E 'n' 9 \U00000064 'd' Notice how the non-ASCII character takes *two* code units ? (if you expect indexing to use characters, that'd be wrong) More at http://prowiki.org/wiki4d/wiki.cgi?CharsAndStrs --anders

ah. And yep the i++ was a typo (oops). So maybe something like: dchar opIndex( int index ) { int i; foreach( dchar c; data ) { if ( i == index ) return c; i++; } } The i is no longer the foreach's index, so the i++ isn't a typo anymore. Thanks for the info. I'll check out that faq a little later, gotta go.
Sep 29 2006
prev sibling parent reply Georg Wrede <georg.wrede nospam.org> writes:
Chad J > wrote:
 I will go ahead and say that the current state of char[] is incorrect. 
 That is, if you write a program manipulating char[] strings, then run it 
 in china, you will be dissapointed with the results.  It won't matter 
 how fast the program runs, because bad stuff will happen like entire 
 strings becoming unreadable to the user.

Wrong. And that's precisely what I meant about the Daddy holding bike allegory a few messages back. The current system seems to work "by magic". So, if you do go to China, itll "just work". At this point you _should_ not believe me. :-) But it still works. --- The secret is, there actually is a delicate balance between UTF-8 and the library string operations. As long as you use library functions to extract substrings, join or manipulate them, everything is OK. And very few of us actually either need to, or see the effort of bit-twiddling individual octets in these "char" arrays. So things just keep on working. --- Not convinced yet? Well, a lot of folks here are from Europe, and our languages contain "non-ASCII" characters. Our text manipulating programs still work allright. And, actually D is pretty popular in Japan. Every once in a while some Japanese guys pop on-and-off here, and some of them don't even speak English, so they use a machine translator(!) to talk with us. Just guess if they use ASCII in their programs. And you know what, most of these guys even use their own characters for variable names in D! And not one of them has complained about "disappointing results". --- That's why I continued with: keep your eyes shut and keep on coding.
Sep 29 2006
next sibling parent reply Chad J <""gamerChad\" spamIsBad gmail.com"> writes:
Georg Wrede wrote:
 The secret is, there actually is a delicate balance between UTF-8 and 
 the library string operations. As long as you use library functions to 
 extract substrings, join or manipulate them, everything is OK. And very 
 few of us actually either need to, or see the effort of bit-twiddling 
 individual octets in these "char" arrays.
 

But this is what I'm talking about... you can't slice them or index them. I might actually index a character out of an array from time to time. If I don't know about UTF, and I do just keep on coding, and I do something like this: char[] str = "some string in nonenglish text"; for ( int i = 0; i < str.length; i++ ) { str[i] = doSomething( str[i] ); } and this will fail right? If it does fail, then everything is not alright. You do have to worry about UTF. Someone has to tell you to use a foreach there.
Sep 29 2006
next sibling parent Georg Wrede <georg.wrede nospam.org> writes:
Chad J > wrote:
 Georg Wrede wrote:
 
 The secret is, there actually is a delicate balance between UTF-8 and 
 the library string operations. As long as you use library functions to 
 extract substrings, join or manipulate them, everything is OK. And 
 very few of us actually either need to, or see the effort of 
 bit-twiddling individual octets in these "char" arrays.

But this is what I'm talking about... you can't slice them or index them. I might actually index a character out of an array from time to time. If I don't know about UTF, and I do just keep on coding, and I do something like this: char[] str = "some string in nonenglish text"; for ( int i = 0; i < str.length; i++ ) { str[i] = doSomething( str[i] ); } and this will fail right? If it does fail, then everything is not alright. You do have to worry about UTF. Someone has to tell you to use a foreach there.

Yes. That's why I talked about you falling down once you realise Daddy's not holding the bike. Part of UTF-8's magic lies in that it is amazingly easy to get working smoothly with truly minor tweaks to "formerly ASCII-only" libraries -- so that even the most exotic languages have no problem. Your concerns about the for loop are valid, and expected. Now, IMHO, the standard library should take care of "all" the situations where you would ever need to split, join, examine, or otherwise use strings, "non-ASCII" or not. (And I really have no complaint (Walter!) about this.) Therefore, in no normal circumstances should you have to twiddle them yourself -- unless. And this "unless" is exactly why I'm unhappy with the situation, too. Problem is, _technology_wise_ the existing setup may actually be the best, both considering ease of writing the library, ease of using it, robustness of both the library and users' code, and the headaches saved from programmers who, either haven't heard of the issue (whether they're American or Chinese!), or who simply trust their lives with the machinery. So, where's the actual problem??? At this point I'm inclined to say: the documentation, and the stage props! The latter meaning: exposing the fact that our "strings" are just arrays is psychologically wrong, and even more so is the fact that we're shamelessly storing entities of variable length in arrays which have no notion of such -- even worse, while we brag with slices! If this had been a university course assignment, we'd have been thrown out of class, for both half baked work, and for arrogance towards our client, victimizing the coder. The former meaning: we should not be like "we're bad enough to overtly use plain arrays for variable-length data, now if you have a problem with it, the go home and learn stuff, or then just trust us". Both "documentation" and "stage props" ultimately meaning that the largest problem here is psychology, pedagogy, and education. --- A lot would already be won by: merely aliasing char[] to string, and discouraging other than guru-level folks from screwing with their internals. This alone would save a lot of Fear, Uncertainty and D-phobia. The documentation should take pains in explaining up front that if you _really_ want to do Character-by-Character ops _and_ you live outside of America, then the Right way to do it (ehh, actually the Canonical Way), is to first convert the string to dchar[]. Period. Then, if somebody else knows enough of UTF-8 and knows he can handle bit twiddling more efficiently than using the Canonical Way, with plain char[] and "foreignish", then let him. But let that be undocumented and Un-Discussed in the docs. Precisely like a lot of other things are. (And should be.) And will be. He's on his own, and he ought to know it. --- In other words, the normal programmer should believe he's working with black-box Strings, and he will be happy with it. That way he'll survive whether he's in Urduland or Boise, Idaho -- without neither ever needing to have heard about UTF nor other crap. Not until in Appendix Z of the manual should we ever admit that the Emperor's Clothes are just plain arrays, and we apologize for the breach of manners of storing variable length data in simple naked arrays. And here would be the right place to explain how come this hasn't blown up in our faces already. And, exactly how you'll avoid it too. (This _needs_ to contain an adequate explanation about the actual format of UTF-8.) --- TO RECAP The _single_ biggest strings-related disservice to our pilgrims is to lead them to believe, that D stores strings in something like utf8[] internally. Now that's an oxymoron, if I ever saw one. (If utf8[] was _actually_ implemented, it would probably have to be an alias of char[][]. Right? Right? What we have instead is ubyte[], which is _not_ the same as utf8[].) (Oh, and if it ever becomes obvious that not _everybody_ understood this, then that in itself simply proves my point here.) (*1) And the fault lies in the documentation, not the implementation! This results, in braincell-hours wasted, precisely as much as everybody has to waste them, before they realise that the acronym RAII is a filthy lie. Akin only to the former "German _Democratic_ Republic". Only a politician should be capable of this kind of deception. Ok, nobody is doing it on purpose. Things being too clear to oneself often result in difficulties to find ways to express them to new people. (Happens every day at the Math department! :-( ) And since all in-the-know are unable to see it, and all not-in-the-know are too, then both groups might think it's the thing itself that is "the problem", and not merely the chosen _presentation_ of it. ################# Sorry for sonding Righteous, arrogant and whatever. But this really is a 5 minute thing for one person to fix for good, while it wastes entire days or months _per_person_, from _every_ non-defoiled victim who approaches the issue. Originally I was one of them: hence the aggression. ------------------------------------------- (*1) Even I am not simultaneously both literally and theoretically right here. Those who saw it right away, probably won't mind, since it's the point that is the issue here. Now, having to write this disclaimer, IMHO simply again underlines the very point attempted here.
Sep 29 2006
prev sibling parent reply Walter Bright <newshound digitalmars.com> writes:
Chad J > wrote:
 But this is what I'm talking about... you can't slice them or index 
 them.  I might actually index a character out of an array from time to 
 time.  If I don't know about UTF, and I do just keep on coding, and I do 
 something like this:
 
 char[] str = "some string in nonenglish text";
 for ( int i = 0; i < str.length; i++ )
 {
   str[i] = doSomething( str[i] );
 }
 
 and this will fail right?
 
 If it does fail, then everything is not alright.  You do have to worry 
 about UTF.  Someone has to tell you to use a foreach there.

Yes, you do have to be aware of it being UTF, just like in C you have to be aware that strings are 0 terminated. But once aware of it, there is plenty of support for it in the core language and in std.utf. You can also simply use dchar[], which has a one to one mapping between characters and indices, if you prefer. Contrast that with C++, which has no usable or portable support for UTF-8, UTF-16, or any Unicode. All your carefully coded use of std::string needs to be totally scrapped and redone with your own custom classes, should you decide your app needs to support unicode. You can also wrap char[] inside a class that provides a view of the data as if it were dchar's. But I don't think the performance of such a class would be competitive. Interestingly, it turns out that most string operations do not need to be concerned with the number of char's in a character (like "find this substring"), and forcing them to care just makes for inefficiency.
Sep 29 2006
parent reply Sean Kelly <sean f4.ca> writes:
Walter Bright wrote:
 
 Contrast that with C++, which has no usable or portable support for 
 UTF-8, UTF-16, or any Unicode. All your carefully coded use of 
 std::string needs to be totally scrapped and redone with your own custom 
 classes, should you decide your app needs to support unicode.

As long as you're aware that you are working in UTF-8 I think std::string could still be used. It just may be strange to use substring searches to find multibyte characters with no built-in support for dchar-type searching.
 You can also wrap char[] inside a class that provides a view of the data 
  as if it were dchar's. But I don't think the performance of such a 
 class would be competitive. Interestingly, it turns out that most string 
 operations do not need to be concerned with the number of char's in a 
 character (like "find this substring"), and forcing them to care just 
 makes for inefficiency.

Yup. I realized this while working on array operations and it came as a surprise--when I began I figured I would have to provide overloads for char strings, but in most cases it simply isn't necessary. Sean
Sep 30 2006
parent reply Walter Bright <newshound digitalmars.com> writes:
Sean Kelly wrote:
 Walter Bright wrote:
 Contrast that with C++, which has no usable or portable support for 
 UTF-8, UTF-16, or any Unicode. All your carefully coded use of 
 std::string needs to be totally scrapped and redone with your own 
 custom classes, should you decide your app needs to support unicode.

As long as you're aware that you are working in UTF-8 I think std::string could still be used. It just may be strange to use substring searches to find multibyte characters with no built-in support for dchar-type searching.

It's so broken that there are proposals to reengineer core C++ to add support for UTF types. 1) implementation-defined whether a char is signed or unsigned, so you've got to cast the result of any string[i] 2) none of the iteration, insertion, appending, etc., operations can handle multibyte 3) no UTF conversion or transliteration 4) C++ source text encoding is implementation-defined, so no using UTF characters in source code (have to use \u or \U notation)
Sep 30 2006
parent reply Sean Kelly <sean f4.ca> writes:
Walter Bright wrote:
 Sean Kelly wrote:
 Walter Bright wrote:
 Contrast that with C++, which has no usable or portable support for 
 UTF-8, UTF-16, or any Unicode. All your carefully coded use of 
 std::string needs to be totally scrapped and redone with your own 
 custom classes, should you decide your app needs to support unicode.

As long as you're aware that you are working in UTF-8 I think std::string could still be used. It just may be strange to use substring searches to find multibyte characters with no built-in support for dchar-type searching.

It's so broken that there are proposals to reengineer core C++ to add support for UTF types. 1) implementation-defined whether a char is signed or unsigned, so you've got to cast the result of any string[i]

Oops, forgot about this.
 2) none of the iteration, insertion, appending, etc., operations can 
 handle multibyte

True. And I hinted at this above.
 3) no UTF conversion or transliteration
 
 4) C++ source text encoding is implementation-defined, so no using UTF 
 characters in source code (have to use \u or \U notation)

Personally, I see this as a language deficiency more than a deficiency in std::string. std::string is really just a vector with some search capabilities thrown in. It's not that great for a string class, but it works well enough as a general sequence container. And it will work a tad better once they impose the came data contiguity guarantee that vector has (I believe that's one of the issues set to be resolved for 0x). Overall, I do agree with you. Though I suppose that's obvious as I'm a former C++ advocate who now uses D quite a bit :-) Sean
Oct 01 2006
parent Walter Bright <newshound digitalmars.com> writes:
Sean Kelly wrote:
 3) no UTF conversion or transliteration

 4) C++ source text encoding is implementation-defined, so no using UTF 
 characters in source code (have to use \u or \U notation)

Personally, I see this as a language deficiency more than a deficiency in std::string.

That's why the proposals to fix it are rewriting some of the *core* C++ language.
 std::string is really just a vector with some search 
 capabilities thrown in.

Another difficulty with it is it doesn't have a connection with std::vector<char>.
 It's not that great for a string class, but it 
 works well enough as a general sequence container.  And it will work a 
 tad better once they impose the came data contiguity guarantee that 
 vector has (I believe that's one of the issues set to be resolved for 0x).
 
 Overall, I do agree with you.  Though I suppose that's obvious as I'm a 
 former C++ advocate who now uses D quite a bit :-)

:-)
Oct 01 2006
prev sibling next sibling parent reply Johan Granberg <lijat.meREM OVEgmail.com> writes:
Georg Wrede wrote:
 Wrong.
 
 And that's precisely what I meant about the Daddy holding bike allegory 
 a few messages back.
 
 The current system seems to work "by magic". So, if you do go to China, 
 itll "just work".
 
 At this point you _should_ not believe me. :-) But it still works.
 
 ---

But is this not a needless source of confusion, that could be eliminated by defining char as "big enough to hold a unicode code point" or something else that eliminates the possibility to incorrectly divide utf tokens. I will have to try using char[] with non ascii characters thou I have been using dchar fore that up till now.
Sep 29 2006
parent reply Georg Wrede <georg.wrede nospam.org> writes:
Johan Granberg wrote:
 Georg Wrede wrote:
 
 Wrong.

 And that's precisely what I meant about the Daddy holding bike 
 allegory a few messages back.

 The current system seems to work "by magic". So, if you do go to 
 China, itll "just work".

 At this point you _should_ not believe me. :-) But it still works.

 ---

But is this not a needless source of confusion, that could be eliminated by defining char as "big enough to hold a unicode code point" or something else that eliminates the possibility to incorrectly divide utf tokens. I will have to try using char[] with non ascii characters thou I have been using dchar fore that up till now.

You might begin with pasting this and compiling it: import std.stdio; void main() { int √∂yl√§tti; int –®–Ķ–§–§; √∂yl√§tti = 37; –®–Ķ–§–§ = 19; writefln("K√∂yhyys 1 on %d ja n√∂yr√§ 2 on %d, ett√§ n√§in.", √∂yl√§tti, –®–Ķ–§–§); } It will compile, and run just fine. (The source file having been read into DMD as a single big string, and then having gone through comment removal, tokenizing, parsing, lexing, compiling, optimizing, and finally the variable names having found their way into the executable. Even though the front end has been written in D itself, with simply char[] all over the place.) (Then you might see that the Windows "command prompt window" renders the output wrong, but it's only from the fact that Windows itself doesn't handle UTF-8 right in the Command Window.) The next thing you might do is to write a grep program (that takes as input a file and as output writes the lines found). Write the program as if you had never heard this discussion. Then feed it the Kalevala in Finnish, or Mao's Red Book in Chinese. Should still work. As long as you don't start tampering with the individual octets in strings, you should be just fine. Don't think about UTF and you'll prosper.
Sep 29 2006
parent reply Derek Parnell <derek psyc.ward> writes:
On Sat, 30 Sep 2006 03:03:02 +0300, Georg Wrede wrote:


 As long as you don't start tampering with the individual octets in 
 strings, you should be just fine. Don't think about UTF and you'll prosper.

The Build program does lots of 'tampering'. I had to rewrite many standard routines and create some new ones to deal with unicode characters because the standard ones just don't work. And Build still fails to do somethings correctly (e.g. case insensitive compares) but that's on the TODO list. I have to think about UTF because it doesn't work unless I do that. -- Derek Parnell Melbourne, Australia "Down with mediocrity!"
Sep 29 2006
parent Georg Wrede <georg.wrede nospam.org> writes:
Derek Parnell wrote:
 On Sat, 30 Sep 2006 03:03:02 +0300, Georg Wrede wrote:
 
As long as you don't start tampering with the individual octets in 
strings, you should be just fine. Don't think about UTF and you'll prosper.

The Build program does lots of 'tampering'. I had to rewrite many standard routines and create some new ones to deal with unicode characters because the standard ones just don't work.

Do you still remember which they were?
 And Build still fails to do somethings
 correctly (e.g. case insensitive compares) but that's on the TODO list.

Yes, case insensitive compares are difficult if you want to cater for non-ASCII strings. While it may not be unreasonably difficult to get American, European and Russian strings right, there will always be languages and character sets where even the Unicode guys aren't sure what is right. Unfortunately.
Sep 29 2006
prev sibling parent reply Geoff Carlton <gcarlton iinet.net.au> writes:
Georg Wrede wrote:

 The secret is, there actually is a delicate balance between UTF-8 and 
 the library string operations. As long as you use library functions to 
 extract substrings, join or manipulate them, everything is OK. And very 
 few of us actually either need to, or see the effort of bit-twiddling 
 individual octets in these "char" arrays.
 
 So things just keep on working.
 

I agree, but I disagree that there is a problem, or that utf-8 is a bad choice, or that perhaps char[] or string should be called utf8 instead. As a note here, I actually had a page of text localised into Chinese last week - it came back as a utf8 text file. The only thing with utf8 is that a glyphs aren't represented by a single char. But utf16 is no better! And even utf32 codepoints can be combined into a single rendered glyph. So truncating a string at an arbitrary index is not going to slice on a glyph boundary. However, it doesn't mean utf8 is ASCII mixed with "garbage" bytes. That garbage is a unique series of bytes that represent a codepoint. This is a property not found in any other encoding. As such, everything works, strstr, strchr, strcat, printf, scanf - for ASCII, normal unicode, and the "Astral planes". It all just works. The only thing that breaks is if you tried to index or truncate the data by hand. But even that mostly works, you can iterate through, looking for ASCII sequences, chop out ASCII and string together more stuff, it all works because you can just ignore the higher order bytes. Pretty much the only thing that fails is if you said "I don't know whats in the string, but chop it off at index 12".
Sep 29 2006
parent reply Georg Wrede <georg.wrede nospam.org> writes:
Geoff Carlton wrote:
 Georg Wrede wrote:
 
 The secret is, there actually is a delicate balance between UTF-8 and 
 the library string operations. As long as you use library functions to 
 extract substrings, join or manipulate them, everything is OK. And 
 very few of us actually either need to, or see the effort of 
 bit-twiddling individual octets in these "char" arrays.

 So things just keep on working.

I agree, but I disagree that there is a problem, or that utf-8 is a bad choice, or that perhaps char[] or string should be called utf8 instead. As a note here, I actually had a page of text localised into Chinese last week - it came back as a utf8 text file. The only thing with utf8 is that a glyphs aren't represented by a single char. But utf16 is no better! And even utf32 codepoints can be combined into a single rendered glyph. So truncating a string at an arbitrary index is not going to slice on a glyph boundary. However, it doesn't mean utf8 is ASCII mixed with "garbage" bytes. That garbage is a unique series of bytes that represent a codepoint. This is a property not found in any other encoding. As such, everything works, strstr, strchr, strcat, printf, scanf - for ASCII, normal unicode, and the "Astral planes". It all just works. The only thing that breaks is if you tried to index or truncate the data by hand. But even that mostly works, you can iterate through, looking for ASCII sequences, chop out ASCII and string together more stuff, it all works because you can just ignore the higher order bytes. Pretty much the only thing that fails is if you said "I don't know whats in the string, but chop it off at index 12".

Yes.
Sep 29 2006
parent reply Johan Granberg <lijat.meREM OVEgmail.com> writes:
Georg Wrede wrote:
 Geoff Carlton wrote:
 But even that mostly works, you can iterate through, looking for ASCII 
 sequences, chop out ASCII and string together more stuff, it all works 
 because you can just ignore the higher order bytes.  Pretty much the 
 only thing that fails is if you said "I don't know whats in the 
 string, but chop it off at index 12".

Yes.

How should we chop strings on character boundaries? I have a text rendering function that uses freetype and want to restrict the width of the renderd string, (i have to use some sort of search here, binary or linear) by truncating it. Right now I use dchar but if char is sufficient it would save me conversions all over the place.
Sep 29 2006
parent Walter Bright <newshound digitalmars.com> writes:
Johan Granberg wrote:
 How should we chop strings on character boundaries?

std.utf.toUTFindex() should do the trick.
Sep 30 2006
prev sibling parent reply Johan Granberg <lijat.meREM OVEgmail.com> writes:
BCS wrote:
 Why isn't performance a problem?
 
 If you are saying that this won't cause performance hits in run times or 
  memory space, I might be able to buy it, but I'm not yet convinced.
 
 If you are saying that causing a performance hit in run times or memory 
 space is not a problem... in that case I think you are dead wrong and 
 you will not convince me otherwise.
 
 In my opinion, any compiled language should allow fairly direct access 
 to the most efficient practical means of doing something*. If I didn't 
 care about speed and memory I wound use some sort of scripting language.
 
 A good set of libs should make most of this moot. Leave the char as is 
 and define a typedef struct or whatever that provides the added 
 functionality that you want.
 
 * OTOH a language should not mandate code to be efficient at the expense 
 of ease of coding.

I don't think any performance hit will be so big that it causes problems (max x4 memory and negligible computation overhead). Hope that made clear what I meant.
Sep 29 2006
parent reply BCS <BCS pathlink.com> writes:
Johan Granberg wrote:
 BCS wrote:
 
 Why isn't performance a problem?


 If you are saying that causing a performance hit in run times or 
 memory space is not a problem... in that case I think you are dead 
 wrong and you will not convince me otherwise.

I don't think any performance hit will be so big that it causes problems (max x4 memory and negligible computation overhead). Hope that made clear what I meant.

If you will note, I said nothing about the size of the hit. While some may disagree, I think that any unneeded hit is a problem. One alternative that I could live with would use 4 character types: char one codeunit in whatever encoding the runtime uses schar one 8 bit code unit (ASCII or utf-8) wchar one 16 bit code unit (same as before) dchar one 32 bit code unit (same as before) (using the same thing for ASCII and UTF-8 may be a problem, but this isn't my field) The point being that char, wchar and dchar are not representing numbers and should be there own type. This also preserves direct access to 8, 16 and 32 bit types.
Oct 01 2006
parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
BCS wrote:

 One alternative that I could live with would use 4 character types:
 
 char    one codeunit in whatever encoding the runtime uses
 schar    one 8 bit code unit (ASCII or utf-8)
 wchar    one 16 bit code unit (same as before)
 dchar    one 32 bit code unit (same as before)

We have that already: ubyte one codeunit in whatever encoding the runtime uses char one 8 bit code unit (ASCII or utf-8) There is no support in Phobos for runtime/native encodings, but you can use the "iconv" library to do such conversions ?
 (using the same thing for ASCII and UTF-8 may be a problem, but this isn't my
field)

All ASCII characters are valid UTF-8 code units, so it's OK. --anders
Oct 01 2006
parent reply BCS <BCS pathlink.com> writes:
Anders F BjŲrklund wrote:
 BCS wrote:
 
 One alternative that I could live with would use 4 character types:

 char    one codeunit in whatever encoding the runtime uses
 schar    one 8 bit code unit (ASCII or utf-8)
 wchar    one 16 bit code unit (same as before)
 dchar    one 32 bit code unit (same as before)

We have that already: ubyte one codeunit in whatever encoding the runtime uses char one 8 bit code unit (ASCII or utf-8)

ubyte is an 8 bit unsigned number not a character encoding. [after some more reading] I may be just rambling but... how about have the type of the value denote the encoding. One for ASCII would only ever store ASCII (UTF-8 is invalid), same for UTF-8,16 and 32. Direct assignment would be illegal (as with, say int[] -> Object) or implicitly converted (as with int -> real). Casts would be provided. Indexing would be by codepoint. Non-array variables would be big enough to store any codepoint (ASCII -> 8bit, !ASCII -> 32-bit). Some sort of "whatever the system uses" data type (ah la C's int) could be used for actual output, maybe even escaping anything that won't get displayed correctly. This all sort of follows the idea of "call it what it is and don't hide the overhead". 1) Characters are a different type of data than numbers (see the threads on bool) and as such, that should be reflected in the type system. 2) I have no problem with high overhead operations as long as I can avoid using them when I don't want to.
 
 There is no support in Phobos for runtime/native encodings,
 but you can use the "iconv" library to do such conversions ?
 
 (using the same thing for ASCII and UTF-8 may be a problem, but this 
 isn't my field)

All ASCII characters are valid UTF-8 code units, so it's OK.

But UTF-8 is not ASCII.
 --anders

Oct 01 2006
next sibling parent Georg Wrede <georg.wrede nospam.org> writes:
BCS wrote:
 I may be just rambling but...
 
 how about have the type of the value denote the encoding. One for ASCII 
 would only ever store ASCII (UTF-8 is invalid)

Then all Americans would use that instead of UTF-8. This is natural, since first you code for yourself, later maybe for your boss, etc. And, you'd only become aware of any problems when a Latino tries to use his own name Josť, talk about MotŲrhead, or AnaÔs the fragrance. And the mail and newsreader you wrote in D simply would not work. Guess if anybody would heed the warning "Only use this new ASCII encoding when you are absolutely positive the program never will encounter a single foreign sentence or letter". So, better not. --- D's current setup and documentation engourage this kind of suggestions, and I don't blame you. Things being like they are, a programmer who wants to write a crossword puzzle generator, would of course begin with: char[20][20] theGrid; It's a shame that an otherwise so excellent language ( + the wording it its docs) downright leads you to do this. The guy naturally assumes that D being a "UTF-8" language, this would work even in Chinese. (Hey, char[] foo = "Josť MotŲrhead from the band AnaÔs is on stage!"; works, so why wouldn't theGrid? Poor guy. I can't blame anyone then wanting to stay within ASCII for the rest of D's life.
Oct 01 2006
prev sibling parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
BCS wrote:

 ubyte is an 8 bit unsigned number not a character encoding.

Right, I actually meant ubyte[] but void[] might have been more accurate for representing any (even non-UTF) encoding. (I used ubyte[] in my mapping functions, since they only used legacy 8-bit encodings like "cp1252" or "macroman") Re-reading your post, it seems to me that you were more talking about doing an alias to the UTF type most suitable for the OS ? I guess UTF-8 would be a good choice if the operating system doesn't use Unicode, since then it'll have to do lookups anyway. Otherwise the existing "wchar_t" isn't bad for such an UTF type, it will be UTF-16 on Windows and UTF-32 on Unix (linux,darwin,...)
 All ASCII characters are valid UTF-8 code units, so it's OK.

But UTF-8 is not ASCII.

So you would like a char "type" that would only take ASCII ? I guess that is *one* way of dealing with it, you could also have a wchar type that wouldn't accept surrogates (BMP only) Then it would be OK to index them by code unit / character... (since each allowed character would fit into one code unit) Sounds a little like signed vs. unsigned integers actually ? Then again, 5 character types is even worse than the 3 now. --anders
Oct 01 2006
parent BCS <BCS pathlink.com> writes:
Anders F BjŲrklund wrote:
[...]
 
 Then again, 5 character types is even worse than the 3 now.
 
 --anders

The more I think about it the worse this get. What I really would like is a system that allows O(1) operations on strings (slice out char 7 to 27), allows somewhat compact encoding (8bit) and allows safe operations on UTF (if I do something dumb, it complains). All at the same time would be nice, but is not needed. Come to think about it, a lib that will do good FAST convention between buffers: //note: "in" is intentional, it wont allocate anything UTF8to16(in char[], in wchar[]); UTF8to32(in char[], in dchar[]); UTF16to32(in wchar[], in dchar[]); ... would get most of what I want. <sarcasm> And while I'm at it, I'd like a million bucks please. </sarcasm>
Oct 02 2006