www.digitalmars.com Home | Search | C & C++ | D | DMDScript | News Groups | index | prev | next
Archives

D Programming
D
D.gnu
digitalmars.D
digitalmars.D.bugs
digitalmars.D.dtl
digitalmars.D.dwt
digitalmars.D.announce
digitalmars.D.learn
digitalmars.D.debugger

C/C++ Programming
c++
c++.announce
c++.atl
c++.beta
c++.chat
c++.command-line
c++.dos
c++.dos.16-bits
c++.dos.32-bits
c++.idde
c++.mfc
c++.rtl
c++.stl
c++.stl.hp
c++.stl.port
c++.stl.sgi
c++.stlsoft
c++.windows
c++.windows.16-bits
c++.windows.32-bits
c++.wxwindows

digitalmars.empire
digitalmars.DMDScript

digitalmars.D - First Impressions

↑ ↓ ← Geoff Carlton <gcarlton iinet.net.au> writes:
Hi,
I'm a C++ user who's just tried D and I wanted to give my first
impressions.  I can't really justify moving any of my codebase over to
D, so I wrote a quick tool to parse a dictionary file and make a
histogram - a bit like the wc demo in the dmd package.

1.)
I was a bit underwhelmed by the syntax of char[].  I've used lua which
also has strings,functions and maps as basic primitives, so going back
to array notation seems a bit low level.  Also, char[][] is not the best
start in the main() declaration.  Is it a 2D array, an array of
arrays?  Then there is the char[][char[]].  What a mouthful for a simple
map!

Well, now I need to find elements.. I'd use std::string's find() here, 
but the wc example has all array operations. Even isalpha is done as 
'a', 'z' comparisons on an indexed array. Back to low level C stuff.

A simple alias of char[] to string would simplify the first glance code.
   string x;    // yep, a string
   main (string[]) // an array of strings
   string[string] m; // map of string to string

I believe single functions get pulled in as member functions?  e.g.
find(string) can be used as string.find()?  If so, it means that all the
string functionality can be added and then used naturally as member
functions on this "string" (which is really just the plain old char[] in
disguise).

This is a small thing, but I think it would help in terms of the mindset
of strings being a first class primitive, and clear up simple "hello
world" examples at the same time.  Put simply, every modern language has
a first class string primitive type, except D - at least in terms of
nomenclature.

2.)
I liked the more powerful for loop.  I'm curious is there any ability to 
use delegates in the same way as lua does?  I was blown away the first 
time I realised how simple it was for custom iteration in lua.  In 
short, you write a function that returns a delegate (a closure?) that 
itself returns arguments, terminating in nil.

   e.g. for r in rooms_in_level(lvl) // custom function

As lua can handle multiple return arguments, it can also do a key,value 
sort of thing that D can do.  What a wonderful way of allowing any sort 
of iteration.

It beats pages of code in C++ to write an iterator that can go forwards, 
or one that can go backwards (wow, the power of C++!).  C++09 still 
isn't much of an improvement here, it only sugars the awful iterator syntax.

3.)
 From the newsgroups, it seems like 'auto' as local raii and 'auto' as
automatic type deduction are still linked to the one keyword.  Well in 
lua, 'local' is pretty intuitive for locally scoped variables.  Also 
'auto' will soon mean automatic type deduction in C++.  So those make 
sense to me personally.  Looks like this has been discussed to death, 
but thats my 2c.

4.)
The D version of Scintilla and d-build was nice, very easy to use.
Personally I would have preferred the default behaviour of dbuild to put 
object files in an /obj subdirectory and the final exe in the original 
directory dbuild is run from.

This way, it could be run from a root directory, operate on a /src 
subdirectory, and not clutter up the source with object files.  There is 
a switch for that, of course, but I can't imagine when you would want 
object files sitting in the same directory as the source.

Well, as first impressions go, I was pleased by D, and am interested to 
see how well it fares as time goes on.  Its just a shame that all the 
tools/library/IDE is all in C++!

Thanks,
Geoff
Sep 28 2006
"Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"Geoff Carlton" <gcarlton iinet.net.au> wrote in message 
news:efhp1r$1r9s$1 digitaldaemon.com...
 Hi,
 I'm a C++ user who's just tried D and I wanted to give my first
 impressions.  I can't really justify moving any of my codebase over to
 D, so I wrote a quick tool to parse a dictionary file and make a
 histogram - a bit like the wc demo in the dmd package.

 1.)
 I was a bit underwhelmed by the syntax of char[].  I've used lua which
 also has strings,functions and maps as basic primitives, so going back
 to array notation seems a bit low level.  Also, char[][] is not the best
 start in the main() declaration.  Is it a 2D array, an array of
 arrays?  Then there is the char[][char[]].  What a mouthful for a simple
 map!

 Well, now I need to find elements.. I'd use std::string's find() here, but 
 the wc example has all array operations. Even isalpha is done as 'a', 'z' 
 comparisons on an indexed array. Back to low level C stuff.

 A simple alias of char[] to string would simplify the first glance code.
   string x;    // yep, a string
   main (string[]) // an array of strings
   string[string] m; // map of string to string

 I believe single functions get pulled in as member functions?  e.g.
 find(string) can be used as string.find()?  If so, it means that all the
 string functionality can be added and then used naturally as member
 functions on this "string" (which is really just the plain old char[] in
 disguise).

They're more just syntactic sugar than member functions. You can, in fact do this with any array type, e.g void foo(int[] arr) { ... } int[] x; x = [4, 5, 6, 7]; // bug in the new array literals ;) x.foo();
 This is a small thing, but I think it would help in terms of the mindset
 of strings being a first class primitive, and clear up simple "hello
 world" examples at the same time.  Put simply, every modern language has
 a first class string primitive type, except D - at least in terms of
 nomenclature.

It does look nicer. I suppose the counterargument would be that having an alias char[] string might not be portable -- what about wchar[] and dchar[]? Would they be wstring and dstring? Or would we choose wchar[] or dchar[] to be string to be more forward-thinking (since languages like Java and C# already use UTF-16 as the default string type)? I've never been too incredibly put off by char[], but of course other people have other opinions.
 2.)
 I liked the more powerful for loop.  I'm curious is there any ability to 
 use delegates in the same way as lua does?  I was blown away the first 
 time I realised how simple it was for custom iteration in lua.  In short, 
 you write a function that returns a delegate (a closure?) that itself 
 returns arguments, terminating in nil.

   e.g. for r in rooms_in_level(lvl) // custom function

 As lua can handle multiple return arguments, it can also do a key,value 
 sort of thing that D can do.  What a wonderful way of allowing any sort of 
 iteration.

Unfortunately the way Lua does "foreach" iteration is exactly the inverse of how D does it. Lua gets an iterator and keeps calling it in the loop; D gives the loop (the entire body!) to the iterator function, which runs the loop. So it's something like a "true" iterator as described in the Lua book: level.each(function(r) print("Room: " .. r) end) D does it this way I guess to make it easier to write iterators. Since you're limited to one return value, it's simpler to make the iterator a callback and pass the indices into the foreach body than it is to make the iterator return multiple parameters through "out" parameters. That, and it's easier to keep track of state with a callback iterator. (I'm going through which to use in a Lua-like language that I'm designing too!)
 It beats pages of code in C++ to write an iterator that can go forwards, 
 or one that can go backwards (wow, the power of C++!).  C++09 still isn't 
 much of an improvement here, it only sugars the awful iterator syntax.

Weeeeeeeee! C++
 3.)
 From the newsgroups, it seems like 'auto' as local raii and 'auto' as
 automatic type deduction are still linked to the one keyword.  Well in 
 lua, 'local' is pretty intuitive for locally scoped variables.  Also 
 'auto' will soon mean automatic type deduction in C++.  So those make 
 sense to me personally.  Looks like this has been discussed to death, but 
 thats my 2c.

I don't even wanna get into it ;) _Technically_ speaking, auto isn't really "used" in type deduction; instead, the syntax is just <storage class> <identifier>, skipping the type. Since the default storage class is auto, it looks like auto is being used to determine the type, but it also works like e.g. static x = 5; I think a better way to do it would be to have a special "stand-in" type, such as var x = 5; static var y = 20; auto var f = new Foo(); // this will be RAII and automatically type-determined
 4.)
 The D version of Scintilla and d-build was nice, very easy to use.
 Personally I would have preferred the default behaviour of dbuild to put 
 object files in an /obj subdirectory and the final exe in the original 
 directory dbuild is run from.

 This way, it could be run from a root directory, operate on a /src 
 subdirectory, and not clutter up the source with object files.  There is a 
 switch for that, of course, but I can't imagine when you would want object 
 files sitting in the same directory as the source.

 Well, as first impressions go, I was pleased by D, and am interested to 
 see how well it fares as time goes on.  Its just a shame that all the 
 tools/library/IDE is all in C++!

 Thanks,
 Geoff 

Sep 28 2006
↑ ↓ → Geoff Carlton <gcarlton iinet.net.au> writes:
Jarrett Billingsley wrote:

 It does look nicer.  I suppose the counterargument would be that having an 
 alias char[] string might not be portable -- what about wchar[] and dchar[]? 
 Would they be wstring and dstring?  Or would we choose wchar[] or dchar[] to 
 be string to be more forward-thinking (since languages like Java and C# 
 already use UTF-16 as the default string type)?
 

I'm a fan of utf-8 so it would seem natural to have string, wstring, and dstring. IMO utf-16 is backward thinking, and has the dubious property of being mostly fixed width, except when its not. And even utf-32 isn't one-to-one in terms of glyphs rendered on screen. Anyway, as a low level programmer, I appreciate that its all based on very powerful and flexible arrays. But as a high level programmer, I don't want to be reminded of that fact every time I need a to use a string.
 Unfortunately the way Lua does "foreach" iteration is exactly the inverse of 
 how D does it.  Lua gets an iterator and keeps calling it in the loop; D 
 gives the loop (the entire body!) to the iterator function, which runs the 
 loop.  So it's something like a "true" iterator as described in the Lua 
 book:

Ok, although the advantage of the first method is that you write the iterator once, and then its easy to use for all clients. Wrapping up the loop in a function is just backward, although it is much more palatable in the inline format than a clunky out of line functor or using _1, _2 hackery magic. As an example, I love the fact that I can do this in lua: for r1 in rooms_in_level(lvl) do for r2 in rooms_in_level(lvl) do for c in connections(r1, r2) do print("got connection " .. c) end end end I wrote Floyd's algorithm in lua in the time it would take me in C++ to not even finish thinking about what structures, classes, vectors I would use. I imagine D would be as easy, although not as nice as the above style.
 
 D does it this way I guess to make it easier to write iterators.  Since 
 you're limited to one return value, it's simpler to make the iterator a 
 callback and pass the indices into the foreach body than it is to make the 
 iterator return multiple parameters through "out" parameters.  That, and 
 it's easier to keep track of state with a callback iterator.  (I'm going 
 through which to use in a Lua-like language that I'm designing too!)

Multiple returns would be tricky. C++ looks like its getting there with std::tuple and std::tie, but as always the downside is the sheer clunkiness. As hetrogenous arrays aren't in the core language for either C++ or D, its tricky to come up with a clean solution. Designing a language would be great fun, and I think lua has done a great many things right. Not sure about the typeless state though, it gets messy with large projects. Still, no templates (or rather, every function is like a template).
Sep 28 2006
→ Lutger <lutger.blijdestijn gmail.com> writes:
Geoff Carlton wrote:
 Hi,
 I'm a C++ user who's just tried D and I wanted to give my first
 impressions.  I can't really justify moving any of my codebase over to
 D, so I wrote a quick tool to parse a dictionary file and make a
 histogram - a bit like the wc demo in the dmd package.

You'll sure be pleased with D coming from C++.
 1.)
 I was a bit underwhelmed by the syntax of char[]...

Yes, I was too. But although it looks not very nice at first sight, D's arrays are nothing like C++ arrays. Strings are first class, array notation is consistent and getting used to them together with concatenation and slicing operators, I found they are quite powerful yet simple to use.
 2.)
 I liked the more powerful for loop.  I'm curious is there any ability to 
 use delegates in the same way as lua does?  I was blown away the first 
 time I realised how simple it was for custom iteration in lua.  In 
 short, you write a function that returns a delegate (a closure?) that 
 itself returns arguments, terminating in nil.

You can enable a class to use the foreach statement. http://www.digitalmars.com/d/statement.html#foreach
 4.)
 The D version of Scintilla and d-build was nice, very easy to use.
 Personally I would have preferred the default behaviour of dbuild to put 
 object files in an /obj subdirectory and the final exe in the original 
 directory dbuild is run from.
 
 This way, it could be run from a root directory, operate on a /src 
 subdirectory, and not clutter up the source with object files.  There is 
 a switch for that, of course, but I can't imagine when you would want 
 object files sitting in the same directory as the source.

Check out build: http://www.dsource.org/projects/build
 Well, as first impressions go, I was pleased by D, and am interested to 
 see how well it fares as time goes on.  Its just a shame that all the 
 tools/library/IDE is all in C++!
 
 Thanks,
 Geoff

Sep 28 2006
Derek Parnell <derek nomail.afraid.org> writes:
On Fri, 29 Sep 2006 10:23:32 +1000, Geoff Carlton wrote:

 Hi,
 I'm a C++ user who's just tried D

 I was a bit underwhelmed by the syntax of char[].

Yes. It isn't very 'nice' for a modern language. Though as you note below a simple alias can help a lot. alias char[] string;
 I believe single functions get pulled in as member functions?  e.g.
 find(string) can be used as string.find()? 

This syntax sugar works for all arrays. func(T[] x, a) x.func(a) are equivalent.
 2.)
 I liked the more powerful for loop.  I'm curious is there any ability to 
 use delegates in the same way as lua does?

Yes it can use anonymous delegates. You can also overload it in classes.
 
 3.)
  From the newsgroups, it seems like 'auto' as local raii and 'auto' as
 automatic type deduction are still linked to the one keyword.

There are lots of D users hoping that this wart will be repaired before too long.
 4.)
 The D version of Scintilla and d-build was nice, very easy to use.
 Personally I would have preferred the default behaviour of dbuild to put 
 object files in an /obj subdirectory and the final exe in the original 
 directory dbuild is run from.
 
 This way, it could be run from a root directory, operate on a /src 
 subdirectory, and not clutter up the source with object files.  There is 
 a switch for that, of course, but I can't imagine when you would want 
 object files sitting in the same directory as the source.

Thanks for the Build comments. One unfortunate thing I find is that one person's defaults are another's exceptions. That is why you can tailor Build to your 'default' behaviour requirements. In this case, create a text file in the same directory that Build.exe is installed in, called 'build.cfg' and place in it the line ... CMDLINE=-od./obj Then when you run the tool, the command line switch is applied every time you run it. -- Derek (skype: derek.j.parnell) Melbourne, Australia "Down with mediocrity!" 29/09/2006 4:44:52 PM
Sep 28 2006
↑ ↓ Walter Bright <newshound digitalmars.com> writes:
Derek Parnell wrote:
 On Fri, 29 Sep 2006 10:23:32 +1000, Geoff Carlton wrote:
 I was a bit underwhelmed by the syntax of char[].

Yes. It isn't very 'nice' for a modern language. Though as you note below a simple alias can help a lot. alias char[] string;

On the other hand, the reasons other languages have strings as classes is because they just don't support arrays very well. C++'s std::string combines the worst of core functionality and libraries, and has the advantages of neither. An early design goal for D was to upgrade arrays to the point where string classes weren't necessary.
Sep 29 2006
→ =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Walter Bright wrote:

 An early design goal for D was to upgrade arrays to the point where 
 string classes weren't necessary.

A string alias might still be, just as the bool alias was. --anders
Sep 29 2006
Derek Parnell <derek psyc.ward> writes:
On Fri, 29 Sep 2006 01:24:50 -0700, Walter Bright wrote:

 Derek Parnell wrote:
 On Fri, 29 Sep 2006 10:23:32 +1000, Geoff Carlton wrote:
 I was a bit underwhelmed by the syntax of char[].

Yes. It isn't very 'nice' for a modern language. Though as you note below a simple alias can help a lot. alias char[] string;

On the other hand, the reasons other languages have strings as classes is because they just don't support arrays very well. C++'s std::string combines the worst of core functionality and libraries, and has the advantages of neither. An early design goal for D was to upgrade arrays to the point where string classes weren't necessary.

And is it there yet? I mean, given that a string is just a lump of text, is there any text processing operation that cannot be simply done to a char[] item? I can't think of any but maybe somebody else can. And if a char[] is just as capable as a std::string, then why not have an official alias in Phobos? Will 'alias char[] string' cause anyone any problems? -- Derek Parnell Melbourne, Australia "Down with mediocrity!"
Sep 29 2006
→ Georg Wrede <georg.wrede nospam.org> writes:
Derek Parnell wrote:
 On Fri, 29 Sep 2006 01:24:50 -0700, Walter Bright wrote:
An early design goal for D was to upgrade arrays to the point where 
string classes weren't necessary.

And is it there yet? I mean, given that a string is just a lump of text

The string you're talking about is not just a lump of text. More specifically it's a lump of text, irregularly interspersed with short non-ascii ubyte sequences. The latter being of course the tails of UTF-8 "characters".
Sep 29 2006
→ David Medlock <noone nowhere.com> writes:
Derek Parnell wrote:
 On Fri, 29 Sep 2006 01:24:50 -0700, Walter Bright wrote:
 
 
Derek Parnell wrote:

On Fri, 29 Sep 2006 10:23:32 +1000, Geoff Carlton wrote:

I was a bit underwhelmed by the syntax of char[].

Yes. It isn't very 'nice' for a modern language. Though as you note below a simple alias can help a lot. alias char[] string;

On the other hand, the reasons other languages have strings as classes is because they just don't support arrays very well. C++'s std::string combines the worst of core functionality and libraries, and has the advantages of neither. An early design goal for D was to upgrade arrays to the point where string classes weren't necessary.

And is it there yet? I mean, given that a string is just a lump of text, is there any text processing operation that cannot be simply done to a char[] item? I can't think of any but maybe somebody else can. And if a char[] is just as capable as a std::string, then why not have an official alias in Phobos? Will 'alias char[] string' cause anyone any problems?

string array types. -DavidM
Sep 29 2006
Walter Bright <newshound digitalmars.com> writes:
Derek Parnell wrote:
 And is it there yet? I mean, given that a string is just a lump of text, is
 there any text processing operation that cannot be simply done to a char[]
 item? I can't think of any but maybe somebody else can.

I believe it's there. I don't think std::string or java.lang.String have anything over it.
 And if a char[] is just as capable as a std::string, then why not have an
 official alias in Phobos? Will 'alias char[] string' cause anyone any
 problems?

I don't think it'll cause problems, it just seems pointless.
Sep 29 2006
→ Matthias Spycher <matthias coware.com> writes:
Immutability and some guarantees about the validity of the state of an 
immutable string in a concurrent setting are what set Java strings 
apart. Garbage collection without immutable strings in the standard 
library is quite out of the ordinary.

Walter Bright wrote:
 Derek Parnell wrote:
 And is it there yet? I mean, given that a string is just a lump of 
 text, is
 there any text processing operation that cannot be simply done to a 
 char[]
 item? I can't think of any but maybe somebody else can.

I believe it's there. I don't think std::string or java.lang.String have anything over it.
 And if a char[] is just as capable as a std::string, then why not have an
 official alias in Phobos? Will 'alias char[] string' cause anyone any
 problems?

I don't think it'll cause problems, it just seems pointless.

Sep 29 2006
Derek Parnell <derek psyc.ward> writes:
On Fri, 29 Sep 2006 10:04:57 -0700, Walter Bright wrote:

 Derek Parnell wrote:
 And is it there yet? I mean, given that a string is just a lump of text, is
 there any text processing operation that cannot be simply done to a char[]
 item? I can't think of any but maybe somebody else can.

I believe it's there. I don't think std::string or java.lang.String have anything over it.

I'm pretty sure that the phobos routines for search and replace only work for ASCII text. For example, std.string.find(japanesetext, "a") will nearly always fail to deliver the correct result. It finds the first occurance of the byte value for the letter 'a' which may well be inside a Japanese character. It looks for byte-subsets rather than character sub-sets.
 And if a char[] is just as capable as a std::string, then why not have an
 official alias in Phobos? Will 'alias char[] string' cause anyone any
 problems?

I don't think it'll cause problems, it just seems pointless.

It may very well be pointless for your way of thinking, but your language is also for people who may not necessarily think in the same manner as yourself. I, for example, think there is a point to having my code read like its dealing with strings rather than arrays of characters. I suspect I'm not alone. We could all write the alias in all our code, but you could also be helpful and do it for us - like you did with bit/bool. -- Derek Parnell Melbourne, Australia "Down with mediocrity!"
Sep 29 2006
→ Georg Wrede <georg.wrede nospam.org> writes:
Derek Parnell wrote:
 I'm pretty sure that the phobos routines for search and replace only work
 for ASCII text. For example, std.string.find(japanesetext, "a") will nearly
 always fail to deliver the correct result. It finds the first occurance of
 the byte value for the letter 'a' which may well be inside a Japanese
 character. It looks for byte-subsets rather than character sub-sets.

I take it that you mean that the bit pattern, or byte, 'a' (as in 0x61) may be found within a Japanese multibyte glyph? Or even a very long Japanese text. That is not correct. The designers of UTF-8 knew that this would be dangerous, and created UTF-8 so that such _will_not_happen_. Ever. Therefore, something like std.string.find() doesn't even have to know about it. Basically, std.string.find() and comparable functions, only have to receive two octet sequences, and see where one of them first occurs in the other. No need to be aware of UTF or ASCII. For all we know, the strings may even be in EBCDIC. Still works. If the strings themselves are valid (in whichever encoding you have chosen to use), then the result will also be valid. ((For the sake of completeness, here I've restricted the discussion to the version of such functions that accept ubyte[] compatible input (obviously including char[]). Those taking 16 or 32 bits, and especially if we deliberately feed input of wrong width to any of these, then of course the results will be more complicated.))
Sep 29 2006
Walter Bright <newshound digitalmars.com> writes:
Derek Parnell wrote:
 I'm pretty sure that the phobos routines for search and replace only work
 for ASCII text. For example, std.string.find(japanesetext, "a") will nearly
 always fail to deliver the correct result. It finds the first occurance of
 the byte value for the letter 'a' which may well be inside a Japanese
 character.

That cannot happen, because multibyte sequences *always* have the high bit set, and 'a' does not. That's one of the things that sets UTF-8 apart from other multibyte formats. You might be thinking of the older Shift-JIS multibyte encoding, which did suffer from such problems.
 It looks for byte-subsets rather than character sub-sets.

I don't think it's broken, but if it is, those are bugs, not fundamental problems with char[], and should be filed in bugzilla.
 It may very well be pointless for your way of thinking, but your language
 is also for people who may not necessarily think in the same manner as
 yourself. I, for example, think there is a point to having my code read
 like its dealing with strings rather than arrays of characters. I suspect
 I'm not alone. We could all write the alias in all our code, but you could
 also be helpful and do it for us - like you did with bit/bool.

I'm concerned about just adding more names that don't add real value. As I wrote in a private email discussion about C++ typedefs, they should only be used when: 1) they provide an abstraction against the presumption that the underlying type may change 2) they provide a self-documentation purpose (1) certainly doesn't apply to string. (2) may, but char[] has no use other than that of being a string, as a char[] is always a string and a string is always a char[]. So I don't think string fits (2). And lastly, there's the inevitable confusion. People learning the language will see char[] and string, and wonder which should be used when. I can't think of any consistent understandable rule for that. So it just winds up being wishy-washy. Adding more names into the global space (which is what names in object.d are) should be done extremely conservatively. If someone wants to use the string alias as their personal or company style, I have no issue with that, as other people *do* think differently than me (which is abundantly clear here!).
Sep 29 2006
↑ ↓ Derek Parnell <derek psyc.ward> writes:
On Fri, 29 Sep 2006 23:11:37 -0700, Walter Bright wrote:

 Derek Parnell wrote:
 I'm pretty sure that the phobos routines for search and replace only work
 for ASCII text. For example, std.string.find(japanesetext, "a") will nearly
 always fail to deliver the correct result. It finds the first occurance of
 the byte value for the letter 'a' which may well be inside a Japanese
 character.

That cannot happen, because multibyte sequences *always* have the high bit set, and 'a' does not. That's one of the things that sets UTF-8 apart from other multibyte formats. You might be thinking of the older Shift-JIS multibyte encoding, which did suffer from such problems.

Thanks. That has cleared up some misconceptions and pre-concenptions that I had with utf encoding. I can reduce some of my home-grown routines now and reduce that number of times that I (think I) need dchar[] ;-)
 It may very well be pointless for your way of thinking, but your language
 is also for people who may not necessarily think in the same manner as
 yourself. I, for example, think there is a point to having my code read
 like its dealing with strings rather than arrays of characters. I suspect
 I'm not alone. We could all write the alias in all our code, but you could
 also be helpful and do it for us - like you did with bit/bool.

I'm concerned about just adding more names that don't add real value. As I wrote in a private email discussion about C++ typedefs, they should only be used when: 1) they provide an abstraction against the presumption that the underlying type may change 2) they provide a self-documentation purpose (1) certainly doesn't apply to string.

No argument there.
  (2) may, but char[] has no use 
 other than that of being a string, as a char[] is always a string and a 
 string is always a char[]. So I don't think string fits (2).

This is a lttle more debatable, but not worth generating hostility. A string of text contains characters whose position in the string is significant - there are semantics to be applied to the entire text. It is quite possible to conceive of an application in which the characters in the char[] array have no importance attached to their relative position within the array *where compared to neighboring characters*. The order of characters in text is significant but not necessarily so in a arbitary character array. Conceptually a string is different from a char[], even though they are implemented using the same technology.
 And lastly, there's the inevitable confusion. People learning the 
 language will see char[] and string, and wonder which should be used 
 when. I can't think of any consistent understandable rule for that. So 
 it just winds up being wishy-washy. Adding more names into the global 
 space (which is what names in object.d are) should be done extremely 
 conservatively.

And yet we have "toString" and not "toCharArray" or "toUTF"! And we still have the "printf" in object.d too!
 If someone wants to use the string alias as their personal or company 
 style, I have no issue with that, as other people *do* think differently 
 than me (which is abundantly clear here!).

I'll revert Build to string again as it is a lot easier to read. It started out that way but I converted it to char[] to appease you (why I thought you need appeasing is lost though). :-) -- Derek Parnell Melbourne, Australia "Down with mediocrity!"
Sep 30 2006
Walter Bright <newshound digitalmars.com> writes:
Derek Parnell wrote:
  (2) may, but char[] has no use 
 other than that of being a string, as a char[] is always a string and a 
 string is always a char[]. So I don't think string fits (2).

This is a lttle more debatable, but not worth generating hostility.

I certainly hope this thread doesn't degenerate into that like some of the others.
 A string of text contains characters whose position in the string is
 significant - there are semantics to be applied to the entire text. It is
 quite possible to conceive of an application in which the characters in the
 char[] array have no importance attached to their relative position within
 the array *where compared to neighboring characters*. The order of
 characters in text is significant but not necessarily so in a arbitary
 character array. 
 
 Conceptually a string is different from a char[], even though they are
 implemented using the same technology.

You do have a point there.
 And lastly, there's the inevitable confusion. People learning the 
 language will see char[] and string, and wonder which should be used 
 when. I can't think of any consistent understandable rule for that. So 
 it just winds up being wishy-washy. Adding more names into the global 
 space (which is what names in object.d are) should be done extremely 
 conservatively.

And yet we have "toString" and not "toCharArray" or "toUTF"!

True, and some have called for renaming char to utf8. While that would be technically more correct (as toUTF would be, too), it just looks awful. I suppose that since I grew up with char* meaning string, using char[] seems perfectly natural. I tried typedef'ing char* to string now and then, but always wound up going back to just using char*.
 And we still have the "printf" in object.d too!

I know many feel that printf doesn't belong there. It certainly isn't there for purity or consistency. It's there purely (!) for the convenience of writing short quickie programs. I tend to use it for quick debugging test cases, because it doesn't rely on the rest of D working.
 If someone wants to use the string alias as their personal or company 
 style, I have no issue with that, as other people *do* think differently 
 than me (which is abundantly clear here!).

I'll revert Build to string again as it is a lot easier to read. It started out that way but I converted it to char[] to appease you (why I thought you need appeasing is lost though). :-)

No, you certainly don't need to appease me! I do care about maintaining a reasonably consistent style in Phobos, but I don't believe a language should enforce a particular style beyond the standard library. Viva la difference. P.S. I did say to not 'enforce', but that doesn't mean I am above recommending a particular style, as in http://www.digitalmars.com/d/dstyle.html
Sep 30 2006
→ Derek Parnell <derek psyc.ward> writes:
On Sat, 30 Sep 2006 21:18:02 -0700, Walter Bright wrote:

 P.S. I did say to not 'enforce', but that doesn't mean I am above 
 recommending a particular style, as in 
 http://www.digitalmars.com/d/dstyle.html

Oh, I threw trhat away ages ago ;-) -- Derek Parnell Melbourne, Australia "Down with mediocrity!"
Oct 01 2006
Lars Ivar Igesund <larsivar igesund.net> writes:
Walter Bright wrote:

 
 And yet we have "toString" and not "toCharArray" or "toUTF"!

True, and some have called for renaming char to utf8. While that would be technically more correct (as toUTF would be, too), it just looks awful.

Nope, it just looks correct. -- Lars Ivar Igesund blog at http://larsivi.net DSource & #D: larsivi
Oct 01 2006
↑ ↓ → Lionello Lunesu <lio lunesu.remove.com> writes:
Lars Ivar Igesund wrote:
 Walter Bright wrote:
 
 And yet we have "toString" and not "toCharArray" or "toUTF"!

be technically more correct (as toUTF would be, too), it just looks awful.

Nope, it just looks correct.

I don't think renaming toString to toUTF gets rid of any confusion. AFAIK, toString is meant for debugging and char[] should be enough, and yet flexible enough for unicode strings. In fact, "string toString()" would be a good solution too. --- My 4 reasons for the "string" aliases: * readability: less [] pairs; * safety: char[] is not zero-terminated, so lets not pretend there's a relation with C's char*. In fact: lets hide any relation; * clarity: a char[] should not be iterated 1 char at a time, which makes it different from an int[]. * consistency: "string toString()" L.
Oct 02 2006
Georg Wrede <georg.wrede nospam.org> writes:
Walter Bright wrote:
 True, and some have called for renaming char to utf8. While that would 
 be technically more correct (as toUTF would be, too), it just looks awful.

Let's just say it would be a first step in lessening the confusion _we_ create in newcomers' heads.
Oct 01 2006
↑ ↓ Kevin Bealer <kevinbealer gmail.com> writes:
Georg Wrede wrote:
 Walter Bright wrote:
 True, and some have called for renaming char to utf8. While that would 
 be technically more correct (as toUTF would be, too), it just looks 
 awful.

Let's just say it would be a first step in lessening the confusion _we_ create in newcomers' heads.

I would kind of agree with this, but I think it's a two-edged knife. If we say 'char[]' then users don't know it's a string until they read the 'why D arrays are great' page (which they should read, but...) If we say 'string' then we hide the fact that [] can be applied and that other array-like operations can work. For instance, from a Java perspective: char[] : Users don't know that it's "String"; users see it as low-level. Some will try to write things like 'find()' by hand since they will figure arrays are low level and not expect this to exist. string : Users will think it's immutable, special; they will ask "how do I get one of the characters out of a string", "how do I convert string to char[]?", and other things that would be obvious without the alias. Kevin
Oct 02 2006
→ Georg Wrede <georg.wrede nospam.org> writes:
Kevin Bealer wrote:
 Georg Wrede wrote:
 
 Walter Bright wrote:

 True, and some have called for renaming char to utf8. While that 
 would be technically more correct (as toUTF would be, too), it just 
 looks awful.

Let's just say it would be a first step in lessening the confusion _we_ create in newcomers' heads.

I would kind of agree with this, but I think it's a two-edged knife. If we say 'char[]' then users don't know it's a string until they read the 'why D arrays are great' page (which they should read, but...) If we say 'string' then we hide the fact that [] can be applied and that other array-like operations can work. For instance, from a Java perspective: char[] : Users don't know that it's "String"; users see it as low-level. Some will try to write things like 'find()' by hand since they will figure arrays are low level and not expect this to exist.

Yes.
 string : Users will think it's immutable, special; they will ask "how do
          I get one of the characters out of a string", "how do I convert
          string to char[]?", and other things that would be obvious
          without the alias.

Well, with string, folks would at least be inclined to search for the library function to do it. --- Overall, having string instead of char[] should result in folks learning and doing more with D _before_ they get tangled with UTF issues. (I guess, getting tangled with UTF is unavoidable.) But the more later folks stumble on this, the better they can handle it. If it happens too soon, then they will just run away from D. But substituting string for char[] in D is not enough. More than half the issue is the wording in the docs. --- Another thing intimately connected with this is whether we should have char[] or utf8[] (string or no string, this is an important thing anyway). I understand that "char" is one of the words that a seasoned programmer's fingers know by heart. So it would feel simply disgusting to have to learn (and bother) to write "utf8" which I admit is a lot more work to type. (Seriously.) Now, "string" is easy for the fingers, and then you get to skip "[]", which makes it all a little more palatable. Having string would let us have the underlying type be utf8[], which really emphasizes and calls your attention to the fact that it's not byte-by-byte stuff we have there.
Oct 03 2006
→ =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Kevin Bealer wrote:

 If we say 'char[]' then users don't know it's a string until they read 
 the 'why D arrays are great' page (which they should read, but...)
 
 If we say 'string' then we hide the fact that [] can be applied and that 
 other array-like operations can work.

Which could be a *good* thing, since it would stop users from hurting themselves by pretending that the D strings are arrays of characters ? And when they have read up that they are "arrays of Unicode code units", they should be OK with interpreting the "string" alias as char[] arrays.
 For instance, from a Java perspective:
 
 char[] : Users don't know that it's "String"; users see it as low-level.
          Some will try to write things like 'find()' by hand since they
          will figure arrays are low level and not expect this to exist.
 
 string : Users will think it's immutable, special; they will ask "how do
          I get one of the characters out of a string", "how do I convert
          string to char[]?", and other things that would be obvious
          without the alias.

I think the best answer would be: "to get a char[] from the string, use the std.utf.toUTF8 function", since this also works even if you redeclare the "string" alias to be something else - like wchar_t[] ? Earlier* I suggested adding the alias utf8_t for "char", just like we have int8_t for "byte", but I wouldn't rename the actual D types. Just a little std.stdutf module with some aliases, if ever needed... string std.string.toString( ) utf8_t[] std.utf.toUTF8( ) utf16_t[] std.utf.toUTF16( ) utf32_t[] std.utf.toUTF32( ) --anders * digitalmars.D/11821, 2004-10-15
Oct 03 2006
Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
Derek Parnell wrote:
  (2) may, but char[] has no use 
 other than that of being a string, as a char[] is always a string and a 
 string is always a char[]. So I don't think string fits (2).

This is a lttle more debatable, but not worth generating hostility. A string of text contains characters whose position in the string is significant - there are semantics to be applied to the entire text. It is quite possible to conceive of an application in which the characters in the char[] array have no importance attached to their relative position within the array *where compared to neighboring characters*. The order of characters in text is significant but not necessarily so in a arbitary character array. Conceptually a string is different from a char[], even though they are implemented using the same technology.

Precisely! And even if such conceptual difference didn't exist, or is very rare, 'string' is nonetheless more readable than 'char[]', a fact I am constantly reminded of when I see 'int main(char[][] args)' instead of 'int main(string[] args)', which translates much more quickly into the brain as 'array of strings' than its current counterpart. -- Bruno Medeiros - MSc in CS/E student http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
Oct 01 2006
↑ ↓ → Geoff Carlton <gcarlton iinet.net.au> writes:
Bruno Medeiros wrote:
 Precisely! And even if such conceptual difference didn't exist, or is 
 very rare, 'string' is nonetheless more readable than 'char[]', a fact I 
 am constantly reminded of when I see 'int main(char[][] args)' instead 
 of 'int main(string[] args)', which translates much more quickly into 
 the  brain as 'array of strings' than its current counterpart.
 

There are also many cases where char arrays are not strings: Single array of characters, not strings: char GAME_10PT_LETTERS[] = { 'x', 'z' }; Two-dimensional array of characters, not string arrays: char GAME_LETTERS[][] = { GAME_0PT_LETTERS, GAME_1PT_LETTERS, .. }; char m_scrabbleBoard[20][20];
Oct 01 2006
Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Derek Parnell schrieb am 2006-09-30:
 On Fri, 29 Sep 2006 10:04:57 -0700, Walter Bright wrote:

 Derek Parnell wrote:
 And is it there yet? I mean, given that a string is just a lump of text, is
 there any text processing operation that cannot be simply done to a char[]
 item? I can't think of any but maybe somebody else can.

I believe it's there. I don't think std::string or java.lang.String have anything over it.

I'm pretty sure that the phobos routines for search and replace only work for ASCII text. For example, std.string.find(japanesetext, "a") will nearly always fail to deliver the correct result. It finds the first occurance of the byte value for the letter 'a' which may well be inside a Japanese character. It looks for byte-subsets rather than character sub-sets.

~wow~ Have a look at std.string.find's source and try to stop giggling *g* The correct implementation would be: # import std.string; # import std.c.string; # import std.utf; # # int find(char[] s, dchar c) # { # if (c <= 0x7F) # { // Plain old ASCII # auto p = cast(char*)memchr(s, c, s.length); # if (p) # return p - cast(char *)s; # else # return -1; # } # # // c is a universal character # return std.string.find(s, toUTF8([c])); # } The same applies to ifind and the like. Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFFHj4fLK5blCcjpWoRAj67AJoDagf5zf7Az7ZqMDfOyZdRJ+aIqQCdGeen ye80pstE4IJC1WoxgTVVgdc= =iwT5 -----END PGP SIGNATURE-----
Sep 30 2006
↑ ↓ Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Thomas Kuehne schrieb am 2006-09-30:
 Derek Parnell schrieb am 2006-09-30:
 On Fri, 29 Sep 2006 10:04:57 -0700, Walter Bright wrote:

 Derek Parnell wrote:
 And is it there yet? I mean, given that a string is just a lump of text, is
 there any text processing operation that cannot be simply done to a char[]
 item? I can't think of any but maybe somebody else can.

I believe it's there. I don't think std::string or java.lang.String have anything over it.

I'm pretty sure that the phobos routines for search and replace only work for ASCII text. For example, std.string.find(japanesetext, "a") will nearly always fail to deliver the correct result. It finds the first occurance of the byte value for the letter 'a' which may well be inside a Japanese character. It looks for byte-subsets rather than character sub-sets.

~wow~ Have a look at std.string.find's source and try to stop giggling *g* The correct implementation would be:

As it seems, the original code depends on the undocumented index behavior with regards to silent transcoding in foreach. Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFFHkOILK5blCcjpWoRAnmjAJ9PKdGDHsghycgxHdr7hkc+IP+XEgCgohH8 LH7OOQgQAZoTMLRQXtWhqbE= =or0x -----END PGP SIGNATURE-----
Sep 30 2006
↑ ↓ → Sean Kelly <sean f4.ca> writes:
Thomas Kuehne wrote:
 
 As it seems, the original code depends on the undocumented index behavior
 with regards to silent transcoding in foreach.

The wording could be more explicit, but I think the current documentation implies the actual behavior: "The index must be of int or uint type, it cannot be inout, and it is set to be the index of the array element." The docs should probably also be revised to allow for 64-bit indices, where the index would be long or ulong. Something along the lines of: "The index must be an integer type of size equal to size_t.sizeof. . ." Sean
Sep 30 2006
→ Geoff Carlton <gcarlton iinet.net.au> writes:
Walter Bright wrote:
 Derek Parnell wrote:
 And is it there yet? I mean, given that a string is just a lump of 
 text, is
 there any text processing operation that cannot be simply done to a 
 char[]
 item? I can't think of any but maybe somebody else can.

I believe it's there. I don't think std::string or java.lang.String have anything over it.
 And if a char[] is just as capable as a std::string, then why not have an
 official alias in Phobos? Will 'alias char[] string' cause anyone any
 problems?

I don't think it'll cause problems, it just seems pointless.

Hi, The main reasons I think are these: It simplifies the initial examples, particularly main(string[]), and maps such as string[string]. More complex examples are a map of words to text lines, string[][string], rather than char[][][char[]]. It clarifies the actual use of the entity. It is a text string, not just a jumbled array of characters. Arrays of char can be used for other things, such as the set of player letters in a scrabble game. A string has the additional usage that we know it as is text string. The alias reflects that intent. Given a user wants to use a string, there is no need to expose the implementation detail of how strings are done in D. Perhaps in perl, strings are a linked list of shorts, but it doesn't mean that you'd have list<short> all over the place. Use of char[] and char[][] looks like low level C. It has also been noted that it encourages char based indexing, which is not a good thing for utf8. Anyway, hope one of those points grabbed you! Geoff
Sep 29 2006
→ David Medlock <noone nowhere.com> writes:
Walter Bright wrote:
 Derek Parnell wrote:
 
 And is it there yet? I mean, given that a string is just a lump of 
 text, is
 there any text processing operation that cannot be simply done to a 
 char[]
 item? I can't think of any but maybe somebody else can.

I believe it's there. I don't think std::string or java.lang.String have anything over it.
 And if a char[] is just as capable as a std::string, then why not have an
 official alias in Phobos? Will 'alias char[] string' cause anyone any
 problems?

I don't think it'll cause problems, it just seems pointless.

The reason *I* want it is _alias_ does not respect the private: visibility modifier. So when I pull out an old piece of code which says alias char[] string and import it in my newer module I get conflicts when I compile. Then I must do this silly hack where I include the newer file from the old or vice versa. If you didn't add this into phobos, at least or adopt a method to discriminate between more than one alias with the same name to resolve the issue. -DavidM
Sep 29 2006
=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Geoff Carlton wrote:

 A simple alias of char[] to string would simplify the first glance code.
   string x;    // yep, a string
   main (string[]) // an array of strings
   string[string] m; // map of string to string
 
 I believe single functions get pulled in as member functions?  e.g.
 find(string) can be used as string.find()?  If so, it means that all the
 string functionality can be added and then used naturally as member
 functions on this "string" (which is really just the plain old char[] in
 disguise).

Problem of "char[]" is both that it hides the fact that "char" is UTF-8 while at the same time it exposes the fact that it's stored as an array. You can "improve" upon that readability with aliases, like declaring say utf8_t -> char and string -> utf8_t[], but you still need to understand Unicode and Arrays in order to use it outside of the provided methods... I think "hides the implementation" was the biggest argument against it ? http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues
 This is a small thing, but I think it would help in terms of the mindset
 of strings being a first class primitive, and clear up simple "hello
 world" examples at the same time.  Put simply, every modern language has
 a first class string primitive type, except D - at least in terms of
 nomenclature.

I did the big mistake of thinking it would be a good thing to be able to switch between "ANSI" and "UNICODE" builds (of wxD), and so did it like: version(UNICODE) alias char[] string; else // version(ANSI) alias wchar_t[] string; // wchar[] on Windows, dchar[] on Unix Still trying to sort out all the code problems with that idea, as there is a ton of toUTF8 and other conversions to make strings work together. In retrospect it would have been much easier to have stuck with char[], and do the conversion from UTF-8 to the local encoding on the C++ side. (since there were no guarantees that the "char" and "wchar_t" types in C++ used UTF encodings, even if they did so in Unix/GTK+ for instance) Any (minor) performance issues of having to do the UTF-8 <-> UTF-32 conversions were not worth the hassle of doing it on the D side, IMHO. So I agree with the "alias char[] string;" and the string[string] args. It's going to be used as wx.common.string for instance, in wxD library. --anders
Sep 29 2006
↑ ↓ → =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
 I did the big mistake of thinking it would be a good thing to be able to
 switch between "ANSI" and "UNICODE" builds (of wxD), and so did it like:
 
 version(UNICODE)
     alias char[] string;
 else // version(ANSI)
     alias wchar_t[] string; // wchar[] on Windows, dchar[] on Unix

Except the other way around, of course! version(UNICODE) alias wchar_t[] string; else // version(ANSI) alias char[] string; Now, to get me some more coffee... :-P --anders
Sep 29 2006
Lionello Lunesu <lio lunesu.remove.com> writes:
I also ALWAYS create aliases for char[], wchar[], dchar[]... I DO wish 
they would be included by default in Phobos.

alias char[] string;
alias wchar[] wstring;
alias dchar[] dstring;

Perhaps, using string instead of char[], it's more obvious that it's not 
zero-terminated. I've seen D examples online that just cast a char[] to 
char* for use in MessageBox and the like (which worked since it were 
string constants.)

L.
Sep 29 2006
=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Lionello Lunesu wrote:

 Perhaps, using string instead of char[], it's more obvious that it's not 
 zero-terminated. I've seen D examples online that just cast a char[] to 
 char* for use in MessageBox and the like (which worked since it were 
 string constants.)

And probably only for ASCII string constants, at that... --anders
Sep 29 2006
↑ ↓ → Lionello Lunesu <lio lunesu.remove.com> writes:
Anders F Björklund wrote:
 Lionello Lunesu wrote:
 
 Perhaps, using string instead of char[], it's more obvious that it's 
 not zero-terminated. I've seen D examples online that just cast a 
 char[] to char* for use in MessageBox and the like (which worked since 
 it were string constants.)

And probably only for ASCII string constants, at that...

Right, that too! char[] somestring = "...."; func( somestring[0] ); // WRONG: somestring[x] is not 1 character! Using "string" would make it less obvious: string somestring = "....."; func( somestring[0] ); // [0] means what? This goes for iteration as well. DMD will still deduct 'char' as the type type, but at least one's less likely to type foreach(char c;str). If you want to iterate the UNICODE characters in a string, you'll specify "dchar" as the type and you won't worry about "how come I can use dchar when it's a char[]": foreach(dchar c; somestring) func(c); // correct L.
Sep 29 2006
Georg Wrede <georg.wrede nospam.org> writes:
Lionello Lunesu wrote:
 I also ALWAYS create aliases for char[], wchar[], dchar[]... I DO wish 
 they would be included by default in Phobos.
 
 alias char[] string;
 alias wchar[] wstring;
 alias dchar[] dstring;
 
 Perhaps, using string instead of char[], it's more obvious that it's not 
 zero-terminated. I've seen D examples online that just cast a char[] to 
 char* for use in MessageBox and the like (which worked since it were 
 string constants.)

Using char[] as long as you don't know about UTF seems to work pretty well in D. But the moment you realise that we're having potential multibyte characters in what essentially is a ubyte[], you get scared to death, and start to wonder how on earth you haven't yet blown up your hard disk. You start having nightmares about slicing char arrays at the wrong place, extracting single chars that might not be storable in a char, and all of a sudden you decide to stick with your old language "till things calm down". The only medicine to this is simply to shut your eyes and keep coding on like you never did realise anything. It's a little like when you first realised Daddy isn't holding your bike: you instantly fall hurting yourself, instead of realizing that he's probably let go ages ago, and you still haven't fallen, so simply keep going. --- This doesn't mean I'm happy with this either, but I don't have the energy to conjure up a significantly better solution _and_ fight for it till it gets accepted. (Some things are just too hard to fix, like "bit=bool" was, and now "auto/auto".)
Sep 29 2006
↑ ↓ Chad J <""gamerChad\" spamIsBad gmail.com"> writes:
Georg Wrede wrote:
 Lionello Lunesu wrote:
 
 I also ALWAYS create aliases for char[], wchar[], dchar[]... I DO wish 
 they would be included by default in Phobos.

 alias char[] string;
 alias wchar[] wstring;
 alias dchar[] dstring;

 Perhaps, using string instead of char[], it's more obvious that it's 
 not zero-terminated. I've seen D examples online that just cast a 
 char[] to char* for use in MessageBox and the like (which worked since 
 it were string constants.)

Using char[] as long as you don't know about UTF seems to work pretty well in D. But the moment you realise that we're having potential multibyte characters in what essentially is a ubyte[], you get scared to death, and start to wonder how on earth you haven't yet blown up your hard disk. You start having nightmares about slicing char arrays at the wrong place, extracting single chars that might not be storable in a char, and all of a sudden you decide to stick with your old language "till things calm down". The only medicine to this is simply to shut your eyes and keep coding on like you never did realise anything. It's a little like when you first realised Daddy isn't holding your bike: you instantly fall hurting yourself, instead of realizing that he's probably let go ages ago, and you still haven't fallen, so simply keep going. --- This doesn't mean I'm happy with this either, but I don't have the energy to conjure up a significantly better solution _and_ fight for it till it gets accepted. (Some things are just too hard to fix, like "bit=bool" was, and now "auto/auto".)

haha too true. I experienced this too as I read this ng. It hasn't been THAT truamatic for me though, since everything seems to work as long as you stick to english. I don't have the resources to even begin thinking about non-english text (ex: paying people to translate stuff), so I don't lose any sleep about it, at least not yet. Perhaps there should be a string struct/class that has an undefined underlying type (it could be UTF-8, 16, 32, you dunno really), and you could index it to get the *complete* character at any position in the string. Basically, it is like char[], but it /just works/ in all cases. I'd almost rather have the size of a char be undefined, and just have char[] be the said magic string type. If you want something with a .size of 1, then there is byte/ubyte. There would probably have to be some stuff in the phobos internals to handle such a string in a correct manner. Going even further... if you could make char[] be such a magic string type, then wchar[] and dchar[] could probably be deprecated - use ushort and uint instead. Then add the following aliases to phobos: alias ubyte utf8; alias ushort utf16; alias uint utf32; Just a thought. I'm no expert on UTF, but maybe this can start a discussion that will result in the nightmares ending :)
Sep 29 2006
↑ ↓ Johan Granberg <lijat.meREM OVEgmail.com> writes:
Chad J > wrote:
 Perhaps there should be a string struct/class that has an undefined 
 underlying type (it could be UTF-8, 16, 32, you dunno really), and you 
 could index it to get the *complete* character at any position in the 
 string.  Basically, it is like char[], but it /just works/ in all cases. 
  I'd almost rather have the size of a char be undefined, and just have 
 char[] be the said magic string type.  If you want something with a 
 ..size of 1, then there is byte/ubyte.  There would probably have to be 
 some stuff in the phobos internals to handle such a string in a correct 
 manner.

I have thought about this to.
 Going even further... if you could make char[] be such a magic string 
 type, then wchar[] and dchar[] could probably be deprecated - use ushort 
 and uint instead.  Then add the following aliases to phobos:
 alias ubyte utf8;
 alias ushort utf16;
 alias uint utf32;

I completely agree, char should hold a character independently of encoding and NOT a code unit or something else. I think it would bee beneficial to D in the long term if chars where done right (meaning that they can store any character) how it is implemented is not important and i believe performance is not a problem here, so ease of use and correctness would be appreciated.
Sep 29 2006
↑ ↓ BCS <BCS pathlink.com> writes:
Johan Granberg wrote:
 
 
 I completely agree, char should hold a character independently of 
 encoding and NOT a code unit or something else. I think it would be
 beneficial to D in the long term if chars where done right (meaning that 
 they can store any character) how it is implemented is not important and 
 i believe performance is not a problem here, so ease of use and 
 correctness would be appreciated.

Why isn't performance a problem? If you are saying that this won't cause performance hits in run times or memory space, I might be able to buy it, but I'm not yet convinced. If you are saying that causing a performance hit in run times or memory space is not a problem... in that case I think you are dead wrong and you will not convince me otherwise. In my opinion, any compiled language should allow fairly direct access to the most efficient practical means of doing something*. If I didn't care about speed and memory I wound use some sort of scripting language. A good set of libs should make most of this moot. Leave the char as is and define a typedef struct or whatever that provides the added functionality that you want. * OTOH a language should not mandate code to be efficient at the expense of ease of coding.
Sep 29 2006
Chad J <""gamerChad\" spamIsBad gmail.com"> writes:
BCS wrote:
 Johan Granberg wrote:
 
 I completely agree, char should hold a character independently of 
 encoding and NOT a code unit or something else. I think it would be
 beneficial to D in the long term if chars where done right (meaning 
 that they can store any character) how it is implemented is not 
 important and i believe performance is not a problem here, so ease of 
 use and correctness would be appreciated.

Why isn't performance a problem? If you are saying that this won't cause performance hits in run times or memory space, I might be able to buy it, but I'm not yet convinced. If you are saying that causing a performance hit in run times or memory space is not a problem... in that case I think you are dead wrong and you will not convince me otherwise. In my opinion, any compiled language should allow fairly direct access to the most efficient practical means of doing something*. If I didn't care about speed and memory I wound use some sort of scripting language. A good set of libs should make most of this moot. Leave the char as is and define a typedef struct or whatever that provides the added functionality that you want. * OTOH a language should not mandate code to be efficient at the expense of ease of coding.

I will go ahead and say that the current state of char[] is incorrect. That is, if you write a program manipulating char[] strings, then run it in china, you will be dissapointed with the results. It won't matter how fast the program runs, because bad stuff will happen like entire strings becoming unreadable to the user. Technically if you follow UTF and do your char[] manipulations very carefully, it is correct, but realistically few if any people will do such things (I won't). Also, if you do this, your program will probably run as slow as one with the proposed char/string solution, maybe slower (since language/stdlib level support can be heavily optimized). What I'd like then, is a program that is correct and as fast as possible while still being correct. Sure you can get some speed gains by just using ASCII and saying to hell with UTF, but you should probably only do that when profiling has shown that such speed gains are actually useful/needed in your program. Ultimately we have to decide whether we want D to default to UTF code which might run slightly slower but allow better localization and international friendliness, or if we want it to default to ASCII or some such encoding that runs slightly faster but is mostly limited to english. I'd like the default to be UTF. Then we can have a base of code to correctly manipulate UTF strings (in phobos and language supported). Writing correct ASCII manipulation routine without good library/language support is a lot easier than writing good UTF manipulation routines without good library/language support, and UTF will probably be used much more than ASCII. Also, if we move over to full blown UTF, we won't have to give up ASCII. It seems to me like the phobos std.string functions are pretty much ASCII string manipulating functions (no multibyte string support). So just copy those out to a seperate library, call it "ASCII lib", and there's your library support for ASCII. That leaves string literals, which is a slight problem, but I suppose easily fixed: ubyte[] hi = "hello!"a; Just add a postfix 'a' for strings which makes the string an ASCII literal, of type ubyte[]. D arrays don't seem powerful enough to do UTF manipulations without special attention, but they are powerful enough to do ASCII manipulations without special attention, so using ubyte[] as an ASCII string should give full language support for these. Given that and ASCIILIB you pretty much have the current D string manipulation capabilities afaik, and it will be fast.
Sep 29 2006
=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Chad J > wrote:

 I'd like the default to be UTF. Then we can have a base of code to
 correctly manipulate UTF strings (in phobos and language supported).
 Writing correct ASCII manipulation routine without good library/language
 support is a lot easier than writing good UTF manipulation routines
 without good library/language support, and UTF will probably be used
 much more than ASCII.

But D already uses Unicode for all strings, encoded as UTF ? When you say "ASCII", do you mean 8-bit encodings perhaps ? (since all proper 7-bit ASCII are already valid UTF-8 too)
 Also, if we move over to full blown UTF, we won't have to give up ASCII. 
  It seems to me like the phobos std.string functions are pretty much 
 ASCII string manipulating functions (no multibyte string support).  So 
 just copy those out to a seperate library, call it "ASCII lib", and 
 there's your library support for ASCII.  That leaves string literals, 
 which is a slight problem, but I suppose easily fixed:
 ubyte[] hi = "hello!"a;

I don't understand this, why can't you use UTF-8 for this ? char[] hi = "hello!";
 Just add a postfix 'a' for strings which makes the string an ASCII 
 literal, of type ubyte[].  D arrays don't seem powerful enough to do UTF 
 manipulations without special attention, but they are powerful enough to 
 do ASCII manipulations without special attention, so using ubyte[] as an 
 ASCII string should give full language support for these.  Given that 
 and ASCIILIB you pretty much have the current D string manipulation 
 capabilities afaik, and it will be fast.

What is not powerful enough about the foreach(dchar c; str) ? It will step through that UTF-8 array one codepoint at a time. --anders
Sep 29 2006
↑ ↓ Chad J <""gamerChad\" spamIsBad gmail.com"> writes:
Anders F Björklund wrote:
 Chad J > wrote:
 
 I'd like the default to be UTF. Then we can have a base of code to
 correctly manipulate UTF strings (in phobos and language supported).
 Writing correct ASCII manipulation routine without good library/language
 support is a lot easier than writing good UTF manipulation routines
 without good library/language support, and UTF will probably be used
 much more than ASCII.

But D already uses Unicode for all strings, encoded as UTF ? When you say "ASCII", do you mean 8-bit encodings perhaps ? (since all proper 7-bit ASCII are already valid UTF-8 too)

Probably 7-bit. Anything where the size of one character is ALWAYS one byte. I am already assuming that ASCII is a subset or at least is mostly a subset of UTF8. However, I talk about it in an exclusive manner because if you handle UTF8 strings properly then the code will probably run at least slightly slower than with ASCII-only strings.
 Also, if we move over to full blown UTF, we won't have to give up 
 ASCII.  It seems to me like the phobos std.string functions are pretty 
 much ASCII string manipulating functions (no multibyte string 
 support).  So just copy those out to a seperate library, call it 
 "ASCII lib", and there's your library support for ASCII.  That leaves 
 string literals, which is a slight problem, but I suppose easily fixed:
 ubyte[] hi = "hello!"a;

I don't understand this, why can't you use UTF-8 for this ? char[] hi = "hello!";

I was talking about IF we made char[] into a datatype that handles all of those odd corner cases correctly (slices into multibyte strings, for instance) then it will no longer be the same fast ASCII-only routines. So for those who want the fast ASCII-only stuff, it would nice to specify a way to make string literals such that each character in the literal takes only one byte, without ugly casting. To get an ASCII monobyte string from a string literal in D I would have to do the following: ubyte[] hi = cast(ubyte[])"hello!"; hmmm, yuck.
 Just add a postfix 'a' for strings which makes the string an ASCII 
 literal, of type ubyte[].  D arrays don't seem powerful enough to do 
 UTF manipulations without special attention, but they are powerful 
 enough to do ASCII manipulations without special attention, so using 
 ubyte[] as an ASCII string should give full language support for 
 these.  Given that and ASCIILIB you pretty much have the current D 
 string manipulation capabilities afaik, and it will be fast.

What is not powerful enough about the foreach(dchar c; str) ? It will step through that UTF-8 array one codepoint at a time.

I'm assuming 'str' is a char[], which would make that very nice. But it doesn't solve correctly slicing or indexing into a char[]. If nothing was done about this and I absolutely needed UTF support, I'd probably make a class like so: class String { char[] data; ... dchar opIndex( int index ) { foreach( int i, dchar c; data ) { if ( i == index ) return c; i++; } } // similar thing for opSlice down here ... } Which is probably slower than could be done. All in all it is a drag that we should have to learn all of this UTF stuff. I want char[] to just work!
Sep 29 2006
=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:
Chad J > wrote:

 Probably 7-bit.  Anything where the size of one character is ALWAYS one 
 byte.  I am already assuming that ASCII is a subset or at least is 
 mostly a subset of UTF8.  However, I talk about it in an exclusive 
 manner because if you handle UTF8 strings properly then the code will 
 probably run at least slightly slower than with ASCII-only strings.

It's mostly about looking out for the UTF "control" characters, which is not more than a simple assertion in your ASCII-only functions really... I don't think handling UTF-8 properly is a burden for string functions, when you compare it with the enormous gain that it has over ASCII-only.
 What is not powerful enough about the foreach(dchar c; str) ?
 It will step through that UTF-8 array one codepoint at a time.

I'm assuming 'str' is a char[], which would make that very nice. But it doesn't solve correctly slicing or indexing into a char[].

Well, it's also a lot "trickier" than that... For instance, my last name can be written in Unicode as Björklund or Bj¨orklund, both of which are valid - only that in one of them, the 'ö' occupies two full code points! It's still a single character, which is why Unicode avoids that term... As you know, if you need to access your strings by codepoint (something that the Unicode group explicitly recommends against, in their FAQ) then char[] isn't a very nice format - because of the conversion overhead... But it's still possible to translate, transform, and translate back ?
 If nothing was done about this and I absolutely needed UTF support,
 I'd probably make a class like so: [...]

In my own mock String class, I cached the dchar[] codepoints on demand. (viewable at http://www.algonet.se/~afb/d/dcaf/html/class_string.html)
 All in all it is a drag that we should have to learn all of this UTF 
 stuff.  I want char[] to just work!

Using Unicode strings and characters does require a little learning... (where http://www.unicode.org/faq/utf_bom.html is a very good page) And D does force you to think about string implementation, no question. This has both pros and cons, but it is a deliberate language decision. If you're willing to handle the "surrogates", then UTF-16 is a rather good trade-off between the default UTF-8 and wasteful UTF-32 formats ? A downside is that it is not "ascii-compatible" (has embedded NUL chars) and that it is endian-dependant unlike the more universal UTF-8 format. --anders
Sep 29 2006
→ Georg Wrede <georg.wrede nospam.org> writes:
Anders F Björklund wrote:
 If you're willing to handle the "surrogates", then UTF-16 is a rather
 good trade-off between the default UTF-8 and wasteful UTF-32 formats ?
 A downside is that it is not "ascii-compatible" (has embedded NUL chars)
 and that it is endian-dependant unlike the more universal UTF-8 format.

Problem is, using 16-bit you sort-of get away with _almost_ all of it. But as a pay-back, the day your 16 bits don't suffice, you're in deep crap. And that day _will_ come.
Sep 29 2006
Chad J <""gamerChad\" spamIsBad gmail.com"> writes:
Anders F Björklund wrote:
 What is not powerful enough about the foreach(dchar c; str) ?
 It will step through that UTF-8 array one codepoint at a time.

I'm assuming 'str' is a char[], which would make that very nice. But it doesn't solve correctly slicing or indexing into a char[].

Well, it's also a lot "trickier" than that... For instance, my last name can be written in Unicode as Björklund or Bj¨orklund, both of which are valid - only that in one of them, the 'ö' occupies two full code points! It's still a single character, which is why Unicode avoids that term...

So it seems to me the problem is that those 2 bytes are both 2 characters and 1 character at the same time. In this case, I'd prefer being able to index to a safe default (like the ö, instead of the umlauts next to the o), or not being able to index at all.
 As you know, if you need to access your strings by codepoint (something 
 that the Unicode group explicitly recommends against, in their FAQ) then 
 char[] isn't a very nice format - because of the conversion overhead...
 But it's still possible to translate, transform, and translate back ?
 

I read that FAQ at the bottom of this post, and didn't see anything about accessing strings by codepoint. Maybe you mean a different FAQ here, in which case, could I have a link please? I've been to the unicode site before and all I remember was being confused and having a hard time finding the info I wanted :( Also I still am not sure exactly what a code point is. And that FAQ at the bottom used the word "surrogate" a lot; I'm not sure about that one either. When you say char[] isn't a nice format, I wasn't thinking about having the string class I mentioned earlier store the data ONLY as char[]. It might be wchar[]. Or dchar[]. Then it would be automatically converted between the two either at compile time (when possible) or dynamically at runtime (hopefully only when needed). So if someone throws a Chinese character literal at it, there is a very big clue there to use UTF32 or something that can store all of the characters in a uniform width sort of way, to speed indexing. Algorithms could be used so that a program 'learns' at runtime what kind of strings are dominating the program, and uses algorithms optimized for those. Maybe this is a bit too complex, but I can dream, hehe.
 If nothing was done about this and I absolutely needed UTF support,
 I'd probably make a class like so: [...]

In my own mock String class, I cached the dchar[] codepoints on demand. (viewable at http://www.algonet.se/~afb/d/dcaf/html/class_string.html)
 All in all it is a drag that we should have to learn all of this UTF 
 stuff.  I want char[] to just work!

Using Unicode strings and characters does require a little learning... (where http://www.unicode.org/faq/utf_bom.html is a very good page) And D does force you to think about string implementation, no question. This has both pros and cons, but it is a deliberate language decision. If you're willing to handle the "surrogates", then UTF-16 is a rather good trade-off between the default UTF-8 and wasteful UTF-32 formats ? A downside is that it is not "ascii-compatible" (has embedded NUL chars) and that it is endian-dependant unlike the more universal UTF-8 format. --anders

My impression has gone from being quite scared of UTF to being not so worried, but only for myself. D seems to be good at handling UTF, but only if someone tells you to never handle strings as arrays of characters. Unfortunately, the first thing you see in a lot of D programs is "int main( char[][] args )" and there are some arrays of characters being used as strings. This also means that some array capabilities like indexing and the braggable slicing are more dangerous than useful for string handling. It's a newbie trap. Like I said earlier, I either want to be able to index/slice strings safely, or not at all (or better yet, not by any intuitive means).
Sep 30 2006
↑ ↓ → =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:
Chad J > wrote:

 I read that FAQ at the bottom of this post, and didn't see anything 
 about accessing strings by codepoint.  Maybe you mean a different FAQ 
 here, in which case, could I have a link please?  I've been to the 
 unicode site before and all I remember was being confused and having a 
 hard time finding the info I wanted :(

I meant http://www.unicode.org/faq/utf_bom.html#12
 Also I still am not sure exactly what a code point is.  And that FAQ at 
 the bottom used the word "surrogate" a lot; I'm not sure about that one 
 either.

Code point is the closest thing to a "character", although it might take more than one Unicode code point to represent a single Unicode grapheme. Surrogates are used with UTF-16, to represent "too large" code points... i.e. they always occur in "surrogate pairs", which combine to a single
 When you say char[] isn't a nice format, I wasn't thinking about having 
 the string class I mentioned earlier store the data ONLY as char[].  It 
 might be wchar[].  Or dchar[].  Then it would be automatically converted 
 between the two either at compile time (when possible) or dynamically at 
 runtime (hopefully only when needed).  So if someone throws a Chinese 
 character literal at it, there is a very big clue there to use UTF32 or 
 something that can store all of the characters in a uniform width sort 
 of way, to speed indexing.  Algorithms could be used so that a program 
 'learns' at runtime what kind of strings are dominating the program, and 
 uses algorithms optimized for those.  Maybe this is a bit too complex, 
 but I can dream, hehe.

Actually I said that dchar[] (i.e. UTF-32) wasn't ideal, but anyway... (UTF-8 or UTF-16 is preferrable, for the reasons in the UTF FAQ above) We already have char[] as the string default in D, but most models for a String class uses wchar[] (i.e. UTF-16), for instance Mango or Java: * http://mango.dsource.org/classUString.html (uses the ICU lib) * http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html All formats do use Unicode, so converting from one UTF to another is mostly a question of memory/performance and not about any data loss. However, it is not converted at compile time (without using templates) so mixing and matching different representations is somewhat of a pain. I think that char[] for string and wchar[] for String are good defaults.
 My impression has gone from being quite scared of UTF to being not so 
 worried, but only for myself.  D seems to be good at handling UTF, but 
 only if someone tells you to never handle strings as arrays of 
 characters.  Unfortunately, the first thing you see in a lot of D 
 programs is "int main( char[][] args )" and there are some arrays of 
 characters being used as strings.  This also means that some array 
 capabilities like indexing and the braggable slicing are more dangerous 
 than useful for string handling.  It's a newbie trap.

It is, since it isn't really "arrays of characters" but "arrays of code units". What muddies the waters further is that sometimes they're equal. That is, with ASCII characters each character fits into a a D char unit. Without surrogates, each character (from BMP) fits into one wchar unit. However, all code that handles the shorter formats should be prepared to handle non-ASCII (for UTF-8) and surrogates (for UTF-16), or use UTF-32: bool isAscii(char c) { return (c <= 0x7f); } bool isSurrogate(wchar c) { return (c >= 0xD800 && c <= 0xDFFF); } But a warning that D uses multi-byte strings might be in order, yes... Another warning that it only supports UTF-8 platforms* might also be ? --anders * "main(char[][] args)" does not work for any non-UTF consoles, as you will get invalid UTF sequences for the non-ASCII chars.
Oct 01 2006
=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:
Chad J > wrote:

 char[] data; 

   dchar opIndex( int index )
   {
     foreach( int i, dchar c; data )
     {
       if ( i == index )
         return c;
 
       i++;
     }
   }

This code probably does not work as you think it does... If you loop through a char[] using dchars (with a foreach), then the int will get the codeunit index - *not* codepoint. (the ++ in your code above looks more like a typo though, since it needs to *either* foreach i, or do it "manually") import std.stdio; void main() { char[] str = "Björklund"; foreach(int i, dchar c; str) { writefln("%4d \\U%08X '%s'", i, c, ""d ~ c); } } Will print the following sequence: 0 \U00000042 'B' 1 \U0000006A 'j' 2 \U000000F6 'ö' 4 \U00000072 'r' 5 \U0000006B 'k' 6 \U0000006C 'l' 7 \U00000075 'u' 8 \U0000006E 'n' 9 \U00000064 'd' Notice how the non-ASCII character takes *two* code units ? (if you expect indexing to use characters, that'd be wrong) More at http://prowiki.org/wiki4d/wiki.cgi?CharsAndStrs --anders
Sep 29 2006
↑ ↓ → Chad J <""gamerChad\" spamIsBad gmail.com"> writes:
Anders F Björklund wrote:
 Chad J > wrote:
 
 char[] data; 

   dchar opIndex( int index )
   {
     foreach( int i, dchar c; data )
     {
       if ( i == index )
         return c;

       i++;
     }
   }

This code probably does not work as you think it does... If you loop through a char[] using dchars (with a foreach), then the int will get the codeunit index - *not* codepoint. (the ++ in your code above looks more like a typo though, since it needs to *either* foreach i, or do it "manually") import std.stdio; void main() { char[] str = "Björklund"; foreach(int i, dchar c; str) { writefln("%4d \\U%08X '%s'", i, c, ""d ~ c); } } Will print the following sequence: 0 \U00000042 'B' 1 \U0000006A 'j' 2 \U000000F6 'ö' 4 \U00000072 'r' 5 \U0000006B 'k' 6 \U0000006C 'l' 7 \U00000075 'u' 8 \U0000006E 'n' 9 \U00000064 'd' Notice how the non-ASCII character takes *two* code units ? (if you expect indexing to use characters, that'd be wrong) More at http://prowiki.org/wiki4d/wiki.cgi?CharsAndStrs --anders

ah. And yep the i++ was a typo (oops). So maybe something like: dchar opIndex( int index ) { int i; foreach( dchar c; data ) { if ( i == index ) return c; i++; } } The i is no longer the foreach's index, so the i++ isn't a typo anymore. Thanks for the info. I'll check out that faq a little later, gotta go.
Sep 29 2006
Georg Wrede <georg.wrede nospam.org> writes:
Chad J > wrote:
 I will go ahead and say that the current state of char[] is incorrect. 
 That is, if you write a program manipulating char[] strings, then run it 
 in china, you will be dissapointed with the results.  It won't matter 
 how fast the program runs, because bad stuff will happen like entire 
 strings becoming unreadable to the user.

Wrong. And that's precisely what I meant about the Daddy holding bike allegory a few messages back. The current system seems to work "by magic". So, if you do go to China, itll "just work". At this point you _should_ not believe me. :-) But it still works. --- The secret is, there actually is a delicate balance between UTF-8 and the library string operations. As long as you use library functions to extract substrings, join or manipulate them, everything is OK. And very few of us actually either need to, or see the effort of bit-twiddling individual octets in these "char" arrays. So things just keep on working. --- Not convinced yet? Well, a lot of folks here are from Europe, and our languages contain "non-ASCII" characters. Our text manipulating programs still work allright. And, actually D is pretty popular in Japan. Every once in a while some Japanese guys pop on-and-off here, and some of them don't even speak English, so they use a machine translator(!) to talk with us. Just guess if they use ASCII in their programs. And you know what, most of these guys even use their own characters for variable names in D! And not one of them has complained about "disappointing results". --- That's why I continued with: keep your eyes shut and keep on coding.
Sep 29 2006
Chad J <""gamerChad\" spamIsBad gmail.com"> writes:
Georg Wrede wrote:
 The secret is, there actually is a delicate balance between UTF-8 and 
 the library string operations. As long as you use library functions to 
 extract substrings, join or manipulate them, everything is OK. And very 
 few of us actually either need to, or see the effort of bit-twiddling 
 individual octets in these "char" arrays.
 

But this is what I'm talking about... you can't slice them or index them. I might actually index a character out of an array from time to time. If I don't know about UTF, and I do just keep on coding, and I do something like this: char[] str = "some string in nonenglish text"; for ( int i = 0; i < str.length; i++ ) { str[i] = doSomething( str[i] ); } and this will fail right? If it does fail, then everything is not alright. You do have to worry about UTF. Someone has to tell you to use a foreach there.
Sep 29 2006
→ Georg Wrede <georg.wrede nospam.org> writes:
Chad J > wrote:
 Georg Wrede wrote:
 
 The secret is, there actually is a delicate balance between UTF-8 and 
 the library string operations. As long as you use library functions to 
 extract substrings, join or manipulate them, everything is OK. And 
 very few of us actually either need to, or see the effort of 
 bit-twiddling individual octets in these "char" arrays.

But this is what I'm talking about... you can't slice them or index them. I might actually index a character out of an array from time to time. If I don't know about UTF, and I do just keep on coding, and I do something like this: char[] str = "some string in nonenglish text"; for ( int i = 0; i < str.length; i++ ) { str[i] = doSomething( str[i] ); } and this will fail right? If it does fail, then everything is not alright. You do have to worry about UTF. Someone has to tell you to use a foreach there.

Yes. That's why I talked about you falling down once you realise Daddy's not holding the bike. Part of UTF-8's magic lies in that it is amazingly easy to get working smoothly with truly minor tweaks to "formerly ASCII-only" libraries -- so that even the most exotic languages have no problem. Your concerns about the for loop are valid, and expected. Now, IMHO, the standard library should take care of "all" the situations where you would ever need to split, join, examine, or otherwise use strings, "non-ASCII" or not. (And I really have no complaint (Walter!) about this.) Therefore, in no normal circumstances should you have to twiddle them yourself -- unless. And this "unless" is exactly why I'm unhappy with the situation, too. Problem is, _technology_wise_ the existing setup may actually be the best, both considering ease of writing the library, ease of using it, robustness of both the library and users' code, and the headaches saved from programmers who, either haven't heard of the issue (whether they're American or Chinese!), or who simply trust their lives with the machinery. So, where's the actual problem??? At this point I'm inclined to say: the documentation, and the stage props! The latter meaning: exposing the fact that our "strings" are just arrays is psychologically wrong, and even more so is the fact that we're shamelessly storing entities of variable length in arrays which have no notion of such -- even worse, while we brag with slices! If this had been a university course assignment, we'd have been thrown out of class, for both half baked work, and for arrogance towards our client, victimizing the coder. The former meaning: we should not be like "we're bad enough to overtly use plain arrays for variable-length data, now if you have a problem with it, the go home and learn stuff, or then just trust us". Both "documentation" and "stage props" ultimately meaning that the largest problem here is psychology, pedagogy, and education. --- A lot would already be won by: merely aliasing char[] to string, and discouraging other than guru-level folks from screwing with their internals. This alone would save a lot of Fear, Uncertainty and D-phobia. The documentation should take pains in explaining up front that if you _really_ want to do Character-by-Character ops _and_ you live outside of America, then the Right way to do it (ehh, actually the Canonical Way), is to first convert the string to dchar[]. Period. Then, if somebody else knows enough of UTF-8 and knows he can handle bit twiddling more efficiently than using the Canonical Way, with plain char[] and "foreignish", then let him. But let that be undocumented and Un-Discussed in the docs. Precisely like a lot of other things are. (And should be.) And will be. He's on his own, and he ought to know it. --- In other words, the normal programmer should believe he's working with black-box Strings, and he will be happy with it. That way he'll survive whether he's in Urduland or Boise, Idaho -- without neither ever needing to have heard about UTF nor other crap. Not un