www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - char, wchar and dchar should be supported equally

reply James McComb <alan jamesmccomb.id.au> writes:
I like D having char, wchar and dchar. And I like the way that they will 
(soon?) implicitly convert between each other. But I don't like the way 
that D is biased towards char. I think that char, dchar and wchar should 
be supported equally.

For example, modern Windows systems support UTF-16 (via the W 
functions). So you might decide to use wchar, because that is also 
UTF-16. The Windows API expects zero-terminated strings, and you can 
clearly indicate this in your code by calling toStringz. But toStringz 
takes char, so your wchar will be implicitly converted to char and then 
implicitly converted back to wchar. So there is no point using wchar!

But what if every function in std.string had wchar and dchar versions?
Then you could use wchar and call wtoStringz. (At the end of this email, 
there is some working code showing how this could be implemented using 
templates and aliases. There are other ways that std.string could 
support wchar and dchar, such as function overloading or function 
templates.)

Also, in order for char, wchar and dchar to be supported equally, Object 
should have wtoString and dtoString methods. (Because toString cannot be 
overloaded based on its return type.)

Does anyone else out there feel the same? Or should I get over it and 
JUC (Just Use Char) like I already JUB (Just Use Bit)?

James McComb

<code>
import std.stdio;

template TStringFunctions(T) {
     T[] toStringz(T[] str) {
         if (!str)
             return "";

         T[] copy = str.dup;
         return copy ~= '\0';
     }

     // Other string functions...
}

alias TStringFunctions!(char)  stringFunctions;
alias TStringFunctions!(wchar) wstringFunctions;
alias TStringFunctions!(dchar) dstringFunctions;

alias stringFunctions.toStringz  toStringz;
alias wstringFunctions.toStringz wtoStringz;
alias dstringFunctions.toStringz dtoStringz;

// Other string function aliases...

// Example usage
void main() {
     char[]   str = "utf-8 string";
     wchar[] wstr = "utf-16 string";

     str  = toStringz(str);
     wstr = wtoStringz(wstr);
}
</code>
Jun 03 2005
next sibling parent reply Trevor Parscal <trevorparscal hotmail.com> writes:
James McComb wrote:
 I like D having char, wchar and dchar. And I like the way that they will 
 (soon?) implicitly convert between each other. But I don't like the way 
 that D is biased towards char. I think that char, dchar and wchar should 
 be supported equally.
 
 For example, modern Windows systems support UTF-16 (via the W 
 functions). So you might decide to use wchar, because that is also 
 UTF-16. The Windows API expects zero-terminated strings, and you can 
 clearly indicate this in your code by calling toStringz. But toStringz 
 takes char, so your wchar will be implicitly converted to char and then 
 implicitly converted back to wchar. So there is no point using wchar!
 
 But what if every function in std.string had wchar and dchar versions?
 Then you could use wchar and call wtoStringz. (At the end of this email, 
 there is some working code showing how this could be implemented using 
 templates and aliases. There are other ways that std.string could 
 support wchar and dchar, such as function overloading or function 
 templates.)
 
 *snip* Object should have wtoString and dtoString methods. 
 

well.. wtoString is a bad naming convention.. I think toWString or toDString makes a little more sense, but to be honest, I think it should work like read and write, and return char[], wchar[], or dchar[] based on what you cast. That's my two cents anyhoo, as an avid dchar[] user. -- Thanks, Trevor Parscal www.trevorparscal.com trevorparscal hotmail.com
Jun 03 2005
parent reply Hasan Aljudy <hasan.aljudy gmail.com> writes:
Trevor Parscal wrote:
 James McComb wrote:
 
 I like D having char, wchar and dchar. And I like the way that they 
 will (soon?) implicitly convert between each other. But I don't like 
 the way that D is biased towards char. I think that char, dchar and 
 wchar should be supported equally.

 For example, modern Windows systems support UTF-16 (via the W 
 functions). So you might decide to use wchar, because that is also 
 UTF-16. The Windows API expects zero-terminated strings, and you can 
 clearly indicate this in your code by calling toStringz. But toStringz 
 takes char, so your wchar will be implicitly converted to char and 
 then implicitly converted back to wchar. So there is no point using 
 wchar!

 But what if every function in std.string had wchar and dchar versions?
 Then you could use wchar and call wtoStringz. (At the end of this 
 email, there is some working code showing how this could be 
 implemented using templates and aliases. There are other ways that 
 std.string could support wchar and dchar, such as function overloading 
 or function templates.)

 *snip* Object should have wtoString and dtoString methods.

well.. wtoString is a bad naming convention.. I think toWString or toDString makes a little more sense, but to be honest, I think it should work like read and write, and return char[], wchar[], or dchar[] based on what you cast. That's my two cents anyhoo, as an avid dchar[] user.

I think that toString or any std function that takes a string and processes it, should always take dchar and return dchar. Assuming that dchar is implicitly convertable to char and wchar, there can be no loss of information when doing something like: <code> dchar[] someFunction(dchar[]) ... ... wchar[] wtest = ... wtest = someFunction(wtest); //no loss ... char[] test = .. test = someFunction(test); //no loss </code> of course I maybe wrong, but I'm assuming that converting a char to wchar is like converting an int to double .. where any extra space is just filled with zeros (speaking in the bit level), and you can convert an int to double, process it, and convert it back to int, and assume that no information will be lost because of the conversion to double. ofcourse information can be lost if "int" is not enough to store the value returned from the function, but this has nothing to do with converting back and forth to double then to int.
Jun 03 2005
next sibling parent reply Trevor Parscal <trevorparscal hotmail.com> writes:
Hasan Aljudy wrote:
 
 I think that toString or any std function that takes a string and 
 processes it, should always take dchar and return dchar.
 

The best idea for this I have heard thus far.. Especially since, anytime you are doing a toString you aren't going to be worried about the addtional overhead of a dchar[] (or so I believe) -- Thanks, Trevor Parscal www.trevorparscal.com trevorparscal hotmail.com
Jun 03 2005
parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Fri, 03 Jun 2005 20:42:25 -0700, Trevor Parscal  
<trevorparscal hotmail.com> wrote:
 Hasan Aljudy wrote:
  I think that toString or any std function that takes a string and  
 processes it, should always take dchar and return dchar.

The best idea for this I have heard thus far.. Especially since, anytime you are doing a toString you aren't going to be worried about the addtional overhead of a dchar[] (or so I believe)

If you're using char[] then it gets converted to dchar[], processed, then converted back. That's not ideal IMO. Ideally we only want conversion to happen in 1, or at most 2 places. 1. Data is converted on input from <input format> to <internal format>. 2. Data is converted on output from <internal format> to <output format>. Sometimes applications will do #1, sometimes they will do #2, sometimes they will do both (for one reason or another). Each application will have a different <internal format> chosen for some specific reason, perhaps even a different <internal format> for each group of data. So, Ideally we require 3 variants of every single string function. But of course, we dont want to be repeating ourselves all the time, in fact we want only one 'function' we just want to re-use it for all 3 string types. So, might I suggest using templates eg. import std.stdio; import std.ctype; template toLowerT(Type) { Type[] toLowerT(Type[] input) { Type[] res = input.dup; foreach(inout Type c; res) c = tolower(c); return res; } } alias toLowerT!(char) toLower; alias toLowerT!(wchar) toLower; alias toLowerT!(dchar) toLower; void main() { char[] a = "REGAN"; wchar[] b = "WAS"; dchar[] c = "HERE"; //we can even use the x.fn() form as opposed to fn(x) if we wish. writefln("%s=%s",a,a.toLower()); writefln("%s=%s",b,b.toLower()); writefln("%s=%s",c,c.toLower()); } NOTE: I realise using ctype's tolower function will only work with ASCII, not the full compliment of unicode characters. This is a semi-functional example only. Regan
Jun 03 2005
parent James McComb <alan jamesmccomb.id.au> writes:
Regan Heath wrote:

 template toLowerT(Type) {
   Type[] toLowerT(Type[] input) {
     Type[] res = input.dup;
     foreach(inout Type c; res)
         c = tolower(c);
     return res;
   }
 }
 
 alias toLowerT!(char) toLower;
 alias toLowerT!(wchar) toLower;
 alias toLowerT!(dchar) toLower;

Thinks: so that's how you do it! :) This is the kind of thing I had in mind. Is there any chance that std.string actually *will* be implemented like this? James McComb
Jun 04 2005
prev sibling next sibling parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Fri, 03 Jun 2005 21:37:23 -0600, Hasan Aljudy <hasan.aljudy gmail.com>  
wrote:
 of course I maybe wrong, but I'm assuming that converting a char to  
 wchar is like converting an int to double .. where any extra space is  
 just filled with zeros (speaking in the bit level)

Yes and No. In many cases, yes, especially where ASCII is used. However some UTF-8 'characters'/'glyphs' (not sure what the correct term is exactly) take 2 or more char's (UTF-8 codepoints) to represent, so when converting them you might go from 3 chars to 1 wchar (1 UTF-16 codepoint) which is a decrease in byte space required, and often a change in the value of the codepoint.
 , and you can convert an int to double, process it, and convert it back  
 to int, and assume that no information will be lost because of the  
 conversion to double.

Converting to/from char[], wchar[] and dchar[] causes no loss of data, ever. All existing glyphs can be represented in UTF-8(char[]), UTF-16(wchar[]) and UTF-32(dchar[]), thus all existing strings can be represented in all types. Of course that representation uses a different number of bytes and may in fact use different bit patterns(codepoints) as well. Regan
Jun 03 2005
parent reply Hasan Aljudy <hasan.aljudy gmail.com> writes:
Regan Heath wrote:
  > Converting to/from char[], wchar[] and dchar[] causes no loss of data,
 ever. All existing glyphs can be represented in UTF-8(char[]),  
 UTF-16(wchar[]) and UTF-32(dchar[]), thus all existing strings can be  
 represented in all types. Of course that representation uses a 
 different  number of bytes and may in fact use different bit 
 patterns(codepoints) as  well.
 
 Regan

What then is the point of having all of these different types? How does UTF-8 work? when you only have 256 possible values?
Jun 03 2005
parent "Regan Heath" <regan netwin.co.nz> writes:
On Sat, 04 Jun 2005 00:05:46 -0600, Hasan Aljudy <hasan.aljudy gmail.com>  
wrote:
 Regan Heath wrote:
   > Converting to/from char[], wchar[] and dchar[] causes no loss of  
 data,
 ever. All existing glyphs can be represented in UTF-8(char[]),   
 UTF-16(wchar[]) and UTF-32(dchar[]), thus all existing strings can be   
 represented in all types. Of course that representation uses a  
 different  number of bytes and may in fact use different bit  
 patterns(codepoints) as  well.
  Regan

What then is the point of having all of these different types?

They're each better or worse depending on the data you're operating on. Terminology: (I think this is correct) Codepoint == one char, wchar, or dchar. Character == a symbol, made up of 1 or more codepoints. UTF-8 is perfect if most/all of your data is ASCII, as UTF-8 characters have the same values as they do in ASCII, ASCII is a sub-set of UTF-8 (which can represent characters that do not exist in ASCII). UTF-16 is better than UTF-8 in cases where most/all of your data would take 2 or more UTF-8 codepoints to represent. Essentially UTF-16 can store some characters in less space than UTF-8 can. UTF-32 is better than UTF-16 in cases where most/all of your data would take 2 or more UTF-16 codepoints to represent. Some people choose to use UTF-32 as you can guarantee a codepoint == a character, meaning the dchar's length property is the 'string' length (this is not always the case with wchar, or char, due to some characters taking more than 1 codepoint).
 How does UTF-8 work? when you only have 256 possible values?

In essence it uses between 1 and 4 codepoints to represent a single character. Someone probably has a better reference than this: http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-AppendixA I just quickly googled that up. Regan
Jun 03 2005
prev sibling parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Hasan Aljudy wrote:

 I think that toString or any std function that takes a string and 
 processes it, should always take dchar and return dchar.

That's like saying that booleans should always be represented with "int", and I'm afraid it won't fly around here since we're obsessed with the size of variables more than processing time :-) Conversion is a real problem, but at least you can do: char[] str; foreach(dchar c; str) { ... } Plus some ASCII shortcuts, when the high bit isn't set. Much more on http://prowiki.org/wiki4d/wiki.cgi?CharsAndStrs (and several other pages on the Wiki4D, like Derek's RFE: "FeatureRequestList/ImplicitConversionBetweenUTF") --anders PS. You probably meant to say "dchar[]", and not dchar ?
Jun 04 2005
parent reply Hasan Aljudy <hasan.aljudy gmail.com> writes:
Anders F Björklund wrote:
 Hasan Aljudy wrote:
 
 I think that toString or any std function that takes a string and 
 processes it, should always take dchar and return dchar.

That's like saying that booleans should always be represented with "int", and I'm afraid it won't fly around here since we're obsessed with the size of variables more than processing time :-)

No, it's not like representing booleans with ints .. it's actually like saying ints should always be represented by doubles. booleans are not numbers, there is no reason to represent them as numbers, and no one should ever store numbers in booleans. But char, wchar, and dchar are all characters, just with different storage space. I don't really think anybody cares about size, most people who care would care most about performance (processing time). imagine if all std functions used short instead of int ;) that could be a serious problem.
 Conversion is a real problem, but at least you can do:
    char[] str; foreach(dchar c; str) { ... }
 Plus some ASCII shortcuts, when the high bit isn't set.
 

I don't like having to read the unicode specs to be able to deal with simple things like char. Your "ASCII shortcuts" would be low-level stuff dealing with how char and dchar are represented in memory. C'mon people, D is a high level language.
Jun 04 2005
next sibling parent "Kris" <fu bar.com> writes:
It would be great to resolve this ongoing concern. However, you might
consider trying the ICU project for all your unicode needs ~ it's what Java
uses under the covers:
http://www-306.ibm.com/software/globalization/icu/index.jsp

There's a D interface available over here, along with a well-rounded String
class: http://dsource.org/forums/viewtopic.php?t=148

- Kris

"Hasan Aljudy" <hasan.aljudy gmail.com> wrote in message
news:d7t8tc$b40$1 digitaldaemon.com...
 Anders F Björklund wrote:
 Hasan Aljudy wrote:

 I think that toString or any std function that takes a string and
 processes it, should always take dchar and return dchar.

That's like saying that booleans should always be represented with "int", and I'm afraid it won't fly around here since we're obsessed with the size of variables more than processing time :-)

No, it's not like representing booleans with ints .. it's actually like saying ints should always be represented by doubles. booleans are not numbers, there is no reason to represent them as numbers, and no one should ever store numbers in booleans. But char, wchar, and dchar are all characters, just with different storage space. I don't really think anybody cares about size, most people who care would care most about performance (processing time). imagine if all std functions used short instead of int ;) that could be a serious problem.
 Conversion is a real problem, but at least you can do:
    char[] str; foreach(dchar c; str) { ... }
 Plus some ASCII shortcuts, when the high bit isn't set.

I don't like having to read the unicode specs to be able to deal with simple things like char. Your "ASCII shortcuts" would be low-level stuff dealing with how char and dchar are represented in memory. C'mon people, D is a high level language.

Jun 04 2005
prev sibling parent reply Vathix <vathix dprogramming.com> writes:
 I don't like having to read the unicode specs to be able to deal with  
 simple things like char. Your "ASCII shortcuts" would be low-level stuff  
 dealing with how char and dchar are represented in memory.

 C'mon people, D is a high level language.

Maybe there should be isascii(char) somewhere :) Would be inlined and self documenting.
Jun 04 2005
parent reply =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:
Vathix wrote:

 I don't like having to read the unicode specs to be able to deal with  
 simple things like char. Your "ASCII shortcuts" would be low-level 
 stuff  dealing with how char and dchar are represented in memory.

 C'mon people, D is a high level language.

Maybe there should be isascii(char) somewhere :) Would be inlined and self documenting.

I suggested that enhancement last year, but it wasn't popular... digitalmars.D.bugs/2154 Or maybe it just got lost in this crippled "bug reporting system" ? --anders
Jun 05 2005
parent reply Derek Parnell <derek psych.ward> writes:
On Sun, 05 Jun 2005 09:25:09 +0200, Anders F Björklund wrote:

 Vathix wrote:
 
 I don't like having to read the unicode specs to be able to deal with  
 simple things like char. Your "ASCII shortcuts" would be low-level 
 stuff  dealing with how char and dchar are represented in memory.

 C'mon people, D is a high level language.

Maybe there should be isascii(char) somewhere :) Would be inlined and self documenting.

I suggested that enhancement last year, but it wasn't popular... digitalmars.D.bugs/2154 Or maybe it just got lost in this crippled "bug reporting system" ?

You mean like this ... //--------------------------- // --- isASCII -- // Returns true if the supplied argument is an ASCII character. // // Paramaters: // (1) -- char -- The character to test. // (return) -- bool -- 'true' if the character is ASCII otherwise false. //--------------------------- bool isASCII(char c) out(result) { assert(result == (UTF8stride[c] == 1)); } body{ return (cast(uint)c <= 127U ? true : false); } unittest { assert(isASCII('a') == true); assert(isASCII('~') == true); assert(isASCII('\xFF') == false); assert(isASCII('\x80') == false); assert(isASCII('\x00') == true); assert(isASCII(cast(char) -1) == false); } //--------------------------- -- Derek Parnell Melbourne, Australia 5/06/2005 7:13:16 PM
Jun 05 2005
parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Derek Parnell wrote:

 You mean like this ...
 //---------------------------
 //  --- isASCII --
 // Returns true if the supplied argument is an ASCII character.
 //
 // Paramaters:
 //      (1)   -- char -- The character to test.
 //   (return) -- bool -- 'true' if the character is ASCII otherwise false.
 //---------------------------

Is that the "Natural Docs" format ? I think I prefer Doxygen, myself: /// Is the supplied code unit an ASCII character ? /// param c The UTF-8 code unit to test. /// return 'true' if the character is ASCII
 bool isASCII(char c)
 out(result)
 {
     assert(result == (UTF8stride[c] == 1));
 }
 body{
     return (cast(uint)c <= 127U ? true : false);
 }

But surely this workaround shouldn't be needed ? If a "bool" function can't return a comparison, then there's something severly broken somewhere... --anders
Jun 05 2005
parent reply Derek Parnell <derek psych.ward> writes:
On Sun, 05 Jun 2005 12:09:47 +0200, Anders F Björklund wrote:

 Derek Parnell wrote:
 
 You mean like this ...
 //---------------------------
 //  --- isASCII --
 // Returns true if the supplied argument is an ASCII character.
 //
 // Paramaters:
 //      (1)   -- char -- The character to test.
 //   (return) -- bool -- 'true' if the character is ASCII otherwise false.
 //---------------------------

Is that the "Natural Docs" format ?

 I think I prefer Doxygen, myself:
 /// Is the supplied code unit an ASCII character ?
 ///  param c    The UTF-8 code unit to test.
 ///  return     'true' if the character is ASCII

Good on ya.
 bool isASCII(char c)
 out(result)
 {
     assert(result == (UTF8stride[c] == 1));
 }
 body{
     return (cast(uint)c <= 127U ? true : false);
 }

But surely this workaround shouldn't be needed ? If a "bool" function can't return a comparison, then there's something severly broken somewhere...

I make a distinction between the machine code that is generated by a compiler and the source code that is read by a human. Yes, the compiler is able to work out that a bool is returned from a comparison, but by writing it out explicitly, we also get a clear and unambiguous statement of intent by the coder. We get the same machine code generated and now its also human readable too. In other words, it is self-documenting and does not rely on the sophistication of the compiler. -- Derek Parnell Melbourne, Australia 5/06/2005 8:39:19 PM
Jun 05 2005
parent =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Derek Parnell wrote:

Is that the "Natural Docs" format ?

Dunno. What's that ? I just made this up on the spot.

http://www.naturaldocs.org/ Whatever style is used, it should be parsable ?
 Yes, the compiler is able to work out that a bool is returned from a
 comparison, but by writing it out explicitly, we also get a clear and
 unambiguous statement of intent by the coder. We get the same machine code
 generated and now its also human readable too.

Ah, OK, then it wasn't a compiler bug <phew>. Just a matter of opinion on readability... :-) Like: "a < b" versus "(a < b) ? true : false" --anders
Jun 05 2005
prev sibling parent reply Derek Parnell <derek psych.ward> writes:
On Sat, 04 Jun 2005 11:20:47 +1000, James McComb wrote:

 I like D having char, wchar and dchar. And I like the way that they will 
 (soon?) implicitly convert between each other. But I don't like the way 
 that D is biased towards char. I think that char, dchar and wchar should 
 be supported equally.

Yes please. I've had to write dchar[] versions of a lot of things in std.string and others. I tend to use char[] only when reading to and from files/streams, and use dchar[] for internal routines. The application I'm working on now does a lot of text processing and it is too slow to convert char[] -> dchar[], process it, convert dchar[] -> char[]. The simplicity of dchar[] is that the array index always points to the start of a character, where as with char[] and wchar[] the index can point to somewhere inside a character. (Remembering that each character in a dchar[] string is the same size - a dchar - but characters in wchar[] and char[] have variable sizes.) The current Phobos routines are heavily biased to char[]. Also, the use of templates is not always the best solution because there are some optimizations available, depending on the UTF encoding format used. -- Derek Parnell Melbourne, Australia 4/06/2005 6:08:29 PM
Jun 04 2005
parent =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Derek Parnell wrote:

 The current Phobos routines are heavily biased to char[]. Also, the use of
 templates is not always the best solution because there are some
 optimizations available, depending on the UTF encoding format used.

Not that anyone cares, but templates also have severe problems on other D platforms such as with the GDC compiler on Mac OS X... It's getting better, but it's like "the early days of C++" or so. --anders
Jun 04 2005