www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Wide characters support in D

reply Ruslan Nikolaev <nruslan_devel yahoo.com> writes:
Note: I posted this already on runtime D list, but I think that list was a
wrong one for this question. Sorry for duplication :-)

Hi. I am new to D. It looks like D supports 3 types of characters: char, wchar,
dchar. This is cool, however, I have some questions about it:

1. When we have 2 methods (one with wchar[] and another with char[]), how D
will determine which one to use if I pass a string "hello world"?
2. Many libraries (e.g. tango or phobos) don't provide functions/methods (or
have incomplete support) for wchar/dchar
e.g. writefln probably assumes char[] for strings like "Number %d..."
3. Even if they do support, it is kind of annoying to provide methods for all 3
types of chars. Especially, if we want to use native mode (e.g. for Windows
wchar is better, for Linux char is better). E.g. Windows has _wopen, _wdirent,
_wreaddir, _wopenddir, _wmain(int argc, wchar_t[] argv) and so on, and they
should be native (in a sense that no conversion is necessary when we do, for
instance, _wopen). Linux doesn't have them as UTF-8 is used widely there.

Since D language is targeted on system programming, why not to try to use
whatever works better on a particular system (e.g. char will be 2 bytes on
Windows and 1 byte on Linux; it can be a compiler switch, and all libraries can
be compiled properly on a particular system). It's still necessary to have all
3 types of char for cooperation with C. But in those cases byte, short and int
will do their work. For this kind of situation, it would be nice to have some
built-in functions for transparent conversion from char to byte/short/int and
vice versa (especially, if conversion only happens if needed on a particular
platform).

In my opinion, to separate notion of character from byte would be nice, and it
makes sense as a particular platform uses either UTF-8 or UTF-16 natively.
Programmers may write universal code (like TCHAR on Windows). Unfortunately, C
uses 'char' and 'byte' interchangeably but why D has to make this mistake again?

Sorry if my suggestion sounds odd. Anyway, it would be great to hear something
from D gurus :-)

Ruslan.


      
Jun 07 2010
next sibling parent "Simen kjaeraas" <simen.kjaras gmail.com> writes:
Ruslan Nikolaev <nruslan_devel yahoo.com> wrote:

 1. When we have 2 methods (one with wchar[] and another with char[]),  
 how D will determine which one to use if I pass a string "hello world"?

String literals in D(2) are of type immutable(char)[] (char[] in D1) by default, and thus will be handled by the char[]-version of the function. Should you want a string literal of a different type, append a c, w, or d to specify char[], wchar[] or dchar[]. Or use a cast.
 Since D language is targeted on system programming, why not to try to  
 use whatever works better on a particular system (e.g. char will be 2  
 bytes on Windows and 1 byte on Linux; it can be a compiler switch, and  
 all libraries can be compiled properly on a particular system).

Because this leads to unportable code, that fails in unexpected ways when moved from one system to another, thus increasing rather than decreasing the cognitive load on the hapless programmer.
 It's still necessary to have all 3 types of char for cooperation with C.  
 But in those cases byte, short and int will do their work.

Absolutely not. One of the things D tries, is doing strings right. For that purpose, all 3 types are needed.
 In my opinion, to separate notion of character from byte would be nice,  
 and it makes sense as a particular platform uses either UTF-8 or UTF-16  
 natively. Programmers may write universal code (like TCHAR on Windows).  
 Unfortunately, C uses 'char' and 'byte' interchangeably but why D has to  
 make this mistake again?

D has not. A char is a character, a possibly incomplete UTF-8 codepoint, while a byte is a byte, a humble number in the order of -128 to +127. Yes, it is possible to abuse char in D, and byte likewise. D aims to allow programmers to program close to the metal if the programmer so wishes, and thus does not pretend char is an opaque type about which nothing can be known. -- Simen
Jun 07 2010
prev sibling next sibling parent Robert Clipsham <robert octarineparrot.com> writes:
On 07/06/10 22:48, Ruslan Nikolaev wrote:
 Note: I posted this already on runtime D list, but I think that list
 was a wrong one for this question. Sorry for duplication :-)

 Hi. I am new to D. It looks like D supports 3 types of characters:
 char, wchar, dchar. This is cool, however, I have some questions
 about it:

 1. When we have 2 methods (one with wchar[] and another with char[]),
 how D will determine which one to use if I pass a string "hello
 world"?

If you pass "Hello World", this is always a string (char[] in D1, immutable(char)[] in D2). If you want to specify a type with a string literal, you can use "Hello World"w or "Hello World"d for wstring anddstringrespectively.
 2. Many libraries (e.g. tango or phobos) don't provide
 functions/methods (or have incomplete support) for wchar/dchar e.g.
 writefln probably assumes char[] for strings like "Number %d..."

In tango most, if not all string functions are templated, so work with all string types, char[], wchar[] and dchar[]. I don't know how well phobos supports other string types, I know phobos 1 is extremely limited for types other than char[], I don't know about Phobos 2
 3.
 Even if they do support, it is kind of annoying to provide methods
 for all 3 types of chars. Especially, if we want to use native mode
 (e.g. for Windows wchar is better, for Linux char is better). E.g.
 Windows has _wopen, _wdirent, _wreaddir, _wopenddir, _wmain(int argc,
 wchar_t[] argv) and so on, and they should be native (in a sense that
 no conversion is necessary when we do, for instance, _wopen). Linux
 doesn't have them as UTF-8 is used widely there.

Enter templates! You can write the function once and have it work with all three string types with little effort involved. All the lower level functions that interact with the operating system are abstracted away nicely for you in both Tango and Phobos, so you'll never have to deal with this for basic functions. For your own it's a simple matter of templating them in most cases.
 Since D language is targeted on system programming, why not to try to
 use whatever works better on a particular system (e.g. char will be 2
 bytes on Windows and 1 byte on Linux; it can be a compiler switch,
 and all libraries can be compiled properly on a particular system).
 It's still necessary to have all 3 types of char for cooperation with
 C. But in those cases byte, short and int will do their work. For
 this kind of situation, it would be nice to have some built-in
 functions for transparent conversion from char to byte/short/int and
 vice versa (especially, if conversion only happens if needed on a
 particular platform).

This is something C did wrong. If compilers are free to choose their own width for the string type you end up with the mess C has where every library introduces their own custom types to make sure they're the expected length, eg uint32_t etc. Having things the other way around makes life far easier - int is always 32bits signed for example, the same applies to strings. You can use version blocks if you want to specify a type which changes based on platform, I wouldn't recommend it though, it just makes life harder in the long run.
 In my opinion, to separate notion of character from byte would be
 nice, and it makes sense as a particular platform uses either UTF-8
 or UTF-16 natively. Programmers may write universal code (like TCHAR
 on Windows). Unfortunately, C uses 'char' and 'byte' interchangeably
 but why D has to make this mistake again?

They are different types in D, so I'm not sure what you mean. byte/ubyte have no encoding associated with them, char is always UTF-8, wchar UTF-16 etc. Robert
Jun 07 2010
prev sibling next sibling parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
Ruslan Nikolaev wrote:

 1. When we have 2 methods (one with wchar[] and another with char[]), 

I asked the same question on the D.learn group recently. Literals like that don't have a particular encoding. The programmer must specify explicitly to resolve ambiguities: "hello world"c or "hello world"w.
 3. Even if they do support, it is kind of annoying to provide methods 

I think the solution is to take advantage of templates and use template constraints if the template parameter is too flexible. Another approach might be to use dchar within the application and use other encodings on the intefraces. Ali
Jun 07 2010
prev sibling next sibling parent justin <justin economicmodeling.com> writes:
This doesn't answer all your questions and suggestions, but here goes.
In answer to #1, "Hello world" is a literal of type char[] (or string). If you
want
to use UTF-16 or 32, use "Hello world"w and "Hello world"d respectively.
In partial answer to #2 and #3, it's generally pretty easy to adapt a string
function to support string, wstring, and dstring by using templating and the
fact
that D can do automatic conversions for you. For instance:

string blah = "hello world";
foreach (dchar c; blah)   // guaranteed to get a full character
  // do something
Jun 07 2010
prev sibling next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Ruslan Nikolaev wrote:
 Note: I posted this already on runtime D list,

Although D is designed to be fairly agnostic about character types, in practice I recommend the following: 1. Use the string type for strings, it's char[] on D1 and immutable(char)[] on D2. 2. Use dchar's to hold individual characters. The problem with wchar's is that everyone forgets about surrogate pairs. Most UTF-16 programs in the wild, including nearly all Java programs, are broken with regard to surrogate pairs. The problem with dchar's is strings of them consume memory at a prodigious rate.
Jun 07 2010
next sibling parent Kagamin <spam here.lot> writes:
Walter Bright Wrote:

 The problem with wchar's is that everyone forgets about surrogate pairs. Most 
 UTF-16 programs in the wild, including nearly all Java programs, are broken
with 
 regard to surrogate pairs.

I'm affraid, it will pretty hard to show the bug. I don't know whether java is particularly nasty here, but for C code it will be hard.
Jun 07 2010
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:
 The problem with dchar's is strings of them consume 
 memory at a prodigious rate.

Warning: lazy musings ahead. I hope we'll soon have computers with 200+ GB of RAM where using strings that use less than 32-bit chars is in most cases a premature optimization (like today is often a silly optimization to use arrays of 16-bit ints instead of 32-bit or 64-bit ints. Only special situations found with the profiler can justify the use of arrays of shorts in a low level language). Even in PCs with 200 GB of RAM the first levels of CPU caches can be very small (like 32 KB), and cache misses are costly, so even if huge amounts of RAMs are present, to increase performance it can be useful to reduce the size of strings. A possible solution to this problem can be some kind of real-time hardware compression/decompression between the CPU and the RAM. UTF-8 can be a good enough way to compress 32-bit strings. So we are back to writing low-level programs that have to deal with UTF-8. To avoid this, CPUs and RAM can compress/decompress the text transparently to the programmer. Unfortunately UTF-8 is a variable-length encoding, so maybe it can't be done transparently enough. So a smarter and better compression algorithm can be used to keep all this transparent enough (not fully transparent, some low-level situations can require code that deals with the compression). Bye, bearophile
Jun 08 2010
next sibling parent Walter Bright <newshound1 digitalmars.com> writes:
bearophile wrote:
 Walter Bright:
 The problem with dchar's is strings of them consume memory at a prodigious
 rate.

Warning: lazy musings ahead. I hope we'll soon have computers with 200+ GB of RAM where using strings that use less than 32-bit chars is in most cases a premature optimization (like today is often a silly optimization to use arrays of 16-bit ints instead of 32-bit or 64-bit ints. Only special situations found with the profiler can justify the use of arrays of shorts in a low level language). Even in PCs with 200 GB of RAM the first levels of CPU caches can be very small (like 32 KB), and cache misses are costly, so even if huge amounts of RAMs are present, to increase performance it can be useful to reduce the size of strings. A possible solution to this problem can be some kind of real-time hardware compression/decompression between the CPU and the RAM. UTF-8 can be a good enough way to compress 32-bit strings. So we are back to writing low-level programs that have to deal with UTF-8. To avoid this, CPUs and RAM can compress/decompress the text transparently to the programmer. Unfortunately UTF-8 is a variable-length encoding, so maybe it can't be done transparently enough. So a smarter and better compression algorithm can be used to keep all this transparent enough (not fully transparent, some low-level situations can require code that deals with the compression).

I strongly suspect that the encode/decode time for UTF-8 is more than compensated for by the 4x reduction in memory usage. I did a large app 10 years ago using dchars throughout, and the effects of the memory consumption were murderous. (As the recent article on memory consumption shows, large data structures can have huge negative speed consequences due to virtual and cache memory, and multiple cores trying to access the same memory.) https://lwn.net/Articles/250967/ Keep in mind that the overwhelming bulk of UTF-8 text is ascii, and requires only one cycle to "decode".
Jun 08 2010
prev sibling parent reply Rainer Deyke <rainerd eldwood.com> writes:
On 6/8/2010 13:57, bearophile wrote:
 I hope we'll soon have computers with 200+ GB of RAM where using
 strings that use less than 32-bit chars is in most cases a premature
 optimization (like today is often a silly optimization to use arrays
 of 16-bit ints instead of 32-bit or 64-bit ints. Only special
 situations found with the profiler can justify the use of arrays of
 shorts in a low level language).

Off-topic, but I don't need a profiler to tell me that my 1024x1024x1024 arrays should use shorts instead of ints. And even when 200GB becomes common, I'd still rather not waste that memory by using twice as much space as I have to just because I can. -- Rainer Deyke - rainerd eldwood.com
Jun 08 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"Rainer Deyke" <rainerd eldwood.com> wrote in message 
news:humes8$s8$1 digitalmars.com...
 On 6/8/2010 13:57, bearophile wrote:
 I hope we'll soon have computers with 200+ GB of RAM where using
 strings that use less than 32-bit chars is in most cases a premature
 optimization (like today is often a silly optimization to use arrays
 of 16-bit ints instead of 32-bit or 64-bit ints. Only special
 situations found with the profiler can justify the use of arrays of
 shorts in a low level language).

Off-topic, but I don't need a profiler to tell me that my 1024x1024x1024 arrays should use shorts instead of ints. And even when 200GB becomes common, I'd still rather not waste that memory by using twice as much space as I have to just because I can.

I think he was just musing that it would be nice to be able to ignore multiple encodings and multiple-code-units, and get back to something much closer to the blissful simplicity of ASCII. On that particular point, I concur ;)
Jun 08 2010
parent "Nick Sabalausky" <a a.a> writes:
"Nick Sabalausky" <a a.a> wrote in message 
news:humfrk$2gk$1 digitalmars.com...
 "Rainer Deyke" <rainerd eldwood.com> wrote in message 
 news:humes8$s8$1 digitalmars.com...
 On 6/8/2010 13:57, bearophile wrote:
 I hope we'll soon have computers with 200+ GB of RAM where using
 strings that use less than 32-bit chars is in most cases a premature
 optimization (like today is often a silly optimization to use arrays
 of 16-bit ints instead of 32-bit or 64-bit ints. Only special
 situations found with the profiler can justify the use of arrays of
 shorts in a low level language).

Off-topic, but I don't need a profiler to tell me that my 1024x1024x1024 arrays should use shorts instead of ints. And even when 200GB becomes common, I'd still rather not waste that memory by using twice as much space as I have to just because I can.

I think he was just musing that it would be nice to be able to ignore multiple encodings and multiple-code-units, and get back to something much closer to the blissful simplicity of ASCII. On that particular point, I concur ;)

Keep in mind too, that for an English-language app (and there are plenty), even using ASCII still wastes space, since you usually only need the 26 letters, 10 digits, a few whitespace characters, and a handful of punctuation. You could probably fit that in 6 bits per character, less if you're ballsy enough to use huffman encoding internally. Yea, there's twice as many letters if you count uppercase/lowercase, but random-casing is rare so there's tricks you can use to just stick with 26 plus maybe a few special control characters. But, of course, nobody actually does any of that because with the amount of memory we have, and the amount of memory already used by other parts of a program, the savings wouldn't be worth the bother. But I agree with your point too. Just saying.
Jun 08 2010
prev sibling next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 07 Jun 2010 17:48:09 -0400, Ruslan Nikolaev  
<nruslan_devel yahoo.com> wrote:

 Note: I posted this already on runtime D list, but I think that list was  
 a wrong one for this question. Sorry for duplication :-)

 Hi. I am new to D. It looks like D supports 3 types of characters: char,  
 wchar, dchar. This is cool, however, I have some questions about it:

 1. When we have 2 methods (one with wchar[] and another with char[]),  
 how D will determine which one to use if I pass a string "hello world"?
 2. Many libraries (e.g. tango or phobos) don't provide functions/methods  
 (or have incomplete support) for wchar/dchar
 e.g. writefln probably assumes char[] for strings like "Number %d..."
 3. Even if they do support, it is kind of annoying to provide methods  
 for all 3 types of chars. Especially, if we want to use native mode  
 (e.g. for Windows wchar is better, for Linux char is better). E.g.  
 Windows has _wopen, _wdirent, _wreaddir, _wopenddir, _wmain(int argc,  
 wchar_t[] argv) and so on, and they should be native (in a sense that no  
 conversion is necessary when we do, for instance, _wopen). Linux doesn't  
 have them as UTF-8 is used widely there.

 Since D language is targeted on system programming, why not to try to  
 use whatever works better on a particular system (e.g. char will be 2  
 bytes on Windows and 1 byte on Linux; it can be a compiler switch, and  
 all libraries can be compiled properly on a particular system). It's  
 still necessary to have all 3 types of char for cooperation with C. But  
 in those cases byte, short and int will do their work. For this kind of  
 situation, it would be nice to have some built-in functions for  
 transparent conversion from char to byte/short/int and vice versa  
 (especially, if conversion only happens if needed on a particular  
 platform).

 In my opinion, to separate notion of character from byte would be nice,  
 and it makes sense as a particular platform uses either UTF-8 or UTF-16  
 natively. Programmers may write universal code (like TCHAR on Windows).  
 Unfortunately, C uses 'char' and 'byte' interchangeably but why D has to  
 make this mistake again?

One thing that may not be clear from your interpretation of D's docs, all strings representable by one character type are also representable by all the other character types. This means that a function that takes a char[] can also take a dchar[] if it is sent through a converter (i.e. toUtf8 on Tango I think). So D's char is decidedly not like byte or ubyte, or C's char. In general, I use char (utf8) because I am used to C and ASCII (which is exactly represented in utf-8). But because char is utf-8, it could potentially accept any unicode string. -Steve
Jun 07 2010
parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
Steven Schveighoffer wrote:
 a function that takes 
 a char[] can also take a dchar[] if it is sent through a converter (i.e. 
 toUtf8 on Tango I think).

In Phobos, there are text, wtext, and dtext in std.conv: /** Convenience functions for converting any number and types of arguments into _text (the three character widths). Example: ---- assert(text(42, ' ', 1.5, ": xyz") == "42 1.5: xyz"); assert(wtext(42, ' ', 1.5, ": xyz") == "42 1.5: xyz"w); assert(dtext(42, ' ', 1.5, ": xyz") == "42 1.5: xyz"d); ---- */ Ali
Jun 07 2010
prev sibling parent "Jer" <jersey chicago.com> writes:
Ruslan Nikolaev wrote:
 Note: I posted this already on runtime D list, but I think that list
 was a wrong one for this question. Sorry for duplication :-)

 Hi. I am new to D. It looks like D supports 3 types of characters:
 char, wchar, dchar. This is cool,

It's wrong, actually.
Jun 10 2010