www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Re: Wide characters support in D

reply Ruslan Nikolaev <nruslan_devel yahoo.com> writes:
Ok, ok... that was just a suggestion... Thanks, for reply about "Hello world"
representation. Was postfix "w" and "d" added initially or just recently? I did
not know about it. I thought D does automatic conversion for string literals.

Yes, templates may help. However, that unnecessary make code bigger (since we
have to compile it for every char type). The other problem is that it allows
programmer to choose which one to use. He or she may just prefer char[] as
UTF-8 (or wchar[] as UTF-16). That will be fine on platform that supports this
encoding natively (e.g. for file system operations, screen output, etc.),
whereas it will cause conversion overhead on the other. Not to say that it's a
big overhead, but unnecessary one. Having said this, I do agree that there must
be some flexibility (e.g. in Java char[] is always 2 bytes), however, I don't
believe that this flexibility should be available for application programmer.

I don't think there is any problem with having different size of char. In fact,
that would make programs better (since application programmers will have to
think in terms of characters as opposed to bytes). System programmers (i.e. OS
programmers) may choose to think as they expect it to be (since char width
option can be added to compiler). TCHAR in Windows is a good example of it.
Whenever you need to determine size of element (e.g. for allocation), you can
use 'sizeof'. Again, it does not mean that you're deprived of char/wchar/dchar
capability. It still can be supported (e.g. via ubyte/ushort/uint) for the sake
of interoperability or some special cases. Special string constants (e.g. ""b,
""w, ""d) can be supported, too. My only point is that it would be good to have
universal char type that depends on platform. That, in turns, allows to have
unified char for all libraries on this platform.

In addition, commonly used constants '\n', '\r', '\t' will be the same
regardless of char width.

Anyway, that was just a suggestion. You may disagree with this if you wish.

Ruslan.


      
Jun 07 2010
next sibling parent torhu <no spam.invalid> writes:
On 08.06.2010 01:16, Ruslan Nikolaev wrote:
 Ok, ok... that was just a suggestion... Thanks, for reply about "Hello world"
representation. Was postfix "w" and "d" added initially or just recently? I did
not know about it. I thought D does automatic conversion for string literals.

There is automatic conversion, try this example: --- //void f(char[] s) { writefln("char"); } void f(wchar[] s) { writefln("wchar"); } void main() { f("hello"); } --- As long as there's just one possible match, a string literal with no postfix will be interpreted as char[], wchar[], or dchar[] depending on context. But if you uncomment the first f(), the compiler will complain about there being two matching overloads. Then you'll have to add the 'c' or 'w' postfixes to the string literal to disambiguate. For templates and type inference, string literals default to char[]. This example prints 'char': --- void f(T)(T[] s) { writefln(T.stringof); } void main() { f("hello"); } ---
Jun 07 2010
prev sibling parent "Nick Sabalausky" <a a.a> writes:
"Ruslan Nikolaev" <nruslan_devel yahoo.com> wrote in message 
news:mailman.122.1275952601.24349.digitalmars-d puremagic.com...
 Ok, ok... that was just a suggestion... Thanks, for reply about "Hello 
 world" representation. Was postfix "w" and "d" added initially or just 
 recently? I did not know about it. I thought D does automatic conversion 
 for string literals.

The postfix 'c', 'w' and 'd' have been in there a long time. But D does have a little bit of automatic conversion. Let me try to clarify: "hello"c // string, UTF-8 "hello"w // wstring, UTF-16 "hello"d // dstring, UTF-32 "hello" // Depends how you use it Suppose I have a function that takes a UTF-8 string, and I call it: void cfoo(string a) {} cfoo("hello"c); // Works cfoo("hello"w); // Error, wrong type cfoo("hello"d); // Error, wrong type cfoo("hello"); // Works, assumed to be UTF-8 string If I make a different function that takes a UTF-16 wstring instead: void wfoo(wstring a) {} wfoo("hello"c); // Error, wrong type wfoo("hello"w); // Works wfoo("hello"d); // Error, wrong type wfoo("hello"); // Works, assumed to be UTF-16 wstring And then, a UTF-32 dstring version would be similar: void dfoo(dstring a) {} dfoo("hello"c); // Error, wrong type dfoo("hello"w); // Error, wrong type dfoo("hello"d); // Works dfoo("hello"); // Works, assumed to be UTF-32 dstring As you can see, the literals with postfixes are always the exact type you specify. If you have no postfix, then you get whatever the compiler expects it to be. But, then the question is, what happens if any of those types can be used? Which does the compiler choose? void Tfoo(T)(T a) { // When compiling, display the type used. pragma(msg, T.stringof); } Tfoo("hello"); (Normally you'd want to add in a constraint that T must be one of the string types, so that no one tries to pass in an int or float or something. I skipped that in there.) In that, Tfoo isn't expecting any particular type of string, it can take any type. And "hello" doesn't have a postfix, so the compiler uses the default: UTF-8 string.
 Yes, templates may help. However, that unnecessary make code bigger (since 
 we have to compile it for every char type).<

It only generates code for the types that are actually needed. If, for instance, your progam never uses anything except UTF-8, then only one version of the function will be made - the UTF-8 version. If you don't use every char type, then it doesn't generate it for every char type - just the ones you choose to use.
The other problem is that it allows programmer to choose which one to use. 
He or she may just prefer char[] as UTF-8 (or wchar[] as UTF-16). That will 
be fine on platform that supports this encoding natively (e.g. for file 
system operations, screen output, etc.), whereas it will cause conversion 
overhead on the other. I don't think there is any problem with having 
different size of char. In fact, that would make programs better (since 
application programmers will have to think in terms of characters as 
opposed to bytes). Not to say that it's a big overhead, but unnecessary 
one. Having said this, I do agree that there must be some flexibility (e.g. 
in Java char[] is always 2 bytes), however, I don't believe that this 
flexibility should be available for application programmer.

That's not good. First of all, UTF-16 is a lousy encoding, it combines the worst of both UTF-8 and UTF-32: It's multibyte and non-word-aligned like UTF-8, but it still wastes a lot of space like UTF-32. So even if your OS uses it natively, it's still best to do most internal processing in either UTF-8 or UTF-32. (And with templated string functions, if the programmer actually does want to use the native type in the *rare* cases where he's making enough OS calls that it would actually matter, he can still do so.) Secondly, the programmer *should* be able to use whatever type he decides is appropriate. If he wants to stick with native, he can do so, but he shouldn't be forced into choosing between "use the native encoding" and "abuse the type system by pretending that an int is a character". For instance, complex low-level text processing *relies* on knowing exactly what encoding is being used and coding specifically to that encoding. As an example, I'm currently working on a generalized parser library ( http://www.dsource.org/projects/goldie ). Something like that is complex enough already that implementing the internal lexer natively for each possible native text encoding is just not worthwhile, expecially since the text hardly every gets passed to or from any OS calls that expect any particular encoding. Or maybe you're on a fancy OS that can handle any encoding natively. Or maybe the programmer is in a low-memory (or very-large-data) situation and needs the space savings of UTF-8 regardless of OS and doesn't care about speed. Or maybe they're actually *writing* an OS (Most moderns languages are completely useless for writing an OS. D isn't). A language or a library should *never* assume it knows the programmer's needs better than the programmer does. Also, C already tried the approach of multi-sized types (ex, C's "int"), and it ended up being a big PITA disaster that everyone ended up having to make up hacks to work around.

expect it to be (since char width option can be added to compiler).<

See that's the thing, D is intended as a systems language, so a D programmer must be able to easily handle it that way whenever they need to.
TCHAR in Windows is a good example of it. Whenever you need to determine 
size of element (e.g. for allocation), you can use 'sizeof'. Again, it does 
not mean that you're deprived of char/wchar/dchar capability. It still can 
be supported (e.g. via ubyte/ushort/uint) for the sake of interoperability 
or some special cases. Special string constants (e.g. ""b, ""w, ""d) can be 
supported, too. My only point is that it would be good to have universal 
char type that depends on platform.

You can have that easily: version(Windows) alias wstring tstring; else alias string tstring; Besides, just because you *can* get a job done a certain way doesn't mean languages should never try to allow a better way for those who want a better way.
 That, in turns, allows to have unified char for all libraries on this 
 platform.

With templated text functions, there is very little benefit to be gained from having a unified char. Just wouldn't serve any real purpose. All it would do is cause problems for anyone who needs to work at the low-level. ------------------------------- Not sent from an iPhone.
Jun 07 2010