digitalmars.D - Re: Wide characters support in D

Ruslan Nikolaev <nruslan_devel yahoo.com> Jun 07 2010

torhu <no spam.invalid> Jun 07 2010
"Nick Sabalausky" <a a.a> Jun 07 2010

Ruslan Nikolaev <nruslan_devel yahoo.com> writes:

Ok, ok... that was just a suggestion... Thanks, for reply about "Hello world"
representation. Was postfix "w" and "d" added initially or just recently? I did
not know about it. I thought D does automatic conversion for string literals.

Yes, templates may help. However, that unnecessary make code bigger (since we
have to compile it for every char type). The other problem is that it allows
programmer to choose which one to use. He or she may just prefer char[] as
UTF-8 (or wchar[] as UTF-16). That will be fine on platform that supports this
encoding natively (e.g. for file system operations, screen output, etc.),
whereas it will cause conversion overhead on the other. Not to say that it's a
big overhead, but unnecessary one. Having said this, I do agree that there must
be some flexibility (e.g. in Java char[] is always 2 bytes), however, I don't
believe that this flexibility should be available for application programmer.

I don't think there is any problem with having different size of char. In fact,
that would make programs better (since application programmers will have to
think in terms of characters as opposed to bytes). System programmers (i.e. OS
programmers) may choose to think as they expect it to be (since char width
option can be added to compiler). TCHAR in Windows is a good example of it.
Whenever you need to determine size of element (e.g. for allocation), you can
use 'sizeof'. Again, it does not mean that you're deprived of char/wchar/dchar
capability. It still can be supported (e.g. via ubyte/ushort/uint) for the sake
of interoperability or some special cases. Special string constants (e.g. ""b,
""w, ""d) can be supported, too. My only point is that it would be good to have
universal char type that depends on platform. That, in turns, allows to have
unified char for all libraries on this platform.

In addition, commonly used constants '\n', '\r', '\t' will be the same
regardless of char width.

Anyway, that was just a suggestion. You may disagree with this if you wish.

Ruslan.

Jun 07 2010

torhu <no spam.invalid> writes:

On 08.06.2010 01:16, Ruslan Nikolaev wrote:
 Ok, ok... that was just a suggestion... Thanks, for reply about "Hello world"
representation. Was postfix "w" and "d" added initially or just recently? I did
not know about it. I thought D does automatic conversion for string literals.


There is automatic conversion, try this example:

---
//void f(char[] s) { writefln("char"); }
void f(wchar[] s) { writefln("wchar"); }


void main()
{
   f("hello");
}
---

As long as there's just one possible match, a string literal with no 
postfix will be interpreted as char[], wchar[], or dchar[] depending on 
context.  But if you uncomment the first f(), the compiler will complain 
about there being two matching overloads.  Then you'll have to add the 
'c' or 'w' postfixes to the string literal to disambiguate.

For templates and type inference, string literals default to char[].

This example prints 'char':
---
void f(T)(T[] s) { writefln(T.stringof); }

void main()
{
   f("hello");
}
---

Jun 07 2010

"Nick Sabalausky" <a a.a> writes:

"Ruslan Nikolaev" <nruslan_devel yahoo.com> wrote in message 
news:mailman.122.1275952601.24349.digitalmars-d puremagic.com...
 Ok, ok... that was just a suggestion... Thanks, for reply about "Hello 
 world" representation. Was postfix "w" and "d" added initially or just 
 recently? I did not know about it. I thought D does automatic conversion 
 for string literals.


The postfix 'c', 'w' and 'd' have been in there a long time. But D does have 
a little bit of automatic conversion. Let me try to clarify:

    "hello"c  // string, UTF-8
    "hello"w  // wstring, UTF-16
    "hello"d  // dstring, UTF-32
    "hello"   // Depends how you use it

Suppose I have a function that takes a UTF-8 string, and I call it:

    void cfoo(string a) {}

    cfoo("hello"c); // Works
    cfoo("hello"w); // Error, wrong type
    cfoo("hello"d); // Error, wrong type
    cfoo("hello");  // Works, assumed to be UTF-8 string

If I make a different function that takes a UTF-16 wstring instead:

    void wfoo(wstring a) {}

    wfoo("hello"c); // Error, wrong type
    wfoo("hello"w); // Works
    wfoo("hello"d); // Error, wrong type
    wfoo("hello");  // Works, assumed to be UTF-16 wstring

And then, a UTF-32 dstring version would be similar:

    void dfoo(dstring a) {}

    dfoo("hello"c); // Error, wrong type
    dfoo("hello"w); // Error, wrong type
    dfoo("hello"d); // Works
    dfoo("hello");  // Works, assumed to be UTF-32 dstring

As you can see, the literals with postfixes are always the exact type you 
specify. If you have no postfix, then you get whatever the compiler expects 
it to be.

But, then the question is, what happens if any of those types can be used? 
Which does the compiler choose?

    void Tfoo(T)(T a)
    {
        // When compiling, display the type used.
        pragma(msg, T.stringof);
    }

    Tfoo("hello");

(Normally you'd want to add in a constraint that T must be one of the string 
types, so that no one tries to pass in an int or float or something. I 
skipped that in there.)

In that, Tfoo isn't expecting any particular type of string, it can take any 
type. And "hello" doesn't have a postfix, so the compiler uses the default: 
UTF-8 string.

 Yes, templates may help. However, that unnecessary make code bigger (since 
 we have to compile it for every char type).<


It only generates code for the types that are actually needed. If, for 
instance, your progam never uses anything except UTF-8, then only one 
version of the function will be made - the UTF-8 version.  If you don't use 
every char type, then it doesn't generate it for every char type - just the 
ones you choose to use.

The other problem is that it allows programmer to choose which one to use. 
He or she may just prefer char[] as UTF-8 (or wchar[] as UTF-16). That will 
be fine on platform that supports this encoding natively (e.g. for file 
system operations, screen output, etc.), whereas it will cause conversion 
overhead on the other. I don't think there is any problem with having 
different size of char. In fact, that would make programs better (since 
application programmers will have to think in terms of characters as 
opposed to bytes). Not to say that it's a big overhead, but unnecessary 
one. Having said this, I do agree that there must be some flexibility (e.g. 
in Java char[] is always 2 bytes), however, I don't believe that this 
flexibility should be available for application programmer.



That's not good. First of all, UTF-16 is a lousy encoding, it combines the 
worst of both UTF-8 and UTF-32: It's multibyte and non-word-aligned like 
UTF-8, but it still wastes a lot of space like UTF-32. So even if your OS 
uses it natively, it's still best to do most internal processing in either 
UTF-8 or UTF-32. (And with templated string functions, if the programmer 
actually does want to use the native type in the *rare* cases where he's 
making enough OS calls that it would actually matter, he can still do so.)

Secondly, the programmer *should* be able to use whatever type he decides is 
appropriate. If he wants to stick with native, he can do so, but he 
shouldn't be forced into choosing between "use the native encoding" and 
"abuse the type system by pretending that an int is a character". For 
instance, complex low-level text processing *relies* on knowing exactly what 
encoding is being used and coding specifically to that encoding. As an 
example, I'm currently working on a generalized parser library ( 
http://www.dsource.org/projects/goldie ). Something like that is complex 
enough already that implementing the internal lexer natively for each 
possible native text encoding is just not worthwhile, expecially since the 
text hardly every gets passed to or from any OS calls that expect any 
particular encoding. Or maybe you're on a fancy OS that can handle any 
encoding natively. Or maybe the programmer is in a low-memory (or 
very-large-data) situation and needs the space savings of UTF-8 regardless 
of OS and doesn't care about speed. Or maybe they're actually *writing* an 
OS (Most moderns languages are completely useless for writing an OS. D 
isn't). A language or a library should *never* assume it knows the 
programmer's needs better than the programmer does.

Also, C already tried the approach of multi-sized types (ex, C's "int"), and 
it ended up being a big PITA disaster that everyone ended up having to make 
up hacks to work around.



 expect it to be (since char width option can be added to compiler).<


See that's the thing, D is intended as a systems language, so a D programmer 
must be able to easily handle it that way whenever they need to.

TCHAR in Windows is a good example of it. Whenever you need to determine 
size of element (e.g. for allocation), you can use 'sizeof'. Again, it does 
not mean that you're deprived of char/wchar/dchar capability. It still can 
be supported (e.g. via ubyte/ushort/uint) for the sake of interoperability 
or some special cases. Special string constants (e.g. ""b, ""w, ""d) can be 
supported, too. My only point is that it would be good to have universal 
char type that depends on platform.


You can have that easily:

version(Windows)
    alias wstring tstring;
else
    alias string tstring;

Besides, just because you *can* get a job done a certain way doesn't mean 
languages should never try to allow a better way for those who want a better 
way.

 That, in turns, allows to have unified char for all libraries on this 
 platform.


With templated text functions, there is very little benefit to be gained 
from having a unified char. Just wouldn't serve any real purpose. All it 
would do is cause problems for anyone who needs to work at the low-level.

-------------------------------
Not sent from an iPhone.

Jun 07 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Re: Wide characters support in D