www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Re: Wide characters support in D

reply Ruslan Nikolaev <nruslan_devel yahoo.com> writes:
=0A> It only generates code for the types that are actually=0A> needed. If,=
 for =0A> instance, your progam never uses anything except UTF-8,=0A> then =
only one =0A> version of the function will be made - the UTF-8=0A> version.=
=A0 If you don't use =0A> every char type, then it doesn't generate it for =
every char=0A> type - just the =0A> ones you choose to use.=0A=0ANot quite =
right. If we create system dynamic libraries or dynamic libraries commonly =
used, we will have to compile every instance unless we want to burden user =
with this. Otherwise, the same code will be duplicated in users program ove=
r and over again.=0A=0A> That's not good. First of all, UTF-16 is a lousy e=
ncoding,=0A> it combines the =0A> worst of both UTF-8 and UTF-32: It's mult=
ibyte and=0A> non-word-aligned like =0A> UTF-8, but it still wastes a lot o=
f space like UTF-32. So=0A> even if your OS =0A> uses it natively, it's sti=
ll best to do most internal=0A> processing in either =0A> UTF-8 or UTF-32. =
(And with templated string functions, if=0A> the programmer =0A> actually d=
oes want to use the native type in the *rare*=0A> cases where he's =0A> mak=
ing enough OS calls that it would actually matter, he=0A> can still do so.)=
=0A>=0A=0AFirst of all, UTF-16 is not a lousy encoding. It requires for mos=
t characters 2 bytes (not so big wastage especially if you consider other l=
anguages). Only for REALLY rare chars do you need 4 bytes. Whereas UTF-8 wi=
ll require from 1 to 3 bytes for the same common characters. And also 4 cha=
rs for REALLY rare ones. In UTF-16 surrogate is an exception whereas in UTF=
-8 it is a rule (when something is an exception, it won't affect performanc=
e in most cases; when something is a rule - it will affect).=0A=0AFinally, =
UTF-16 is used by a variety of systems/tools: Windows, Java, C#, Qt and man=
y others. Developers of these systems chose to use UTF-16 even though some =
of them (e.g. Java, C#, Qt) were developed in the era of UTF-8=0A=0A=0A> Se=
condly, the programmer *should* be able to use whatever=0A> type he decides=
 is =0A> appropriate. If he wants to stick with native, he can do=0A=0AWhy?=
 He/She can just use conversion to UTF-32 (dchar) whenever better understan=
ding of character is needed. At least, that's what should be done anyway.=
=0A=0A> =0A> You can have that easily:=0A> =0A> version(Windows)=0A> =A0 =
=A0 alias wstring tstring;=0A> else=0A> =A0 =A0 alias string tstring;=0A> =
=0A=0ASee that's my point. Nobody is going to do this unless the above is s=
tandardized by the language. Everybody will stick to something particular (=
either char or wchar). =0A=0A=0A> =0A> With templated text functions, there=
 is very little benefit=0A> to be gained =0A> from having a unified char. J=
ust wouldn't serve any real=0A=0Asee my comment above about templates and d=
ynamic libraries =0A=0ARuslan=0A=0A=0A      
Jun 07 2010
next sibling parent Jesse Phillips <jessekphillips+D gmail.com> writes:
On Mon, 07 Jun 2010 19:26:02 -0700, Ruslan Nikolaev wrote:

 It only generates code for the types that are actually needed. If, for
 instance, your progam never uses anything except UTF-8, then only one
 version of the function will be made - the UTF-8 version.  If you don't
 use
 every char type, then it doesn't generate it for every char type - just
 the
 ones you choose to use.

Not quite right. If we create system dynamic libraries or dynamic libraries commonly used, we will have to compile every instance unless we want to burden user with this. Otherwise, the same code will be duplicated in users program over and over again.

I think you really need to look more into what templates are and do. There is also going to be very little performance gain by using the "system type" for strings. Considering that most of the work is not likely going be to the system commands you mentioned, but within D itself.
Jun 07 2010
prev sibling parent "Nick Sabalausky" <a a.a> writes:
"Ruslan Nikolaev" <nruslan_devel yahoo.com> wrote in message 
news:mailman.124.1275963971.24349.digitalmars-d puremagic.com...

Nick wrote:
 It only generates code for the types that are actually
 needed. If, for
 instance, your progam never uses anything except UTF-8,
 then only one
 version of the function will be made - the UTF-8
 version. If you don't use
 every char type, then it doesn't generate it for every char
 type - just the
 ones you choose to use.

Not quite right. If we create system dynamic libraries or dynamic libraries 
commonly used, we will have to compile every instance unless we want to 
burden user with this. Otherwise, the same code will be duplicated in users 
program over and over again.<

That's a rather minor issue. I think you're overestimating the amount of bloat that occurs from having one string type versus three string types. Absolute worst case scenario would be a library that contains nothing but text-processing functions. That would triple in size, but what's the biggest such lib you've ever seen anyway? And for most libs, only a fraction is going to be taken up by text processing, so the difference won't be particularly large. In fact, the difference would likely be dwarfed anyway by the bloat incurred from all the other templated code (ie which would be largely unaffected by number of string types), and yes, *that* can get to be a problem, but it's an entirely separate one.
 That's not good. First of all, UTF-16 is a lousy encoding,
 it combines the
 worst of both UTF-8 and UTF-32: It's multibyte and
 non-word-aligned like
 UTF-8, but it still wastes a lot of space like UTF-32. So
 even if your OS
 uses it natively, it's still best to do most internal
 processing in either
 UTF-8 or UTF-32. (And with templated string functions, if
 the programmer
 actually does want to use the native type in the *rare*
 cases where he's
 making enough OS calls that it would actually matter, he
 can still do so.)

First of all, UTF-16 is not a lousy encoding. It requires for most 
characters 2 bytes (not so big wastage especially if you consider other 
languages). Only for REALLY rare chars do you need 4 bytes. Whereas UTF-8 
will require from 1 to 3 bytes for the same common characters. And also 4 
chars for REALLY rare ones. In UTF-16 surrogate is an exception whereas in 
UTF-8 it is a rule (when something is an exception, it won't affect 
performance in most cases; when something is a rule - it will affect).<

Maybe "lousy" is too strong a word, but aside from compatibility with other libs/software that use it (which I'll address separately), UTF-16 is not particularly useful compared to UTF-8 and UTF-32: Non-latin-alphabet language: UTF-8 vs UTF-16: The real-word difference in sizes is minimal. But UTF-8 has some advantages: The nature of the the encoding makes backwards-scanning cheaper and easier. Also, as Walter said, bugs in the handling of multi-code-unit characters become fairly obvious. Advantages of UTF-16: None. Latin-alphabet language: UTF-8 vs UTF-16: All the same UTF-8 advantages for non-latin-alphabet languages still apply, plus there's a space savings: Under UTF-8, *most* characters are going to be 1 byte. Yes, there will be the occasional 2+ byte character, but they're so much less common that the overhead compared to ASCII (I'm only using ASCII as a baseline here, for the sake of comparisons) would only be around 0% to 15% depending on the language. UTF-16, however, has a consistent 100% overhead (slightly more when you count surrogate pairs, but I'll just leave it at 100%). So, depending on language, UTF-16 would be around 70%-100% larger than UTF-8. That's not insignificant. Any language: UTF-32 vs UTF-16: Using UTF-32 takes up extra space, but when that matters, UTF-8 already has the advantage over UTF-16 anyway regardless of whether or not UTF-8 is providing a space savings (see above), so the question of UTF-32 vs UTF-16 becomes useless. The rest of the time, UTF-32 has these advantages: Guaranteed one code-unit per character. And, the code-unit size is faster on typical CPUs (typical CPUs generally handle 32-bits faster than they handle 8- or 16-bits). Advantages of UTF-16: None. So compatibility with certain tools/libs is really the only reason ever to choose UTF-16.
Finally, UTF-16 is used by a variety of systems/tools: Windows, Java, C#, 
Qt and many others. Developers of these systems chose to use UTF-16 even 
though some of them (e.g. Java, C#, Qt) were developed in the era of UTF-8<

First of all, it's not exactly unheard of for big projects to make a sub-optimal decision. Secondly, Java and Windows adapted 16-bit encodings back when many people were still under the mistaken impression that would allow them to hold any character in one code-unit. If that had been true, then it would indeed have had at least certain advantages over UTF-8. But by the time the programming world at large knew better, it was too late for Java or Windows to re-evaluate the decision; they'd already jumped in with both feet. C# and .NET use UTF-16 because Windows does. I don't know about Qt, but judging by how long Wikipedia says it's been around, I'd say it's probably the same story. As for choosing to use UTF-16 because of interfacing with other tools and libs that use it: That's certainly a good reason to use UTF-16. But it's about the only reason. And it's a big mistake to just assume that the overhead of converting to/from UTF-16 when crossing those API borders is always going to outweigh all other concerns: For instance, if you're writing an app that does a large amount of text-processing on relatively small amounts of text and only deals a little bit with a UTF-16 API, then the overhead of operating on 16-bits at a time can easily outweigh the overhead from the UTF-16 <-> UTF-32 conversions. Or, maybe the app you're writing is more memory-limited than speed-limited. There are perfectly legitimate reasons to want to use an encoding other than the OS-native. Why force those people to circumvent the type system to do it? Especially in a language that's intended to be usable as a systems language. Just to potentially save a couple megs on some .dll or .so?
 Secondly, the programmer *should* be able to use whatever
 type he decides is
 appropriate. If he wants to stick with native, he can do
Why? He/She can just use conversion to UTF-32 (dchar) whenever better 
understanding of character is needed. At least, that's what should be done 
anyway.<

Weren't you saying that the main point of just having one string type (the OS-native string) was to avoid unnecessary conversions? But now you're arguing that's it's fine to do unnecessary conversions and to have the multiple string types?
 You can have that easily:

 version(Windows)
 alias wstring tstring;
 else
 alias string tstring;

See that's my point. Nobody is going to do this unless the above is 
standardized by the language. Everybody will stick to something particular 
(either char or wchar).<

True enough. I don't have anything against having something like that in the std library as long as the others are still available too. Could be useful in a few cases. I do think having it *instead* of the three types is far too presumptuous, though.
Jun 07 2010