digitalmars.D - Re: Wide characters support in D

Ruslan Nikolaev <nruslan_devel yahoo.com> Jun 07 2010

Jesse Phillips <jessekphillips+D gmail.com> Jun 07 2010
"Nick Sabalausky" <a a.a> Jun 07 2010

Ruslan Nikolaev <nruslan_devel yahoo.com> writes:

=0A> It only generates code for the types that are actually=0A> needed. If,=
 for =0A> instance, your progam never uses anything except UTF-8,=0A> then =
only one =0A> version of the function will be made - the UTF-8=0A> version.=
=A0 If you don't use =0A> every char type, then it doesn't generate it for =
every char=0A> type - just the =0A> ones you choose to use.=0A=0ANot quite =
right. If we create system dynamic libraries or dynamic libraries commonly =
used, we will have to compile every instance unless we want to burden user =
with this. Otherwise, the same code will be duplicated in users program ove=
r and over again.=0A=0A> That's not good. First of all, UTF-16 is a lousy e=
ncoding,=0A> it combines the =0A> worst of both UTF-8 and UTF-32: It's mult=
ibyte and=0A> non-word-aligned like =0A> UTF-8, but it still wastes a lot o=
f space like UTF-32. So=0A> even if your OS =0A> uses it natively, it's sti=
ll best to do most internal=0A> processing in either =0A> UTF-8 or UTF-32. =
(And with templated string functions, if=0A> the programmer =0A> actually d=
oes want to use the native type in the *rare*=0A> cases where he's =0A> mak=
ing enough OS calls that it would actually matter, he=0A> can still do so.)=
=0A>=0A=0AFirst of all, UTF-16 is not a lousy encoding. It requires for mos=
t characters 2 bytes (not so big wastage especially if you consider other l=
anguages). Only for REALLY rare chars do you need 4 bytes. Whereas UTF-8 wi=
ll require from 1 to 3 bytes for the same common characters. And also 4 cha=
rs for REALLY rare ones. In UTF-16 surrogate is an exception whereas in UTF=
-8 it is a rule (when something is an exception, it won't affect performanc=
e in most cases; when something is a rule - it will affect).=0A=0AFinally, =
UTF-16 is used by a variety of systems/tools: Windows, Java, C#, Qt and man=
y others. Developers of these systems chose to use UTF-16 even though some =
of them (e.g. Java, C#, Qt) were developed in the era of UTF-8=0A=0A=0A> Se=
condly, the programmer *should* be able to use whatever=0A> type he decides=
 is =0A> appropriate. If he wants to stick with native, he can do=0A=0AWhy?=
 He/She can just use conversion to UTF-32 (dchar) whenever better understan=
ding of character is needed. At least, that's what should be done anyway.=
=0A=0A> =0A> You can have that easily:=0A> =0A> version(Windows)=0A> =A0 =
=A0 alias wstring tstring;=0A> else=0A> =A0 =A0 alias string tstring;=0A> =
=0A=0ASee that's my point. Nobody is going to do this unless the above is s=
tandardized by the language. Everybody will stick to something particular (=
either char or wchar). =0A=0A=0A> =0A> With templated text functions, there=
 is very little benefit=0A> to be gained =0A> from having a unified char. J=
ust wouldn't serve any real=0A=0Asee my comment above about templates and d=
ynamic libraries =0A=0ARuslan=0A=0A=0A

Jun 07 2010

Jesse Phillips <jessekphillips+D gmail.com> writes:

On Mon, 07 Jun 2010 19:26:02 -0700, Ruslan Nikolaev wrote:

 It only generates code for the types that are actually needed. If, for
 instance, your progam never uses anything except UTF-8, then only one
 version of the function will be made - the UTF-8 version.  If you don't
 use
 every char type, then it doesn't generate it for every char type - just
 the
 ones you choose to use.


 Not quite right. If we create system dynamic libraries or dynamic
 libraries commonly used, we will have to compile every instance unless
 we want to burden user with this. Otherwise, the same code will be
 duplicated in users program over and over again.
 


I think you really need to look more into what templates are and do.

There is also going to be very little performance gain by using the 
"system type" for strings. Considering that most of the work is not 
likely going be to the system commands you mentioned, but within D itself.

Jun 07 2010

"Nick Sabalausky" <a a.a> writes:

"Ruslan Nikolaev" <nruslan_devel yahoo.com> wrote in message 
news:mailman.124.1275963971.24349.digitalmars-d puremagic.com...

Nick wrote:
 It only generates code for the types that are actually
 needed. If, for
 instance, your progam never uses anything except UTF-8,
 then only one
 version of the function will be made - the UTF-8
 version. If you don't use
 every char type, then it doesn't generate it for every char
 type - just the
 ones you choose to use.


Not quite right. If we create system dynamic libraries or dynamic libraries 
commonly used, we will have to compile every instance unless we want to 
burden user with this. Otherwise, the same code will be duplicated in users 
program over and over again.<


That's a rather minor issue. I think you're overestimating the amount of 
bloat that occurs from having one string type versus three string types. 
Absolute worst case scenario would be a library that contains nothing but 
text-processing functions. That would triple in size, but what's the biggest 
such lib you've ever seen anyway? And for most libs, only a fraction is 
going to be taken up by text processing, so the difference won't be 
particularly large. In fact, the difference would likely be dwarfed anyway 
by the bloat incurred from all the other templated code (ie which would be 
largely unaffected by number of string types), and yes, *that* can get to be 
a problem, but it's an entirely separate one.

 That's not good. First of all, UTF-16 is a lousy encoding,
 it combines the
 worst of both UTF-8 and UTF-32: It's multibyte and
 non-word-aligned like
 UTF-8, but it still wastes a lot of space like UTF-32. So
 even if your OS
 uses it natively, it's still best to do most internal
 processing in either
 UTF-8 or UTF-32. (And with templated string functions, if
 the programmer
 actually does want to use the native type in the *rare*
 cases where he's
 making enough OS calls that it would actually matter, he
 can still do so.)

First of all, UTF-16 is not a lousy encoding. It requires for most 
characters 2 bytes (not so big wastage especially if you consider other 
languages). Only for REALLY rare chars do you need 4 bytes. Whereas UTF-8 
will require from 1 to 3 bytes for the same common characters. And also 4 
chars for REALLY rare ones. In UTF-16 surrogate is an exception whereas in 
UTF-8 it is a rule (when something is an exception, it won't affect 
performance in most cases; when something is a rule - it will affect).<


Maybe "lousy" is too strong a word, but aside from compatibility with other 
libs/software that use it (which I'll address separately), UTF-16 is not 
particularly useful compared to UTF-8 and UTF-32:

Non-latin-alphabet language: UTF-8 vs UTF-16:

The real-word difference in sizes is minimal. But UTF-8 has some advantages: 
The nature of the the encoding makes backwards-scanning cheaper and easier. 
Also, as Walter said, bugs in the handling of multi-code-unit characters 
become fairly obvious. Advantages of UTF-16: None.

Latin-alphabet language: UTF-8 vs UTF-16:

All the same UTF-8 advantages for non-latin-alphabet languages still apply, 
plus there's a space savings: Under UTF-8, *most* characters are going to be 
1 byte. Yes, there will be the occasional 2+ byte character, but they're so 
much less common that the overhead compared to ASCII (I'm only using ASCII 
as a baseline here, for the sake of comparisons) would only be around 0% to 
15% depending on the language. UTF-16, however, has a consistent 100% 
overhead (slightly more when you count surrogate pairs, but I'll just leave 
it at 100%). So, depending on language, UTF-16 would be around 70%-100% 
larger than UTF-8. That's not insignificant.

Any language: UTF-32 vs UTF-16:

Using UTF-32 takes up extra space, but when that matters, UTF-8 already has 
the advantage over UTF-16 anyway regardless of whether or not UTF-8 is 
providing a space savings (see above), so the question of UTF-32 vs UTF-16 
becomes useless. The rest of the time, UTF-32 has these advantages: 
Guaranteed one code-unit per character. And, the code-unit size is faster on 
typical CPUs (typical CPUs generally handle 32-bits faster than they handle 
8- or 16-bits). Advantages of UTF-16: None.

So compatibility with certain tools/libs is really the only reason ever to 
choose UTF-16.

Finally, UTF-16 is used by a variety of systems/tools: Windows, Java, C#, 
Qt and many others. Developers of these systems chose to use UTF-16 even 
though some of them (e.g. Java, C#, Qt) were developed in the era of UTF-8<


First of all, it's not exactly unheard of for big projects to make a 
sub-optimal decision.

Secondly, Java and Windows adapted 16-bit encodings back when many people 
were still under the mistaken impression that would allow them to hold any 
character in one code-unit. If that had been true, then it would indeed have 
had at least certain advantages over UTF-8. But by the time the programming 
world at large knew better, it was too late for Java or Windows to 
re-evaluate the decision; they'd already jumped in with both feet. C# and 
.NET use UTF-16 because Windows does. I don't know about Qt, but judging by 
how long Wikipedia says it's been around, I'd say it's probably the same 
story.

As for choosing to use UTF-16 because of interfacing with other tools and 
libs that use it: That's certainly a good reason to use UTF-16. But it's 
about the only reason. And it's a big mistake to just assume that the 
overhead of converting to/from UTF-16 when crossing those API borders is 
always going to outweigh all other concerns:

For instance, if you're writing an app that does a large amount of 
text-processing on relatively small amounts of text and only deals a little 
bit with a UTF-16 API, then the overhead of operating on 16-bits at a time 
can easily outweigh the overhead from the UTF-16 <-> UTF-32 conversions.

Or, maybe the app you're writing is more memory-limited than speed-limited.

There are perfectly legitimate reasons to want to use an encoding other than 
the OS-native. Why force those people to circumvent the type system to do 
it? Especially in a language that's intended to be usable as a systems 
language. Just to potentially save a couple megs on some .dll or .so?

 Secondly, the programmer *should* be able to use whatever
 type he decides is
 appropriate. If he wants to stick with native, he can do
Why? He/She can just use conversion to UTF-32 (dchar) whenever better 
understanding of character is needed. At least, that's what should be done 
anyway.<


Weren't you saying that the main point of just having one string type (the 
OS-native string) was to avoid unnecessary conversions? But now you're 
arguing that's it's fine to do unnecessary conversions and to have the 
multiple string types?

 You can have that easily:

 version(Windows)
 alias wstring tstring;
 else
 alias string tstring;

See that's my point. Nobody is going to do this unless the above is 
standardized by the language. Everybody will stick to something particular 
(either char or wchar).<


True enough. I don't have anything against having something like that in the 
std library as long as the others are still available too. Could be useful 
in a few cases. I do think having it *instead* of the three types is far too 
presumptuous, though.

Jun 07 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Re: Wide characters support in D