digitalmars.D - Why is utf8 the default in D?

Frank Benoit (8/8) Apr 27 2009 M$ and Java have chosen to use utf16 as their default Unicode character

Christopher Wright (10/19) Apr 27 2009 Unicode did not match ISO10646 when Java and Windows standardized on
Michel Fortin (32/42) Apr 27 2009 The argument at the time was that they were going to work directly with

Andrei Alexandrescu (5/48) Apr 27 2009 Well put. I distinctly remember the hubbub around Java's UTF16 support

Steven Schveighoffer (5/13) Apr 27 2009 I would expect the answer was to be compatible with C, utf8 seamlessly

Frank Benoit <keinfarbton googlemail.com> writes:

M$ and Java have chosen to use utf16 as their default Unicode character
encoding. I am sure, the decision was not made without good reasoning.

What are their arguments?
Why does D propagate utf8 as the default?

E.g.
Exception.msg
Object.toString()
new std.stream.File( char[] )

Apr 27 2009

Christopher Wright <dhasenan gmail.com> writes:

Frank Benoit wrote:
 M$ and Java have chosen to use utf16 as their default Unicode character
 encoding. I am sure, the decision was not made without good reasoning.
 
 What are their arguments?
 Why does D propagate utf8 as the default?

Unicode did not match ISO10646 when Java and Windows standardized on 
UTF-16. At the time, the choices were UTF-8 and UTF-16. UTF-16 made 
internationalization easier than UTF-8 with a relatively small overhead.

As for UTF-8 being the default in D, that's a library issue. I think 
Tango uses templates and overloads to allow you to use char[], wchar[], 
and dchar[] most of the time.

 E.g.
 Exception.msg
 Object.toString()

Object.toString and Exception.msg are meant to be used for debug output.

 new std.stream.File( char[] )

It should not be too hard to fix this; you can write a bug report asking 
for overloads.

Apr 27 2009

Michel Fortin <michel.fortin michelf.com> writes:

On 2009-04-27 07:04:06 -0400, Frank Benoit <keinfarbton googlemail.com> said:

 M$ and Java have chosen to use utf16 as their default Unicode character
 encoding. I am sure, the decision was not made without good reasoning.
 
 What are their arguments?
 Why does D propagate utf8 as the default?
 
 E.g.
 Exception.msg
 Object.toString()
 new std.stream.File( char[] )

The argument at the time was that they were going to work directly with 
Unicode code points, thus simplifying things. Then Unicode extended to 
cover even more characters, and at some point 16 bit became 
insufficient; 16-bit encoding of Unicode became UTF-16, and surrogate 
pairs were added to allow it to contain even higher code points, making 
16-bit unicode a variable-size character encoding now known as UTF-16.

So it turns out that those in the early years of Unicode who made that 
choice made it for reasons that no longer exist. Today, variable-size 
UTF-16 makes it as hard to calculate a string length and do random 
access in a string as UTF-8. In practice, many frameworks just ignore 
the problem and are happy counting each code of a surrogate pair as two 
characters, as they always did, but that behaviour isn't exactly 
correct.

To get the benefit those framework/language designers though they'd get 
at the time, we'd have to go with UTF-32, but then storing strings 
become immensely wasteful. And I'm not counting that most data exchange 
these days have standardized with UTF-8, rarely you'll encounter UTF-16 
in the wild (and when you do, you have take care about UTF-16 LE and 
BE), and even more rare is UTF-32. And that's not counting that 
perhaps, one day, Unicode will grow again and fall outside of its 
32-bit range... although that may have to wait until learn a few 
extraterrestrial languages. :-)

So the D solution, which is to use UTF-8 everywhere while still 
supporting string operations using UTF-16 and UTF-32, looks very good 
to me. What I actually do is use UTF-8 everywhere, and sometime when I 
need to easily manipulate characters I use UTF-32. And I use UTF-16 for 
dealing with APIs expecting it, but not for much else.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Apr 27 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Michel Fortin wrote:
 On 2009-04-27 07:04:06 -0400, Frank Benoit <keinfarbton googlemail.com> 
 said:
 
 M$ and Java have chosen to use utf16 as their default Unicode character
 encoding. I am sure, the decision was not made without good reasoning.

 What are their arguments?
 Why does D propagate utf8 as the default?

 E.g.
 Exception.msg
 Object.toString()
 new std.stream.File( char[] )

 
 The argument at the time was that they were going to work directly with 
 Unicode code points, thus simplifying things. Then Unicode extended to 
 cover even more characters, and at some point 16 bit became 
 insufficient; 16-bit encoding of Unicode became UTF-16, and surrogate 
 pairs were added to allow it to contain even higher code points, making 
 16-bit unicode a variable-size character encoding now known as UTF-16.
 
 So it turns out that those in the early years of Unicode who made that 
 choice made it for reasons that no longer exist. Today, variable-size 
 UTF-16 makes it as hard to calculate a string length and do random 
 access in a string as UTF-8. In practice, many frameworks just ignore 
 the problem and are happy counting each code of a surrogate pair as two 
 characters, as they always did, but that behaviour isn't exactly correct.
 
 To get the benefit those framework/language designers though they'd get 
 at the time, we'd have to go with UTF-32, but then storing strings 
 become immensely wasteful. And I'm not counting that most data exchange 
 these days have standardized with UTF-8, rarely you'll encounter UTF-16 
 in the wild (and when you do, you have take care about UTF-16 LE and 
 BE), and even more rare is UTF-32. And that's not counting that perhaps, 
 one day, Unicode will grow again and fall outside of its 32-bit range... 
 although that may have to wait until learn a few extraterrestrial 
 languages. :-)
 
 So the D solution, which is to use UTF-8 everywhere while still 
 supporting string operations using UTF-16 and UTF-32, looks very good to 
 me. What I actually do is use UTF-8 everywhere, and sometime when I need 
 to easily manipulate characters I use UTF-32. And I use UTF-16 for 
 dealing with APIs expecting it, but not for much else.

Well put. I distinctly remember the hubbub around Java's UTF16 support 
that was solving all of strings' problems, followed by the embarrassed 
silence upon the introduction of UTF32.

Andrei

Apr 27 2009

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 27 Apr 2009 07:04:06 -0400, Frank Benoit  
<keinfarbton googlemail.com> wrote:

 M$ and Java have chosen to use utf16 as their default Unicode character
 encoding. I am sure, the decision was not made without good reasoning.

 What are their arguments?
 Why does D propagate utf8 as the default?

 E.g.
 Exception.msg
 Object.toString()
 new std.stream.File( char[] )

I would expect the answer was to be compatible with C, utf8 seamlessly  
implements ASCII.

-Steve

Apr 27 2009

D Programming

C/C++ Programming

Other

digitalmars.D - Why is utf8 the default in D?