digitalmars.D - Re: Wide characters support in D

Ruslan Nikolaev <nruslan_devel yahoo.com> Jun 07 2010

dennis luehring <dl.soluz gmx.net> Jun 08 2010
"Nick Sabalausky" <a a.a> Jun 08 2010

"Nick Sabalausky" <a a.a> Jun 08 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Jun 08 2010

"Nick Sabalausky" <a a.a> Jun 08 2010

Ruslan Nikolaev <nruslan_devel yahoo.com> writes:

 
 Maybe "lousy" is too strong a word, but aside from
 compatibility with other 
 libs/software that use it (which I'll address separately),
 UTF-16 is not 
 particularly useful compared to UTF-8 and UTF-32:


 


I tried to avoid commenting this because I am afraid we'll stray away from the
main point (which is not discussion about which Unicode is better). But in
short I would say: "Not quite right". UTF-16 as already mentioned is generally
faster for non-Latin letters (as reading 2 bytes of aligned data takes the same
time as reading 1 byte). Although, I am not familiar with Asian languages, I
believe that UTF-16 requires just 2 bytes instead of 3 for most of symbols.
That is one of the reason they don't like UTF-8. UTF-32 doesn't have any
advantage except for being fixed length. It has a lot of unnecessary memory,
cache, etc. overhead (the worst case scenario for both UTF8/16) which is not
justified for any language.

 
 First of all, it's not exactly unheard of for big projects
 to make a 
 sub-optimal decision.


I would say, the decision was quite optimal for many reasons, including that
"lousy programming" will not cause too many problems as in case of UTF-8.

 
 Secondly, Java and Windows adapted 16-bit encodings back
 when many people 
 were still under the mistaken impression that would allow
 them to hold any 
 character in one code-unit. If that had been true, then it


I doubt that it was the only reason. UTF-8 was already available before Windows
NT was released. It would be much easier to use UTF-8 instead of ANSI as
opposed to creating parallel API. Nonetheless, UTF-16 has been chosen. In
addition, C# has been released already when UTF-16 became variable length. I
doubt that conversion overhead (which is small compared to VM) was the main
reason to preserve UTF-16.


Concerning why I say that it's good to have conversion to UTF-32 (you asked
somewhere):

I think you did not understand correctly what I meant. This a very common
practice, and in fact - required, to convert from both UTF-8 and UTF-16 to
UTF-32 when you need to do character analysis (e.g. mbtowc() in C). In fact, it
is the only place where UTF-32 is commonly used and useful.

Jun 07 2010

dennis luehring <dl.soluz gmx.net> writes:

please use the "Reply" Button

On 08.06.2010 08:50, Ruslan Nikolaev wrote:
 Maybe "lousy" is too strong a word, but aside from
 compatibility with other
 libs/software that use it (which I'll address separately),
 UTF-16 is not
 particularly useful compared to UTF-8 and UTF-32:




 I tried to avoid commenting this because I am afraid we'll stray away from the
main point (which is not discussion about which Unicode is better). But in
short I would say: "Not quite right". UTF-16 as already mentioned is generally
faster for non-Latin letters (as reading 2 bytes of aligned data takes the same
time as reading 1 byte). Although, I am not familiar with Asian languages, I
believe that UTF-16 requires just 2 bytes instead of 3 for most of symbols.
That is one of the reason they don't like UTF-8. UTF-32 doesn't have any
advantage except for being fixed length. It has a lot of unnecessary memory,
cache, etc. overhead (the worst case scenario for both UTF8/16) which is not
justified for any language.

 First of all, it's not exactly unheard of for big projects
 to make a
 sub-optimal decision.


 I would say, the decision was quite optimal for many reasons, including that
"lousy programming" will not cause too many problems as in case of UTF-8.

 Secondly, Java and Windows adapted 16-bit encodings back
 when many people
 were still under the mistaken impression that would allow
 them to hold any
 character in one code-unit. If that had been true, then it


 I doubt that it was the only reason. UTF-8 was already available before
Windows NT was released. It would be much easier to use UTF-8 instead of ANSI
as opposed to creating parallel API. Nonetheless, UTF-16 has been chosen. In
addition, C# has been released already when UTF-16 became variable length. I
doubt that conversion overhead (which is small compared to VM) was the main
reason to preserve UTF-16.


 Concerning why I say that it's good to have conversion to UTF-32 (you asked
somewhere):

 I think you did not understand correctly what I meant. This a very common
practice, and in fact - required, to convert from both UTF-8 and UTF-16 to
UTF-32 when you need to do character analysis (e.g. mbtowc() in C). In fact, it
is the only place where UTF-32 is commonly used and useful.

Jun 08 2010

"Nick Sabalausky" <a a.a> writes:

"Ruslan Nikolaev" <nruslan_devel yahoo.com> wrote in message 
news:mailman.128.1275979841.24349.digitalmars-d puremagic.com...
 Secondly, Java and Windows adapted 16-bit encodings back
 when many people
 were still under the mistaken impression that would allow
 them to hold any
 character in one code-unit. If that had been true, then it


 I doubt that it was the only reason. UTF-8 was already available before 
 Windows NT was released. It would be much easier to use UTF-8 instead of 
 ANSI as opposed to creating parallel API. Nonetheless, UTF-16 has been 
 chosen.


I didn't say that was the only reason. Also, you've misunderstood my point:

Their reasoning at the time:
    8-bit: Multiple code-units for some characters
    16-bit: One code-unit per character
    Therefore, use 16-bit.

Reality:
    8-bit: Multiple code-units for some characters
    16-bit: Multiple code-units for some characters
    Therefore, old reasoning not necessarily still applicable.

 In addition, C# has been released already when UTF-16 became variable 
 length.


Right, like I said, C#/.NET use UTF-16 because that's what MS had already 
standardized on.

I doubt that conversion overhead (which is small compared to VM) was the 
main reason to preserve UTF-16.


I never said anything about conversion overhead being a reason to preserve 
UTF-16.

 Concerning why I say that it's good to have conversion to UTF-32 (you 
 asked somewhere):

 I think you did not understand correctly what I meant. This a very common 
 practice, and in fact - required, to convert from both UTF-8 and UTF-16 to 
 UTF-32 when you need to do character analysis (e.g. mbtowc() in C). In 
 fact, it is the only place where UTF-32 is commonly used and useful.


I'm well aware why UTF-32 is useful. Earlier, you had started out saying 
that there should only be one string type, the OS-native type. Now you're 
changing your tune and saying that we do need multiple types.

Jun 08 2010

"Nick Sabalausky" <a a.a> writes:

"Nick Sabalausky" <a a.a> wrote in message 
news:huktq1$8tr$1 digitalmars.com...
 "Ruslan Nikolaev" <nruslan_devel yahoo.com> wrote in message 
 news:mailman.128.1275979841.24349.digitalmars-d puremagic.com...
 In addition, C# has been released already when UTF-16 became variable 
 length.


 Right, like I said, C#/.NET use UTF-16 because that's what MS had already 
 standardized on.


s/UTF-16/16-bit/  It's getting late and I'm starting to mix terminology...

Jun 08 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 06/08/2010 03:12 AM, Nick Sabalausky wrote:
 "Nick Sabalausky"<a a.a>  wrote in message
 news:huktq1$8tr$1 digitalmars.com...
 "Ruslan Nikolaev"<nruslan_devel yahoo.com>  wrote in message
 news:mailman.128.1275979841.24349.digitalmars-d puremagic.com...
 In addition, C# has been released already when UTF-16 became variable
 length.


 Right, like I said, C#/.NET use UTF-16 because that's what MS had already
 standardized on.


 s/UTF-16/16-bit/  It's getting late and I'm starting to mix terminology...


s/16-bit/UCS-2/

The story is that Windows standardized on UCS-2, which is the uniform 
16-bit-per-character encoding that predates UTF-16. When UCS-2 turned 
out to be insufficient, it was extended to the variable-length UTF-16. 
As has been discussed, that has been quite unpleasant because a lot of 
code out there handles strings as if they were UCS-2.

Andrei

Jun 08 2010

"Nick Sabalausky" <a a.a> writes:

"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:hul65q$o98$1 digitalmars.com...
 On 06/08/2010 03:12 AM, Nick Sabalausky wrote:
 "Nick Sabalausky"<a a.a>  wrote in message
 news:huktq1$8tr$1 digitalmars.com...
 "Ruslan Nikolaev"<nruslan_devel yahoo.com>  wrote in message
 news:mailman.128.1275979841.24349.digitalmars-d puremagic.com...
 In addition, C# has been released already when UTF-16 became variable
 length.


 Right, like I said, C#/.NET use UTF-16 because that's what MS had 
 already
 standardized on.


 s/UTF-16/16-bit/  It's getting late and I'm starting to mix 
 terminology...


 s/16-bit/UCS-2/

 The story is that Windows standardized on UCS-2, which is the uniform 
 16-bit-per-character encoding that predates UTF-16. When UCS-2 turned out 
 to be insufficient, it was extended to the variable-length UTF-16. As has 
 been discussed, that has been quite unpleasant because a lot of code out 
 there handles strings as if they were UCS-2.


Ok, that's what I had thought, but then I started second-guessing, so I 
figured "s/UTF-16/16-bit/" was a safer claim than "s/UTF-16/UCS-2/".

Jun 08 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Re: Wide characters support in D