www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Unicode and UTF similarities/differences?

reply Kramer <Kramer_member pathlink.com> writes:
I know Unicode and UTF is talked about a lot here and with good reason, because
it sounds like a feature that makes D attractive in the portability realm of
I18N being able to write source code once and compile anywhere.  But as a newbie
to D and to Unicode and UTF, could anyone please explain the differences (or
similarities) on the two.

Thanks,
Kramer
Oct 01 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjk0hf$oeu$1 digitaldaemon.com>, Kramer says...
I know Unicode and UTF is talked about a lot here and with good reason, because
it sounds like a feature that makes D attractive in the portability realm of
I18N being able to write source code once and compile anywhere.  But as a newbie
to D and to Unicode and UTF, could anyone please explain the differences (or
similarities) on the two.

Well, they are different kinds of objects. Unicode is a character set; UTF-16 is an encoding. Bear with me - I'll try to make that clearer. A character set is a set of characters in which each character has a /number/ associated with it, called its "codepoint". For example, in the ASCII character set, the character 'A' has a codepoint of 65 (more usually written in hex, as 0x41). In the Unicode character set, 'A' also has a codepoint of 65, and the character '' (not present in ASCII) has a codepoint of 8,364 (more normally written in hex as 0x20AC). Unicode characters are often written as U+ followed by their codepoint in hexadecimal. That is, U+20AC means the same thing as ''. Once upon a time, Unicode was going to be a sixteen-bit wide character set. That is, there were going to be (at most) 65,536 characters in it. Thus, every Unicode string would fit comfortably into an array of 16-bit-wide words. Then things changed. Unicode grew too big. Suddenly, 65,536 characters wasn't going to be enough. But too many important real-life applications had come to rely on characters being 16-bits wide (for example: Java and Windows, to name a couple of biggies). Something had to be done. That something was UTF-16. UTF-16 is a sneaky way of squeezing >65535 characters into an array originally designed for 16-bit words. Unicode characters with codepoints <0x10000 still occupy only one word; Unicode characters with codepoints >=0x10000 now occupy two words. (A special range of otherwise unused codepoints makes this possible). In general, an "encoding" is a bidirectional mapping which maps each codepoint to an array of fixed-width objects called "code units". How wide is a code unit? Well, it depends on the encoding. UTF-8 code units are 8 bits wide; UTF-16 code units are 16 bits wide; and UTF-32 code units are 32 bits wide. So UTF-16 is a mapping from Unicode codepoints to arrays of 16-bit wide units. For example, the codepoint 0x10000 maps (in UTF-16) to the array [ 0xD800, 0xDC00 ]. You can learn all about this in much more detail here: http://www.unicode.org/faq/utf_bom.html Hope that helps Arcane Jill
Oct 01 2004
next sibling parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Arcane Jill wrote:

<snip>
 In general, an "encoding" is a bidirectional mapping

Bidirectional? Shouldn't the other way be a "decoding"?
 which maps each codepoint to an array of fixed-width objects called 
 "code units".

What character set are ISO-8859-x et al encodings of, for that matter? Or do certain web browsers/news clients say "encoding" when half the time they really mean "character set"? (The two nomers could come together, if each is its own character set, encoded by the identity function.) Stewart.
Oct 04 2004
next sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjrg2e$2eg5$1 digitaldaemon.com>, Stewart Gordon says...
Bidirectional?  Shouldn't the other way be a "decoding"?

Hehe. Then maybe I should have said "reversible"? Or maybe I should have been defining "encoding scheme" rather than "encoding"? Anyway, I don't think it matters too much, it was just a rough explanation. Anyone who wants the details should head over to the glossary on the Unicode web site.
What character set are ISO-8859-x et al encodings of, for that matter?

Themselves. All the encodings of 8-bit-wide character sets are encoded trivially.
Or do certain web browsers/news clients say "encoding" when half the 
time they really mean "character set"?

Yeah, it's one of those historical accidents - like when you see in an HTTP header: # Content-type = text/html; charset=UTF-8 it should probably really say encoding, but that's just how it ended up.
(The two nomers could come together, if each is its own character set, 
encoded by the identity function.)

Yeah. As you quite rightly point out, with 8-bit character sets, there's basically no difference. Jill
Oct 04 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjrg2e$2eg5$1 digitaldaemon.com>, Stewart Gordon says...

What character set are ISO-8859-x et al encodings of, for that matter?
Or do certain web browsers/news clients say "encoding" when half the 
time they really mean "character set"?

Another way of looking at it (which certainly works from the point of view of web browsers and web pages) is that the character set is Unicode, and that the 8-bit character sets may be viewed as encodings of Unicode, so that (for example), in WINDOWS-1252, the Unicode character U+20AC is encoded as 0x80. Of course, this flies in the face of my earlier statement that encodings are reversible, but I think I was wrong on that point. (An encoding can only be reversible if it can losslessly encode the entirety of its character set. The UTFs are all reversible). Fortunately, HTML (and XML) gives us a workaround, because of course even in Latin-1, you can still use: # &#x0410; to get Cyrillic uppercase A. Jill PS. This is probably all too vague to go in a WIKI. Maybe it would be better if the WIKI just directed people to the glossary on the Unicode web site. It's a bit wordier, but a lot more accurate.
Oct 04 2004
next sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cjtfkv$2g3b$1 digitaldaemon.com...
 PS. This is probably all too vague to go in a WIKI. Maybe it would be

 the WIKI just directed people to the glossary on the Unicode web site.

 bit wordier, but a lot more accurate.

I think it should go into the Wiki because: 1) It repeatedly comes up with regards to D because a language that supports UTF so thoroughly is new and people are not familiar with the UTF issues yet. 2) You write good explanations. They'd be easier to find in the Wiki than on the n.g. 3) I think people would be more comfortable to start with the nice overviews you write. The wordier, pedantic explanations can come later. 4) The Wiki isn't like a book where one has to be ruthless in maintaining tight focus. The D Wiki is free to explore related topics to whatever depth is interesting to D programmers.
Oct 05 2004
parent J C Calvarese <jcc7 cox.net> writes:
In article <cjuilc$cju$1 digitaldaemon.com>, Walter says...
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cjtfkv$2g3b$1 digitaldaemon.com...
 PS. This is probably all too vague to go in a WIKI. Maybe it would be

 the WIKI just directed people to the glossary on the Unicode web site.

 bit wordier, but a lot more accurate.

I think it should go into the Wiki because: 1) It repeatedly comes up with regards to D because a language that supports UTF so thoroughly is new and people are not familiar with the UTF issues yet. 2) You write good explanations. They'd be easier to find in the Wiki than on the n.g. 3) I think people would be more comfortable to start with the nice overviews you write. The wordier, pedantic explanations can come later. 4) The Wiki isn't like a book where one has to be ruthless in maintaining tight focus. The D Wiki is free to explore related topics to whatever depth is interesting to D programmers.

I agree. I added it to http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues (with a link back to AJ's post). jcc7
Oct 05 2004
prev sibling parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Arcane Jill wrote:
<snip>
 (An encoding can only be reversible if it can losslessly encode the entirety
of its character set. The
 UTFs are all reversible).

So it would be able to map an arbitrary Unicode string to a string in the destination code, as long as there are substitution rules for Unicode characters that become unavailable?
 Fortunately, HTML (and XML) gives us a workaround,
 because of course even in Latin-1, you can still use:
 
 #    &#x0410;
 
 to get Cyrillic uppercase A.

AIUI, &# codes are supposed to be Unicode, regardless of the character set in which the HTML file is actually encoded. Just like character escapes in D are independent of the source encoding. Stewart.
Oct 05 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cjujiv$dmb$1 digitaldaemon.com>, Stewart Gordon says...
Arcane Jill wrote:
<snip>
 (An encoding can only be reversible if it can losslessly encode the entirety
of its character set. The
 UTFs are all reversible).

So it would be able to map an arbitrary Unicode string to a string in the destination code, as long as there are substitution rules for Unicode characters that become unavailable?

Well, usually there's some sort of default replacement character - like if you're converting from a Russian character set to a Latin one, you sometimes end up replacing some of the characters with '?'. That would be non-reversible. But on the other hand, if you instead replaced missing characters with "&#x----;" (with Unicode codepoint inserted) then it suddenly becomes reversible. I'm not sure if that's cheating, but it works well enough for the internet.
AIUI, &# codes are supposed to be Unicode, regardless of the character 
set in which the HTML file is actually encoded.

Yes, that's spot on. (Although try telling that to Microsoft, who insist on displaying &#128; as '' in Internet Explorer! But yes - you're right; Microsoft is wrong). Jill
Oct 05 2004
parent Stewart Gordon <smjg_1998 yahoo.com> writes:
Arcane Jill wrote:

 In article <cjujiv$dmb$1 digitaldaemon.com>, Stewart Gordon says...
 
 Arcane Jill wrote: <snip>
 
 (An encoding can only be reversible if it can losslessly encode 
 the entirety of its character set. The UTFs are all reversible).

So it would be able to map an arbitrary Unicode string to a string in the destination code, as long as there are substitution rules for Unicode characters that become unavailable?

Well, usually there's some sort of default replacement character - like if you're converting from a Russian character set to a Latin one, you sometimes end up replacing some of the characters with '?'. That would be non-reversible.

Of course, from an implementation point of view, there's the option of throwing an exception for untranslatable characters....
 But on the other hand, if you instead replaced missing characters 
 with "&#x----;" (with Unicode codepoint inserted) then it suddenly 
 becomes reversible. I'm not sure if that's cheating, but it works 
 well enough for the internet.

As long as at least one of the characters '&', '#', 'x', ';' is also encoded.
 AIUI, &# codes are supposed to be Unicode, regardless of the 
 character set in which the HTML file is actually encoded.

Yes, that's spot on. (Although try telling that to Microsoft, who insist on displaying &#128; as '' in Internet Explorer! But yes - you're right; Microsoft is wrong).

Quite a few browsers copy M$ in this respect. Even on this Mac, IE, Safari and Mozilla (which is meant to be standards compliant!) all do it. Stewart.
Oct 06 2004
prev sibling parent "Walter" <newshound digitalmars.com> writes:
I think you should add this to the D wiki!
Oct 04 2004