digitalmars.D - Unicode and UTF similarities/differences?

Kramer (7/7) Oct 01 2004 I know Unicode and UTF is talked about a lot here and with good reason, ...

Arcane Jill (32/37) Oct 01 2004 Well, they are different kinds of objects. Unicode is a character set; U...

Stewart Gordon (9/12) Oct 04 2004 Bidirectional? Shouldn't the other way be a "decoding"?

Arcane Jill (14/20) Oct 04 2004 Hehe. Then maybe I should have said "reversible"? Or maybe I should have...
Arcane Jill (16/19) Oct 04 2004 Another way of looking at it (which certainly works from the point of vi...

Walter (15/18) Oct 05 2004 better if

J C Calvarese (4/22) Oct 05 2004 I agree. I added it to http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIss...

Stewart Gordon (10/18) Oct 05 2004 So it would be able to map an arbitrary Unicode string to a string in

Arcane Jill (12/21) Oct 05 2004 Well, usually there's some sort of default replacement character - like ...

Stewart Gordon (8/33) Oct 06 2004 Of course, from an implementation point of view, there's the option of

Walter (1/1) Oct 04 2004 I think you should add this to the D wiki!

Kramer <Kramer_member pathlink.com> writes:

I know Unicode and UTF is talked about a lot here and with good reason, because
it sounds like a feature that makes D attractive in the portability realm of
I18N being able to write source code once and compile anywhere.  But as a newbie
to D and to Unicode and UTF, could anyone please explain the differences (or
similarities) on the two.

Thanks,
Kramer

Oct 01 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cjk0hf$oeu$1 digitaldaemon.com>, Kramer says...
I know Unicode and UTF is talked about a lot here and with good reason, because
it sounds like a feature that makes D attractive in the portability realm of
I18N being able to write source code once and compile anywhere.  But as a newbie
to D and to Unicode and UTF, could anyone please explain the differences (or
similarities) on the two.

Well, they are different kinds of objects. Unicode is a character set; UTF-16 is
an encoding. Bear with me - I'll try to make that clearer.

A character set is a set of characters in which each character has a /number/
associated with it, called its "codepoint". For example, in the ASCII character
set, the character 'A' has a codepoint of 65 (more usually written in hex, as
0x41). In the Unicode character set, 'A' also has a codepoint of 65, and the
character '�' (not present in ASCII) has a codepoint of 8,364 (more normally
written in hex as 0x20AC).

Unicode characters are often written as U+ followed by their codepoint in
hexadecimal. That is, U+20AC means the same thing as '�'.

Once upon a time, Unicode was going to be a sixteen-bit wide character set. That
is, there were going to be (at most) 65,536 characters in it. Thus, every
Unicode string would fit comfortably into an array of 16-bit-wide words.

Then things changed. Unicode grew too big. Suddenly, 65,536 characters wasn't
going to be enough. But too many important real-life applications had come to
rely on characters being 16-bits wide (for example: Java and Windows, to name a
couple of biggies). Something had to be done. That something was UTF-16.

UTF-16 is a sneaky way of squeezing >65535 characters into an array originally
designed for 16-bit words. Unicode characters with codepoints <0x10000 still
occupy only one word; Unicode characters with codepoints >=0x10000 now occupy
two words. (A special range of otherwise unused codepoints makes this possible).

In general, an "encoding" is a bidirectional mapping which maps each codepoint
to an array of fixed-width objects called "code units". How wide is a code unit?
Well, it depends on the encoding. UTF-8 code units are 8 bits wide; UTF-16 code
units are 16 bits wide; and UTF-32 code units are 32 bits wide. So UTF-16 is a
mapping from Unicode codepoints to arrays of 16-bit wide units. For example, the
codepoint 0x10000 maps (in UTF-16) to the array [ 0xD800, 0xDC00 ].

You can learn all about this in much more detail here:
http://www.unicode.org/faq/utf_bom.html

Hope that helps
Arcane Jill

Oct 01 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Arcane Jill wrote:

<snip>
 In general, an "encoding" is a bidirectional mapping

Bidirectional?  Shouldn't the other way be a "decoding"?

 which maps each codepoint to an array of fixed-width objects called 
 "code units".

What character set are ISO-8859-x et al encodings of, for that matter?

Or do certain web browsers/news clients say "encoding" when half the 
time they really mean "character set"?

(The two nomers could come together, if each is its own character set, 
encoded by the identity function.)

Stewart.

Oct 04 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cjrg2e$2eg5$1 digitaldaemon.com>, Stewart Gordon says...
Bidirectional?  Shouldn't the other way be a "decoding"?

Hehe. Then maybe I should have said "reversible"? Or maybe I should have been
defining "encoding scheme" rather than "encoding"? Anyway, I don't think it
matters too much, it was just a rough explanation. Anyone who wants the details
should head over to the glossary on the Unicode web site.


What character set are ISO-8859-x et al encodings of, for that matter?

Themselves. All the encodings of 8-bit-wide character sets are encoded
trivially.


Or do certain web browsers/news clients say "encoding" when half the 
time they really mean "character set"?

Yeah, it's one of those historical accidents - like when you see in an HTTP
header:



it should probably really say encoding, but that's just how it ended up.

(The two nomers could come together, if each is its own character set, 
encoded by the identity function.)

Yeah. As you quite rightly point out, with 8-bit character sets, there's
basically no difference.

Jill

Oct 04 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cjrg2e$2eg5$1 digitaldaemon.com>, Stewart Gordon says...

What character set are ISO-8859-x et al encodings of, for that matter?
Or do certain web browsers/news clients say "encoding" when half the 
time they really mean "character set"?

Another way of looking at it (which certainly works from the point of view of
web browsers and web pages) is that the character set is Unicode, and that the
8-bit character sets may be viewed as encodings of Unicode, so that (for
example), in WINDOWS-1252, the Unicode character U+20AC is encoded as 0x80.

Of course, this flies in the face of my earlier statement that encodings are
reversible, but I think I was wrong on that point. (An encoding can only be
reversible if it can losslessly encode the entirety of its character set. The
UTFs are all reversible). Fortunately, HTML (and XML) gives us a workaround,
because of course even in Latin-1, you can still use:



to get Cyrillic uppercase A.

Jill

PS. This is probably all too vague to go in a WIKI. Maybe it would be better if
the WIKI just directed people to the glossary on the Unicode web site. It's a
bit wordier, but a lot more accurate.

Oct 04 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cjtfkv$2g3b$1 digitaldaemon.com...
 PS. This is probably all too vague to go in a WIKI. Maybe it would be

better if
 the WIKI just directed people to the glossary on the Unicode web site.

It's a
 bit wordier, but a lot more accurate.

I think it should go into the Wiki because:

1) It repeatedly comes up with regards to D because a language that supports
UTF so thoroughly is new and people are not familiar with the UTF issues
yet.

2) You write good explanations. They'd be easier to find in the Wiki than on
the n.g.

3) I think people would be more comfortable to start with the nice overviews
you write. The wordier, pedantic explanations can come later.

4) The Wiki isn't like a book where one has to be ruthless in maintaining
tight focus. The D Wiki is free to explore related topics to whatever depth
is interesting to D programmers.

Oct 05 2004

J C Calvarese <jcc7 cox.net> writes:

In article <cjuilc$cju$1 digitaldaemon.com>, Walter says...
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cjtfkv$2g3b$1 digitaldaemon.com...
 PS. This is probably all too vague to go in a WIKI. Maybe it would be

better if
 the WIKI just directed people to the glossary on the Unicode web site.

It's a
 bit wordier, but a lot more accurate.

I think it should go into the Wiki because:

1) It repeatedly comes up with regards to D because a language that supports
UTF so thoroughly is new and people are not familiar with the UTF issues
yet.

2) You write good explanations. They'd be easier to find in the Wiki than on
the n.g.

3) I think people would be more comfortable to start with the nice overviews
you write. The wordier, pedantic explanations can come later.

4) The Wiki isn't like a book where one has to be ruthless in maintaining
tight focus. The D Wiki is free to explore related topics to whatever depth
is interesting to D programmers.

I agree. I added it to http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues
(with a link back to AJ's post). 

jcc7

Oct 05 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Arcane Jill wrote:
<snip>
 (An encoding can only be reversible if it can losslessly encode the entirety
of its character set. The
 UTFs are all reversible).

So it would be able to map an arbitrary Unicode string to a string in 
the destination code, as long as there are substitution rules for 
Unicode characters that become unavailable?

 Fortunately, HTML (and XML) gives us a workaround,
 because of course even in Latin-1, you can still use:
 

 
 to get Cyrillic uppercase A.

<snip>


set in which the HTML file is actually encoded.  Just like character 
escapes in D are independent of the source encoding.

Stewart.

Oct 05 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cjujiv$dmb$1 digitaldaemon.com>, Stewart Gordon says...
Arcane Jill wrote:
<snip>
 (An encoding can only be reversible if it can losslessly encode the entirety
of its character set. The
 UTFs are all reversible).

So it would be able to map an arbitrary Unicode string to a string in 
the destination code, as long as there are substitution rules for 
Unicode characters that become unavailable?

Well, usually there's some sort of default replacement character - like if
you're converting from a Russian character set to a Latin one, you sometimes end
up replacing some of the characters with '?'. That would be non-reversible.

But on the other hand, if you instead replaced missing characters with
"&#x----;" (with Unicode codepoint inserted) then it suddenly becomes
reversible. I'm not sure if that's cheating, but it works well enough for the
internet.





set in which the HTML file is actually encoded.

Yes, that's spot on. (Although try telling that to Microsoft, who insist on

is wrong).

Jill

Oct 05 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Arcane Jill wrote:

 In article <cjujiv$dmb$1 digitaldaemon.com>, Stewart Gordon says...
 
 Arcane Jill wrote: <snip>
 
 (An encoding can only be reversible if it can losslessly encode 
 the entirety of its character set. The UTFs are all reversible).

 
 So it would be able to map an arbitrary Unicode string to a string 
 in the destination code, as long as there are substitution rules 
 for Unicode characters that become unavailable?

 
 Well, usually there's some sort of default replacement character - 
 like if you're converting from a Russian character set to a Latin 
 one, you sometimes end up replacing some of the characters with '?'. 
 That would be non-reversible.

Of course, from an implementation point of view, there's the option of 
throwing an exception for untranslatable characters....

 But on the other hand, if you instead replaced missing characters 
 with "&#x----;" (with Unicode codepoint inserted) then it suddenly 
 becomes reversible. I'm not sure if that's cheating, but it works 
 well enough for the internet.


encoded.


 character set in which the HTML file is actually encoded.

 
 Yes, that's spot on. (Although try telling that to Microsoft, who 

 you're right; Microsoft is wrong).

Quite a few browsers copy M$ in this respect.  Even on this Mac, IE, 
Safari and Mozilla (which is meant to be standards compliant!) all do it.

Stewart.

Oct 06 2004

"Walter" <newshound digitalmars.com> writes:

I think you should add this to the D wiki!

Oct 04 2004

D Programming

C/C++ Programming

Other

digitalmars.D - Unicode and UTF similarities/differences?