www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Why can't D store all UTF-8 code units in char type? (not really

reply thebluepandabear <therealbluepandabear protonmail.com> writes:
Hello (noob question),

I am reading a book about D by Ali, and he talks about the 
different char types: char, wchar, and dchar. He says that char 
stores a UTF-8 code unit, wchar stores a UTF-16 code unit, and 
dchar stores a UTF-32 code unit, this makes sense.

He then goes on to say that:

"Contrary to some other programming languages, characters in D 
may consist of
different numbers of bytes. For example, because 'Ğ' must be 
represented by at
least 2 bytes in Unicode, it doesn't fit in a variable of type 
char. On the other
hand, because dchar consists of 4 bytes, it can hold any Unicode 
character."

It's his explanation as to why this code doesn't compile even 
though Ğ is a UTF-8 code unit:

```D
char utf8 = 'Ğ';
```

But I don't really understand this? What does it mean that it 
'must be represented by at least 2 bytes'? If I do `char.sizeof` 
it's 2 bytes so I am confused why it doesn't fit, I don't think 
it was explained well in the book.

Any help would be appreciated.
Dec 02 2022
next sibling parent reply Adam D Ruppe <destructionator gmail.com> writes:
On Friday, 2 December 2022 at 21:18:44 UTC, thebluepandabear 
wrote:
 It's his explanation as to why this code doesn't compile even 
 though Ğ is a UTF-8 code unit:
That's not a utf-8 code unit. A utf-8 code unit is just a single byte with a particular interpretation.
 If I do `char.sizeof` it's 2 bytes
Are you sure about that? `char.sizeof` is 1. A char is just a single byte. The Ğ code point (note code units and code points are two different things, a code point is an abstract idea, like a number, and a code unit is one byte that, when combined, can create the number).
Dec 02 2022
parent reply thebluepandabear <therealbluepandabear protonmail.com> writes:
 That's not a utf-8 code unit.
Hm, that specifically might not be. The thing is, I thought a UTF-8 code unit can store 1-4 bytes for each character, so how is it right to say that `char` is a utf-8 code unit, it seems like it's just an ASCII code unit.
Dec 02 2022
parent ag0aep6g <anonymous example.com> writes:
On 02.12.22 22:39, thebluepandabear wrote:
 Hm, that specifically might not be. The thing is, I thought a UTF-8 code 
 unit can store 1-4 bytes for each character, so how is it right to say 
 that `char` is a utf-8 code unit, it seems like it's just an ASCII code 
 unit.
You're simply not using the term "code unit" correctly. A UTF-8 code unit is just one of those 1-4 bytes. Together they form a "sequence" which encodes a "code point". And all (true) ASCII code units are indeed also valid UTF-8 code units. Because UTF-8 is a superset of ASCII. If you save a file as ASCII and open it as UTF-8, that works. But it doesn't work the other way around.
Dec 03 2022
prev sibling next sibling parent reply rikki cattermole <rikki cattermole.co.nz> writes:
char is always UTF-8 codepoint and therefore exactly 1 byte.

wchar is always UTF-16 codepoint and therefore exactly 2 bytes.

dchar is always UTF-32 codepoint and therefore exactly 4 bytes;

'Ğ' has the value U+011E which is a lot larger than what 1 byte can 
hold. You need 2 chars or 1 wchar/dchar.

https://unicode-table.com/en/011E/
Dec 02 2022
parent reply Adam D Ruppe <destructionator gmail.com> writes:
On Friday, 2 December 2022 at 21:26:40 UTC, rikki cattermole 
wrote:
 char is always UTF-8 codepoint and therefore exactly 1 byte.
 wchar is always UTF-16 codepoint and therefore exactly 2 bytes.
 dchar is always UTF-32 codepoint and therefore exactly 4 bytes;
You mean "code unit". There's no such thing as a utf-8/16/32 codepoint. A codepoint is a more abstract concept that is encoded in one of the utf formats.
Dec 02 2022
parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 03/12/2022 10:35 AM, Adam D Ruppe wrote:
 On Friday, 2 December 2022 at 21:26:40 UTC, rikki cattermole wrote:
 char is always UTF-8 codepoint and therefore exactly 1 byte.
 wchar is always UTF-16 codepoint and therefore exactly 2 bytes.
 dchar is always UTF-32 codepoint and therefore exactly 4 bytes;
You mean "code unit". There's no such thing as a utf-8/16/32 codepoint. A codepoint is a more abstract concept that is encoded in one of the utf formats.
Yeah you're right, its code unit not code point.
Dec 02 2022
parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 12/2/22 13:44, rikki cattermole wrote:

 Yeah you're right, its code unit not code point.
This proves yet again how badly chosen those names are. I must look it up every time before using one or the other. So they are both "code"? One is a "unit" and the other is a "point"? Sheesh! Ali
Dec 02 2022
next sibling parent rikki cattermole <rikki cattermole.co.nz> writes:
On 03/12/2022 11:32 AM, Ali Çehreli wrote:
 On 12/2/22 13:44, rikki cattermole wrote:
 
  > Yeah you're right, its code unit not code point.
 
 This proves yet again how badly chosen those names are. I must look it 
 up every time before using one or the other.
 
 So they are both "code"? One is a "unit" and the other is a "point"? 
 Sheesh!
 
 Ali
Yeah, and I even have a physical copy beside me! P.s. Oh btw Unicode 15 should be coming soon to Phobos :) Once that is in, expect Turkic support for case insensitive matching!
Dec 02 2022
prev sibling parent reply "H. S. Teoh" <hsteoh qfbox.info> writes:
On Fri, Dec 02, 2022 at 02:32:47PM -0800, Ali ehreli via Digitalmars-d-learn
wrote:
 On 12/2/22 13:44, rikki cattermole wrote:
 
 Yeah you're right, its code unit not code point.
This proves yet again how badly chosen those names are. I must look it up every time before using one or the other. So they are both "code"? One is a "unit" and the other is a "point"? Sheesh!
[...] Think of Unicode as a vector space. A code point is a point in this space, and a code unit is one of the unit vectors; although some points can be reached with a single unit vector, to get to a general point you need to combine one or more unit vectors. Furthermore, the set of unit vectors you have depends on which coordinate system (i.e., encoding) you're using. Reencoding a Unicode string is essentially changing your coordinate system. ;-) (Exercise for the reader: compute the transformation matrix for reencoding. :-P) Also, a grapheme is a curve through this space (you *graph* the curve, you see), and as we all know, a curve may consist of more than one point. :-D (Exercise for the reader: what's the Hausdorff dimension of the set of strings over Unicode space? :-P) T -- First Rule of History: History doesn't repeat itself -- historians merely repeat each other.
Dec 02 2022
parent reply thebluepandabear <therealbluepandabear protonmail.com> writes:
 :-D

 (Exercise for the reader: what's the Hausdorff dimension of the 
 set of strings over Unicode space? :-P)


 T
Your explanation was great and cleared things up... not sure about the linear algebra one though ;)
Dec 02 2022
parent reply thebluepandabear <therealbluepandabear protonmail.com> writes:
On Friday, 2 December 2022 at 23:44:28 UTC, thebluepandabear 
wrote:
 :-D

 (Exercise for the reader: what's the Hausdorff dimension of 
 the set of strings over Unicode space? :-P)


 T
Your explanation was great and cleared things up... not sure about the linear algebra one though ;)
Actually now when I think about it, it is quite a creative way of explaining things. I take back what I said.
Dec 02 2022
parent "H. S. Teoh" <hsteoh qfbox.info> writes:
On Fri, Dec 02, 2022 at 11:47:30PM +0000, thebluepandabear via
Digitalmars-d-learn wrote:
 On Friday, 2 December 2022 at 23:44:28 UTC, thebluepandabear wrote:
 :-D
 
 (Exercise for the reader: what's the Hausdorff dimension of the
 set of strings over Unicode space? :-P)
 
 
 T
Your explanation was great and cleared things up... not sure about the linear algebra one though ;)
Actually now when I think about it, it is quite a creative way of explaining things. I take back what I said.
It was a math joke. :-P It was half-serious, though, and I think the analogy surprisingly holds up well enough in many cases. In any case, silly analogies are often a good mnemonic for remembering things like Unicode terminology. :-D T -- Freedom: (n.) Man's self-given right to be enslaved by his own depravity.
Dec 02 2022
prev sibling next sibling parent "H. S. Teoh" <hsteoh qfbox.info> writes:
On Fri, Dec 02, 2022 at 09:18:44PM +0000, thebluepandabear via
Digitalmars-d-learn wrote:
 Hello (noob question),
 
 I am reading a book about D by Ali, and he talks about the different
 char types: char, wchar, and dchar. He says that char stores a UTF-8
 code unit, wchar stores a UTF-16 code unit, and dchar stores a UTF-32
 code unit, this makes sense.
 
 He then goes on to say that:
 
 "Contrary to some other programming languages, characters in D may
 consist of different numbers of bytes. For example, because 'Ğ' must
 be represented by at least 2 bytes in Unicode, it doesn't fit in a
 variable of type char. On the other hand, because dchar consists of 4
 bytes, it can hold any Unicode character."
 
 It's his explanation as to why this code doesn't compile even though Ğ
 is a UTF-8 code unit:
 
 ```D
 char utf8 = 'Ğ';
 ```
 
 But I don't really understand this? What does it mean that it 'must be
 represented by at least 2 bytes'? If I do `char.sizeof` it's 2 bytes
 so I am confused why it doesn't fit, I don't think it was explained
 well in the book.
That's wrong, char.sizeof should be exactly 1 byte, no more, no less. First, before we talk about Unicode, we need to get the terminology straight: Code unit = unit of storage in a particular representation (encoding) of Unicode. E.g., a UTF-8 string consists of a stream of 1-byte code units, a UTF-16 string consists of a stream of 2-byte code units, etc.. Do NOT confuse this with "code point", or worse, "character". Code point = the abstract Unicode entity that occupies a single slot in the Unicode tables. Usually written as U+xxx where xxx is some hexadecimal number. IMPORTANT NOTE: do NOT confuse a code point with what a normal human being thinks of as a "character". Even though in many cases a code point happens to represent a single "character", this isn't always true. It's safer to understand a code point as a single slot in one of the Unicode tables. NOTE: a code point may be represented by multiple code units, depending on the encoding. For example, in UTF-8, some code points require multiple code units (multiple bytes) to represent. This varies depending on the character; the code point `A` needs only a single code unit, but the code point `Ш` needs 3 bytes, and the code point `😀` requires 4 bytes. In UTF-16, `A` and `Ш` occupy only 1 code unit (2 bytes, because in UTF-16, one code unit == 2 bytes), but `😀` needs 2 code units (4 bytes). Note that neither code unit nor code point correspond directly with what we normally think of as a "character". The Unicode terminology for that is: Grapheme = one or more code points that combine together to produce a single visual representation. For example, the 2-code-point sequence U+006D U+030A produce the *single* grapheme `m̊`, and the 3-code-point sequence U+03C0 U+0306 U+032f produces the grapheme `π̯̆`. Note that each code point in these sequences may require multiple code units, depending on which encoding you're using. This email is encoded in UTF-8, so the first sequence occupies 3 bytes (1 byte for the 1st code point, 2 bytes for the second), and the second sequence occupies 6 bytes (2 bytes per code point). // OK, now let's talk about D. In D, we have 3 "character" types (I'm putting "character" in quotes because they are actually code units, do NOT confuse them with visual characters): char, wchar, dchar, which are 1, 2, and 4 bytes, respectively. To find out whether something fits into a char, first you have to find out how many code points it occupies, and second, how many code units are required to represent those code points. For example, the character `À` can be represented by the single code point U+00C0. However, it requires *two* UTF-8 code units to represent (this is a consequence of how UTF-8 represents code points), in spite of being a value that's less than 256. So U+00C0 would not fit into a single char; you need (at least) 2 chars to hold it. If we were to use UTF-16 instead, U+00C0 would easily fit into a single code unit. Each code unit in UTF-16, however, is 2 bytes, so for some code points (such as 'a', U+0061), the UTF-8 encoding would be smaller. A dchar always fits any Unicode code point, because code points can only go up to 0x10FFFF (max 3 bytes). HOWEVER, using dchar does NOT guarantee that it will hold a complete visual character, because Unicode graphemes can be arbitrarily long. For example, the `π̯̆` grapheme above requires at least 3 code points to represent, which means it requires at least 3 dchars (== 12 bytes) to represent. In UTF-8 encoding, however, it occupies only 6 bytes (still the same 3 code points, just encoded differently). // I hope this is clear (as mud :P -- Unicode is a complex beast). Or at least clear*er*, anyway. T -- People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG
Dec 02 2022
prev sibling next sibling parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 12/2/22 4:18 PM, thebluepandabear wrote:
 Hello (noob question),
 
 I am reading a book about D by Ali, and he talks about the different 
 char types: char, wchar, and dchar. He says that char stores a UTF-8 
 code unit, wchar stores a UTF-16 code unit, and dchar stores a UTF-32 
 code unit, this makes sense.
 
 He then goes on to say that:
 
 "Contrary to some other programming languages, characters in D may 
 consist of
 different numbers of bytes. For example, because 'Ğ' must be represented 
 by at
 least 2 bytes in Unicode, it doesn't fit in a variable of type char. On 
 the other
 hand, because dchar consists of 4 bytes, it can hold any Unicode 
 character."
 
 It's his explanation as to why this code doesn't compile even though Ğ 
 is a UTF-8 code unit:
 
 ```D
 char utf8 = 'Ğ';
 ```
 
 But I don't really understand this? What does it mean that it 'must be 
 represented by at least 2 bytes'? If I do `char.sizeof` it's 2 bytes so 
 I am confused why it doesn't fit, I don't think it was explained well in 
 the book.
 
 Any help would be appreciated.
 
a *code point* is a value out of the unicode standard. [Code points](https://en.wikipedia.org/wiki/Code_point) represent glyphs, combining marks, or other things (not sure of the full list) that reside in the standard. When you want to figure out, "hmm... what value does the emoji 👍 have?" It's a *code point*. This is a number from 0 to 0x10FFFF for Unicode. (BTW, it's 0x14ffd) UTF-X are various *encodings* of unicode. UTF8 is an encoding of unicode where 1 to 4 bytes (called *code units*) encode a single unicode *code point*. There are various encodings, and all can be decoded to the same list of *code points*. The most direct form is UTF-32, where each *code point* is also a *code unit*. `char` is a UTF-8 code unit. `wchar` is a UTF-16 code unit, and `dchar` is a UTF-32 code unit. The reason why you can't encode a Ğ into a single `char` is because it's code point is 0x11e, which does not fit into a single `char`. Therefore, an encoding scheme is used to put it into 2 `char`. Hope this helps. -Steve
Dec 02 2022
prev sibling parent =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 12/2/22 13:18, thebluepandabear wrote:

 But I don't really understand this? What does it mean that it 'must be
 represented by at least 2 bytes'?
The integral value of Ğ in unicode is 286. https://unicodeplus.com/U+011E Since 'char' is 8 bits, it cannot store 286. At first, that sounds like a hopeless situation, making one think that Ğ cannot be represented in a string. The concept of encoding to the rescue: Ğ can be encoded by 2 chars: import std.stdio; void main() { foreach (c; "Ğ") { writefln!"%b"(c); } } That program prints 11000100 10011110 Articles like the following explain well how that second byte is a continuation byte: https://en.wikipedia.org/wiki/UTF-8#Encoding (It's a continuation byte because it starts with the bits 10).
 I don't think it was explained well in
 the book.
Coincidentally, according to another recent feedback I received, unicode and UTF are introduced way too early for such a book. I agree. I hadn't understood a single thing when the first time smart people were trying to explain unicode and UTF encodings to the company where I worked at. I had years of programming experience back then. (Although, I now think the instructors were not really good; and the company was pretty bad as well. :) )
 Any help would be appreciated.
I recommend the Wikipedia page I linked above. It is enlightening to understand how about 150K unicode characters can be encoded with units of 8 bits. You can safely ignore wchar, dchar, wstring, and dstring for daily coding. Only special programs may need to deal with those types. 'char' and string are what we need and do use predominantly in D. Ali
Dec 02 2022