digitalmars.D.learn - Why can't D store all UTF-8 code units in char type? (not really
- thebluepandabear (24/24) Dec 02 2022 Hello (noob question),
- Adam D Ruppe (11/14) Dec 02 2022 That's not a utf-8 code unit.
- thebluepandabear (4/5) Dec 02 2022 Hm, that specifically might not be. The thing is, I thought a
- ag0aep6g (7/11) Dec 03 2022 You're simply not using the term "code unit" correctly. A UTF-8 code
- rikki cattermole (6/6) Dec 02 2022 char is always UTF-8 codepoint and therefore exactly 1 byte.
- Adam D Ruppe (5/8) Dec 02 2022 You mean "code unit". There's no such thing as a utf-8/16/32
- rikki cattermole (2/10) Dec 02 2022 Yeah you're right, its code unit not code point.
- =?UTF-8?Q?Ali_=c3=87ehreli?= (5/6) Dec 02 2022 This proves yet again how badly chosen those names are. I must look it
- rikki cattermole (5/16) Dec 02 2022 Yeah, and I even have a physical copy beside me!
- H. S. Teoh (19/28) Dec 02 2022 [...]
- thebluepandabear (2/6) Dec 02 2022 Your explanation was great and cleared things up... not sure
- thebluepandabear (4/13) Dec 02 2022 Actually now when I think about it, it is quite a creative way of
- H. S. Teoh (8/22) Dec 02 2022 It was a math joke. :-P It was half-serious, though, and I think the
- H. S. Teoh (67/93) Dec 02 2022 That's wrong, char.sizeof should be exactly 1 byte, no more, no less.
- Steven Schveighoffer (20/52) Dec 02 2022 a *code point* is a value out of the unicode standard. [Code
- =?UTF-8?Q?Ali_=c3=87ehreli?= (34/39) Dec 02 2022 The integral value of Ğ in unicode is 286.
Hello (noob question), I am reading a book about D by Ali, and he talks about the different char types: char, wchar, and dchar. He says that char stores a UTF-8 code unit, wchar stores a UTF-16 code unit, and dchar stores a UTF-32 code unit, this makes sense. He then goes on to say that: "Contrary to some other programming languages, characters in D may consist of different numbers of bytes. For example, because 'Ğ' must be represented by at least 2 bytes in Unicode, it doesn't fit in a variable of type char. On the other hand, because dchar consists of 4 bytes, it can hold any Unicode character." It's his explanation as to why this code doesn't compile even though Ğ is a UTF-8 code unit: ```D char utf8 = 'Ğ'; ``` But I don't really understand this? What does it mean that it 'must be represented by at least 2 bytes'? If I do `char.sizeof` it's 2 bytes so I am confused why it doesn't fit, I don't think it was explained well in the book. Any help would be appreciated.
Dec 02 2022
On Friday, 2 December 2022 at 21:18:44 UTC, thebluepandabear wrote:It's his explanation as to why this code doesn't compile even though Ğ is a UTF-8 code unit:That's not a utf-8 code unit. A utf-8 code unit is just a single byte with a particular interpretation.If I do `char.sizeof` it's 2 bytesAre you sure about that? `char.sizeof` is 1. A char is just a single byte. The Ğ code point (note code units and code points are two different things, a code point is an abstract idea, like a number, and a code unit is one byte that, when combined, can create the number).
Dec 02 2022
That's not a utf-8 code unit.Hm, that specifically might not be. The thing is, I thought a UTF-8 code unit can store 1-4 bytes for each character, so how is it right to say that `char` is a utf-8 code unit, it seems like it's just an ASCII code unit.
Dec 02 2022
On 02.12.22 22:39, thebluepandabear wrote:Hm, that specifically might not be. The thing is, I thought a UTF-8 code unit can store 1-4 bytes for each character, so how is it right to say that `char` is a utf-8 code unit, it seems like it's just an ASCII code unit.You're simply not using the term "code unit" correctly. A UTF-8 code unit is just one of those 1-4 bytes. Together they form a "sequence" which encodes a "code point". And all (true) ASCII code units are indeed also valid UTF-8 code units. Because UTF-8 is a superset of ASCII. If you save a file as ASCII and open it as UTF-8, that works. But it doesn't work the other way around.
Dec 03 2022
char is always UTF-8 codepoint and therefore exactly 1 byte. wchar is always UTF-16 codepoint and therefore exactly 2 bytes. dchar is always UTF-32 codepoint and therefore exactly 4 bytes; 'Ğ' has the value U+011E which is a lot larger than what 1 byte can hold. You need 2 chars or 1 wchar/dchar. https://unicode-table.com/en/011E/
Dec 02 2022
On Friday, 2 December 2022 at 21:26:40 UTC, rikki cattermole wrote:char is always UTF-8 codepoint and therefore exactly 1 byte. wchar is always UTF-16 codepoint and therefore exactly 2 bytes. dchar is always UTF-32 codepoint and therefore exactly 4 bytes;You mean "code unit". There's no such thing as a utf-8/16/32 codepoint. A codepoint is a more abstract concept that is encoded in one of the utf formats.
Dec 02 2022
On 03/12/2022 10:35 AM, Adam D Ruppe wrote:On Friday, 2 December 2022 at 21:26:40 UTC, rikki cattermole wrote:Yeah you're right, its code unit not code point.char is always UTF-8 codepoint and therefore exactly 1 byte. wchar is always UTF-16 codepoint and therefore exactly 2 bytes. dchar is always UTF-32 codepoint and therefore exactly 4 bytes;You mean "code unit". There's no such thing as a utf-8/16/32 codepoint. A codepoint is a more abstract concept that is encoded in one of the utf formats.
Dec 02 2022
On 12/2/22 13:44, rikki cattermole wrote:Yeah you're right, its code unit not code point.This proves yet again how badly chosen those names are. I must look it up every time before using one or the other. So they are both "code"? One is a "unit" and the other is a "point"? Sheesh! Ali
Dec 02 2022
On 03/12/2022 11:32 AM, Ali Çehreli wrote:On 12/2/22 13:44, rikki cattermole wrote: > Yeah you're right, its code unit not code point. This proves yet again how badly chosen those names are. I must look it up every time before using one or the other. So they are both "code"? One is a "unit" and the other is a "point"? Sheesh! AliYeah, and I even have a physical copy beside me! P.s. Oh btw Unicode 15 should be coming soon to Phobos :) Once that is in, expect Turkic support for case insensitive matching!
Dec 02 2022
On Fri, Dec 02, 2022 at 02:32:47PM -0800, Ali ehreli via Digitalmars-d-learn wrote:On 12/2/22 13:44, rikki cattermole wrote:[...] Think of Unicode as a vector space. A code point is a point in this space, and a code unit is one of the unit vectors; although some points can be reached with a single unit vector, to get to a general point you need to combine one or more unit vectors. Furthermore, the set of unit vectors you have depends on which coordinate system (i.e., encoding) you're using. Reencoding a Unicode string is essentially changing your coordinate system. ;-) (Exercise for the reader: compute the transformation matrix for reencoding. :-P) Also, a grapheme is a curve through this space (you *graph* the curve, you see), and as we all know, a curve may consist of more than one point. :-D (Exercise for the reader: what's the Hausdorff dimension of the set of strings over Unicode space? :-P) T -- First Rule of History: History doesn't repeat itself -- historians merely repeat each other.Yeah you're right, its code unit not code point.This proves yet again how badly chosen those names are. I must look it up every time before using one or the other. So they are both "code"? One is a "unit" and the other is a "point"? Sheesh!
Dec 02 2022
:-D (Exercise for the reader: what's the Hausdorff dimension of the set of strings over Unicode space? :-P) TYour explanation was great and cleared things up... not sure about the linear algebra one though ;)
Dec 02 2022
On Friday, 2 December 2022 at 23:44:28 UTC, thebluepandabear wrote:Actually now when I think about it, it is quite a creative way of explaining things. I take back what I said.:-D (Exercise for the reader: what's the Hausdorff dimension of the set of strings over Unicode space? :-P) TYour explanation was great and cleared things up... not sure about the linear algebra one though ;)
Dec 02 2022
On Fri, Dec 02, 2022 at 11:47:30PM +0000, thebluepandabear via Digitalmars-d-learn wrote:On Friday, 2 December 2022 at 23:44:28 UTC, thebluepandabear wrote:It was a math joke. :-P It was half-serious, though, and I think the analogy surprisingly holds up well enough in many cases. In any case, silly analogies are often a good mnemonic for remembering things like Unicode terminology. :-D T -- Freedom: (n.) Man's self-given right to be enslaved by his own depravity.Actually now when I think about it, it is quite a creative way of explaining things. I take back what I said.:-D (Exercise for the reader: what's the Hausdorff dimension of the set of strings over Unicode space? :-P) TYour explanation was great and cleared things up... not sure about the linear algebra one though ;)
Dec 02 2022
On Fri, Dec 02, 2022 at 09:18:44PM +0000, thebluepandabear via Digitalmars-d-learn wrote:Hello (noob question), I am reading a book about D by Ali, and he talks about the different char types: char, wchar, and dchar. He says that char stores a UTF-8 code unit, wchar stores a UTF-16 code unit, and dchar stores a UTF-32 code unit, this makes sense. He then goes on to say that: "Contrary to some other programming languages, characters in D may consist of different numbers of bytes. For example, because 'Ğ' must be represented by at least 2 bytes in Unicode, it doesn't fit in a variable of type char. On the other hand, because dchar consists of 4 bytes, it can hold any Unicode character." It's his explanation as to why this code doesn't compile even though Ğ is a UTF-8 code unit: ```D char utf8 = 'Ğ'; ``` But I don't really understand this? What does it mean that it 'must be represented by at least 2 bytes'? If I do `char.sizeof` it's 2 bytes so I am confused why it doesn't fit, I don't think it was explained well in the book.That's wrong, char.sizeof should be exactly 1 byte, no more, no less. First, before we talk about Unicode, we need to get the terminology straight: Code unit = unit of storage in a particular representation (encoding) of Unicode. E.g., a UTF-8 string consists of a stream of 1-byte code units, a UTF-16 string consists of a stream of 2-byte code units, etc.. Do NOT confuse this with "code point", or worse, "character". Code point = the abstract Unicode entity that occupies a single slot in the Unicode tables. Usually written as U+xxx where xxx is some hexadecimal number. IMPORTANT NOTE: do NOT confuse a code point with what a normal human being thinks of as a "character". Even though in many cases a code point happens to represent a single "character", this isn't always true. It's safer to understand a code point as a single slot in one of the Unicode tables. NOTE: a code point may be represented by multiple code units, depending on the encoding. For example, in UTF-8, some code points require multiple code units (multiple bytes) to represent. This varies depending on the character; the code point `A` needs only a single code unit, but the code point `Ш` needs 3 bytes, and the code point `😀` requires 4 bytes. In UTF-16, `A` and `Ш` occupy only 1 code unit (2 bytes, because in UTF-16, one code unit == 2 bytes), but `😀` needs 2 code units (4 bytes). Note that neither code unit nor code point correspond directly with what we normally think of as a "character". The Unicode terminology for that is: Grapheme = one or more code points that combine together to produce a single visual representation. For example, the 2-code-point sequence U+006D U+030A produce the *single* grapheme `m̊`, and the 3-code-point sequence U+03C0 U+0306 U+032f produces the grapheme `π̯̆`. Note that each code point in these sequences may require multiple code units, depending on which encoding you're using. This email is encoded in UTF-8, so the first sequence occupies 3 bytes (1 byte for the 1st code point, 2 bytes for the second), and the second sequence occupies 6 bytes (2 bytes per code point). // OK, now let's talk about D. In D, we have 3 "character" types (I'm putting "character" in quotes because they are actually code units, do NOT confuse them with visual characters): char, wchar, dchar, which are 1, 2, and 4 bytes, respectively. To find out whether something fits into a char, first you have to find out how many code points it occupies, and second, how many code units are required to represent those code points. For example, the character `À` can be represented by the single code point U+00C0. However, it requires *two* UTF-8 code units to represent (this is a consequence of how UTF-8 represents code points), in spite of being a value that's less than 256. So U+00C0 would not fit into a single char; you need (at least) 2 chars to hold it. If we were to use UTF-16 instead, U+00C0 would easily fit into a single code unit. Each code unit in UTF-16, however, is 2 bytes, so for some code points (such as 'a', U+0061), the UTF-8 encoding would be smaller. A dchar always fits any Unicode code point, because code points can only go up to 0x10FFFF (max 3 bytes). HOWEVER, using dchar does NOT guarantee that it will hold a complete visual character, because Unicode graphemes can be arbitrarily long. For example, the `π̯̆` grapheme above requires at least 3 code points to represent, which means it requires at least 3 dchars (== 12 bytes) to represent. In UTF-8 encoding, however, it occupies only 6 bytes (still the same 3 code points, just encoded differently). // I hope this is clear (as mud :P -- Unicode is a complex beast). Or at least clear*er*, anyway. T -- People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG
Dec 02 2022
On 12/2/22 4:18 PM, thebluepandabear wrote:Hello (noob question), I am reading a book about D by Ali, and he talks about the different char types: char, wchar, and dchar. He says that char stores a UTF-8 code unit, wchar stores a UTF-16 code unit, and dchar stores a UTF-32 code unit, this makes sense. He then goes on to say that: "Contrary to some other programming languages, characters in D may consist of different numbers of bytes. For example, because 'Ğ' must be represented by at least 2 bytes in Unicode, it doesn't fit in a variable of type char. On the other hand, because dchar consists of 4 bytes, it can hold any Unicode character." It's his explanation as to why this code doesn't compile even though Ğ is a UTF-8 code unit: ```D char utf8 = 'Ğ'; ``` But I don't really understand this? What does it mean that it 'must be represented by at least 2 bytes'? If I do `char.sizeof` it's 2 bytes so I am confused why it doesn't fit, I don't think it was explained well in the book. Any help would be appreciated.a *code point* is a value out of the unicode standard. [Code points](https://en.wikipedia.org/wiki/Code_point) represent glyphs, combining marks, or other things (not sure of the full list) that reside in the standard. When you want to figure out, "hmm... what value does the emoji 👍 have?" It's a *code point*. This is a number from 0 to 0x10FFFF for Unicode. (BTW, it's 0x14ffd) UTF-X are various *encodings* of unicode. UTF8 is an encoding of unicode where 1 to 4 bytes (called *code units*) encode a single unicode *code point*. There are various encodings, and all can be decoded to the same list of *code points*. The most direct form is UTF-32, where each *code point* is also a *code unit*. `char` is a UTF-8 code unit. `wchar` is a UTF-16 code unit, and `dchar` is a UTF-32 code unit. The reason why you can't encode a Ğ into a single `char` is because it's code point is 0x11e, which does not fit into a single `char`. Therefore, an encoding scheme is used to put it into 2 `char`. Hope this helps. -Steve
Dec 02 2022
On 12/2/22 13:18, thebluepandabear wrote:But I don't really understand this? What does it mean that it 'must be represented by at least 2 bytes'?The integral value of Ğ in unicode is 286. https://unicodeplus.com/U+011E Since 'char' is 8 bits, it cannot store 286. At first, that sounds like a hopeless situation, making one think that Ğ cannot be represented in a string. The concept of encoding to the rescue: Ğ can be encoded by 2 chars: import std.stdio; void main() { foreach (c; "Ğ") { writefln!"%b"(c); } } That program prints 11000100 10011110 Articles like the following explain well how that second byte is a continuation byte: https://en.wikipedia.org/wiki/UTF-8#Encoding (It's a continuation byte because it starts with the bits 10).I don't think it was explained well in the book.Coincidentally, according to another recent feedback I received, unicode and UTF are introduced way too early for such a book. I agree. I hadn't understood a single thing when the first time smart people were trying to explain unicode and UTF encodings to the company where I worked at. I had years of programming experience back then. (Although, I now think the instructors were not really good; and the company was pretty bad as well. :) )Any help would be appreciated.I recommend the Wikipedia page I linked above. It is enlightening to understand how about 150K unicode characters can be encoded with units of 8 bits. You can safely ignore wchar, dchar, wstring, and dstring for daily coding. Only special programs may need to deal with those types. 'char' and string are what we need and do use predominantly in D. Ali
Dec 02 2022