digitalmars.D.learn - Why can't D store all UTF-8 code units in char type? (not really

thebluepandabear (24/24) Dec 02 2022 Hello (noob question),

Adam D Ruppe (11/14) Dec 02 2022 That's not a utf-8 code unit.

thebluepandabear (4/5) Dec 02 2022 Hm, that specifically might not be. The thing is, I thought a

ag0aep6g (7/11) Dec 03 2022 You're simply not using the term "code unit" correctly. A UTF-8 code

rikki cattermole (6/6) Dec 02 2022 char is always UTF-8 codepoint and therefore exactly 1 byte.

Adam D Ruppe (5/8) Dec 02 2022 You mean "code unit". There's no such thing as a utf-8/16/32

rikki cattermole (2/10) Dec 02 2022 Yeah you're right, its code unit not code point.

=?UTF-8?Q?Ali_=c3=87ehreli?= (5/6) Dec 02 2022 This proves yet again how badly chosen those names are. I must look it

rikki cattermole (5/16) Dec 02 2022 Yeah, and I even have a physical copy beside me!
H. S. Teoh (19/28) Dec 02 2022 [...]

thebluepandabear (2/6) Dec 02 2022 Your explanation was great and cleared things up... not sure

thebluepandabear (4/13) Dec 02 2022 Actually now when I think about it, it is quite a creative way of

H. S. Teoh (8/22) Dec 02 2022 It was a math joke. :-P It was half-serious, though, and I think the

H. S. Teoh (67/93) Dec 02 2022 That's wrong, char.sizeof should be exactly 1 byte, no more, no less.
Steven Schveighoffer (20/52) Dec 02 2022 a *code point* is a value out of the unicode standard. [Code
=?UTF-8?Q?Ali_=c3=87ehreli?= (34/39) Dec 02 2022 The integral value of Ğ in unicode is 286.

thebluepandabear <therealbluepandabear protonmail.com> writes:

Hello (noob question),

I am reading a book about D by Ali, and he talks about the 
different char types: char, wchar, and dchar. He says that char 
stores a UTF-8 code unit, wchar stores a UTF-16 code unit, and 
dchar stores a UTF-32 code unit, this makes sense.

He then goes on to say that:

"Contrary to some other programming languages, characters in D 
may consist of
different numbers of bytes. For example, because 'Ğ' must be 
represented by at
least 2 bytes in Unicode, it doesn't fit in a variable of type 
char. On the other
hand, because dchar consists of 4 bytes, it can hold any Unicode 
character."

It's his explanation as to why this code doesn't compile even 
though Ğ is a UTF-8 code unit:

```D
char utf8 = 'Ğ';
```

But I don't really understand this? What does it mean that it 
'must be represented by at least 2 bytes'? If I do `char.sizeof` 
it's 2 bytes so I am confused why it doesn't fit, I don't think 
it was explained well in the book.

Any help would be appreciated.

Dec 02 2022

Adam D Ruppe <destructionator gmail.com> writes:

On Friday, 2 December 2022 at 21:18:44 UTC, thebluepandabear 
wrote:
 It's his explanation as to why this code doesn't compile even 
 though Ğ is a UTF-8 code unit:

That's not a utf-8 code unit.

A utf-8 code unit is just a single byte with a particular 
interpretation.

 If I do `char.sizeof` it's 2 bytes

Are you sure about that? `char.sizeof` is 1. A char is just a 
single byte.

The Ğ code point (note code units and code points are two 
different things, a code point is an abstract idea, like a 
number, and a code unit is one byte that, when combined, can 
create the number).

Dec 02 2022

thebluepandabear <therealbluepandabear protonmail.com> writes:

 That's not a utf-8 code unit.

Hm, that specifically might not be. The thing is, I thought a 
UTF-8 code unit can store 1-4 bytes for each character, so how is 
it right to say that `char` is a utf-8 code unit, it seems like 
it's just an ASCII code unit.

Dec 02 2022

ag0aep6g <anonymous example.com> writes:

On 02.12.22 22:39, thebluepandabear wrote:
 Hm, that specifically might not be. The thing is, I thought a UTF-8 code 
 unit can store 1-4 bytes for each character, so how is it right to say 
 that `char` is a utf-8 code unit, it seems like it's just an ASCII code 
 unit.

You're simply not using the term "code unit" correctly. A UTF-8 code 
unit is just one of those 1-4 bytes. Together they form a "sequence" 
which encodes a "code point".

And all (true) ASCII code units are indeed also valid UTF-8 code units. 
Because UTF-8 is a superset of ASCII. If you save a file as ASCII and 
open it as UTF-8, that works. But it doesn't work the other way around.

Dec 03 2022

rikki cattermole <rikki cattermole.co.nz> writes:

char is always UTF-8 codepoint and therefore exactly 1 byte.

wchar is always UTF-16 codepoint and therefore exactly 2 bytes.

dchar is always UTF-32 codepoint and therefore exactly 4 bytes;

'Ğ' has the value U+011E which is a lot larger than what 1 byte can 
hold. You need 2 chars or 1 wchar/dchar.

https://unicode-table.com/en/011E/

Dec 02 2022

Adam D Ruppe <destructionator gmail.com> writes:

On Friday, 2 December 2022 at 21:26:40 UTC, rikki cattermole 
wrote:
 char is always UTF-8 codepoint and therefore exactly 1 byte.
 wchar is always UTF-16 codepoint and therefore exactly 2 bytes.
 dchar is always UTF-32 codepoint and therefore exactly 4 bytes;

You mean "code unit". There's no such thing as a utf-8/16/32 
codepoint. A codepoint is a more abstract concept that is encoded 
in one of the utf formats.

Dec 02 2022

rikki cattermole <rikki cattermole.co.nz> writes:

On 03/12/2022 10:35 AM, Adam D Ruppe wrote:
 On Friday, 2 December 2022 at 21:26:40 UTC, rikki cattermole wrote:
 char is always UTF-8 codepoint and therefore exactly 1 byte.
 wchar is always UTF-16 codepoint and therefore exactly 2 bytes.
 dchar is always UTF-32 codepoint and therefore exactly 4 bytes;

 
 You mean "code unit". There's no such thing as a utf-8/16/32 codepoint. 
 A codepoint is a more abstract concept that is encoded in one of the utf 
 formats.

Yeah you're right, its code unit not code point.

Dec 02 2022

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 12/2/22 13:44, rikki cattermole wrote:

 Yeah you're right, its code unit not code point.

This proves yet again how badly chosen those names are. I must look it 
up every time before using one or the other.

So they are both "code"? One is a "unit" and the other is a "point"? Sheesh!

Ali

Dec 02 2022

rikki cattermole <rikki cattermole.co.nz> writes:

On 03/12/2022 11:32 AM, Ali Çehreli wrote:
 On 12/2/22 13:44, rikki cattermole wrote:
 
  > Yeah you're right, its code unit not code point.
 
 This proves yet again how badly chosen those names are. I must look it 
 up every time before using one or the other.
 
 So they are both "code"? One is a "unit" and the other is a "point"? 
 Sheesh!
 
 Ali

Yeah, and I even have a physical copy beside me!

P.s.

Oh btw Unicode 15 should be coming soon to Phobos :)
Once that is in, expect Turkic support for case insensitive matching!

Dec 02 2022

"H. S. Teoh" <hsteoh qfbox.info> writes:

On Fri, Dec 02, 2022 at 02:32:47PM -0800, Ali �ehreli via Digitalmars-d-learn
wrote:
 On 12/2/22 13:44, rikki cattermole wrote:
 
 Yeah you're right, its code unit not code point.

 
 This proves yet again how badly chosen those names are. I must look it
 up every time before using one or the other.
 
 So they are both "code"? One is a "unit" and the other is a "point"?
 Sheesh!

[...]

Think of Unicode as a vector space.  A code point is a point in this
space, and a code unit is one of the unit vectors; although some points
can be reached with a single unit vector, to get to a general point you
need to combine one or more unit vectors.

Furthermore, the set of unit vectors you have depends on which
coordinate system (i.e., encoding) you're using.  Reencoding a Unicode
string is essentially changing your coordinate system. ;-) (Exercise for
the reader: compute the transformation matrix for reencoding. :-P)

Also, a grapheme is a curve through this space (you *graph* the curve,
you see), and as we all know, a curve may consist of more than one
point.

:-D

(Exercise for the reader: what's the Hausdorff dimension of the set of
strings over Unicode space? :-P)


T

-- 
First Rule of History: History doesn't repeat itself -- historians merely
repeat each other.

Dec 02 2022

thebluepandabear <therealbluepandabear protonmail.com> writes:

 :-D

 (Exercise for the reader: what's the Hausdorff dimension of the 
 set of strings over Unicode space? :-P)


 T

Your explanation was great and cleared things up... not sure 
about the linear algebra one though ;)

Dec 02 2022

thebluepandabear <therealbluepandabear protonmail.com> writes:

On Friday, 2 December 2022 at 23:44:28 UTC, thebluepandabear 
wrote:
 :-D

 (Exercise for the reader: what's the Hausdorff dimension of 
 the set of strings over Unicode space? :-P)


 T

 Your explanation was great and cleared things up... not sure 
 about the linear algebra one though ;)

Actually now when I think about it, it is quite a creative way of 
explaining things. I take back what I said.

Dec 02 2022

"H. S. Teoh" <hsteoh qfbox.info> writes:

On Fri, Dec 02, 2022 at 11:47:30PM +0000, thebluepandabear via
Digitalmars-d-learn wrote:
 On Friday, 2 December 2022 at 23:44:28 UTC, thebluepandabear wrote:
 :-D
 
 (Exercise for the reader: what's the Hausdorff dimension of the
 set of strings over Unicode space? :-P)
 
 
 T

 
 Your explanation was great and cleared things up... not sure about
 the linear algebra one though ;)

 
 Actually now when I think about it, it is quite a creative way of
 explaining things. I take back what I said.

It was a math joke. :-P  It was half-serious, though, and I think the
analogy surprisingly holds up well enough in many cases.  In any case,
silly analogies are often a good mnemonic for remembering things like
Unicode terminology. :-D


T

-- 
Freedom: (n.) Man's self-given right to be enslaved by his own depravity.

Dec 02 2022

"H. S. Teoh" <hsteoh qfbox.info> writes:

On Fri, Dec 02, 2022 at 09:18:44PM +0000, thebluepandabear via
Digitalmars-d-learn wrote:
 Hello (noob question),
 
 I am reading a book about D by Ali, and he talks about the different
 char types: char, wchar, and dchar. He says that char stores a UTF-8
 code unit, wchar stores a UTF-16 code unit, and dchar stores a UTF-32
 code unit, this makes sense.
 
 He then goes on to say that:
 
 "Contrary to some other programming languages, characters in D may
 consist of different numbers of bytes. For example, because 'Ğ' must
 be represented by at least 2 bytes in Unicode, it doesn't fit in a
 variable of type char. On the other hand, because dchar consists of 4
 bytes, it can hold any Unicode character."
 
 It's his explanation as to why this code doesn't compile even though Ğ
 is a UTF-8 code unit:
 
 ```D
 char utf8 = 'Ğ';
 ```
 
 But I don't really understand this? What does it mean that it 'must be
 represented by at least 2 bytes'? If I do `char.sizeof` it's 2 bytes
 so I am confused why it doesn't fit, I don't think it was explained
 well in the book.

That's wrong, char.sizeof should be exactly 1 byte, no more, no less.

First, before we talk about Unicode, we need to get the terminology
straight:

Code unit = unit of storage in a particular representation (encoding) of
Unicode. E.g., a UTF-8 string consists of a stream of 1-byte code units,
a UTF-16 string consists of a stream of 2-byte code units, etc.. Do NOT
confuse this with "code point", or worse, "character".

Code point = the abstract Unicode entity that occupies a single slot in
the Unicode tables.  Usually written as U+xxx where xxx is some
hexadecimal number.

	IMPORTANT NOTE: do NOT confuse a code point with what a normal
	human being thinks of as a "character".  Even though in many
	cases a code point happens to represent a single "character",
	this isn't always true.  It's safer to understand a code point
	as a single slot in one of the Unicode tables.

	NOTE: a code point may be represented by multiple code units,
	depending on the encoding. For example, in UTF-8, some code
	points require multiple code units (multiple bytes) to
	represent. This varies depending on the character; the code
	point `A` needs only a single code unit, but the code point `Ш`
	needs 3 bytes, and the code point `😀` requires 4 bytes. In
	UTF-16, `A` and `Ш` occupy only 1 code unit (2 bytes, because in
	UTF-16, one code unit == 2 bytes), but `😀` needs 2 code units
	(4 bytes).

Note that neither code unit nor code point correspond directly with what
we normally think of as a "character".  The Unicode terminology for that
is:

Grapheme = one or more code points that combine together to produce a
single visual representation.  For example, the 2-code-point sequence
U+006D U+030A produce the *single* grapheme `m̊`, and the 3-code-point
sequence U+03C0 U+0306 U+032f produces the grapheme `π̯̆`.  Note that each
code point in these sequences may require multiple code units, depending
on which encoding you're using.  This email is encoded in UTF-8, so the
first sequence occupies 3 bytes (1 byte for the 1st code point, 2 bytes
for the second), and the second sequence occupies 6 bytes (2 bytes per
code point).

//

OK, now let's talk about D.  In D, we have 3 "character" types (I'm
putting "character" in quotes because they are actually code units, do
NOT confuse them with visual characters): char, wchar, dchar, which are
1, 2, and 4 bytes, respectively.

To find out whether something fits into a char, first you have to find
out how many code points it occupies, and second, how many code units
are required to represent those code points.  For example, the character
`À` can be represented by the single code point U+00C0. However, it
requires *two* UTF-8 code units to represent (this is a consequence of
how UTF-8 represents code points), in spite of being a value that's less
than 256.  So U+00C0 would not fit into a single char; you need (at
least) 2 chars to hold it.

If we were to use UTF-16 instead, U+00C0 would easily fit into a single
code unit.  Each code unit in UTF-16, however, is 2 bytes, so for some
code points (such as 'a', U+0061), the UTF-8 encoding would be smaller.

A dchar always fits any Unicode code point, because code points can only
go up to 0x10FFFF (max 3 bytes).  HOWEVER, using dchar does NOT
guarantee that it will hold a complete visual character, because Unicode
graphemes can be arbitrarily long.  For example, the `π̯̆` grapheme above
requires at least 3 code points to represent, which means it requires at
least 3 dchars (== 12 bytes) to represent. In UTF-8 encoding, however,
it occupies only 6 bytes (still the same 3 code points, just encoded
differently).

//

I hope this is clear (as mud :P -- Unicode is a complex beast). Or at
least clear*er*, anyway.


T

-- 
People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG

Dec 02 2022

Steven Schveighoffer <schveiguy gmail.com> writes:

On 12/2/22 4:18 PM, thebluepandabear wrote:
 Hello (noob question),
 
 I am reading a book about D by Ali, and he talks about the different 
 char types: char, wchar, and dchar. He says that char stores a UTF-8 
 code unit, wchar stores a UTF-16 code unit, and dchar stores a UTF-32 
 code unit, this makes sense.
 
 He then goes on to say that:
 
 "Contrary to some other programming languages, characters in D may 
 consist of
 different numbers of bytes. For example, because 'Ğ' must be represented 
 by at
 least 2 bytes in Unicode, it doesn't fit in a variable of type char. On 
 the other
 hand, because dchar consists of 4 bytes, it can hold any Unicode 
 character."
 
 It's his explanation as to why this code doesn't compile even though Ğ 
 is a UTF-8 code unit:
 
 ```D
 char utf8 = 'Ğ';
 ```
 
 But I don't really understand this? What does it mean that it 'must be 
 represented by at least 2 bytes'? If I do `char.sizeof` it's 2 bytes so 
 I am confused why it doesn't fit, I don't think it was explained well in 
 the book.
 
 Any help would be appreciated.
 


a *code point* is a value out of the unicode standard. [Code 
points](https://en.wikipedia.org/wiki/Code_point) represent glyphs, 
combining marks, or other things (not sure of the full list) that reside 
in the standard. When you want to figure out, "hmm... what value does 
the emoji 👍 have?" It's a *code point*. This is a number from 0 to 
0x10FFFF for Unicode. (BTW, it's 0x14ffd)

UTF-X are various *encodings* of unicode. UTF8 is an encoding of unicode 
where 1 to 4 bytes (called *code units*) encode a single unicode *code 
point*.

There are various encodings, and all can be decoded to the same list of 
*code points*. The most direct form is UTF-32, where each *code point* 
is also a *code unit*.

`char` is a UTF-8 code unit. `wchar` is a UTF-16 code unit, and `dchar` 
is a UTF-32 code unit.

The reason why you can't encode a Ğ into a single `char` is because it's 
code point is 0x11e, which does not fit into a single `char`. Therefore, 
an encoding scheme is used to put it into 2 `char`.

Hope this helps.

-Steve

Dec 02 2022

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 12/2/22 13:18, thebluepandabear wrote:

 But I don't really understand this? What does it mean that it 'must be
 represented by at least 2 bytes'?

The integral value of Ğ in unicode is 286.

   https://unicodeplus.com/U+011E

Since 'char' is 8 bits, it cannot store 286.

At first, that sounds like a hopeless situation, making one think that Ğ 
cannot be represented in a string. The concept of encoding to the 
rescue: Ğ can be encoded by 2 chars:

import std.stdio;

void main() {
     foreach (c; "Ğ") {
         writefln!"%b"(c);
     }
}

That program prints

11000100
10011110

Articles like the following explain well how that second byte is a 
continuation byte:

   https://en.wikipedia.org/wiki/UTF-8#Encoding

(It's a continuation byte because it starts with the bits 10).

 I don't think it was explained well in
 the book.

Coincidentally, according to another recent feedback I received, unicode 
and UTF are introduced way too early for such a book. I agree. I hadn't 
understood a single thing when the first time smart people were trying 
to explain unicode and UTF encodings to the company where I worked at. I 
had years of programming experience back then. (Although, I now think 
the instructors were not really good; and the company was pretty bad as 
well. :) )

 Any help would be appreciated.

I recommend the Wikipedia page I linked above. It is enlightening to 
understand how about 150K unicode characters can be encoded with units 
of 8 bits.

You can safely ignore wchar, dchar, wstring, and dstring for daily 
coding. Only special programs may need to deal with those types. 'char' 
and string are what we need and do use predominantly in D.

Ali

Dec 02 2022

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Why can't D store all UTF-8 code units in char type? (not really