www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - How to print unicode characters (no library)?

reply rempas <rempas tutanota.com> writes:
Hi! I'm trying to print some Unicode characters using UTF-8 
(char), UTF-16 (wchar) and UTF-32 (dchar). I want to do this 
without using any library by using the "write" system call 
directly with 64-bit Linux. Only the UTF-8 solution seems to be 
working as expected. The other solutions will not print the 
unicode characters (I'm using an emoji in my case for example). 
Another thing I noticed is the size of the strings. From what I 
know (and tell me if I'm mistaken), UTF-16 and UTF-32 have fixed 
size lengths for their characters. UTF-16 uses 2 bytes (16 bits) 
and UTF-32 uses 4 bytes (32 bits) without treating any character 
specially. This doesn't seem to be the case for me however. 
Consider my code:

```
import core.stdc.stdio;

void exit(ulong code) {
   asm {
     "syscall"
     : : "a" (60), "D" (code);
   }
}

void write(T)(int fd, const T buf, ulong len) {
   asm {
     "syscall"
     : : "a" (1), "D" (1), "S" (buf), "d" (len)
     : "memory", "rcx";
   }
}

extern (C) void main() {
   string  utf8s  = "Hello 😂\n";
   write(1, utf8s.ptr, utf8s.length);

   wstring utf16s = "Hello 😂\n"w;
   write(1, utf16s.ptr, utf16s.length * 2);

   dstring utf32s = "Hello 😂\n"d;
   write(1, utf32s.ptr, utf32s.length * 4);

   printf("\nutf8s.length = %lu\nutf16s.length = 
%lu\nutf32s.length = %lu\n",
       utf8s.length, utf16s.length, utf32s.length);

   exit(0);
}
```

And its output:

```
Hello 😂
Hello =��
Hello �

utf8s.length = 11
utf16s.length = 9
utf32s.length = 8
```

Now the UTF-8 string will report 11 characters and print them 
normally. So it treats every character that is 127 or less as if 
it was an ascii character and uses 1-byte for it. Characters 
above that range, are either a 2-byte or 4-byte unicode 
characters. So it works as I expected based on what I've read/saw 
for UTF-8 (now I understand why everyone loves it, lol :P)!

Now what about the other two? I was expecting UTF-16 to report 16 
characters and UTF-32 to report 32 characters. Also why the 
characters are not shown as expected? Isn't the "write" system 
call just writing a sequence of characters without caring which 
they are? So if I just give it the right length, shouldn't it 
just work? I'm pretty much sure that this is not as I expect it 
and it doesn't work like that. Anyone has an idea?
Dec 26 2021
parent reply Adam Ruppe <destructionator gmail.com> writes:
On Sunday, 26 December 2021 at 20:50:39 UTC, rempas wrote:
 I want to do this without using any library by using the 
 "write" system call directly with 64-bit Linux.
write just transfers a sequence of bytes. It doesn't know nor care what they represent - that's for the receiving end to figure out.
 know (and tell me if I'm mistaken), UTF-16 and UTF-32 have 
 fixed size lengths for their characters.
You are mistaken. There's several exceptions, utf-16 can come in pairs, and even utf-32 has multiple "characters" that combine onto one thing on screen. I prefer to think of a string as a little virtual machine that can be run to produce output rather than actually being "characters". Even with plain ascii, consider the backspace "character" - it is more an instruction to go back than it is a thing that is displayed on its own.
 Now the UTF-8 string will report 11 characters and print them 
 normally.
This is because the *receiving program* treats them as utf-8 and runs it accordingly. Not all terminals will necessarily do this, and programs you pipe to can do it very differently.
 Now what about the other two? I was expecting UTF-16 to report 
 16 characters and UTF-32 to report 32 characters.
The [w|d|]string.length function returns the number of elements in there, which is bytes for string, 16 bit elements for wstring (so bytes / 2), or 32 bit elements for dstring (so bytes / 4). This is not necessarily related to the number of characters displayed.
 Isn't the "write" system call just writing a sequence of 
 characters without caring which they are?
yes, it just passes bytes through. It doesn't know they are supposed to be characters...
Dec 26 2021
next sibling parent reply max haughton <maxhaton gmail.com> writes:
On Sunday, 26 December 2021 at 21:22:42 UTC, Adam Ruppe wrote:
 On Sunday, 26 December 2021 at 20:50:39 UTC, rempas wrote:
 [...]
write just transfers a sequence of bytes. It doesn't know nor care what they represent - that's for the receiving end to figure out.
 [...]
You are mistaken. There's several exceptions, utf-16 can come in pairs, and even utf-32 has multiple "characters" that combine onto one thing on screen. I prefer to think of a string as a little virtual machine that can be run to produce output rather than actually being "characters". Even with plain ascii, consider the backspace "character" - it is more an instruction to go back than it is a thing that is displayed on its own.
 [...]
This is because the *receiving program* treats them as utf-8 and runs it accordingly. Not all terminals will necessarily do this, and programs you pipe to can do it very differently.
 [...]
The [w|d|]string.length function returns the number of elements in there, which is bytes for string, 16 bit elements for wstring (so bytes / 2), or 32 bit elements for dstring (so bytes / 4). This is not necessarily related to the number of characters displayed.
 [...]
yes, it just passes bytes through. It doesn't know they are supposed to be characters...
I think that mental model is pretty good actually. Maybe a more specific idea exists, but this virtual machine concept does actually explain to the new programmer to expect dragons - or at least that the days of plain ASCII are long gone (and never happened, e.g. backspace as you say)
Dec 26 2021
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sun, Dec 26, 2021 at 11:45:25PM +0000, max haughton via Digitalmars-d-learn
wrote:
[...]
 I think that mental model is pretty good actually. Maybe a more
 specific idea exists, but this virtual machine concept does actually
 explain to the new programmer to expect dragons - or at least that the
 days of plain ASCII are long gone (and never happened, e.g. backspace
 as you say)
In some Unix terminals, backspace + '_' causes a character to be underlined. So it's really a mini VM, not just pure data. So yeah, the good ole ASCII days never happened. :-D T -- This sentence is false.
Dec 26 2021
parent reply rempas <rempas tutanota.com> writes:
On Sunday, 26 December 2021 at 23:57:47 UTC, H. S. Teoh wrote:
 In some Unix terminals, backspace + '_' causes a character to 
 be underlined. So it's really a mini VM, not just pure data. So 
 yeah, the good ole ASCII days never happened. :-D


 T
How can you do that? I'm trying to print the codes for them but it doesn't work. Or you cannot choose to have this behavior and there are only some terminals that support this?
Dec 26 2021
parent Kagamin <spam here.lot> writes:
On Monday, 27 December 2021 at 07:29:05 UTC, rempas wrote:
 How can you do that? I'm trying to print the codes for them but 
 it doesn't work. Or you cannot choose to have this behavior and 
 there are only some terminals that support this?
Try it on https://en.wikipedia.org/wiki/Teletype_Model_33
Dec 27 2021
prev sibling parent reply rempas <rempas tutanota.com> writes:
On Sunday, 26 December 2021 at 21:22:42 UTC, Adam Ruppe wrote:
 write just transfers a sequence of bytes. It doesn't know nor 
 care what they represent - that's for the receiving end to 
 figure out.
Oh, so it was as I expected :P
 You are mistaken. There's several exceptions, utf-16 can come 
 in pairs, and even utf-32 has multiple "characters" that 
 combine onto one thing on screen.
Oh yeah. About that, I wasn't given a demonstration of how it works so I forgot about it. I saw that in Unicode you can combine some code points to get different results but I never saw how that happens in practice. If you combine two code points, you get another different graph. So yeah that one thing I don't understand...
 I prefer to think of a string as a little virtual machine that 
 can be run to produce output rather than actually being 
 "characters". Even with plain ascii, consider the backspace 
 "character" - it is more an instruction to go back than it is a 
 thing that is displayed on its own.
Yes, that's a great way of seeing it. I suppose that this all happens under the hood and it is OS specific so why have to know how the OS we are working with works under the hood to fully understand how this happens. Also the idea of some "characters" been "instructions" is very interesting. Now from what I've seen, non-printable characters are always instructions (except for the "space" character) so another way to think about this is by thinking that every character can have one instruction and this is either to get written (displayed) in the file or to do another modification in the text but without getting displayed itself as a character. Of course, I don't suppose that's what happening under the hood but it's an interesting way of describe it.
 This is because the *receiving program* treats them as utf-8 
 and runs it accordingly. Not all terminals will necessarily do 
 this, and programs you pipe to can do it very differently.
That's pretty interesting actually. Terminals (and don't forget shells) are programs themselves so they choose the encoding themselves. However, do you know what we do from cross compatibility then? Because this sounds like a HUGE mess real world applications
 The [w|d|]string.length function returns the number of elements 
 in there, which is bytes for string, 16 bit elements for 
 wstring (so bytes / 2), or 32 bit elements for dstring (so 
 bytes / 4).

 This is not necessarily related to the number of characters 
 displayed.
I don't understand that. Based on your calculations, the results should have been different. Also how are the numbers fixed? Like you said the amount of bytes of each encoding is not always standard for every character. Even if they were fixed this means 2-bytes for each UTF-16 character and 4-bytes for each UTF-32 character so still the numbers doesn't make sense to me. So still the number of the "length" property should have been the same for every encoding or at least for UTF-16 and UTF-32. So are the sizes of every character fixed or not? Damn you guys should got paid for the help you are giving in this forum
Dec 26 2021
next sibling parent reply Kagamin <spam here.lot> writes:
D strings are plain arrays without any text-specific logic, the 
element is called code unit, which has a fixed size, and the 
array length specifies how many elements are in the array. This 
model is most adequate for memory correctness, i.e. it shows what 
takes how much memory and where it will fit. D doesn't impose 
fixed interpretations like characters or code points, because 
there are many of them and neither is the correct one, you need 
one or another in different situations. Linux console one example 
of such situation: it doesn't accept characters or code points, 
it accepts utf8 code units, using anything else is an error.
Dec 27 2021
parent reply rempas <rempas tutanota.com> writes:
On Monday, 27 December 2021 at 09:29:38 UTC, Kagamin wrote:
 D strings are plain arrays without any text-specific logic, the 
 element is called code unit, which has a fixed size, and the 
 array length specifies how many elements are in the array. This 
 model is most adequate for memory correctness, i.e. it shows 
 what takes how much memory and where it will fit. D doesn't 
 impose fixed interpretations like characters or code points, 
 because there are many of them and neither is the correct one, 
 you need one or another in different situations. Linux console 
 one example of such situation: it doesn't accept characters or 
 code points, it accepts utf8 code units, using anything else is 
 an error.
So should I just use UTF-8 only for Linux? What about other operating systems? I suppose Unix-based OSs (maybe MacOS as well if I'm lucky) work the same as well. But what about Windows? Unfortunately I have to support this OS too with my library so I should know. If you know and you can tell me of course...
Dec 27 2021
next sibling parent reply Adam D Ruppe <destructionator gmail.com> writes:
On Monday, 27 December 2021 at 11:21:54 UTC, rempas wrote:
 So should I just use UTF-8 only for Linux?
Most unix things do utf-8 more often than not, but technically you are supposed to check the locale and change the terminal settings to do it right.
 But what about Windows?
You should ALWAYS use the -W suffix functions on Windows when available, and pass them utf-16 encoded strings. There's a bunch of windows things taking utf-8 nowdays too, but utf-16 is what they standardized on back in the 1990's so it gives you a lot of compatibility. The Windows OS will convert to other things for you it for you do this utf-16 consistently.
 Unfortunately I have to support this OS too with my library so 
 I should know.
The Windows API is an absolute pleasure to work with next to much of the trash you're forced to deal with on Linux.
Dec 27 2021
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Dec 27, 2021 at 02:30:55PM +0000, Adam D Ruppe via Digitalmars-d-learn
wrote:
 On Monday, 27 December 2021 at 11:21:54 UTC, rempas wrote:
 So should I just use UTF-8 only for Linux?
Most unix things do utf-8 more often than not, but technically you are supposed to check the locale and change the terminal settings to do it right.
Technically, yes. But practically all modern Linux distros have standardized on UTF-8, and you're quite unlikely to run into non-UTF-8 environments except on legacy systems or extremely specialized applications. I don't know what's the situation on BSD, but I'd imagine it's pretty similar. A lot of modern Linux applications don't even work properly under anything non-UTF-8, so for practical purposes I'd say don't even worry about it, unless you're specifically targeting a non-UTF8 environment for a specific reason.
 But what about Windows?
You should ALWAYS use the -W suffix functions on Windows when available, and pass them utf-16 encoded strings.
[...] I'm not a regular Windows user, but I did remember running into problems where sometimes command.exe doesn't handle Unicode properly, and needs an API call to switch it to UTF mode or something. T -- First Rule of History: History doesn't repeat itself -- historians merely repeat each other.
Dec 27 2021
parent reply Adam D Ruppe <destructionator gmail.com> writes:
On Monday, 27 December 2021 at 15:26:16 UTC, H. S. Teoh wrote:
 A lot of modern Linux applications don't even work properly 
 under anything non-UTF-8
yeah, you're supposed to check the locale but since so many people just assume that's becoming the new de facto reality just like how people blindly shoot out vt100 codes without checking TERM and that usually works too.
 I'm not a regular Windows user, but I did remember running into 
 problems where sometimes command.exe doesn't handle Unicode 
 properly, and needs an API call to switch it to UTF mode or 
 something.
That'd be because someone called the -A function instead of the -W ones. The -W ones just work if you use them. The -A ones are there for compatibility with Windows 95 and have quirks. This is the point behind my blog post i linked before, people saying to make that api call don't understand the problem and are patching over one bug with another bug instead of actually fixing it with the correct function call.
Dec 27 2021
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Dec 27, 2021 at 04:40:19PM +0000, Adam D Ruppe via Digitalmars-d-learn
wrote:
 On Monday, 27 December 2021 at 15:26:16 UTC, H. S. Teoh wrote:
 A lot of modern Linux applications don't even work properly under
 anything non-UTF-8
yeah, you're supposed to check the locale but since so many people just assume that's becoming the new de facto reality
Yep, sad reality.
 just like how people blindly shoot out vt100 codes without checking
 TERM and that usually works too.
Haha, doesn't terminal.d do that in a few places too? ;-) To be fair, though, most of the popular terminal apps are based off of extensions of vt100 codes anyway, so the basic escape sequences more-or-less work across the board. AFAIK non-vt100 codes are getting rarer and can practically be treated as legacy these days. (At least on Linux, that is. Can't say for the other *nixen.)
 I'm not a regular Windows user, but I did remember running into problems
 where sometimes command.exe doesn't handle Unicode properly, and needs
 an API call to switch it to UTF mode or something.
That'd be because someone called the -A function instead of the -W ones. The -W ones just work if you use them. The -A ones are there for compatibility with Windows 95 and have quirks. This is the point behind my blog post i linked before, people saying to make that api call don't understand the problem and are patching over one bug with another bug instead of actually fixing it with the correct function call.
Point. T -- Just because you survived after you did it, doesn't mean it wasn't stupid!
Dec 27 2021
prev sibling parent reply rempas <rempas tutanota.com> writes:
On Monday, 27 December 2021 at 14:30:55 UTC, Adam D Ruppe wrote:
 Most unix things do utf-8 more often than not, but technically 
 you are supposed to check the locale and change the terminal 
 settings to do it right.
Cool! I mean, I don't plan on supporting legacy systems so I think we're fine if the up-to-date systems fully support UTF-8 as the default.
 You should ALWAYS use the -W suffix functions on Windows when 
 available, and pass them utf-16 encoded strings.

 There's a bunch of windows things taking utf-8 nowdays too, but 
 utf-16 is what they standardized on back in the 1990's so it 
 gives you a lot of compatibility. The Windows OS will convert 
 to other things for you it for you do this utf-16 consistently.
That's pretty nice. In this case is even better because at least for now, I will not work on Windows by myself because making the library work on Linux is a bit of a challenge itself. So I will wait for any contributors to work with that and they will probably know how windows convert UTF-8 to UTF-16 and they will be able to do tests. Also I plan to support only Windows 10/11 64-bit officially so just like with Unix, I don't mind if legacy systems don't work.
 The Windows API is an absolute pleasure to work with next to 
 much of the trash you're forced to deal with on Linux.
Whaaaat??? Don't crash my dreams sempai!!! I mean, this may sound stupid but which kind of API you are referring to? Do you mean system library stuff (like "unistd.h" for linux and "windows.h" for Windows) or low level system calls?
Dec 27 2021
parent reply Adam D Ruppe <destructionator gmail.com> writes:
On Tuesday, 28 December 2021 at 06:51:52 UTC, rempas wrote:
 That's pretty nice. In this case is even better because at 
 least for now, I will not work on Windows by myself because 
 making the library work on Linux is a bit of a challenge itself.
What is your library? You might be able to just use my terminal.d too....
 The Windows API is an absolute pleasure to work with next to 
 much of the trash you're forced to deal with on Linux.
Whaaaat??? Don't crash my dreams sempai!!! I mean, this may sound stupid but which kind of API you are referring to? Do you mean system library stuff (like "unistd.h" for linux and "windows.h" for Windows) or low level system calls?
Virtually all of it; Windows is just way easier to develop for. You'll see if you get deeper in this terminal stuff.... reading mouse data from Windows is a simple read of input event structs. Doing it from a Linux system is....... not simple.
Dec 28 2021
parent rempas <rempas tutanota.com> writes:
On Tuesday, 28 December 2021 at 13:04:26 UTC, Adam D Ruppe wrote:
 What is your library? You might be able to just use my 
 terminal.d too....
My library will be "libd" it will be like "libc" but better and cooler! And it will be native to D! And of course it will not depend on "libc" and it will not require and special runtime support as it will be "betterC". I don't plan to replace the other "default libs" like "libm", librt", "libpthread" etc. tho. At least not for now...
 Virtually all of it; Windows is just way easier to develop for. 
 You'll see if you get deeper in this terminal stuff.... reading 
 mouse data from Windows is a simple read of input event 
 structs. Doing it from a Linux system is....... not simple.
Well sucks to be me....
Dec 28 2021
prev sibling parent reply Kagamin <spam here.lot> writes:
On Monday, 27 December 2021 at 11:21:54 UTC, rempas wrote:
 So should I just use UTF-8 only for Linux? What about other 
 operating systems? I suppose Unix-based OSs (maybe MacOS as 
 well if I'm lucky) work the same as well. But what about 
 Windows? Unfortunately I have to support this OS too with my 
 library so I should know. If you know and you can tell me of 
 course...
https://utf8everywhere.org/ - this is an advise from a windows programmer, I use it too. Windows allocates a per thread buffer and when you call, say, WriteConsoleA, it first transcodes the string to UTF-16 in the buffer and calls WriteConsoleW, you would do something like that.
Dec 27 2021
parent rempas <rempas tutanota.com> writes:
On Monday, 27 December 2021 at 14:47:51 UTC, Kagamin wrote:
 https://utf8everywhere.org/ - this is an advise from a windows 
 programmer, I use it too. Windows allocates a per thread buffer 
 and when you call, say, WriteConsoleA, it first transcodes the 
 string to UTF-16 in the buffer and calls WriteConsoleW, you 
 would do something like that.
That's awesome! Like I said to Adam, I will not officially write code for Windows myself (at least for now) so It will probably be up to the contributors to decide anyway. Tho knowing that there will not be compatibility problems with the latest versions of Windows is just nice to know. Thanks a lot for the info man!
Dec 27 2021
prev sibling next sibling parent reply Adam D Ruppe <destructionator gmail.com> writes:
On Monday, 27 December 2021 at 07:12:24 UTC, rempas wrote:
 Oh yeah. About that, I wasn't given a demonstration of how it 
 works so I forgot about it. I saw that in Unicode you can 
 combine some code points to get different results but I never 
 saw how that happens in practice.
The emoji is one example, the one you posted is two code points. Some other common ones are accented letters will SOMETIMES - there's exceptions - be created by the letter followed by an accent mark. Some of those complicated emojis are several points with optional changes. Like it might be "woman" followed by "skin tone 2" 👩🏽. Some of them are "dancing" followed by "skin tone 0" followed by "male" and such. So it displays as one thing, but it is composed by 2 or more code points, and each code point might be composed from several code units, depending on the encoding. Again, think of it more as a little virtual machine building up a thing. A lot of these are actually based on combinations of old typewriters and old state machine terminal hardware. Like the reason "a" followed by "backspace" followed by "_" - SOMETIMES, it depends on the receiving program, this isn't a unicode thing - might sometimes be an underlined a because of think about typing that on a typewriter with a piece of paper. The "a" gets stamped on the paper. Backspace just moves back, but since the "a" is already on the paper, it isn't going to be erased. So when you type the _, it gets stamped on the paper along with the a. So some programs emulate that concept. The emoji thing is the same basic idea (though it doesn't use backspace): start by drawing a woman, then modify it with a skin color. Or start by drawing a person, then draw another person, then add a skin color, then make them female, and you have a family emoji. Impossible to do that stamping paper, but a little computer VM can understand this and build up the glyph.
 Yes, that's a great way of seeing it. I suppose that this all 
 happens under the hood and it is OS specific so why have to 
 know how the OS we are working with works under the hood to 
 fully understand how this happens.
9 Well, it isn't necessarily OS, any program can do its own thing. Of course, the OS can define something: Windows, for example, defines its things are UTF-16, or you can use a translation layer which does its own things for a great many functions. But still applications might treat it differently. For example, the xterm terminal emulator can be configured to use utf-8 or something else. It can be configured to interpret them in a way that emulated certain old terminals, including ones that work like a printer or the state machine things.
 However, do you know what we do from cross compatibility then? 
 Because this sounds like a HUGE mess real world applications
Yeah, it is a complete mess, especially on Linux. But even on Windows where Microsoft standardized on utf-16 for text functions, there's still weird exceptions. Like writing to the console vs piping to an application can be different. If you've ever written a single character to a windows pipe and seen different reults than if you wrote two, now you get an idea why.... it is trying to auto-detect if it is two-byte characters or one-byte streams. I wrote a little bit about this on my public blog: http://dpldocs.info/this-week-in-d/Blog.Posted_2019_11_25.html Or view the source of my terminal.d to see some of the "fun" in decoding all this nonsense. http://arsd-official.dpldocs.info/arsd.terminal.html The module there does a lot more than just the basics, but still most the top half of the file is all about this stuff. Mouse input might be encoded as utf characters, then you gotta change the mode and check various detection tricks. Ugh.
 I don't understand that. Based on your calculations, the 
 results should have been different. Also how are the numbers 
 fixed? Like you said the amount of bytes of each encoding is 
 not always standard for every character. Even if they were 
 fixed this means 2-bytes for each UTF-16 character and 4-bytes 
 for each UTF-32 character so still the numbers doesn't make 
 sense to me.
They're not characters, they're code points. Remember, multiple code points can be combined to form one character on screen. Let's look at: "Hello 😂\n"; This is actually a series of 8 code points: H, e, l, l, o, <space>, <crying face>, <new line> Those code points can themselves be encoded in three different ways: dstring: encodes each code point as a single element. That's why dstring there length is 8. Each *element* of this though is 32 bits which you see if you cast it to ubyte[], the length in bytes is 4x the length of the dstring, but dstring.length returns the number of units, not the number of bytes. So here one unit = one point, but remember each *point* is NOT necessarily anything you see on screen. It represents just one complete instruction to the VM. wstring: encodes each code point as one or two elements. If its value is in the lower half of the space (< 64k about), it gets one element. If it is in the upper half (> 64k) it gets two elements, one just saying "the next element should be combined with this one". That's why its length is 9. It kinda looks like: H, e, l, l, o, <space>, <next element is a point in the upper half of the space>, <crying face>, <new line> That "next element" unit is an additional element that is processed to figure out which points we get (which, again, are then feed into the VM thingy to be executed to actually produce something on string). So when you see that "next element is a point..." thing, it puts that in a buffer and pulls another element off the stream to produce the next VM instruction. After it comes in, that instruction gets executed and added to the next buffer. Each element in this array is 16 bits, meaning if you cast it to ubyte[], you'll see the length double. Finally, there's "string", which is utf-8, meaning each element is 8 bits, but again, there is a buffer you need to build up to get the code points you feed into that VM. Like we saw with 16 bits, there's now additional elements that tell you when a thing goes over. Any value < 128 gets a single element, then the next set gets two elements you do some bit shifts and bitwise-or to recombine, then another set with three elements and even a set with four elements. The first element tells you how many more elements you need to build up the point buffer. H, e, l, l, o, <space>, <next point is combined by these bits PLUS THREE MORE elements>, <this is a work-in-progress element and needs two more>, <this is a work-in-progress element and needs one more>, <this is the final work-in-progress element>, <new line> And now you see why it came to length == 11 - that emoji needed enough bits to build up the code point that it had to be spread across 4 bytes. Notice how each element here told you how many elements are left. This is encoded into the bit pattern and is part of why it took 4 elements instead of just three; there's some error-checking redundancy in there. This is a nice part of the design allowing you to validate a utf-8 stream more reliably and even recover if you jumped somewhere in the middle of a multi-byte sequence. But anyway, that's kinda an implementation detail - the big point here is just that each element of the string array has pieces it needs to recombine to make the unicode code points. Then, the unicode code points are instructions that are fed into a VM kind of thing to actually produce output, and this will sometimes vary depending on what the target program doing the interpreting is. So the layers are: 1) bytes build up into string/wstring/dstring array elements (aka "code units") 2) those code unit element arrays are decoded into code point instructions 3) those code point instructions are run to produce output. (or of course when you get to a human reader, they can interpret it differently too but obviously human language is a whole other mess lol)
Dec 27 2021
next sibling parent rempas <rempas tutanota.com> writes:
On Monday, 27 December 2021 at 14:23:37 UTC, Adam D Ruppe wrote:
 [...]
After reading the whole things, I said it and I'll say it again! You guys must get paid for your support!!!! I also helped a guy in another forum yesterday writing a very big reply and tbh it felt great :P
 (or of course when you get to a human reader, they can 
 interpret it differently too but obviously human language is a 
 whole other mess lol)
Yep! If machines are complicated, humans are even more complicated. Tho machine are also made from humans so... hmmmm!
Dec 27 2021
prev sibling parent reply ag0aep6g <anonymous example.com> writes:
On 27.12.21 15:23, Adam D Ruppe wrote:
 Let's look at:
 
 "Hello 😂\n";
[...]
 Finally, there's "string", which is utf-8, meaning each element is 8 
 bits, but again, there is a buffer you need to build up to get the code 
 points you feed into that VM.
[...]
 H, e, l, l, o, <space>, <next point is combined by these bits PLUS THREE 
 MORE elements>, <this is a work-in-progress element and needs two more>, 
 <this is a work-in-progress element and needs one more>, <this is the 
 final work-in-progress element>, <new line>
[...]
 Notice how each element here told you how many elements are left. This 
 is encoded into the bit pattern and is part of why it took 4 elements 
 instead of just three; there's some error-checking redundancy in there. 
 This is a nice part of the design allowing you to validate a utf-8 
 stream more reliably and even recover if you jumped somewhere in the 
 middle of a multi-byte sequence.
It's actually just the first byte that tells you how many are in the sequence. The continuation bytes don't have redundancies for that. To recover from the middle of a sequence, you just skip the orphaned continuation bytes one at a time.
Dec 27 2021
parent Adam D Ruppe <destructionator gmail.com> writes:
On Tuesday, 28 December 2021 at 06:46:57 UTC, ag0aep6g wrote:
 It's actually just the first byte that tells you how many are 
 in the sequence. The continuation bytes don't have redundancies 
 for that.
Right, but they do have that high bit set and next bit clear so you can tell you're in the middle and thus either go backward to the count byte to recover this character or go forward to the next count byte and drop this char while recovering the stream. My brain mixed this up with the rest of it and wrote it poorly lol.
Dec 28 2021
prev sibling next sibling parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Monday, 27 December 2021 at 07:12:24 UTC, rempas wrote:
 On Sunday, 26 December 2021 at 21:22:42 UTC, Adam Ruppe wrote:
 write just transfers a sequence of bytes. It doesn't know nor 
 care what they represent - that's for the receiving end to 
 figure out.
Oh, so it was as I expected :P
Well to add functionality with say ANSI you entered an escape code and then stuff like offset, color, effect, etc. UTF-8 automatically has escape codes being anything 128 or over, so as long as the terminal understand it, it should be what's handling it. https://www.robvanderwoude.com/ansi.php In the end it's all just a binary string of 1's and 0's.
Dec 27 2021
parent reply rempas <rempas tutanota.com> writes:
On Monday, 27 December 2021 at 21:38:03 UTC, Era Scarecrow wrote:
  Well to add functionality with say ANSI you entered an escape 
 code and then stuff like offset, color, effect, etc. UTF-8 
 automatically has escape codes being anything 128 or over, so 
 as long as the terminal understand it, it should be what's 
 handling it.

  https://www.robvanderwoude.com/ansi.php

  In the end it's all just a binary string of 1's and 0's.
Thanks for that post!! I already knew about some of this "escape codes" but I full list of them will come in handy ;)
Dec 27 2021
parent reply Adam D Ruppe <destructionator gmail.com> writes:
On Tuesday, 28 December 2021 at 07:03:25 UTC, rempas wrote:
 I already knew about some of this "escape codes" but I full 
 list of them will come in handy ;)
https://invisible-island.net/xterm/ctlseqs/ctlseqs.html and that's not quite full either..... it really is a mess from hell
Dec 28 2021
parent reply rempas <rempas tutanota.com> writes:
On Tuesday, 28 December 2021 at 12:56:11 UTC, Adam D Ruppe wrote:
 https://invisible-island.net/xterm/ctlseqs/ctlseqs.html

 and that's not quite full either..... it really is a mess from 
 hell
Still less complicated and organized than my life...
Dec 28 2021
parent rempas <rempas tutanota.com> writes:
On Tuesday, 28 December 2021 at 14:53:57 UTC, rempas wrote:
 On Tuesday, 28 December 2021 at 12:56:11 UTC, Adam D Ruppe 
 wrote:
 https://invisible-island.net/xterm/ctlseqs/ctlseqs.html

 and that's not quite full either..... it really is a mess from 
 hell
Still less complicated and organized than my life...
"Less complicated and more organized" is what I wanted to say. Damn I can't even make a joke right...
Dec 28 2021
prev sibling parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Monday, 27 December 2021 at 07:12:24 UTC, rempas wrote:
 I don't understand that. Based on your calculations, the 
 results should have been different. Also how are the numbers 
 fixed? Like you said the amount of bytes of each encoding is 
 not always standard for every character. Even if they were 
 fixed this means 2-bytes for each UTF-16 character and 4-bytes 
 for each UTF-32 character so still the numbers doesn't make 
 sense to me. So still the number of the "length" property 
 should have been the same for every encoding or at least for 
 UTF-16 and UTF-32. So are the sizes of every character fixed or 
 not?
Your string is represented by 8 codepoints. The number of codeunits to represent them in memory depends on the encoding. D supports to work with 3 different encodings (in the Unicode standard there are more than these 3) string utf8s = "Hello 😂\n"; wstring utf16s = "Hello 😂\n"w; dstring utf32s = "Hello 😂\n"d; Here the canonical Unicode representation of your string H e l l o 😂 \n U+0048 U+0065 U+006C U+006C U+006F U+0020 U+1F602 U+000a let's see how these 3 variable are represented in memory: utf8s : 48 65 6C 6C 6F 20 F0 9F 98 82 0a 11 char in memory using 11 bytes utf16s: 0048 0065 006C 006C 006F 0020 D83D DE02 000A 9 wchar in memory using 18 bytes utf16s: 00000048 00000065 0000006C 0000006C 0000006F 00000020 0001F602 0000000A 8 dchar in memory using 32 bytes As you can see, the most compact form is generally UTF-8, that's why it is the preferred encoding for Unicode. UTF-16 is supported because of legacy support reason like it is used in the Windows API and also internally in Java. UTF-32 has one advantage, in that it has a 1 to 1 mapping between codepoint and array index. In practice it is not that much of an advantage as codepoints and characters are disjoint concepts. UTF-32 uses a lot of memory for practically no benefit (when you read in the forum about the big auto-decode error of D it is linked to this).
Dec 28 2021