digitalmars.D.learn - UTF-16 endianess

Marek Janukowicz (16/16) Jan 29 2016 I have trouble understanding how endianess works for UTF-16.

Steven Schveighoffer (10/24) Jan 29 2016 It's not any different from other endianness.

Marek Janukowicz (9/19) Jan 29 2016 To be precise - my case is IMAP UTF7 folder name encoding and I finally ...

Steven Schveighoffer (16/37) Jan 29 2016 No clever way, just the straightforward way ;)

Johannes Pfau (3/36) Jan 29 2016 There's also a phobos solution: bigEndianToNative in std.bitmanip.
Marek Janukowicz (4/17) Jan 30 2016 That's a good point, thanks.

Adam D. Ruppe (7/8) Jan 29 2016 UTF-16 (as well as UTF-32) comes in both little-endian and

Marek Janukowicz <marek janukowicz.net> writes:

I have trouble understanding how endianess works for UTF-16.

For example UTF-16 code for 'ł' character is 0x0142. But this program shows
otherwise:

import std.stdio;

public void main () {
  ubyte[] properOrder = [0x01, 0x42];
	ubyte[] reverseOrder = [0x42, 0x01];
	writefln( "proper: %s, reverse: %s", 
		cast(wchar[])properOrder, 
		cast(wchar[])reverseOrder );
}

output:

proper: 䈁, reverse: ł

Is there anything I should know about UTF endianess?

-- 
Marek Janukowicz

Jan 29 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 1/29/16 5:36 PM, Marek Janukowicz wrote:
 I have trouble understanding how endianess works for UTF-16.

 For example UTF-16 code for 'ł' character is 0x0142. But this program shows
 otherwise:

 import std.stdio;

 public void main () {
    ubyte[] properOrder = [0x01, 0x42];
 	ubyte[] reverseOrder = [0x42, 0x01];
 	writefln( "proper: %s, reverse: %s",
 		cast(wchar[])properOrder,
 		cast(wchar[])reverseOrder );
 }

 output:

 proper: 䈁, reverse: ł

 Is there anything I should know about UTF endianess?

It's not any different from other endianness.

In other words, a UTF16 code unit is expected to be in the endianness of 
the platform you are running on.

If you are on x86 or x86_64 (very likely), then it should be little endian.

If your source of data is big-endian (or opposite from your native 
endianness), it will have to be converted before treating as a wchar[].

Note the version identifiers BigEndian and LittleEndian can be used to 
compile the correct code.

-Steve

Jan 29 2016

Marek Janukowicz <marek janukowicz.net> writes:

On Fri, 29 Jan 2016 17:43:26 -0500, Steven Schveighoffer wrote:
 Is there anything I should know about UTF endianess?

 It's not any different from other endianness.

 In other words, a UTF16 code unit is expected to be in the endianness of 
 the platform you are running on.

 If you are on x86 or x86_64 (very likely), then it should be little endian.

 If your source of data is big-endian (or opposite from your native 
 endianness), 

To be precise - my case is IMAP UTF7 folder name encoding and I finally found
out it's indeed big endian, which explains my problem (as I'm indeed on x86_64).

 it will have to be converted before treating as a wchar[].

Is there any clever way to do the conversion? Or do I need to swap the bytes
manually?

 Note the version identifiers BigEndian and LittleEndian can be used to 
 compile the correct code.

This solution is of no use to me as I don't want to change the endianess in
general.

-- 
Marek Janukowicz

Jan 29 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 1/29/16 6:03 PM, Marek Janukowicz wrote:
 On Fri, 29 Jan 2016 17:43:26 -0500, Steven Schveighoffer wrote:
 Is there anything I should know about UTF endianess?

 It's not any different from other endianness.

 In other words, a UTF16 code unit is expected to be in the endianness of
 the platform you are running on.

 If you are on x86 or x86_64 (very likely), then it should be little endian.

 If your source of data is big-endian (or opposite from your native
 endianness),

 To be precise - my case is IMAP UTF7 folder name encoding and I finally found
 out it's indeed big endian, which explains my problem (as I'm indeed on
x86_64).

 it will have to be converted before treating as a wchar[].

 Is there any clever way to do the conversion? Or do I need to swap the bytes
 manually?

No clever way, just the straightforward way ;)

Swapping endianness of 32-bits can be done with core.bitop.bswap. Doing 
it with 16 bits I believe you have to do bit shifting. Something like:

foreach(ref elem; wcharArr) elem = ((elem << 8) & 0xff00) | ((elem >> 8) 
& 0x00ff);

Or you can do it with the bytes directly before casting

 Note the version identifiers BigEndian and LittleEndian can be used to
 compile the correct code.

 This solution is of no use to me as I don't want to change the endianess in
 general.

What I mean is that you can annotate your code with version statements like:

version(LittleEndian)
{
    // perform the byteswap
    ...
}

so your code is portable to BigEndian systems (where you would not want 
to byte swap).

-Steve

Jan 29 2016

Johannes Pfau <nospam example.com> writes:

Am Fri, 29 Jan 2016 18:58:17 -0500
schrieb Steven Schveighoffer <schveiguy yahoo.com>:

 On 1/29/16 6:03 PM, Marek Janukowicz wrote:
 On Fri, 29 Jan 2016 17:43:26 -0500, Steven Schveighoffer wrote:  
 Is there anything I should know about UTF endianess?  

 It's not any different from other endianness.

 In other words, a UTF16 code unit is expected to be in the
 endianness of the platform you are running on.

 If you are on x86 or x86_64 (very likely), then it should be
 little endian.

 If your source of data is big-endian (or opposite from your native
 endianness),  

 To be precise - my case is IMAP UTF7 folder name encoding and I
 finally found out it's indeed big endian, which explains my problem
 (as I'm indeed on x86_64). 
 it will have to be converted before treating as a wchar[].  

 Is there any clever way to do the conversion? Or do I need to swap
 the bytes manually?  

 
 No clever way, just the straightforward way ;)
 
 Swapping endianness of 32-bits can be done with core.bitop.bswap.
 Doing it with 16 bits I believe you have to do bit shifting.
 Something like:
 
 foreach(ref elem; wcharArr) elem = ((elem << 8) & 0xff00) | ((elem >>
 8) & 0x00ff);
 
 Or you can do it with the bytes directly before casting


There's also a phobos solution: bigEndianToNative in std.bitmanip.

Jan 29 2016

Marek Janukowicz <marek janukowicz.net> writes:

On Fri, 29 Jan 2016 18:58:17 -0500, Steven Schveighoffer wrote:
 Note the version identifiers BigEndian and LittleEndian can be used to
 compile the correct code.

 This solution is of no use to me as I don't want to change the endianess in
 general.

 What I mean is that you can annotate your code with version statements like:

 version(LittleEndian)
 {
     // perform the byteswap
     ...
 }

 so your code is portable to BigEndian systems (where you would not want 
 to byte swap).

That's a good point, thanks.

-- 
Marek Janukowicz

Jan 30 2016

Adam D. Ruppe <destructionator gmail.com> writes:

On Friday, 29 January 2016 at 22:36:37 UTC, Marek Janukowicz 
wrote:
 I have trouble understanding how endianess works for UTF-16.

UTF-16 (as well as UTF-32) comes in both little-endian and 
big-endian variants. A byte-order marker in the file can help you 
detect which one it is in.

See t his t able:

http://www.unicode.org/faq/utf_bom.html#gen6

Jan 29 2016

D Programming

C/C++ Programming

Other

digitalmars.D.learn - UTF-16 endianess