www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - UTF-16 endianess

reply Marek Janukowicz <marek janukowicz.net> writes:
I have trouble understanding how endianess works for UTF-16.

For example UTF-16 code for 'ł' character is 0x0142. But this program shows
otherwise:

import std.stdio;

public void main () {
  ubyte[] properOrder = [0x01, 0x42];
	ubyte[] reverseOrder = [0x42, 0x01];
	writefln( "proper: %s, reverse: %s", 
		cast(wchar[])properOrder, 
		cast(wchar[])reverseOrder );
}

output:

proper: 䈁, reverse: ł

Is there anything I should know about UTF endianess?

-- 
Marek Janukowicz
Jan 29 2016
next sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 1/29/16 5:36 PM, Marek Janukowicz wrote:
 I have trouble understanding how endianess works for UTF-16.

 For example UTF-16 code for 'ł' character is 0x0142. But this program shows
 otherwise:

 import std.stdio;

 public void main () {
    ubyte[] properOrder = [0x01, 0x42];
 	ubyte[] reverseOrder = [0x42, 0x01];
 	writefln( "proper: %s, reverse: %s",
 		cast(wchar[])properOrder,
 		cast(wchar[])reverseOrder );
 }

 output:

 proper: 䈁, reverse: ł

 Is there anything I should know about UTF endianess?
It's not any different from other endianness. In other words, a UTF16 code unit is expected to be in the endianness of the platform you are running on. If you are on x86 or x86_64 (very likely), then it should be little endian. If your source of data is big-endian (or opposite from your native endianness), it will have to be converted before treating as a wchar[]. Note the version identifiers BigEndian and LittleEndian can be used to compile the correct code. -Steve
Jan 29 2016
parent reply Marek Janukowicz <marek janukowicz.net> writes:
On Fri, 29 Jan 2016 17:43:26 -0500, Steven Schveighoffer wrote:
 Is there anything I should know about UTF endianess?
It's not any different from other endianness. In other words, a UTF16 code unit is expected to be in the endianness of the platform you are running on. If you are on x86 or x86_64 (very likely), then it should be little endian. If your source of data is big-endian (or opposite from your native endianness),
To be precise - my case is IMAP UTF7 folder name encoding and I finally found out it's indeed big endian, which explains my problem (as I'm indeed on x86_64).
 it will have to be converted before treating as a wchar[].
Is there any clever way to do the conversion? Or do I need to swap the bytes manually?
 Note the version identifiers BigEndian and LittleEndian can be used to 
 compile the correct code.
This solution is of no use to me as I don't want to change the endianess in general. -- Marek Janukowicz
Jan 29 2016
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 1/29/16 6:03 PM, Marek Janukowicz wrote:
 On Fri, 29 Jan 2016 17:43:26 -0500, Steven Schveighoffer wrote:
 Is there anything I should know about UTF endianess?
It's not any different from other endianness. In other words, a UTF16 code unit is expected to be in the endianness of the platform you are running on. If you are on x86 or x86_64 (very likely), then it should be little endian. If your source of data is big-endian (or opposite from your native endianness),
To be precise - my case is IMAP UTF7 folder name encoding and I finally found out it's indeed big endian, which explains my problem (as I'm indeed on x86_64).
 it will have to be converted before treating as a wchar[].
Is there any clever way to do the conversion? Or do I need to swap the bytes manually?
No clever way, just the straightforward way ;) Swapping endianness of 32-bits can be done with core.bitop.bswap. Doing it with 16 bits I believe you have to do bit shifting. Something like: foreach(ref elem; wcharArr) elem = ((elem << 8) & 0xff00) | ((elem >> 8) & 0x00ff); Or you can do it with the bytes directly before casting
 Note the version identifiers BigEndian and LittleEndian can be used to
 compile the correct code.
This solution is of no use to me as I don't want to change the endianess in general.
What I mean is that you can annotate your code with version statements like: version(LittleEndian) { // perform the byteswap ... } so your code is portable to BigEndian systems (where you would not want to byte swap). -Steve
Jan 29 2016
next sibling parent Johannes Pfau <nospam example.com> writes:
Am Fri, 29 Jan 2016 18:58:17 -0500
schrieb Steven Schveighoffer <schveiguy yahoo.com>:

 On 1/29/16 6:03 PM, Marek Janukowicz wrote:
 On Fri, 29 Jan 2016 17:43:26 -0500, Steven Schveighoffer wrote:  
 Is there anything I should know about UTF endianess?  
It's not any different from other endianness. In other words, a UTF16 code unit is expected to be in the endianness of the platform you are running on. If you are on x86 or x86_64 (very likely), then it should be little endian. If your source of data is big-endian (or opposite from your native endianness),
To be precise - my case is IMAP UTF7 folder name encoding and I finally found out it's indeed big endian, which explains my problem (as I'm indeed on x86_64).
 it will have to be converted before treating as a wchar[].  
Is there any clever way to do the conversion? Or do I need to swap the bytes manually?
No clever way, just the straightforward way ;) Swapping endianness of 32-bits can be done with core.bitop.bswap. Doing it with 16 bits I believe you have to do bit shifting. Something like: foreach(ref elem; wcharArr) elem = ((elem << 8) & 0xff00) | ((elem >> 8) & 0x00ff); Or you can do it with the bytes directly before casting
There's also a phobos solution: bigEndianToNative in std.bitmanip.
Jan 29 2016
prev sibling parent Marek Janukowicz <marek janukowicz.net> writes:
On Fri, 29 Jan 2016 18:58:17 -0500, Steven Schveighoffer wrote:
 Note the version identifiers BigEndian and LittleEndian can be used to
 compile the correct code.
This solution is of no use to me as I don't want to change the endianess in general.
What I mean is that you can annotate your code with version statements like: version(LittleEndian) { // perform the byteswap ... } so your code is portable to BigEndian systems (where you would not want to byte swap).
That's a good point, thanks. -- Marek Janukowicz
Jan 30 2016
prev sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
On Friday, 29 January 2016 at 22:36:37 UTC, Marek Janukowicz 
wrote:
 I have trouble understanding how endianess works for UTF-16.
UTF-16 (as well as UTF-32) comes in both little-endian and big-endian variants. A byte-order marker in the file can help you detect which one it is in. See t his t able: http://www.unicode.org/faq/utf_bom.html#gen6
Jan 29 2016