www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Why ElementType!(char[3]) == dchar instead of char?

reply drug <drug2004 bk.ru> writes:
http://dpaste.dzfl.pl/4535c5c03126
Sep 01 2015
next sibling parent reply drug <drug2004 bk.ru> writes:
On 01.09.2015 19:18, drug wrote:
 http://dpaste.dzfl.pl/4535c5c03126
Should I use ForeachType!(char[3]) instead of ElementType?
Sep 01 2015
parent Justin Whear <justin economicmodeling.com> writes:
On Tue, 01 Sep 2015 19:21:44 +0300, drug wrote:

 On 01.09.2015 19:18, drug wrote:
 http://dpaste.dzfl.pl/4535c5c03126
Should I use ForeachType!(char[3]) instead of ElementType?
Try std.range.ElementEncodingType
Sep 01 2015
prev sibling parent reply Justin Whear <justin economicmodeling.com> writes:
On Tue, 01 Sep 2015 19:18:42 +0300, drug wrote:

 http://dpaste.dzfl.pl/4535c5c03126
Arrays of char are assumed to be UTF-8 encoded text and a single char is not necessarily sufficient to represent a character. ElementType identifies the type that you will receive when (for instance) foreaching over the array and D autodecodes the UTF-8 for you. If you'd like to represent raw bytes use byte[3] or ubyte[3]. If you'd like other encodings, check out std.encoding.
Sep 01 2015
parent reply Justin Whear <justin economicmodeling.com> writes:
On Tue, 01 Sep 2015 16:25:53 +0000, Justin Whear wrote:

 On Tue, 01 Sep 2015 19:18:42 +0300, drug wrote:
 
 http://dpaste.dzfl.pl/4535c5c03126
Arrays of char are assumed to be UTF-8 encoded text and a single char is not necessarily sufficient to represent a character. ElementType identifies the type that you will receive when (for instance) foreaching over the array and D autodecodes the UTF-8 for you. If you'd like to represent raw bytes use byte[3] or ubyte[3]. If you'd like other encodings, check out std.encoding.
I should correct this: * ForeachType is the element type that will inferred by a foreach loop * ElementType is usually the same as ForeachType but is the type of the value returned by .front One major distinction is that ElementType is only for ranges while ForeachType will work for iterable non-ranges.
Sep 01 2015
parent reply drug <drug2004 bk.ru> writes:
On 01.09.2015 19:32, Justin Whear wrote:
 On Tue, 01 Sep 2015 16:25:53 +0000, Justin Whear wrote:

 On Tue, 01 Sep 2015 19:18:42 +0300, drug wrote:

 http://dpaste.dzfl.pl/4535c5c03126
Arrays of char are assumed to be UTF-8 encoded text and a single char is not necessarily sufficient to represent a character. ElementType identifies the type that you will receive when (for instance) foreaching over the array and D autodecodes the UTF-8 for you. If you'd like to represent raw bytes use byte[3] or ubyte[3]. If you'd like other encodings, check out std.encoding.
I should correct this: * ForeachType is the element type that will inferred by a foreach loop * ElementType is usually the same as ForeachType but is the type of the value returned by .front One major distinction is that ElementType is only for ranges while ForeachType will work for iterable non-ranges.
I'm just trying to automatically convert D types to hdf5 types so I guess char[..] isn't obligatory some form of UTF-8 encoded text. Or I should treat it so?
Sep 01 2015
next sibling parent "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Tue, Sep 01, 2015 at 07:40:24PM +0300, drug via Digitalmars-d-learn wrote:
[...]
 I'm just trying to automatically convert D types to hdf5 types so I
 guess char[..] isn't obligatory some form of UTF-8 encoded text. Or I
 should treat it so?
In D, char[]/wchar[]/dchar[] are intended to be UTF. If you're dealing with strings encoded with other character sets, you should use ubyte[] (or ushort[], etc.) instead. T -- EMACS = Extremely Massive And Cumbersome System
Sep 01 2015
prev sibling parent reply Justin Whear <justin economicmodeling.com> writes:
On Tue, 01 Sep 2015 19:40:24 +0300, drug wrote:

 I'm just trying to automatically convert D types to hdf5 types so I
 guess char[..] isn't obligatory some form of UTF-8 encoded text. Or I
 should treat it so?
Because of D's autodecoding it can be problematic to assume UTF-8 if other encodings are actually in use. If, for instance, you try printing a string stored as char[] that is actually Latin-1 encoded and contains characters from the high range, you'll get a runtime UTF-8 decoding exception. If you don't know ahead of time what the encoding will be, using ubyte[] will be safer. The other option is to dynamically reencode strings to UTF-8 as you read them.
Sep 01 2015
parent reply drug <drug2004 bk.ru> writes:
My case is I don't know what type user will be using, because I write a 
library. What's the best way to process char[..] in this case?
Sep 01 2015
parent reply Jonathan M Davis via Digitalmars-d-learn writes:
On Tuesday, September 01, 2015 20:05:18 drug via Digitalmars-d-learn wrote:
 My case is I don't know what type user will be using, because I write a
 library. What's the best way to process char[..] in this case?
char[] should never be anything other than UTF-8. Similarly, wchar[] is UTF-16, and dchar[] is UTF-32. So, if you're getting something other than UTF-8, it should not be char[]. It should be something more like ubyte[]. If you want to operate on it as char[], you should convert it to UTF-8. std.encoding may or may not help with that. But pretty much everything in D - certainly in the standard library - assumes that char, wchar, and dchar are UTF-encoded, and the language spec basically defines them that way. Technically, you _can_ put other encodings in them, but it's just asking for trouble. - Jonathan M Davis
Sep 01 2015
parent reply drug <drug2004 bk.ru> writes:
02.09.2015 00:08, Jonathan M Davis via Digitalmars-d-learn пишет:
 On Tuesday, September 01, 2015 20:05:18 drug via Digitalmars-d-learn wrote:
 My case is I don't know what type user will be using, because I write a
 library. What's the best way to process char[..] in this case?
char[] should never be anything other than UTF-8. Similarly, wchar[] is UTF-16, and dchar[] is UTF-32. So, if you're getting something other than UTF-8, it should not be char[]. It should be something more like ubyte[]. If you want to operate on it as char[], you should convert it to UTF-8. std.encoding may or may not help with that. But pretty much everything in D - certainly in the standard library - assumes that char, wchar, and dchar are UTF-encoded, and the language spec basically defines them that way. Technically, you _can_ put other encodings in them, but it's just asking for trouble. - Jonathan M Davis
I see, thanks. So I should always treat char[] as UTF in D itself, but because I need to pass char[], wchar[] or dchar[] to a C library I should treat it as not UTF but ubytes sequence or ushort or uint sequence - just to pass it correctly, right?
Sep 01 2015
parent reply "FreeSlave" <freeslave93 gmail.com> writes:
On Wednesday, 2 September 2015 at 05:00:42 UTC, drug wrote:
 02.09.2015 00:08, Jonathan M Davis via Digitalmars-d-learn 
 пишет:
 On Tuesday, September 01, 2015 20:05:18 drug via 
 Digitalmars-d-learn wrote:
 My case is I don't know what type user will be using, because 
 I write a
 library. What's the best way to process char[..] in this case?
char[] should never be anything other than UTF-8. Similarly, wchar[] is UTF-16, and dchar[] is UTF-32. So, if you're getting something other than UTF-8, it should not be char[]. It should be something more like ubyte[]. If you want to operate on it as char[], you should convert it to UTF-8. std.encoding may or may not help with that. But pretty much everything in D - certainly in the standard library - assumes that char, wchar, and dchar are UTF-encoded, and the language spec basically defines them that way. Technically, you _can_ put other encodings in them, but it's just asking for trouble. - Jonathan M Davis
I see, thanks. So I should always treat char[] as UTF in D itself, but because I need to pass char[], wchar[] or dchar[] to a C library I should treat it as not UTF but ubytes sequence or ushort or uint sequence - just to pass it correctly, right?
You should just keep in mind that strings returned by Phobos are UTF encoded. Does your C library have UTF support? Is it relevant at all? Maybe it just treats char array as binary data. But if it does some non-trivial string and character manipulations or talks to file system, then it surely should expect strings in some specific encoding, and if it's not UTF, you should re-encode data before passing from D to this library. Also C does not have wchar and dchar, but has wchar_t which size is not fixed and depends on particular platform.
Sep 02 2015
parent reply drug <drug2004 bk.ru> writes:
On 02.09.2015 11:30, FreeSlave wrote:
 I see, thanks. So I should always treat char[] as UTF in D itself, but
 because I need to pass char[], wchar[] or dchar[] to a C library I
 should treat it as not UTF but ubytes sequence or ushort or uint
 sequence - just to pass it correctly, right?
You should just keep in mind that strings returned by Phobos are UTF encoded. Does your C library have UTF support? Is it relevant at all? Maybe it just treats char array as binary data. But if it does some non-trivial string and character manipulations or talks to file system, then it surely should expect strings in some specific encoding, and if it's not UTF, you should re-encode data before passing from D to this library. Also C does not have wchar and dchar, but has wchar_t which size is not fixed and depends on particular platform.
Well, I think it's not simple question. The C library I used is hdf5 lib and it stores data without processing. In general. In particular I need to evalutate a situation concretely, I guess. Thanks all for anwers.
Sep 02 2015
parent Jonathan M Davis via Digitalmars-d-learn writes:
On Wednesday, September 02, 2015 11:47:11 drug via Digitalmars-d-learn wrote:
 On 02.09.2015 11:30, FreeSlave wrote:
 I see, thanks. So I should always treat char[] as UTF in D itself, but
 because I need to pass char[], wchar[] or dchar[] to a C library I
 should treat it as not UTF but ubytes sequence or ushort or uint
 sequence - just to pass it correctly, right?
You should just keep in mind that strings returned by Phobos are UTF encoded. Does your C library have UTF support? Is it relevant at all? Maybe it just treats char array as binary data. But if it does some non-trivial string and character manipulations or talks to file system, then it surely should expect strings in some specific encoding, and if it's not UTF, you should re-encode data before passing from D to this library. Also C does not have wchar and dchar, but has wchar_t which size is not fixed and depends on particular platform.
Well, I think it's not simple question. The C library I used is hdf5 lib and it stores data without processing. In general. In particular I need to evalutate a situation concretely, I guess. Thanks all for anwers.
Yeah. char in C is often used for what D uses ubyte, so just because C uses a char doesn't mean that it even has anything to do with strings, let alone UTF. The correct way to deal with a C function depends on the C function, and that requires that you understand enough about what it's doing to know whether you're really dealing with a string or just bytes. Fortunately, most of the time - in *nix-land anyway - when char* is treated as string data, it's either ASCII or UTF-8. However, in Windows, it's not, and the situation gets far less pleasant (though if you're dealing with strings a Windows API, you should almost always be using UTF-16 and avoid that whole issue altogether). In any case, you have to be familiar with what the C function is doing and whether it's operating on string data or not rather than just blindly seeing char* and thinking that it's a zero-terminated string. - Jonathan M Davis
Sep 02 2015