digitalmars.D.learn - Why ElementType!(char[3]) == dchar instead of char?
- drug (1/1) Sep 01 2015 http://dpaste.dzfl.pl/4535c5c03126
- drug (2/3) Sep 01 2015 Should I use ForeachType!(char[3]) instead of ElementType?
- Justin Whear (2/6) Sep 01 2015 Try std.range.ElementEncodingType
- Justin Whear (7/8) Sep 01 2015 Arrays of char are assumed to be UTF-8 encoded text and a single char is...
- Justin Whear (7/17) Sep 01 2015 I should correct this:
- drug (4/21) Sep 01 2015 I'm just trying to automatically convert D types to hdf5 types so I
- H. S. Teoh via Digitalmars-d-learn (8/11) Sep 01 2015 In D, char[]/wchar[]/dchar[] are intended to be UTF. If you're dealing
- Justin Whear (8/11) Sep 01 2015 Because of D's autodecoding it can be problematic to assume UTF-8 if
- drug (2/2) Sep 01 2015 My case is I don't know what type user will be using, because I write a
- Jonathan M Davis via Digitalmars-d-learn (11/13) Sep 01 2015 char[] should never be anything other than UTF-8. Similarly, wchar[] is
- drug (5/18) Sep 01 2015 I see, thanks. So I should always treat char[] as UTF in D itself, but
- FreeSlave (10/42) Sep 02 2015 You should just keep in mind that strings returned by Phobos are
- drug (5/18) Sep 02 2015 Well, I think it's not simple question. The C library I used is hdf5 lib...
- Jonathan M Davis via Digitalmars-d-learn (15/35) Sep 02 2015 Yeah. char in C is often used for what D uses ubyte, so just because C u...
On 01.09.2015 19:18, drug wrote:http://dpaste.dzfl.pl/4535c5c03126Should I use ForeachType!(char[3]) instead of ElementType?
Sep 01 2015
On Tue, 01 Sep 2015 19:21:44 +0300, drug wrote:On 01.09.2015 19:18, drug wrote:Try std.range.ElementEncodingTypehttp://dpaste.dzfl.pl/4535c5c03126Should I use ForeachType!(char[3]) instead of ElementType?
Sep 01 2015
On Tue, 01 Sep 2015 19:18:42 +0300, drug wrote:http://dpaste.dzfl.pl/4535c5c03126Arrays of char are assumed to be UTF-8 encoded text and a single char is not necessarily sufficient to represent a character. ElementType identifies the type that you will receive when (for instance) foreaching over the array and D autodecodes the UTF-8 for you. If you'd like to represent raw bytes use byte[3] or ubyte[3]. If you'd like other encodings, check out std.encoding.
Sep 01 2015
On Tue, 01 Sep 2015 16:25:53 +0000, Justin Whear wrote:On Tue, 01 Sep 2015 19:18:42 +0300, drug wrote:I should correct this: * ForeachType is the element type that will inferred by a foreach loop * ElementType is usually the same as ForeachType but is the type of the value returned by .front One major distinction is that ElementType is only for ranges while ForeachType will work for iterable non-ranges.http://dpaste.dzfl.pl/4535c5c03126Arrays of char are assumed to be UTF-8 encoded text and a single char is not necessarily sufficient to represent a character. ElementType identifies the type that you will receive when (for instance) foreaching over the array and D autodecodes the UTF-8 for you. If you'd like to represent raw bytes use byte[3] or ubyte[3]. If you'd like other encodings, check out std.encoding.
Sep 01 2015
On 01.09.2015 19:32, Justin Whear wrote:On Tue, 01 Sep 2015 16:25:53 +0000, Justin Whear wrote:I'm just trying to automatically convert D types to hdf5 types so I guess char[..] isn't obligatory some form of UTF-8 encoded text. Or I should treat it so?On Tue, 01 Sep 2015 19:18:42 +0300, drug wrote:I should correct this: * ForeachType is the element type that will inferred by a foreach loop * ElementType is usually the same as ForeachType but is the type of the value returned by .front One major distinction is that ElementType is only for ranges while ForeachType will work for iterable non-ranges.http://dpaste.dzfl.pl/4535c5c03126Arrays of char are assumed to be UTF-8 encoded text and a single char is not necessarily sufficient to represent a character. ElementType identifies the type that you will receive when (for instance) foreaching over the array and D autodecodes the UTF-8 for you. If you'd like to represent raw bytes use byte[3] or ubyte[3]. If you'd like other encodings, check out std.encoding.
Sep 01 2015
On Tue, Sep 01, 2015 at 07:40:24PM +0300, drug via Digitalmars-d-learn wrote: [...]I'm just trying to automatically convert D types to hdf5 types so I guess char[..] isn't obligatory some form of UTF-8 encoded text. Or I should treat it so?In D, char[]/wchar[]/dchar[] are intended to be UTF. If you're dealing with strings encoded with other character sets, you should use ubyte[] (or ushort[], etc.) instead. T -- EMACS = Extremely Massive And Cumbersome System
Sep 01 2015
On Tue, 01 Sep 2015 19:40:24 +0300, drug wrote:I'm just trying to automatically convert D types to hdf5 types so I guess char[..] isn't obligatory some form of UTF-8 encoded text. Or I should treat it so?Because of D's autodecoding it can be problematic to assume UTF-8 if other encodings are actually in use. If, for instance, you try printing a string stored as char[] that is actually Latin-1 encoded and contains characters from the high range, you'll get a runtime UTF-8 decoding exception. If you don't know ahead of time what the encoding will be, using ubyte[] will be safer. The other option is to dynamically reencode strings to UTF-8 as you read them.
Sep 01 2015
My case is I don't know what type user will be using, because I write a library. What's the best way to process char[..] in this case?
Sep 01 2015
On Tuesday, September 01, 2015 20:05:18 drug via Digitalmars-d-learn wrote:My case is I don't know what type user will be using, because I write a library. What's the best way to process char[..] in this case?char[] should never be anything other than UTF-8. Similarly, wchar[] is UTF-16, and dchar[] is UTF-32. So, if you're getting something other than UTF-8, it should not be char[]. It should be something more like ubyte[]. If you want to operate on it as char[], you should convert it to UTF-8. std.encoding may or may not help with that. But pretty much everything in D - certainly in the standard library - assumes that char, wchar, and dchar are UTF-encoded, and the language spec basically defines them that way. Technically, you _can_ put other encodings in them, but it's just asking for trouble. - Jonathan M Davis
Sep 01 2015
02.09.2015 00:08, Jonathan M Davis via Digitalmars-d-learn пишет:On Tuesday, September 01, 2015 20:05:18 drug via Digitalmars-d-learn wrote:I see, thanks. So I should always treat char[] as UTF in D itself, but because I need to pass char[], wchar[] or dchar[] to a C library I should treat it as not UTF but ubytes sequence or ushort or uint sequence - just to pass it correctly, right?My case is I don't know what type user will be using, because I write a library. What's the best way to process char[..] in this case?char[] should never be anything other than UTF-8. Similarly, wchar[] is UTF-16, and dchar[] is UTF-32. So, if you're getting something other than UTF-8, it should not be char[]. It should be something more like ubyte[]. If you want to operate on it as char[], you should convert it to UTF-8. std.encoding may or may not help with that. But pretty much everything in D - certainly in the standard library - assumes that char, wchar, and dchar are UTF-encoded, and the language spec basically defines them that way. Technically, you _can_ put other encodings in them, but it's just asking for trouble. - Jonathan M Davis
Sep 01 2015
On Wednesday, 2 September 2015 at 05:00:42 UTC, drug wrote:02.09.2015 00:08, Jonathan M Davis via Digitalmars-d-learn пишет:You should just keep in mind that strings returned by Phobos are UTF encoded. Does your C library have UTF support? Is it relevant at all? Maybe it just treats char array as binary data. But if it does some non-trivial string and character manipulations or talks to file system, then it surely should expect strings in some specific encoding, and if it's not UTF, you should re-encode data before passing from D to this library. Also C does not have wchar and dchar, but has wchar_t which size is not fixed and depends on particular platform.On Tuesday, September 01, 2015 20:05:18 drug via Digitalmars-d-learn wrote:I see, thanks. So I should always treat char[] as UTF in D itself, but because I need to pass char[], wchar[] or dchar[] to a C library I should treat it as not UTF but ubytes sequence or ushort or uint sequence - just to pass it correctly, right?My case is I don't know what type user will be using, because I write a library. What's the best way to process char[..] in this case?char[] should never be anything other than UTF-8. Similarly, wchar[] is UTF-16, and dchar[] is UTF-32. So, if you're getting something other than UTF-8, it should not be char[]. It should be something more like ubyte[]. If you want to operate on it as char[], you should convert it to UTF-8. std.encoding may or may not help with that. But pretty much everything in D - certainly in the standard library - assumes that char, wchar, and dchar are UTF-encoded, and the language spec basically defines them that way. Technically, you _can_ put other encodings in them, but it's just asking for trouble. - Jonathan M Davis
Sep 02 2015
On 02.09.2015 11:30, FreeSlave wrote:Well, I think it's not simple question. The C library I used is hdf5 lib and it stores data without processing. In general. In particular I need to evalutate a situation concretely, I guess. Thanks all for anwers.I see, thanks. So I should always treat char[] as UTF in D itself, but because I need to pass char[], wchar[] or dchar[] to a C library I should treat it as not UTF but ubytes sequence or ushort or uint sequence - just to pass it correctly, right?You should just keep in mind that strings returned by Phobos are UTF encoded. Does your C library have UTF support? Is it relevant at all? Maybe it just treats char array as binary data. But if it does some non-trivial string and character manipulations or talks to file system, then it surely should expect strings in some specific encoding, and if it's not UTF, you should re-encode data before passing from D to this library. Also C does not have wchar and dchar, but has wchar_t which size is not fixed and depends on particular platform.
Sep 02 2015
On Wednesday, September 02, 2015 11:47:11 drug via Digitalmars-d-learn wrote:On 02.09.2015 11:30, FreeSlave wrote:Yeah. char in C is often used for what D uses ubyte, so just because C uses a char doesn't mean that it even has anything to do with strings, let alone UTF. The correct way to deal with a C function depends on the C function, and that requires that you understand enough about what it's doing to know whether you're really dealing with a string or just bytes. Fortunately, most of the time - in *nix-land anyway - when char* is treated as string data, it's either ASCII or UTF-8. However, in Windows, it's not, and the situation gets far less pleasant (though if you're dealing with strings a Windows API, you should almost always be using UTF-16 and avoid that whole issue altogether). In any case, you have to be familiar with what the C function is doing and whether it's operating on string data or not rather than just blindly seeing char* and thinking that it's a zero-terminated string. - Jonathan M DavisWell, I think it's not simple question. The C library I used is hdf5 lib and it stores data without processing. In general. In particular I need to evalutate a situation concretely, I guess. Thanks all for anwers.I see, thanks. So I should always treat char[] as UTF in D itself, but because I need to pass char[], wchar[] or dchar[] to a C library I should treat it as not UTF but ubytes sequence or ushort or uint sequence - just to pass it correctly, right?You should just keep in mind that strings returned by Phobos are UTF encoded. Does your C library have UTF support? Is it relevant at all? Maybe it just treats char array as binary data. But if it does some non-trivial string and character manipulations or talks to file system, then it surely should expect strings in some specific encoding, and if it's not UTF, you should re-encode data before passing from D to this library. Also C does not have wchar and dchar, but has wchar_t which size is not fixed and depends on particular platform.
Sep 02 2015