www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Unicode BOM and endianness

reply Tim Locke <root vic-20.net> writes:
How do I acquire and determine the BOM and endianness of a file I am
reading?

Thanks
Aug 03 2006
parent reply Derek Parnell <derek nomail.afraid.org> writes:
On Fri, 04 Aug 2006 00:36:21 -0300, Tim Locke wrote:

 How do I acquire and determine the BOM and endianness of a file I am
 reading?
 
 Thanks

You might check out http://en.wikipedia.org/wiki/Byte_Order_Mark -- Derek (skype: derek.j.parnell) Melbourne, Australia "Down with mediocrity!" 4/08/2006 2:14:46 PM
Aug 03 2006
next sibling parent reply Hasan Aljudy <hasan.aljudy gmail.com> writes:
Derek Parnell wrote:
 On Fri, 04 Aug 2006 00:36:21 -0300, Tim Locke wrote:
 
 
How do I acquire and determine the BOM and endianness of a file I am
reading?

Thanks

You might check out http://en.wikipedia.org/wiki/Byte_Order_Mark

Are GNU tools really as ignorant of Unicode as that page implies? [quote] While UTF-8 does not have byte order issues, a BOM encoded in UTF-8 may be used to mark text as UTF-8. Quite a lot of Windows software (including Windows Notepad) adds one to UTF-8 files. However in Unix-like systems (which make heavy use of text files for configuration) this practice is not recommended, as it will interfere with correct processing of important codes such as the hash-bang at the start of an interpreted script. It may also interfere with source for programming languages that don't recognise it. For example, gcc reports stray characters at the beginning of a source file, and in PHP, if output buffering is disabled, it has the subtle effect of causing the page to start being sent to the browser, preventing custom headers from being specified by the PHP script [/quote]
Aug 03 2006
parent Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hasan Aljudy schrieb am 2006-08-04:
 Derek Parnell wrote:
 On Fri, 04 Aug 2006 00:36:21 -0300, Tim Locke wrote:
 
 
How do I acquire and determine the BOM and endianness of a file I am
reading?

Thanks

You might check out http://en.wikipedia.org/wiki/Byte_Order_Mark

Are GNU tools really as ignorant of Unicode as that page implies? [quote] While UTF-8 does not have byte order issues, a BOM encoded in UTF-8 may be used to mark text as UTF-8. Quite a lot of Windows software (including Windows Notepad) adds one to UTF-8 files. However in Unix-like systems (which make heavy use of text files for configuration) this practice is not recommended, as it will interfere with correct processing of important codes such as the hash-bang at the start of an interpreted script.

Let's have 2 UTF-8 files with BOMs: A and B cat A B > C A's BOM will remain a BOM but B's BOM is going to be interpreted as "zero-width no-break space". Thus using BOMs in combination with streaming, concating etc. will allways cause problems. In contrast to Windows, Linux - - home to the GNU tools - treats "text" and "binary" files as "binary" files. Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFE076MLK5blCcjpWoRAk2+AKCkpgjpZxJLcTOjcfZLWbfyZqnJgQCgjQTk aVnsQBdsGsq/IehsN4xYAHs= =FlZk -----END PGP SIGNATURE-----
Aug 04 2006
prev sibling parent reply Tim Locke <root vic-20.net> writes:
On Fri, 4 Aug 2006 14:15:00 +1000, Derek Parnell
<derek nomail.afraid.org> wrote:

On Fri, 04 Aug 2006 00:36:21 -0300, Tim Locke wrote:

 How do I acquire and determine the BOM and endianness of a file I am
 reading?
 
 Thanks

You might check out http://en.wikipedia.org/wiki/Byte_Order_Mark

I'm sorry but I wasn't clear in what I am looking for. I'm looking to be able to open a file and have D automatically tell me which format it is, e.g. UTF-8, UTF-16LE, UTF-16BE, etc. without my having to code it. Ideally I would like to be able to read any unicode or ascii file and have D automatically detect its type and allow me to read it into whatever format I want, such as char, wchar, dchar.
Aug 04 2006
parent Derek <derek psyc.ward> writes:
On Fri, 04 Aug 2006 08:44:21 -0300, Tim Locke wrote:

 On Fri, 4 Aug 2006 14:15:00 +1000, Derek Parnell
 <derek nomail.afraid.org> wrote:
 
On Fri, 04 Aug 2006 00:36:21 -0300, Tim Locke wrote:

 How do I acquire and determine the BOM and endianness of a file I am
 reading?
 
 Thanks

You might check out http://en.wikipedia.org/wiki/Byte_Order_Mark

I'm sorry but I wasn't clear in what I am looking for. I'm looking to be able to open a file and have D automatically tell me which format it is, e.g. UTF-8, UTF-16LE, UTF-16BE, etc. without my having to code it. Ideally I would like to be able to read any unicode or ascii file and have D automatically detect its type and allow me to read it into whatever format I want, such as char, wchar, dchar.

The phobos library supplied by Walter does not have this functionality. The mango library and maybe others do. I know that I had to code this myself when I needed it. -- Derek Parnell Melbourne, Australia "Down with mediocrity!"
Aug 04 2006