www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - std.stream, BOM, and deprecation

reply Charles Hixson <charleshixsn earthlink.net> writes:
If std.stream is being deprecated, what is the correct way to deal with 
file BOMs.  This is particularly concerning utf8 files, which I 
understand to be a bit problematic, as there isn't, actually, a utf8 
BOM, merely a convention which isn't a part of a standard.  But the 
std.stdio documentation doesn't so much as mention byte order marks (BOMs).

If this should wait until std.io is released, then I could use 
std.stream until them, but the documentation is already warning to avoid 
using it.
Oct 13 2012
next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday, October 13, 2012 18:53:48 Charles Hixson wrote:
 If std.stream is being deprecated, what is the correct way to deal with
 file BOMs.  This is particularly concerning utf8 files, which I
 understand to be a bit problematic, as there isn't, actually, a utf8
 BOM, merely a convention which isn't a part of a standard.  But the
 std.stdio documentation doesn't so much as mention byte order marks (BOMs).
 
 If this should wait until std.io is released, then I could use
 std.stream until them, but the documentation is already warning to avoid
 using it.
std.stream will be around until after std.io has been introduced, because std.io will be its replacement. As for dealing with BOMs, I don't really know anything about that, so I don't really have any suggestions. I know that it's come up before, and you can probably find some discussion on it in the archives, but for the most part, Phobos' I/O assumes UTF-8 or compatible, and if you want something else, you have to deal with it yourself. It's an area where Phobos needs improvement. You can use std.stream, but just be aware that in the long term, you'll either have to refactor your code so that it uses another solution (presumably std.io) or copy std.stream to your own stuff, because it's going to be removed from Phobos eventually. - Jonathan M Davis
Oct 13 2012
prev sibling next sibling parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 10/13/2012 06:53 PM, Charles Hixson wrote:
 If std.stream is being deprecated, what is the correct way to deal with
 file BOMs. This is particularly concerning utf8 files, which I
 understand to be a bit problematic, as there isn't, actually, a utf8
 BOM,
That's correct. There is just one byte order for UTF-8.
 merely a convention which isn't a part of a standard.
I am not sure about that. The Unicode standard describes UTF-8 as code units following each other in the file. There can't be any confusion about their order. According to Wikipedia, the only use of BOM for UTF-8 is to identify the file as having been encoded in UTF-8: http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8 But that can't have any meaning. The file could have been encoded in any one of the multitude of code pages as well. Treating the first three bytes as BOM would be taking a chance in that case and dropping those three characters.
 But the
 std.stdio documentation doesn't so much as mention byte order marks 
(BOMs).
 If this should wait until std.io is released, then I could use
 std.stream until them, but the documentation is already warning to avoid
 using it.
As I understand it, it is all down to convention any way. What is the meaning of the non-ASCII code 166? Only the generator of the file knows. :/ Ali
Oct 13 2012
prev sibling next sibling parent reply Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:
On Sat, 13 Oct 2012 18:53:48 -0700
Charles Hixson <charleshixsn earthlink.net> wrote:

 If std.stream is being deprecated, what is the correct way to deal
 with file BOMs.  This is particularly concerning utf8 files, which I 
 understand to be a bit problematic, as there isn't, actually, a utf8 
 BOM, merely a convention which isn't a part of a standard.  But the 
 std.stdio documentation doesn't so much as mention byte order marks
 (BOMs).
 
 If this should wait until std.io is released, then I could use 
 std.stream until them, but the documentation is already warning to
 avoid using it.
Personally, I think it's kind of cumbersome to deal with in Phobos, so I wrote this wrapper that I use instead, which handles everything: https://bitbucket.org/Abscissa/semitwistdtools/src/977820d5dcb0/src/semitwist/util/io.d?at=master#cl-24 And then there's the utfConvert below it if you already have the data in memory instead of on disk. (Maybe I should add some range capability and make a Phobos pull request. I don't know if it'd fly though. It uses a lot of custom endian- and bom-related code since I found the existing endian/bom stuff in phobos inadequate. So that stuff would have to be accepted, and then this too, and it's usually a bit of a pain to get things approved.)
Oct 14 2012
parent Charles Hixson <charleshixsn earthlink.net> writes:
On 10/14/2012 10:28 PM, Nick Sabalausky wrote:
 On Sat, 13 Oct 2012 18:53:48 -0700
 Charles Hixson<charleshixsn earthlink.net>  wrote:

 If std.stream is being deprecated, what is the correct way to deal
 with file BOMs.  This is particularly concerning utf8 files, which I
 understand to be a bit problematic, as there isn't, actually, a utf8
 BOM, merely a convention which isn't a part of a standard.  But the
 std.stdio documentation doesn't so much as mention byte order marks
 (BOMs).

 If this should wait until std.io is released, then I could use
 std.stream until them, but the documentation is already warning to
 avoid using it.
Personally, I think it's kind of cumbersome to deal with in Phobos, so I wrote this wrapper that I use instead, which handles everything: https://bitbucket.org/Abscissa/semitwistdtools/src/977820d5dcb0/src/semitwist/util/io.d?at=master#cl-24 And then there's the utfConvert below it if you already have the data in memory instead of on disk. (Maybe I should add some range capability and make a Phobos pull request. I don't know if it'd fly though. It uses a lot of custom endian- and bom-related code since I found the existing endian/bom stuff in phobos inadequate. So that stuff would have to be accepted, and then this too, and it's usually a bit of a pain to get things approved.)
That wrapper looks very nice, but it's a lot more than what I need. I want to deal only with utf8 files, many of which have BOMs. I *can* handle that by detecting the BOM and dropping it. I don't need anything else. I was merely wondering what the appropriate way to approach this was now that std.stream is being documented as deprecated, but no replacement specified. It sounds like the appropriate response is to use std.stdio, and handle the BOM myself.
Oct 16 2012
prev sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 13 Oct 2012 21:53:48 -0400, Charles Hixson  
<charleshixsn earthlink.net> wrote:

 If std.stream is being deprecated, what is the correct way to deal with  
 file BOMs.  This is particularly concerning utf8 files, which I  
 understand to be a bit problematic, as there isn't, actually, a utf8  
 BOM, merely a convention which isn't a part of a standard.  But the  
 std.stdio documentation doesn't so much as mention byte order marks  
 (BOMs).

 If this should wait until std.io is released, then I could use  
 std.stream until them, but the documentation is already warning to avoid  
 using it.
When std.io is released, it will be fully BOM-aware by default (as long as you use the purely D versions). The plan from my point of view is for std.io be be a replacement backend for std.stdio, with the C version being the default (as it must be for compatibility purposes). -Steve
Oct 15 2012
parent Charles Hixson <charleshixsn earthlink.net> writes:
On 10/15/2012 10:29 AM, Steven Schveighoffer wrote:
 On Sat, 13 Oct 2012 21:53:48 -0400, Charles Hixson
 <charleshixsn earthlink.net> wrote:

 If std.stream is being deprecated, what is the correct way to deal
 with file BOMs. This is particularly concerning utf8 files, which I
 understand to be a bit problematic, as there isn't, actually, a utf8
 BOM, merely a convention which isn't a part of a standard. But the
 std.stdio documentation doesn't so much as mention byte order marks
 (BOMs).

 If this should wait until std.io is released, then I could use
 std.stream until them, but the documentation is already warning to
 avoid using it.
When std.io is released, it will be fully BOM-aware by default (as long as you use the purely D versions). The plan from my point of view is for std.io be be a replacement backend for std.stdio, with the C version being the default (as it must be for compatibility purposes). -Steve
That sounds good. All of the files I'm interested should have been converted to utf8 (if they weren't already), but many of them have the utf8 BOM (so they won't be confused with other non-unicode files). It sounds like std.io will handle this in a transparent fashion.
Oct 16 2012