www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - BOMs and std.stream

reply "Carlos Santander B." <csantander619 gmail.com> writes:
Currently std.stream doesn't recognize BOMs, and while it might not be a big 
thing, there're times where it could be important.
I saved an XML file with .NET with encoding UTF-8, so it set a BOM. When I
tried 
to read it using Miguel Ferreira Simões' XML library, it complained about the 
file not being well-formed. Further testing made me discover that removing the 
BOM solved the problem. So it's a problem, ATM.
I think std.stream should change somehow, but I just don't know how.

-----------------------
Carlos Santander Bernal 
Nov 20 2004
next sibling parent reply J C Calvarese <jcc7 cox.net> writes:
Carlos Santander B. wrote:
 Currently std.stream doesn't recognize BOMs, and while it might not be a big 
 thing, there're times where it could be important.
 I saved an XML file with .NET with encoding UTF-8, so it set a BOM. When I
tried 
 to read it using Miguel Ferreira Simões' XML library, it complained about the 
 file not being well-formed. Further testing made me discover that removing the 
 BOM solved the problem. So it's a problem, ATM.
 I think std.stream should change somehow, but I just don't know how.
 
 -----------------------
 Carlos Santander Bernal 
I think there is a need for something like this in std.stream. I ran into this challenge a while back, and I didn't really think of a good solution at the time. But I just came up with an idea for a fix (it's not complicated, but I think it'd work). We could add a function called something like getBOM. If a BOM is present, it will return a string with the BOM and move the current location past the BOM. If there isn't a BOM, an empty string is returned and the current location doesn't change. That's just one idea for a design. A similar idea is that an enum could be returned instead of a string. -- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/
Nov 20 2004
next sibling parent Ben Hinkle <Ben_member pathlink.com> writes:
In article <cnp7hv$2tls$1 digitaldaemon.com>, J C Calvarese says...
Carlos Santander B. wrote:
 Currently std.stream doesn't recognize BOMs, and while it might not be a big 
 thing, there're times where it could be important.
 I saved an XML file with .NET with encoding UTF-8, so it set a BOM. When I
tried 
 to read it using Miguel Ferreira Simões' XML library, it complained about the 
 file not being well-formed. Further testing made me discover that removing the 
 BOM solved the problem. So it's a problem, ATM.
 I think std.stream should change somehow, but I just don't know how.
 
 -----------------------
 Carlos Santander Bernal 
I think there is a need for something like this in std.stream. I ran into this challenge a while back, and I didn't really think of a good solution at the time. But I just came up with an idea for a fix (it's not complicated, but I think it'd work). We could add a function called something like getBOM. If a BOM is present, it will return a string with the BOM and move the current location past the BOM. If there isn't a BOM, an empty string is returned and the current location doesn't change. That's just one idea for a design. A similar idea is that an enum could be returned instead of a string. -- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/
I like the enum idea. It would be nice if the stream remembered the BOM in the UTF-16 case so that the code that reads strings can swap byte orders if needed. Otherwise the user is hosed if the stream is in the wrong byte-ordering. I sense another std.stream project in the next few days... -Ben
Nov 21 2004
prev sibling parent "Kris" <fu bar.com> writes:
The ICU project provides this kind of thing: (from the documentation)

        static final char[] detectSignature (void[] input)

                Detects Unicode signature byte sequences at the start
                of the byte stream and returns the charset name of the
                indicated Unicode charset. A null is returned where no
                Unicode signature is recognized.

                A caller can create a UConverter using the charset name.
                The first code unit (wchar) from the start of the stream
                will be U+FEFF (the Unicode BOM/signature character)
                and can usually be ignored.

You might take a look at the breadth of that project; you'll find it covers
pretty much anything you'll need for regular Unicode processing, and then
some ...

http://www.dsource.org/forums/viewtopic.php?t=420



"J C Calvarese" <jcc7 cox.net> wrote in message
news:cnp7hv$2tls$1 digitaldaemon.com...
| Carlos Santander B. wrote:
| > Currently std.stream doesn't recognize BOMs, and while it might not be a
big
| > thing, there're times where it could be important.
| > I saved an XML file with .NET with encoding UTF-8, so it set a BOM. When
I tried
| > to read it using Miguel Ferreira Simões' XML library, it complained
about the
| > file not being well-formed. Further testing made me discover that
removing the
| > BOM solved the problem. So it's a problem, ATM.
| > I think std.stream should change somehow, but I just don't know how.
| >
| > -----------------------
| > Carlos Santander Bernal
|
| I think there is a need for something like this in std.stream. I ran
| into this challenge a while back, and I didn't really think of a good
| solution at the time. But I just came up with an idea for a fix (it's
| not complicated, but I think it'd work).
|
| We could add a function called something like getBOM. If a BOM is
| present, it will return a string with the BOM and move the current
| location past the BOM. If there isn't a BOM, an empty string is returned
| and the current location doesn't change.
|
| That's just one idea for a design. A similar idea is that an enum could
| be returned instead of a string.
|
| --
| Justin (a/k/a jcc7)
| http://jcc_7.tripod.com/d/
Nov 21 2004
prev sibling parent Stewart Gordon <smjg_1998 yahoo.com> writes:
Carlos Santander B. wrote:
 Currently std.stream doesn't recognize BOMs, and while it might not be a big 
 thing, there're times where it could be important.
 I saved an XML file with .NET with encoding UTF-8, so it set a BOM. When I
tried 
 to read it using Miguel Ferreira Simões' XML library, it complained about the 
 file not being well-formed. Further testing made me discover that removing the 
 BOM solved the problem. So it's a problem, ATM.
The problem is that std.stream seems to be designed to work with binary files, with a few text capabilities thrown in but not to this level.
 I think std.stream should change somehow, but I just don't know how.
My thought is to develop a new set of classes for working with text files. I posted something on this a while back: http://www.digitalmars.com/drn-bin?wwwnews?digitalmars.D/6089 Stewart.
Nov 22 2004