www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - read files ... continued

reply "Jan Hanselaer" <jan.hanselaer gmail.com> writes:
Woops ... sent before done writing ... sorry

Hi

I'm writing an application that reads all kind of text files.
I'm not really familiar with the filetypes.
For the moment I read them with a BufferedFile.
I read the lines with readLine()

Stream br = new BufferedFile(fileName);
char[] line = br.readLine();

But that causes a lot of trouble. I managed to figure out how to read a file
his BOM and so It'll also be possible I presume to convert them to a type
(UTF8 for example) that I always use. (I'm checking that later)
But for a lot of files when I check the BOM I get result -1 (meaning the 
type is not known).
http://www.digitalmars.com/d/phobos/std_stream.html
The only known BOM types are listed there (UTF8,UTF16,UTF32 LE or BE)

For a lot of text files on my system (windows) the type is ANSI, and there's 
no problem reading them with BufferedFile if there are no special signs in 
it.
But if there's an accent or something (for example 'é'), than it's an 
invalid UTF sequence. I cannot convert the text because the BOM for this 
files is also unknown.

Anyone has an idea of how to catch this sort of files (and convert them?) Or 
is there a stream that takes into account the filetype by itself? Would be 
very handy ...

It's an application I wrote in Java I'm now trying in D. In Java I used a 
BufferedReader on A FileReader and there all goes well. Sometimes files are 
not read well, but no faults like this invalid UTF-sequence in D.

If someone unterstands my problem out of all this confusing talk (that's 
because I'm rather confused myself) ... I'd be glad :p

Thanks!
May 13 2007
parent reply Daniel Keep <daniel.keep.lists gmail.com> writes:
Jan Hanselaer wrote:
 Woops ... sent before done writing ... sorry
 
 Hi
 
 I'm writing an application that reads all kind of text files.
 I'm not really familiar with the filetypes.
 For the moment I read them with a BufferedFile.
 I read the lines with readLine()
 
 Stream br = new BufferedFile(fileName);
 char[] line = br.readLine();
 
 But that causes a lot of trouble. I managed to figure out how to read a file
 his BOM and so It'll also be possible I presume to convert them to a type
 (UTF8 for example) that I always use. (I'm checking that later)
 But for a lot of files when I check the BOM I get result -1 (meaning the 
 type is not known).
 http://www.digitalmars.com/d/phobos/std_stream.html
 The only known BOM types are listed there (UTF8,UTF16,UTF32 LE or BE)
 
 For a lot of text files on my system (windows) the type is ANSI, and there's 
 no problem reading them with BufferedFile if there are no special signs in 
 it.
 But if there's an accent or something (for example '�'), than it's an 
 invalid UTF sequence. I cannot convert the text because the BOM for this 
 files is also unknown.
 
 Anyone has an idea of how to catch this sort of files (and convert them?) Or 
 is there a stream that takes into account the filetype by itself? Would be 
 very handy ...
 
 It's an application I wrote in Java I'm now trying in D. In Java I used a 
 BufferedReader on A FileReader and there all goes well. Sometimes files are 
 not read well, but no faults like this invalid UTF-sequence in D.
 
 If someone unterstands my problem out of all this confusing talk (that's 
 because I'm rather confused myself) ... I'd be glad :p
 
 Thanks!

Basically, the problem is that if the file is in something other than ASCII, UTF-8, UTF-16 or UTF-32, then there's no way for D to work out what it's supposed to be. There are various methods for autodetecting the codepage of a piece of text, but none of them are foolproof. Hence why there isn't a stream to do this for you; it's a nasty, horrible problem that no one wants to solve... at least, I know I don't. :) Incidentally, notice that the example accent you provided didn't show up; presumably because your mail reader doesn't know how to use codepages properly. :3 (Checks headers) Outlook Express—why am I not surprised :P Anyway, if you need to open files that aren't in a usable encoding, there's a few things you can do: 1. Read the text as ASCII, and discard all characters that lie outside of the 7-bit range. 2. Add an option somewhere, or perhaps a tag to the file, to indicate what the code page is. 3. Find and use one of those auto-detection algorithms. In any case, you'll need a library for converting between codepages. I *think* that either Tango or Mango has one, but I'm not sure. <shameless-plug> Also, if you need more clarification on how text in D works, you can give this a read: http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD </shameless-plug> Hope this has been of at least some help. -- Daniel -- int getRandomNumber() { return 4; // chosen by fair dice roll. // guaranteed to be random. } http://xkcd.com/ v2sw5+8Yhw5ln4+5pr6OFPma8u6+7Lw4Tm6+7l6+7D i28a2Xs3MSr2e4/6+7t4TNSMb6HTOp5en5g6RAHCP http://hackerkey.com/
May 13 2007
parent reply "Jan Hanselaer" <jan.hanselaer gmail.com> writes:
"Daniel Keep" <daniel.keep.lists gmail.com> schreef in bericht 
news:f26qq0$1d16$1 digitalmars.com...
 Jan Hanselaer wrote:
 Woops ... sent before done writing ... sorry

 Hi

 I'm writing an application that reads all kind of text files.
 I'm not really familiar with the filetypes.
 For the moment I read them with a BufferedFile.
 I read the lines with readLine()

 Stream br = new BufferedFile(fileName);
 char[] line = br.readLine();

 But that causes a lot of trouble. I managed to figure out how to read a 
 file
 his BOM and so It'll also be possible I presume to convert them to a type
 (UTF8 for example) that I always use. (I'm checking that later)
 But for a lot of files when I check the BOM I get result -1 (meaning the
 type is not known).
 http://www.digitalmars.com/d/phobos/std_stream.html
 The only known BOM types are listed there (UTF8,UTF16,UTF32 LE or BE)

 For a lot of text files on my system (windows) the type is ANSI, and 
 there's
 no problem reading them with BufferedFile if there are no special signs 
 in
 it.
 But if there's an accent or something (for example '?'), than it's an
 invalid UTF sequence. I cannot convert the text because the BOM for this
 files is also unknown.

 Anyone has an idea of how to catch this sort of files (and convert them?) 
 Or
 is there a stream that takes into account the filetype by itself? Would 
 be
 very handy ...

 It's an application I wrote in Java I'm now trying in D. In Java I used a
 BufferedReader on A FileReader and there all goes well. Sometimes files 
 are
 not read well, but no faults like this invalid UTF-sequence in D.

 If someone unterstands my problem out of all this confusing talk (that's
 because I'm rather confused myself) ... I'd be glad :p

 Thanks!

Basically, the problem is that if the file is in something other than ASCII, UTF-8, UTF-16 or UTF-32, then there's no way for D to work out what it's supposed to be. There are various methods for autodetecting the codepage of a piece of text, but none of them are foolproof. Hence why there isn't a stream to do this for you; it's a nasty, horrible problem that no one wants to solve... at least, I know I don't. :) Incidentally, notice that the example accent you provided didn't show up; presumably because your mail reader doesn't know how to use codepages properly. :3 (Checks headers) Outlook Express-why am I not surprised :P Anyway, if you need to open files that aren't in a usable encoding, there's a few things you can do: 1. Read the text as ASCII, and discard all characters that lie outside of the 7-bit range. 2. Add an option somewhere, or perhaps a tag to the file, to indicate what the code page is. 3. Find and use one of those auto-detection algorithms. In any case, you'll need a library for converting between codepages. I *think* that either Tango or Mango has one, but I'm not sure. <shameless-plug> Also, if you need more clarification on how text in D works, you can give this a read: http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD </shameless-plug> Hope this has been of at least some help.

Yes, thanks a lot. At least now I understand it more. But it's a pity there isn't a stream doing all the work. It's going to be very difficult to read the different files in a proper way.
 -- Daniel

 -- 
 int getRandomNumber()
 {
    return 4; // chosen by fair dice roll.
              // guaranteed to be random.
 }

 http://xkcd.com/

 v2sw5+8Yhw5ln4+5pr6OFPma8u6+7Lw4Tm6+7l6+7D
 i28a2Xs3MSr2e4/6+7t4TNSMb6HTOp5en5g6RAHCP  http://hackerkey.com/ 

May 13 2007
next sibling parent Lars Ivar Igesund <larsivar igesund.net> writes:
Jan Hanselaer wrote:

 
 "Daniel Keep" <daniel.keep.lists gmail.com> schreef in bericht
 news:f26qq0$1d16$1 digitalmars.com...
 Jan Hanselaer wrote:
 Woops ... sent before done writing ... sorry

 Hi

 I'm writing an application that reads all kind of text files.
 I'm not really familiar with the filetypes.
 For the moment I read them with a BufferedFile.
 I read the lines with readLine()

 Stream br = new BufferedFile(fileName);
 char[] line = br.readLine();

 But that causes a lot of trouble. I managed to figure out how to read a
 file
 his BOM and so It'll also be possible I presume to convert them to a
 type (UTF8 for example) that I always use. (I'm checking that later)
 But for a lot of files when I check the BOM I get result -1 (meaning the
 type is not known).
 http://www.digitalmars.com/d/phobos/std_stream.html
 The only known BOM types are listed there (UTF8,UTF16,UTF32 LE or BE)

 For a lot of text files on my system (windows) the type is ANSI, and
 there's
 no problem reading them with BufferedFile if there are no special signs
 in
 it.
 But if there's an accent or something (for example '?'), than it's an
 invalid UTF sequence. I cannot convert the text because the BOM for this
 files is also unknown.

 Anyone has an idea of how to catch this sort of files (and convert
 them?) Or
 is there a stream that takes into account the filetype by itself? Would
 be
 very handy ...

 It's an application I wrote in Java I'm now trying in D. In Java I used
 a BufferedReader on A FileReader and there all goes well. Sometimes
 files are
 not read well, but no faults like this invalid UTF-sequence in D.

 If someone unterstands my problem out of all this confusing talk (that's
 because I'm rather confused myself) ... I'd be glad :p

 Thanks!

Basically, the problem is that if the file is in something other than ASCII, UTF-8, UTF-16 or UTF-32, then there's no way for D to work out what it's supposed to be. There are various methods for autodetecting the codepage of a piece of text, but none of them are foolproof. Hence why there isn't a stream to do this for you; it's a nasty, horrible problem that no one wants to solve... at least, I know I don't. :) Incidentally, notice that the example accent you provided didn't show up; presumably because your mail reader doesn't know how to use codepages properly. :3 (Checks headers) Outlook Express-why am I not surprised :P Anyway, if you need to open files that aren't in a usable encoding, there's a few things you can do: 1. Read the text as ASCII, and discard all characters that lie outside of the 7-bit range. 2. Add an option somewhere, or perhaps a tag to the file, to indicate what the code page is. 3. Find and use one of those auto-detection algorithms. In any case, you'll need a library for converting between codepages. I *think* that either Tango or Mango has one, but I'm not sure. <shameless-plug> Also, if you need more clarification on how text in D works, you can give this a read: http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD </shameless-plug> Hope this has been of at least some help.

Yes, thanks a lot. At least now I understand it more. But it's a pity there isn't a stream doing all the work. It's going to be very difficult to read the different files in a proper way.

Currently Mango has bindings for IBM's ICU library, which may be the most comprehensive solution for this type of text handling. --- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the Tango
May 13 2007
prev sibling parent Carlos Santander <csantander619 gmail.com> writes:
Jan Hanselaer escribió:
 
 Yes, thanks a lot. At least now I understand it more. But it's a pity there 
 isn't a stream doing all the work.
 It's going to be very difficult to read the different files in a proper way.
 

Any particular reason why you can't use EndianStream? -- Carlos Santander Bernal
May 13 2007