digitalmars.D.learn - read files ... continued

Jan Hanselaer (30/30) May 13 2007 Woops ... sent before done writing ... sorry

Daniel Keep (36/75) May 13 2007 Basically, the problem is that if the file is in something other than

Jan Hanselaer (5/87) May 13 2007 Yes, thanks a lot. At least now I understand it more. But it's a pity th...

Lars Ivar Igesund (8/99) May 13 2007 Currently Mango has bindings for IBM's ICU library, which may be the mos...
Carlos Santander (4/9) May 13 2007 Any particular reason why you can't use EndianStream?

"Jan Hanselaer" <jan.hanselaer gmail.com> writes:

Woops ... sent before done writing ... sorry

Hi

I'm writing an application that reads all kind of text files.
I'm not really familiar with the filetypes.
For the moment I read them with a BufferedFile.
I read the lines with readLine()

Stream br = new BufferedFile(fileName);
char[] line = br.readLine();

But that causes a lot of trouble. I managed to figure out how to read a file
his BOM and so It'll also be possible I presume to convert them to a type
(UTF8 for example) that I always use. (I'm checking that later)
But for a lot of files when I check the BOM I get result -1 (meaning the 
type is not known).
http://www.digitalmars.com/d/phobos/std_stream.html
The only known BOM types are listed there (UTF8,UTF16,UTF32 LE or BE)

For a lot of text files on my system (windows) the type is ANSI, and there's 
no problem reading them with BufferedFile if there are no special signs in 
it.
But if there's an accent or something (for example '�'), than it's an 
invalid UTF sequence. I cannot convert the text because the BOM for this 
files is also unknown.

Anyone has an idea of how to catch this sort of files (and convert them?) Or 
is there a stream that takes into account the filetype by itself? Would be 
very handy ...

It's an application I wrote in Java I'm now trying in D. In Java I used a 
BufferedReader on A FileReader and there all goes well. Sometimes files are 
not read well, but no faults like this invalid UTF-sequence in D.

If someone unterstands my problem out of all this confusing talk (that's 
because I'm rather confused myself) ... I'd be glad :p

Thanks!

May 13 2007

Daniel Keep <daniel.keep.lists gmail.com> writes:

Jan Hanselaer wrote:
 Woops ... sent before done writing ... sorry
 
 Hi
 
 I'm writing an application that reads all kind of text files.
 I'm not really familiar with the filetypes.
 For the moment I read them with a BufferedFile.
 I read the lines with readLine()
 
 Stream br = new BufferedFile(fileName);
 char[] line = br.readLine();
 
 But that causes a lot of trouble. I managed to figure out how to read a file
 his BOM and so It'll also be possible I presume to convert them to a type
 (UTF8 for example) that I always use. (I'm checking that later)
 But for a lot of files when I check the BOM I get result -1 (meaning the 
 type is not known).
 http://www.digitalmars.com/d/phobos/std_stream.html
 The only known BOM types are listed there (UTF8,UTF16,UTF32 LE or BE)
 
 For a lot of text files on my system (windows) the type is ANSI, and there's 
 no problem reading them with BufferedFile if there are no special signs in 
 it.
 But if there's an accent or something (for example '�'), than it's an 
 invalid UTF sequence. I cannot convert the text because the BOM for this 
 files is also unknown.
 
 Anyone has an idea of how to catch this sort of files (and convert them?) Or 
 is there a stream that takes into account the filetype by itself? Would be 
 very handy ...
 
 It's an application I wrote in Java I'm now trying in D. In Java I used a 
 BufferedReader on A FileReader and there all goes well. Sometimes files are 
 not read well, but no faults like this invalid UTF-sequence in D.
 
 If someone unterstands my problem out of all this confusing talk (that's 
 because I'm rather confused myself) ... I'd be glad :p
 
 Thanks!

Basically, the problem is that if the file is in something other than
ASCII, UTF-8, UTF-16 or UTF-32, then there's no way for D to work out
what it's supposed to be.

There are various methods for autodetecting the codepage of a piece of
text, but none of them are foolproof.  Hence why there isn't a stream to
do this for you; it's a nasty, horrible problem that no one wants to
solve... at least, I know I don't. :)

Incidentally, notice that the example accent you provided didn't show
up; presumably because your mail reader doesn't know how to use
codepages properly. :3

(Checks headers) Outlook Express—why am I not surprised :P

Anyway, if you need to open files that aren't in a usable encoding,
there's a few things you can do:

1. Read the text as ASCII, and discard all characters that lie outside
of the 7-bit range.
2. Add an option somewhere, or perhaps a tag to the file, to indicate
what the code page is.
3. Find and use one of those auto-detection algorithms.

In any case, you'll need a library for converting between codepages.  I
*think* that either Tango or Mango has one, but I'm not sure.

<shameless-plug>
Also, if you need more clarification on how text in D works, you can
give this a read: http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD
</shameless-plug>

Hope this has been of at least some help.

	-- Daniel

-- 
int getRandomNumber()
{
    return 4; // chosen by fair dice roll.
              // guaranteed to be random.
}

http://xkcd.com/

v2sw5+8Yhw5ln4+5pr6OFPma8u6+7Lw4Tm6+7l6+7D
i28a2Xs3MSr2e4/6+7t4TNSMb6HTOp5en5g6RAHCP  http://hackerkey.com/

May 13 2007

"Jan Hanselaer" <jan.hanselaer gmail.com> writes:

"Daniel Keep" <daniel.keep.lists gmail.com> schreef in bericht 
news:f26qq0$1d16$1 digitalmars.com...
 Jan Hanselaer wrote:
 Woops ... sent before done writing ... sorry

 Hi

 I'm writing an application that reads all kind of text files.
 I'm not really familiar with the filetypes.
 For the moment I read them with a BufferedFile.
 I read the lines with readLine()

 Stream br = new BufferedFile(fileName);
 char[] line = br.readLine();

 But that causes a lot of trouble. I managed to figure out how to read a 
 file
 his BOM and so It'll also be possible I presume to convert them to a type
 (UTF8 for example) that I always use. (I'm checking that later)
 But for a lot of files when I check the BOM I get result -1 (meaning the
 type is not known).
 http://www.digitalmars.com/d/phobos/std_stream.html
 The only known BOM types are listed there (UTF8,UTF16,UTF32 LE or BE)

 For a lot of text files on my system (windows) the type is ANSI, and 
 there's
 no problem reading them with BufferedFile if there are no special signs 
 in
 it.
 But if there's an accent or something (for example '?'), than it's an
 invalid UTF sequence. I cannot convert the text because the BOM for this
 files is also unknown.

 Anyone has an idea of how to catch this sort of files (and convert them?) 
 Or
 is there a stream that takes into account the filetype by itself? Would 
 be
 very handy ...

 It's an application I wrote in Java I'm now trying in D. In Java I used a
 BufferedReader on A FileReader and there all goes well. Sometimes files 
 are
 not read well, but no faults like this invalid UTF-sequence in D.

 If someone unterstands my problem out of all this confusing talk (that's
 because I'm rather confused myself) ... I'd be glad :p

 Thanks!

 Basically, the problem is that if the file is in something other than
 ASCII, UTF-8, UTF-16 or UTF-32, then there's no way for D to work out
 what it's supposed to be.

 There are various methods for autodetecting the codepage of a piece of
 text, but none of them are foolproof.  Hence why there isn't a stream to
 do this for you; it's a nasty, horrible problem that no one wants to
 solve... at least, I know I don't. :)

 Incidentally, notice that the example accent you provided didn't show
 up; presumably because your mail reader doesn't know how to use
 codepages properly. :3

 (Checks headers) Outlook Express-why am I not surprised :P

 Anyway, if you need to open files that aren't in a usable encoding,
 there's a few things you can do:

 1. Read the text as ASCII, and discard all characters that lie outside
 of the 7-bit range.
 2. Add an option somewhere, or perhaps a tag to the file, to indicate
 what the code page is.
 3. Find and use one of those auto-detection algorithms.

 In any case, you'll need a library for converting between codepages.  I
 *think* that either Tango or Mango has one, but I'm not sure.

 <shameless-plug>
 Also, if you need more clarification on how text in D works, you can
 give this a read: 
 http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD
 </shameless-plug>

 Hope this has been of at least some help.

Yes, thanks a lot. At least now I understand it more. But it's a pity there 
isn't a stream doing all the work.
It's going to be very difficult to read the different files in a proper way.

 -- Daniel

 -- 
 int getRandomNumber()
 {
    return 4; // chosen by fair dice roll.
              // guaranteed to be random.
 }

 http://xkcd.com/

 v2sw5+8Yhw5ln4+5pr6OFPma8u6+7Lw4Tm6+7l6+7D
 i28a2Xs3MSr2e4/6+7t4TNSMb6HTOp5en5g6RAHCP  http://hackerkey.com/

May 13 2007

Lars Ivar Igesund <larsivar igesund.net> writes:

Jan Hanselaer wrote:

 
 "Daniel Keep" <daniel.keep.lists gmail.com> schreef in bericht
 news:f26qq0$1d16$1 digitalmars.com...
 Jan Hanselaer wrote:
 Woops ... sent before done writing ... sorry

 Hi

 I'm writing an application that reads all kind of text files.
 I'm not really familiar with the filetypes.
 For the moment I read them with a BufferedFile.
 I read the lines with readLine()

 Stream br = new BufferedFile(fileName);
 char[] line = br.readLine();

 But that causes a lot of trouble. I managed to figure out how to read a
 file
 his BOM and so It'll also be possible I presume to convert them to a
 type (UTF8 for example) that I always use. (I'm checking that later)
 But for a lot of files when I check the BOM I get result -1 (meaning the
 type is not known).
 http://www.digitalmars.com/d/phobos/std_stream.html
 The only known BOM types are listed there (UTF8,UTF16,UTF32 LE or BE)

 For a lot of text files on my system (windows) the type is ANSI, and
 there's
 no problem reading them with BufferedFile if there are no special signs
 in
 it.
 But if there's an accent or something (for example '?'), than it's an
 invalid UTF sequence. I cannot convert the text because the BOM for this
 files is also unknown.

 Anyone has an idea of how to catch this sort of files (and convert
 them?) Or
 is there a stream that takes into account the filetype by itself? Would
 be
 very handy ...

 It's an application I wrote in Java I'm now trying in D. In Java I used
 a BufferedReader on A FileReader and there all goes well. Sometimes
 files are
 not read well, but no faults like this invalid UTF-sequence in D.

 If someone unterstands my problem out of all this confusing talk (that's
 because I'm rather confused myself) ... I'd be glad :p

 Thanks!

 Basically, the problem is that if the file is in something other than
 ASCII, UTF-8, UTF-16 or UTF-32, then there's no way for D to work out
 what it's supposed to be.

 There are various methods for autodetecting the codepage of a piece of
 text, but none of them are foolproof.  Hence why there isn't a stream to
 do this for you; it's a nasty, horrible problem that no one wants to
 solve... at least, I know I don't. :)

 Incidentally, notice that the example accent you provided didn't show
 up; presumably because your mail reader doesn't know how to use
 codepages properly. :3

 (Checks headers) Outlook Express-why am I not surprised :P

 Anyway, if you need to open files that aren't in a usable encoding,
 there's a few things you can do:

 1. Read the text as ASCII, and discard all characters that lie outside
 of the 7-bit range.
 2. Add an option somewhere, or perhaps a tag to the file, to indicate
 what the code page is.
 3. Find and use one of those auto-detection algorithms.

 In any case, you'll need a library for converting between codepages.  I
 *think* that either Tango or Mango has one, but I'm not sure.

 <shameless-plug>
 Also, if you need more clarification on how text in D works, you can
 give this a read:
 http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD
 </shameless-plug>

 Hope this has been of at least some help.

 
 Yes, thanks a lot. At least now I understand it more. But it's a pity
 there isn't a stream doing all the work.
 It's going to be very difficult to read the different files in a proper
 way.

Currently Mango has bindings for IBM's ICU library, which may be the most
comprehensive solution for this type of text handling.

--- 
Lars Ivar Igesund
blog at http://larsivi.net
DSource, #d.tango & #D: larsivi
Dancing the Tango

May 13 2007

Carlos Santander <csantander619 gmail.com> writes:

Jan Hanselaer escribi�:
 
 Yes, thanks a lot. At least now I understand it more. But it's a pity there 
 isn't a stream doing all the work.
 It's going to be very difficult to read the different files in a proper way.
 

Any particular reason why you can't use EndianStream?

-- 
Carlos Santander Bernal

May 13 2007

D Programming

C/C++ Programming

Other

digitalmars.D.learn - read files ... continued