www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Reading ASCII file with some codes above 127 (exten ascii)

reply "Paul" <phshaffer gmail.com> writes:
I am reading a file that has a few extended ASCII codes (e.g. 
degree symdol). Depending on how I read the file in and what I do 
with it the error shows up at different points.  I'm pretty sure 
it all boils down to the these extended ascii codes.

Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 file? 
  I've messed with the std.encoding module but really can't figure 
out what I need to do.

There must be a simple solution to this.
May 13 2012
next sibling parent "Era Scarecrow" <rtcvb32 yahoo.com> writes:
On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
 I am reading a file that has a few extended ASCII codes (e.g. 
 degree symdol). Depending on how I read the file in and what I 
 do with it the error shows up at different points.  I'm pretty 
 sure it all boils down to the these extended ascii codes.

 Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 
 file?
 I've messed with the std.encoding module but really can't 
 figure out what I need to do.

 There must be a simple solution to this.
Same here. I've ended up writing a custom array converter that if there's any 128+ codes it converts it and returns a new array. Maybe this is wrong, but for me it works. import std.utf; import std.ascii; //conversion table of ascii (latin-1?) to unicode for text compares. //only 128-255 private immutable wchar[] extAscii = [ 0x20AC, 0x0081, 0x201A, 0x0192, 0x201E, 0x2026, 0x2020, 0x2021, 0x02C6, 0x2030, 0x0160, 0x2039, 0x0152, 0x008D, 0x017D, 0x008F, 0x0090, 0x2018, 0x2019, 0x201C, 0x201D, 0x2022, 0x2013, 0x2014, 0x02DC, 0x2122, 0x0161, 0x203A, 0x0153, 0x009D, 0x017E, 0x0178, 0x00A0, 0x00A1, 0x00A2, 0x00A3, 0x00A4, 0x00A5, 0x00A6, 0x00A7, 0x00A8, 0x00A9, 0x00AA, 0x00AB, 0x00AC, 0x00AD, 0x00AE, 0x00AF, 0x00B0, 0x00B1, 0x00B2, 0x00B3, 0x00B4, 0x00B5, 0x00B6, 0x00B7, 0x00B8, 0x00B9, 0x00BA, 0x00BB, 0x00BC, 0x00BD, 0x00BE, 0x00BF, 0x00C0, 0x00C1, 0x00C2, 0x00C3, 0x00C4, 0x00C5, 0x00C6, 0x00C7, 0x00C8, 0x00C9, 0x00CA, 0x00CB, 0x00CC, 0x00CD, 0x00CE, 0x00CF, 0x00D0, 0x00D1, 0x00D2, 0x00D3, 0x00D4, 0x00D5, 0x00D6, 0x00D7, 0x00D8, 0x00D9, 0x00DA, 0x00DB, 0x00DC, 0x00DD, 0x00DE, 0x00DF, 0x00E0, 0x00E1, 0x00E2, 0x00E3, 0x00E4, 0x00E5, 0x00E6, 0x00E7, 0x00E8, 0x00E9, 0x00EA, 0x00EB, 0x00EC, 0x00ED, 0x00EE, 0x00EF, 0x00F0, 0x00F1, 0x00F2, 0x00F3, 0x00F4, 0x00F5, 0x00F6, 0x00F7, 0x00F8, 0x00F9, 0x00FA, 0x00FB, 0x00FC, 0x00FD, 0x00FE, 0x00FF]; /**since I can't find a good explanation of conversion, this is custom made. if it doesn't need to be converted, it returns the original buffer*/ char[] ascii2char(ubyte[] input) { char[] o; foreach(i, b; input) { if (b & 0x80) { if (!o.length) o = cast(char[]) input[0 .. i]; encode(o, extAscii[b - 0x80]); } else if (o.length) o ~= b; } return o.length ? o : cast(char[]) input; }
May 13 2012
prev sibling parent reply "Graham Fawcett" <fawcett uwindsor.ca> writes:
On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
 I am reading a file that has a few extended ASCII codes (e.g. 
 degree symdol). Depending on how I read the file in and what I 
 do with it the error shows up at different points.  I'm pretty 
 sure it all boils down to the these extended ascii codes.

 Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 
 file?
  I've messed with the std.encoding module but really can't 
 figure out what I need to do.

 There must be a simple solution to this.
This seems to work: import std.stdio, std.file, std.encoding; void main() { auto latin = cast(Latin1String) read("/tmp/hi.8859"); string s; transcode(latin, s); writeln(s); } Graham
May 14 2012
next sibling parent "Paul" <phshaffer gmail.com> writes:
On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
 On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
 I am reading a file that has a few extended ASCII codes (e.g. 
 degree symdol). Depending on how I read the file in and what I 
 do with it the error shows up at different points.  I'm pretty 
 sure it all boils down to the these extended ascii codes.

 Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 
 file?
 I've messed with the std.encoding module but really can't 
 figure out what I need to do.

 There must be a simple solution to this.
This seems to work: import std.stdio, std.file, std.encoding; void main() { auto latin = cast(Latin1String) read("/tmp/hi.8859"); string s; transcode(latin, s); writeln(s); } Graham
Awesome! Thanks a million!
May 17 2012
prev sibling parent reply "Paul" <phshaffer gmail.com> writes:
On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
 On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
 I am reading a file that has a few extended ASCII codes (e.g. 
 degree symdol). Depending on how I read the file in and what I 
 do with it the error shows up at different points.  I'm pretty 
 sure it all boils down to the these extended ascii codes.

 Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 
 file?
 I've messed with the std.encoding module but really can't 
 figure out what I need to do.

 There must be a simple solution to this.
This seems to work: import std.stdio, std.file, std.encoding; void main() { auto latin = cast(Latin1String) read("/tmp/hi.8859"); string s; transcode(latin, s); writeln(s); } Graham
I thought I was in good shape with your above suggestion. I does help me read and process text. But when I go to print it out I have problems. Here is my input file: °F Here is my code: import std.stdio; import std.string; import std.file; import std.encoding; // Main function void main(){ auto fout = File("out.txt","w"); auto latinS = cast(Latin1String) read("in.txt"); string uniS; transcode(latinS, uniS); foreach(line; uniS.splitLines()){ transcode(line, latinS); fout.writeln(line); fout.writeln(latinS); } } Here is the output: °F [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70] If I print the Unicode string I get an extra weird character. If I print the Unicode string retranslated to Latin1, it get weird pseudo-code. Can you help?
May 23 2012
parent reply "Graham Fawcett" <fawcett uwindsor.ca> writes:
On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
 On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
 On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
 I am reading a file that has a few extended ASCII codes (e.g. 
 degree symdol). Depending on how I read the file in and what 
 I do with it the error shows up at different points.  I'm 
 pretty sure it all boils down to the these extended ascii 
 codes.

 Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 
 file?
 I've messed with the std.encoding module but really can't 
 figure out what I need to do.

 There must be a simple solution to this.
This seems to work: import std.stdio, std.file, std.encoding; void main() { auto latin = cast(Latin1String) read("/tmp/hi.8859"); string s; transcode(latin, s); writeln(s); } Graham
I thought I was in good shape with your above suggestion. I does help me read and process text. But when I go to print it out I have problems. Here is my input file: °F Here is my code: import std.stdio; import std.string; import std.file; import std.encoding; // Main function void main(){ auto fout = File("out.txt","w"); auto latinS = cast(Latin1String) read("in.txt"); string uniS; transcode(latinS, uniS); foreach(line; uniS.splitLines()){ transcode(line, latinS); fout.writeln(line); fout.writeln(latinS); } } Here is the output: °F [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70] If I print the Unicode string I get an extra weird character. If I print the Unicode string retranslated to Latin1, it get weird pseudo-code. Can you help?
I tried the program and it seemed to work for me. What program are you using to read "out.txt"? Are you sure it supports UTF-8, and knows to open the file as UTF-8? (This looks suspiciously like a tool's attempt to misinterpret a UTF-8 string as Latin-1.) If you're on a Unix system, what does "file in.txt out.txt" report? Graham
May 23 2012
parent reply "Paul" <phshaffer gmail.com> writes:
On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett wrote:
 On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
 On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
 On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
 I am reading a file that has a few extended ASCII codes 
 (e.g. degree symdol). Depending on how I read the file in 
 and what I do with it the error shows up at different 
 points.  I'm pretty sure it all boils down to the these 
 extended ascii codes.

 Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 
 file?
 I've messed with the std.encoding module but really can't 
 figure out what I need to do.

 There must be a simple solution to this.
This seems to work: import std.stdio, std.file, std.encoding; void main() { auto latin = cast(Latin1String) read("/tmp/hi.8859"); string s; transcode(latin, s); writeln(s); } Graham
I thought I was in good shape with your above suggestion. I does help me read and process text. But when I go to print it out I have problems. Here is my input file: °F Here is my code: import std.stdio; import std.string; import std.file; import std.encoding; // Main function void main(){ auto fout = File("out.txt","w"); auto latinS = cast(Latin1String) read("in.txt"); string uniS; transcode(latinS, uniS); foreach(line; uniS.splitLines()){ transcode(line, latinS); fout.writeln(line); fout.writeln(latinS); } } Here is the output: °F [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70] If I print the Unicode string I get an extra weird character. If I print the Unicode string retranslated to Latin1, it get weird pseudo-code. Can you help?
I tried the program and it seemed to work for me. What program are you using to read "out.txt"? Are you sure it supports UTF-8, and knows to open the file as UTF-8? (This looks suspiciously like a tool's attempt to misinterpret a UTF-8 string as Latin-1.) If you're on a Unix system, what does "file in.txt out.txt" report? Graham
Hmmm. I'm not communicating well. I want to read and write ASCII. The only reason I'm converting to Unicode is because D needs it (as I understand). Yes if I open °F in notepad++ and tell notepad++ that it is UTF-8, it shows °F. I want to: 1) Read an ascii file that may have codes above 127. 2) Convert to unicode so D funcs like .splitLines() can work with it. 3) Convert back to ascii so that stuff like °F writes out as it was read in. If I open in.txt and out.txt in an ascii editor, °F should look the same in both files with the editor encoding the files as ANSI/ASCII. I thought my program was doing just that. Thanks for your assistance.
May 23 2012
parent reply "Graham Fawcett" <fawcett uwindsor.ca> writes:
On Wednesday, 23 May 2012 at 18:43:04 UTC, Paul wrote:
 On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett wrote:
 On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
 On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
 On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
 I am reading a file that has a few extended ASCII codes 
 (e.g. degree symdol). Depending on how I read the file in 
 and what I do with it the error shows up at different 
 points.  I'm pretty sure it all boils down to the these 
 extended ascii codes.

 Can I just tell dmd that I'm reading a Latin1 or ISO 8859-1 
 file?
 I've messed with the std.encoding module but really can't 
 figure out what I need to do.

 There must be a simple solution to this.
This seems to work: import std.stdio, std.file, std.encoding; void main() { auto latin = cast(Latin1String) read("/tmp/hi.8859"); string s; transcode(latin, s); writeln(s); } Graham
I thought I was in good shape with your above suggestion. I does help me read and process text. But when I go to print it out I have problems. Here is my input file: °F Here is my code: import std.stdio; import std.string; import std.file; import std.encoding; // Main function void main(){ auto fout = File("out.txt","w"); auto latinS = cast(Latin1String) read("in.txt"); string uniS; transcode(latinS, uniS); foreach(line; uniS.splitLines()){ transcode(line, latinS); fout.writeln(line); fout.writeln(latinS); } } Here is the output: °F [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70] If I print the Unicode string I get an extra weird character. If I print the Unicode string retranslated to Latin1, it get weird pseudo-code. Can you help?
I tried the program and it seemed to work for me. What program are you using to read "out.txt"? Are you sure it supports UTF-8, and knows to open the file as UTF-8? (This looks suspiciously like a tool's attempt to misinterpret a UTF-8 string as Latin-1.) If you're on a Unix system, what does "file in.txt out.txt" report? Graham
Hmmm. I'm not communicating well. I want to read and write ASCII. The only reason I'm converting to Unicode is because D needs it (as I understand). Yes if I open °F in notepad++ and tell notepad++ that it is UTF-8, it shows °F. I want to: 1) Read an ascii file that may have codes above 127. 2) Convert to unicode so D funcs like .splitLines() can work with it. 3) Convert back to ascii so that stuff like °F writes out as it was read in. If I open in.txt and out.txt in an ascii editor, °F should look the same in both files with the editor encoding the files as ANSI/ASCII. I thought my program was doing just that. Thanks for your assistance.
To make sure we're on the same page -- ASCII is a 7-bit encoding, and any character above 127 is by definition not an ASCII character. At that point we're talking about an encoding other than ASCII, such as UTF-8 or Latin-1. If you're reading a file that has bytes > 127, you really have no choice but to specify (assume?) an encoding, Latin-1 for example. There's no guarantee your input file is Latin-1, though, and garbage-in will result in garbage-out. So I think what you're trying to do is 1. read a Latin-1 file, into unicode (internally in D) 2. do splitLines(), etc., generating some result 3. Convert the result back to latin-1, and output it. Is that right? Graham
May 23 2012
parent reply "Paul" <phshaffer gmail.com> writes:
On Wednesday, 23 May 2012 at 19:01:53 UTC, Graham Fawcett wrote:
 On Wednesday, 23 May 2012 at 18:43:04 UTC, Paul wrote:
 On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett 
 wrote:
 On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
 On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett wrote:
 On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
 I am reading a file that has a few extended ASCII codes 
 (e.g. degree symdol). Depending on how I read the file in 
 and what I do with it the error shows up at different 
 points.  I'm pretty sure it all boils down to the these 
 extended ascii codes.

 Can I just tell dmd that I'm reading a Latin1 or ISO 
 8859-1 file?
 I've messed with the std.encoding module but really can't 
 figure out what I need to do.

 There must be a simple solution to this.
This seems to work: import std.stdio, std.file, std.encoding; void main() { auto latin = cast(Latin1String) read("/tmp/hi.8859"); string s; transcode(latin, s); writeln(s); } Graham
I thought I was in good shape with your above suggestion. I does help me read and process text. But when I go to print it out I have problems. Here is my input file: °F Here is my code: import std.stdio; import std.string; import std.file; import std.encoding; // Main function void main(){ auto fout = File("out.txt","w"); auto latinS = cast(Latin1String) read("in.txt"); string uniS; transcode(latinS, uniS); foreach(line; uniS.splitLines()){ transcode(line, latinS); fout.writeln(line); fout.writeln(latinS); } } Here is the output: °F [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70] If I print the Unicode string I get an extra weird character. If I print the Unicode string retranslated to Latin1, it get weird pseudo-code. Can you help?
I tried the program and it seemed to work for me. What program are you using to read "out.txt"? Are you sure it supports UTF-8, and knows to open the file as UTF-8? (This looks suspiciously like a tool's attempt to misinterpret a UTF-8 string as Latin-1.) If you're on a Unix system, what does "file in.txt out.txt" report? Graham
Hmmm. I'm not communicating well. I want to read and write ASCII. The only reason I'm converting to Unicode is because D needs it (as I understand). Yes if I open °F in notepad++ and tell notepad++ that it is UTF-8, it shows °F. I want to: 1) Read an ascii file that may have codes above 127. 2) Convert to unicode so D funcs like .splitLines() can work with it. 3) Convert back to ascii so that stuff like °F writes out as it was read in. If I open in.txt and out.txt in an ascii editor, °F should look the same in both files with the editor encoding the files as ANSI/ASCII. I thought my program was doing just that. Thanks for your assistance.
To make sure we're on the same page -- ASCII is a 7-bit encoding, and any character above 127 is by definition not an ASCII character. At that point we're talking about an encoding other than ASCII, such as UTF-8 or Latin-1. If you're reading a file that has bytes > 127, you really have no choice but to specify (assume?) an encoding, Latin-1 for example. There's no guarantee your input file is Latin-1, though, and garbage-in will result in garbage-out. So I think what you're trying to do is 1. read a Latin-1 file, into unicode (internally in D) 2. do splitLines(), etc., generating some result 3. Convert the result back to latin-1, and output it. Is that right? Graham
Exactly.
May 23 2012
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, May 23, 2012 at 09:09:27PM +0200, Paul wrote:
 On Wednesday, 23 May 2012 at 19:01:53 UTC, Graham Fawcett wrote:
[...]
So I think what you're trying to do is

1. read a Latin-1 file, into unicode (internally in D)
2. do splitLines(), etc., generating some result
3. Convert the result back to latin-1, and output it.

Is that right?
Graham
Exactly.
The safest way is probably to read it as binary data (i.e. byte[]), then do the conversion into UTF8, then process it, and finally convert it back to latin-1 (in binary form) and output it. D assumes Unicode internally; if you try to read a Latin-1 file as char[], you may be running into some implicit UTF conversions that are corrupting the data. Best use byte[] for reading/writing, and do conversions to/from UTF-8 internally for processing. T -- Doubt is a self-fulfilling prophecy.
May 23 2012
parent "Paul" <phshaffer gmail.com> writes:
 The safest way is probably to read it as binary data (i.e. 
 byte[]), then
 do the conversion into UTF8, then process it, and finally 
 convert it
 back to latin-1 (in binary form) and output it.

 D assumes Unicode internally; if you try to read a Latin-1 file 
 as
 char[], you may be running into some implicit UTF conversions 
 that are
 corrupting the data. Best use byte[] for reading/writing, and do
 conversions to/from UTF-8 internally for processing.


 T
You mean something like Era has done in the first reply? If that is so I have to say I'm really surprized. To write D so it natively expects and outputs unicode is one thing but not making a clean simple way to read extended ASCII chars (i.e. Latin1) and write them back out seems like an oversight. I think I'm (actually Graham) is close. Thanks for your feedback HS.
May 23 2012
prev sibling parent reply "Graham Fawcett" <fawcett uwindsor.ca> writes:
On Wednesday, 23 May 2012 at 19:09:29 UTC, Paul wrote:
 On Wednesday, 23 May 2012 at 19:01:53 UTC, Graham Fawcett wrote:
 On Wednesday, 23 May 2012 at 18:43:04 UTC, Paul wrote:
 On Wednesday, 23 May 2012 at 18:04:56 UTC, Graham Fawcett 
 wrote:
 On Wednesday, 23 May 2012 at 15:48:20 UTC, Paul wrote:
 On Monday, 14 May 2012 at 12:58:20 UTC, Graham Fawcett 
 wrote:
 On Sunday, 13 May 2012 at 21:03:45 UTC, Paul wrote:
 I am reading a file that has a few extended ASCII codes 
 (e.g. degree symdol). Depending on how I read the file in 
 and what I do with it the error shows up at different 
 points.  I'm pretty sure it all boils down to the these 
 extended ascii codes.

 Can I just tell dmd that I'm reading a Latin1 or ISO 
 8859-1 file?
 I've messed with the std.encoding module but really can't 
 figure out what I need to do.

 There must be a simple solution to this.
This seems to work: import std.stdio, std.file, std.encoding; void main() { auto latin = cast(Latin1String) read("/tmp/hi.8859"); string s; transcode(latin, s); writeln(s); } Graham
I thought I was in good shape with your above suggestion. I does help me read and process text. But when I go to print it out I have problems. Here is my input file: °F Here is my code: import std.stdio; import std.string; import std.file; import std.encoding; // Main function void main(){ auto fout = File("out.txt","w"); auto latinS = cast(Latin1String) read("in.txt"); string uniS; transcode(latinS, uniS); foreach(line; uniS.splitLines()){ transcode(line, latinS); fout.writeln(line); fout.writeln(latinS); } } Here is the output: °F [cast(immutable(Latin1Char))176, cast(immutable(Latin1Char))70] If I print the Unicode string I get an extra weird character. If I print the Unicode string retranslated to Latin1, it get weird pseudo-code. Can you help?
I tried the program and it seemed to work for me. What program are you using to read "out.txt"? Are you sure it supports UTF-8, and knows to open the file as UTF-8? (This looks suspiciously like a tool's attempt to misinterpret a UTF-8 string as Latin-1.) If you're on a Unix system, what does "file in.txt out.txt" report? Graham
Hmmm. I'm not communicating well. I want to read and write ASCII. The only reason I'm converting to Unicode is because D needs it (as I understand). Yes if I open °F in notepad++ and tell notepad++ that it is UTF-8, it shows °F. I want to: 1) Read an ascii file that may have codes above 127. 2) Convert to unicode so D funcs like .splitLines() can work with it. 3) Convert back to ascii so that stuff like °F writes out as it was read in. If I open in.txt and out.txt in an ascii editor, °F should look the same in both files with the editor encoding the files as ANSI/ASCII. I thought my program was doing just that. Thanks for your assistance.
To make sure we're on the same page -- ASCII is a 7-bit encoding, and any character above 127 is by definition not an ASCII character. At that point we're talking about an encoding other than ASCII, such as UTF-8 or Latin-1. If you're reading a file that has bytes > 127, you really have no choice but to specify (assume?) an encoding, Latin-1 for example. There's no guarantee your input file is Latin-1, though, and garbage-in will result in garbage-out. So I think what you're trying to do is 1. read a Latin-1 file, into unicode (internally in D) 2. do splitLines(), etc., generating some result 3. Convert the result back to latin-1, and output it. Is that right? Graham
Exactly.
This works, though it's ugly: foreach(line; uniS.splitLines()) { transcode(line, latinS); fout.writeln((cast(char[]) latinS)); } The Latin1String type, at the storage level, is a ubyte[]. By casting to char[], you can get a similar-to-string thing that writeln() can handle. Graham
May 23 2012
parent reply "Paul" <phshaffer gmail.com> writes:
 This works, though it's ugly:


     foreach(line; uniS.splitLines()) {
        transcode(line, latinS);
        fout.writeln((cast(char[]) latinS));
     }

 The Latin1String type, at the storage level, is a ubyte[]. By 
 casting to char[], you can get a similar-to-string thing that 
 writeln() can handle.

 Graham
Awesome! What a lesson! Thannk you! So if anyone is following this thread heres my code now. This reads a text file(encoded in Latin1 which is basic ascii with extended ascii codes), allows D to work with it in unicode, and then spits it back out as Latin1. I wonder about the speed between this method and Era's home-spun solution? import std.stdio; import std.string; import std.file; import std.encoding; // Main function void main(){ auto fout = File("out.txt","w"); auto latinS = cast(Latin1String) read("in.txt"); string uniS; transcode(latinS, uniS); foreach(line; uniS.splitLines()){ transcode(line, latinS); fout.writeln((cast(char[]) latinS)); } }
May 23 2012
next sibling parent reply "era scarecrow" <rtcvb32 yahoo.com> writes:
On Wednesday, 23 May 2012 at 21:02:27 UTC, Paul wrote:
 I wonder about the speed between this method and Era's 
 home-spun solution?
My solution may have a flaw in it's lookup table; namely if I got one of the codes wrong. I used regex and a site to reference them all so I Hope it's right. I can't remember but I think it was from http://www.alanwood.net/demos/ansi.html The main reason I wrote it was there was no good explanations in the documentation of anywhere of how to use std.encoding and transcode. This meant I was stuck and needed some simple solution. I'm not sure if my solution is going to be faster, but it does do minimal object allocation/resizing/abstraction, and tries not to make a new string if it doesn't have to. Who knows? Perhaps it will be added to phobos once the table is verified.
May 24 2012
parent Era Scarecrow <rtcvb32 yahoo.com> writes:
On Thursday, 24 May 2012 at 19:47:06 UTC, era scarecrow wrote:
 On Wednesday, 23 May 2012 at 21:02:27 UTC, Paul wrote:
 I wonder about the speed between this method and Era's 
 home-spun solution?
Who knows? Perhaps it will be added to phobos once the table is verified.
Well after taking to heart about a gc-less solution and doing a inputRange I re-wrote the entire thing. Of course to make it even faster/simpler a full lookup table conversion is used instead. Further reduction has made a very tiny simple filter. Curiously relooking at it there's actually very few codes that are there that really require special attention. If there's still any interest in this I can release it.
May 15 2016
prev sibling parent "Regan Heath" <regan netmail.co.nz> writes:
On Wed, 23 May 2012 22:02:25 +0100, Paul <phshaffer gmail.com> wrote:
 This works, though it's ugly:


     foreach(line; uniS.splitLines()) {
        transcode(line, latinS);
        fout.writeln((cast(char[]) latinS));
     }

 The Latin1String type, at the storage level, is a ubyte[]. By casting  
 to char[], you can get a similar-to-string thing that writeln() can  
 handle.

 Graham
Awesome! What a lesson! Thannk you! So if anyone is following this thread heres my code now. This reads a text file(encoded in Latin1 which is basic ascii with extended ascii codes), allows D to work with it in unicode, and then spits it back out as Latin1. I wonder about the speed between this method and Era's home-spun solution? import std.stdio; import std.string; import std.file; import std.encoding; // Main function void main(){ auto fout = File("out.txt","w"); auto latinS = cast(Latin1String) read("in.txt"); string uniS; transcode(latinS, uniS); foreach(line; uniS.splitLines()){ transcode(line, latinS); fout.writeln((cast(char[]) latinS)); } }
The only thing which would worry me about this code is the cast(char[]) in the final writeln.. I know some parts of phobos verify the char data is correct UTF-8 and this line casts latin-1 to char[] which can potentially create invalid UTF-8 data. That said, I had a really quick look at the phobos code for File.writeln and I'm not sure whether this function does any UTF-8 validation. I would be happier if the latin-1 was written as a stream of bytes with no assumed interpretation, IMO. R -- Using Opera's revolutionary email client: http://www.opera.com/mail/
May 25 2012