www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Parsing a UTF-16LE file line by line?

reply Nestor <nestorperez2016 yopmail.com> writes:
Hi,

I was just trying to parse a UTF-16LE file using byLine, but 
apparently this function doesn't work with anything other than 
UTF-8, because I get this error:

"Invalid UTF-8 sequence (at index 1)"

How can I achieve what I want, without loading the entire file 
into memory?

Thanks in advance.
Jan 04
next sibling parent Daniel =?iso-8859-1?b?S2964Ws=?= via Digitalmars-d-learn writes:
Nestor via Digitalmars-d-learn <digitalmars-d-learn puremagic.com>=20
napsal St, led 4, 2017 v 12=E2=88=B603 :
 Hi,
=20
 I was just trying to parse a UTF-16LE file using byLine, but=20
 apparently this function doesn't work with anything other than UTF-8,=20
 because I get this error:
=20
 "Invalid UTF-8 sequence (at index 1)"
=20
 How can I achieve what I want, without loading the entire file into=20
 memory?
=20
 Thanks in advance.
can you show your code, byLine should works ok, and post some example=20 of utf16-le file which does not works =
Jan 04
prev sibling next sibling parent reply Daniel =?iso-8859-1?b?S2964Ws=?= via Digitalmars-d-learn writes:
Daniel Koz=C3=A1k <kozzi11 gmail.com> napsal St, led 4, 2017 v 6=E2=88=B633=
 :
=20
 Nestor via Digitalmars-d-learn <digitalmars-d-learn puremagic.com>=20
 napsal St, led 4, 2017 v 12=E2=88=B603 :
 Hi,
=20
 I was just trying to parse a UTF-16LE file using byLine, but=20
 apparently this function doesn't work with anything other than=20
 UTF-8, because I get this error:
=20
 "Invalid UTF-8 sequence (at index 1)"
=20
 How can I achieve what I want, without loading the entire file into=20
 memory?
=20
 Thanks in advance.
can you show your code, byLine should works ok, and post some example=20 of utf16-le file which does not works
Ok, I've done some testing and you are right byLine is broken, so=20 please fill a bug =
Jan 04
parent reply Nestor <nestorperez2016 yopmail.com> writes:
On Wednesday, 4 January 2017 at 18:48:59 UTC, Daniel Kozák wrote:
 Ok, I've done some testing and you are right byLine is broken, 
 so please fill a bug
A bug? I was under the impression that this function was *intended* to work only with UTF-8 encoded files.
Jan 04
next sibling parent reply pineapple <meapineapple gmail.com> writes:
On Wednesday, 4 January 2017 at 19:20:31 UTC, Nestor wrote:
 On Wednesday, 4 January 2017 at 18:48:59 UTC, Daniel Kozák 
 wrote:
 Ok, I've done some testing and you are right byLine is broken, 
 so please fill a bug
A bug? I was under the impression that this function was *intended* to work only with UTF-8 encoded files.
I'm not sure if this works quite as intended, but I was at least able to produce a UTF-16 decode error rather than a UTF-8 decode error by setting the file orientation before reading it. import std.stdio; import core.stdc.wchar_ : fwide; void main(){ auto file = File("UTF-16LE encoded file.txt"); fwide(file.getFP(), 1); foreach(line; file.byLine){ writeln(file.readln); } }
Jan 04
parent reply rumbu <rumbu rumbu.ro> writes:
 I'm not sure if this works quite as intended, but I was at 
 least able to produce a UTF-16 decode error rather than a UTF-8 
 decode error by setting the file orientation before reading it.

     import std.stdio;
     import core.stdc.wchar_ : fwide;
     void main(){
         auto file = File("UTF-16LE encoded file.txt");
         fwide(file.getFP(), 1);
         foreach(line; file.byLine){
             writeln(file.readln);
         }
     }
fwide is not implemented in Windows: https://msdn.microsoft.com/en-us/library/aa985619.aspx
Jan 05
parent reply pineapple <meapineapple gmail.com> writes:
On Friday, 6 January 2017 at 06:24:12 UTC, rumbu wrote:
 I'm not sure if this works quite as intended, but I was at 
 least able to produce a UTF-16 decode error rather than a 
 UTF-8 decode error by setting the file orientation before 
 reading it.

     import std.stdio;
     import core.stdc.wchar_ : fwide;
     void main(){
         auto file = File("UTF-16LE encoded file.txt");
         fwide(file.getFP(), 1);
         foreach(line; file.byLine){
             writeln(file.readln);
         }
     }
fwide is not implemented in Windows: https://msdn.microsoft.com/en-us/library/aa985619.aspx
That's odd. It was on Windows 7 64-bit that I put together and tested that example, and calling fwide definitely had an effect on program behavior.
Jan 06
parent reply Mike Wey <mike-wey example.com> writes:
On 01/06/2017 11:33 AM, pineapple wrote:
 On Friday, 6 January 2017 at 06:24:12 UTC, rumbu wrote:
 I'm not sure if this works quite as intended, but I was at least able
 to produce a UTF-16 decode error rather than a UTF-8 decode error by
 setting the file orientation before reading it.

     import std.stdio;
     import core.stdc.wchar_ : fwide;
     void main(){
         auto file = File("UTF-16LE encoded file.txt");
         fwide(file.getFP(), 1);
         foreach(line; file.byLine){
             writeln(file.readln);
         }
     }
fwide is not implemented in Windows: https://msdn.microsoft.com/en-us/library/aa985619.aspx
That's odd. It was on Windows 7 64-bit that I put together and tested that example, and calling fwide definitely had an effect on program behavior.
Are you compiling a 32bit binary? Because in that case you would be using the digital mars c runtime which might have an implementation for fwide. -- Mike Wey
Jan 06
parent reply Nestor <nestorperez2016 yopmail.com> writes:
On Friday, 6 January 2017 at 11:42:17 UTC, Mike Wey wrote:
 On 01/06/2017 11:33 AM, pineapple wrote:
 On Friday, 6 January 2017 at 06:24:12 UTC, rumbu wrote:
 I'm not sure if this works quite as intended, but I was at 
 least able
 to produce a UTF-16 decode error rather than a UTF-8 decode 
 error by
 setting the file orientation before reading it.

     import std.stdio;
     import core.stdc.wchar_ : fwide;
     void main(){
         auto file = File("UTF-16LE encoded file.txt");
         fwide(file.getFP(), 1);
         foreach(line; file.byLine){
             writeln(file.readln);
         }
     }
fwide is not implemented in Windows: https://msdn.microsoft.com/en-us/library/aa985619.aspx
That's odd. It was on Windows 7 64-bit that I put together and tested that example, and calling fwide definitely had an effect on program behavior.
Are you compiling a 32bit binary? Because in that case you would be using the digital mars c runtime which might have an implementation for fwide.
After some testing I realized that byLine was not the one failing, but any string manipulation done to the obtained line. Compile the following example with and without -debug and run to see what I mean: import std.stdio, std.string; enum EXIT_SUCCESS = 0, EXIT_FAILURE = 1; int main() { version(Windows) { import core.sys.windows.wincon; SetConsoleOutputCP(65001); } auto f = File("utf16le.txt", "r"); foreach (line; f.byLine()) try { string s; debug s = cast(string)strip(line); // this is the one causing problems if (1 > s.length) continue; writeln(s); } catch(Exception e) { writefln("Error. %s\nFile \"%s\", line %s.", e.msg, e.file, e.line); return EXIT_FAILURE; } return EXIT_SUCCESS; }
Jan 15
next sibling parent Nestor <nestorperez2016 yopmail.com> writes:
On Sunday, 15 January 2017 at 14:48:12 UTC, Nestor wrote:
 After some testing I realized that byLine was not the one 
 failing, but any string manipulation done to the obtained line. 
 Compile the following example with and without -debug and run 
 to see what I mean:

 import std.stdio, std.string;

 enum
   EXIT_SUCCESS = 0,
   EXIT_FAILURE = 1;

 int main() {
   version(Windows) {
     import core.sys.windows.wincon;
     SetConsoleOutputCP(65001);
   }
   auto f = File("utf16le.txt", "r");
   foreach (line; f.byLine()) try {
     string s;
     debug s = cast(string)strip(line); // this is the one 
 causing problems
     if (1 > s.length) continue;
     writeln(s);
   } catch(Exception e) {
     writefln("Error. %s\nFile \"%s\", line %s.", e.msg, e.file, 
 e.line);
     return EXIT_FAILURE;
   }
   return EXIT_SUCCESS;
 }
By the way, when caught, the exception says it's in file src/phobos/std/utf.d line 1217, but that file only has 784 lines. That's quite odd. (I am compiling with dmd 2.072.2)
Jan 15
prev sibling parent reply Daniel =?UTF-8?B?S296w6Fr?= via Digitalmars-d-learn writes:
V Sun, 15 Jan 2017 14:48:12 +0000
Nestor via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> napsáno:

 On Friday, 6 January 2017 at 11:42:17 UTC, Mike Wey wrote:
 On 01/06/2017 11:33 AM, pineapple wrote:  
 On Friday, 6 January 2017 at 06:24:12 UTC, rumbu wrote:  
 I'm not sure if this works quite as intended, but I was at 
 least able
 to produce a UTF-16 decode error rather than a UTF-8 decode 
 error by
 setting the file orientation before reading it.

     import std.stdio;
     import core.stdc.wchar_ : fwide;
     void main(){
         auto file = File("UTF-16LE encoded file.txt");
         fwide(file.getFP(), 1);
         foreach(line; file.byLine){
             writeln(file.readln);
         }
     }  
fwide is not implemented in Windows: https://msdn.microsoft.com/en-us/library/aa985619.aspx
That's odd. It was on Windows 7 64-bit that I put together and tested that example, and calling fwide definitely had an effect on program behavior.
Are you compiling a 32bit binary? Because in that case you would be using the digital mars c runtime which might have an implementation for fwide.
After some testing I realized that byLine was not the one failing, but any string manipulation done to the obtained line. Compile the following example with and without -debug and run to see what I mean: import std.stdio, std.string; enum EXIT_SUCCESS = 0, EXIT_FAILURE = 1; int main() { version(Windows) { import core.sys.windows.wincon; SetConsoleOutputCP(65001); } auto f = File("utf16le.txt", "r"); foreach (line; f.byLine()) try { string s; debug s = cast(string)strip(line); // this is the one causing problems if (1 > s.length) continue; writeln(s); } catch(Exception e) { writefln("Error. %s\nFile \"%s\", line %s.", e.msg, e.file, e.line); return EXIT_FAILURE; } return EXIT_SUCCESS; }
This is because byLine does return range, so until you do something with that it does not cause any harm :)
Jan 15
parent reply Nestor <nestorperez2016 yopmail.com> writes:
On Sunday, 15 January 2017 at 16:29:23 UTC, Daniel Kozák wrote:
 This is because byLine does return range, so until you do 
 something with that it does not cause any harm :)
I see. So correcting my original doubt: How could I parse an UTF16LE file line by line (producing a proper string in each iteration) without loading the entire file into memory?
Jan 15
parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Sunday, 15 January 2017 at 19:48:04 UTC, Nestor wrote:
 I see. So correcting my original doubt:

 How could I parse an UTF16LE file line by line (producing a 
 proper string in each iteration) without loading the entire 
 file into memory?
Could... roll your own? Although if you wanted it to be UTF-8 output instead would require a second pass or better yet changing how the i iterated. char[] getLine16LE(File inp = stdin) { static char[1024*4] buffer; //4k reusable buffer, NOT thread safe int i; while(inp.rawRead(buffer[i .. i+2]) != null) { if (buffer[i] == '\n') break; i+=2; } return buffer[0 .. i]; }
Jan 16
next sibling parent reply Nestor <nestorperez2016 yopmail.com> writes:
On Monday, 16 January 2017 at 14:47:23 UTC, Era Scarecrow wrote:
 On Sunday, 15 January 2017 at 19:48:04 UTC, Nestor wrote:
 I see. So correcting my original doubt:

 How could I parse an UTF16LE file line by line (producing a 
 proper string in each iteration) without loading the entire 
 file into memory?
Could... roll your own? Although if you wanted it to be UTF-8 output instead would require a second pass or better yet changing how the i iterated. char[] getLine16LE(File inp = stdin) { static char[1024*4] buffer; //4k reusable buffer, NOT thread safe int i; while(inp.rawRead(buffer[i .. i+2]) != null) { if (buffer[i] == '\n') break; i+=2; } return buffer[0 .. i]; }
Thanks, but unfortunately this function does not produce proper UTF8 strings, as a matter of fact the output even starts with the BOM. Also it doen't handle CRLF, and even for LF terminated lines it doesn't seem to work for lines other than the first. I guess I have to code encoding detection, buffered read, and transcoding by hand, the only problem is that the result could be sub-optimal, which is why I was looking for a built-in solution.
Jan 17
parent reply Era Scarecrow <rtcvb32 yahoo.com> writes:
On Tuesday, 17 January 2017 at 11:40:15 UTC, Nestor wrote:
 Thanks, but unfortunately this function does not produce proper 
 UTF8 strings, as a matter of fact the output even starts with 
 the BOM. Also it doesn't handle CRLF, and even for LF 
 terminated lines it doesn't seem to work for lines other than 
 the first.
I thought you wanted to get line by line of contents, which would then remain as UTF-16. Translating between the two types shouldn't be hard, probably to!string or a foreach with appending to code-units on chars would convert to UTF-8. Skipping the BOM is just a matter of skipping the first two bytes identifying it...
 I guess I have to code encoding detection, buffered read, and 
 transcoding by hand, the only problem is that the result could 
 be sub-optimal, which is why I was looking for a built-in 
 solution.
Maybe. Honestly I'm not nearly as familiar with the library or functions as I would love to be, so often home-made solutions seem more prevalent until I learn the lingo. A disadvantage of being self taught.
Jan 26
parent reply Nestor <nestorperez2016 yopmail.com> writes:
On Friday, 27 January 2017 at 04:26:31 UTC, Era Scarecrow wrote:
  Skipping the BOM is just a matter of skipping the first two 
 bytes identifying it...
AFAIK in some cases the BOM takes up to 4 bytes (FOR UTF-32), so when input encoding is unknown one must perform some kind of detection in order to apply the correct transcoding later. I thought by now dmd had this functionality built-in and exposed, since the compiler itself seems to do it for source code units.
Jan 28
parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Saturday, 28 January 2017 at 15:40:24 UTC, Nestor wrote:
 On Friday, 27 January 2017 at 04:26:31 UTC, Era Scarecrow wrote:
  Skipping the BOM is just a matter of skipping the first two 
 bytes identifying it...
AFAIK in some cases the BOM takes up to 4 bytes (FOR UTF-32), so when input encoding is unknown one must perform some kind of detection in order to apply the correct transcoding later. I thought by now dmd had this functionality built-in and exposed, since the compiler itself seems to do it for source code units.
On UTF-8 files the BOM is 3 bytes long.
Jan 29
prev sibling parent reply Jack Applegame <japplegame gmail.com> writes:
On Monday, 16 January 2017 at 14:47:23 UTC, Era Scarecrow wrote:
     static char[1024*4] buffer;  //4k reusable buffer, NOT 
 thread safe
Maybe I'm wrong, but I think it's thread safe. Because static mutable non-shared variables are stored in TLS.
Jan 26
parent Era Scarecrow <rtcvb32 yahoo.com> writes:
On Friday, 27 January 2017 at 07:02:52 UTC, Jack Applegame wrote:
 On Monday, 16 January 2017 at 14:47:23 UTC, Era Scarecrow wrote:
     static char[1024*4] buffer;  //4k reusable buffer, NOT 
 thread safe
Maybe I'm wrong, but I think it's thread safe. Because static mutable non-shared variables are stored in TLS.
Perhaps, but fibers or other instances of sharing the buffer wouldn't be safe/reliable, at least not for long.
Jan 27
prev sibling parent Daniel =?iso-8859-1?b?S2964Ws=?= via Digitalmars-d-learn writes:
Nestor via Digitalmars-d-learn <digitalmars-d-learn puremagic.com>=20
napsal St, led 4, 2017 v 8=E2=88=B620 :
 On Wednesday, 4 January 2017 at 18:48:59 UTC, Daniel Koz=C3=A1k wrote:
 Ok, I've done some testing and you are right byLine is broken, so=20
 please fill a bug
=20 A bug? I was under the impression that this function was *intended*=20 to work only with UTF-8 encoded files.
Impression is nice but there is nothing about it, so anyone who will=20 read doc will expect it to work on any encoding. And from doc I see there is a way how one can select encoding and even=20 select Terminator and its type, and this does not works so I expect it=20 is a bug. Another wierd behaviour is when you read file as wstring it will try to=20 decode it as utf8, then encode it to utf16, but even if it works (for=20 utf8 files), and you end up with wstring lines (wstring[]) and you try=20 to save it, it will automaticly save it as utf8. WTF this is really=20 wrong and if it is intended it should be documentet better. Right now=20 it is really hard to work with dlang stdio. But I hoppe it will be deprecated someday and replace with something=20 what support ranges and async io =
Jan 04
prev sibling parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 1/4/17 6:03 AM, Nestor wrote:
 Hi,

 I was just trying to parse a UTF-16LE file using byLine, but apparently
 this function doesn't work with anything other than UTF-8, because I get
 this error:

 "Invalid UTF-8 sequence (at index 1)"

 How can I achieve what I want, without loading the entire file into memory?

 Thanks in advance.
I have not tested much with UTF16 and std.stdio, but I don't believe the underlying FILE * being used by phobos has good support for it. In my testing, for instance, byLine with a non-ascii delimeter didn't work at all. On Windows 64-bit, MSVC simply ignores any attempts to change the width of the stream. I wouldn't hold out much hope for this to be fixed. -Steve
Jan 05