digitalmars.D.learn - Parsing a UTF-16LE file line by line?

Nestor (8/8) Jan 04 2017 Hi,

Daniel =?iso-8859-1?b?S2964Ws=?= via Digitalmars-d-learn (5/17) Jan 04 2017 can you show your code, byLine should works ok, and post some example=20
Daniel =?iso-8859-1?b?S2964Ws=?= via Digitalmars-d-learn (5/22) Jan 04 2017 Ok, I've done some testing and you are right byLine is broken, so=20

Nestor (3/5) Jan 04 2017 A bug? I was under the impression that this function was

pineapple (13/19) Jan 04 2017 I'm not sure if this works quite as intended, but I was at least

rumbu (2/14) Jan 05 2017 fwide is not implemented in Windows:

pineapple (4/21) Jan 06 2017 That's odd. It was on Windows 7 64-bit that I put together and

Mike Wey (6/27) Jan 06 2017 Are you compiling a 32bit binary? Because in that case you would be

Nestor (28/58) Jan 15 2017 After some testing I realized that byLine was not the one

Nestor (5/32) Jan 15 2017 By the way, when caught, the exception says it's in file
Daniel =?UTF-8?B?S296w6Fr?= via Digitalmars-d-learn (4/67) Jan 15 2017 This is because byLine does return range, so until you do something with...

Nestor (5/7) Jan 15 2017 I see. So correcting my original doubt:

Era Scarecrow (15/19) Jan 16 2017 Could... roll your own? Although if you wanted it to be UTF-8

Nestor (8/28) Jan 17 2017 Thanks, but unfortunately this function does not produce proper

Era Scarecrow (11/20) Jan 26 2017 I thought you wanted to get line by line of contents, which

Nestor (6/8) Jan 28 2017 AFAIK in some cases the BOM takes up to 4 bytes (FOR UTF-32), so

Patrick Schluter (2/10) Jan 29 2017 On UTF-8 files the BOM is 3 bytes long.

Jack Applegame (3/5) Jan 26 2017 Maybe I'm wrong, but I think it's thread safe. Because static

Era Scarecrow (3/8) Jan 27 2017 Perhaps, but fibers or other instances of sharing the buffer

Daniel =?iso-8859-1?b?S2964Ws=?= via Digitalmars-d-learn (16/22) Jan 04 2017 Impression is nice but there is nothing about it, so anyone who will=20

Steven Schveighoffer (9/16) Jan 05 2017 I have not tested much with UTF16 and std.stdio, but I don't believe the...

Nestor <nestorperez2016 yopmail.com> writes:

Hi,

I was just trying to parse a UTF-16LE file using byLine, but 
apparently this function doesn't work with anything other than 
UTF-8, because I get this error:

"Invalid UTF-8 sequence (at index 1)"

How can I achieve what I want, without loading the entire file 
into memory?

Thanks in advance.

Jan 04 2017

Daniel =?iso-8859-1?b?S2964Ws=?= via Digitalmars-d-learn writes:

Nestor via Digitalmars-d-learn <digitalmars-d-learn puremagic.com>=20
napsal St, led 4, 2017 v 12=E2=88=B603 :
 Hi,
=20
 I was just trying to parse a UTF-16LE file using byLine, but=20
 apparently this function doesn't work with anything other than UTF-8,=20
 because I get this error:
=20
 "Invalid UTF-8 sequence (at index 1)"
=20
 How can I achieve what I want, without loading the entire file into=20
 memory?
=20
 Thanks in advance.

can you show your code, byLine should works ok, and post some example=20
of utf16-le file which does not works
=

Jan 04 2017

Daniel =?iso-8859-1?b?S2964Ws=?= via Digitalmars-d-learn writes:

Daniel Koz=C3=A1k <kozzi11 gmail.com> napsal St, led 4, 2017 v 6=E2=88=B633=
 :
=20
 Nestor via Digitalmars-d-learn <digitalmars-d-learn puremagic.com>=20
 napsal St, led 4, 2017 v 12=E2=88=B603 :
 Hi,
=20
 I was just trying to parse a UTF-16LE file using byLine, but=20
 apparently this function doesn't work with anything other than=20
 UTF-8, because I get this error:
=20
 "Invalid UTF-8 sequence (at index 1)"
=20
 How can I achieve what I want, without loading the entire file into=20
 memory?
=20
 Thanks in advance.

 can you show your code, byLine should works ok, and post some example=20
 of utf16-le file which does not works

Ok, I've done some testing and you are right byLine is broken, so=20
please fill a bug


=

Jan 04 2017

Nestor <nestorperez2016 yopmail.com> writes:

On Wednesday, 4 January 2017 at 18:48:59 UTC, Daniel Kozák wrote:
 Ok, I've done some testing and you are right byLine is broken, 
 so please fill a bug

A bug? I was under the impression that this function was 
*intended* to work only with UTF-8 encoded files.

Jan 04 2017

pineapple <meapineapple gmail.com> writes:

On Wednesday, 4 January 2017 at 19:20:31 UTC, Nestor wrote:
 On Wednesday, 4 January 2017 at 18:48:59 UTC, Daniel Kozák 
 wrote:
 Ok, I've done some testing and you are right byLine is broken, 
 so please fill a bug

 A bug? I was under the impression that this function was 
 *intended* to work only with UTF-8 encoded files.

I'm not sure if this works quite as intended, but I was at least 
able to produce a UTF-16 decode error rather than a UTF-8 decode 
error by setting the file orientation before reading it.

     import std.stdio;
     import core.stdc.wchar_ : fwide;
     void main(){
         auto file = File("UTF-16LE encoded file.txt");
         fwide(file.getFP(), 1);
         foreach(line; file.byLine){
             writeln(file.readln);
         }
     }

Jan 04 2017

rumbu <rumbu rumbu.ro> writes:

 I'm not sure if this works quite as intended, but I was at 
 least able to produce a UTF-16 decode error rather than a UTF-8 
 decode error by setting the file orientation before reading it.

     import std.stdio;
     import core.stdc.wchar_ : fwide;
     void main(){
         auto file = File("UTF-16LE encoded file.txt");
         fwide(file.getFP(), 1);
         foreach(line; file.byLine){
             writeln(file.readln);
         }
     }

fwide is not implemented in Windows: 
https://msdn.microsoft.com/en-us/library/aa985619.aspx

Jan 05 2017

pineapple <meapineapple gmail.com> writes:

On Friday, 6 January 2017 at 06:24:12 UTC, rumbu wrote:
 I'm not sure if this works quite as intended, but I was at 
 least able to produce a UTF-16 decode error rather than a 
 UTF-8 decode error by setting the file orientation before 
 reading it.

     import std.stdio;
     import core.stdc.wchar_ : fwide;
     void main(){
         auto file = File("UTF-16LE encoded file.txt");
         fwide(file.getFP(), 1);
         foreach(line; file.byLine){
             writeln(file.readln);
         }
     }

 fwide is not implemented in Windows: 
 https://msdn.microsoft.com/en-us/library/aa985619.aspx

That's odd. It was on Windows 7 64-bit that I put together and 
tested that example, and calling fwide definitely had an effect 
on program behavior.

Jan 06 2017

Mike Wey <mike-wey example.com> writes:

On 01/06/2017 11:33 AM, pineapple wrote:
 On Friday, 6 January 2017 at 06:24:12 UTC, rumbu wrote:
 I'm not sure if this works quite as intended, but I was at least able
 to produce a UTF-16 decode error rather than a UTF-8 decode error by
 setting the file orientation before reading it.

     import std.stdio;
     import core.stdc.wchar_ : fwide;
     void main(){
         auto file = File("UTF-16LE encoded file.txt");
         fwide(file.getFP(), 1);
         foreach(line; file.byLine){
             writeln(file.readln);
         }
     }

 fwide is not implemented in Windows:
 https://msdn.microsoft.com/en-us/library/aa985619.aspx

 That's odd. It was on Windows 7 64-bit that I put together and tested
 that example, and calling fwide definitely had an effect on program
 behavior.

Are you compiling a 32bit binary? Because in that case you would be 
using the digital mars c runtime which might have an implementation for 
fwide.

-- 
Mike Wey

Jan 06 2017

Nestor <nestorperez2016 yopmail.com> writes:

On Friday, 6 January 2017 at 11:42:17 UTC, Mike Wey wrote:
 On 01/06/2017 11:33 AM, pineapple wrote:
 On Friday, 6 January 2017 at 06:24:12 UTC, rumbu wrote:
 I'm not sure if this works quite as intended, but I was at 
 least able
 to produce a UTF-16 decode error rather than a UTF-8 decode 
 error by
 setting the file orientation before reading it.

     import std.stdio;
     import core.stdc.wchar_ : fwide;
     void main(){
         auto file = File("UTF-16LE encoded file.txt");
         fwide(file.getFP(), 1);
         foreach(line; file.byLine){
             writeln(file.readln);
         }
     }

 fwide is not implemented in Windows:
 https://msdn.microsoft.com/en-us/library/aa985619.aspx

 That's odd. It was on Windows 7 64-bit that I put together and 
 tested
 that example, and calling fwide definitely had an effect on 
 program
 behavior.

 Are you compiling a 32bit binary? Because in that case you 
 would be using the digital mars c runtime which might have an 
 implementation for fwide.

After some testing I realized that byLine was not the one 
failing, but any string manipulation done to the obtained line. 
Compile the following example with and without -debug and run to 
see what I mean:

import std.stdio, std.string;

enum
   EXIT_SUCCESS = 0,
   EXIT_FAILURE = 1;

int main() {
   version(Windows) {
     import core.sys.windows.wincon;
     SetConsoleOutputCP(65001);
   }
   auto f = File("utf16le.txt", "r");
   foreach (line; f.byLine()) try {
     string s;
     debug s = cast(string)strip(line); // this is the one causing 
problems
     if (1 > s.length) continue;
     writeln(s);
   } catch(Exception e) {
     writefln("Error. %s\nFile \"%s\", line %s.", e.msg, e.file, 
e.line);
     return EXIT_FAILURE;
   }
   return EXIT_SUCCESS;
}

Jan 15 2017

Nestor <nestorperez2016 yopmail.com> writes:

On Sunday, 15 January 2017 at 14:48:12 UTC, Nestor wrote:
 After some testing I realized that byLine was not the one 
 failing, but any string manipulation done to the obtained line. 
 Compile the following example with and without -debug and run 
 to see what I mean:

 import std.stdio, std.string;

 enum
   EXIT_SUCCESS = 0,
   EXIT_FAILURE = 1;

 int main() {
   version(Windows) {
     import core.sys.windows.wincon;
     SetConsoleOutputCP(65001);
   }
   auto f = File("utf16le.txt", "r");
   foreach (line; f.byLine()) try {
     string s;
     debug s = cast(string)strip(line); // this is the one 
 causing problems
     if (1 > s.length) continue;
     writeln(s);
   } catch(Exception e) {
     writefln("Error. %s\nFile \"%s\", line %s.", e.msg, e.file, 
 e.line);
     return EXIT_FAILURE;
   }
   return EXIT_SUCCESS;
 }

By the way, when caught, the exception says it's in file 
src/phobos/std/utf.d line 1217, but that file only has 784 lines. 
That's quite odd.

(I am compiling with dmd 2.072.2)

Jan 15 2017

Daniel =?UTF-8?B?S296w6Fr?= via Digitalmars-d-learn writes:

V Sun, 15 Jan 2017 14:48:12 +0000
Nestor via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> napsáno:

 On Friday, 6 January 2017 at 11:42:17 UTC, Mike Wey wrote:
 On 01/06/2017 11:33 AM, pineapple wrote:  
 On Friday, 6 January 2017 at 06:24:12 UTC, rumbu wrote:  
 I'm not sure if this works quite as intended, but I was at 
 least able
 to produce a UTF-16 decode error rather than a UTF-8 decode 
 error by
 setting the file orientation before reading it.

     import std.stdio;
     import core.stdc.wchar_ : fwide;
     void main(){
         auto file = File("UTF-16LE encoded file.txt");
         fwide(file.getFP(), 1);
         foreach(line; file.byLine){
             writeln(file.readln);
         }
     }  

 fwide is not implemented in Windows:
 https://msdn.microsoft.com/en-us/library/aa985619.aspx  

 That's odd. It was on Windows 7 64-bit that I put together and 
 tested
 that example, and calling fwide definitely had an effect on 
 program
 behavior.  

 Are you compiling a 32bit binary? Because in that case you 
 would be using the digital mars c runtime which might have an 
 implementation for fwide.  

 
 After some testing I realized that byLine was not the one 
 failing, but any string manipulation done to the obtained line. 
 Compile the following example with and without -debug and run to 
 see what I mean:
 
 import std.stdio, std.string;
 
 enum
    EXIT_SUCCESS = 0,
    EXIT_FAILURE = 1;
 
 int main() {
    version(Windows) {
      import core.sys.windows.wincon;
      SetConsoleOutputCP(65001);
    }
    auto f = File("utf16le.txt", "r");
    foreach (line; f.byLine()) try {
      string s;
      debug s = cast(string)strip(line); // this is the one causing 
 problems
      if (1 > s.length) continue;
      writeln(s);
    } catch(Exception e) {
      writefln("Error. %s\nFile \"%s\", line %s.", e.msg, e.file, 
 e.line);
      return EXIT_FAILURE;
    }
    return EXIT_SUCCESS;
 }

This is because byLine does return range, so until you do something with that
it does not cause any harm :)

Jan 15 2017

Nestor <nestorperez2016 yopmail.com> writes:

On Sunday, 15 January 2017 at 16:29:23 UTC, Daniel Kozák wrote:
 This is because byLine does return range, so until you do 
 something with that it does not cause any harm :)

I see. So correcting my original doubt:

How could I parse an UTF16LE file line by line (producing a 
proper string in each iteration) without loading the entire file 
into memory?

Jan 15 2017

Era Scarecrow <rtcvb32 yahoo.com> writes:

On Sunday, 15 January 2017 at 19:48:04 UTC, Nestor wrote:
 I see. So correcting my original doubt:

 How could I parse an UTF16LE file line by line (producing a 
 proper string in each iteration) without loading the entire 
 file into memory?

Could... roll your own? Although if you wanted it to be UTF-8 
output instead would require a second pass or better yet changing 
how the i iterated.

char[] getLine16LE(File inp = stdin) {
     static char[1024*4] buffer;  //4k reusable buffer, NOT thread 
safe
     int i;
     while(inp.rawRead(buffer[i .. i+2]) != null) {
         if (buffer[i] == '\n')
             break;

         i+=2;
     }

     return buffer[0 .. i];
}

Jan 16 2017

Nestor <nestorperez2016 yopmail.com> writes:

On Monday, 16 January 2017 at 14:47:23 UTC, Era Scarecrow wrote:
 On Sunday, 15 January 2017 at 19:48:04 UTC, Nestor wrote:
 I see. So correcting my original doubt:

 How could I parse an UTF16LE file line by line (producing a 
 proper string in each iteration) without loading the entire 
 file into memory?

 Could... roll your own? Although if you wanted it to be UTF-8 
 output instead would require a second pass or better yet 
 changing how the i iterated.

 char[] getLine16LE(File inp = stdin) {
     static char[1024*4] buffer;  //4k reusable buffer, NOT 
 thread safe
     int i;
     while(inp.rawRead(buffer[i .. i+2]) != null) {
         if (buffer[i] == '\n')
             break;

         i+=2;
     }

     return buffer[0 .. i];
 }

Thanks, but unfortunately this function does not produce proper 
UTF8 strings, as a matter of fact the output even starts with the 
BOM. Also it doen't handle CRLF, and even for LF terminated lines 
it doesn't seem to work for lines other than the first.

I guess I have to code encoding detection, buffered read, and 
transcoding by hand, the only problem is that the result could be 
sub-optimal, which is why I was looking for a built-in solution.

Jan 17 2017

Era Scarecrow <rtcvb32 yahoo.com> writes:

On Tuesday, 17 January 2017 at 11:40:15 UTC, Nestor wrote:
 Thanks, but unfortunately this function does not produce proper 
 UTF8 strings, as a matter of fact the output even starts with 
 the BOM. Also it doesn't handle CRLF, and even for LF 
 terminated lines it doesn't seem to work for lines other than 
 the first.

  I thought you wanted to get line by line of contents, which 
would then remain as UTF-16. Translating between the two types 
shouldn't be hard, probably to!string or a foreach with appending 
to code-units on chars would convert to UTF-8.

  Skipping the BOM is just a matter of skipping the first two 
bytes identifying it...

 I guess I have to code encoding detection, buffered read, and 
 transcoding by hand, the only problem is that the result could 
 be sub-optimal, which is why I was looking for a built-in 
 solution.

  Maybe. Honestly I'm not nearly as familiar with the library or 
functions as I would love to be, so often home-made solutions 
seem more prevalent until I learn the lingo. A disadvantage of 
being self taught.

Jan 26 2017

Nestor <nestorperez2016 yopmail.com> writes:

On Friday, 27 January 2017 at 04:26:31 UTC, Era Scarecrow wrote:
  Skipping the BOM is just a matter of skipping the first two 
 bytes identifying it...

AFAIK in some cases the BOM takes up to 4 bytes (FOR UTF-32), so 
when input encoding is unknown one must perform some kind of 
detection in order to apply the correct transcoding later. I 
thought by now dmd had this functionality built-in and exposed, 
since the compiler itself seems to do it for source code units.

Jan 28 2017

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Saturday, 28 January 2017 at 15:40:24 UTC, Nestor wrote:
 On Friday, 27 January 2017 at 04:26:31 UTC, Era Scarecrow wrote:
  Skipping the BOM is just a matter of skipping the first two 
 bytes identifying it...

 AFAIK in some cases the BOM takes up to 4 bytes (FOR UTF-32), 
 so when input encoding is unknown one must perform some kind of 
 detection in order to apply the correct transcoding later. I 
 thought by now dmd had this functionality built-in and exposed, 
 since the compiler itself seems to do it for source code units.

On UTF-8 files the BOM is 3 bytes long.

Jan 29 2017

Jack Applegame <japplegame gmail.com> writes:

On Monday, 16 January 2017 at 14:47:23 UTC, Era Scarecrow wrote:
     static char[1024*4] buffer;  //4k reusable buffer, NOT 
 thread safe

Maybe I'm wrong, but I think it's thread safe. Because static 
mutable non-shared variables are stored in TLS.

Jan 26 2017

Era Scarecrow <rtcvb32 yahoo.com> writes:

On Friday, 27 January 2017 at 07:02:52 UTC, Jack Applegame wrote:
 On Monday, 16 January 2017 at 14:47:23 UTC, Era Scarecrow wrote:
     static char[1024*4] buffer;  //4k reusable buffer, NOT 
 thread safe

 Maybe I'm wrong, but I think it's thread safe. Because static 
 mutable non-shared variables are stored in TLS.

  Perhaps, but fibers or other instances of sharing the buffer 
wouldn't be safe/reliable, at least not for long.

Jan 27 2017

Daniel =?iso-8859-1?b?S2964Ws=?= via Digitalmars-d-learn writes:

Nestor via Digitalmars-d-learn <digitalmars-d-learn puremagic.com>=20
napsal St, led 4, 2017 v 8=E2=88=B620 :
 On Wednesday, 4 January 2017 at 18:48:59 UTC, Daniel Koz=C3=A1k wrote:
 Ok, I've done some testing and you are right byLine is broken, so=20
 please fill a bug

=20
 A bug? I was under the impression that this function was *intended*=20
 to work only with UTF-8 encoded files.

Impression is nice but there is nothing about it, so anyone who will=20
read doc will expect it to work on any encoding.
And from doc I see there is a way how one can select encoding and even=20
select Terminator and its type, and this does not works so I expect it=20
is a bug.

Another wierd behaviour is when you read file as wstring it will try to=20
decode it as utf8, then encode it to utf16, but even if it works (for=20
utf8 files), and you end up with wstring lines (wstring[]) and you try=20
to save it, it will automaticly save it as utf8. WTF this is really=20
wrong and if it is intended it should be documentet better. Right now=20
it is really hard to work with dlang stdio.

But I hoppe it will be deprecated someday and replace with something=20
what support ranges and async io
=

Jan 04 2017

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 1/4/17 6:03 AM, Nestor wrote:
 Hi,

 I was just trying to parse a UTF-16LE file using byLine, but apparently
 this function doesn't work with anything other than UTF-8, because I get
 this error:

 "Invalid UTF-8 sequence (at index 1)"

 How can I achieve what I want, without loading the entire file into memory?

 Thanks in advance.

I have not tested much with UTF16 and std.stdio, but I don't believe the 
underlying FILE * being used by phobos has good support for it.

In my testing, for instance, byLine with a non-ascii delimeter didn't 
work at all.

On Windows 64-bit, MSVC simply ignores any attempts to change the width 
of the stream.

I wouldn't hold out much hope for this to be fixed.

-Steve

Jan 05 2017

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Parsing a UTF-16LE file line by line?