www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Strange behavior in console with UTF-8

reply Jonathan Villa <jv_vortex msn.com> writes:
I prefer to post this thing here because it could that I'm doing 
something wrong.

I'm using std.stdio -> readln() to read whatever I'm typing in 
the console.
BUT, if the line contains some UTF-8 characters, the data 
obtained is EMPTY and

<code>

module runnable;

import std.stdio;
import std.string : chomp;
import std.experimental.logger;

void doSomethingElse(wchar[] data)
{
     writeln("hello!");
}

int main(string[] args)
{
     /* Some fix I found to fix UTF-8 related problems, I'm using 
Windows 10 */
     version(Windows)
     {
         import core.sys.windows.windows;
         if (SetConsoleCP(65001) == 0)
             throw new Exception("failure");
         if (SetConsoleOutputCP(65001) == 0)
             throw new Exception("failure");
     }
     FileLogger fl = new FileLogger("log.log");
     wchar[] readerBuffer;

     readln(readerBuffer);
     readerBuffer = chomp(readerBuffer);

     fl.info(readerBuffer.length); /* <- if the readed string 
contains at least one UTF-8
                                         char this prints 0, else 
it prints its length
                                    */

     if (readerBuffer != "exit"w)
         doSomethingElse(readerBuffer);

     /* Also, all the following code doesn't run as expected, the 
program doesn't wait for
        you, it executes readln() even without pressing/sending a 
key */
     readln(readerBuffer);
     fl.info(readerBuffer.length);
     readln(readerBuffer);
     fl.info(readerBuffer.length);
     readln(readerBuffer);
     fl.info(readerBuffer.length);
     readln(readerBuffer);
     fl.info(readerBuffer.length);
     readln(readerBuffer);
     fl.info(readerBuffer.length);

     return 0;
}
</code>
The real code is bigger but this describes the bug. Also, if it 
needs to print UTF-8 there's no problem.

My main problem is that the line is gonna be sended through a TCP 
socket and I wanna make it work with UTF-8. I'm using WCHAR 
instead of CHAR with the hope to get less problems in the future.

I you comment the fixed Windows code, the program crashes
http://prntscr.com/ajmy14

Also I tried stdin.flush() right after the first readln() but 
nothing seems to fix it.

I'm doing something wrong?
many thanks.
Mar 24 2016
next sibling parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 03/24/2016 05:54 PM, Jonathan Villa wrote:

 I'm using WCHAR instead of CHAR
 with the hope to get less problems in the future.
Try char: char[] readerBuffer;
 Also I tried stdin.flush()
flush() has no effect on input streams. Ali
Mar 24 2016
next sibling parent Jonathan Villa <jv_vortex msn.com> writes:
On Friday, 25 March 2016 at 01:03:06 UTC, Ali Çehreli wrote:
 On 03/24/2016 05:54 PM, Jonathan Villa wrote:

 Try char:

     char[] readerBuffer;
 flush() has no effect on input streams.

 Ali
Thankf fot he quick reply. Unfortunately it behaves exactly as before with wchar.
Mar 24 2016
prev sibling parent Jonathan Villa <jv_vortex msn.com> writes:
On Friday, 25 March 2016 at 01:03:06 UTC, Ali Çehreli wrote:

 Try char:

     char[] readerBuffer;

 Ali
Also tried with dchar ... there's no changes.
Mar 24 2016
prev sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 3/24/16 8:54 PM, Jonathan Villa wrote:
 I prefer to post this thing here because it could that I'm doing
 something wrong.

 I'm using std.stdio -> readln() to read whatever I'm typing in the console.
 BUT, if the line contains some UTF-8 characters, the data obtained is
 EMPTY and

 <code>

 module runnable;

 import std.stdio;
 import std.string : chomp;
 import std.experimental.logger;

 void doSomethingElse(wchar[] data)
 {
      writeln("hello!");
 }

 int main(string[] args)
 {
      /* Some fix I found to fix UTF-8 related problems, I'm using
 Windows 10 */
      version(Windows)
      {
          import core.sys.windows.windows;
          if (SetConsoleCP(65001) == 0)
              throw new Exception("failure");
          if (SetConsoleOutputCP(65001) == 0)
              throw new Exception("failure");
      }
      FileLogger fl = new FileLogger("log.log");
      wchar[] readerBuffer;

      readln(readerBuffer);
      readerBuffer = chomp(readerBuffer);

      fl.info(readerBuffer.length); /* <- if the readed string contains
 at least one UTF-8
                                          char this prints 0, else it
 prints its length
                                     */

      if (readerBuffer != "exit"w)
          doSomethingElse(readerBuffer);

      /* Also, all the following code doesn't run as expected, the
 program doesn't wait for
         you, it executes readln() even without pressing/sending a key */
      readln(readerBuffer);
      fl.info(readerBuffer.length);
      readln(readerBuffer);
      fl.info(readerBuffer.length);
      readln(readerBuffer);
      fl.info(readerBuffer.length);
      readln(readerBuffer);
      fl.info(readerBuffer.length);
      readln(readerBuffer);
      fl.info(readerBuffer.length);

      return 0;
 }
 </code>
 The real code is bigger but this describes the bug. Also, if it needs to
 print UTF-8 there's no problem.

 My main problem is that the line is gonna be sended through a TCP socket
 and I wanna make it work with UTF-8. I'm using WCHAR instead of CHAR
 with the hope to get less problems in the future.

 I you comment the fixed Windows code, the program crashes
 http://prntscr.com/ajmy14

 Also I tried stdin.flush() right after the first readln() but nothing
 seems to fix it.

 I'm doing something wrong?
 many thanks.
D's File i/o uses C's FILE * i/o system. At least on Windows, this has literally zero support for wchar (you can set stream width, and the library just ignores it). What is likely happening is that it is putting the char code units into wchar buffer directly, which is not what you want. I am not certain of this cause, but I would steer clear of any i/o that is not char-based. What you can do is read into a char buffer, and then re-encode using std.conv.to to get wchar strings if you need that. -Steve
Mar 25 2016
parent reply Jonathan Villa <jv_vortex msn.com> writes:
On Friday, 25 March 2016 at 13:58:44 UTC, Steven Schveighoffer 
wrote:
 On 3/24/16 8:54 PM, Jonathan Villa wrote:
 [...]
D's File i/o uses C's FILE * i/o system. At least on Windows, this has literally zero support for wchar (you can set stream width, and the library just ignores it). What is likely happening is that it is putting the char code units into wchar buffer directly, which is not what you want. I am not certain of this cause, but I would steer clear of any i/o that is not char-based. What you can do is read into a char buffer, and then re-encode using std.conv.to to get wchar strings if you need that. -Steve
It's the same Ali suggested (if I get it right) and the behaviour its the same. It just get to send a UTF8 char to reproduce the mess, independently of the char type you send. JV
Mar 25 2016
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 3/25/16 6:47 PM, Jonathan Villa wrote:
 On Friday, 25 March 2016 at 13:58:44 UTC, Steven Schveighoffer wrote:
 On 3/24/16 8:54 PM, Jonathan Villa wrote:
 [...]
D's File i/o uses C's FILE * i/o system. At least on Windows, this has literally zero support for wchar (you can set stream width, and the library just ignores it). What is likely happening is that it is putting the char code units into wchar buffer directly, which is not what you want. I am not certain of this cause, but I would steer clear of any i/o that is not char-based. What you can do is read into a char buffer, and then re-encode using std.conv.to to get wchar strings if you need that.
It's the same Ali suggested (if I get it right) and the behaviour its the same. It just get to send a UTF8 char to reproduce the mess, independently of the char type you send.
At this point, I think knowing exactly what input you are sending would be helpful. Can you attach a file which has the input that causes the error? Or just paste the input into your post. -Steve
Mar 26 2016
next sibling parent reply Jonathan Villa <jv_vortex msn.com> writes:
On Saturday, 26 March 2016 at 16:34:34 UTC, Steven Schveighoffer 
wrote:
 On 3/25/16 6:47 PM, Jonathan Villa wrote:
 On Friday, 25 March 2016 at 13:58:44 UTC, Steven Schveighoffer 
 wrote:
[...]
OK, the following inputs I've tested: á, é, í, ó, ú, ñ, à, è, ì, ò, ù. Just one input is enough to reproduce the behaviour. JV
 It's the same Ali suggested (if I get it right) and the 
 behaviour its
 the same.

 It just get to send a UTF8 char to reproduce the mess, 
 independently of
 the char type you send.
At this point, I think knowing exactly what input you are sending would be helpful. Can you attach a file which has the input that causes the error? Or just paste the input into your post. -Steve
The following chars I've tested: á, é, í, ó, ú, ñ, à, è, ì, ò, ù. Just one input of thouse is enough to reproduce the behaviour
Mar 27 2016
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 3/27/16 12:04 PM, Jonathan Villa wrote:
 On Saturday, 26 March 2016 at 16:34:34 UTC, Steven Schveighoffer wrote:
 On 3/25/16 6:47 PM, Jonathan Villa wrote:
 On Friday, 25 March 2016 at 13:58:44 UTC, Steven Schveighoffer wrote:
 [...]
OK, the following inputs I've tested: á, é, í, ó, ú, ñ, à, è, ì, ò, ù. Just one input is enough to reproduce the behaviour. JV
 It's the same Ali suggested (if I get it right) and the behaviour its
 the same.

 It just get to send a UTF8 char to reproduce the mess, independently of
 the char type you send.
At this point, I think knowing exactly what input you are sending would be helpful. Can you attach a file which has the input that causes the error? Or just paste the input into your post.
The following chars I've tested: á, é, í, ó, ú, ñ, à, è, ì, ò, ù. Just one input of thouse is enough to reproduce the behaviour
I can reproduce your issue on windows. It works on Mac OS X. I see different behavior on 32-bit (DMC stdlib) vs. 64-bit (MSVC stdlib). On both, the line is not read properly (I get a length of 0). On 32-bit, the program exits immediately, indicating it cannot read any more data. On 64-bit, the program continues to allow input. I don't think this is normal behavior, and should be filed as a bug. I'm not a Windows developer normally, but I would guess this is an issue with the Windows flavors of readln. Please file here: https://issues.dlang.org under the Phobos component. -Steve
Mar 28 2016
parent Jonathan Villa <jv_vortex msn.com> writes:
On Monday, 28 March 2016 at 18:28:33 UTC, Steven Schveighoffer 
wrote:
 On 3/27/16 12:04 PM, Jonathan Villa wrote:

 I can reproduce your issue on windows.

 It works on Mac OS X.

 I see different behavior on 32-bit (DMC stdlib) vs. 64-bit 
 (MSVC stdlib). On both, the line is not read properly (I get a 
 length of 0). On 32-bit, the program exits immediately, 
 indicating it cannot read any more data.

 On 64-bit, the program continues to allow input.

 I don't think this is normal behavior, and should be filed as a 
 bug. I'm not a Windows developer normally, but I would guess 
 this is an issue with the Windows flavors of readln.

 Please file here: https://issues.dlang.org under the Phobos 
 component.

 -Steve
Ok, I'm gonna register it with your data. Thanks. JV.
Mar 28 2016
prev sibling parent Jonathan Villa <jv_vortex msn.com> writes:
On Saturday, 26 March 2016 at 16:34:34 UTC, Steven Schveighoffer 
wrote:
 On 3/25/16 6:47 PM, Jonathan Villa wrote:
 At this point, I think knowing exactly what input you are 
 sending would be helpful. Can you attach a file which has the 
 input that causes the error? Or just paste the input into your 
 post.

 -Steve
I've tested on Debian 4.2 x64 using CHAR type, and it behaves correctly without any problems. Clearly this bug must be something related with the Windows console. Here's the behaviour in Windows 10 x64: http://prntscr.com/akskt1 And here's in Debian x64 4.2: http://prntscr.com/akskjw JV
Mar 27 2016