www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Windows vs UTF-8 (issue 15845)

reply ag0aep6g <anonymous example.com> writes:
When trying to make sense of issue 15845 [1], I've found Windows 
behaving outright broken. I don't have a clue about Windows programming, 
though, so it's very possible that I'm just missing something. I'd hope so.

Code:
----
import std.stdio;
import std.exception: enforce;
import core.sys.windows.windows;

void main()
{
     SetConsoleCP(65001);
     SetConsoleOutputCP(65001);

     uint readBytes;
     ubyte c;
     ReadFile(GetStdHandle(STD_INPUT_HANDLE), &c, 1, &readBytes, 
null).enforce();
     writeln(readBytes, " ", c);
}
----

This works for ASCII characters. It does not work for non-ASCII 
characters, e.g. 'ü'. ReadFile does not indicate an error, but it also 
doesn't read anything.

I can't find any explanation for this in the documentation for Readfile 
[2] or via Google. The same happens with -m32, -m64, fgetc, fgets. It 
also happens with equivalent C programs compiled with Visual Studio 2015.

I did find out that this apparently only happens when stdin is 
considered a TTY. According to _isatty [3], stdin is not a TTY when I 
use a pipe for input, e.g. `echo ä | test`, and then it works.

Does this make sense to anyone?


[1] https://issues.dlang.org/show_bug.cgi?id=15845
[2] https://msdn.microsoft.com/en-us/library/aa365467.aspx (ReadFile)
[3] https://msdn.microsoft.com/en-us/library/f4s0ddew.aspx (_isatty)
Apr 03 2016
next sibling parent reply Martin Krejcirik <mk-junk i-line.cz> writes:
On Sunday, 3 April 2016 at 21:55:39 UTC, ag0aep6g wrote:
 Does this make sense to anyone?
Using UTF8 console via C api is broken in many ways on Windows. The problem is in C library. The only sensible way is to use Windows API. Related issues: https://issues.dlang.org/show_bug.cgi?id=1448 https://issues.dlang.org/show_bug.cgi?id=15761
Apr 03 2016
next sibling parent reply ag0aep6g <anonymous example.com> writes:
On 04.04.2016 00:07, Martin Krejcirik wrote:
 Using UTF8 console via C api is broken in many ways on Windows. The
 problem is in C library. The only sensible way is to use Windows API.
I'm under the impression that ReadFile is a Windows API function. Is that not so? If it isn't, what is the corresponding Windows API function?
Apr 03 2016
parent reply Martin Krejcirik <mk-junk i-line.cz> writes:
Dne 4. 4. 2016 v 0:11 ag0aep6g napsal(a):
 On 04.04.2016 00:07, Martin Krejcirik wrote:
 I'm under the impression that ReadFile is a Windows API function. Is
 that not so? If it isn't, what is the corresponding Windows API function?
Sorry, I missed that in your post. Anyway, after years of trying, I've resorted to always converting to/from OEMCP. You can use fromMBSz, toMBSz, GetConsoleCP, GetConsoleOutputCP functions for that. -- mk
Apr 03 2016
parent reply ag0aep6g <anonymous example.com> writes:
On 04.04.2016 00:31, Martin Krejcirik wrote:
 Sorry, I missed that in your post. Anyway, after years of trying, I've
 resorted to always converting to/from OEMCP.

 You can use fromMBSz, toMBSz, GetConsoleCP, GetConsoleOutputCP functions
 for that.
I'm not really asking for myself, but more for fixing issue 15845. If reading UTF-8 is broken in Windows and there's no workaround, then issue 15845 can't be fixed, and we should stop telling people to use `chcp 65001` (and don't forget to change the font).
Apr 03 2016
parent reply Martin Krejcirik <mk-junk i-line.cz> writes:
Dne 4. 4. 2016 v 0:52 ag0aep6g napsal(a):
 reading UTF-8 is broken in Windows and there's no workaround, then issue
 15845 can't be fixed, and we should stop telling people to use `chcp
 65001` (and don't forget to change the font).
I think ReadConsole and WriteConsole API functions work with codepage 65001. (Sorry my previous reply went to your email). -- mk
Apr 03 2016
parent reply ag0aep6g <anonymous example.com> writes:
On Sunday, 3 April 2016 at 23:29:18 UTC, Martin Krejcirik wrote:
 I think ReadConsole and WriteConsole API functions work with 
 codepage 65001. (Sorry my previous reply went to your email).
Yeah, ReadConsole does work, somewhat. The data comes in as UTF-16, not UTF-8, though. And this time it only works when when stdin is a TTY (opposite of ReadFile). So our reading functions would have to query _isatty and choose ReadFile or ReadConsole depending on the result. When using ReadConsole, they would also have to convert from UTF-16 to UTF-8. At that point it would probably make sense to detect other code pages as well and convert from those to UTF-8. Weird how bad the support for UTF-8 seems to be in Windows.
Apr 03 2016
next sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 4 April 2016 at 00:06:28 UTC, ag0aep6g wrote:
 Weird how bad the support for UTF-8 seems to be in Windows.
Windows is more of a utf-16 system. It uses that internally, not utf-8, so conversions are often done anyway.
Apr 03 2016
prev sibling parent Kagamin <spam here.lot> writes:
On Monday, 4 April 2016 at 00:06:28 UTC, ag0aep6g wrote:
 Weird how bad the support for UTF-8 seems to be in Windows.
UTF-8 is a newer technology. As early adopters of unicode (before Unicode 3.0 standard), Windows, OSX and Java used UCS-2 and later migrated to UTF-16.
Apr 04 2016
prev sibling next sibling parent Martin Krejcirik <mk-junk i-line.cz> writes:
And convert to non-unicode codepage (OEMCP) ...
Apr 03 2016
prev sibling parent Kagamin <spam here.lot> writes:
On Sunday, 3 April 2016 at 22:07:07 UTC, Martin Krejcirik wrote:
 On Sunday, 3 April 2016 at 21:55:39 UTC, ag0aep6g wrote:
 Does this make sense to anyone?
Using UTF8 console via C api is broken in many ways on Windows. The problem is in C library. The only sensible way is to use Windows API. Related issues: https://issues.dlang.org/show_bug.cgi?id=1448 https://issues.dlang.org/show_bug.cgi?id=15761
Last I checked Walter insisted that D I/O should be compatible with C I/O.
Apr 04 2016
prev sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Sunday, 3 April 2016 at 21:55:39 UTC, ag0aep6g wrote:
     ReadFile(GetStdHandle(STD_INPUT_HANDLE), &c, 1, &readBytes, 
 null).enforce();
I'm not sure if this is it or not, but you are asking for only one byte here, but giving it a multibyte sequence. What happens if you give it a 4 char buffer? I imagine it would work fine then in all cases. It seems to for me. The docs say it returns when "A write operation completes on the write end of the pipe." That's probably what is happening here, and then it doesn't have enough room in your buffer to put the message, so it reads zero. I'm not sure why it wouldn't return an error though... and it seems to remove the whole message from the buffer anyway... but still, it kinda makes sense that it wouldn't give you the partial input since it needs to be translated as a whole unit. Regardless though, giving it a bigger buffer should work in all cases and has other benefits too, so that's probably what you should do.
Apr 03 2016
parent reply anonymous <anonymous example.com> writes:
On Sunday, 3 April 2016 at 22:48:21 UTC, Adam D. Ruppe wrote:
 What happens if you give it a 4 char buffer? I imagine it would 
 work fine then in all cases. It seems to for me.
Doesn't seem to work for me. The exact code I tested: ---- import std.stdio; import std.exception: enforce; import core.sys.windows.windows; void main() { SetConsoleCP(65001); SetConsoleOutputCP(65001); uint readBytes; ubyte[4] c; ReadFile(GetStdHandle(STD_INPUT_HANDLE), c.ptr, c.length, &readBytes, null).enforce(); writeln(readBytes, " ", c[]); } ---- When I enter "a", it prints "3 [97, 13, 10, 0]". When I enter "ä", it prints "0 [0, 0, 0, 0]". I've also tried even larger buffers. Same result.
Apr 03 2016
parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Sunday, 3 April 2016 at 23:11:53 UTC, anonymous wrote:
 Doesn't seem to work for me.
Hmm, worked on my desktop but not my laptop... and I have no idea why now. ReadConsoleW works fine though in all my attempts, we should prolly just change the library to use it.
Apr 03 2016
parent reply Martin Krejcirik <mk-junk i-line.cz> writes:
Dne 4. 4. 2016 v 2:03 Adam D. Ruppe napsal(a):
 ReadConsoleW works fine though in all my attempts, we should prolly just
 change the library to use it.
Probably not, it dont't work with pipes. Oh well ... -- mk
Apr 03 2016
parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 4 April 2016 at 00:08:54 UTC, Martin Krejcirik wrote:
 Probably not, it dont't work with pipes. Oh well ...
It is easy to detect that though and branch accordingly.
Apr 03 2016
parent Martin Krejcirik <mk-junk i-line.cz> writes:
Dne 4. 4. 2016 v 2:22 Adam D. Ruppe napsal(a):
 On Monday, 4 April 2016 at 00:08:54 UTC, Martin Krejcirik wrote:
 Probably not, it dont't work with pipes. Oh well ...
It is easy to detect that though and branch accordingly.
I think it not woth it. If Phobos just converted automatically from codepage to utf-8 for std streams, that would be enough. CP 65001 would still not work, but no one would notice. -- mk
Apr 03 2016