www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - how to localize console and GUI apps in Windows

reply Andrei <aalub mail.ru> writes:
There is one everlasting problem writing Cyrillic programs in 
Windows: Microsoft consequently invented two much different code 
pages for Russia and other Cyrillic-alphabet countries: first was 
MSDOS-866 (and alike), second Windows-1251. Nowadays MS Windows 
uses first code page for console programs, second for GUI 
applications, and there always are many workarounds to get proper 
translation between them. Mostly a programmer should write 
program sources either in one code page for console and other for 
GUI, or use .NET, which basically uses UTF8 in sources and makes 
seamless translation depending on back end.

In D language which uses only UTF8 for string encoding I cannot 
write neither MS866 code page program texts, nor Windows-1251 - 
both cases end in a compiler error like "Invalid trailing code 
unit" or "Outside Unicode code space". And writing Cyrillic 
strings in UTF8 format is fatal for both console and GUI Windows 
targets.

My question is: is there any standard means to translate Cyrillic 
or any other localized UTF8 strings for console and GUI output in 
D libraries. If so - where I can get more information and good 
example. Google would not help.

Thanks.
Dec 28 2017
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Dec 28, 2017 at 05:56:32PM +0000, Andrei via Digitalmars-d-learn wrote:
 There is one everlasting problem writing Cyrillic programs in Windows:
 Microsoft consequently invented two much different code pages for
 Russia and other Cyrillic-alphabet countries: first was MSDOS-866 (and
 alike), second Windows-1251. Nowadays MS Windows uses first code page
 for console programs, second for GUI applications, and there always
 are many workarounds to get proper translation between them. Mostly a
 programmer should write program sources either in one code page for
 console and other for GUI, or use .NET, which basically uses UTF8 in
 sources and makes seamless translation depending on back end.
 
 In D language which uses only UTF8 for string encoding I cannot write
 neither MS866 code page program texts, nor Windows-1251 - both cases
 end in a compiler error like "Invalid trailing code unit" or "Outside
 Unicode code space". And writing Cyrillic strings in UTF8 format is
 fatal for both console and GUI Windows targets.
 
 My question is: is there any standard means to translate Cyrillic or
 any other localized UTF8 strings for console and GUI output in D
 libraries. If so - where I can get more information and good example.
 Google would not help.
[...] The string / wstring / dstring types in D are intended to be Unicode strings. If you need to use other encodings, you really should be using ubyte[] or const(ubyte)[] or immutable(ubyte)[], instead of string. One approach is to use UTF-8 in your code, and only translate to one of the code pages when you need to produce output. I wrote a small module for translating to/from KOI8-R when dealing with Russian text; you might find it helpful: ------------------------------------------------------------------------------- /** * Module to convert between UTF and KOI8-R */ module koi8r; import std.string; import std.range; static immutable ubyte[0x450 - 0x410] utf2koi8r = [ 225, 226, 247, 231, 228, 229, 246, 250, // АБВГДЕЖЗ 233, 234, 235, 236, 237, 238, 239, 240, // ИЙКЛМНОП 242, 243, 244, 245, 230, 232, 227, 254, // РСТУФХЦЧ 251, 253, 255, 249, 248, 252, 224, 241, // ШЩЪЫЬЭЮЯ 193, 194, 215, 199, 196, 197, 214, 218, // абвгдежз 201, 202, 203, 204, 205, 206, 207, 208, // ийклмноп 210, 211, 212, 213, 198, 200, 195, 222, // рстуфхцч 219, 221, 223, 217, 216, 220, 192, 209 // шщъыьэюя ]; /** * Translates a range of UTF characters into KOI8-R characters. * Returns: Range of KOI8-R characters (as ubyte). */ auto toKOI8r(R)(R range) if (isInputRange!R && is(ElementType!R : dchar)) { static struct Result { R _range; property bool empty() { return _range.empty; } property ubyte front() { dchar ch = _range.front; // ASCII if (ch < 128) return cast(ubyte)ch; // Primary alphabetic range if (ch >= 0x410 && ch < 0x450) return utf2koi8r[ch - 0x410]; // Special case: Ё and ё are outside the usual range. if (ch == 0x401) return 179; if (ch == 0x451) return 163; throw new Exception( "Encoding error: unable to convert '%c' to KOI8-R".format(ch)); } void popFront() { _range.popFront(); } static if (isForwardRange!R) { property Result save() { Result copy; copy._range = _range.save; return copy; } } } return Result(range); } unittest { import std.string; import std.algorithm : equal; assert("юабцдефгхийклмнопярстужвьызшэщчъ".toK I8r.equal(iota(192, 224))); assert("ЮАБЦДЕФГХИЙКЛМНОПЯРСТУЖВЬЫЗШЭЩЧЪ".toK I8r.equal(iota(224, 256))); } unittest { auto r = "abc абв".toKOI8r; static assert(isForwardRange!(typeof(r))); import std.algorithm.comparison : equal; assert(r.equal(['a', 'b', 'c', ' ', 193, 194, 215])); } static dchar[0x100 - 0xC0] koi8r2utf = [ 'ю', 'а', 'б', 'ц', 'д', 'е', 'ф', 'г', // 192-199 'х', 'и', 'й', 'к', 'л', 'м', 'н', 'о', // 200-207 'п', 'я', 'р', 'с', 'т', 'у', 'ж', 'в', // 208-215 'ь', 'ы', 'з', 'ш', 'э', 'щ', 'ч', 'ъ', // 216-223 'Ю', 'А', 'Б', 'Ц', 'Д', 'Е', 'Ф', 'Г', // 224-231 'Х', 'И', 'Й', 'К', 'Л', 'М', 'Н', 'О', // 232-239 'П', 'Я', 'Р', 'С', 'Т', 'У', 'Ж', 'В', // 240-247 'Ь', 'Ы', 'З', 'Ш', 'Э', 'Щ', 'Ч', 'Ъ' // 248-255 ]; /** * Translates a range of KOI8-R characters to UTF. * Returns: Range of UTF characters (as dchar). */ auto fromKOI8r(R)(R range) if (isInputRange!R && is(ElementType!R : ubyte)) { static struct Result { R _range; property bool empty() { return _range.empty; } property dchar front() { ubyte b = _range.front; if (b < 128) return b; if (b >= 192) return koi8r2utf[b - 192]; switch (b) { case 128: return '─'; case 152: return '≤'; case 153: return '≥'; case 163: return 'ё'; case 179: return 'Ё'; default: import std.string : format; throw new Exception( "KOI8-R character %d not implemented yet".format(b)); } } void popFront() { _range.popFront(); } static if (isForwardRange!R) { property Result save() { Result copy; copy._range = _range.save; return copy; } } } return Result(range); } unittest { import std.algorithm.comparison : equal; ubyte[] lower = [ 193, 194, 215, 199, 196, 197, 163, 214, 218, 201, 202, 203, 204, 205, 206, 207, 208, 210, 211, 212, 213, 198, 200, 195, 222, 219, 221, 223, 217, 216, 220, 192, 209 ]; assert(lower.fromKOI8r.equal("абвгдеёжзийклмнопрстуфхцчшщъыьэюя")); ubyte[] upper = [ 225, 226, 247, 231, 228, 229, 179, 246, 250, 233, 234, 235, 236, 237, 238, 239, 240, 242, 243, 244, 245, 230, 232, 227, 254, 251, 253, 255, 249, 248, 252, 224, 241 ]; assert(upper.fromKOI8r.equal("АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ")); } ------------------------------------------------------------------------------- As the unittests show, you just call toKOI8r or fromKOI8r to translate between encodings. All non-Unicode strings are traded as ubyte[], so that you won't accidentally mix up a Unicode string with a KOI8-R string. And the code should be straightforward enough to be adapted for other encodings as well. Hope this helps. T -- For every argument for something, there is always an equal and opposite argument against it. Debates don't give answers, only wounded or inflated egos.
Dec 28 2017
parent reply Andrei <aalub mail.ru> writes:
On Thursday, 28 December 2017 at 18:45:39 UTC, H. S. Teoh wrote:
 On Thu, Dec 28, 2017 at 05:56:32PM +0000, Andrei via 
 Digitalmars-d-learn wrote:
 ...
 The string / wstring / dstring types in D are intended to be 
 Unicode strings.  If you need to use other encodings, you 
 really should be using ubyte[] or const(ubyte)[] or 
 immutable(ubyte)[], instead of string.
Thank you Teoh for advise and good example! I was looking towards writing something like that if no decision exists. Still this way of deliberate translations seems to be not the best. It supposes explicit workaround for every ahchoo in Russian and steady converting ubyte[] to string and back around. No formatting gems, no simple and elegant I/O statements or string/char comparisons. This may be endurable if you write an application where Russian is only one of rare options, and what if your whole environment is totally Russian? Or some other nonASCII locale... Many other cultures have same mix of DOS/Window/Unix code pages. The decision to use only Unicode for strings in D language seems excellent just because of this, but the realization turns out to be delusive. Folks in such countries won’t appreciate a language which is elegant only for English-spoken intercommunications. This problem is common for most programming languages and runtimes I know of. The only system which has decided the whole case is .NET I think. The way proposed by zabruk70 below seems more appropriate though more particular too - I feel it suits only console type of applications. Alas, this type of application proved to be buggy too. On Thursday, 28 December 2017 at 22:49:30 UTC, zabruk70 wrote:
 you can just set console CP to UTF-8:

 https://github.com/CyberShadow/ae/blob/master/sys/console.d
Yes! This seems to be the required, thank you very much! Though it is not suitable for GUI type of a Windows application. Still some testing showed that this way conforms only console output. Simple read/write/compare script listed below works very well until the user enters something Russian. It then prints **empty** response and falls into indefinite loop printing the prompt and then immediately empty response without actually reading it. But I think this is subject for ”Issues” part of this forum. p.s. I’ve found that I may set “Consolas” font for a console window and then you can output properly localized UTF8 strings without any special code in D script managing code pages. Still this does not decide localized input problem: any localized input throws an exception “std.utf.UTFException... Invalid UTF-8 sequence”. The script: import core.sys.windows.windows; import std.stdio; import std.string; int main(string[] args) { const UTF8CP = 65001; UINT oldCP, oldOutputCP; oldCP = GetConsoleCP(); oldOutputCP = GetConsoleOutputCP(); SetConsoleCP(UTF8CP); SetConsoleOutputCP(UTF8CP); writeln("hello world, привет всем!"); bool quit = false; string response; while (!quit) { write("responde something: "); response=readln().strip(); writefln("your response is \"%s\"", response); if (response == "quit") { writeln("good buy then!"); quit = true; } } SetConsoleCP(oldCP); SetConsoleOutputCP(oldOutputCP); return 0; }
Dec 29 2017
next sibling parent reply zabruk70 <sorry noem.ail> writes:
On Friday, 29 December 2017 at 10:35:53 UTC, Andrei wrote:
 Though it is not suitable for GUI type of a Windows application.
AFAIK, Windows GUI have no ANSI/OEM problem. You can use Unicode. For Windows ANSI/OEM problem you can use also https://dlang.org/phobos/std_windows_charset.html
Dec 29 2017
next sibling parent reply Andrei <aalub mail.ru> writes:
On Friday, 29 December 2017 at 11:14:39 UTC, zabruk70 wrote:
 On Friday, 29 December 2017 at 10:35:53 UTC, Andrei wrote:
 Though it is not suitable for GUI type of a Windows 
 application.
AFAIK, Windows GUI have no ANSI/OEM problem. You can use Unicode.
Partly, yes. Just for a test I tried to "russify" the example Windows GUI program that comes with D installation pack (samples\d\winsamp.d). Window captions, button captions, message box texts written in UTF8 all shows fine. But direct text output functions CreateFont()/TextOut() render all Cyrillic from UTF8 strings into garbage.
 For Windows ANSI/OEM problem you can use also
 https://dlang.org/phobos/std_windows_charset.html
Thank you very much, toMBSz() makes requisite translation for TextOut() function with some workarounds.
Jan 02
parent reply thedeemon <dlang thedeemon.com> writes:
On Wednesday, 3 January 2018 at 06:42:42 UTC, Andrei wrote:
 AFAIK, Windows GUI have no ANSI/OEM problem.
 You can use Unicode.
Partly, yes. Just for a test I tried to "russify" the example Windows GUI program that comes with D installation pack (samples\d\winsamp.d). Window captions, button captions, message box texts written in UTF8 all shows fine. But direct text output functions CreateFont()/TextOut() render all Cyrillic from UTF8 strings into garbage.
Windows API contains two sets of functions: those whose names end with A (meaning ANSI), the other where names end with W (wide characters, meaning Unicode). The sample uses TextOutA, this function that expects 8-bit encoding. Properly, you need to use TextOutW that accepts 16-bit Unicode, so just convert your UTF-8 D strings to 16-bit Unicode wstrings, there are appropriate conversion functions in Phobos.
Jan 03
next sibling parent thedeemon <dlang thedeemon.com> writes:
On Wednesday, 3 January 2018 at 09:11:32 UTC, thedeemon wrote:
 you need to use TextOutW that accepts 16-bit Unicode, so just 
 convert your UTF-8 D strings to 16-bit Unicode wstrings, there 
 are appropriate conversion functions in Phobos.
Some details: import std.utf : toUTF16z; ... string s = "привет"; TextOutW(s.toUTF16z);
Jan 03
prev sibling parent Andrei <aalub mail.ru> writes:
On Wednesday, 3 January 2018 at 09:11:32 UTC, thedeemon wrote:
 Windows API contains two sets of functions: those whose names 
 end with A (meaning ANSI), the other where names end with W 
 (wide characters, meaning Unicode). The sample uses TextOutA, 
 this function that expects 8-bit encoding.
Gosh, I should new this :)) Thanks for the point! TextOutW() works fine with wstring texts in this example and no more changes needed. That's just enough for this example. Thank you! Yet my particular interest is console interconnections. With the help of this forum I've learned console settings to write Cyrillic properly and simply to the console using UTF8 encoding. One thing that remains is to read and process the user's input. For now in the example I've cited above response=readln(); statement returns an empty string, in a console set for UTF8 code page, if the user's input contains any Cyrillic letters. Then the program's behavior differs depending on the compiler (or more likely on the runtime library): the one compiled with ldc continues to read on and returns empty lines, instead of the user's input, and the one compiled with dmd only returns empty lines not waiting for the user's input and not actually reading anything (i.e. it falls into indefinite loop busily printing empty response hundreds times a second). That's only for localized input. With ASCII input same program works fine. May be there is some more settings I must learn to set console to properly read non-ASCII input?
Jan 03
prev sibling parent Martin Krejcirik <mk-junk i-line.cz> writes:
On Friday, 29 December 2017 at 11:14:39 UTC, zabruk70 wrote:
 AFAIK, Windows GUI have no ANSI/OEM problem.
 You can use Unicode.
Be advised there are some problems with console UTF-8 input/output in Windows. The most usable is Win10 new console window but I recommend to use Windows API (WriteConsole) instead. It works correctly regardless of codepage setting, os version and C library.
Jan 03
prev sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Dec 29, 2017 at 10:35:53AM +0000, Andrei via Digitalmars-d-learn wrote:
 On Thursday, 28 December 2017 at 18:45:39 UTC, H. S. Teoh wrote:
 On Thu, Dec 28, 2017 at 05:56:32PM +0000, Andrei via Digitalmars-d-learn
 wrote:
 ...
 The string / wstring / dstring types in D are intended to be Unicode
 strings.  If you need to use other encodings, you really should be
 using ubyte[] or const(ubyte)[] or immutable(ubyte)[], instead of
 string.
Thank you Teoh for advise and good example! I was looking towards writing something like that if no decision exists. Still this way of deliberate translations seems to be not the best. It supposes explicit workaround for every ahchoo in Russian and steady converting ubyte[] to string and back around. No formatting gems, no simple and elegant I/O statements or string/char comparisons. This may be endurable if you write an application where Russian is only one of rare options, and what if your whole environment is totally Russian?
You mean if your environment uses a non-UTF encoding? If your environment uses UTF, there is no problem. I have code with strings in Russian (and other languages) embedded, and it's no problem because everything is in Unicode, all input and all output. But I understand that in Windows you may not have this luxury. So you have to deal with codepages and what-not. Converting back and forth is not a big problem, and it actually also solves the problem of string comparisons, because std.uni provides utilities for collating strings, etc.. But it only works for Unicode, so you have to convert to Unicode internally anyway. Also, for static strings, it's not hard to make the codepage mapping functions CTFE-able, so you can actually write string literals in a codepage and have the compiler automatically convert it to UTF-8. The other approach, if you don't like the idea of converting codepages all the time, is to explicitly work in ubyte[] for all strings. Or, preferably, create your own string type with ubyte[] representation underneath, and implement your own comparison functions, etc., then use this type for all strings. Better yet, contribute this to code.dlang.org so that others who have the same problem can reuse your code instead of needing to write their own. [...]
 p.s. I’ve found that I may set “Consolas” font for a console window
 and then you can output properly localized UTF8 strings without any
 special code in D script managing code pages. Still this does not
 decide localized input problem: any localized input throws an
 exception “std.utf.UTFException...  Invalid UTF-8 sequence”.
Is the exception thrown in readln() or in writeln()? If it's in writeln(), it shouldn't be a big deal, you just have to pass the data returned by readln() to fromKOI8 (or whatever other codepage you're using). If the problem is in readln(), then you probably need to read the input in binary (i.e., as ubyte[]) and convert it manually. Unfortunately, there's no other way around this if you're forced to use codepages. The ideal situation is if you can just use Unicode throughout your environment. But of course, sometimes you have no choice. T -- Heuristics are bug-ridden by definition. If they didn't have bugs, they'd be algorithms.
Dec 29 2017
next sibling parent Andrei <aalub mail.ru> writes:
On Friday, 29 December 2017 at 18:13:04 UTC, H. S. Teoh wrote:
 On Fri, Dec 29, 2017 at 10:35:53AM +0000, Andrei via 
 Digitalmars-d-learn wrote:
 This may be endurable if you write an application where 
 Russian is only one of rare options, and what if your whole 
 environment is totally Russian?
You mean if your environment uses a non-UTF encoding? If your environment uses UTF, there is no problem. I have code with strings in Russian (and other languages) embedded, and it's no problem because everything is in Unicode, all input and all output.
No, I mean difficulties to write a program based on non-ASCII locales. Every programming language learning since C starts with a "hello world" program which every non-English programmer essentially tries to translate to native language - and gets unreadable mess on the screen. Thousands try, hundreds look for a solution, dozens find it, and a few continue with the new language. That's not because these programmers cannot read English text-books, they can. That's because they want to write non-English programs for non-English people, and that's essential. And there are many programming languages (or rather their runtimes) which do not suffer such a deficiency. That's the reason for UNICODE adoption all over the programming world - including D language, but what's the good for me if I can write in a D program a UTF8 string with my native language text, and get the same unreadable mess on the screen? Yes, a new language in development can lack support for some features, but this forum branch shows that a simple and handy solution exists - yet nobody cares to bring it to the first pages of every text-book for beginners, at least as a footnote. Thus thousands of potential new language fans are lost from start.
 But I understand that in Windows you may not have this luxury. 
 So you have to deal with codepages and what-not.

 Converting back and forth is not a big problem, and it actually 
 also solves the problem of string comparisons, because std.uni 
 provides utilities for collating strings, etc.. But it only 
 works for Unicode, so you have to convert to Unicode internally 
 anyway.  Also, for static strings, it's not hard to make the 
 codepage mapping functions CTFE-able, so you can actually write 
 string literals in a codepage and have the compiler 
 automatically convert it to UTF-8.

 The other approach, if you don't like the idea of converting 
 codepages all the time, is to explicitly work in ubyte[] for 
 all strings. Or, preferably, create your own string type with 
 ubyte[] representation underneath, and implement your own 
 comparison functions, etc., then use this type for all strings. 
 Better yet, contribute this to code.dlang.org so that others 
 who have the same problem can reuse your code instead of 
 needing to write their own.
I'd definitely try this if I decide to use D language for my purposes (which not settled yet). But to decide I need some experience, and for now it stopped at reading the user's input (for training I intend to translate into D my recent rather complex interactive C# program).
 Still this does not decide localized input problem: any 
 localized input throws an exception “std.utf.UTFException...  
 Invalid UTF-8 sequence”.
Is the exception thrown in readln() or in writeln()? If it's in writeln(), it shouldn't be a big deal, you just have to pass the data returned by readln() to fromKOI8 (or whatever other codepage you're using). If the problem is in readln(), then you probably need to read the input in binary (i.e., as ubyte[]) and convert it manually. Unfortunately, there's no other way around this if you're forced to use codepages. The ideal situation is if you can just use Unicode throughout your environment. But of course, sometimes you have no choice.
It depends. If I avoid proper console code page initializing, I see in debugger that runtime reads the user's input as CP866 (MS DOS) Cyrillic and then throws the exception "Invalid UTF-8 sequence" when trying to handle it as UTF8 string (in particular by strip() or writeln() functions). This situation seems quite manageable by code page conversions you've mentioned above. I've tried first library function found (std.windows.charset), and got a rather fanciful working statement: response = fromMBSz((readln()~"\0").ptr, 1).strip(); which assigns correct Latin/Cyrillic contents to the response variable. And if I initialize console with SetConsoleCP(65001) statement things get worse, as I've said above. Then readln() statement returns an empty string and something gets broken inside the runtime, because any further readln() statements do not wait for user input, and return empty strings immediately.
Jan 04
prev sibling parent Andrei <aalub mail.ru> writes:
On Friday, 29 December 2017 at 18:13:04 UTC, H. S. Teoh wrote:
 If the problem is in readln(), then you probably need to read 
 the input in binary (i.e., as ubyte[]) and convert it manually.
Could you kindly explain how I can read console input into binary ubyte[]?
Jan 04
prev sibling parent zabruk70 <sorry noem.ail> writes:
you can just set console CP to UTF-8:

https://github.com/CyberShadow/ae/blob/master/sys/console.d
Dec 28 2017