www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 3193] New: Wrong processing by DMD.exe of Russian Windows-1251 character set: "invalid UTF-8 sequence"

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193

           Summary: Wrong processing by DMD.exe of Russian Windows-1251
                    character set: "invalid UTF-8 sequence"
           Product: D
           Version: unspecified
          Platform: x86
               URL: http://picasaweb.google.ru/ohalzov/DBugs#
        OS/Version: Windows
            Status: NEW
          Keywords: diagnostic, wrong-code
          Severity: critical
          Priority: P2
         Component: DMD
        AssignedTo: nobody puremagic.com
        ReportedBy: ok96 mail.ru


If you compile hello.d example with Russian Win1251 charecters in this line:
 printf("Привет, D!\n");
dmd.exe reports an error:
D:\Apps\Prog_D\dmd\samples\d>dmd hello.d
hello.d(5): invalid UTF-8 sequence
hello.d(5): invalid UTF-8 sequence
hello.d(5): invalid UTF-8 sequence
hello.d(5): invalid UTF-8 sequence
hello.d(5): invalid UTF-8 sequence
hello.d(5): invalid UTF-8 sequence
If you save hello.d in UTF-8, then anyway dmd.exe compiles it wrong (see http
link).

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 20 2009
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193


Jarrett Billingsley <jarrett.billingsley gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jarrett.billingsley gmail.c
                   |                            |om




--- Comment #1 from Jarrett Billingsley <jarrett.billingsley gmail.com> 
2009-07-20 06:09:18 PDT ---
The compiler does not understand Windows-1251, so this is according to spec.

However, you say the compiler compiles it wrong if it's in UTF-8; where's the
link?

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 20 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193





--- Comment #2 from Oleg Halzov <ok96 mail.ru>  2009-07-20 06:13:25 PDT ---
Created an attachment (id=428)
 --> (http://d.puremagic.com/issues/attachment.cgi?id=428)
This screenshot is from Chris Miller

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 20 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193





--- Comment #3 from Jarrett Billingsley <jarrett.billingsley gmail.com> 
2009-07-20 07:52:38 PDT ---
Sorry, this is invalid.  To solve this, you have to do the following:

1) Set cmd.exe's font to Lucida Console.
2) Execute 'chcp 65001'.

Then run your program.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 20 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193





--- Comment #4 from Jarrett Billingsley <jarrett.billingsley gmail.com> 
2009-07-20 07:53:22 PDT ---
Created an attachment (id=429)
 --> (http://d.puremagic.com/issues/attachment.cgi?id=429)
Correct Russian output

Here's an image that shows it working properly.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 20 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193


Matti Niemenmaa <matti.niemenmaa+dbugzilla iki.fi> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |INVALID




-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 20 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193


Oleg Halzov <ok96 mail.ru> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|INVALID                     |




--- Comment #5 from Oleg Halzov <ok96 mail.ru>  2009-07-20 22:28:13 PDT ---
But Jarrett, almost everybody who codes in Russian needs Windows-1251 codepage
by default. If we need to compile small program and we don't have robist IDE we
use notapad.exe (or something like this) that saves  Russian text in
Windows-1251.
And nobody will be changing his dafault font in "Command Prompt" to Lucida
Console only for my small program - I swear you!
Any other compilers (Pascal, C, C++) understand that the Russian text in
Windows is in Windows-1251! Currently I dont have any good editor for D whare I
can normally edit Russian texts in UTF-8. Entice Designer has a bug confirmed
by Chris Miller - you cannot enter Russian text, only copy and paste.
Therefore if you build a D compiler for Win32 platform, you have make it work
with widely used regional codepages.  Because the entire world is not English
only and fully not UTF-8!

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 20 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193


Jarrett Billingsley <jarrett.billingsley gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|diagnostic, wrong-code      |
            Summary|Wrong processing by DMD.exe |Support Windows-1251 as a
                   |of Russian Windows-1251     |source encoding
                   |character set: "invalid     |
                   |UTF-8 sequence"             |
           Severity|critical                    |enhancement




--- Comment #6 from Jarrett Billingsley <jarrett.billingsley gmail.com> 
2009-07-20 23:11:34 PDT ---
What you're basically asking for is an enhancement.  I'm sorry, but that's the
way it works.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 20 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193


Stewart Gordon <smjg iname.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |smjg iname.com




--- Comment #7 from Stewart Gordon <smjg iname.com>  2009-07-21 14:43:27 PDT ---
Why not make this enhancement request "Write a decent, free, Unicode-compatible
code editor that syntax-highlights D properly"?

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 21 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193





--- Comment #8 from Jarrett Billingsley <jarrett.billingsley gmail.com> 
2009-07-21 17:07:58 PDT ---
(In reply to comment #7)
 Why not make this enhancement request "Write a decent, free, Unicode-compatible
 code editor that syntax-highlights D properly"?

Why not be a sarcastic ass _all the time_? -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jul 21 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193





--- Comment #9 from Oleg Halzov <ok96 mail.ru>  2009-07-22 01:19:42 PDT ---
Dear friends, D has really good ideas behind its face, but Unicode support
(UTF-16) in the compiler instead of old UTF-8 is "MUST HAVE" feature. Its a
great need in non-Latin languages. Windows-1251 codepage is #1 for Russian
Windows programmers.
Otherwise D compiler will stay an experiment forever.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 22 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193





--- Comment #10 from Stewart Gordon <smjg iname.com>  2009-07-22 01:47:17 PDT
---
(In reply to comment #9)
 Dear friends, D has really good ideas behind its face, but Unicode support
 (UTF-16) in the compiler instead of old UTF-8 is "MUST HAVE" feature.

DMD already supports UTF-16. Even UTF-32. Why do you want UTF-8 support removed?
 Its a great need in non-Latin languages. Windows-1251 codepage is #1 
 for Russian Windows programmers. Otherwise D compiler will stay an 
 experiment forever.

How would supporting codepages work anyway? Would they be converted to UTF-8 at compiletime? In this case, D would need some form of character encoding declaration. Or would they be left as are, and be rejected only in wchar, wchar[], dchar and dchar[] literals? What about all the D features and APIs that rely on char[] being UTF-8? Seriously, if you're going to code in D and need to use non-ASCII characters, it goes without saying that you should have a Unicode-compatible editor. The lack of good D editors may be a real issue at the moment, but AISI it makes little sense to try to work around it. No programming language is born with high-quality development tools. People need to write them. (That said, there have been a few dedicated D IDE projects. What's the highest stage of development any of them is at?) -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jul 22 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193





--- Comment #11 from Oleg Halzov <ok96 mail.ru>  2009-07-22 01:57:48 PDT ---
Stewart, Windows compilers SHOULD understand and correctly convert regional
characters for console and dialogs (from resource files). The simplest test for
the compiler in Windows is to enter text in notepad.exe in regional language
and try to compile the file. MS VCPP compiler, BCC compiler and any other C++
compiler do it.
And if DMD supports UTF-16 then how to make it work with UTF-16 Russian text
entered in the simplest Notepad editor?

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 22 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193


Walter Bright <bugzilla digitalmars.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bugzilla digitalmars.com




--- Comment #12 from Walter Bright <bugzilla digitalmars.com>  2009-07-22
02:32:23 PDT ---
 And if DMD supports UTF-16 then how to make it work with UTF-16 Russian text
 entered in the simplest Notepad editor?

DMD will automatically detect and work correctly with UTF-16 and UTF-32 encoded source files. The logic to do this is in module.c of the compiler source code. If it does not work with a particular UTF-16 encoded file, please attach that file to this bug report. Note that UTF-16 encoded files are not encoded using a code page. If a source file is encoded with a particular code page, there is no way for the compiler to automatically detect it. C compilers often have a command line flag which is used to tell it what code page to use. Using code pages, therefore, makes your source code completely non-portable which is one of the reasons why D uses Unicode instead. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jul 22 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193





--- Comment #13 from Oleg Halzov <ok96 mail.ru>  2009-07-22 03:42:38 PDT ---
Created an attachment (id=431)
 --> (http://d.puremagic.com/issues/attachment.cgi?id=431)
D Windows Unicode text - edit in Notapad and compile result in Console

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 22 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193





--- Comment #14 from Oleg Halzov <ok96 mail.ru>  2009-07-22 03:43:10 PDT ---
Created an attachment (id=432)
 --> (http://d.puremagic.com/issues/attachment.cgi?id=432)
D UTF-8 text - edit in Notapad and compile result in Console

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 22 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193





--- Comment #15 from Oleg Halzov <ok96 mail.ru>  2009-07-22 03:44:11 PDT ---
Created an attachment (id=433)
 --> (http://d.puremagic.com/issues/attachment.cgi?id=433)
D ANSI text with Russian letters - edit in Notapad and compile result in
Console

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 22 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193


Oleg Halzov <ok96 mail.ru> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
    Attachment #431|D Windows Unicode text -    |D Windows Unicode text -
        description|edit in Notapad and compile |edit in Notepad and compile
                   |result in Console           |result in Console




-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 22 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193


Oleg Halzov <ok96 mail.ru> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
    Attachment #432|D UTF-8 text - edit in      |D UTF-8 text - edit in
        description|Notapad and compile result  |Notepad and compile result
                   |in Console                  |in Console




-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 22 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193


Oleg Halzov <ok96 mail.ru> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
    Attachment #433|D ANSI text with Russian    |D ANSI text with Russian
        description|letters - edit in Notapad   |letters - edit in Notepad
                   |and compile result in       |and compile result in
                   |Console                     |Console




-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 22 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193





--- Comment #16 from Oleg Halzov <ok96 mail.ru>  2009-07-22 03:51:47 PDT ---
Dear Walter,
Please take a close look at my last 3 attachements having "edit in Notepad and
compile result in Console" text in descriptions.
Note that all Russians have 866 codepage by default in Windows Command Prompt.
Nobody will be switching 866 to any other codepage for console application.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 22 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193





--- Comment #17 from Stewart Gordon <smjg iname.com>  2009-07-22 04:32:31 PDT
---
Going by your screenshots and their descriptions, DMD is behaving correctly.

I do, however, feel that D's stdio ought to support codepages(In reply to
comment #16)
 Dear Walter,
 Please take a close look at my last 3 attachements having "edit in Notepad and
 compile result in Console" text in descriptions.

Going by your screenshots and their descriptions, DMD is behaving correctly.
 Note that all Russians have 866 codepage by default in Windows Command Prompt.

You mean it's hard-coded for each language's edition of Windows? That's something else that ought to change.
 Nobody will be switching 866 to any other codepage for console application.

Console output is an entirely separate issue from source encoding. I feel that D's stdio ought to support codepages, but it doesn't (aside from the fact that printf isn't part of D's stdio). Meanwhile, please check out my utility library http://pr.stewartsplace.org.uk/d/sutil/ -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jul 22 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193


Andrei Alexandrescu <andrei metalanguage.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |andrei metalanguage.com




--- Comment #18 from Andrei Alexandrescu <andrei metalanguage.com>  2009-07-24
18:10:55 PDT ---
I think support for codepages and other character types could be implemented in
a library. That was the ambitious purpose behind std.encoding. Yet another
great project for someone interested.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 24 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193





--- Comment #19 from Sobirari Muhomori <maxmo pochta.ru>  2009-07-27 02:39:21
PDT ---
As to console output, it's a duplicate of (runtime) bug 2742 or bug 1448. Tango
and C API work correctly, phobos doesn't. As to cp1251, this ice age technology
is definitely not a way to go, unicode is a future. No, it's the present.
Windows works in unicode and you should use it. As to convertion of source from
ANSI to OEM codepage, it's valid RFE, but hardly one will implement it. You can
try yourself.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 27 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193





--- Comment #20 from Stewart Gordon <smjg iname.com>  2009-07-27 03:34:29 PDT
---
(In reply to comment #19)
 As to console output, it's a duplicate of (runtime) bug 2742 or bug 1448.

This is getting OT for this bug report, but it's 2742 to which what this conversation has drifted into is related. 1448 is a separate issue.
 As to convertion of source from ANSI to OEM codepage, it's valid
 RFE, but hardly one will implement it. You can try yourself.

I already have. See comment 17. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jul 27 2009
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193





--- Comment #21 from Sobirari Muhomori <maxmo pochta.ru>  2009-07-28 02:37:01
PDT ---
Hmm... your library is just an API, it has nothing to do with source encoding
and as far as I see it accepts utf8 text, not ANSI.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 28 2009
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=3193





--- Comment #22 from Stewart Gordon <smjg iname.com>  2009-07-28 03:30:01 PDT
---
(In reply to comment #21)
 Hmm... your library is just an API, it has nothing to do with source encoding

As has a lot of the discussion here from comment 13 onwards. Maybe, to avoid confusion, we should continue this conversation at bug 2742. Or perhaps even better, on the newsgroup.
 and as far as I see it accepts utf8 text, not ANSI.

Not quite. It communicates with the console in the console codepage. Application code communicates with it in UTF-8. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jul 28 2009