www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - Russian and other national languages support

reply zorran <zorran tut.by> writes:
Russian language not working
in comments and strings by default
with ANSI coding (code page)

Compiler write error - "invalid UTF-8 sequence"

==============
void main()
{
	string s = "&#1063;&#1090;&#1086;-&#1090;&#1086; &#1087;&#1086;
&#1088;&#1091;&#1089;&#1089;&#1082;&#1080;"; // some text in russian
	printf("hello, world!"); //
&#1047;&#1076;&#1088;&#1072;&#1074;&#1089;&#1090;&#1074;&#1091;&#1081;,
&#1084;&#1080;&#1088;!
}
==============

(D version 1.039)

in Delphi, C#, and many C++ compilers - All OK!
Why?
it can reduce popularity D!
Russian text not needs two-byte code-page! its not Chinese!
Feb 03 2009
next sibling parent Max Samukha <samukha voliacable.com.removethis> writes:
On Tue, 3 Feb 2009 17:13:38 +0000 (UTC), zorran <zorran tut.by> wrote:

Russian language not working
in comments and strings by default
with ANSI coding (code page)

Compiler write error - "invalid UTF-8 sequence"

==============
void main()
{
	string s = "&#1063;&#1090;&#1086;-&#1090;&#1086; &#1087;&#1086;
&#1088;&#1091;&#1089;&#1089;&#1082;&#1080;"; // some text in russian
	printf("hello, world!"); //
&#1047;&#1076;&#1088;&#1072;&#1074;&#1089;&#1090;&#1074;&#1091;&#1081;,
&#1084;&#1080;&#1088;!
}
==============

(D version 1.039)

in Delphi, C#, and many C++ compilers - All OK!
Why?
it can reduce popularity D!
Russian text not needs two-byte code-page! its not Chinese!

D strings are supposed to be UTF-8. Source files can be ASCII or UTF. To escape a Unicode code point, use \u0000 or \U00000000, where 0 is a hexadecimal digit. Be aware that dmd/phobos still have some minor problems with Unicode support. For example, messages produced by static asserts are not output correctly.
Feb 03 2009
prev sibling next sibling parent reply BCS <ao pathlink.com> writes:
Reply to Zorran,

 Russian language not working
 in comments and strings by default
 with ANSI coding (code page)
 Compiler write error - "invalid UTF-8 sequence"
 
 ==============
 void main()
 {
 string s = "&#1063;&#1090;&#1086;-&#1090;&#1086; &#1087;&#1086;
 &#1088;&#1091;&#1089;&#1089;&#1082;&#1080;"; // some text in russian
 printf("hello, world!"); //
 &#1047;&#1076;&#1088;&#1072;&#1074;&#1089;&#1090;&#1074;&#1091;&#1081;
 , &#1084;&#1080;&#1088;!
 }
 ==============
 
 (D version 1.039)
 
 in Delphi, C#, and many C++ compilers - All OK!
 Why?
 it can reduce popularity D!
 Russian text not needs two-byte code-page! its not Chinese!

IIRC D doesn't use codepages at all, it is pure UTF-8/16/32. Code pages have all kinds of nasty side effects. For instance, the above code is a garbled mess of number codes in my NG reader. Also this kind of thing: http://www.viprasys.com/vb/f44/hole-notepad-12276/ Way back (2-3 years) I remember a long thread about the use of UTF in D and the up shot was that it's not grate but it's a lot better than anything else anyone has come up with.
Feb 03 2009
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
BCS wrote:
<snip>
 IIRC D doesn't use codepages at all, it is pure UTF-8/16/32. Code pages 
 have all kinds of nasty side effects. For instance, the above code is a 
 garbled mess of number codes in my NG reader. Also this kind of thing: 
 http://www.viprasys.com/vb/f44/hole-notepad-12276/

Seems to be a bug in the web newsgroup interface. Indeed: http://validator.w3.org/check?uri=http://www.digitalmars.com/webnews/newsgroups.php Knowing PHP, it should be trivial to insert a meta tag to fix this. Though really, www.digitalmars.com should be configured to declare all text/* content as UTF-8 in the HTTP headers. Meanwhile, best bet is to stop using the web interface and get oneself a newsreader. Stewart.
Feb 03 2009
parent reply BCS <ao pathlink.com> writes:
Reply to Stewart,

 
 Meanwhile, best bet is to stop using the web interface and get oneself
 a newsreader.
 

If the web interface is the problem than it's the posting bit as /I'm/ not using the web interface.
Feb 03 2009
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
BCS wrote:
 Reply to Stewart,
 
 Meanwhile, best bet is to stop using the web interface and get oneself
 a newsreader.

If the web interface is the problem than it's the posting bit

It can't be just the posting bit. If it doesn't declare a sensible encoding, it can't properly display UTF-8 encoded posts either. JTAI it doesn't just need to declare an encoding for the HTML output - it also needs to declare a suitable encoding when posting and handle encoding properly when displaying messages. But how easy or not is this in PHP?
 as /I'm/ not using the web interface.

My comment wasn't aimed at you particularly - I just needed somewhere to put it. Sorry if it seemed otherwise. Stewart.
Feb 03 2009
parent reply BCS <none anon.com> writes:
Hello Stewart,

 BCS wrote:
 
 Reply to Stewart,
 
 Meanwhile, best bet is to stop using the web interface and get
 oneself a newsreader.
 


encoding, it can't properly display UTF-8 encoded posts either.

it could be converting client side to ASCII :)
Feb 03 2009
parent Stewart Gordon <smjg_1998 yahoo.com> writes:
BCS wrote:
 Hello Stewart,
 
 BCS wrote:


 If the web interface is the problem than it's the posting bit

encoding, it can't properly display UTF-8 encoded posts either.

it could be converting client side to ASCII :)

AIUI form posts are transmitted in the encoding of the HTML page containing the form. If the user supplies a character that can't be represented in this encoding, it gets converted on the client side to an HTML entity reference. Look at http://d.puremagic.com/issues/show_bug.cgi?id=111 When this issue was filed, Bugzilla was configured to serve pages in ISO-8859-1; hence the bug report was mangled, with # t ~= "我"; having become # t ~= "&#25105;"; Now our Bugzilla is on UTF-8, but this instance remains because it is what went through to the server at the time and is therefore stored in the database. Stewart.
Feb 04 2009
prev sibling next sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Tue, 03 Feb 2009 20:13:38 +0300, zorran <zorran tut.by> wrote:

 Russian language not working
 in comments and strings by default
 with ANSI coding (code page)

 Compiler write error - "invalid UTF-8 sequence"

 ==============
 void main()
 {
 	string s = "&#1063;&#1090;&#1086;-&#1090;&#1086; &#1087;&#1086;  
 &#1088;&#1091;&#1089;&#1089;&#1082;&#1080;"; // some text in russian
 	printf("hello, world!"); //  
 &#1047;&#1076;&#1088;&#1072;&#1074;&#1089;&#1090;&#1074;&#1091;&#1081;,  
 &#1084;&#1080;&#1088;!
 }
 ==============

 (D version 1.039)

 in Delphi, C#, and many C++ compilers - All OK!
 Why?
 it can reduce popularity D!
 Russian text not needs two-byte code-page! its not Chinese!

Just save your file as UTF-8 and you are done.
Feb 03 2009
prev sibling next sibling parent reply Kagamin <spam here.lot> writes:
zorran Wrote:

 in Delphi, C#, and many C++ compilers - All OK!
 Why?
 it can reduce popularity D!
 Russian text not needs two-byte code-page! its not Chinese!

In C# all strings are two-byte encoded (UTF-16), in C++ L"..." strings are (usually) two-byte encoded, Delphi is a legacy technology, but some people enabled it with some WideStrings and TNT which are unicode too. Modern projects usually use modern technologies like unicode. If you really want to work with ANSI strings, you can do it, but then you should not use D libraries, which expect strings to be unicode.
Feb 05 2009
parent zorran <zorran tut.by> writes:
I only say about source code format, but not internal presentation
strings!
Feb 07 2009
prev sibling parent Walter Bright <newshound1 digitalmars.com> writes:
zorran wrote:
 Russian language not working
 in comments and strings by default
 with ANSI coding (code page)
 
 Compiler write error - "invalid UTF-8 sequence"

D source code is expected to be in Unicode format (like UTF-8). Modern editors can be set to generate this format instead of using code pages. Having the source code in Unicode ensures global portability of the source code. If the code is written in one code page, then compiled or displayed with a different code page, the result is garbage.
Feb 20 2009