digitalmars.D - Character encoding problem

Mathias Bierschenk (21/21) Nov 19 2004 How can I print German characters? I've tried the following simple progr...

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (6/16) Nov 19 2004 D only supports Unicode, so *both* your editor and
Simon Buchan (13/35) Nov 19 2004 The c functions dont like non-latin char's very much. I had this problem

Mathias Bierschenk (10/19) Nov 19 2004 Could you provide an example? I can't get it to work here. The following...

Ben Hinkle (2/10) Nov 19 2004 Are you sure your command window is set to use UTF-8? On Windows I think...

Ilya Minkov (9/11) Nov 19 2004 That doesn't matter - or rather i think there is nothing to configure.

Mathias Bierschenk (5/15) Nov 19 2004 I've just downloaded SciTE and have done what you suggested. I admit tha...

Val�ry Croizier (4/5) Nov 19 2004 You'll find it there

Mathias Bierschenk (3/6) Nov 19 2004 Thanks!

Ilya Minkov (6/9) Nov 19 2004 Ah, i missed out that you are through to getting garbage. :) Well, i'll

Stewart Gordon (7/9) Nov 19 2004 In Windows 98, a command prompt is still a plain old MS-DOS window. As

Thomas Kuehne (6/27) Nov 19 2004 Let's try to track down the real problem.

Mathias Bierschenk (10/14) Nov 19 2004 I've tested the above string. The result for both puts and printf is tha...

Thomas Kuehne (9/22) Nov 19 2004 This is a known problem. If you use UTF-16/32 without a BOM(byte order m...

Thomas Kuehne (39/61) Nov 19 2004 Here a patch that enables GDC-0.8 and DMD-0.106 to handle

Thomas Kuehne (3/9) Nov 19 2004 The real problem was that it removed the bytes of the not existing BOM.

Stewart Gordon (8/22) Nov 19 2004 You can include MS-DOS characters in a string, but only as escape codes....

Mathias Bierschenk (13/18) Nov 19 2004 Yep, that works. Maybe this is a more portable (encoded as UTF-8):

Walter (6/18) Nov 19 2004 program:

Mathias Bierschenk (22/26) Nov 20 2004 No, that doesn't work.

Roberto Mariottini (55/75) Nov 22 2004 In article , Mathi...

Thomas Kuehne (12/20) Nov 22 2004 Maybe you could take a look at dmd/src/phobos/std/c/stdio.d?
=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (20/35) Nov 22 2004 Mac OS X has a similar issue (uses MacRoman/ISO-8859-1 by default),

Roberto Mariottini (6/9) Nov 22 2004 The code doesn't work anyway, see my other post for details.

Carlos Santander B. (75/75) Nov 19 2004 "Mathias Bierschenk" escribi� en el mensaje
Manfred Hansen (6/36) Nov 19 2004 Hello,

Thomas Kuehne (20/24) Nov 20 2004 Vim 6.2 works for me.

Manfred Hansen (19/50) Nov 20 2004 My locale

"Mathias Bierschenk" <Mathias.Bierschenk web.de> writes:

How can I print German characters? I've tried the following simple program:

import std.c.stdio;

int main()
{
   puts("�������"); // German characters

   return 0;
}

As the normal MS-DOS EDIT encoding didn't work (Windows 98 SE, German  
edition) I tried Mozilla to save the source code file with different  
character encodings but none worked as expected. Here's what I tried using  
the current DMD version:

MS-DOS encoding as performed by Microsoft's EDIT editor:
(5) "invalid UTF-sequence"

Western (ISO-8859-1):
(5) "invalid UTF-sequence"

Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian):
(1) "semicolon expected, not '.'"
(1) no identifier for declarator

Unicode (UTF-16 and UTF-8):
both compile fine but output garbage under MS-DOS
(Windows 98 SE, German edition)

Nov 19 2004

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Mathias Bierschenk wrote:

 How can I print German characters? I've tried the following simple program:
 
 import std.c.stdio;
 
 int main()
 {
   puts("�������"); // German characters
 
   return 0;
 }

D only supports Unicode, so *both* your editor and
your terminal must be set to this. (UTF-8, usually)

Does the Windows 98 SE command prompt support Unicode ?
If you not, you need to convert before outputting...

--anders

Nov 19 2004

"Simon Buchan" <currently no.where> writes:

On Fri, 19 Nov 2004 12:49:01 +0100, Mathias Bierschenk  
<Mathias.Bierschenk web.de> wrote:

 How can I print German characters? I've tried the following simple  
 program:

 import std.c.stdio;

 int main()
 {
    puts("äöüßÄÖÜ"); // German characters

    return 0;
 }

 As the normal MS-DOS EDIT encoding didn't work (Windows 98 SE, German  
 edition) I tried Mozilla to save the source code file with different  
 character encodings but none worked as expected. Here's what I tried  
 using the current DMD version:

 MS-DOS encoding as performed by Microsoft's EDIT editor:
 (5) "invalid UTF-sequence"

 Western (ISO-8859-1):
 (5) "invalid UTF-sequence"

 Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian):
 (1) "semicolon expected, not '.'"
 (1) no identifier for declarator

 Unicode (UTF-16 and UTF-8):
 both compile fine but output garbage under MS-DOS
 (Windows 98 SE, German edition)

The c functions dont like non-latin char's very much. I had this problem
displaying a file to console.
Currently, you are best of to use either writef or (if you dont want it
formatted) std.stream 's stdout.writeString and stdout.writeLine. (You
could of course use writef("%s", yourstring) , but I dont like that very
much)
Be careful: std.stdio and std.stream.stdout arn't sync'ed. (I use  
std.stream
exclusively)

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/

Nov 19 2004

"Mathias Bierschenk" <Mathias.Bierschenk web.de> writes:

Am Sat, 20 Nov 2004 01:03:14 +1300 schrieb Simon Buchan  
<currently no.where>:

 The c functions dont like non-latin char's very much. I had this problem
 displaying a file to console.
 Currently, you are best of to use either writef or (if you dont want it
 formatted) std.stream 's stdout.writeString and stdout.writeLine. (You
 could of course use writef("%s", yourstring) , but I dont like that very
 much)
 Be careful: std.stdio and std.stream.stdout arn't sync'ed. (I use  
 std.stream
 exclusively)

Could you provide an example? I can't get it to work here. The following  
program, saved with several unicode encodings, still yields garbage:

import std.stream;

int main()
{
   stdout.writeString("�������\n");

   return 0;
}

Nov 19 2004

Ben Hinkle <Ben_member pathlink.com> writes:

Could you provide an example? I can't get it to work here. The following  
program, saved with several unicode encodings, still yields garbage:

import std.stream;

int main()
{
   stdout.writeString("�������\n");

   return 0;
}

Are you sure your command window is set to use UTF-8? On Windows I think you
change it by going to the "Regional Settings" control panel.

Nov 19 2004

Ilya Minkov <minkov cs.tum.edu> writes:

Ben Hinkle schrieb:

 Are you sure your command window is set to use UTF-8? On Windows I think you
 change it by going to the "Regional Settings" control panel.

That doesn't matter - or rather i think there is nothing to configure. 
The problem is, he misuses Mozilla for something wrong. He should rather 
use a programmer's editor which supports UTF-8, for example SciTE. In 
this example, also go to File -> Encoding -> UTF-8.

The output will be another problem - either multi-character garbage (C 
functions) or automatically converted to local codepage (D native 
Unicode functions)

-eye

Nov 19 2004

"Mathias Bierschenk" <Mathias.Bierschenk web.de> writes:

Am Fri, 19 Nov 2004 17:03:36 +0100 schrieb Ilya Minkov <minkov cs.tum.edu>:

 Are you sure your command window is set to use UTF-8? On Windows I  
 think you
 change it by going to the "Regional Settings" control panel.

 That doesn't matter - or rather i think there is nothing to configure.  
 The problem is, he misuses Mozilla for something wrong. He should rather  
 use a programmer's editor which supports UTF-8, for example SciTE. In  
 this example, also go to File -> Encoding -> UTF-8.

I've just downloaded SciTE and have done what you suggested. I admit that  
using Mozilla for encoding issues is not very elegant. SciTE doesn't  
change anything, though. I still get garbage.
By the way, I there a D plugin for SciTE?

 The output will be another problem - either multi-character garbage (C  
 functions) or automatically converted to local codepage (D native  
 Unicode functions)

Nov 19 2004

"Val�ry Croizier" <valery freesurf.fr> writes:

"Mathias Bierschenk" <Mathias.Bierschenk web.de> a �crit dans le message de
news: opshp0d1h29gaiaw dialin-145-254-035-176.arcor-ip.net...

 By the way, I there a D plugin for SciTE?

You'll find it there
http://www.prowiki.org/wiki4d/wiki.cgi?EditorSupport#SciTE

Nov 19 2004

"Mathias Bierschenk" <Mathias.Bierschenk web.de> writes:

Am Fri, 19 Nov 2004 22:08:56 +0100 schrieb Val�ry Croizier  
<valery freesurf.fr>:

 By the way, I there a D plugin for SciTE?

 You'll find it there
 http://www.prowiki.org/wiki4d/wiki.cgi?EditorSupport#SciTE

Thanks!

Nov 19 2004

Ilya Minkov <minkov cs.tum.edu> writes:

Mathias Bierschenk schrieb:

 I've just downloaded SciTE and have done what you suggested. I admit 
 that  using Mozilla for encoding issues is not very elegant. SciTE 
 doesn't  change anything, though. I still get garbage.

Ah, i missed out that you are through to getting garbage. :) Well, i'll 
see what can be wrong. In general, non-NT Windows has not been largely 
considered in the Phobos implementation, because these Windows versions 
are not very Unicode compatible.

-eye

Nov 19 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Ben Hinkle wrote:
<snip>
 Are you sure your command window is set to use UTF-8? On Windows I think you
 change it by going to the "Regional Settings" control panel.

In Windows 98, a command prompt is still a plain old MS-DOS window.  As 
such, it can't possibly use UTF-8, as this would break the essential 
one-to-one mapping between bytes and on-screen character positions.

I don't know how different this really is in Windows 2000/XP....

Stewart.

Nov 19 2004

Thomas Kuehne <thomas-dloop kuehne.thisisspam.cn> writes:

Let's try to track down  the real problem.

change the string into "\u00E2\u00F6\u00FC\u00DF" (ae)(oe)(ue)(ss).
If the output is still garbage try printf instead of puts.

If the problem still exists it's an output/shell problem.

Thomas

Mathias Bierschenk schrieb am Fri, 19 Nov 2004 12:49:01 +0100:
 How can I print German characters? I've tried the following simple program:

 import std.c.stdio;

 int main()
 {
    puts("�������"); // German characters

    return 0;
 }

 As the normal MS-DOS EDIT encoding didn't work (Windows 98 SE, German  
 edition) I tried Mozilla to save the source code file with different  
 character encodings but none worked as expected. Here's what I tried using  
 the current DMD version:

 MS-DOS encoding as performed by Microsoft's EDIT editor:
 (5) "invalid UTF-sequence"

 Western (ISO-8859-1):
 (5) "invalid UTF-sequence"

 Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian):
 (1) "semicolon expected, not '.'"
 (1) no identifier for declarator

 Unicode (UTF-16 and UTF-8):
 both compile fine but output garbage under MS-DOS
 (Windows 98 SE, German edition)

Nov 19 2004

"Mathias Bierschenk" <Mathias.Bierschenk web.de> writes:

Am Fri, 19 Nov 2004 13:09:06 +0100 schrieb Thomas Kuehne  
<thomas-dloop kuehne.thisisspam.cn>:

 Let's try to track down  the real problem.

 change the string into "\u00E2\u00F6\u00FC\u00DF" (ae)(oe)(ue)(ss).

 If the output is still garbage try printf instead of puts.

I've tested the above string. The result for both puts and printf is that  
either it doesn't compile or it outputs garbage:

MS-DOS/Western (ISO-8859-1), UTF-16, UTF-8
compile fine but output garbage under MS-DOS
(Windows 98 SE, German edition)

Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian):
(1) "semicolon expected, not '.'"
(1) no identifier for declarator

 If the problem still exists it's an output/shell problem.

Nov 19 2004

Thomas Kuehne <thomas-dloop kuehne.cn> writes:

Mathias Bierschenk schrieb:
 Let's try to track down  the real problem.

 change the string into "\u00E2\u00F6\u00FC\u00DF" (ae)(oe)(ue)(ss).

 If the output is still garbage try printf instead of puts.

I've tested the above string. The result for both puts and printf is that  
either it doesn't compile or it outputs garbage:

MS-DOS/Western (ISO-8859-1), UTF-16, UTF-8
compile fine but output garbage under MS-DOS
(Windows 98 SE, German edition)

Clearly seems to be a shell problem.

Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian):
(1) "semicolon expected, not '.'"
(1) no identifier for declarator

This is a known problem. If you use UTF-16/32 without a BOM(byte order mark) the
current dmd assumes UTF-8 and subsequently fails.

http://svn.kuehne.cn/dstress/www/dstress.html#encoding_utf_16be
http://svn.kuehne.cn/dstress/www/dstress.html#encoding_utf_16le
http://svn.kuehne.cn/dstress/www/dstress.html#encoding_utf_32be
http://svn.kuehne.cn/dstress/www/dstress.html#encoding_utf_32le

Thomas

Nov 19 2004

Thomas Kuehne <thomas-dloop kuehne.thisisspam.cn> writes:

Here a patch that enables GDC-0.8 and DMD-0.106 to handle
UTF-8/16/32 with and without bom.

Thomas

--- gdc-0.8/d/dmd/module.c	2004-10-02 19:19:31.000000000 +0200
+++ gdc-0.8d/d/dmd/module.c	2004-11-19 19:19:09.522419400 +0100
   -241,6 +241,7   
 	 * EF BB BF	UTF-8
 	 */
 
+	int haveNoBom=0;
 	if (buf[0] == 0xFF && buf[1] == 0xFE)
 	{
 	    if (buflen >= 4 && buf[2] == 0 && buf[3] == 0)
   -257,6 +258,7   
 		    fatal();
 		}
 
+		pu-=haveNoBom;
 		dbuf.reserve(buflen / 4);
 		while (++pu < pumax)
 		{   unsigned u;
   -292,6 +294,7   
 		    fatal();
 		}
 
+		pu-=haveNoBom;
 		dbuf.reserve(buflen / 2);
 		while (++pu < pumax)
 		{   unsigned u;
   -354,6 +357,8   
 	     * figure out the encoding.
 	     */
 
+            haveNoBom=1;
+
 	    if (buflen >= 4)
 	    {   if (buf[1] == 0 && buf[2] == 0 && buf[3] == 0)
 		{   // UTF-32LE


Thomas Kuehne schrieb am Fri, 19 Nov 2004 14:19:33 +0000 (UTC):
 Let's try to track down  the real problem.

 change the string into "\u00E2\u00F6\u00FC\u00DF" (ae)(oe)(ue)(ss).

 If the output is still garbage try printf instead of puts.

I've tested the above string. The result for both puts and printf is that  
either it doesn't compile or it outputs garbage:

MS-DOS/Western (ISO-8859-1), UTF-16, UTF-8
compile fine but output garbage under MS-DOS
(Windows 98 SE, German edition)

 Clearly seems to be a shell problem.

Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian):
(1) "semicolon expected, not '.'"
(1) no identifier for declarator

 This is a known problem. If you use UTF-16/32 without a BOM(byte order mark)
the
 current dmd assumes UTF-8 and subsequently fails.

 http://svn.kuehne.cn/dstress/www/dstress.html#encoding_utf_16be
 http://svn.kuehne.cn/dstress/www/dstress.html#encoding_utf_16le
 http://svn.kuehne.cn/dstress/www/dstress.html#encoding_utf_32be
 http://svn.kuehne.cn/dstress/www/dstress.html#encoding_utf_32le

Nov 19 2004

Thomas Kuehne <thomas-dloop kuehne.thisisspam.cn> writes:

Thomas Kuehne schrieb am Fri, 19 Nov 2004 19:26:25 +0100:
Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian):
(1) "semicolon expected, not '.'"
(1) no identifier for declarator

 This is a known problem. If you use UTF-16/32 without a BOM(byte order mark)
the
 current dmd assumes UTF-8 and subsequently fails.


The real problem was that it removed the bytes of the not existing BOM.

Thomas

Nov 19 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Mathias Bierschenk wrote:
 How can I print German characters? I've tried the following simple program:
 
 import std.c.stdio;
 
 int main()
 {
   puts("�������"); // German characters
 
   return 0;
 }
 

<snip>
 Unicode (UTF-16 and UTF-8):
 both compile fine but output garbage under MS-DOS
 (Windows 98 SE, German edition)

You can include MS-DOS characters in a string, but only as escape codes. 
  In your case (assuming your code page is 437, 850, 852, 853 or 857):

     puts("\x84\x94\x81\xE1\x8E\x99\x9A");

Since the whole point of this is for outputting to MS-DOS, you could 
argue that this is appropriate use of non-Unicode characters in a string.

Stewart.

Nov 19 2004

"Mathias Bierschenk" <Mathias.Bierschenk web.de> writes:

Am Fri, 19 Nov 2004 16:02:17 +0000 schrieb Stewart Gordon  
<smjg_1998 yahoo.com>:

 You can include MS-DOS characters in a string, but only as escape codes.  
   In your case (assuming your code page is 437, 850, 852, 853 or 857):

      puts("\x84\x94\x81\xE1\x8E\x99\x9A");

 Since the whole point of this is for outputting to MS-DOS, you could  
 argue that this is appropriate use of non-Unicode characters in a string.

Yep, that works. Maybe this is a more portable (encoded as UTF-8):

import std.c.stdio;

int main()
{
   version(Win32)
     puts("\x84\x94\x81\xE1\x8E\x99\x9A");
   else
     puts("�������");

   return 0;
}

What do you think?!

Nov 19 2004

"Walter" <newshound digitalmars.com> writes:

"Mathias Bierschenk" <Mathias.Bierschenk web.de> wrote in message
news:opshpm3zlo9gaiaw dialin-212-144-051-051.arcor-ip.net...
 How can I print German characters? I've tried the following simple

program:
 import std.c.stdio;

 int main()
 {
    puts("�������"); // German characters

    return 0;
 }

 As the normal MS-DOS EDIT encoding didn't work (Windows 98 SE, German
 edition) I tried Mozilla to save the source code file with different
 character encodings but none worked as expected. Here's what I tried using
 the current DMD version:

 MS-DOS encoding as performed by Microsoft's EDIT editor:

Using Microsoft Notepad, click on "Save As" and under encoding, select
"UTF-8". Then, use std.stdio.writef() instead of std.c.stdio.puts(), and it
should work.

Nov 19 2004

"Mathias Bierschenk" <Mathias.Bierschenk web.de> writes:

Am Fri, 19 Nov 2004 14:13:32 -0800 schrieb Walter  
<newshound digitalmars.com>:

 Using Microsoft Notepad, click on "Save As" and under encoding, select
 "UTF-8". Then, use std.stdio.writef() instead of std.c.stdio.puts(), and  
 it
 should work.

No, that doesn't work.
Some others here have tracked down the main problem: The Win9x console  
doesn't support Unicode. Instead one can only make use of some DOS escape  
sequences. The only thing that works so far (thanks to Stewart Gordon):

puts("\x84\x94\x81\xE1\x8E\x99\x9A"); // �������

or, more portable(?), written by myself:

import std.c.stdio;

int main()
{
   version(Win32)
     puts("\x84\x94\x81\xE1\x8E\x99\x9A");
   else
     puts("�������");

   return 0;
}

Carlos Santander B. suggested another solution, based on Y. Tomino's Win32  
headers, that seems to convert characters at run-time. I can't get it to  
print anything at the moment, so I can't yet tell if it is better than  
what I have got so far.
Maybe someone should write a tutorial about input/output basics in D? ;-)

Nov 20 2004

Roberto Mariottini <Roberto_member pathlink.com> writes:

In article <opshrgt5ci9gaiaw dialin-212-144-051-198.arcor-ip.net>, Mathias
Bierschenk says...
[...]
Some others here have tracked down the main problem: The Win9x console  
doesn't support Unicode.

This problem is for Windows NT/2000/XP also.
Consoles use OEM character set.
D doesn't support this.

 Instead one can only make use of some DOS escape  
sequences. The only thing that works so far (thanks to Stewart Gordon):

puts("\x84\x94\x81\xE1\x8E\x99\x9A"); // �������

This are binary encodings of OEM characters.

or, more portable(?), written by myself:

import std.c.stdio;

int main()
{
   version(Win32)
     puts("\x84\x94\x81\xE1\x8E\x99\x9A");
   else
     puts("�������");

   return 0;
}

This is not portable at all. It work only if the OEM codepage used is compatible
with CP437 for those codeponits.

The solution is to use CharToOemW, a function that translates a string from
UTF-16 to OEM character set (when possible, of course).

See an example:

<code>
import std.stdio;
import std.c.stdio;
import std.c.windows.windows;

extern (Windows)
{
export BOOL CharToOemW(
LPCWSTR lpszSrc,  // string to translate
LPSTR lpszDst     // translated string
);
}

int main()
{
puts("-- untranslated --");
puts("�������");
writef("�������\n");

puts("-- translated --");
wchar[] mess = "�������";
char[] OEMmess = new char[mess.length];
CharToOemW(mess, OEMmess);
puts(OEMmess);
writef(OEMmess);

return 0;
}
</code>

This outputs:

-- untranslated --


-- translated --
�������
Error: invalid UTF-8 sequence

Here you can not that puts() works, but writef() not. That's because writefs
expects OEMmess to be UTF-8.
The results are that writef doesn't work, in any case, under Windows.

Note also that on Windows 95/98/Me this works only if the Microsoft Layer for
Unicode is installed.

The only alternative is to use CharToOemA, that converts the current ANSI
codepage (for most western countries: Windows-1252) to current OEM codepage.
I don't know how to translate UTF-8 to ANSI.

Carlos Santander B. suggested another solution, based on Y. Tomino's Win32  
headers, that seems to convert characters at run-time. I can't get it to  
print anything at the moment, so I can't yet tell if it is better than  
what I have got so far.

I've not tested it, too.

Maybe someone should write a tutorial about input/output basics in D? ;-)

Yes, please do it.

Ciao

Nov 22 2004

Thomas Kuehne <thomas-dloop kuehne.thisisspam.cn> writes:

Roberto Mariottini schrieb am Mon, 22 Nov 2004 09:52:27 +0000 (UTC):
 Here you can not that puts() works, but writef() not. That's because writefs
 expects OEMmess to be UTF-8.
 The results are that writef doesn't work, in any case, under Windows.

 Note also that on Windows 95/98/Me this works only if the Microsoft Layer for
 Unicode is installed.

 The only alternative is to use CharToOemA, that converts the current ANSI
 codepage (for most western countries: Windows-1252) to current OEM codepage.
 I don't know how to translate UTF-8 to ANSI.

Maybe you could take a look at dmd/src/phobos/std/c/stdio.d?

You should be able to change it in a way that - if "FILE*" equals
stdout, stderr or stdlog and the hosting environment is Windows -
CharToOemA is called before C's "fputs", "fputc", "puts" or "putw" is
called.

The consequence would be that all writef/*put* calls should produce reasonable
output. To do the same with with "printf" you'd have to modify
dmd/src/phobos/internal/object.d and dmd/src/phobos/object.d .

I'm currently not running Windows but it would be interesting if
"fputws" works correctly for non-ASCI chars.

Thomas

Nov 22 2004

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Roberto Mariottini wrote:

Some others here have tracked down the main problem: The Win9x console  
doesn't support Unicode.

 
 This problem is for Windows NT/2000/XP also.
 Consoles use OEM character set.
 D doesn't support this.

Mac OS X has a similar issue (uses MacRoman/ISO-8859-1 by default),
but fortunately you can choose UTF-8 from the Terminal settings...

 This is not portable at all. It work only if the OEM codepage used is
compatible
 with CP437 for those codeponits.
 
 The solution is to use CharToOemW, a function that translates a string from
 UTF-16 to OEM character set (when possible, of course).

Or supply similar functions in D, which could be an alternative ?

Carlos Santander B. suggested another solution, based on Y. Tomino's Win32  
headers, that seems to convert characters at run-time. I can't get it to  
print anything at the moment, so I can't yet tell if it is better than  
what I have got so far.


I have written some basic lookups (i.e. "wchar mapping[256];")
using the tables that are all available on the Unicode site:


ISO Latin-1 (simple!)
http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT

DOS Latin Console
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP437.TXT

Windows "Latin-1"
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

Mac OS Roman
http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT

(there are few dozen others, but I think these are the most common ?)


But it needs a more thought-through API to be really useful...
And some optimization to do the reverse lookup, I suppose ?

I'm thinking one array of char[256], and one char[] of exceptions.
(where 0x00-0xFF would use the lookup, and 0x0100-0xFFFF the hash)

--anders

Nov 22 2004

Roberto Mariottini <Roberto_member pathlink.com> writes:

In article <cnlrlp$14b6$1 digitaldaemon.com>, Walter says...

[...]
Using Microsoft Notepad, click on "Save As" and under encoding, select
"UTF-8". Then, use std.stdio.writef() instead of std.c.stdio.puts(), and it
should work.

The code doesn't work anyway, see my other post for details.
The biggest problem is that writef() doesn't work on Windows, neither 9x/Me nor
NT/2000/XP.

Ciao

Nov 22 2004

"Carlos Santander B." <csantander619 gmail.com> writes:

"Mathias Bierschenk" <Mathias.Bierschenk web.de> escribi� en el mensaje
news:opshpm3zlo9gaiaw dialin-212-144-051-051.arcor-ip.net...
| How can I print German characters? I've tried the following simple program:
|
| import std.c.stdio;
|
| int main()
| {
|   puts("�������"); // German characters
|
|   return 0;
| }
|
| As the normal MS-DOS EDIT encoding didn't work (Windows 98 SE, German
| edition) I tried Mozilla to save the source code file with different
| character encodings but none worked as expected. Here's what I tried using
| the current DMD version:
|
| MS-DOS encoding as performed by Microsoft's EDIT editor:
| (5) "invalid UTF-sequence"
|
| Western (ISO-8859-1):
| (5) "invalid UTF-sequence"
|
| Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian):
| (1) "semicolon expected, not '.'"
| (1) no identifier for declarator
|
| Unicode (UTF-16 and UTF-8):
| both compile fine but output garbage under MS-DOS
| (Windows 98 SE, German edition)

I was investigating the same thing recently. What I really wanted was a Windows
console that did Unicode, but I couldn't find it.
But I came across to some C++ program which allows you to output UTF-16 strings
(wchar * in C++ on Windows). Translated to D, the program was like this:

import std.file;
import std.string;
import std.utf;

import win32.winbase;
import win32.wincon;
import win32.winnls;

void main ()
{
    wchar [] tmp_w = toUTF16(cast(char[])"carlos andr�s");
    wchar *   szwOut = tmp_w;
    DWORD      dwBytesWritten;
    DWORD      fdwMode;
    HANDLE     outHandle = GetStdHandle(STD_OUTPUT_HANDLE);

    if( (GetFileType(outHandle) & FILE_TYPE_CHAR) && GetConsoleMode( outHandle, 
&fdwMode) )
        WriteConsoleW( outHandle, szwOut, wcslen(szwOut), &dwBytesWritten, 
null);
    else
    {
        int nOutputCP = GetConsoleOutputCP();
        //int charCount = WideCharToMultiByte(nOutputCP, 0, szwOut, -1, null,
0, 
null, null);
        //char* szaStr = new char[charCount];
        //WideCharToMultiByte( nOutputCP, 0, szwOut, -1, szaStr, charCount, 
null, null);
        char [] tmp = toUTF8(tmp_w);
        char * szaStr = toMBSz(tmp);
        int charCount = tmp.length;
        WriteFile(outHandle, szaStr, charCount-1, &dwBytesWritten, null);
    }

}

It uses Y Tomino's Win32 headers. The encoding how it's saved doesn't seem to 
matter.
I really don't remember where I found the original, so you can use this code as 
you want since it's not mine.
For linux, I don't think there's any problem since it goes UTF-8 by default (at 
least with RedHat based distros, in my experience).
BTW, if someone knows about a Unicode console for Windows, please let me know.

-----------------------
Carlos Santander Bernal

Nov 19 2004

Manfred Hansen <manfred toppoint.de> writes:

Hello,

i have the same problem on Linux Debian (sarge) and SUSE 9.1.
"invalid UTF-8 sequence"
Editor is vim .

Manfred

Mathias Bierschenk wrote:

 How can I print German characters? I've tried the following simple
 program:
 
 import std.c.stdio;
 
 int main()
 {
    puts("�������"); // German characters
 
    return 0;
 }
 
 As the normal MS-DOS EDIT encoding didn't work (Windows 98 SE, German
 edition) I tried Mozilla to save the source code file with different
 character encodings but none worked as expected. Here's what I tried using
 the current DMD version:
 
 MS-DOS encoding as performed by Microsoft's EDIT editor:
 (5) "invalid UTF-sequence"
 
 Western (ISO-8859-1):
 (5) "invalid UTF-sequence"
 
 Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian):
 (1) "semicolon expected, not '.'"
 (1) no identifier for declarator
 
 Unicode (UTF-16 and UTF-8):
 both compile fine but output garbage under MS-DOS
 (Windows 98 SE, German edition)

Nov 19 2004

Thomas Kuehne <thomas-dloop kuehne.thisisspam.cn> writes:

Manfred Hansen schrieb am Sat, 20 Nov 2004 08:53:41 +0100:
 Hello,

 i have the same problem on Linux Debian (sarge) and SUSE 9.1.
 "invalid UTF-8 sequence"
 Editor is vim .

Vim 6.2 works for me.
Are you sure your locale is set to use UTF-8?

















Please send me a sample, if this problem persists.

Thomas

Nov 20 2004

Manfred Hansen <manfred toppoint.de> writes:

Thomas Kuehne wrote:

 
 Manfred Hansen schrieb am Sat, 20 Nov 2004 08:53:41 +0100:
 Hello,

 i have the same problem on Linux Debian (sarge) and SUSE 9.1.
 "invalid UTF-8 sequence"
 Editor is vim .

 
 Vim 6.2 works for me.
 Are you sure your locale is set to use UTF-8?
 















 
 Please send me a sample, if this problem persists.
 
 Thomas

My locale 
hansen hansen-lx:~/d$ locale
LANG=de_DE euro
LC_CTYPE="de_DE euro"
LC_NUMERIC="de_DE euro"
LC_TIME="de_DE euro"
LC_COLLATE="de_DE euro"
LC_MONETARY="de_DE euro"
LC_MESSAGES="de_DE euro"
LC_PAPER="de_DE euro"
LC_NAME="de_DE euro"
LC_ADDRESS="de_DE euro"
LC_TELEPHONE="de_DE euro"
LC_MEASUREMENT="de_DE euro"
LC_IDENTIFICATION="de_DE euro"
LC_ALL=

thank you for the advice, i try to switch to UTF-8 .

mfg Manfred

Nov 20 2004

D Programming

C/C++ Programming

Other

digitalmars.D - Character encoding problem