www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Character encoding problem

reply "Mathias Bierschenk" <Mathias.Bierschenk web.de> writes:
How can I print German characters? I've tried the following simple program:

import std.c.stdio;

int main()
{
   puts("äöüßÄÖÜ"); // German characters

   return 0;
}

As the normal MS-DOS EDIT encoding didn't work (Windows 98 SE, German  
edition) I tried Mozilla to save the source code file with different  
character encodings but none worked as expected. Here's what I tried using  
the current DMD version:

MS-DOS encoding as performed by Microsoft's EDIT editor:
(5) "invalid UTF-sequence"

Western (ISO-8859-1):
(5) "invalid UTF-sequence"

Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian):
(1) "semicolon expected, not '.'"
(1) no identifier for declarator

Unicode (UTF-16 and UTF-8):
both compile fine but output garbage under MS-DOS
(Windows 98 SE, German edition)
Nov 19 2004
next sibling parent =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Mathias Bierschenk wrote:

 How can I print German characters? I've tried the following simple program:
 
 import std.c.stdio;
 
 int main()
 {
   puts("äöüßÄÖÜ"); // German characters
 
   return 0;
 }
D only supports Unicode, so *both* your editor and your terminal must be set to this. (UTF-8, usually) Does the Windows 98 SE command prompt support Unicode ? If you not, you need to convert before outputting... --anders
Nov 19 2004
prev sibling next sibling parent reply "Simon Buchan" <currently no.where> writes:
On Fri, 19 Nov 2004 12:49:01 +0100, Mathias Bierschenk  
<Mathias.Bierschenk web.de> wrote:

 How can I print German characters? I've tried the following simple  
 program:

 import std.c.stdio;

 int main()
 {
    puts("äöüßÄÖÜ"); // German characters

    return 0;
 }

 As the normal MS-DOS EDIT encoding didn't work (Windows 98 SE, German  
 edition) I tried Mozilla to save the source code file with different  
 character encodings but none worked as expected. Here's what I tried  
 using the current DMD version:

 MS-DOS encoding as performed by Microsoft's EDIT editor:
 (5) "invalid UTF-sequence"

 Western (ISO-8859-1):
 (5) "invalid UTF-sequence"

 Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian):
 (1) "semicolon expected, not '.'"
 (1) no identifier for declarator

 Unicode (UTF-16 and UTF-8):
 both compile fine but output garbage under MS-DOS
 (Windows 98 SE, German edition)
The c functions dont like non-latin char's very much. I had this problem displaying a file to console. Currently, you are best of to use either writef or (if you dont want it formatted) std.stream 's stdout.writeString and stdout.writeLine. (You could of course use writef("%s", yourstring) , but I dont like that very much) Be careful: std.stdio and std.stream.stdout arn't sync'ed. (I use std.stream exclusively) -- Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
Nov 19 2004
parent reply "Mathias Bierschenk" <Mathias.Bierschenk web.de> writes:
Am Sat, 20 Nov 2004 01:03:14 +1300 schrieb Simon Buchan  
<currently no.where>:

 The c functions dont like non-latin char's very much. I had this problem
 displaying a file to console.
 Currently, you are best of to use either writef or (if you dont want it
 formatted) std.stream 's stdout.writeString and stdout.writeLine. (You
 could of course use writef("%s", yourstring) , but I dont like that very
 much)
 Be careful: std.stdio and std.stream.stdout arn't sync'ed. (I use  
 std.stream
 exclusively)
Could you provide an example? I can't get it to work here. The following program, saved with several unicode encodings, still yields garbage: import std.stream; int main() { stdout.writeString("äöüßÄÖÜ\n"); return 0; }
Nov 19 2004
parent reply Ben Hinkle <Ben_member pathlink.com> writes:
Could you provide an example? I can't get it to work here. The following  
program, saved with several unicode encodings, still yields garbage:

import std.stream;

int main()
{
   stdout.writeString("äöüßÄÖÜ\n");

   return 0;
}
Are you sure your command window is set to use UTF-8? On Windows I think you change it by going to the "Regional Settings" control panel.
Nov 19 2004
next sibling parent reply Ilya Minkov <minkov cs.tum.edu> writes:
Ben Hinkle schrieb:

 Are you sure your command window is set to use UTF-8? On Windows I think you
 change it by going to the "Regional Settings" control panel.
That doesn't matter - or rather i think there is nothing to configure. The problem is, he misuses Mozilla for something wrong. He should rather use a programmer's editor which supports UTF-8, for example SciTE. In this example, also go to File -> Encoding -> UTF-8. The output will be another problem - either multi-character garbage (C functions) or automatically converted to local codepage (D native Unicode functions) -eye
Nov 19 2004
parent reply "Mathias Bierschenk" <Mathias.Bierschenk web.de> writes:
Am Fri, 19 Nov 2004 17:03:36 +0100 schrieb Ilya Minkov <minkov cs.tum.edu>:

 Are you sure your command window is set to use UTF-8? On Windows I  
 think you
 change it by going to the "Regional Settings" control panel.
That doesn't matter - or rather i think there is nothing to configure. The problem is, he misuses Mozilla for something wrong. He should rather use a programmer's editor which supports UTF-8, for example SciTE. In this example, also go to File -> Encoding -> UTF-8.
I've just downloaded SciTE and have done what you suggested. I admit that using Mozilla for encoding issues is not very elegant. SciTE doesn't change anything, though. I still get garbage. By the way, I there a D plugin for SciTE?
 The output will be another problem - either multi-character garbage (C  
 functions) or automatically converted to local codepage (D native  
 Unicode functions)
Nov 19 2004
next sibling parent reply "Valéry Croizier" <valery freesurf.fr> writes:
"Mathias Bierschenk" <Mathias.Bierschenk web.de> a écrit dans le message de
news: opshp0d1h29gaiaw dialin-145-254-035-176.arcor-ip.net...

 By the way, I there a D plugin for SciTE?
You'll find it there http://www.prowiki.org/wiki4d/wiki.cgi?EditorSupport#SciTE
Nov 19 2004
parent "Mathias Bierschenk" <Mathias.Bierschenk web.de> writes:
Am Fri, 19 Nov 2004 22:08:56 +0100 schrieb Valéry Croizier  
<valery freesurf.fr>:

 By the way, I there a D plugin for SciTE?
You'll find it there http://www.prowiki.org/wiki4d/wiki.cgi?EditorSupport#SciTE
Thanks!
Nov 19 2004
prev sibling parent Ilya Minkov <minkov cs.tum.edu> writes:
Mathias Bierschenk schrieb:

 I've just downloaded SciTE and have done what you suggested. I admit 
 that  using Mozilla for encoding issues is not very elegant. SciTE 
 doesn't  change anything, though. I still get garbage.
Ah, i missed out that you are through to getting garbage. :) Well, i'll see what can be wrong. In general, non-NT Windows has not been largely considered in the Phobos implementation, because these Windows versions are not very Unicode compatible. -eye
Nov 19 2004
prev sibling parent Stewart Gordon <smjg_1998 yahoo.com> writes:
Ben Hinkle wrote:
<snip>
 Are you sure your command window is set to use UTF-8? On Windows I think you
 change it by going to the "Regional Settings" control panel.
In Windows 98, a command prompt is still a plain old MS-DOS window. As such, it can't possibly use UTF-8, as this would break the essential one-to-one mapping between bytes and on-screen character positions. I don't know how different this really is in Windows 2000/XP.... Stewart.
Nov 19 2004
prev sibling next sibling parent reply Thomas Kuehne <thomas-dloop kuehne.thisisspam.cn> writes:
Let's try to track down  the real problem.

change the string into "\u00E2\u00F6\u00FC\u00DF" (ae)(oe)(ue)(ss).
If the output is still garbage try printf instead of puts.

If the problem still exists it's an output/shell problem.

Thomas

Mathias Bierschenk schrieb am Fri, 19 Nov 2004 12:49:01 +0100:
 How can I print German characters? I've tried the following simple program:

 import std.c.stdio;

 int main()
 {
    puts("äöüßÄÖÜ"); // German characters

    return 0;
 }

 As the normal MS-DOS EDIT encoding didn't work (Windows 98 SE, German  
 edition) I tried Mozilla to save the source code file with different  
 character encodings but none worked as expected. Here's what I tried using  
 the current DMD version:

 MS-DOS encoding as performed by Microsoft's EDIT editor:
 (5) "invalid UTF-sequence"

 Western (ISO-8859-1):
 (5) "invalid UTF-sequence"

 Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian):
 (1) "semicolon expected, not '.'"
 (1) no identifier for declarator

 Unicode (UTF-16 and UTF-8):
 both compile fine but output garbage under MS-DOS
 (Windows 98 SE, German edition)
Nov 19 2004
parent reply "Mathias Bierschenk" <Mathias.Bierschenk web.de> writes:
Am Fri, 19 Nov 2004 13:09:06 +0100 schrieb Thomas Kuehne  
<thomas-dloop kuehne.thisisspam.cn>:

 Let's try to track down  the real problem.

 change the string into "\u00E2\u00F6\u00FC\u00DF" (ae)(oe)(ue)(ss).

 If the output is still garbage try printf instead of puts.
I've tested the above string. The result for both puts and printf is that either it doesn't compile or it outputs garbage: MS-DOS/Western (ISO-8859-1), UTF-16, UTF-8 compile fine but output garbage under MS-DOS (Windows 98 SE, German edition) Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian): (1) "semicolon expected, not '.'" (1) no identifier for declarator
 If the problem still exists it's an output/shell problem.
Nov 19 2004
parent reply Thomas Kuehne <thomas-dloop kuehne.cn> writes:
Mathias Bierschenk schrieb:
 Let's try to track down  the real problem.

 change the string into "\u00E2\u00F6\u00FC\u00DF" (ae)(oe)(ue)(ss).

 If the output is still garbage try printf instead of puts.
I've tested the above string. The result for both puts and printf is that either it doesn't compile or it outputs garbage: MS-DOS/Western (ISO-8859-1), UTF-16, UTF-8 compile fine but output garbage under MS-DOS (Windows 98 SE, German edition)
Clearly seems to be a shell problem.
Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian):
(1) "semicolon expected, not '.'"
(1) no identifier for declarator
This is a known problem. If you use UTF-16/32 without a BOM(byte order mark) the current dmd assumes UTF-8 and subsequently fails. http://svn.kuehne.cn/dstress/www/dstress.html#encoding_utf_16be http://svn.kuehne.cn/dstress/www/dstress.html#encoding_utf_16le http://svn.kuehne.cn/dstress/www/dstress.html#encoding_utf_32be http://svn.kuehne.cn/dstress/www/dstress.html#encoding_utf_32le Thomas
Nov 19 2004
parent reply Thomas Kuehne <thomas-dloop kuehne.thisisspam.cn> writes:
Here a patch that enables GDC-0.8 and DMD-0.106 to handle
UTF-8/16/32 with and without bom.

Thomas

--- gdc-0.8/d/dmd/module.c	2004-10-02 19:19:31.000000000 +0200
+++ gdc-0.8d/d/dmd/module.c	2004-11-19 19:19:09.522419400 +0100
   -241,6 +241,7   
 	 * EF BB BF	UTF-8
 	 */
 
+	int haveNoBom=0;
 	if (buf[0] == 0xFF && buf[1] == 0xFE)
 	{
 	    if (buflen >= 4 && buf[2] == 0 && buf[3] == 0)
   -257,6 +258,7   
 		    fatal();
 		}
 
+		pu-=haveNoBom;
 		dbuf.reserve(buflen / 4);
 		while (++pu < pumax)
 		{   unsigned u;
   -292,6 +294,7   
 		    fatal();
 		}
 
+		pu-=haveNoBom;
 		dbuf.reserve(buflen / 2);
 		while (++pu < pumax)
 		{   unsigned u;
   -354,6 +357,8   
 	     * figure out the encoding.
 	     */
 
+            haveNoBom=1;
+
 	    if (buflen >= 4)
 	    {   if (buf[1] == 0 && buf[2] == 0 && buf[3] == 0)
 		{   // UTF-32LE


Thomas Kuehne schrieb am Fri, 19 Nov 2004 14:19:33 +0000 (UTC):
 Let's try to track down  the real problem.

 change the string into "\u00E2\u00F6\u00FC\u00DF" (ae)(oe)(ue)(ss).

 If the output is still garbage try printf instead of puts.
I've tested the above string. The result for both puts and printf is that either it doesn't compile or it outputs garbage: MS-DOS/Western (ISO-8859-1), UTF-16, UTF-8 compile fine but output garbage under MS-DOS (Windows 98 SE, German edition)
Clearly seems to be a shell problem.
Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian):
(1) "semicolon expected, not '.'"
(1) no identifier for declarator
This is a known problem. If you use UTF-16/32 without a BOM(byte order mark) the current dmd assumes UTF-8 and subsequently fails. http://svn.kuehne.cn/dstress/www/dstress.html#encoding_utf_16be http://svn.kuehne.cn/dstress/www/dstress.html#encoding_utf_16le http://svn.kuehne.cn/dstress/www/dstress.html#encoding_utf_32be http://svn.kuehne.cn/dstress/www/dstress.html#encoding_utf_32le
Nov 19 2004
parent Thomas Kuehne <thomas-dloop kuehne.thisisspam.cn> writes:
Thomas Kuehne schrieb am Fri, 19 Nov 2004 19:26:25 +0100:
Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian):
(1) "semicolon expected, not '.'"
(1) no identifier for declarator
This is a known problem. If you use UTF-16/32 without a BOM(byte order mark) the current dmd assumes UTF-8 and subsequently fails.
The real problem was that it removed the bytes of the not existing BOM. Thomas
Nov 19 2004
prev sibling next sibling parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Mathias Bierschenk wrote:
 How can I print German characters? I've tried the following simple program:
 
 import std.c.stdio;
 
 int main()
 {
   puts("äöüßÄÖÜ"); // German characters
 
   return 0;
 }
 
<snip>
 Unicode (UTF-16 and UTF-8):
 both compile fine but output garbage under MS-DOS
 (Windows 98 SE, German edition)
You can include MS-DOS characters in a string, but only as escape codes. In your case (assuming your code page is 437, 850, 852, 853 or 857): puts("\x84\x94\x81\xE1\x8E\x99\x9A"); Since the whole point of this is for outputting to MS-DOS, you could argue that this is appropriate use of non-Unicode characters in a string. Stewart.
Nov 19 2004
parent "Mathias Bierschenk" <Mathias.Bierschenk web.de> writes:
Am Fri, 19 Nov 2004 16:02:17 +0000 schrieb Stewart Gordon  
<smjg_1998 yahoo.com>:

 You can include MS-DOS characters in a string, but only as escape codes.  
   In your case (assuming your code page is 437, 850, 852, 853 or 857):

      puts("\x84\x94\x81\xE1\x8E\x99\x9A");

 Since the whole point of this is for outputting to MS-DOS, you could  
 argue that this is appropriate use of non-Unicode characters in a string.
Yep, that works. Maybe this is a more portable (encoded as UTF-8): import std.c.stdio; int main() { version(Win32) puts("\x84\x94\x81\xE1\x8E\x99\x9A"); else puts("äöüßÄÖÜ"); return 0; } What do you think?!
Nov 19 2004
prev sibling next sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Mathias Bierschenk" <Mathias.Bierschenk web.de> wrote in message
news:opshpm3zlo9gaiaw dialin-212-144-051-051.arcor-ip.net...
 How can I print German characters? I've tried the following simple
program:
 import std.c.stdio;

 int main()
 {
    puts("äöüßÄÖÜ"); // German characters

    return 0;
 }

 As the normal MS-DOS EDIT encoding didn't work (Windows 98 SE, German
 edition) I tried Mozilla to save the source code file with different
 character encodings but none worked as expected. Here's what I tried using
 the current DMD version:

 MS-DOS encoding as performed by Microsoft's EDIT editor:
Using Microsoft Notepad, click on "Save As" and under encoding, select "UTF-8". Then, use std.stdio.writef() instead of std.c.stdio.puts(), and it should work.
Nov 19 2004
next sibling parent reply "Mathias Bierschenk" <Mathias.Bierschenk web.de> writes:
Am Fri, 19 Nov 2004 14:13:32 -0800 schrieb Walter  
<newshound digitalmars.com>:

 Using Microsoft Notepad, click on "Save As" and under encoding, select
 "UTF-8". Then, use std.stdio.writef() instead of std.c.stdio.puts(), and  
 it
 should work.
No, that doesn't work. Some others here have tracked down the main problem: The Win9x console doesn't support Unicode. Instead one can only make use of some DOS escape sequences. The only thing that works so far (thanks to Stewart Gordon): puts("\x84\x94\x81\xE1\x8E\x99\x9A"); // äöüßÄÖÜ or, more portable(?), written by myself: import std.c.stdio; int main() { version(Win32) puts("\x84\x94\x81\xE1\x8E\x99\x9A"); else puts("äöüßÄÖÜ"); return 0; } Carlos Santander B. suggested another solution, based on Y. Tomino's Win32 headers, that seems to convert characters at run-time. I can't get it to print anything at the moment, so I can't yet tell if it is better than what I have got so far. Maybe someone should write a tutorial about input/output basics in D? ;-)
Nov 20 2004
parent reply Roberto Mariottini <Roberto_member pathlink.com> writes:
In article <opshrgt5ci9gaiaw dialin-212-144-051-198.arcor-ip.net>, Mathias
Bierschenk says...
[...]
Some others here have tracked down the main problem: The Win9x console  
doesn't support Unicode.
This problem is for Windows NT/2000/XP also. Consoles use OEM character set. D doesn't support this.
 Instead one can only make use of some DOS escape  
sequences. The only thing that works so far (thanks to Stewart Gordon):

puts("\x84\x94\x81\xE1\x8E\x99\x9A"); // äöüßÄÖÜ
This are binary encodings of OEM characters.
or, more portable(?), written by myself:

import std.c.stdio;

int main()
{
   version(Win32)
     puts("\x84\x94\x81\xE1\x8E\x99\x9A");
   else
     puts("äöüßÄÖÜ");

   return 0;
}
This is not portable at all. It work only if the OEM codepage used is compatible with CP437 for those codeponits. The solution is to use CharToOemW, a function that translates a string from UTF-16 to OEM character set (when possible, of course). See an example: <code> import std.stdio; import std.c.stdio; import std.c.windows.windows; extern (Windows) { export BOOL CharToOemW( LPCWSTR lpszSrc, // string to translate LPSTR lpszDst // translated string ); } int main() { puts("-- untranslated --"); puts("äöüßÄÖÜ"); writef("äöüßÄÖÜ\n"); puts("-- translated --"); wchar[] mess = "äöüßÄÖÜ"; char[] OEMmess = new char[mess.length]; CharToOemW(mess, OEMmess); puts(OEMmess); writef(OEMmess); return 0; } </code> This outputs: -- untranslated -- -- translated -- äöüßÄÖÜ Error: invalid UTF-8 sequence Here you can not that puts() works, but writef() not. That's because writefs expects OEMmess to be UTF-8. The results are that writef doesn't work, in any case, under Windows. Note also that on Windows 95/98/Me this works only if the Microsoft Layer for Unicode is installed. The only alternative is to use CharToOemA, that converts the current ANSI codepage (for most western countries: Windows-1252) to current OEM codepage. I don't know how to translate UTF-8 to ANSI.
Carlos Santander B. suggested another solution, based on Y. Tomino's Win32  
headers, that seems to convert characters at run-time. I can't get it to  
print anything at the moment, so I can't yet tell if it is better than  
what I have got so far.
I've not tested it, too.
Maybe someone should write a tutorial about input/output basics in D? ;-)
Yes, please do it. Ciao
Nov 22 2004
next sibling parent Thomas Kuehne <thomas-dloop kuehne.thisisspam.cn> writes:
Roberto Mariottini schrieb am Mon, 22 Nov 2004 09:52:27 +0000 (UTC):
 Here you can not that puts() works, but writef() not. That's because writefs
 expects OEMmess to be UTF-8.
 The results are that writef doesn't work, in any case, under Windows.

 Note also that on Windows 95/98/Me this works only if the Microsoft Layer for
 Unicode is installed.

 The only alternative is to use CharToOemA, that converts the current ANSI
 codepage (for most western countries: Windows-1252) to current OEM codepage.
 I don't know how to translate UTF-8 to ANSI.
Maybe you could take a look at dmd/src/phobos/std/c/stdio.d? You should be able to change it in a way that - if "FILE*" equals stdout, stderr or stdlog and the hosting environment is Windows - CharToOemA is called before C's "fputs", "fputc", "puts" or "putw" is called. The consequence would be that all writef/*put* calls should produce reasonable output. To do the same with with "printf" you'd have to modify dmd/src/phobos/internal/object.d and dmd/src/phobos/object.d . I'm currently not running Windows but it would be interesting if "fputws" works correctly for non-ASCI chars. Thomas
Nov 22 2004
prev sibling parent =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Roberto Mariottini wrote:

Some others here have tracked down the main problem: The Win9x console  
doesn't support Unicode.
This problem is for Windows NT/2000/XP also. Consoles use OEM character set. D doesn't support this.
Mac OS X has a similar issue (uses MacRoman/ISO-8859-1 by default), but fortunately you can choose UTF-8 from the Terminal settings...
 This is not portable at all. It work only if the OEM codepage used is
compatible
 with CP437 for those codeponits.
 
 The solution is to use CharToOemW, a function that translates a string from
 UTF-16 to OEM character set (when possible, of course).
Or supply similar functions in D, which could be an alternative ?
Carlos Santander B. suggested another solution, based on Y. Tomino's Win32  
headers, that seems to convert characters at run-time. I can't get it to  
print anything at the moment, so I can't yet tell if it is better than  
what I have got so far.
I have written some basic lookups (i.e. "wchar mapping[256];") using the tables that are all available on the Unicode site: ISO Latin-1 (simple!) http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT DOS Latin Console http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP437.TXT Windows "Latin-1" http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT Mac OS Roman http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT (there are few dozen others, but I think these are the most common ?) But it needs a more thought-through API to be really useful... And some optimization to do the reverse lookup, I suppose ? I'm thinking one array of char[256], and one char[] of exceptions. (where 0x00-0xFF would use the lookup, and 0x0100-0xFFFF the hash) --anders
Nov 22 2004
prev sibling parent Roberto Mariottini <Roberto_member pathlink.com> writes:
In article <cnlrlp$14b6$1 digitaldaemon.com>, Walter says...

[...]
Using Microsoft Notepad, click on "Save As" and under encoding, select
"UTF-8". Then, use std.stdio.writef() instead of std.c.stdio.puts(), and it
should work.
The code doesn't work anyway, see my other post for details. The biggest problem is that writef() doesn't work on Windows, neither 9x/Me nor NT/2000/XP. Ciao
Nov 22 2004
prev sibling next sibling parent "Carlos Santander B." <csantander619 gmail.com> writes:
"Mathias Bierschenk" <Mathias.Bierschenk web.de> escribió en el mensaje
news:opshpm3zlo9gaiaw dialin-212-144-051-051.arcor-ip.net...
| How can I print German characters? I've tried the following simple program:
|
| import std.c.stdio;
|
| int main()
| {
|   puts("äöüßÄÖÜ"); // German characters
|
|   return 0;
| }
|
| As the normal MS-DOS EDIT encoding didn't work (Windows 98 SE, German
| edition) I tried Mozilla to save the source code file with different
| character encodings but none worked as expected. Here's what I tried using
| the current DMD version:
|
| MS-DOS encoding as performed by Microsoft's EDIT editor:
| (5) "invalid UTF-sequence"
|
| Western (ISO-8859-1):
| (5) "invalid UTF-sequence"
|
| Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian):
| (1) "semicolon expected, not '.'"
| (1) no identifier for declarator
|
| Unicode (UTF-16 and UTF-8):
| both compile fine but output garbage under MS-DOS
| (Windows 98 SE, German edition)

I was investigating the same thing recently. What I really wanted was a Windows
console that did Unicode, but I couldn't find it.
But I came across to some C++ program which allows you to output UTF-16 strings
(wchar * in C++ on Windows). Translated to D, the program was like this:

import std.file;
import std.string;
import std.utf;

import win32.winbase;
import win32.wincon;
import win32.winnls;

void main ()
{
    wchar [] tmp_w = toUTF16(cast(char[])"carlos andrés");
    wchar *   szwOut = tmp_w;
    DWORD      dwBytesWritten;
    DWORD      fdwMode;
    HANDLE     outHandle = GetStdHandle(STD_OUTPUT_HANDLE);

    if( (GetFileType(outHandle) & FILE_TYPE_CHAR) && GetConsoleMode( outHandle, 
&fdwMode) )
        WriteConsoleW( outHandle, szwOut, wcslen(szwOut), &dwBytesWritten, 
null);
    else
    {
        int nOutputCP = GetConsoleOutputCP();
        //int charCount = WideCharToMultiByte(nOutputCP, 0, szwOut, -1, null,
0, 
null, null);
        //char* szaStr = new char[charCount];
        //WideCharToMultiByte( nOutputCP, 0, szwOut, -1, szaStr, charCount, 
null, null);
        char [] tmp = toUTF8(tmp_w);
        char * szaStr = toMBSz(tmp);
        int charCount = tmp.length;
        WriteFile(outHandle, szaStr, charCount-1, &dwBytesWritten, null);
    }

}

It uses Y Tomino's Win32 headers. The encoding how it's saved doesn't seem to 
matter.
I really don't remember where I found the original, so you can use this code as 
you want since it's not mine.
For linux, I don't think there's any problem since it goes UTF-8 by default (at 
least with RedHat based distros, in my experience).
BTW, if someone knows about a Unicode console for Windows, please let me know.

-----------------------
Carlos Santander Bernal
Nov 19 2004
prev sibling parent reply Manfred Hansen <manfred toppoint.de> writes:
Hello,

i have the same problem on Linux Debian (sarge) and SUSE 9.1.
"invalid UTF-8 sequence"
Editor is vim .

Manfred

Mathias Bierschenk wrote:

 How can I print German characters? I've tried the following simple
 program:
 
 import std.c.stdio;
 
 int main()
 {
    puts("äöüßÄÖÜ"); // German characters
 
    return 0;
 }
 
 As the normal MS-DOS EDIT encoding didn't work (Windows 98 SE, German
 edition) I tried Mozilla to save the source code file with different
 character encodings but none worked as expected. Here's what I tried using
 the current DMD version:
 
 MS-DOS encoding as performed by Microsoft's EDIT editor:
 (5) "invalid UTF-sequence"
 
 Western (ISO-8859-1):
 (5) "invalid UTF-sequence"
 
 Unicode (UTF-16 and UTF-32, each with Big Endian and Little Endian):
 (1) "semicolon expected, not '.'"
 (1) no identifier for declarator
 
 Unicode (UTF-16 and UTF-8):
 both compile fine but output garbage under MS-DOS
 (Windows 98 SE, German edition)
Nov 19 2004
parent reply Thomas Kuehne <thomas-dloop kuehne.thisisspam.cn> writes:
Manfred Hansen schrieb am Sat, 20 Nov 2004 08:53:41 +0100:
 Hello,

 i have the same problem on Linux Debian (sarge) and SUSE 9.1.
 "invalid UTF-8 sequence"
 Editor is vim .
Vim 6.2 works for me. Are you sure your locale is set to use UTF-8? Please send me a sample, if this problem persists. Thomas
Nov 20 2004
parent Manfred Hansen <manfred toppoint.de> writes:
Thomas Kuehne wrote:

 
 Manfred Hansen schrieb am Sat, 20 Nov 2004 08:53:41 +0100:
 Hello,

 i have the same problem on Linux Debian (sarge) and SUSE 9.1.
 "invalid UTF-8 sequence"
 Editor is vim .
Vim 6.2 works for me. Are you sure your locale is set to use UTF-8? Please send me a sample, if this problem persists. Thomas
My locale hansen hansen-lx:~/d$ locale LANG=de_DE euro LC_CTYPE="de_DE euro" LC_NUMERIC="de_DE euro" LC_TIME="de_DE euro" LC_COLLATE="de_DE euro" LC_MONETARY="de_DE euro" LC_MESSAGES="de_DE euro" LC_PAPER="de_DE euro" LC_NAME="de_DE euro" LC_ADDRESS="de_DE euro" LC_TELEPHONE="de_DE euro" LC_MEASUREMENT="de_DE euro" LC_IDENTIFICATION="de_DE euro" LC_ALL= thank you for the advice, i try to switch to UTF-8 . mfg Manfred
Nov 20 2004