digitalmars.D.bugs - writef doesn't work on Windows XP console

Roberto Mariottini (37/37) Dec 01 2004 Hi.

Ben Hinkle (4/37) Dec 01 2004 This is expected behavior. Writef takes utf-8 strings hence the error th...

Roberto Mariottini (6/24) Dec 01 2004 ^^^^^^^^^^^^^^^^^^^^

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (9/18) Dec 01 2004 It works just fine, but you *have* to set your console to UTF-8.

Stewart Gordon (10/18) Dec 01 2004

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (7/13) Dec 01 2004 Sounds like a good idea. I have some very small encoding additions...

Stewart Gordon (11/19) Dec 01 2004 I wrote that, but then discovered that the 'norm' (if Phobos is anything...

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (6/14) Dec 01 2004 Or one can do like Java and use wchar[] and wchar, and ignore the bloat

Stewart Gordon (8/9) Dec 01 2004

Ben Hinkle (17/30) Dec 01 2004 Note std.stream now has BOM support. Call readBOM or writeBOM in

Roberto Mariottini (5/18) Dec 01 2004 The Windows function to accomplish this task is the already cited CharTo...

Ben Hinkle (14/35) Dec 03 2004 fun -

kris (5/12) Dec 03 2004 I'd like to encourage you to do so. If you take that approach I'll write...

Roberto Mariottini (7/12) Dec 01 2004 Windows XP does *not* support UTF-8 consoles. Neither Windows NT/2000.

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (12/19) Dec 01 2004 Moral of the story being that 8-bit strings should be declared ubyte[].

Roberto Mariottini (12/25) Dec 02 2004 In article ,

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (22/40) Dec 02 2004 That is why it was skipped, but you still need to be aware of the

Regan Heath (20/61) Dec 02 2004 I think it's a good idea. I reckon it will initially cause people to be ...

=?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= (5/22) Dec 03 2004 I'm not using Windows, but a modern system with an UTF-8 console ;-)

Roberto Mariottini <Roberto_member pathlink.com> writes:

Hi.
I can't make writef work on Windows XP using non-7bit-ASCII characters.

The attached test program outputs:

-- untranslated --


-- translated --
�������
Error: invalid UTF-8 sequence

Test program:

import std.stdio;
import std.c.stdio;
import std.c.windows.windows;
 
extern (Windows)
{
  export BOOL CharToOemW(
    LPCWSTR lpszSrc,  // string to translate
    LPSTR lpszDst     // translated string
  );
}
 
int main()
{
   puts("-- untranslated --");
   puts("�������");
   writef("�������\n");
 
   puts("-- translated --");
   wchar[] mess = "�������";
   char[] OEMmess = new char[mess.length];
   CharToOemW(mess, OEMmess);
   puts(OEMmess);
   writef(OEMmess);
 
   return 0;
}

Dec 01 2004

"Ben Hinkle" <ben.hinkle gmail.com> writes:

"Roberto Mariottini" <Roberto_member pathlink.com> wrote in message 
news:cok6li$1pkp$1 digitaldaemon.com...
 Hi.
 I can't make writef work on Windows XP using non-7bit-ASCII characters.

 The attached test program outputs:

 -- untranslated --


 -- translated --
 �������
 Error: invalid UTF-8 sequence

 Test program:

 import std.stdio;
 import std.c.stdio;
 import std.c.windows.windows;

 extern (Windows)
 {
  export BOOL CharToOemW(
    LPCWSTR lpszSrc,  // string to translate
    LPSTR lpszDst     // translated string
  );
 }

 int main()
 {
   puts("-- untranslated --");
   puts("�������");
   writef("�������\n");

   puts("-- translated --");
   wchar[] mess = "�������";
   char[] OEMmess = new char[mess.length];
   CharToOemW(mess, OEMmess);
   puts(OEMmess);
   writef(OEMmess);

   return 0;
 }

This is expected behavior. Writef takes utf-8 strings hence the error that 
the supplied string is not in utf-8 (because it isn't).

Dec 01 2004

Roberto Mariottini <Roberto_member pathlink.com> writes:

In article <coketl$25kv$1 digitaldaemon.com>, Ben Hinkle says...

[...]
 int main()
 {
   puts("-- untranslated --");
   puts("�������");
   writef("�������\n");


^^^^^^^^^^^^^^^^^^^^
   puts("-- translated --");
   wchar[] mess = "�������";
   char[] OEMmess = new char[mess.length];
   CharToOemW(mess, OEMmess);
   puts(OEMmess);
   writef(OEMmess);

   return 0;
 }

This is expected behavior. Writef takes utf-8 strings hence the error that 
the supplied string is not in utf-8 (because it isn't). 

The first writef uses an UTF-8 string, but it doesn't print what expected.
Either one should work, but both don't work.

Ciao

Dec 01 2004

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:

Roberto Mariottini wrote:

 The first writef uses an UTF-8 string, but it doesn't print what expected.
 Either one should work, but both don't work.

It works just fine, but you *have* to set your console to UTF-8.
D does *not* support consoles or shells which are not Unicode... :(

Simple example:
 import std.stdio;
 void main()
 {
   writefln("äöüßÄÖÜ");
 }

In UTF-8 Terminal mode, this prints:
 äöüßÄÖÜ

In Latin-1 Terminal mode, you get:
 Ã¤Ã¶Ã¼ÃÃÃÃ

I'm assuming it prints similar garbage on a non-Unicode XP console ?
(being a Mac user myself I have no idea how to change it on Windows)

--anders

Dec 01 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Anders F Björklund wrote:
 Roberto Mariottini wrote:
 
 The first writef uses an UTF-8 string, but it doesn't print what 
 expected.
 Either one should work, but both don't work.

 
 It works just fine, but you *have* to set your console to UTF-8.
 D does *not* support consoles or shells which are not Unicode... :(

<snip>

A while back I suggested writing some classes to do text file I/O, which 
would have conversion capabilities built in.

http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/6089

I guess it would extend to console I/O.

Stewart.

-- 
My e-mail is valid but not my primary mailbox.  Please keep replies on 
the 'group where everyone may benefit.

Dec 01 2004

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:

Stewart Gordon wrote:

 A while back I suggested writing some classes to do text file I/O, which 
 would have conversion capabilities built in.
 
 http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/6089
 
 I guess it would extend to console I/O.

Sounds like a good idea. I have some very small encoding additions...
(just a lookup for each supported charset, without entire icu/iconv)

http://www.algonet.se/~afb/d/mapping.zip

And I think it should use char[] instead of dchar/dchar[], but that's
rather minor (and it should probably overload all three string types)

--anders

Dec 01 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Anders F Björklund wrote:
 Stewart Gordon wrote:
 
 A while back I suggested writing some classes to do text file I/O, 
 which would have conversion capabilities built in.

 http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/6089


<snip>
 And I think it should use char[] instead of dchar/dchar[], but that's
 rather minor (and it should probably overload all three string types)

I wrote that, but then discovered that the 'norm' (if Phobos is anything 
to go by) is for strings to be manipulated as UTF-8, while dchar gets 
used for individual characters.  Maybe it should 'normally' use char[]. 
  After all, that's the most compact for text in alphabets below U+0800.

But you're probably right that it should overload the lot.

Stewart.

-- 
My e-mail is valid but not my primary mailbox.  Please keep replies on 
the 'group where everyone may benefit.

Dec 01 2004

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:

Stewart Gordon wrote:

 And I think it should use char[] instead of dchar/dchar[], but that's
 rather minor (and it should probably overload all three string types)

 
 I wrote that, but then discovered that the 'norm' (if Phobos is anything 
 to go by) is for strings to be manipulated as UTF-8, while dchar gets 
 used for individual characters.  Maybe it should 'normally' use char[]. 
 After all, that's the most compact for text in alphabets below U+0800.

Or one can do like Java and use wchar[] and wchar, and ignore the bloat
for ASCII strings - and hack in support for surrogates some other way...

Most of the consoles mentioned only support old 16-bit Unicode anyway ?

 But you're probably right that it should overload the lot.

It's the D way :-)

--anders

Dec 01 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Anders F Björklund wrote:
<snip>
 Most of the consoles mentioned only support old 16-bit Unicode anyway ?

<snip>

MS-DOS, and hence DOS windows in Win9x, only support 8-bit IBM codepages.

Stewart.

-- 
My e-mail is valid but not my primary mailbox.  Please keep replies on 
the 'group where everyone may benefit.

Dec 01 2004

"Ben Hinkle" <bhinkle mathworks.com> writes:

"Stewart Gordon" <smjg_1998 yahoo.com> wrote in message
news:cokn8o$2id0$1 digitaldaemon.com...
 Anders F Bj�rklund wrote:
 Roberto Mariottini wrote:

 The first writef uses an UTF-8 string, but it doesn't print what
 expected.
 Either one should work, but both don't work.

 It works just fine, but you *have* to set your console to UTF-8.
 D does *not* support consoles or shells which are not Unicode... :(

 <snip>

 A while back I suggested writing some classes to do text file I/O, which
 would have conversion capabilities built in.

 http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/6089

Note std.stream now has BOM support. Call readBOM or writeBOM in
EndianStream. Now that you mention it it might be nice to make another
Stream subclass and add support for the "native" encodings. It sounds fun -
I'll give it a shot. It should be pretty easy actually since you just
override writeString and writeStringW to call some OS function to convert
the string or char from utf to native encoding.
Supporting arbitrary encodings would probably be left for non-phobos
libraries since they would presumably require something like ICU or
libiconv. So basically what I have in mind is that to write to stdout with
native encoding you'd have to write

import std.stream;
...
stdoutn = NativeTextStream(stdout);
stdoutn.writef(<some utf encoded string>);

-Ben

Dec 01 2004

Roberto Mariottini <Roberto_member pathlink.com> writes:

In article <cokqtc$2okj$1 digitaldaemon.com>, Ben Hinkle says...

[...]
Now that you mention it it might be nice to make another
Stream subclass and add support for the "native" encodings. It sounds fun -
I'll give it a shot. It should be pretty easy actually since you just
override writeString and writeStringW to call some OS function to convert
the string or char from utf to native encoding.

The Windows function to accomplish this task is the already cited CharToOemW().

Supporting arbitrary encodings would probably be left for non-phobos
libraries since they would presumably require something like ICU or
libiconv. So basically what I have in mind is that to write to stdout with
native encoding you'd have to write

import std.stream;
...
stdoutn = NativeTextStream(stdout);
stdoutn.writef(<some utf encoded string>);

The default stdout should be a NativeTextStrem on Windows.

Ciao

Dec 01 2004

"Ben Hinkle" <bhinkle mathworks.com> writes:

"Roberto Mariottini" <Roberto_member pathlink.com> wrote in message
news:comhrd$26ug$1 digitaldaemon.com...
 In article <cokqtc$2okj$1 digitaldaemon.com>, Ben Hinkle says...

 [...]
Now that you mention it it might be nice to make another
Stream subclass and add support for the "native" encodings. It sounds


fun -
I'll give it a shot. It should be pretty easy actually since you just
override writeString and writeStringW to call some OS function to convert
the string or char from utf to native encoding.

 The Windows function to accomplish this task is the already cited

CharToOemW().
Supporting arbitrary encodings would probably be left for non-phobos
libraries since they would presumably require something like ICU or
libiconv. So basically what I have in mind is that to write to stdout


with
native encoding you'd have to write

import std.stream;
...
stdoutn = NativeTextStream(stdout);
stdoutn.writef(<some utf encoded string>);

 The default stdout should be a NativeTextStrem on Windows.

 Ciao

Actually on second thought I'm getting hesitant to put this kind of thing
into std.stream since it is so platform specific - the Mac's iconv API is
mapped to libiconv using C preprocessor macros so the D code will have to
hard-code in those symbol names (AFAIK). Also it looks like CharToOemW might
not be on all Win95/98/Me systems. Each platform will have to get special
code for how to handle this NativeTextStream stuff. It could get pretty
messy for fairly small bang-for-buck. I'm leaning towards putting in an
outside library that can handle arbitrary encodings - like libiconv or
mango's ICU wrapper or something.

Dec 03 2004

kris <fu bar.org> writes:

Ben Hinkle wrote:
  Each platform will have to get special
 code for how to handle this NativeTextStream stuff. It could get pretty
 messy for fairly small bang-for-buck. I'm leaning towards putting in an
 outside library that can handle arbitrary encodings - like libiconv or
 mango's ICU wrapper or something.
 
 

I'd like to encourage you to do so. If you take that approach I'll write 
an adapter for Mango.io, so there's more options for everyone. I'd also 
like to see a Stream adapter for the ICU converters; perhaps there will 
be cases where the 200+ ICU transcoders cover areas that iconv does not?

Dec 03 2004

Roberto Mariottini <Roberto_member pathlink.com> writes:

In article <cokk5u$2dt9$1 digitaldaemon.com>,
=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= says...
Roberto Mariottini wrote:

 The first writef uses an UTF-8 string, but it doesn't print what expected.
 Either one should work, but both don't work.

It works just fine, but you *have* to set your console to UTF-8.
D does *not* support consoles or shells which are not Unicode... :(

Windows XP does *not* support UTF-8 consoles. Neither Windows NT/2000.
So the bug still applies.

I don't think D will go any further if it doesn't support non-English versions
of Windows.

Ciao

Dec 01 2004

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Ben Hinkle wrote:

 I can't make writef work on Windows XP using non-7bit-ASCII characters.

[...snip...]
 This is expected behavior. Writef takes utf-8 strings hence the error that 
 the supplied string is not in utf-8 (because it isn't). 

Moral of the story being that 8-bit strings should be declared ubyte[].
Even if it makes you cast it to a pointer, before usage with C routines:

 ubyte[] OEMmess = new ubyte[mess.length];
 CharToOemW(mess, cast(LPSTR) OEMmess);
 puts(cast(char *) OEMmess);

The "char" type in C, is known as "byte" in D. Confusingly enough.
Like Ben says, the D char type only accepts valid UTF-8 code units...

--anders

PS. No, it doesn't help that the C routines are declared as (char *)
     when they really take (ubyte *) arguments. It's just as a shortcut
     to avoid having to translate the C function declarations to D...

     And of course, it also works just fine for ASCII-only strings.
     (a char[] can be directly converted to char *, iff it is ASCII)
     With non-US-ASCII characters, it doesn't work - as you've seen.

Dec 01 2004

Roberto Mariottini <Roberto_member pathlink.com> writes:

In article <cokjmk$2d8u$1 digitaldaemon.com>,
=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...

[...]
Moral of the story being that 8-bit strings should be declared ubyte[].
Even if it makes you cast it to a pointer, before usage with C routines:

 ubyte[] OEMmess = new ubyte[mess.length];
 CharToOemW(mess, cast(LPSTR) OEMmess);
 puts(cast(char *) OEMmess);

The "char" type in C, is known as "byte" in D. Confusingly enough.
Like Ben says, the D char type only accepts valid UTF-8 code units...

PS. No, it doesn't help that the C routines are declared as (char *)
     when they really take (ubyte *) arguments. It's just as a shortcut
     to avoid having to translate the C function declarations to D...

Sorry, I don't understand.
Are you proposing to change any C function prototype that uses "char*" to
"ubyte*"?
I agree that this would make clear that D char[] are different from C char*.
But it's a lot of work.

     And of course, it also works just fine for ASCII-only strings.
     (a char[] can be directly converted to char *, iff it is ASCII)
     With non-US-ASCII characters, it doesn't work - as you've seen.

With non-US-ASCII, but within the currently selected 8-bit OEM codepage, it
works. The problem is that UTF-8 doesn't get correctly translated to IBM-850 (or
437, or ...) on Windows.

Ciao

Dec 02 2004

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Roberto Mariottini wrote:

PS. No, it doesn't help that the C routines are declared as (char *)
    when they really take (ubyte *) arguments. It's just as a shortcut
    to avoid having to translate the C function declarations to D...

 
 Sorry, I don't understand.

I was being somewhat vague, sorry.

 Are you proposing to change any C function prototype that uses "char*" to
 "ubyte*"?
 I agree that this would make clear that D char[] are different from C char*.
 But it's a lot of work.

That is why it was skipped, but you still need to be aware of the 
difference or it will cause subtle bugs like the one you encountered...
(actually is a huge pain, as soon as you leave the old ascii strings)

Anyway, if you stick non-UTF-8 strings in char[] variables you are
setting yourself up for "invalid UTF-8 sequence". So ubyte[] is better ?
They both convert to C's (char *) in the usual way (with a NUL added)

char[] and wchar[] should be enough for any strings internal to D,
you should only need to mess with 8-bit encodings for input/output...
(and then it should preferrably all be handled by a library routine)

    And of course, it also works just fine for ASCII-only strings.
    (a char[] can be directly converted to char *, iff it is ASCII)
    With non-US-ASCII characters, it doesn't work - as you've seen.

 
 With non-US-ASCII, but within the currently selected 8-bit OEM codepage, it
 works. The problem is that UTF-8 doesn't get correctly translated to IBM-850
(or
 437, or ...) on Windows.

I meant that you can output ASCII as UTF-8 and it will still work...
(mostly, except if you are stuck in EDBDIC or some other weird place)

   writefln("hello world!"); // English, works about everywhere US-ASCII

But to output to the console on Windows (or other non-Unicode platform),
it needs to be translated to the local "code page" or "charset/encoding"

Like if you want to support characters beyond the 96 or so that are
in the ASCII subset, for instance if you live in Italy or Sweden.

   writefln("hall\u00e5 v\u00e4rlden!"); // Swedish, only works in UTF-8

And there is currently no functions in D to do that, as far as I know ?


Same thing applies to console input such as the "char[] args" params...
If you just echo those args on a non-Unicode console, you get errors!
(since then they are not really UTF-8 strings, but casted ubyte[]'s)

--anders

Dec 02 2004

"Regan Heath" <regan netwin.co.nz> writes:

On Thu, 02 Dec 2004 11:11:27 +0100, Anders F Bj�rklund <afb algonet.se>  
wrote:
 Roberto Mariottini wrote:

 PS. No, it doesn't help that the C routines are declared as (char *)
    when they really take (ubyte *) arguments. It's just as a shortcut
    to avoid having to translate the C function declarations to D...

  Sorry, I don't understand.

 I was being somewhat vague, sorry.

 Are you proposing to change any C function prototype that uses "char*"  
 to
 "ubyte*"?
 I agree that this would make clear that D char[] are different from C  
 char*.
 But it's a lot of work.


I think it's a good idea. I reckon it will initially cause people to be  
confused, i.e. they see:

int strcmp(byte *, byte *)

and think "huh? strcmp takes a char * not a byte *" but then if they look  
up byte * and/or char * in the D docs they should hopefully realise the  
difference, that C's char * is really a byte * and D's char[] is UTF  
encoded.

Oh yeah, correct me if I'm wrong but C's "char*" is really a "byte*" not a  
"ubyte*" as C's char's are signed.

 That is why it was skipped, but you still need to be aware of the  
 difference or it will cause subtle bugs like the one you encountered...
 (actually is a huge pain, as soon as you leave the old ascii strings)

 Anyway, if you stick non-UTF-8 strings in char[] variables you are
 setting yourself up for "invalid UTF-8 sequence". So ubyte[] is better ?
 They both convert to C's (char *) in the usual way (with a NUL added)

 char[] and wchar[] should be enough for any strings internal to D,
 you should only need to mess with 8-bit encodings for input/output...
 (and then it should preferrably all be handled by a library routine)

Exactly, all transcoding should be done at the input/output stage (if at  
all) internally you should use char[] wchar[] or dchar[]. Unless of course  
you have a good reason not to.

    And of course, it also works just fine for ASCII-only strings.
    (a char[] can be directly converted to char *, iff it is ASCII)
    With non-US-ASCII characters, it doesn't work - as you've seen.

  With non-US-ASCII, but within the currently selected 8-bit OEM  
 codepage, it
 works. The problem is that UTF-8 doesn't get correctly translated to  
 IBM-850 (or
 437, or ...) on Windows.

 I meant that you can output ASCII as UTF-8 and it will still work...
 (mostly, except if you are stuck in EDBDIC or some other weird place)

   writefln("hello world!"); // English, works about everywhere US-ASCII

 But to output to the console on Windows (or other non-Unicode platform),
 it needs to be translated to the local "code page" or "charset/encoding"

 Like if you want to support characters beyond the 96 or so that are
 in the ASCII subset, for instance if you live in Italy or Sweden.

   writefln("hall\u00e5 v\u00e4rlden!"); // Swedish, only works in UTF-8

 And there is currently no functions in D to do that, as far as I know ?

No, but you can wrap and use the C (windows) function CharToOemW.

Someone suggested that the default stdout stream should do this  
automatically, I think that's a great idea. IIRC Ben was considering  
giving this a go.

 Same thing applies to console input such as the "char[] args" params...
 If you just echo those args on a non-Unicode console, you get errors!
 (since then they are not really UTF-8 strings, but casted ubyte[]'s)

Which strikes me as ridiculous.

Regan

Dec 02 2004

=?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Regan Heath wrote:

 and think "huh? strcmp takes a char * not a byte *" but then if they 
 look  up byte * and/or char * in the D docs they should hopefully 
 realise the  difference, that C's char * is really a byte * and D's 
 char[] is UTF  encoded.
 
 Oh yeah, correct me if I'm wrong but C's "char*" is really a "byte*" not 
 a  "ubyte*" as C's char's are signed.

D is only concerned about "byte size", so it will remain as char*...

   writefln("hall\u00e5 v\u00e4rlden!"); // Swedish, only works in UTF-8

 And there is currently no functions in D to do that, as far as I know ?

 
 No, but you can wrap and use the C (windows) function CharToOemW.

I'm not using Windows, but a modern system with an UTF-8 console ;-)

 Same thing applies to console input such as the "char[] args" params...
 If you just echo those args on a non-Unicode console, you get errors!
 (since then they are not really UTF-8 strings, but casted ubyte[]'s)

 
 Which strikes me as ridiculous.

Either way, both stdout and stdin need to be "extended" for non-UTF-8

--anders

Dec 03 2004

D Programming

C/C++ Programming

Other

digitalmars.D.bugs - writef doesn't work on Windows XP console