www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Error: invalid UTF-8 sequence

reply Carotinho <carotinobg yahoo.it> writes:
Hi all!
I'm new here and to D. I wrote a simple program:

import std.stdio;
import std.stream;

int main() {
  char[] stringa;
  stringa = std.stream.stdin.readLine();
  writefln("%s",stringa);
  return 0;
}

If i type normal characters, like a,b,c etc. everything is ok. 
But when I tries to type special characters like è, ò, ù I get
  Error: invalid UTF-8 sequence
when the program tries to rewrite the string I got.
What is this?

Thanks in advance!

Carotinho
Nov 28 2004
parent reply =?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Carotinho wrote:

 If i type normal characters, like a,b,c etc. everything is ok. 
 But when I tries to type special characters like è, ò, ù I get
   Error: invalid UTF-8 sequence
 when the program tries to rewrite the string I got.
 What is this?
D only works with Unicode. You need to set your shell to UTF-8. (Or, the tricky version, you can cast(ubyte[]) and convert it ?) --anders
Nov 28 2004
parent reply "Simon Buchan" <currently no.where> writes:
On Mon, 29 Nov 2004 00:39:29 +0100, Anders F Björklund <afb algonet.se>  
wrote:

 Carotinho wrote:

 If i type normal characters, like a,b,c etc. everything is ok. But when  
 I tries to type special characters like è, ò, ù I get
   Error: invalid UTF-8 sequence
 when the program tries to rewrite the string I got.
 What is this?
D only works with Unicode. You need to set your shell to UTF-8. (Or, the tricky version, you can cast(ubyte[]) and convert it ?) --anders
I don't think cast works. Unfortunately, the Windows shell can't use UTF. This discussion was referenced somewhere else (maybe digitalmars.D.bugs?) I have a project that tries to write a file with funky punctuation to the screen... the closest I got was to use read/writeString exclusively which gives you rubbish for special characters. There was something mentioned about a Win32 API that converted UTF to codepages and vice-versa... sounded promising, but I don't know if it is currently available to D. Look around, you may get lucky. (and if you do, tell the rest of us :D) -- Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
Nov 28 2004
next sibling parent =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:
Simon Buchan wrote:

 (Or, the tricky version, you can cast(ubyte[]) and convert it ?)
I don't think cast works. Unfortunately, the Windows shell can't use UTF. This discussion was referenced somewhere else (maybe digitalmars.D.bugs?)
It does. The problem is that D just assumes that the shell is UTF-8, and feeds you char[] that are *invalid* (as they are native-encoded) If you translate them yourself, I've found it to work just fine... I don't have a DOS console ( echo off allergies), but it does work with a zsh console set to the ISO-8859-1 encoding (instead of UTF-8) Of course, if the console *is* Unicode - then this doesn't work... Anyway, my test code looked like:
 void main(char[][] args)
 {
 	wchar[256] mapping = iso88591.mapping;
 
 	char[] test = cast(char[]) decode_string(cast(ubyte[]) args[1], mapping);
 	writefln("%s",test);
 	
 	static ubyte[1] z = [ 0 ];
 	printf("%s\n", cast(char*) (encode_string(test, mapping) ~ z) );
 }
Usually when you call old C functions, you want ubyte[] and not char[] since they don't handle UTF-8? The D tradition is to pretend that they have the D definition (char *) anyway, since "it is the same bit size". I use ubyte[] for legacy 8-bit encodings, and char[] for Unicode only.
 There was something mentioned about a Win32 API that converted UTF to
 codepages and vice-versa... sounded promising, but I don't know if
 it is currently available to D. Look around, you may get lucky.
 (and if you do, tell the rest of us :D)
There is a Win32-only API, and some open source libraries (iconv, ICU): http://msdn.microsoft.com/library/en-us/intl/unicode_19mb.asp http://www.gnu.org/software/libiconv/ http://oss.software.ibm.com/icu/ I might share my own little hack later on too, when I've packaged it up. (it just does the 4 main mappings, not the other 200* that the above do, ISO-8859-1 [Latin-1], CP-437 [DOS], CP-1252 [Win], MacRoman [Mac OS 9] ) It's a lot smaller than the real mccoy, and will be under zlib license. http://www.opensource.org/licenses/zlib-license.php (my usual license) If you need the full functionality, look at Mango/ICU or iconv instead? --anders PS. I'm not kidding, it really has hundreds (!) of different encodings: http://oss.software.ibm.com/icu/charset/
Nov 29 2004
prev sibling parent reply "Ben Hinkle" <bhinkle mathworks.com> writes:
[snip]

 There was something mentioned about a Win32 API that converted UTF to
 codepages and vice-versa... sounded promising, but I don't know if
 it is currently available to D. Look around, you may get lucky.
 (and if you do, tell the rest of us :D)
The following solution doesn't handle errors well due to some errno confusion I'm trying to figure out, but it is a start. Here's what you can do. Get iconv.dll from the zip file at http://prdownloads.sourceforge.net/gettext/libiconv-1.9.1.bin.woe32.zip?download and put it in the same directory as your executable. The attached libiconv.d will load the three functions you need. The attached iconv_example.d shows how to call iconv to convert utf-8 to utf-16 little endian. I'm looking into the errno issues and will probably have to recompile libiconv with DMC or something. But for typical usage the above instructions should work. Also I'd like to put a small wrapper around the low-level API to make it easier to use for the simple cases when the input is complete. -Ben begin 666 libiconv.d M(%5S97)S(&UU<W0 <'5T(&EC;VYV+F1L;"!F<F]M( T*(" O+R :'1T<#HO M+W!R9&]W;FQO861S+G-O=7)C969O<F=E+FYE="]G971T97AT+VQI8FEC;VYV M+3$N.2XQ+F)I;BYW;V4S,BYZ:7 _9&]W;FQO860-"B +R\ ;VX =&AE:7( M<&%T:" H96<L('1H92!S86UE(&1I<F5C=&]R>2!A<R!T:&4 ;6%I;B!E>&5C M8VAA<G-E=', 9G)O;6-O9&4 86YD('1O8V]D90T*("!E>'1E<FX *$,I(&EC M;VYV7W0 *"II8V]N=E]O<&5N*2 H8VAA<B J=&]C;V1E+"!C:&%R("IF<F]M M9B!T;R!U;G5S960 ;W5T<'5T(&%N9"!R971U<FX ;G5M8F5R(&]F(&YO;BUR M979E<G-A8FQE( T*(" O+R!C;VYV97)S:6]N<R!O<B M,2!O;B!E<G)O<BX- M"0D (" ("!V;VED("HJ;W5T8G5F+ T*"0D)(" (" <VEZ95]T("IO=71B M9'5L92!M;V0 /2!%>&5-;V1U;&5?3&]A9" B:6-O;G8B*3L-"B ("!I9B H M;V%D(&EC;VYV(&1Y;F%M:6, ;&EB<F%R>2(I.PT*(" (&EC;VYV7V]P96X M/2!C87-T*'1Y<&5O9BAI8V]N=E]O<&5N*2E%>&5-;V1U;&5?1V5T4WEM8F]L M*&UO9"PB;&EB:6-O;G9?;W!E;B(I.PT*(" (&EC;VYV7V-L;W-E(#T 8V%S M="AT>7!E;V8H:6-O;G9?8VQO<V4I*45X94UO9'5L95]'9713>6UB;VPH;6]D M+")L:6)I8V]N=E]C;&]S92(I.PT*(" (&EC;VYV(#T 8V%S="AT>7!E;V8H M:6-O;G8I*45X94UO9'5L95]'9713>6UB;VPH;6]D+")L:6)I8V]N=B(I.PT* M;G8 :7, 8G5I;'0 :6YT;R!L:6)C('-O(&QO861I;F< :7, 875T;VUA=&EC M8V]N=E]O<&5N("AC:&%R("IT;V-O9&4L(&-H87( *F9R;VUC;V1E*3L-" T* M(" O+R!C;VYV97)T(&EN8G5F('1O(&]U=&)U9B!A;F0 <V5T(&EN8GET97-L M969T('1O('5N=7-E9"!I;G!U="!A;F0-"B +R\ ;W5T8G5F('1O('5N=7-E M9"!O=71P=70 86YD(')E='5R;B!N=6UB97( ;V8 ;F]N+7)E=F5R<V%B;&4 M*$,I('-I>F5?="!I8V]N=B H:6-O;G9?="!C9"P =F]I9" J*FEN8G5F+ T* M"0D)(" <VEZ95]T("II;F)Y=&5S;&5F="P-" D)"2 ('9O:60 *BIO=71B 28V]N=E]T(&-D*3L-" T*?0T* ` end begin 666 iconv_example.d M:6UP;W)T(&QI8FEC;VYV.PT*:6UP;W)T('-T9"YS=&1I;SL-" T*=F]I9"!L M=&8M."!T;R!U=&8M,38 ;&ET=&QE(&5N9&EA; T*("!I8V]N=E]T(&-D(#T M('9O:60J(&EN<" ]('-T<CL-"B <VEZ95]T(&EN7VQE;B ]('-T<BYL96YG M;W5T<W1R.R O+R!S;VUE(&=I86YT(&)U9F9E< T*("!V;VED*B!O=71P(#T M"B +R\ 9&\ =&AE(&-O;G9E<G-I;VX-"B <VEZ95]T(')E<R ](&EC;VYV ` end
Nov 29 2004
next sibling parent reply Carotinho <carotinobg yahoo.it> writes:
I thanks you all, I'll start experiments!
For information, I'm running Linux, and even here I'm quite a newbie :)

Byez!
Nov 29 2004
parent "Ben Hinkle" <ben.hinkle gmail.com> writes:
"Carotinho" <carotinobg yahoo.it> wrote in message 
news:cog796$1rjr$1 digitaldaemon.com...
I thanks you all, I'll start experiments!
 For information, I'm running Linux, and even here I'm quite a newbie :)

 Byez!
oh, even better. you don't need the dll then - just get the .d file that declares the iconv functions and you're all set (well, except for figuring out the API and getting the right encodings).
Nov 29 2004
prev sibling next sibling parent "Simon Buchan" <currently no.where> writes:
On Mon, 29 Nov 2004 14:14:33 -0500, Ben Hinkle <bhinkle mathworks.com>  
wrote:

 [snip]

 There was something mentioned about a Win32 API that converted UTF to
 codepages and vice-versa... sounded promising, but I don't know if
 it is currently available to D. Look around, you may get lucky.
 (and if you do, tell the rest of us :D)
The following solution doesn't handle errors well due to some errno confusion I'm trying to figure out, but it is a start. Here's what you can do. Get iconv.dll from the zip file at http://prdownloads.sourceforge.net/gettext/libiconv-1.9.1.bin.woe32.zip?download and put it in the same directory as your executable. The attached libiconv.d will load the three functions you need. The attached iconv_example.d shows how to call iconv to convert utf-8 to utf-16 little endian. I'm looking into the errno issues and will probably have to recompile libiconv with DMC or something. But for typical usage the above instructions should work. Also I'd like to put a small wrapper around the low-level API to make it easier to use for the simple cases when the input is complete. -Ben
This doesnt let you make UTF-8 into an OEM codepage, though, does it? Linux users should be fine if they set their console to a UTF, but poor Windows users are stuck with weird codepages. (I do have the UTF codepages installed, they have to be, but I don't know how you can tell the console to use them) -- "Unhappy Microsoft customers have a funny way of becoming Linux, Salesforce.com and Oracle customers." - www.microsoft-watch.com: "The Year in Review: Microsoft Opens Up"
Nov 29 2004
prev sibling parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Ben Hinkle wrote:

 The following solution doesn't handle errors well due to some errno
 confusion I'm trying to figure out, but it is a start. Here's what you can
 do. Get iconv.dll from the zip file at
 http://prdownloads.sourceforge.net/gettext/libiconv-1.9.1.bin.woe32.zip?download
 and put it in the same directory as your executable. The attached libiconv.d
 will load the three functions you need. The attached iconv_example.d shows
 how to call iconv to convert utf-8 to utf-16 little endian.
 I'm looking into the errno issues and will probably have to recompile
 libiconv with DMC or something. But for typical usage the above instructions
 should work. Also I'd like to put a small wrapper around the low-level API
 to make it easier to use for the simple cases when the input is complete.
This code doesn't work everywhere... (POSIX?) At least not without some more modifications.
   // on POSIX systems iconv is built into libc so loading is automatic
It doesn't work on Mac OS X, unfortunately.
 /usr/bin/ld: Undefined symbols:
 _iconv
 _iconv_close
 _iconv_open
 collect2: ld returned 1 exit status
(It's being loaded from System's /usr/lib/libiconv.dylib) in /usr/include/iconv.h:
 #define iconv_t libiconv_t
 #ifndef LIBICONV_PLUG
 #define iconv_open libiconv_open
 #define iconv libiconv
 #define iconv_close libiconv_close
 #endif
Annoying, isn't it ? So one needs to declare the C functions with the "lib" prefix, and then do wrappers in D for the usual names...
 } else version (darwin) { 
 
   // On Mac OS X, link with -liconv (/usr/lib/libiconv.dylib)
   typedef void *libiconv_t;
 
   // allocate a converter between charsets fromcode and tocode
   extern (C) libiconv_t libiconv_open (char *tocode, char *fromcode);
   iconv_t iconv_open (char *tocode, char *fromcode)
   { return cast(iconv_t) libiconv_open(tocode, fromcode); }
 
   // convert inbuf to outbuf and set inbytesleft to unused input and
   // outbuf to unused output and return number of non-reversable 
   // conversions or -1 on error.
   extern (C) size_t libiconv (libiconv_t cd, void **inbuf,
 			   size_t *inbytesleft,
 			   void **outbuf,
 			   size_t *outbytesleft);
   size_t iconv (iconv_t cd, void **inbuf, size_t *inbytesleft,
 			   void **outbuf, size_t *outbytesleft)
   { return libiconv(cast(libiconv_t) cd, inbuf, inbytesleft, outbuf,
outbytesleft); }
 
   // close converter
   extern (C) int libiconv_close (libiconv_t cd);
   int iconv_close (iconv_t cd)
   { return libiconv_close(cast(libiconv_t) cd); }
 
 } else { 
And the test code assumed that everything is X86:
 version (LittleEndian)
   // convert from utf-8 to utf-16 little endian
   iconv_t cd = iconv_open("UTF-16LE","UTF-8");
 else version (BigEndian)
   // convert from utf-8 to utf-16 big endian
   iconv_t cd = iconv_open("UTF-16BE","UTF-8");
That's actually one of the biggest drawbacks of UTF-16... Besides those little flaws, the code works just fine :-) --anders
Nov 30 2004
parent reply "Ben Hinkle" <bhinkle mathworks.com> writes:
"Anders F Björklund" <afb algonet.se> wrote in message
news:coi25c$1ijr$1 digitaldaemon.com...
 Ben Hinkle wrote:

 The following solution doesn't handle errors well due to some errno
 confusion I'm trying to figure out, but it is a start. Here's what you
can
 do. Get iconv.dll from the zip file at
http://prdownloads.sourceforge.net/gettext/libiconv-1.9.1.bin.woe32.zip?download
 and put it in the same directory as your executable. The attached
libiconv.d
 will load the three functions you need. The attached iconv_example.d
shows
 how to call iconv to convert utf-8 to utf-16 little endian.
 I'm looking into the errno issues and will probably have to recompile
 libiconv with DMC or something. But for typical usage the above
instructions
 should work. Also I'd like to put a small wrapper around the low-level
API
 to make it easier to use for the simple cases when the input is
complete.
 This code doesn't work everywhere... (POSIX?)
 At least not without some more modifications.

   // on POSIX systems iconv is built into libc so loading is automatic
It doesn't work on Mac OS X, unfortunately.
 /usr/bin/ld: Undefined symbols:
 _iconv
 _iconv_close
 _iconv_open
 collect2: ld returned 1 exit status
(It's being loaded from System's /usr/lib/libiconv.dylib) in /usr/include/iconv.h:
 #define iconv_t libiconv_t
 #ifndef LIBICONV_PLUG
 #define iconv_open libiconv_open
 #define iconv libiconv
 #define iconv_close libiconv_close
 #endif
Annoying, isn't it ? So one needs to declare the C functions with the "lib" prefix, and then do wrappers in D for the usual names...
That is a bummer. Love those #defines!
 } else version (darwin) {

   // On Mac OS X, link with -liconv (/usr/lib/libiconv.dylib)
   typedef void *libiconv_t;

   // allocate a converter between charsets fromcode and tocode
   extern (C) libiconv_t libiconv_open (char *tocode, char *fromcode);
   iconv_t iconv_open (char *tocode, char *fromcode)
   { return cast(iconv_t) libiconv_open(tocode, fromcode); }

   // convert inbuf to outbuf and set inbytesleft to unused input and
   // outbuf to unused output and return number of non-reversable
   // conversions or -1 on error.
   extern (C) size_t libiconv (libiconv_t cd, void **inbuf,
    size_t *inbytesleft,
    void **outbuf,
    size_t *outbytesleft);
   size_t iconv (iconv_t cd, void **inbuf, size_t *inbytesleft,
    void **outbuf, size_t *outbytesleft)
   { return libiconv(cast(libiconv_t) cd, inbuf, inbytesleft, outbuf,
outbytesleft); }
   // close converter
   extern (C) int libiconv_close (libiconv_t cd);
   int iconv_close (iconv_t cd)
   { return libiconv_close(cast(libiconv_t) cd); }

 } else {
Maybe I'll try using std.loader for this case, too, and have iconv be a function pointer. Hmm...
 And the test code assumed that everything is X86:

 version (LittleEndian)
   // convert from utf-8 to utf-16 little endian
   iconv_t cd = iconv_open("UTF-16LE","UTF-8");
 else version (BigEndian)
   // convert from utf-8 to utf-16 big endian
   iconv_t cd = iconv_open("UTF-16BE","UTF-8");
That's actually one of the biggest drawbacks of UTF-16...
That's true. I was being lazy with the example. When I tried just plain-old "UTF-16" I think it used big-endian.
 Besides those little flaws, the code works just fine :-)
 --anders
Thanks for the update. I obviously hadn't tried on the Mac.
Nov 30 2004
next sibling parent =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Ben Hinkle wrote:

 That's true. I was being lazy with the example. When I tried just plain-old
 "UTF-16" I think it used big-endian.
Yes, that is in the Unicode specification. BE is the default order. Unless there is a BOM present to classify it as LE instead, that is... See http://www.unicode.org/faq/utf_bom.html --anders
Nov 30 2004
prev sibling parent "Kris" <fu bar.com> writes:
"Ben Hinkle" <bhinkle mathworks.com> wrote ...

| Maybe I'll try using std.loader for this case, too, and have iconv be a
| function pointer. Hmm...

That won't work with dmd 0.107 ...
Nov 30 2004