www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - INVALID UTF-8 SEQUENCE!

reply Martin <Martin_member pathlink.com> writes:
I just downloaded the new dmd compiler and it tells me INVALID UTF-8 SEQUENCE
when I compile. This is when I use non.english characters in my strings.(like
) But I need to use them. The old version was just fine, why this change?

Most of the C compilers accept them, why not D?
Aug 18 2004
next sibling parent J C Calvarese <jcc7 cox.net> writes:
In article <cfvh55$2d5s$1 digitaldaemon.com>, Martin says...
I just downloaded the new dmd compiler and it tells me INVALID UTF-8 SEQUENCE
when I compile. This is when I use non.english characters in my strings.(like
) But I need to use them. The old version was just fine, why this change?

Most of the C compilers accept them, why not D?

What format is your file saved in? From http://www.digitalmars.com/d/lex.html: Source Text D source text can be in one of the following formats: * ASCII * UTF-8 * UTF-16BE * UTF-16LE * UTF-32BE * UTF-32LE jcc7
Aug 18 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cfvh55$2d5s$1 digitaldaemon.com>, Martin says...
I just downloaded the new dmd compiler and it tells me INVALID UTF-8 SEQUENCE
when I compile. This is when I use non.english characters in my strings.(like
) But I need to use them. The old version was just fine, why this change?

I have a sneaking suspicion you might find it will work just fine if you save your source file in UTF-8 before trying to compile it. (Save As...). So far as I know, the D compiler has not changed in this regard (except that it can now auto-detect UTF-16 and UTF-32).
Most of the C compilers accept them, why not D?

Actually, I think most C compilers simply allow a string to consist of an arbitrary sequence of bytes without any interpretation whatsoever - which just happens to appear to work whenever the source file encoding is the same as the run-time encoding. Arcane Jill
Aug 18 2004
next sibling parent reply Martin <Martin_member pathlink.com> writes:
Thank you for your answer!
I have a sneaking suspicion you might find it will work just fine if you save
your source file in UTF-8 before trying to compile it. (Save As...). 

I am using the gnu midnight commander text editor, it only saves ascii.
So far as I know, the D compiler has not changed in this regard (except that it
>can now auto-detect UTF-16 and UTF-32).

I think I changed the 0.93 version to the 0.98. In 0.93 my files compiled fine, in 0.98 I get an error. I changed back to 0.93, because I need to use these characters. So how to I tell the dmd that the source is an ascii file? Thank you! In article <cfvjdr$2dr2$1 digitaldaemon.com>, Arcane Jill says...
In article <cfvh55$2d5s$1 digitaldaemon.com>, Martin says...
I just downloaded the new dmd compiler and it tells me INVALID UTF-8 SEQUENCE
when I compile. This is when I use non.english characters in my strings.(like
) But I need to use them. The old version was just fine, why this change?

I have a sneaking suspicion you might find it will work just fine if you save your source file in UTF-8 before trying to compile it. (Save As...). So far as I know, the D compiler has not changed in this regard (except that it can now auto-detect UTF-16 and UTF-32).
Most of the C compilers accept them, why not D?

Actually, I think most C compilers simply allow a string to consist of an arbitrary sequence of bytes without any interpretation whatsoever - which just happens to appear to work whenever the source file encoding is the same as the run-time encoding. Arcane Jill

Aug 18 2004
next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cfvlcp$2eob$1 digitaldaemon.com>, Martin says...
Thank you for your answer!

Err... Don't thank me yet. Save that until the problem's actually solved!
I have a sneaking suspicion you might find it will work just fine if you save
your source file in UTF-8 before trying to compile it. (Save As...). 

I am using the gnu midnight commander text editor, it only saves ascii.

That's not possible. In your original post you said "I use non.english characters in my strings.(like )". If that statement is true, you /cannot/ be using ASCII, since these characters do not even /exist/ in ASCII. If your text contains any of the characters '', '' or '' then you are /not/ using ASCII. Period. Unfortunately, I am not familiar with this text editor, so I don't know how to determine the encoding it uses, or how to change it. I may have a fix for you even so, however. (Read on...)
So far as I know, the D compiler has not changed in this regard (except that it
>can now auto-detect UTF-16 and UTF-32).

I think I changed the 0.93 version to the 0.98. In 0.93 my files compiled fine, in 0.98 I get an error. I changed back to 0.93, because I need to use these characters.

Okay. Now, first off, the following compiles fine for me: # void main() # { # printf("hello \n"); # } using DMD v0.98, with the file saved as UTF-8. However, when I resaved the file as ISO-8859-1 (which is an invalid thing to do) then the compiler (correctly) gave me the compile-time error message: "Invalid UTF-8 sequence". I believe that the earlier version to which you refer (0.93) there was a bug, which was fixed in 0.96 - according to the change log: "Invalid UTF characters in string literals now diagnosed." In other words, DMD 0.93 failed to diagnose the invalid UTF-8 characters in your source file, and so the file compiled -- but it compiled incorrectly. The error would not have been detected until runtime - and even then only IF you passed your string to a UTF conversion routine. If you passed your invalid string straight to printf(), for example, the v0.93 compiler wouldn't even have noticed. But you can bet your life that even if your program had appeared to run correctly on your machine, it would not necessarily have worked on anyone else's. So tell me - what operating system are you using. The word "gnu" makes me suspect Linux, in which case I believe you need to set the environment variable, CHARSET to the value UTF-8. (But I'm a Windows user, so I could be wrong - I'm hoping someone will leap in here and correct me if so). Anyway, once you've set your environment variable, everything should work with the latest DMD - and this time, it will work for everyone, not just for you.
So how to I tell the dmd that the source is an ascii file?

There's a problem here - which is that you and I are not speaking the same language. An ASCII file is a file which DOES NOT CONTAIN any characters having codepoints outside the range 0x00 to 0x7F. DMD is perfectly happy with ASCII files, but your files are not ASCII. Sorry to be pedantic. Your file is /probably/ ISO-8859-1 (aka "Latin 1"). But it's not ASCII. Arcane Jill
Aug 18 2004
next sibling parent reply Martin <Martin_member pathlink.com> writes:
Yes you are probably right, it is some kind of extended ascii, in this case I
think that yes it is ISO-8859-1.
My problem is, that the webserver that I am wrting this software for, uses the
same encoding.
With the old version everything worked fine. Everyone that used the server saw
the characters right. 

So can I tell the dmd to use  ISO-8859-1, or just not to check the things it
shouldn't be checking?



In article <cfvqq8$2jhu$1 digitaldaemon.com>, Arcane Jill says...
In article <cfvlcp$2eob$1 digitaldaemon.com>, Martin says...
Thank you for your answer!

Err... Don't thank me yet. Save that until the problem's actually solved!
I have a sneaking suspicion you might find it will work just fine if you save
your source file in UTF-8 before trying to compile it. (Save As...). 

I am using the gnu midnight commander text editor, it only saves ascii.

That's not possible. In your original post you said "I use non.english characters in my strings.(like )". If that statement is true, you /cannot/ be using ASCII, since these characters do not even /exist/ in ASCII. If your text contains any of the characters '', '' or '' then you are /not/ using ASCII. Period. Unfortunately, I am not familiar with this text editor, so I don't know how to determine the encoding it uses, or how to change it. I may have a fix for you even so, however. (Read on...)
So far as I know, the D compiler has not changed in this regard (except that it
>can now auto-detect UTF-16 and UTF-32).

I think I changed the 0.93 version to the 0.98. In 0.93 my files compiled fine, in 0.98 I get an error. I changed back to 0.93, because I need to use these characters.

Okay. Now, first off, the following compiles fine for me: # void main() # { # printf("hello \n"); # } using DMD v0.98, with the file saved as UTF-8. However, when I resaved the file as ISO-8859-1 (which is an invalid thing to do) then the compiler (correctly) gave me the compile-time error message: "Invalid UTF-8 sequence". I believe that the earlier version to which you refer (0.93) there was a bug, which was fixed in 0.96 - according to the change log: "Invalid UTF characters in string literals now diagnosed." In other words, DMD 0.93 failed to diagnose the invalid UTF-8 characters in your source file, and so the file compiled -- but it compiled incorrectly. The error would not have been detected until runtime - and even then only IF you passed your string to a UTF conversion routine. If you passed your invalid string straight to printf(), for example, the v0.93 compiler wouldn't even have noticed. But you can bet your life that even if your program had appeared to run correctly on your machine, it would not necessarily have worked on anyone else's. So tell me - what operating system are you using. The word "gnu" makes me suspect Linux, in which case I believe you need to set the environment variable, CHARSET to the value UTF-8. (But I'm a Windows user, so I could be wrong - I'm hoping someone will leap in here and correct me if so). Anyway, once you've set your environment variable, everything should work with the latest DMD - and this time, it will work for everyone, not just for you.
So how to I tell the dmd that the source is an ascii file?

There's a problem here - which is that you and I are not speaking the same language. An ASCII file is a file which DOES NOT CONTAIN any characters having codepoints outside the range 0x00 to 0x7F. DMD is perfectly happy with ASCII files, but your files are not ASCII. Sorry to be pedantic. Your file is /probably/ ISO-8859-1 (aka "Latin 1"). But it's not ASCII. Arcane Jill

Aug 18 2004
next sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Martin" <Martin_member pathlink.com> wrote in message
news:cg0ggt$16f3$1 digitaldaemon.com...
 Yes you are probably right, it is some kind of extended ascii, in this

 think that yes it is ISO-8859-1.
 My problem is, that the webserver that I am wrting this software for, uses

 same encoding.
 With the old version everything worked fine. Everyone that used the server

 the characters right.

 So can I tell the dmd to use  ISO-8859-1, or just not to check the things

 shouldn't be checking?

There's no way to do that right now. One of the problems with using such charsets in source code is the source code is then non-portable. Someone can just change a seemingly unrelated system setting, and poof, your builds fail. You can also use \xXX to specify the characters, though that is ugly enough to be unusable.
Aug 18 2004
next sibling parent Martin <Martin_member pathlink.com> writes:
I think I will use the \xXX. My workaround solution was much uglyer, so I am
quite happy with this one.

Thanks!

In article <cg0n3l$1ln6$1 digitaldaemon.com>, Walter says...
"Martin" <Martin_member pathlink.com> wrote in message
news:cg0ggt$16f3$1 digitaldaemon.com...
 Yes you are probably right, it is some kind of extended ascii, in this

 think that yes it is ISO-8859-1.
 My problem is, that the webserver that I am wrting this software for, uses

 same encoding.
 With the old version everything worked fine. Everyone that used the server

 the characters right.

 So can I tell the dmd to use  ISO-8859-1, or just not to check the things

 shouldn't be checking?

There's no way to do that right now. One of the problems with using such charsets in source code is the source code is then non-portable. Someone can just change a seemingly unrelated system setting, and poof, your builds fail. You can also use \xXX to specify the characters, though that is ugly enough to be unusable.

Aug 19 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cg0n3l$1ln6$1 digitaldaemon.com>, Walter says...

You can also use \xXX to specify the characters, though that is ugly
enough to be unusable.

Sorry, Walter - that's not right! You should not be encouraging the use of \xXX in this context. This is wrong. Martin needs to be using \uXXXX, not \xXX. Instead of \xD6, he needs to use \u00D6. (Martin, I hope you're listening). Sticking \x's into a string literal is just another way to create an invalid UTF-8 sequence. See this code: # void main() # { # char[] s1 = "\xD6"; # char[] s2 = "\u00D6"; # # printf("s1.length = %d\n", s1.length); # printf("s2.length = %d\n", s2.length); # } This will output: # s1.length = 1 # s2.length = 2 thereby proving that s1 contains an Invalid UTF-8 sequence! (But s2 is correct). Remember - \x is used to insert literal bytes. \u inserts characters. All you've done is provided a way to get pre DMD-0.96 behavior out of a DMD-0.96+ compiler. Arcane Jill
Aug 19 2004
next sibling parent reply Martin <Martin_member pathlink.com> writes:
I think I will move to UTF-8 with my next version of the program. I can't do it
right now, because then it needs some rewriting.
The UTF-8 output is not the problem, it's more like UTF-8 input. I need to read
the POST data from users browser, to proccess it.
The problem with UTF-8 is that a character can be 1,2,3 or even 4 bytes long. I
do a lot of text proccessing and I need to rewrite, atleast look over all these
functions. But I have a deadline coming...

I wrote my last web with C++, didn't use UTF-8, and it works fine. I am only
writing application for Estonian people.

But probalby you are right, I need to move to UTF-8, but not before my next
version.

Martin



In article <cg1of6$18ss$1 digitaldaemon.com>, Arcane Jill says...
In article <cg0n3l$1ln6$1 digitaldaemon.com>, Walter says...

You can also use \xXX to specify the characters, though that is ugly
enough to be unusable.

Sorry, Walter - that's not right! You should not be encouraging the use of \xXX in this context. This is wrong. Martin needs to be using \uXXXX, not \xXX. Instead of \xD6, he needs to use \u00D6. (Martin, I hope you're listening). Sticking \x's into a string literal is just another way to create an invalid UTF-8 sequence. See this code: # void main() # { # char[] s1 = "\xD6"; # char[] s2 = "\u00D6"; # # printf("s1.length = %d\n", s1.length); # printf("s2.length = %d\n", s2.length); # } This will output: # s1.length = 1 # s2.length = 2 thereby proving that s1 contains an Invalid UTF-8 sequence! (But s2 is correct). Remember - \x is used to insert literal bytes. \u inserts characters. All you've done is provided a way to get pre DMD-0.96 behavior out of a DMD-0.96+ compiler. Arcane Jill

Aug 19 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cg1q02$1c1h$1 digitaldaemon.com>, Martin says...

The UTF-8 output is not the problem, it's more like UTF-8 input. I need to read
the POST data from users browser, to proccess it.

But I don't think you can make demands on what encoding in which the POST data is going to be presented, can you? You simply have to recognize it, and decode it. If the data is in ISO-whatever, you must decode that; if the data is in MAC-ROMAN, you must decode that; if the data is in UTF-8, you must decode that. And so on.
The problem with UTF-8 is that a character can be 1,2,3 or even 4 bytes long.

Indeed, but D has lots of handy functions to convert them. And the problem with ISO-8859-1 (Latin-1) is that characters beyond \u00FF are completely unrepresentable. Like, AT ALL. If someone wants to use a lowercase c with an acute accent ('\u0107'), you're completely screwed. UTF-8 is the solution.
I wrote my last web with C++, didn't use UTF-8, and it works fine.

But only if /you/ compile it. If someone else, with a different default encoding, were to compile the same source code, it may fail badly. But it's nice to see you're writing for a non-English audience. I'm sure this trend will continue. Arcane Jill
Aug 19 2004
parent "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cg1rba$1fnh$1 digitaldaemon.com...
 But it's nice to see you're writing for a non-English audience. I'm sure

 trend will continue.

And that's great, because it helps us identify and shake out the problems with the internationalization support.
Aug 19 2004
prev sibling parent "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cg1of6$18ss$1 digitaldaemon.com...
 In article <cg0n3l$1ln6$1 digitaldaemon.com>, Walter says...

You can also use \xXX to specify the characters, though that is ugly
enough to be unusable.

Sorry, Walter - that's not right! You should not be encouraging the use of

 in this context. This is wrong. Martin needs to be using \uXXXX, not \xXX.
 Instead of \xD6, he needs to use \u00D6. (Martin, I hope you're

 Sticking \x's into a string literal is just another way to create an

 UTF-8 sequence. See this code:

True, but if they're used to create a ubyte[] sequence (not a char[] sequence) it should work.
Aug 19 2004
prev sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cg0ggt$16f3$1 digitaldaemon.com>, Martin says...

My problem is, that the webserver that I am wrting this software for, uses the
same encoding.

I think that your statement might need some clarifying. Web servers by definition need to do transcoding. Most programs need a concept of a "run-time encoding" (so they can do printf(), etc.), but the run-time encoding of a web server is no longer limited to that of one particular machine - a web server has to deal with machines all over the internet, each possibly with its own local encoding. The "Accept" field in an HTTP request can act as a request from the browser to the server that the web content be delivered in a particular encoding. For example: # Accept: text/plain, text/html; charset=UTF-8 When the page is delivered, a web server sends back: # Content-type: text/html; charset=UTF-8 If the encoding is not specified then HTML is supposed to default to ISO-8859-1, but XML (including XHTML) is supposed to default to UTF-8. A web server which doesn't do UTF-8, or which doesn't do transcoding, is all but useless. That said, you may still be able to get away with it. If you send all your web content in a particular encoding, then, as long as it is marked as such, the user's browser /may/ be able to reinterpret the page (the Accept request header is supposed to advise you of what the browser can or can't deal with). So, when you say "the webserver ... uses the same encoding [ISO-8859-1]", I'm still not clear what it uses that encoding /for/. It's the default for HTML, but are you saying your server emits no other encoding? Not even UTF-8? That would be weird. Any chance you could clarify?
With the old version everything worked fine. Everyone that used the server saw
the characters right. 

Providing your server emitted "Content-type: text/html; charset=ISO-8859-1" in its response headers, (or just "Content-type: text/html" since ISO-8859-1 is the default for HTML - but that's dangerous, since not all browsers obey the W3C spec), that is likely to be true. But still, you're relying on a parochial character set, and it /is/ possible that some viewers of your server simply won't have that encoding in their browser.
So can I tell the dmd to use  ISO-8859-1, or just not to check the things it
shouldn't be checking?

No. You *MUST* save your DMD source files in either ASCII or UTF-8 before attempting to compile them. If you wish to emit output in ISO-8859-1 then you must ISO-8859-1-encode the output at runtime (which is easy - I can show you how to do that). But why is saving your source file as UTF-8 hard? I've never heard of a modern text editor which can't do it, but if you've discovered one, why not just change to a different text editor? Nonetheless - if you really can't figure out how to save in UTF-8 (which would be surprising for someone writing a web server, with all the transcoding understanding required thereby), then your only remaining choice is to save as ASCII. You can do this by replacing your non-ASCII characters either by Unicode escape sequences (if you want DMD to interpret them) or HTML entities (if you want the users' browsers to interpret them). So replace as follows: # Character Escape sequence HTML entity # ~~~~~~~~~ ~~~~~~~~~~~~~~~ ~~~~~~~~~~~ # \u00DC &#x00DC; # \u00C4 &#x00C4; # \u00D6 &#x00D6; Hope that helps. Arcane Jill
Aug 19 2004
prev sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cfvqq8$2jhu$1 digitaldaemon.com...
 There's a problem here - which is that you and I are not speaking the same
 language. An ASCII file is a file which DOES NOT CONTAIN any characters

 codepoints outside the range 0x00 to 0x7F. DMD is perfectly happy with

 files, but your files are not ASCII.

 Sorry to be pedantic. Your file is /probably/ ISO-8859-1 (aka "Latin 1").

 it's not ASCII.

You write well and understand the issues involved. Can I suggest that you write an article about this for, say, CUJ or DDJ? Such an article exploring this topic is sorely needed.
Aug 18 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cg0gsg$16u8$2 digitaldaemon.com>, Walter says...
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cfvqq8$2jhu$1 digitaldaemon.com...
 There's a problem here - which is that you and I are not speaking the same
 language. An ASCII file is a file which DOES NOT CONTAIN any characters

 codepoints outside the range 0x00 to 0x7F. DMD is perfectly happy with

 files, but your files are not ASCII.

 Sorry to be pedantic. Your file is /probably/ ISO-8859-1 (aka "Latin 1").

 it's not ASCII.

You write well and understand the issues involved. Can I suggest that you write an article about this for, say, CUJ or DDJ? Such an article exploring this topic is sorely needed.

Could be fun. So what are CUJ and DDJ? Could someone give me some URLs? Jill
Aug 19 2004
parent reply Jonathan Leffler <jleffler earthlink.net> writes:
Arcane Jill wrote:

 In article <cg0gsg$16u8$2 digitaldaemon.com>, Walter:
You write well and understand the issues involved. Can I suggest that you
write an article about this for, say, CUJ or DDJ? Such an article exploring
this topic is sorely needed.

Could be fun. So what are CUJ and DDJ? Could someone give me some URLs?

CUJ = C User's Journal (or possibly Users'?) http://www.cuj.com/ (where there's no apostrophe in sight) DDJ = Dr Dobb's Journal http://www.ddj.com/ -- Jonathan Leffler #include <disclaimer.h> Email: jleffler earthlink.net, jleffler us.ibm.com Guardian of DBD::Informix v2003.04 -- http://dbi.perl.org/
Aug 19 2004
parent "Walter" <newshound digitalmars.com> writes:
"Jonathan Leffler" <jleffler earthlink.net> wrote in message
news:cg1n1p$13qg$1 digitaldaemon.com...
 Arcane Jill wrote:

 In article <cg0gsg$16u8$2 digitaldaemon.com>, Walter:
You write well and understand the issues involved. Can I suggest that



write an article about this for, say, CUJ or DDJ? Such an article



this topic is sorely needed.

Could be fun. So what are CUJ and DDJ? Could someone give me some URLs?

CUJ = C User's Journal (or possibly Users'?) http://www.cuj.com/ (where there's no apostrophe in sight) DDJ = Dr Dobb's Journal http://www.ddj.com/

Yes, they're the two main print publications that C/C++ programmers read. The D articles published by them have been well received, and the publisher (CMP Media) has indicated they want more. And besides, they even pay for articles! Getting published in CUJ or DDJ is fairly prestigious, and will look good on any resume. Many of the top highly paid C++ professionals built their reputation early on by writing articles. Many companies also have a policy of giving a bonus to engineering employees who get published in a magazine, that's worth checking out. So it's really an everybody wins kind of situation.
Aug 19 2004
prev sibling parent Nick <Nick_member pathlink.com> writes:
In article <cfvlcp$2eob$1 digitaldaemon.com>, Martin says...
Thank you for your answer!
I have a sneaking suspicion you might find it will work just fine if you save
your source file in UTF-8 before trying to compile it. (Save As...). 

I am using the gnu midnight commander text editor, it only saves ascii.

If you are on linux you can convert from latin1 to utf8 with the command iconv -f latin1 -t utf8 file.d > newfile.d dmd newfile.d You will probably be doing that a lot, so it's best if you can put it in a script or something. Hope this helps :) Nick
Aug 18 2004
prev sibling parent "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cfvjdr$2dr2$1 digitaldaemon.com...
Most of the C compilers accept them, why not D?

Actually, I think most C compilers simply allow a string to consist of an arbitrary sequence of bytes without any interpretation whatsoever - which

 happens to appear to work whenever the source file encoding is the same as

 run-time encoding.

It doesn't always work, some of the code pages include multibyte sequences where " can be the second byte :-(. That's why DMC has special switches for such. This is just the sort of thing I want to move away from.
Aug 18 2004